Data Governance: An AI in Pharma Requirement

By Toni Manzano (Aizon), Mario Stassen (Stassen Pharmaconsult BV), William Whitford (DPS Group) and AIO Team

​In February of 2022, the efforts of Xavier Health were assumed by the AFDO/RAPS Healthcare Products Collaborative. Because of the important work done before this transition, the Collaborative has chosen to retain documents that have Xavier branding and continue to provide them to the communities through this website.  If you have questions, please contact Timothy Hsu, Director of Health Technology Initiatives, at

AI is a relatively mature field (officially born during the Dartmouth Conference in 1956) [1]. The current confluence of virtually unlimited data storage space and computational power has changed many aspects of algorithm design and implementation. For example, it removes concerns for relevant data extent and offers the possibility of accessing algorithm technology through competitive resources. This increased power, and AI’s growing accessibility to many (or democratization), allows its utilization in many different fields or applications without apparent limitation. The perfect storm generated by this, and the many technologies based on the internet’s capabilities, leads to identifying AI as a key element of the digital transformation of society. 

There are many different definitions for AI. They are usually related to the ability of algorithms to “learn” and enable problem-solving in a higher, almost humanly cognitive way. These definitions also encompass such sub-fields as machine and deep learning. It is identified as providing “expert systems” that can make predictions or classifications to support such product innovations as self-driving cars, facial recognition, and automatic translation. However, this is a rather traditional and outcome-focused meaning. 

In the current digital context, AI is better depicted as a cocktail of three main ingredients: math, algorithms (written in software), and power computing. But if we expect to achieve successful and valid results from this cocktail, a secret ingredient is required: good data.  In an earlier blog [2] we discussed challenges and solutions for different types and sources of data−including its integrity, accessibility, and maintenance.

And in this digital era, to talk about data implicitly invokes such adjectives as “electronic,” “massive,” “nonstructured,” “cloud,” and “real time”. But especially when working in Pharma, we must add the “quality” attribute. In a few ways, the Wikipedia definition of Data Governance is curious:

 Data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives. The key focus areas of data governance include availability, usability, consistency, data integrity and data security and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of  poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.

Changing the phrase “supporting business objectives” to “ensuring the safety, quality, and efficiency of the pharmaceutical product”, would be the beginning of a really good definition of data governance in Pharma. Efficient data governance will provide many values−including increasing consistency and confidence in decision making, and improving data security.  It will maximize the potential of data, while minimizing or eliminating outcome failure and re-work.  Data governance concepts and definitions in public authority guidance generally have a rather limited scope on data integrity, and do not cover the broader concept of data quality.

For further context, consider the latest PIC/S Guidance Good Practices for Data Management and Integrity in Regulated GMP/GDP Environments iPI 041-1 1, July 2021 [3] where data governance is defined as:  The sum total of the arrangements is to ensure that data is complete, consistent, and accurate record throughout the data lifecycle. And, this is irrespective of the format in which it is generated, recorded, processed, retained and used.  The definitions in these guidelines make an interesting difference between data integrity and data quality.

Data Integrity: The degree to which data are complete, consistent, accurate, trustworthy, reliable and that these characteristics of the data are maintained throughout the data life cycle.

Data Quality: The assurance that data produced is exactly what was intended to be produced and fit for its intended purpose. This incorporates ALCOA + principles.

Data integrity, for instance, can be seen as a quality characteristic, and is therefore a subset of data quality. 21 CFR Part 116 [4] as well as Annex 11 to the EU GMP Guideline [5] covers an additional data quality characteristic: data availability. 21 CFR Part 11 [6] mentions data confidentiality as well−and shares this control objective with various data privacy and data protection laws and directives.

The concept of “quality data” includes that the source must be attributable. Data attribution includes that its original source must be discoverable, and its path through nodes of exchange (or its “version”) are clear and traceable. Quality data must contain consistent metadata (information describing and qualifying the main data set). This is especially true for unstructured or unlabeled data. It must be legible (proprietary binary formats are not good configurations) and it must have an associated universal timestamp (indispensable for time-series data). Manipulations of data after-the-fact, include removing duplicate or irrelevant observations, repairing structural errors and ambiguities (such as diverse abbreviations), filtering errant (unwanted) outliers, and handling of missing data. These must be scrutinized to prevent non-expected, including inadvertent, bias. Finally, its accuracy should be capable of orthogonal (buy other means) confirmation.

The data governance required in the AI field that has been traditionally linked to the FAIR principles (findable, accessible, interoperable and reusable) and can be considered as a good ally to the original ALCOA principles [7]. These principals direct that data must be attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available.

In the context of AI, and by its nature, data governance encompasses the system by which an organization is controlled and operates, and includes the mechanisms by which it, and its people, are held to account. Ethics, risk management, compliance and administration are now all considered elements of data governance. Instead of looking at processes and systems (or other aspects of the project) you might take the data perspective. Data is what is flowing through the processes and systems. 

The management of all these topics are pre-requisite to the ensurance of the quality of AI outcomes. Such data governance therefore becomes a critical discipline that should be considered as a requisite component of a quality management systems, doesn’t it?


  1. Moor, J. (2006). The Dartmouth college artificial intelligence conference: The next fifty years. Workshop Report 4, AAAI. The Dartmouth College Artificial Intelligence Conference: The Next Fifty Years | AI Magazine (
  2. AI in Pharma Adoption, Part 2: “Good AI Begins With Good Data”, 19 April 2021,
  3. PIC/S Guidance Good Practices for Data Management, Integrity in Regulated GMP/GDP Environments, PI 041-1 1, July 2021,
  4. 21 CFR Part 601, Sec. 601.50, and  Sec. 601.51, Subpart F,
  5. Annex 11 to the EU GMP Guideline, Annex 11 Final 0910 (
  6. 21 CFR Part 11,
  7. WHO (2016). Guidance on good data and record management practices. Technical Report Annex 5, WHO Expert Committee on Specifications for Pharmaceutical Preparations,