By Bill Whitford (DPS Group), Betsy Macht (Johnson & Johnson, retired), and Toni Manzano (Aizon) AI in Operations
In February of 2022, the efforts of Xavier Health were assumed by the AFDO/RAPS Healthcare Products Collaborative. Because of the important work done before this transition, the Collaborative has chosen to retain documents that have Xavier branding and continue to provide them to the communities through this website. If you have questions, please contact Timothy Hsu, Chief Collaboration Officer, at firstname.lastname@example.org.
Bias in analysis or prediction is the disproportionate, improper weight or importance to an idea or thing. There are many sources of bias in analysis and modeling, and they can arise from the initial design of the study, to the way conclusions are applied. Bias can emerge from the very structure or application of the math or algorithm itself. One example of this bias is when a set of data perfectly represents the aspects of the set or population being studied, but the algorithm is constructed to apply improper importance, or weight, to particular values. In other words, the data fairly represents reality, but the process of analysis imparts a bias in the results.
For instance, consider that data from hospital treatments were used to compare complications from the use of two different analgesics. If the algorithm was not constructed to give treatment in an ICU with more weight than a brief admission for examination, very significant differences in the importance of complications caused by one over the other could be missed.
However, bias can also come from the data that is being fed into the equations or models, and the data sets used to train the algorithm/model. Here we will examine some origins of bias, not in the calculations or manipulations of the data, but in the sets of data themselves.
A general definition of data bias is “The data set in question is, for any reason, not sufficiently representative of the population or phenomena considered in the study.” The data acquired for use may be from original research or acquired as secondary data. In either case, the types of bias discussed below could be present.
In the Data Science field there are two main trends. The first trend (Unmanipulated Data Bias) recommends working with fully anonymized data, following a completely blind strategy without any process knowledge related to the topic to work on. The second trend (Manipulated or Cleaned Data Bias) runs in the opposite direction, suggesting that subject matter experts should be involved in the data science operations from the beginning in order to have successful results. These two approaches, which are covered herein, require managing bias in some way!
Unmanipulated Data Bias
In studies within the first trend, using anonymized data in which no manipulation or cleaning has occurred, there are a number of potential biases possible. Some think that because there is no later or subsequent alteration of the data, it must be accurate and representative. But, there are actually a few types of bias possible in data introduced from its original generation, gathering, labeling or organization.
SAMPLE OR SELECTION BIAS
Sample bias occurs when data is not collected from the entire elements or types of the population it is intended to represent. Sample bias occurs where a dataset may represent aspects of the sampled population, group, or cohort set—yet does not accurately reflect the factors or characteristics of the general population that the sampled group is meant to represent. For example, in testing a vaccine for safety and efficacy, a test group selected from a particular geographical region or demographic might not fairly represent such characteristics as the age, gender, and race of the general population.
Measurement bias occurs when faulty measurements of a group or cohort result in a distortion of important parameters or aspects of the data examined. In one example, if various cameras are used through history, installation, or geography to collect the primary data, these cameras may collect different information on the fine features of the images. Furthermore, these data may similarly vary from that data obtained from the cameras used to generate the data employed when applying the model or statistical conclusions of the study.
RECALL OR LABELING BIAS
Another way we incur bias in unmanipulated primary data is through inconsistent labeling or annotation of an otherwise representative data set. An example of this bias occurs when self-reported race is used in establishing demographic distributions in an environment. While the group or cohort of people may be properly randomized, individuals of the same genetic makeup may consider, or report, themselves in different ways. We can also see this when the subjective thoughts of people, or incomplete models in algorithms, affect the labeling of the data, resulting in inaccurate organization of otherwise representative collected data.
An example of observer bias is what is termed “confirmation bias”. People reporting data from their own observation or experience often inadvertently selectively remember, or “cherry pick” the data. This is often the result of seeing or only recording the events that you expect to see, or want to see, because it confirms what you think you “know” to be true. It can happen when researchers go into a project with subjective or preconceived thoughts about their study, either consciously or unconsciously.
Association bias occurs when the data collected reinforces and/or multiplies existing, or consequences of, a previous bias. For example, if we decide to collect data from “quality products”, we might choose to collect samples from expensive products. Or if we want to collect data from “capable” or “proficient” people, we may collect it from a profession that had previously been populated through a racial or gender bias. This bias originates because of the post hoc ergo propter hoc (“after this, therefore because of this” fallacy). Although there are more male engineers or lawyers, this does not mean that women are not as also capable of operating in these professions, but instead, may simply have been previously excluded.
Manipulated or Cleaned Data Bias
Studies within this second trend use data sets that have been manipulated by experts in order to improve the outcome in some way. These activities present the possibility of introducing other types of bias.
Exclusion (sometimes attrition) bias is most common at the data collection or preparation stage. Often it is the result of the post-randomization rejection of valuable data, or data sources, that are thought to be errant, troublesome, or spurious. Sometimes categories of data that represent a very small percentage of the group or cohort, or that are otherwise problematic, are assumed to be unimportant and “clutter”, and therefore are removed. For example, if a study were designed to determine the effect of a drug upon cognitive, problem-solving or other thinking abilities, a group or cohort of randomized patients might be assembled. However, if an assistant or technician disqualified candidates because they were not efficient or punctual in completing the screening procedures, individuals desired in the study might inadvertently be excluded.
Algorithms may also display an uncertainty bias. Here a calculation or model may select, emphasize or present a more confident assessment where better (e.g., larger) data sets are available. This can skew processes toward favoring results from cohorts that simply provide results with a greater confidence level, and de-emphasize results from cohorts that might in fact be the better choice.
In generating data for use in modeling or analysis, transparency in sources and methods is imperative. We must strive to identify areas and sources of potential bias and identify decisions that may create bias. We must consider the potential for bias to impact decisions across the product ecosystem influencing use, diagnostic, and/or product design. Finally, we must ensure that the final employment of conclusions, models or products conform to the constraints introduced during the generation and application of the data sets in the system of product development.