Over the last two decades, advances in machine learning (ML) have delivered immensely empowering technologies; shifting the paradigm of software operating in complex domains away from extremely customized single-use technology stacks towards an increasingly more modular approach that places much of the burden of handling complexity on learning by example. Frameworks such as PyTorch and TensorFlow, along with the open sharing of pre-trained neural network models, have made it more accessible than ever to test out ML solutions to many applications of interest.
Trailing this newfound accessibility of deploying ML models, there has also been an increasing awareness of the potential for such systems to encode biases unanticipated by those applying and deploying the algorithms. In safety-critical domains such as medicine and healthcare, not anticipating such issues can lead to serious consequences spanning treatment, public health, and degradation of the public’s trust.
In one such example, a study found that an algorithm deployed to help health systems and payers identify high-risk patients assigned Black patients experiencing more serious illness an equal level of risk as healthier White patients. Authors of the study showed that a remedy of the algorithmic scoring bias would have increased the percentage of Black patients recommended for additional care from 17.7% to 46.5%.
A large volume of unfortunate examples also emerged with the eagerness of the ML and computer vision communities to assist with the COVID-19 pandemic. In a rush to develop models to identify COVID-19 from chest radiographs and CT images, one review found 415 publications and preprints proposing such systems throughout 2020. Upon screening and systematic scrutiny, none were identified as being of potential clinical use due to flaws in methodology or underlying biases. Had any of these proposals made it through to deployment, the results might have led to systematic misdiagnoses with potential for life-threatening consequences.
How do issues of bias and fairness creep into ML systems?
Fundamentally, machine learning approaches can be distilled down to: a model, a goal task, and an optimization algorithm that runs through available data and tunes the model to perform well on the task. Biases we perceive as unfair may be introduced into a learning system based on how the task is defined, what restrictions a chosen model might pose on the possible solutions, how/where/when (data provenance) and how much data is gathered, and crucially how effectively that data captures all the possibilities that might occur.
The earlier example of identifying high-risk patients was one case where a seemingly logical definition of the task led to the propagation of a systemic bias by the algorithm’s predictions. The system in question predicted the cost of healthcare as a proxy for a patient’s healthcare needs. Unfortunately, inequalities in access to care and a multitude of biases in relationships with the healthcare system have resulted in lower health spending on Black patients based on need. The assumption of task equivalency between estimating cost and need therefore turned out to be a key factor leading to the observed bias in risk predictions, underscoring the need for caution at the level of task definition.
In the domain of detecting COVID-19 from chest images, a prominent issue across multiple systems was the use of a pediatric (ages 1-5) pneumonia dataset as a significant source of patients not afflicted with COVID-19, whereas images used from confirmed COVID-19 cases were collected from adults. Any ML system trained and tested with this type of data bias would have appeared to developers to accurately detect COVID-19 if it learned to discriminate between chest images belonging to adults versus children. Until more broad testing, it would have been unknown whether such a system could accurately identify adults not exhibiting COVID-19 or children that do. Clearly such an ambiguity, introduced as a result of not verifying or clearly disclosing data provenance, could have led to systematic misdiagnoses in deployment if these issues had not been identified in early reviews.
Given that there are so many possible avenues for biases that can raise issues of fairness or safety in entering ML system design, it is unlikely that there will ever be one recipe to guarantee a completely trouble-free deployment. In this light, it is worthwhile to highlight a few guidelines that can help mitigate the risks and encourage a fruitful process of iterative correction.
Encourage collaboration along with institutional and team diversity
This may be obvious and often restated, but what might appear as a perfectly reasonable assumption to an individual from one background of life experiences may immediately raise red flags or contradictions to someone with a different path through life. Similarly, different professional disciplines bring with them deep understandings of complementary domains. In ML, this comes into play at all levels. For example, the phrasing of a question on a survey form for gathering data may impact participant responses, or categories proposed for labeling data may not capture sufficient complexity to describe some participants. Having broad and regular participation from diverse disciplines, backgrounds, communities from which data is gathered, and those impacted by the problem the system is attempting to solve, during both the definition of the problem and the data gathering phase is extremely useful. It improves the chances of catching potential sources of bias early in development and opens the door to an ongoing collaboration of iterative system refinement.
In our first example, the use of health cost as a proxy for need was not identified in early development as a potential source of bias. But fortunately, as noted by authors of the study, the algorithm developer chose to collaborate with researchers who uncovered the issue and iterated on the system to correct the biased under-estimation of health risk .
In recent years the field of AI as a whole has become increasingly aware of the critical importance of a cross-disciplinary and more inclusive approach to research and development, leading to the founding of institutes such as Stanford’s Human-Centered Artificial Intelligence (HAI) and collaborations such as Microsoft’s Project Resolve. While there is much remaining room for progress, the emergence of broad efforts such as these lends an optimistic trajectory for how the field continues to co-evolve with those that it impacts.
Take the time to know your data
In the drive to deliver a solution quickly, it is often tempting to dive directly into building and training ML models. This is especially true in rare cases where data already appear sufficiently organized. Such was the case with the pediatric pneumonia images used across papers enthusiastic to contribute to the diagnosis of COVID-19.
But before diving into model selection or optimization, a huge part of catching potential bias issues early in the development process lies in characterizing your data. This encompasses understanding its provenance, anticipating which aspects of data may not be independent from the task (metadata such as location where data was gathered, age of subjects, instruments used, etc.), and looking for balance issues or gaps both with respect to such metadata, as well as input and output labels defined by the prediction task.
Often, a key challenge is anticipating which metadata will be particularly important to interrogate, which is where (to our previous point) involving a breadth of perspectives becomes a great asset. Beyond this challenge, the search for data gaps and imbalance is largely an exercise in data visualization.
Fortunately, many issues with respect to important metadata can be uncovered with simple frequency plots, such as counting the number of examples representing each label of a category or some range of continuous values (ex: number of samples per patient age range).
Beyond simple dataset statistics, classic dimensionality reduction (PCA) or embedding approaches (t-SNE, UMAP, and PHATE) are extremely useful in visualizing relationships between complex data, such as images or other multi-dimensional signals. These methods help identify subpopulations, and ease the discovery of potential duplicates, outliers, and sparsely sampled domains of data.
In a recent project characterizing an aspect of the adaptive immune response to SARS-CoV-2, we utilized such an embedding space to help fairly represent global immune response to viral proteins. How different people respond to a pathogen is in part determined by which versions of major histocompatibility complex (MHC) proteins their body produces.
These molecules play a key role in the process that enables your body’s T-cells to identify self from non-self. MHCs bind to protein fragments and present them for T-cells to inspect. Binding to MHC is therefore necessary (but not sufficient) to elicit a T-cell response. MHCs are also highly variable, and the prevalence of their variations exhibit regional patterns. So, to fairly represent a global immune response we used an embedding space to identify similarly functioning MHC groups and ensured that each group was represented in the analysis, even if none of its members were of high frequency globally. This way, minority populations with potentially unique immune responses were not excluded.
There are, of course, many useful tools beyond the scope of our limited examples. For a more comprehensive exploration of data visualization techniques and good practices, a great curated collection of key literature is available here.
Once data issues are identified, it becomes the duty of ML practitioners to be transparent with all stakeholders of a project, consult with domain experts regarding the importance of the data gaps and imbalances, and advocate for additional data collection. And where additional data is not a possibility, proceed with robustness and exercise transparency in communicating issues at each system iteration.
Introduce robustness in the face of data imbalance
When the ability to collect new data is limited, there is a constantly evolving menu of design choices that can be helpful in making a system more robust and more transparent in the face of data gaps and imbalances. The frontiers of ML research continue to evolve in this domain and staying abreast of innovations is essential. But fundamentally, it is important to always remember that if an example of a specific phenomenon is not adequately represented in data, there will always be a risk of failure or unpredictable behavior around such cases.
Quantify model uncertainty
For many tasks it is very unlikely that training examples effectively capture the complete range of possibilities that may occur once a model is deployed. This can be due to dataset biases but may also be simply due to extreme rarity of certain examples, even in light of considerable effort in data gathering. In order to help limit models from making arbitrary predictions in such under-constrained test cases, it is critical to capture an estimate of model uncertainty.
While there are many options in literature, one simple yet effective approach is to use ensembles of models (or dropout in neural networks) to obtain a variance for each prediction. We demonstrated the effectiveness of this approach in our own work: using uncertainty from a model ensemble to identify when classification was ambiguous and demonstrating that accuracy improved across multiple metrics when abstaining from prediction in ambiguous cases. The MHC binding prediction systems in which this was applied were a key component enabling our work of tracking the latent potential for T-cell evasion across evolving SARS-CoV-2 variants.
Consider weighting sample importance
When training models in the presence of data imbalance, another simple but useful tactic is to increase the importance of under-represented examples, so that they have more impact on the solution learned through model optimization. This can be achieved through data resampling; however, an equally simple model-agnostic method with fewer potential side effects can be employed with a cleverly smoothed data re-weighting scheme.
As with uncertainty quantification, we utilized the approach suggested here to address imbalance across multiple aspects of data in our own work. The method was used when training MHC binding prediction models to simultaneously weight positive and negative binding examples for each distinct MHC; as well as to address imbalance of total example counts between MHC molecules considered.
Check the rapidly evolving research literature
ML system robustness and safety are currently very active areas of ongoing research, and many alternative pathways to improve performance in the face of data issues are being explored. For instance, predictions or recommendations based on majority populations may be completely irrelevant for a minority group. So Distributionally Robust Optimization (DRO) is one example research domain that aims to optimize task performance of the worst performing sub-population identified in training data. There is also a wealth of strategies that may be useful for specific applications. For example, adversarial training can be useful to encourage models to learn to be insensitive to aspects of data (such as dataset source, assay methodology, age, gender, etc.) one may wish to specifically ignore.
For ML teams, the annual proceedings of NeurIPS are always a fantastic place to begin a review of the latest breadth of available options. In 2021, an entire NeurIPS workshop was devoted to discussing algorithmic fairness and robustness. It is worth repeating that it is highly improbable that any single advancement will be truly sufficient to fully resolve the issue. Which ultimately underscores the importance of transparency on behalf of those developing and deploying ML systems.
Transparency about limitations
The discovery and correction of system biases will continue to be an ongoing and iterative pursuit. Obfuscating system limitations only serves to delay discovery and compound potential negative consequences over time. It is therefore important that teams responsible for ML deployments communicate transparently about all known system assumptions and limitations. This includes clear documentation of any assumptions or approximations in the definition of the task, implicit assumptions based on model choice, as well as any poorly characterized domains in data used to train the system (even if the data itself is not made public).
Not only does this help users to steer clear of employing a system outside the boundaries of its intended application; it also speeds the discovery of potential issues as the disclosures are scrutinized by users with an increasingly large breadth of expertise.
Continue to validate and iterate
There are no comprehensive approaches to “solve” the issues of bias in computing. In addition, our social constructs, categories, and concepts such as fairness are also constantly evolving. This means that to continue to address these issues we will always need to remain open to feedback and continue to iterate not only on algorithms, but also in how we define problems, how we gather data, and how we collaborate with different stakeholders. An increasing level of discussion and awareness throughout the ML community shows promise that we are headed towards a more inclusive ML future, but it is the responsibility of all of us to continue to be mindful, collaborative, and transparent at all levels of our R&D efforts.