You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

28 March 2024

Challenges in Reducing Bias Using Post-Processing Fairness for Breast Cancer Stage Classification with Deep Learning

and
Hawaii Health Digital Lab, Information and Computer Science, University of Hawaii at Manoa, Honolulu, HI 96822, USA
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning in Medical Signal and Image Processing (2nd Edition)

Abstract

Breast cancer is the most common cancer affecting women globally. Despite the significant impact of deep learning models on breast cancer diagnosis and treatment, achieving fairness or equitable outcomes across diverse populations remains a challenge when some demographic groups are underrepresented in the training data. We quantified the bias of models trained to predict breast cancer stage from a dataset consisting of 1000 biopsies from 842 patients provided by AIM-Ahead (Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity). Notably, the majority of data (over 70%) were from White patients. We found that prior to post-processing adjustments, all deep learning models we trained consistently performed better for White patients than for non-White patients. After model calibration, we observed mixed results, with only some models demonstrating improved performance. This work provides a case study of bias in breast cancer medical imaging models and highlights the challenges in using post-processing to attempt to achieve fairness.

1. Introduction

Cancer is the second leading cause of mortality worldwide. Breast cancer, lung cancer, and colorectal cancer account for 51 % of all new diagnoses among women. Breast cancer has the highest death rate at 32 % . However, this death rate is not consistent across different demographic groups. For example, the death rate for Black women is 41 % higher than for White women [].
Recent advancements in deep learning have led to the use of deep neural networks, such as convolutional neural networks (CNNs), for breast cancer prediction. This field is relatively vast, with several models developed to classify benign and malignant tumors as well as to classify the stage of cancer [,].
Unfortunately, the use of artificial intelligence (AI) for cancer diagnostics may increase health disparities []. Because AI models are trained using differing amounts of data for each demographic group, they have the potential to lead to unfair predictions for under-represented groups [,,,,,,].
Three broad classes of algorithms have been investigated to mitigate bias in algorithmic fairness: pre-processing, in-processing, and post-processing. Pre-processing involves changing the data, such as by generative data augmentation, to create equal amounts of data for each demographic group prior to training the model [,]. In-processing methods change the learning algorithm’s optimization objective function to enforce a reduction in bias during the training process. These two categories of techniques can function well if modifications to the underlying data or training process are allowed [,].
The final category of methods, post-processing, is applied after the model has been trained, using a separate set of data that was not used during the training phase. Such “black box” approaches are ideal when determining whether modifying the original AI model is impossible or infeasible []. In this work, we explore the utility of applying post-processing fairness adjustments to breast cancer stage classification using medical imaging data, testing whether standard post-processing methods adapted to the multi-class setting can mitigate bias in these models.
We structure the remainder of the paper as follows: Section 2 provides a description of the AIM-Ahead dataset we used, the fairness metrics we measured, and the deep learning models we trained. Section 3 reports the results of our analyses, characterizing biases that occur across demographic groups and describing the results of post-processing fairness modifications. Section 4 discusses the high-level implications of this work.

2. Materials and Methods

2.1. Dataset

We used a dataset from AIM-Ahead containing whole slide images from 1000 breast biopsies from 842 patients from 2014 to 2020 []. Each unique dataset element is related to an individual biopsy.
These high-resolution images, with dimensions of 100,000 × 150,0000 pixels, are stored as NDPI files, averaging about 2 GB each. We used 10,856 whole slide images generated by 1000 biopsies, averaging five images per biopsy. Each slide is labeled according to the cancer stage associated with the biopsy. A total of 94 % of these determinations were developed within one month of the biopsy procedure [].
We randomly divided patients into two groups, with 80 % of the data used for training and the remaining 20 % reserved for evaluation. The dataset composition for binary classification is depicted in Table 1. The sub-dataset that is used for training consists of 328 biopsies collected from 234 patients, containing a total of 3273 slide images. The held-out dataset includes 41 biopsies from 41 patients and 367 slide images. We assigned a label of 1 to patients who have cancer stages 3 and 4 and a label of 0 to patients who do not show any symptoms of cancer.
Table 1. Data distribution of training, validation, and test sets for the binary classification of no cancer from advanced-stage cancer.
Table 2 provides a breakdown of the training and held-out test sets when splitting the data according to a multi-stage classification formulation. In this case, we assigned a label of 0 to patients with stage 0 cancer, a label of 1 to patients with stage 1 or 2 cancer, and a label of 2 to patients with stage 3 or 4 cancer.
Table 2. Data distribution of training, validation, and test sets for the multi-class classification formulation.

2.2. Machine Learning Models

We evaluate a large number of important CNN architectures (Figure 1) for the classification of breast cancer stages from histopathological images. These architectures include VGG, EfficientNet, ConvNeXt, RegNet, and variations of ResNet models, including ResNet18, ResNet50, Wide ResNet101, and ResNet152. VGG stands out for its depth and use of numerous small-receptive-field filters that capture fine details. EfficientNet scales CNNs using a compound coefficient for balanced efficiency. ConvNeXt adapts Transformer principles for convolutional architectures, often enhancing performance. RegNet optimizes network structures for favorable performance/complexity ratios.
Figure 1. We used CNN models for image feature extraction and classification. We then applied post-processing strategies in an attempt to reduce bias. Finally, we evaluated the models using traditional algorithmic fairness metrics.
While we explored the possibility of training more modern model architectures, particularly Vit and Swin-Vit, on this dataset, our early attempts did not yield satisfactory results. This is likely due to the inadequacy of samples present in the dataset, which renders highly parameterized models ineffective, as highlighted by Zhu et al. []. We therefore did not pursue such architectures in our analysis.
Our Slide Level Classifier, depicted in Figure 2, is tailored specifically to biomedical image data. We used Clustering-constrained Attention Multiple Instance Learning (CLAM). This weakly supervised method employs attention-based learning to automatically identify sub-regions of high diagnostic value to classify the whole slide. CLAM uses instance-level clustering over the representative regions identified to constrain and refine the feature space []. After retrieving features, we added two fully connected layers, with the first layer mapping the feature inputs to a 512-node hidden layer with ReLU activation. The second layer transforms the representation to the number of target classes. The classifier is further enhanced with feature pooling methods—average and max pooling—to synthesize information from the tile-level data of the slide images into a cohesive feature vector, which is then used for classification.
Figure 2. The workflow and architecture of our Slide Level Classifier: feature extraction, classification, and fairness-centered post-processing.
We also construct an Ensemble model, integrating the averaged the predictions from all other models to produce a final outcome.

2.3. Fairness Definitions

Fairness metrics are crucial tools for evaluating and ensuring unbiased mitigation across all demographic groups, irrespective of race, gender, or other protected characteristics. We describe two common fairness metrics that we used to evaluate the bias of our models.

2.3.1. Equalized Odds

Equalized odds is a fairness measurement for predictive models, ensuring that a predictor Y ^ is independent of any protected attribute A given the true outcome Y. The measurement requires equal true positive and false positive rates across demographics in binary and multi-class settings. The purpose of equalized odds is to ensure that no group is unfairly advantaged or disadvantaged by the predictions.
Definition 1.
For binary variables, equalized odds is defined as:
P r ( Y ^ = 1 | A = 0 , Y = y ) = P r ( Y ^ = 1 | A = 1 , Y = y ) , y { 0 , 1 }
This metric aligns with the goal of training classifiers that perform equitably across all demographics [].

2.3.2. Equal Opportunity

In binary classification, Y = 1 often represents a positive outcome, like loan repayment, college admission, or promotion. Equal opportunity is a criterion derived from equalized odds, focusing only on the advantaged group. It requires non-discrimination within this group, ensuring that those who achieve the positive outcome Y = 1 have an equal probability of doing so, regardless of the protected attribute A. This is less stringent than equalized odds and often leads to better utility.
Definition 2.
Equal opportunity for a binary predictor Y ^ is defined as
P r ( Y ^ = 1 | A = 0 , Y = 1 ) = P r ( Y ^ = 1 | A = 1 , Y = 1 )
This condition mandates an equal True Positive Rate (TPR) for different demographic groups without imposing requirements on the False Positive Rate (FPR), thus allowing for potentially greater overall utility of the predictor [].
We define FPR and TPR as follows, using TP to denote a True Positive, FP to denote a False Positive, TN to denote a True Negative, and FN to denote a False Negative:
FPR = F P F P + T N
TPR = T P T P + F N

2.4. Fairness Adjustments

We build our fairness adjustment method upon previous post-processing algorithmic fairness work. Hardt et al. [] propose a method that helps to adjust the model’s outputs to ensure fairness when there are only two possible outcomes. Putzel et al. [] suggest a way to adapt this method for situations with more than two outcomes, such as the breast cancer stage classification task that we study here. To mitigate the issue of sparse samples for some groups, as is the case with our dataset, we introduce a minor adjustment, an epsilon term, to the TPR and FPR calculations to avoid division errors. By analyzing predicted and true labels alongside sensitive attributes such as race, we engineer ’adjusted’ predictions that meet predefined fairness criteria. The resulting predictors aim to balance false positive and true positive rates (for equalized odds) or synchronize true positive rates (for equal opportunity) to ensure fairness across different demographics.
We leverage ROC curves to discern optimal fairness thresholds. Aligning ROC curves across groups leads to predictors that fulfill equalized odds, whereas mismatches may necessitate varying thresholds or probabilistic adjustments to achieve fair treatment. We identify optimal predictors by analyzing the intersections of group-specific convex hulls formed from these ROC curves. We manipulate conditional probabilities within the protected attribute conditional probability matrices through linear programming, optimizing against a fairness-oriented loss function. This process also incorporates an element of flexibility, allowing the loss function to penalize inaccuracies differently based on protected group membership.
Our fair predictors ensure a balanced representation of demographic groups by equalizing various fairness metrics. We explore two different multi-class fairness criteria, although the method could generalize to other fairness metrics as well.
We aim to minimize the same expected loss function for multiple classification that was used by Putzel et al. []:
E [ l ( y ^ a d j , y ) ] = α A i = 1 | C | j i W i j α P r ( A = α , Y = j ) l ( i , j , α )
where W i j α = P r ( Y a d j = i | Y ^ = j , A = α ) are the protected attribute conditional confusion matrices.
To preserve fairness at the individual prediction level, we adopt a stochastic approach. Instead of simply selecting the most probable class, we construct predictions by sampling from the adjusted probabilities. Due to insufficient sample sizes within each demographic group, we encountered instances of zero values for FPs, TPs, FNs, and TNs. To implement our method, we used existing software for calculating fairness metrics, which was originally developed based on binary classification [,]. We add an epsilon term (0.001) to the denominator of each of the four measurements (FPs, TPs, FNs, and TNs) to prevent division errors when calculating the confusion matrix and the fairness metrics (equalized odds and equal opportunity).

2.5. Evaluation Procedure

To ensure statistical robustness, we employ 50 iterations of a bootstrapping approach. During each iteration, we randomly select a subset comprising half of the test samples. This subset is used to compute the FPR and TPR for White and non-White patient groups across all models.
We determine the mean, standard deviation, and confidence intervals of these metrics, allowing for a comparative analysis between the White and non-White cohorts. We apply the t-test to measure the statistical significance of the observed differences across groups.

3. Results

Table 3 presents a comparative analysis, prior to fairness adjustments, of several binary classification deep learning models based on their performance metrics across two demographic stratifications of the dataset: White and non-White groups. We observe a consistent trend of higher binary accuracy, precision, and recall for the White group across all models. The Ensemble model achieves relatively high precision and recall for the White group but exhibits a significant drop in performance for the non-White group, especially in terms of accuracy and F1-score. These findings highlight the disparities in model performance for under-represented demographic groups and emphasize the need for more balanced and fair machine learning algorithms. Figure 3 illustrates this performance disparity in FPRs and TPRs among the various CNN models between groups.
Table 3. Comparison of performance metrics across models for White and non-White groups prior to fairness adjustments.
Figure 3. FPR and TPR for several binary deep learning models, distinguishing between White and non-White group performance.
Table 4 presents the results of independent t-tests conducted to compare the FPR and TPR between groups across various models before applying post-processing adjustment. The majority of the models show a statistically significant difference in FPR, highlighting concerns regarding biases in model performances across demographic groups. Although we did not consistently observe statistical significance at the 0.05 p-value cutoff for TPR, we note that the trend was always towards better performance for White groups, and some models still showed statistically significant differences in TPR. There were no models where the trend was reversed: that is, no models where the performance was better for the non-White groups.
Table 4. Results of independent t-tests comparing FPR and TPR between White (n = 32 in the test set) and non-White (n = 9 in the test set) groups across different models before applying post-processing adjustment.
Table 5 and Figure 4 offer a comprehensive view of model performance before and after fairness adjustments in the binary classification setting. Notably, we do not observe consistent improvements in either FPR or TPR post-adjustment.
Table 5. Model performance before and after fairness adjustments in the binary classification setting. We compare the FPR, TPR, and loss values before adjustment and after applying two post-processing procedures optimized for equalized odds and equalized opportunity.
Figure 4. Comparative analyses across eight machine learning models, demonstrating the impact of fairness adjustments on the FPR, TPR, and loss. (a) FPR comparisons, (b) TPR comparisons, (c) loss comparisons. We do not observe consistent trends.
Table 6 provides an updated comparison of performance metrics for several models for the multi-class setting. The analysis was conducted across White and non-White groups for three different labels. We did not observe consistent discrepancies in performance between the White and non-White groups in the multi-class formulation, but we are hesitant to draw any conclusions from this due to the low overall performance of the models in the multi-class setting.
Table 6. Comparison of performance metrics before post-processing adjustment for the multiple class formulation, stratified by race.
Figure 5 and Table 7 present a comparative analysis of the performance metrics for the deep learning models before and after fairness adjustments in the multi-class setting. We once again do not observe any improvements in performance after the fairness adjustments, but we are hesitant to draw any conclusions from the multi-class results due to the low overall baseline performance.
Figure 5. Comparative analysis of multi-class model performance across several architectures.
Table 7. Model performance before and after fairness adjustments in the multi-class classification setting.

4. Discussion

We observe biases in the performance of the binary classification model, which consistently performs better on test data corresponding to White individuals. Our work adds further evidence to a wide body of prior work [,,], demonstrating that without care, the integration of AI into diagnostic workflows may amplify existing healthcare disparities.
The lack of consistent disparity reductions after fairness adjustments highlights the challenges in applying post-processing techniques to reduce bias in machine learning models trained using medical imaging data. By adjusting the models after training, we had hoped to improve the equity of AI-enabled diagnostics across different racial groups. However, these methods do not appear to work for deep learning models applied to medical imaging.
The primary limitations of this study are (1) the small size of our evaluation dataset and (2) the possible lack of generalizability of our findings due to the use of only one dataset for evaluation. Future research on post-processing fairness in medical imaging would benefit from the use of multi-site datasets for a larger number of patients covering a broader range of demographic attributes. Another major limitation is that we grouped all non-White patients into a single category for fairness analyses due to the lack of sufficient representation of any race other than White. A more robust analysis would have included performance metrics for each individual race. However, such an analysis requires more samples for the under-represented groups, posing a ‘chicken-and-egg problem’. These limitations collectively render our study as a preliminary analysis that should be followed up with more expansive experimentation.
Another interesting area of future work would be studying the explainability of the models in conjunction with fairness. Such a study could aid in the understanding of how different models arrive at their predictions and whether the reasons for arriving at a particular prediction are different across groups.

Author Contributions

Methodology, software, validation, and writing—original draft preparation, visualization, A.S.; conceptualization, funding acquisition, investigation, resources, supervision, writing—review and editing, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

The project described was supported by the National Institute on Minority Health and Health Disparities of the National Institutes of Health under Award Number U54MD007601. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability Statement

The dataset used in this study is the publicly available dataset Nightingale Open Science Dataset, which can be found at the following website: https://app.nightingalescience.org/contests/8lo46ovm2g1j#dataset, accessed on 7 January 2024. All code is publicly available on Github (https://github.com/arminsoltan/AI_fairness_multiple_classification/tree/main (accessed on 26 March 2024)).

Acknowledgments

We used a combination of ChatGPT and Grammarly to edit the grammar of our manuscript and to re-phrase sentences that were originally worded unclearly. However, all content in this manuscript is original, reflecting the original analyses conducted by the authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

  1. Siegel, R.L.; Giaquinto, A.N.; Jemal, A. Cancer statistics, 2024. CA Cancer J. Clin. 2024, 74, 12–49. [Google Scholar] [CrossRef] [PubMed]
  2. Golatkar, A.; Anand, D.; Sethi, A. Classification of breast cancer histology using deep learning. In Proceedings of the Image Analysis and Recognition: 15th International Conference, ICIAR 2018, Póvoa de Varzim, Portugal, 27–29 June 2018; Proceedings 15. Springer: Berlin/Heidelberg, Germany, 2018; pp. 837–844. [Google Scholar]
  3. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. Breast cancer histopathological image classification using Convolutional Neural Networks. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2560–2567. [Google Scholar] [CrossRef]
  4. Boag, W.; Suresh, H.; Celi, L.A.; Szolovits, P.; Ghassemi, M. Racial Disparities and Mistrust in End-of-Life Care. Proc. Mach. Learn. Res. 2018, 85, 587–602. [Google Scholar]
  5. Adamson, A.S.; Smith, A. Machine Learning and Health Care Disparities in Dermatology. JAMA Dermatol. 2018, 154, 1247–1248. [Google Scholar] [CrossRef] [PubMed]
  6. Rajkomar, A.; Hardt, M.; Howell, M.; Corrado, G.; Chin, M. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018, 169, 866–887. [Google Scholar] [CrossRef] [PubMed]
  7. Velagapudi, L.; Mouchtouris, N.; Baldassari, M.; Nauheim, D.; Khanna, O.; Al Saiegh, F.; Herial, N.; Gooch, M.; Tjoumakaris, S.; Rosenwasser, R.; et al. Discrepancies in Stroke Distribution and Dataset Origin in Machine Learning for Stroke. J. Stroke Cerebrovasc. Dis. Off. J. Natl. Stroke Assoc. 2021, 30, 105832. [Google Scholar] [CrossRef] [PubMed]
  8. Berger, T.R.; Wen, P.Y.; Lang-Orsini, M.; Chukwueke, U.N. World Health Organization 2021 classification of central nervous system tumors and implications for therapy for adult-type gliomas: A review. JAMA Oncol. 2022, 8, 1493–1501. [Google Scholar] [CrossRef] [PubMed]
  9. Bardhan, I.R.; Chen, H.; Karahanna, E. Connecting systems, data, and people: A multidisciplinary research roadmap for chronic disease management. Manag. Inf. Syst. Q. 2020, 44, 185–200. [Google Scholar]
  10. Bostrom, N.; Yudkowsky, E. The ethics of artificial intelligence. In The Cambridge Handbook of Artificial Intelligence; Frankish, K., Ramsey, W.M., Eds.; Cambridge University Press: Cambridge, UK, 2014; pp. 316–334. [Google Scholar] [CrossRef]
  11. Futoma, J.; Simons, M.; Panch, T.; Doshi velez, F.; Celi, L. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit. Health 2020, 2, e489–e492. [Google Scholar] [CrossRef] [PubMed]
  12. D’Alessandro, B.; O’Neil, C.; LaGatta, T. Conscientious Classification: A Data Scientist’s Guide to Discrimination-Aware Classification. Big Data 2017, 5, 120–134. [Google Scholar] [CrossRef] [PubMed]
  13. Bellamy, R.K.E.; Dey, K.; Hind, M.; Hoffman, S.C.; Houde, S.; Kannan, K.; Lohia, P.; Martino, J.; Mehta, S.; Mojsilovic, A.; et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv 2018, arXiv:1810.01943. [Google Scholar]
  14. Berk, R.; Heidari, H.; Jabbari, S.; Joseph, M.; Kearns, M.; Morgenstern, J.; Neel, S.; Roth, A. A Convex Framework for Fair Regression. arXiv 2017, arXiv:1706.02409. [Google Scholar]
  15. Bifulco, C.; Piening, B.; Bower, T.; Robicsek, A.; Weerasinghe, R.; Lee, S.; Foster, N.; Juergens, N.; Risley, J.; Nachimuthu, S.; et al. Identifying High-risk Breast Cancer Using Digital Pathology Images: A Nightingale Open Science Dataset. Nightingale Open Science. 2021. Available online: https://docs.ngsci.org/datasets/brca-psj-path/ (accessed on 26 March 2024).
  16. Zhu, H.; Chen, B.; Yang, C. Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective. arXiv 2023, arXiv:2302.03751. [Google Scholar]
  17. Lu, M.Y.; Williamson, D.F.K.; Chen, T.Y.; Chen, R.J.; Barbieri, M.; Mahmood, F. Data Efficient and Weakly Supervised Computational Pathology on Whole Slide Images. arXiv 2020, arXiv:2004.09666. [Google Scholar] [CrossRef] [PubMed]
  18. Hardt, M.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. arXiv 2016, arXiv:1610.02413. [Google Scholar]
  19. Putzel, P.; Lee, S. Blackbox Post-Processing for Multiclass Fairness. arXiv 2022, arXiv:2201.04461. [Google Scholar]
  20. Lee, S. Scotthlee/Fairness: Now with Support for Multiclass Outcomes. 2022. Available online: https://zenodo.org/records/6127503 (accessed on 26 March 2024).
  21. Zhang, H.; Dullerud, N.; Roth, K.; Oakden-Rayner, L.; Pfohl, S.; Ghassemi, M. Improving the Fairness of Chest X-ray Classifiers. PMLR 2022, 174, 204–233. [Google Scholar]
  22. Ghassemi, M.; Mohamed, S. Machine learning and health need better values. Npj Digit. Med. 2022, 5, 51. [Google Scholar] [CrossRef] [PubMed]
  23. Chen, I.; Johansson, F.D.; Sontag, D. Why Is My Classifier Discriminatory? arXiv 2018, arXiv:1805.12002. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.