Challenges in Reducing Bias Using Post-Processing Fairness for Breast Cancer Stage Classification with Deep Learning

Breast cancer is the most common cancer affecting women globally. Despite the significant impact of deep learning models on breast cancer diagnosis and treatment, achieving fairness or equitable outcomes across diverse populations remains a challenge when some demographic groups are underrepresented in the training data. We quantified the bias of models trained to predict breast cancer stage from a dataset consisting of 1000 biopsies from 842 patients provided by AIM-Ahead (Artificial Intelligence/Machine Learning Consortium to Advance Health Equity and Researcher Diversity). Notably, the majority of data (over 70%) were from White patients. We found that prior to post-processing adjustments, all deep learning models we trained consistently performed better for White patients than for non-White patients. After model calibration, we observed mixed results, with only some models demonstrating improved performance. This work provides a case study of bias in breast cancer medical imaging models and highlights the challenges in using post-processing to attempt to achieve fairness.


Introduction
Cancer is the second leading cause of mortality worldwide.Breast cancer, lung cancer, and colorectal cancer account for 51% of all new diagnoses among women.Breast cancer has the highest death rate at 32%.However, this death rate is not consistent across different demographic groups.For example, the death rate for Black women is 41% higher than for White women [1].
Recent advancements in deep learning have led to the use of deep neural networks, such as convolutional neural networks (CNNs), for breast cancer prediction.This field is relatively vast, with several models developed to classify benign and malignant tumors as well as to classify the stage of cancer [2,3].
Unfortunately, the use of artificial intelligence (AI) for cancer diagnostics may increase health disparities [4].Because AI models are trained using differing amounts of data for each demographic group, they have the potential to lead to unfair predictions for underrepresented groups [5][6][7][8][9][10][11].
Three broad classes of algorithms have been investigated to mitigate bias in algorithmic fairness: pre-processing, in-processing, and post-processing.Pre-processing involves changing the data, such as by generative data augmentation, to create equal amounts of data for each demographic group prior to training the model [12,13].In-processing methods change the learning algorithm's optimization objective function to enforce a reduction in bias during the training process.These two categories of techniques can function well if modifications to the underlying data or training process are allowed [13,14].
The final category of methods, post-processing, is applied after the model has been trained, using a separate set of data that was not used during the training phase.Such "black box" approaches are ideal when determining whether modifying the original AI model is impossible or infeasible [13].In this work, we explore the utility of applying post-processing fairness adjustments to breast cancer stage classification using medical imaging data, testing whether standard post-processing methods adapted to the multi-class setting can mitigate bias in these models.
We structure the remainder of the paper as follows: Section 2 provides a description of the AIM-Ahead dataset we used, the fairness metrics we measured, and the deep learning models we trained.Section 3 reports the results of our analyses, characterizing biases that occur across demographic groups and describing the results of post-processing fairness modifications.Section 4 discusses the high-level implications of this work.

Dataset
We used a dataset from AIM-Ahead containing whole slide images from 1000 breast biopsies from 842 patients from 2014 to 2020 [15].Each unique dataset element is related to an individual biopsy.
These high-resolution images, with dimensions of 100,000 × 150,0000 pixels, are stored as NDPI files, averaging about 2 GB each.We used 10,856 whole slide images generated by 1000 biopsies, averaging five images per biopsy.Each slide is labeled according by the cancer stage associated with the biopsy.A total of 94% of these determinations were developed within one month of the biopsy procedure [15].
We randomly divided patients into two groups, with 80% of the data used for training and the remaining 20% reserved for evaluation.The dataset composition for binary classification is depicted in Table 1.The sub-dataset that is used for training consists of 328 biopsies collected from 234 patients, containing a total of 3273 slide images.The held-out dataset includes 41 biopsies from 41 patients and 367 slide images.We assigned a label of 1 to patients who have cancer stages 3 and 4 and a label of 0 to patients who do not show any symptoms of cancer.
Table 2 provides a breakdown of the training and held-out test sets when splitting the data according to a multi-stage classification formulation.In this case, we assigned a label of 0 to patients with stage 0 cancer, a label of 1 to patients with stage 1 or 2 cancer, and a label of 2 to patients with stage 3 or 4 cancer.

Machine Learning Models
We evaluate a large number of important CNN architectures (Figure 1) for the classification of breast cancer stages from histopathological images.These architectures include VGG, EfficientNet, ConvNeXt, RegNet, and variations of ResNet models, including ResNet18, ResNet50, Wide ResNet101, and ResNet152.VGG stands out for its depth and use of numerous small-receptive-field filters that capture fine details.EfficientNet scales CNNs using a compound coefficient for balanced efficiency.ConvNeXt adapts Transformer principles for convolutional architectures, often enhancing performance.RegNet optimizes network structures for favorable performance/complexity ratios.
While we explored the possibility of training more modern model architectures, particularly Vit and Swin-Vit, on this dataset, our early attempts did not yield satisfactory results.This is likely due to the inadequacy of samples present in the dataset, which renders highly parameterized models ineffective, as highlighted by Zhu et al. [16].We therefore did not pursue such architectures in our analysis.
Our Slide Level Classifier, depicted in Figure 2, is tailored specifically to biomedical image data.We used Clustering-constrained Attention Multiple Instance Learning (CLAM).This weakly supervised method employs attention-based learning to automatically identify subregions of high diagnostic value to classify the whole slide.CLAM uses instance-level clustering over the representative regions identified to constrain and refine the feature space [17].After retrieving features, we added two fully connected layers, with the first layer mapping the feature inputs to a 512-node hidden layer with ReLU activation.The second layer transforms the representation to the number of target classes.The classifier is further enhanced with feature pooling methods-average and max pooling-to synthesize information from the tile-level data of the slide images into a cohesive feature vector, which is then used for classification.
We also construct an Ensemble model, integrating the averaged predictions from all other models to produce a final outcome.

Fairness Definitions
Fairness metrics are crucial tools for evaluating and ensuring unbiased mitigation across all demographic groups, irrespective of race, gender, or other protected characteristics.We describe two common fairness metrics that we used to evaluate the bias of our models.

Equalized
Odds-Equalized odds is a fairness measurement for predictive models, ensuring that a predictor Y ˆ is independent of any protected attribute A given the true outcome Y .The measurement requires equal true positive and false positive rates across demographics in binary and multi-class settings.The purpose of equalized odds is to ensure that no group is unfairly advantaged or disadvantaged by the predictions.Definition 1.: For binary variables, equalized odds is defined as: This metric aligns with the goal of training classifiers that perform equitably across all demographics [18].

Equal Opportunity-
In binary classification, Y = 1 often represents a positive outcome, like loan repayment, college admission, or promotion.Equal opportunity is a criterion derived from equalized odds, focusing only on the advantaged group.It requires non-discrimination within this group, ensuring that those who achieve the positive outcome Y = 1 have an equal probability of doing so, regardless of the protected attribute A. This is less stringent than equalized odds and often leads to better utility.Definition 2.: Equal opportunity for a binary predictor Y ˆ is defined as This condition mandates an equal True Positive Rate (TPR) for different demographic groups without imposing requirements on the False Positive Rate (FPR), thus allowing for potentially greater overall utility of the predictor [18].
We define FPR and TPR as follows, using TP to denote a True Positive, FP to denote a False Positive, TN to denote a True Negative, and FN to denote a False Negative:

Fairness Adjustments
We build our fairness adjustment method upon previous post-processing algorithmic fairness work.Hardt et al. [18] propose a method that helps to adjust the model's outputs to ensure fairness when there are only two possible outcomes.Putzel et al. [19] suggest a way to adapt this method for situations with more than two outcomes, such as the breast cancer stage classification task that we study here.To mitigate the issue of sparse samples for some groups, as is the case with our dataset, we introduce a minor adjustment, an epsilon term, to the TPR and FPR calculations to avoid division errors.By analyzing predicted and true labels alongside sensitive attributes such as race, we engineer 'adjusted' predictions that meet predefined fairness criteria.The resulting predictors aim to balance false positive and true positive rates (for equalized odds) or synchronize true positive rates (for equal opportunity) to ensure fairness across different demographics.
We leverage ROC curves to discern optimal fairness thresholds.Aligning ROC curves across groups leads to predictors that fulfill equalized odds, whereas mismatches may necessitate varying thresholds or probabilistic adjustments to achieve fair treatment.We identify optimal predictors by analyzing the intersections of group-specific convex hulls formed from these ROC curves.We manipulate conditional probabilities within the protected attribute conditional probability matrices through linear programming, optimizing against a fairnessoriented loss function.This process also incorporates an element of flexibility, allowing the loss function to penalize inaccuracies differently based on protected group membership.
Our fair predictors ensure a balanced representation of demographic groups by equalizing various fairness metrics.We explore two different multi-class fairness criteria, although the method could generalize to other fairness metrics as well.
We aim to minimize the same expected loss function for multiple classification that was used by Putzel et al. [19]: where W ij α = P r(Y adj = i | Y ˆ= j, A = α) are the protected attribute conditional confusion matrices.
To preserve fairness at the individual prediction level, we adopt a stochastic approach.Instead of simply selecting the most probable class, we construct predictions by sampling from the adjusted probabilities.Due to insufficient sample sizes within each demographic group, we encountered instances of zero values for FPs, TPs, FNs, and TNs.To implement our method, we used existing software for calculating fairness metrics, which was originally developed based on binary classification [13,20].We add an epsilon term (0.001) to the denominator of each of the four measurements (FPs, TPs, FNs, and TNs) to prevent division errors when calculating the confusion matrix and the fairness metrics (equalized odds and equal opportunity).

Evaluation Procedure
To ensure statistical robustness, we employ 50 iterations of a bootstrapping approach.
During each iteration, we randomly select a subset comprising half of the test samples.This subset is used to compute the FPR and TPR for White and non-White patient groups across all models.
We determine the mean, standard deviation, and confidence intervals of these metrics, allowing for a comparative analysis between the White and non-White cohorts.We apply the t-test to measure the statistical significance of the observed differences across groups.

Results
Table 3 presents a comparative analysis, prior to fairness adjustments, of several binary classification deep learning models based on their performance metrics across two demographic stratifications of the dataset: White and non-White groups.We observe a consistent trend of higher binary accuracy, precision, and recall for the White group across all models.The Ensemble model achieves relatively high precision and recall for the White group but exhibits a significant drop in performance for the non-White group, especially in terms of accuracy and F1-score.These findings highlight the disparities in model performance for under-represented demographic groups and emphasize the need for more balanced and fair machine learning algorithms.Figure 3 illustrates this performance disparity in FPRs and TPRs among the various CNN models between groups.
Table 4 presents the results of independent t-tests conducted to compare the FPR and TPR between groups across various models before applying post-processing adjustment.The majority of the models show a statistically significant difference in FPR, highlighting concerns regarding biases in model performances across demographic groups.Although we did not consistently observe statistical significance at the 0.05 p-value cutoff for TPR, we note that the trend was always towards better performance for White groups, and some models still showed statistically significant differences in TPR.There were no models where the trend was reversed: that is, no models where the performance was better for the non-White groups.
Table 5 and Figure 4 offer a comprehensive view of model performance before and after fairness adjustments in the binary classification setting.Notably, we do not observe consistent improvements in either FPR or TPR post-adjustment.
Table 6 provides an updated comparison of performance metrics for several models for the multi-class setting.The analysis was conducted across White and non-White groups for three different labels.We did not observe consistent discrepancies in performance between the White and non-White groups in the multi-class formulation, but we are hesitant to draw any conclusions from this due to the low overall performance of the models in the multi-class setting.
Figure 5 and Table 7 present a comparative analysis of the performance metrics for the deep learning models before and after fairness adjustments in the multi-class setting.We once again do not observe any improvements in performance after the fairness adjustments, but we are hesitant to draw any conclusions from the multi-class results due to the low overall baseline performance.

Discussion
We observe biases in the performance of the binary classification model, which consistently performs better on test data corresponding to White individuals.Our work adds further evidence to a wide body of prior work [21][22][23], demonstrating that without care, the integration of AI into diagnostic workflows may amplify existing healthcare disparities.
The lack of consistent disparity reductions after fairness adjustments highlights the challenges in applying post-processing techniques to reduce bias in machine learning models trained using medical imaging data.By adjusting the models after training, we had hoped to improve the equity of AI-enabled diagnostics across different racial groups.However, these methods do not appear to work for deep learning models applied to medical imaging.
The primary limitations of this study are (1) the small size of our evaluation dataset and (2) the possible lack of generalizability of our findings due to the use of only one dataset for evaluation.Future research on post-processing fairness in medical imaging would benefit from the use of multi-site datasets for a larger number of patients covering a broader range of demographic attributes.Another major limitation is that we grouped all non-White patients into a single category for fairness analyses due to the lack of sufficient representation of any race other than White.A more robust analysis would have included performance metrics for each individual race.However, such an analysis requires more samples for the under-represented groups, posing a 'chicken-and-egg problem'.These limitations collectively render our study as a preliminary analysis that should be followed up with more expansive experimentation.
Another interesting area of future work would be studying the explainability of the models in conjunction with fairness.Such a study could aid in the understanding of how different models arrive at their predictions and whether the reasons for arriving at a particular prediction are different across groups.We used CNN models for image feature extraction and classification.We then applied post-processing strategies in an attempt to reduce bias.Finally, we evaluated the models using traditional algorithmic fairness metrics.FPR and TPR for several binary deep learning models, distinguishing between White and non-White group performance.Comparative analyses across eight machine learning models, demonstrating the impact of fairness adjustments on the FPR, TPR, and loss.(a) FPR comparisons, (b) TPR comparisons, (c) loss comparisons.We do not observe consistent trends.Comparative analysis of multi-class model performance across several architectures.

Figure 2 .
Figure 2.The workflow and architecture of our Slide Level Classifier: feature extraction, classification, and fairness-centered post-processing.

Table 1 .
Data distribution of training, validation, and test sets for the binary classification of no cancer from advancedstage cancer.

Table 2 .
Data distribution of training, validation, and test sets for the multi-class classification formulation.

Table 3 .
Comparison of performance metrics across models for White and non-White groups prior to fairness adjustments.

Table 4 .
Results of independent t-tests comparing FPR and TPR between White (n = 32 in the test set) and non-White (n = 9 in the test set) groups across different models before applying post-processing adjustment.

Table 6 .
Comparison of performance metrics before post-processing adjustment for the multiple class formulation, stratified by race.
Algorithms.Author manuscript; available in PMC 2024 July 03.

Table 7 .
Model performance before and after fairness adjustments in the multi-class classification setting.