Comprehensive Validation on Reweighting Samples for Bias Mitigation via AIF360

Fairness AI aims to detect and alleviate bias across the entire AI development life cycle, encompassing data curation, modeling, evaluation, and deployment-a pivotal aspect of ethical AI implementation. Addressing data bias, particularly concerning sensitive attributes like gender and race, reweighting samples proves efficient for fairness AI. This paper contributes a systematic examination of reweighting samples for traditional machine learning (ML) models, employing five models for binary classification on the Adult Income and COMPUS datasets with various protected attributes. The study evaluates prediction results using five fairness metrics, uncovering the nuanced and model-specific nature of reweighting sample effectiveness in achieving fairness in traditional ML models, as well as revealing the complexity of bias dynamics.


I. INTRODUCTION
Artificial intelligence (AI) ethics represents an evolving and interdisciplinary domain dedicated to grappling with ethical considerations in AI [1], [2].This field explores ethical theories, guidelines, policies, principles, rules, and regulations pertinent to AI [3].Establishing ethical standards for AI is crucial for developing morally sound AI systems or ensuring ethical behavior in AI applications.It involves discerning ethical and moral values, guiding principles that define what is morally right or wrong.Numerous ethical issues have been identified in AI applications and studies, including concerns about transparency, privacy, accountability, bias, discrimination, safety, security, and the potential for criminal or malicious use.In recent years, AI ethics has burgeoned into a broad and rapidly advancing research area, drawing increased attention from researchers [4].
Fairness AI endeavors to identify and mitigate bias throughout the entire life cycle of AI technique development, spanning data curation and preparation, modeling, evaluation, and deployment, which is crucial for the successful implementation of AI ethics [5], [6], [7], [8].Bias can manifest in various forms, potentially leading to unfairness in different downstream learning tasks.These biases typically originate from different stages of the machine learning pipeline, including data curation, algorithm design, and user interactions.Data C. Hastings, L. Qian, P. Obiomon and X. Dong are with the Department of Electrical and Computer Engineering, Prairie View A&M University, Texas A&M University System, Prairie View, TX 77446, USA.C. Gibson is with the College of Juvenile Justice, Executive Director of Texas Juvenile Crime Prevention Center, Prairie View A&M University, Texas A&M University System, Prairie View, TX 77446, USA.Email: chastings@pvamu.edu,liqian@pvamu.edu,cbgibson@pvamu.edu,phobiomon@pvamu.edu,xidong@pvamu.edubias may arise when the data are collected from skewed sources, are incomplete, lack crucial information, or contain errors.Such biases result in unrepresentative or incomplete data, leading to biased outputs.Algorithmic bias stems from biased assumptions or criteria in algorithm design, resulting in biased outputs for downstream tasks.Bias introduced through user interactions occurs when individuals using AI systems inject their own biases or prejudices, whether consciously or unconsciously.To address these sources of bias, various approaches have been proposed.Dataset augmentation involves adding more diverse data to training datasets, enhancing representativeness and reducing bias [9].Bias-aware algorithms are designed to consider different types of bias, working to minimize their impact on system outputs [10].User feedback mechanisms, such as human-in-the-loop systems [11], involve soliciting feedback from users to identify and rectify biases in the system.
Addressing data bias, especially concerning sensitive attributes like gender and race, can be efficiently achieved through sample reweighting, contributing to the advancement of fairness in AI [12].This method involves assigning weights to each sample based on the ratio of its population proportion to its sampling proportion [13].The process ensures the dataset becomes discrimination-free through two key steps.Firstly, specific attributes, such as gender and race, are identified within the datasets.Subsequently, higher weights are assigned to samples from underrepresented groups, while lower weights are assigned to those from overrepresented groups with respect to these specific attributes.These steps collectively contribute to achieving balance across all groups, thereby fostering fairness in the final outputs of AI algorithms trained on the reweighed data.Nevertheless, existing efforts appear to lack a thorough and comprehensive evaluation of the effectiveness of reweighting samples in mitigating bias associated with traditional machine learning models.
This paper conducts a thorough evaluation of the efficacy of reweighting samples in addressing bias associated with traditional machine learning models using AIF360 [14].AI Fairness 360 (AIF360) is a versatile open-source library designed to identify and alleviate bias in machine learning models across the entire AI application lifecycle.It encompasses a comprehensive array of metrics for scrutinizing biases in datasets and models, along with detailed explanations for these metrics and algorithms for bias mitigation.The evaluation in this paper focuses on the reweighting sample methods available in AIF360, applied to classification tasks performed by five traditional machine learning models: Decision Tree, K Nearest Neighbor, Gaussian Naïve Bayes, Logistic Regression, and Random Forest.The reweighting process is carried out with respect to privileged attributes such as sex and race.Subsequently, each traditional machine learning model undergoes classification tasks on both the original datasets and the new datasets resulting from reweighting samples.The comparative analysis involves assessing the performance of these models on the original and new datasets, respectively.This evaluation is based on metrics such as balanced accuracy and fairness metrics.Experimental results highlights the model-specific nature of reweighting sample effectiveness in achieving fairness in traditional ML models, as well as reveal the complexity of bias dynamics.
The contributions in this paper can be summarized as: 1) In contrast to the previous work [15], our research involves a systematic comparison of reweighting samples for mitigating bias on five traditional ML models through AIF360 platform.2) We systematically examine the fairness of experimental results with five fairness metrics and provide insights of effectiveness of reweighting samples for bias mitigation.

II. METHODOLOGY
This paper aims to examine the effectiveness of reweighting sample to enhancing fairness of traditional machine learning algorithms on classification tasks, which covers three AI techniques, namely, reweighting samples, traditional machine learning, and fairness metrics to performance evaluation.

A. Reweighting samples
Fairness in AI can be conceptualized as a multi-objective optimization challenge, aiming to optimize learning objectives while mitigating discrimination with respect to sensitive attributes [12].In essence, achieving fairness may involve a trade-off in learning performance to minimize bias.Data preprocessing emerges as an effective technique for molding training data to foster fairness in AI.Techniques such as suppression, dataset massaging, and reweighting samples have proven effective [12].
Reweighting samples, a specific preprocessing technique, involves adjusting the significance or contribution of individual samples within the training dataset.By strategically assigning weights, it becomes possible to render the training dataset free from discrimination concerning sensitive attributes, all without altering the existing labels.One approach to determining these weights involves measuring them based on the frequency counts associated with the sensitive attribute [16].
This paper leveraged the reweighting sample technique from AIF360 during data preprocessing to enhance fairness in binary classification.The input for the reweighting process comprises a training dataset with samples containing attributes (including a sensitive attribute) and labels, along with the specification of the sensitive attribute.The output is a transformed dataset where sample weights are adjusted concerning the sensitive attribute, mitigating potential classification bias.Throughout the reweighting process, an analysis of the distribution of the sensitive attribute within different groups is conducted.This analysis informs the calculation of reweighting coefficients, which, in turn, adjust sample weights to promote a more uniform distribution across groups.
Given a sensitive (protected) attribute, the privileged group of samples includes the samples with the positive sensitive attribute while the unprivileged group of samples includes the samples with the negative sensitive attribute.For a binary classification task, reweighting samples adjusts weights of four categories of samples, namely w pp (the weight of the positive privileged sample (pp)), w pup (the weight of positive unprivileged samples (pup)), w np (the weight of negative privileged samples (np)), and w nup (the weight of negative unprivileged samples (nup)) as below.
where N p : the number of samples in the privileged group.N pp : the number of samples with the positive class in the privileged group.
N np : the number of samples with the negative class in the privileged group.
N up : the number of samples in the unprivileged group.N pup : the number of samples with the positive class in the unprivileged group.
N nup : the number of samples with the negative class in the unprivileged group.
N pos : the number of samples with the positive class.N neg : the number of samples with the negative class.N total : the number of samples.

B. Traditional Machine Learning
This paper applied five traditional machine learning models to implement binary classification tasks, including Logistic Regression, Decision Tree, K Nearest Neighbor, Gaussian Naive Bayes, and Random Forest [17].
Logistic Regression is a statistical technique employed to model the probability of a binary outcome.It finds widespread application in machine learning for scenarios involving binary classification, aiming to predict whether an instance belongs to one of two classes.The posterior probability of class c 1 is expressed through a logistic sigmoid applied to a linear function of the feature vector ϕ.
Decision Tree is a data mining technique employed to establish classification systems using multiple covariates or to create prediction algorithms for a target variable.This method organizes a population into branch-like segments, forming an inverted tree structure with a root node, internal nodes, and leaf nodes.Notably, the algorithm is non-parametric, allowing it to effectively handle large and complex datasets without imposing a rigid parametric structure [18].However, one limitation of decision trees lies in their reliance on hard splits in the input space, where a single model is responsible for predictions for any given value of the input variables.
K Nearest Neighbor is a supervised machine learning algorithm used for classification and regression tasks.It is a type of instance-based learning, where the model makes predictions based on the similarity of new data points to existing labeled data points in the training set [19], [20].
Gaussian Naive Bayes is a straightforward classification algorithm [17].Its primary principle involves assigning labels to classes by maximizing the posterior probability for each sample.This method operates under the assumption that voxel contributions are conditionally independent and follow a Gaussian (normal) distribution.The discriminant function is formulated as the sum of squared distances to the centroid of each class across all voxels in the searchlight.This sum is then weighted by the variance and the logarithm of the a-priori probability, computed in the training set using Bayes rule.In essence, Gaussian Naive Bayes provides a probabilistic approach to classification, leveraging assumptions about the distribution of features to make predictions.
Random Forest has proven to be remarkably successful as a classification and regression method.This approach involves the combination of multiple randomized decision trees, and it aggregates their predictions through averaging.Notably, it has demonstrated exceptional performance in scenarios where the number of variables is significantly greater than the number of observations.Its versatility extends to large-scale problems, making it adaptable to various ad hoc learning tasks.Additionally, it provides measures of variable importance, adding to its utility and interpretability [21].

C. Fairness Metrics
This paper employed five fairness metrics to evaluate the effectiveness of reweighting samples for mitigating bias.
Given sensitive (protected) attributes, Disparate Impact (DI) denotes inadvertent bias that may arise when predictions exhibit varying error rates or outcomes across demographic groups, where the sensitive attributes, such as race, sex, disability, and age, are deemed protected.This bias can emerge either from training models on biased data or from the predictive model itself being discriminatory.In the context of this study, Disparate Impact pertains to divergent impacts on prediction results as defined by.
where p pup refers to the prediction probability for the unprivileged samples with positive predictions while p pp denotes the prediction probability for the privileged samples with positive predictions.If the disparate impact of the predictions approaches 0, it signifies bias in favor of the privileged group.
Conversely, if it exceeds 1, it indicates a bias in favor of the unprivileged group.A value of 1 implies perfect fairness in the predictions [22].
Average odds difference (AOD) is the average of difference in the false positives rates (FPR) and true positive rates (TPR) between the unprivileged and privileged groups.It is defined by where F P R up and F P R p denote the False Positive Rate for unprivileged and privileged samples, respectively, within predictions, while T P R up and T P R p refer to the True Positive Rate for unprivileged and privileged samples, respectively, within predictions.A result of 0 signifies perfect fairness.A positive value indicates bias in favor of the unprivileged group, while a negative value indicates bias in favor of the privileged group.
Statistical parity difference (SPD) is to calculate the difference between the ratio of favorable outcomes in unprivileged and privileged groups.It is defined by A score below 0 suggests benefits for the unprivileged group, while a score above 0 implies benefits for the privileged group.A score of 0 indicates that both groups receive equal benefits.
Equal opportunity difference (EOD) involves evaluating the equal opportunity for benefiting all groups.EOD specifically centers on the True Positive Rate (TPR), representing the accurate identification of positives in both the unprivileged and privileged groups.The measure is defined by A value of 0 signifies perfect fairness.A positive value indicates bias in favor of the unprivileged group, while a negative value indicates bias in favor of the privileged group.
Theil index (TI) is also called the entropy index which measures both the group and individual fairness.It is defined by where b i = ŷi − y i + 1 and µ is the average of b i .A lower absolute value of TI value in this context would indicate a more equitable distribution of classification outcomes, while a higher absolute value suggests greater disparity.

A. Dataset
This paper employed two datasets, including the Adult Income dataset and the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset, to evaluate the effectiveness of reweighting samples for mitigating fairness.iv Adult Income dataset: it includes 48, 842 samples with 14 attributes, which can be used for predicting whether income exceeds 50K/yr based on census data [23].
COMPAS dataset: the COMPAS system is a case management system and decision support tool initially developed and owned by Northpointe (now Equivant).It was designed for the purpose of assessing the likelihood of an individual committing a future crime.The dataset associated with COMPAS comprises more than 20 attributes and includes a substantial sample size of 11,000 instances [24].

B. Experimental metrics
Balance Accuracy (BA) is applicable to both binary and multi-class classification scenarios by computing the mean of sensitivity and specificity.Sensitivity gauges the correct prediction of true positives, representing accurately identified positive instances, while specificity measures the true negatives over the total negatives predicted by the model.A result nearing 0 signifies poor model performance, whereas a result approaching 1 indicates effective performance across both sensitivity and specificity [25].

C. Experimental results and discussion
This paper implemented comprehensive validation on reweighting samples for mitigating bias for traditional supervised machine learning models with the binary classification tasks.It involved two components of experiments on two datasets: Adult income dataset and COMPAS dataset.
1) Adult income: Figures 1 and 2 illustrate the performance comparison before and after reweighting samples with respect to the protected attribute Race.Prior to reweighting, various machine learning (ML) models exhibit distinct biases in AOD and DI curves.Following reweighting, biases in ML models are mitigated to varying degrees.Notably, decision tree (DT) models demonstrate bias-free behavior in terms of AOD and DI values.Conversely, biases in logistic regression (LR), knearest neighbors (KNN), and Gaussian Naive Bayes (GNB) appear unchanged, while the bias in random forest (RF) becomes more unstable.This suggests that the effectiveness of reweighting samples depends on the specific ML model and may not universally apply.
Table I provides a systematic comparison using additional fairness metrics.DT's bias is eliminated in terms of AOD, EOD, and TI values.However, regarding SPD and DI values, bias persists towards the unprivileged group, emphasizing the need for a comprehensive examination of bias.Furthermore, although other ML models experience slight reductions in balanced accuracy (BAs), their fairness is marginally improved, emphasizing the inherent trade-off between BA and fairness.
Additionally, Figures 3 and 4 depict comparison results for the protected attribute Sex.Similar observations are noted, where reweighting samples are effective mainly for DT, less so for KNN, and have varying impacts on LR, GNB, and RF.Table II echoes these trends, revealing slight reductions in BA and modest improvements in fairness for other ML models across various fairness metrics (SPD, AOD, DI, EOD, and TI).This underscores the nuanced effectiveness of the same debiasing technique across different ML models.
2) COMPAS: To thoroughly assess the effectiveness of reweighting samples in mitigating bias, Figures 5 and 6 provide a performance comparison before and after reweighting samples for the protected attribute Race in the COMPAS dataset.Similarly, prior to reweighting, various machine learning (ML) models exhibit diverse biases in AOD and DI curves.Post-reweighting, DT models demonstrate a lack of bias in AOD and DI values.Conversely, biases in LR, KNN, GNB, and RF appear unchanged, as seen in the curves of AOD and DI.
Table III offers a systematic comparison using additional fairness metrics on the COMPAS dataset.Similar to the Adult dataset, DT's bias is eradicated concerning AOD, EOD, and TI values.However, biases persist in terms of SPD and DI values, showing a continued bias toward the unprivileged group.Additionally, while other ML models experience slight reductions in balanced accuracy (BA), their fairness is marginally improved.
Moreover, Figures 7 and 8 present comparison results for the protected attribute Sex.Similar observations are made, where reweighting samples are effective primarily for DT but less so for KNN, GNB, and RF.Notably, LR exhibits more significant effects after reweighting.Table IV reveals similar trends, with slight reductions in BA and modest improvements in fairness across various metrics (SPD, AOD, DI, EOD, and TI).
In summary, the figures and tables illustrate the impact of reweighting samples on bias mitigation in traditional machine learning models for both the Adult income and COMPAS datasets.Notably, decision tree (DT) models showcase effective bias reduction, while logistic regression (LR), k-nearest neighbors (KNN), and Gaussian Naive Bayes (GNB) exhibit more resistance to debiasing.The trade-off between balanced accuracy (BA) and fairness is evident.Results differ across protected attributes, emphasizing the nuanced effectiveness of reweighting.Comprehensive assessments, including additional fairness metrics, reveal the complexity of bias dynamics.This study highlights the model-specific nature of reweighting sample effectiveness in achieving fairness in traditional machine learning models.

IV. RELATED WORK
The fairness of machine learning (ML) models has become one of the most pivotal challenges of the decade [26].Intended to intelligently prevent errors and biases in decision-making, ML models sometimes unintentionally become sources of bias and discrimination in society.Concerns have been raised about various forms of unfairness in ML, including racial biases in criminal justice systems, disparities in employment, and biases in loan approval processes [27].The entire life cycle of an ML model, covering input data, modeling, evaluation, and feedback, is susceptible to both external and inherent biases, potentially resulting in unjust outcomes.
Techniques for mitigating bias in machine learning models can be categorized into pre-processing, in-processing, and post-processing methods [28].Pre-processing acknowledges that data itself often introduces bias, with distributions of  forms data to eliminate discrimination during training [12].
It is considered the most flexible part of the data science pipeline, making no assumptions about subsequent modeling techniques [29].In-processing adjusts modeling techniques to counter biases and incorporates fairness metrics into model optimization [30].Post-processing addresses unfair model outputs, applying transformations to enhance prediction fairness [31].
Pre-processing assumes that the disparate impact of the trained classifier mirrors that of the training data.Techniques include massaging the dataset by adjusting mislabeled class labels due to bias [22], [32] and reweighting training samples to assign greater importance to sensitive ones [33], [12].

V. CONCLUSION AND FUTURE WORK
This study conducted thorough validation on the application of reweighting samples to address bias in binary classification using traditional machine learning models.It provides a systematic insight into the effectiveness of various traditional classification algorithms concerning different protected attributes, contributing to the advancement of fairness-enhanced AI.Future research will extend this exploration by incorporating more advanced machine learning algorithms, including generative AI models.Additionally, the study will incorporate new datasets and explore a broader range of protected attributes to further enhance the validation process.

Fig. 1 :Fig. 2 :
Fig. 1: Performance comparison via BA vs AOD before and after reweighting samples on Adult income dataset with respect to the protected attribute Race.

Fig. 3 :
Fig. 3: Performance comparison via BA vs AOD before and after reweighting samples on Adult income dataset with respect to the protected attribute Sex.

Fig. 4 :
Fig. 4: Performance comparison via BA vs DI before and after reweighting samples on Adult income dataset with respect to the protected attribute Sex.

Fig. 5 :
Fig. 5: Performance comparison via BA vs AOD before and after reweighting samples on COMPAS dataset with respect to the protected attribute Race.

Fig. 6 :
Fig. 6: Performance comparison via BA vs DI before and after reweighting samples on COMPAS dataset with respect to the protected attribute Race.

Fig. 7 :Fig. 8 :
Fig. 7: Performance comparison via BA vs AOD before and after reweighting samples on COMPAS dataset with respect to the protected attribute Sex.

TABLE I :
Performance comparison between before and after reweighting samples through one classification metric BA and fairness metrics including SPD, AOD, DI, EOD and TI on Adult income dataset regarding the protected attribute Race.

TABLE II :
Performance comparison between before and after reweighting samples through one classification metric BA and fairness metrics including SPD, AOD, DI, EOD and TI on Adult income dataset regarding the protected attribute Sex.

TABLE III :
Performance comparison between before and after reweighting samples through one classification metric BA and fairness metrics including SPD, AOD, DI, EOD and TI on COMPAS dataset regarding the protected attribute Race.