F AIR C AIPI : A Combination of Explanatory Interactive and Fair Machine Learning for Human and Machine Bias Reduction

: The rise of machine-learning applications in domains with critical end-user impact has led to a growing concern about the fairness of learned models, with the goal of avoiding biases that negatively impact speciﬁc demographic groups. Most existing bias-mitigation strategies adapt the importance of data instances during pre-processing. Since fairness is a contextual concept, we advocate for an interactive machine-learning approach that enables users to provide iterative feedback for model adaptation. Speciﬁcally, we propose to adapt the explanatory interactive machine-learning approach C AIPI for fair machine learning. F AIR C AIPI incorporates human feedback in the loop on predictions and explanations to improve the fairness of the model. Experimental results demonstrate that F AIR C AIPI outperforms a state-of-the-art pre-processing bias mitigation strategy in terms of the fairness and the predictive performance of the resulting machine-learning model. We show that F AIR C AIPI can both uncover and reduce bias in machine-learning models and allows us to detect human bias.


Introduction
The discovery of discriminatory machine-learning applications has refuted the popular belief that machine-learning (ML) algorithms make objective decisions.For instance, the COMPAS tool (short for Correctional Offender Management Profiling for Alternative Sanctions) erroneously assigns Black defendants a higher risk of recidivism than white defendants, indicating a clear racial bias [1].Another example is Microsoft's TayAI chatbot, which generated racist and anti-Semitic tweets [2].Despite this, machine-learning algorithms are increasingly used in sensitive domains such as employment hiring assistance [3] and credit request evaluation [4].We argue that ML engineers have an ethical obligation to ensure that ML algorithms do not reproduce data-inherent biases and thus systematically disadvantage certain groups, despite legal requirements that may exist.
Numerous approaches incorporate fairness-or bias mitigation-as an additional objective, e.g., by satisfying fairness metrics during model optimization [5][6][7] or entirely new classification systems [8,9].Celis et al. (2019) derive a generalized classification algorithm that arranges multiple arbitrary fairness metrics in a linear group performance function as an optimization objective.Satisfying fairness constraints during the model optimization is obviously an entirely different approach to bias mitigation than improving fairness before training.For instance, Reweighing modifies the proportion of potentially deprived groups before fitting a classification model such that it satisfies a single specific fairness metric [10].Methods like Reweighing are likely to outperform more generic approaches regarding specific metrics, whereas the class of the former methods offers advantages when aggregating multiple fairness constraints.
Common to state-of-the-art bias-mitigation approaches is that they tend to treat fairness as a context-free, stationary concept.In general, what is considered to be fair is determined by the specific cultural background, be it a nation or a company, which highlights the highly subjective perception of fairness [11].Berman et al. (1985) investigate the following example where colleagues of an Indian and a US company are likewise asked to distribute corporate benefits among each other.Although for the Indian employees, it was fair to distribute the corporate benefits according to the economic need of each individual, the US employees agreed to a distribution proportional to individual merit.Given the observation that fairness is highly context-sensitive, we argue that the current approaches of stationary fairness metrics need to be extended to interactive ML methods to take user-specific cultural presuppositions of fairness into account.To reflect individually perceived fairness, users must be able to guide the optimization process of ML models.In active learning [12], users iteratively label instances that maximize the information gain of the classification model.Coactive learning is an active learning modification where users interactively correct predictions of the classifier [13].
Active and coactive learning presumes that users capture the decision-making mechanism of problems modeled by ML, which we argue to be an overstating assumption.Given context-sensitive fairness interpretations, users have individual perceptions of fair decision-making.Explanatory and interactive ML (XIML) enriches the interactive component of active learning with explanatory ML techniques and thus equips the user with the awareness of how a ML model obtains its decision [14].Hence, XIML discloses the model's decision-making mechanism by explanations and allows the user to correct both the prediction and the explanation.Angerschmid et al. (2022) [15] show that including an explanatory component in the decision-making process improves the user's perception of fairness.Indeed, the objective of CAIPI [14], a state-of-the-art XIML algorithm, is to leverage user feedback that drives the model toward a presumably correct decision-making mechanism from the perspective of a domain expert.For this purpose, CAIPI has been enriched with a user interface in the context of medical image classification [16].Others propose algorithmic adaptations.For instance, Schramowski et al. (2020) [17] specifically tailor CAIPI for deep learning.
Let us introduce a fictional example within the domain of credit risk management to motivate our interactive learning approach.Suppose a student is applying for credit to build or buy real estate, as shown in Figure 1.The primary objectives of credit risk management include evaluating creditworthiness of the applicant, therefore minimizing risks, and maintaining a robust loan portfolio.Initially, suppose that the applicant is a philosophy student in his first semesters, aspiring to pursue research in academia in the future, and whose current income is limited to part-time work.This application is automatically rejected due to the applicant's current age, income situation, and employment status.In this case, the machine-learning system's assessment was accurate and fair with respect to the contextual situation of the student and the perspective of the lending organization, i.e., the explanation is not biased, and no additional interaction is needed.Second, envision a scenario where the applicant has a time contract in academia in the domain of computer science.However, the model, solely considering the weak employment status of a temporary contract, might initially reject the credit application.Here, the domain expert intervenes because he can reasonably anticipate a strong job market in the computing domain and, therefore, proceeds to rectify the label accordingly.Subsequently, he augments the dataset with this corrected information and retrains the model for improved accuracy.Third, suppose that the applicant is a female MSc student in biology.The system rejects the application due to gender female.The domain expert accepts the system recommendation, but he questions the explanation that the decision was based on gender.His reasoning leads to the observation that the explanation is biased; specifically, female applicants are considered less creditworthy.Despite this, he concludes that the decisive features of the credit rejection should be based on the fact that the applicant is not employed and does not provide enough equity capital to fulfill the banks' lending policy.Consequently, the expert intervenes, rectifying this explanation bias by incorporating counterexamples into the dataset for model refinement.We present FAIRCAIPI, a XIML approach based on the CAIPI algorithm.Its novelty is that FAIRCAIPI lets the user interact with the model's decision-making mechanism from a fairness perspective.Instead of embedding presumably correct mechanisms, users are asked to provide iterative feedback to improve the fairness of the model.Biased decisionmaking mechanisms are mitigated by user feedback.By FAIRCAIPI, we contribute a tool to (i) uncover and (ii) reduce machine bias as well as (iii) detect human bias during model optimization.This work presents the theoretical basis and a formal derivation of the FAIRCAIPI method.We assess our approach through a simulation study and additionally compare FAIRCAIPI to Reweighing, a state-of-the-art bias-mitigation pre-processing strategy [10].The research questions addressed by our experiments are as follows: (R1) Does the correction of explanations for fairness lead to fairer models?(R2) Does correcting explanations for fairness lead to fairer explanations?(R3) Does correcting for fair explanations have a negative impact on the predictive performance of the model?(R4) Which is superior, FAIRCAIPI or the state-of-the-art Reweighing strategy?

Approaches to Fairness in Machine Learning
Bias-mitigation strategies can be classified into three stages: pre-processing, inprocessing, and post-processing [18].For an intuitive overview, we summarize and cluster related approaches to fairness in ML in Table 1.We briefly address them in this section to locate FAIRCAIPI in the research field fair ML: Pre-processing is responsible for satisfying fairness quality criteria during the data collection process [19].Prominent examples exist for facial recognition [20].More general pre-processing approaches are based on sampling.Their objective is to modify the weight of certain instances to increase fairness.For example, Reweighing optimizes the proportion of deprived and favored groups in the dataset [10].Other examples include Disparate Impact Removal [21] or Optimized Pre-Processing [22].
Post-processing, on the contrary, changes the output of the ML model by verifying fairness constraints.Some methods tune the probability threshold to minimize outcome differences between deprived and favored groups, e.g., using simpler linear models [23] or classification models [24].Another example is Reject Option Classification, which alternates labels based on critical regions of the decision boundaries [25].
In-processing methods are commonly associated with FAIRCAIPI because fairness objectives are incorporated during model training by satisfying fairness metrics [5][6][7][8].More sophisticated approaches include specific cost functions for different instances in their classification objective [26].The latter approach has also been extended towards regressions [27].In-processing bias-mitigation strategies also exist for more specific ML niches like discrimination-free word embedding for Natural Language Processing [28,29], fair generative models, such as Generative Adversarial Networks [30], or Variational Autoencoders [31], discrimination-free image recognition with deep learning [32,33], or fair causal models and graphs such as Bayesian networks [34][35][36][37][38].Some in-processing methods exploit explanatory ML methods, such as counterfactuals [39], causal explanations [40], or Shapley values [41] to reveal biased decision-making.In particular, the latter is closely associated with our method, as FAIRCAIPI accesses the model's mechanism by local explanations with Shapley values.Interaction with fairness in ML models is currently mostly reserved for visualization techniques that reveal biases within models in user interfaces [42,43].A closely related approach to FAIRCAIPI is the XIML method by Nakao [44], which integrates human feedback on explanations into model optimization to increase fairness.By borrowing a mechanism for user explanation interaction [45], it allows users to change the disadvantageous behavior of a classifier for several protected attributes at once.By contrast, FAIRCAIPI aligns the user explanation interaction in an optimization cycle.
Table 1.High-level clustering of existing bias-mitigation strategies into pre-, in-, and postprocessing [18].FAIRCAIPI can be subsumed into the XIML for bias mitigation category.

Pre-Processing
In-Processing

Materials and Methods
FAIRCAIPI combines the technical concepts of bias mitigation and XIML.Consequently, this section covers both.It begins with a brief description of the German Credit Risk dataset, which serves as a running example for the remainder of this paper and is the subject of the simulation study that underpins FAIRCAIPI.German Credit Risk suffers from a gender bias.Since FAIRCAIPI is an extension of the CAIPI algorithm, the derivation of XIML will mostly be centered around CAIPI.We will adapt CAIPI to satisfy fairness objectives and present a human ML architecture that can (i) detect and (ii) reduce machine bias, and (iii) detects human bias.We will specify the experimental setup at the end of this section.First, let us specify the basic notation used in this article: Notation.A binary classification model f : X n → Y is a function that maps a feature space X n of n features to a target space Y = {0, 1}.For brevity, we omit the superscript n in the following and just write X .We denote an inference by y = f (x).An instance x ∈ X can be represented as feature value vector x = (x 1 , x 2 , . . ., x n ) T ∈ X , where x i denotes a single feature in x at index i.Furthermore, let l : X → Y be a labeling function from instances to class labels.Moreover, let L ⊆ X × Y and U ⊆ X denote subsets of labeled and unlabeled instances, where we write X L and Y L for instance data and labels of L, respectively.Furthermore, we will write x (n) (y (n) ) for the n-th feature (label) instance in X (Y) when the associated set is clear from the context, or add a subscript like x U to indicate its associated set explicitly.We assume a procedure FIT to train and update a classification model.

Bias in the German Credit Risk Dataset
The German Credit Risk data set (https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data, accessed on 17 July 2023) suffers from a gender bias, as shown in Figure 2. In the context of fairness, bias systematically favors a privileged group while penalizing an unprivileged group [18].Privileged and unprivileged groups are determined by the distribution of a favorable label [18] regarding a protected attribute [46,47].Discrimination occurs when an unprivileged group is systematically disadvantaged because of a protected characteristic.Notice that discrimination can occur directly or indirectly.Direct discrimination is based on a protected attribute, while indirect discrimination is caused by apparently neutral features that are highly correlated with the protected attribute [37].To quantify bias, we split the data into four groups [10]:  2 that displays the distribution of the favorable label good credit risk conditioned on the protected attribute gender: This reveals a gender bias, as the favored group male receives the positive label more likely than the deprived group female.By now, we have presented bias mostly narratively, raising the question: How can it be quantified?Metrics measuring bias are called bias-detection metrics.Most bias-detection metrics measure conditional classification model performance differences.For instance, a fair classifier requires that the recall of the model-in our case, the correctly detected good credit risks-is stable with respect to gender as a protected attribute.
Definition 1 (Protected Attribute).Let S be the protected attribute and let S = s and S = s mark the privileged and unprivileged groups, respectively.Definition 2 (Favorable Label).Let ŷ = f (x) be a prediction, where ŷ = d denotes the favorable label and ŷ = d the unfavorable outcome label, respectively.
In the following, we present five bias-detection metrics: Statistical Parity [48], Equalized Odds [23], Equalized Opportunity [23], False Positive Error Rate Balance [49], and Predictive Parity [49].According to the definition of Statistical Parity (SP) [48], a classifier is fair if the probability of receiving the unfavorable label is distributed equally across privileged and unprivileged groups, i.e., where P is the probability of the predictive outcome ŷ conditioned on the protected attribute S. When we now condition the probability of receiving the unfavorable label, i.e., P( ŷ = d), on the privileged and unprivileged groups S = s and S = s, we obtain P( ŷ = d|S = s) and P( ŷ = d|S = s), respectively.From the definition in Equation ( 1) directly follows the assumption that Statistical Parity exists if the conditional difference of receiving the unfavorable label is zero between the privileged and unprivileged groups.The differential form of ( 1) is given in Equation ( 2), which states that a fair classifier should yield a Statistical Parity estimate SP = 0, i.e., The main difference between Statistical Parity compared to the remaining bias-detection metrics is that it does not require access to the ground truth, as it solely relies on the prediction conditioned on a known protected attribute.In contrast, Equalized Odds (EqOdds) [23] includes the ground truth label y in the condition of each side: Equation ( 4) formulates EqOdds as an average of two differences.Each part of the sum is a performance difference between the privileged and the unprivileged group.The conditional probability is now replaced by performance metrics of a binary classification.Equalized Odds likewise considers the false positive ( f pr) and the true positive rate (tpr): In fact, the idea of (3) forms the basis of Equal Opportunity (EqOpp) [23], False Positive Error Rate Balance (FPERB) [49], and Predictive Parity (PP) [49], where EqOpp uses the tpr in (5) and FPERB the f pr in (6).Predictive Parity is slightly different as it exploits the false discovery rate ( f dr) in (7), calculated by f p × ( f p + tp) −1 .

EqOpp = tpr S= s − tpr S=s
(5) Our explanatory and interactive FAIRCAIPI cycle will present the bias-detection metrics, as defined above, to the user.Its purpose is to notify and educate users about the impact of their changes have on the fairness of the classifier.Furthermore, Statistical Parity (2) will play an important role in the benchmark test in the simulation study as Reweighing is optimized for Statistical Parity.

Explanatory Interactive Machine Learning
The state-of-the-art XIML algorithm CAIPI [14] involves users by iteratively including prediction and explanation corrections.CAIPI has three prediction outcome states: Right for the Right Reasons (RRR), Right for the Wrong Reasons (RWR), or Wrong for the Wrong Reasons (W).The two erroneous cases require human intervention.Although users correct the label in the W case, they give feedback for wrong explanations in the RWR case, where so-called counterexamples are generated.Counterexamples are additional novel instances, containing solely the decisive features.They are supposed to shape the model's mechanism into a presumably correct direction from a user's perspective.Practically, using suitable data augmentation procedures, a single explanation correction yields multiple different counterexamples.For the remainder of this work, let us assume that a procedure GEN generates counterexamples using a user-induced explanation correction.Definition 3 (Counterexample Generation).Consider the prediction ŷ = f (x) and suppose that, according to a user, the vector x * is decisive for ŷ, in the sense that the attribution effect of each non-zero feature in x * exceeds a threshold value, where vector x * is derived from x such that the set of non-zero components of x * is a subset of the set of non-zero components in x.Let us assume a procedure GEN that takes x, ŷ, x * , c as inputs and returns counterexample feature and target data sets X and Y .For Y , we repeat ŷ for c times.And for X , x is repeated c times.Whenever x i is not set in x * , x i is disturbed, e.g., by randomization.
We use SHapley Additive exPlanations (SHAP) [50] to obtain local explanations of a classification model, whereas traditional CAIPI uses LIME.SHAP tends to be a fruitful option at this point, as LIME's performance is sensitive to segmentation algorithms [51].SHAP, in contrast, performs reliably on tabular data [50].The SHAP explanation model approximates a model f for a specific input x by an explanation g that uses a simplified input x that maps to the original input x via a mapping h x (x ) = x.It ensures that g(z ) ≈ f (h x (z )) whenever z ≈ x and where z ∈ {0, 1} M and M is the number of simplified input features.The SHAP method relies on Shapley values, which measure the importance of a feature x i for a prediction by calculating the impact of knowledge about this feature for the prediction.The contribution of a feature value x i to a prediction outcome is known as the SHAP value φ i ∈ R, such that the sum of all feature attributions approximates the output f (x) of the original model.
The SHAP method is built upon three desirable properties: local accuracy, missingness, and consistency [50].SHAP's simplified representation of a classification model f (x) including a simplified feature space representation x , i.e., directly satisfies the local accuracy property, whenever x = h x (x ) holds [50].SHAP approximates a model f for an input x in the sense that attributions are added such that the explanation model g(x ) for simplified input x should at least match the output of the original model f (x): In the vanilla case, attributions φ are added linearly, where each φ i represents the importance of a feature (or a combination of a feature subset) with size M.The baseline attribution φ 0 is calculated by φ 0 = f (h x (0)).
Apart from local accuracy, SHAP satisfies the missingness and consistency properties [50].Missingness states that zero (in this case missing) feature values have an attribution value of zero.However, consistency guarantees that a stronger feature importance for f is also represented by the attribution value φ.Ensuring all three properties, SHAP attributions are given as: where z ⊆ x represents all z vectors where the non-zero entries are a subset of the nonzero entries in x , |z | is the number of non-zero entries in z , f x (z ) = f (h( x (z ))) and z \ i denotes setting z i = 0.The right-hand side of Equation ( 9) reflects the original idea behind Shapley values, as it is the difference for f x between including versus excluding the i-th feature of z .Equation ( 9) is the single possible solution that follows from the properties: local accuracy, missingness, and consistency [50].Young (1985) [52] originally proves that Shapley values are the only possible representations that satisfy local accuracy and missingness.Young (1985) [52] utilizes an additional axiom, which Lundberg and Lee (2017) prove to be non-essential.According to Lundberg and Lee (2017), the missingness property is non-essential for Shapley values themselves.However, it is essential for additive feature attribution methods such as SHAP.
Consider Figure 3 as an exemplary SHAP explanation for an arbitrary instance and a classification model with a prediction score of 0.5, where values lower than 0.5 indicate a good credit risk and values higher or equal to 0.5, a bad credit risk.SHAP's base value lies approximately at 0.27.From this point, some attributes have a positive impact, i.e., a positive attribution value-they drag the decision toward bad credit risk.Here, having a bank account's status that is not None (it exists), but the average incoming monthly payment within the last year is smaller than 200 German Mark, being female (sex = 0) and not been employed for over 4 years (employment = 4 + years = 0) are the major reasons for bad credit risk.On the contrary, the investigated instance requested a credit_amount of 2278 German Mark, which is the only feature that receives a negative attribution value and thus contributes to a good credit risk.CAIPI has a local explanation procedure that takes a feature instance and a classification model as input and reveals the decisive features to the user.Mathematically, our procedure EXP is built upon SHAP.

Definition 4 (Local Explanation
).Let φ be attribution values assigned to x, highlighting the importance of each feature x i in x for f (x) (9).Furthermore, let α be an importance threshold.Let e ⊆ x denote the set of features such that |φ i | > α holds.We assume an explanation procedure EXP that takes x, f , and α as input and returns e.
CAIPI [14] leverages user feedback regarding the model's prediction and explanation depending on the prediction outcome state.In each iteration, it selects the most informative instance from an unlabeled dataset-that is, the instance with prediction score closets to the decision boundary (in our case 0.5) regarding the classifier trained on a smaller pool of labeled data.We argue that this instance is most informative, as we associate predictions close to the decision boundary with high uncertainty.Knowledge about its label maximizes the information gain for the classifier in the next iteration.The procedure MII retrieves the index of the most informative instance.At this point, we assume access to both the prediction scores and the decision boundary.

Definition 5 (Most Informative Instance).
Let the procedure MII take a set of predictions Ŷ and a decision boundary β as input.It returns the index m of the most informative instance, i.e., the prediction with the score closest to β. CAIPI requires human feedback at two points: to evaluate the correctness of the prediction and the explanation.In the second case, by correcting the local explanation, users can induce a desirable decision mechanism.We denote the interaction points by the procedure INTERACT and summarize CAIPI in Algorithm 1.In each iteration, CAIPI trains a model on the labeled data (line 2) and draws predictions on the unlabeled feature instances (line 3) to obtain the most informative instance (line 4).The user examines the prediction and provides the correct label if the prediction is incorrect (line 7).Otherwise, if the prediction is correct, CAIPI presents the corresponding local explanation, which can be corrected by the user if necessary (line 9).If the explanation is correct, the instance is added to the labeled dataset (line 12), otherwise, counterexamples are generated (line 14).The current most informative instance is removed from the set of unlabeled data to prepare the next iteration (line 15).In contrast to the original CAIPI algorithm [14], we formalize each component explicitly.We will utilize our explicit formalization to adapt CAIPI to fair ML in the next section.

U )
Label retrieved from human annotator U , y (m) )} Case W: label correction U , ŷ(m) )} Case RWR: Explanation correction and generation of counterexamples

Fair Explanatory and Interactive Machine Learning
FAIRCAIPI adapts the original CAIPI framework with a fairness objective in two ways: (i) it evaluates the local explanation to detect biased decision-making, and (ii) it thus accounts for protected attributes during the counterexample generation.Regarding adaptation (i), we recapture the groups DP, DN, FP, and FN from Section 3.1, where D indicates the deprived and F the favored group, each with either the desirable positive or undesirable negative outcome, P or N, respectively.We argue that the over-proportional presence of DN and FP manifests a bias, as the first assigns the undesirable outcome to the deprived and the second the desirable outcome to the favored group.Consequently, we define an explanation-a decision-making mechanism-as unfair if the fact of belonging to the deprived group is a reason for receiving the undesirable label.Conversely, belonging to the favored group is a reason to receive the desirable label.Regarding adaptation (ii), our goal is to remove protected attributes from the decision-making mechanism.This is achieved if the protected attributes are randomized, and all remaining features are held constant during counterexample generation.Randomization, in our case, means that if the fact of being male is a reason for a good credit risk, our counterexample is the identical instance, but the gender is female.Let us formalize the notion of biased decision making: Definition 6 (Biased Decision Making).Consider features e i from explanation e, written e i ∈ e, for a model's outcome ŷ = f (x) ∈ {d, d}, and a protected attribute S with deprived group s and favored group s.We define the decision-making mechanism of f to be biased if it holds that ŷ = d and ∃e i ∈ e, e i=S = s, or ŷ = d and ∃e i ∈ e, e i=S = s.FAIRCAIPI's bias-mitigation strategy takes place in the counterexample generation procedure GEN' (Algorithm 2), where we identify the parameterization of the protected attribute that would reproduce a bias regarding the prediction (line 2).For example, if the prediction would be a bad credit risk, then the bias-reproducing parameterization of the protected attribute gender would be female.Next, we build a set of all possible values of the protected attribute without the bias-reproducing value (line 3).To generate a counterexample instance, we repeat each feature but randomly replace the protected attribute (line 5), and add the input label (line 6).The resulting counterexample dataset contains all initial correlations except for the correlation between the protected attribute and the target.

Algorithm 2 GEN' [x, y, S, c] (Counterexample Generation)
Input: feature instance x, label y, protected attribute S, number of counterexamples c Output: Data sets of labeled counterexamples X , Y 1: 4: for n ← 1 : c do 5: x ← sample(s In contrast to original CAIPI, FAIRCAIPI (Algorithm 3) takes the protected attribute as an input.The users still provide feedback on the model's explanation.However, their task is no longer to induce correct mechanisms but to mitigate bias.Thus, according to Definition 6, the interaction in line 10 returns True if the explanation is biased and False otherwise, where an identified bias yields bias mitigating counterexamples (line 14).FAIRCAIPI is capable of (i) detecting and (ii) reducing machine bias and (iii) uncovering human bias.Note, however, that technically, so far FAIRCAIPI does not require human feedback as long as we have access to all elements of Algorithm 3, which often is a realistic assumption when experiments are only pseudo-unlabeled.Although this qualifies FAIRCAIPI to meet the first two capabilities, the latter is still not met.Practically, we explicitly want to include a human user in the optimization cycle.The user can override FAIRCAIPI's biased decision (Definition 6).This might be due to good human intentionspassive bias, when the users fail to correct a biased mechanism-or active bias, when users intentionally miss corrections or correct the mechanism such that the bias-detection metrics suffer.We enrich FAIRCAIPI to present bias detection metrics (Section 3.1) to the user at the beginning and the end of each iteration, i.e., before and after refitting the model (line 2).Furthermore, we include an interaction step where the user has the opportunity to provide feedback, and FAIRCAIPI notifies the user in the case conducted or missed corrections negatively affect the bias-detection metrics.Placing FAIRCAIPI in the priory-described design (Figure 4) has an essential benefit: It educates users, as it relates human feedback directly to bias-detection metrics.f ← FIT(L) and ComputeFairnessMetrics( f , S, L) Compare fairness before and after refitting the model and present to the user m ← MII( Ŷ, 0.5) 5: Label retrieved from human annotator if not b then 12:

compute fairness metrics
Select and predict evaluate and/or correct prediction and/or explanation notify user in case of active or passive human bias generate counterexamples Human FAIRCAIPI interaction.FAIRCAIPI seeks user feedback regarding the prediction and explanation of a classification model.Placing bias-detection metrics at the beginning and the end of each iteration lets the users relate their correction to the model's bias.Furthermore, CAIPI notifies users if conducted or missed corrections negatively affect bias-detection metrics.FAIRCAIPI, operated by users that act out of good intentions, will make classification models fairer.
Our approach aims at improving the model quality while minimizing the number of queries, interactions, and overall cost.According to [14], active learning benefits from a significantly improved sample complexity compared to passive learning due to the specific selection of instances for labeling.Regarding its computational complexity, our approach is influenced by its model-agnostic nature, inheriting the complexities associated with fitting a model and making predictions for specific instances based on the underlying machinelearning algorithm.The core components of our approach, including FAIRCAIPI and its associated computations, such as computing fairness metrics and SHAP values, maintain reasonable complexity: Notably, SHAP computation being part of the EXP procedure, although generally intractable, can be efficiently approximated in polynomial time [50,53].Procedures MII and GEN' for counterexample generation, as well as used set operations are at most of polynomial complexity.Hence, we argue that FAIRCAIPI is a computationally viable XIML approach.

Simulation Study
We demonstrate FAIRCAIPI in a simulation study, whose objective is to reduce gender bias in the German Credit risk dataset (code that reproduces the simulation study and code to execute FAIRCAIPI interactively according to Figure 4 is available under: https: //github.com/emanuelsla/faircaipi,accessed on 11 August 2023).This makes the gender variable the only protected attribute S.Moreover, we have access to the labels of the actually unlabeled instances by Y U = l(U ) and evaluate our explanations according to Definition 6 with an attribution threshold of 0.005.We benchmark FAIRCAIPI with Reweighing [10], which is a pre-processing technique and assigns weights to the training instances to achieve Statistical Parity (1).From the German Credit Risk dataset, we assume 550 instances to be labeled and 150 to be unlabeled.The remaining 300 instances serve as test data.We train a Random Forest classifier with balanced class weights.A priori, we achieve an accuracy of 75%.A comparatively high starting performance is important for FAIRCAIPI, as its objective is to make decision-making mechanisms fairer.Prior research shows that fewer a priori labeled instances suffice for a high predictive performance after CAIPI optimization [16].We run FAIRCAIPI for 100 iterations.In RWR iterations, we add a single counterexample, the identical instance with the opposite gender-we neutralize the instance's gender effect.

Results
During 100 FAIRCAIPI iterations shown in Figure 5, we observe a slight trend toward more label corrections compared to explanation corrections.This implies more iterations of type W than of type RWR.Within FAIRCAIPI, approximately 35 out of 100 iterations correct only the label, i.e., solely the predictive accuracy.Yet, within 30 iterations, the SHAP explanation is corrected.These are the only iterations where the simulated user tries to adapt the model's mechanism to mitigate the classifier's inherent bias.However, in about 35 FAIRCAIPI iterations, the prediction is correct, including an unbiased decision-making mechanism representing RRR cases.The results for the bias-mitigation property of FAIRCAIPI are shown in Table 2. There, we compare several bias-detection metrics after 100 iterations of the FAIRCAIPI optimization, their optimal value during 100 iterations, the state-of-the-art sampling-based pre-processing procedure Reweighing, and the default Random Forest classifier without bias-mitigation extensions.All bias-detection metrics have their optimum at zero.We observe that FAIRCAIPI is superior for every bias-detection metric, except Statistical Parity.This holds for the FAIRCAIPI results after 100 iterations and is even amplified by taking its optima into account.Reweighing, which adds weights to instances to satisfy Statistical Parity, clearly outperforms the others regarding Statistical Parity but offers only minor improvements for other bias detection metrics compared to the default Random Forest classifier.Although the former steadily converges toward its optimum, the latter has an even higher amplitude, and it tends to diverge from the optimum value again after iteration 80.
The bias reduction trend can also be observed in Figure 7, where the number of unfair predictions and explanations tend to decrease during FAIRCAIPI optimization.Figure 7 provides some interesting insights: Due to the size of the labeled and test data, unfair predictions and explanations occur more frequently in the labeled data than in the test data.Moreover, if the explanations are unfair-predictions are made for biased reasons-the prediction is also classified as unfair.This does not hold the other way around, as instances can be part of the deprived group and receive the unfavored label, but out of fair reasons-not caused by the protected attribute.Hence, unfair predictions occur more often than unfair explanations.
Bias mitigation should not negatively affect the predictive performance of the classifier.Therefore, we summarize several performance metrics for binary classification in Table 3, where we compare accuracy, precision, recall, and F1-score for FAIRCAIPI, a standard Random Forest classifier, and a Random Forest classification pre-processed with Reweighing.FAIRCAIPI is superior for each performance metric.However, we observe a large discrepancy in the performance metrics when they are conditioned on the target.Although all classifiers perform comparatively well for good credit risk, their performance suffers for bad credit risk.We underpin the prior comparison and visualize the test accuracy of each FAIRCAIPI iteration (Figure 8) but we stress the highly imbalanced setting (Figure 2) and, therefore, emphasize the results in Table 3.We only use the accuracy metric in Figure 8 as a proxy to visualize performance changes over the course of FAIRCAIPI optimization.We see minor changes, which is the desirable behavior, as FAIRCAIPI is designed to reduce the bias of an established decision-making mechanism rather than optimize the predictive quality.Let us finalize this section and answer our research questions: (R1) Does the correction of explanations for fairness lead to fairer models?
Yes, according to Figure 6, the bias detection metrics converge to their optimum as explanation corrections are added.Furthermore, unfair predictions decrease over the course of 100 FAIRCAIPI iterations (Figure 7), where explanations are corrected in 30 iterations (Figure 5).(R2) Does correcting explanations for fairness lead to fairer explanations?
Yes, Figure 7 reveals that the number of unfair explanations decreases with FAIRCAIPI.(R3) Does correcting for fair explanations have a negative impact on the predictive performance of the model?No, Figure 8 clearly illustrates solely minor performance changes during FAIRCAIPI optimization.Table 3 even indicates a slight increase in predictive quality with FAIRCAIPI compared to Random Forest classification with and without Reweighing.(R4) Which is superior, FAIRCAIPI or the state-of-the-art Reweighing strategy?
Considering Table 2, FAIRCAIPI is the superior bias-mitigation strategy for every metric except Statistical Parity, which is the optimization goal of Reweighing.

Discussion
Our experimental results show that FAIRCAIPI is a suitable bias-mitigation strategy.It satisfies two desirable properties: First, it does not negatively affect the predictive performance-on the contrary, according to our findings, it slightly increases predictive performance.Second, it mitigates bias successfully, taking several bias-detection metrics into account, and even outperforms a state-of-the-art bias-mitigation pre-processing procedure.Moreover, in the spirit of XIML, FAIRCAIPI ensures a transparent decision-making mechanism and is capable of directly involving humans in bias mitigation, which, despite legal requirements that might exist, is undoubtedly an ethical benefit.Humans do not treat fairness as a stationary concept [11].According to our findings, the adaptation of the decision-making mechanism of a classification model by human feedback yields an overall well-suited bias mitigation, taking several bias-detection metrics into account.Optimizing for a stationary metric alone does not guarantee overall fairness improvements.
We argue that FAIRCAIPI is capable of (i) discovering and (ii) reducing machine bias and (iii) detecting human bias.Although the first two arguments are investigated experimentally, we propose a formal architecture for the third.This requires further investigation.User studies are applicable here.In this context, we could ask the question: Is FAIRCAIPI able to reduce human bias?In general, the experiments face limitations, e.g., we only present a proof of concept on a single dataset, investigating a single bias.We do not address how FAIRCAIPI performs with multiple protected variables, which is empirically more often the case, or in the context of multi-label classification or regression.Nevertheless, the setup of our simulation study is in line with evaluations of selected state-of-the-art bis mitigation strategies at the pre-processing [10,21], in-processing [8], and post-processing [24] stages.All aforementioned papers have in common that they mitigate gender bias.A subset of them uses the German Credit Risk data set [8,10,21].In general, FAIRCAIPI has some algorithmic shortcomings: Even if users are notified when their decision negatively affects bias-detection metrics, FAIRCAIPI assumes sufficient knowledge to provide optimal feedback.However, what happens when unfair explanations are less obvious than in our vanilla case, e.g., when multivariate correlations of features reproduce bias?Then, our user assumption is probably too strong.Compared to traditional CAIPI [14], we also lack experimental evidence on how an increasing number of counterexamples affects fairness.Using our simple counterexample generation procedure would imply repeating the identical counterexample multiple times.More sophisticated counterexample generators using statistical bootstrap methods or generative approaches are applicable here.
Let us conclude this section and place our findings into the existing research: FAIR-CAIPI is a bias-mitigation in-processing method that is located in a specific niche of XIML methods with a fairness objective.According to our literature review, FAIRCAIPI is closely related to a bias-mitigation method that lets users interact with explanations [44].However, the major difference is that FAIRCAIPI aims for a human-machine partnership, where both parties profit-machine bias is mitigated, and the user's bias is detected.Existing XIML bias-mitigation procedures involve users more rarely.In contrast, they may be more applicable to practical problems because involving users over 100 iterations may be time-consuming.Nevertheless, FAIRCAIPI extends the spectrum of XIML procedures and occupies a specific niche.Experiments show that frequent user interaction is also fruitful for bias mitigation and may be morally superior in the context of fairness.

Conclusions
FAIRCAIPI is an in-processing bias-mitigation algorithm that is based on XIML that involves users.Iterative user feedback is used to prone the model's decision-making mechanism into a fairer direction-a direction with less biased decision making.Experiments show that bias detection metrics improve during FAIRCAIPI optimization while the predictive quality of the classification model remains stable.FAIRCAIPI can detect and mitigate machine bias.Furthermore, it also detects human bias.
For future work, we plan to generalize our framework: We will mitigate bias in arbitrary classification settings and consider multiple protected attributes at once.Therefore, arbitrary classification includes, for instance, multi-label classification.This will force us to adjust our FAIRCAIPI cycle because a predictive result can be assigned to multiple prediction outcome cases at the same time.In this regard, we will ask two research questions: How does CAIPI user feedback look like in the context of multi-label classification?In addition, how can user feedback in multi-label classification settings be converted into counterexamples?In the simplest case, multiple protected attributes will all be subject to randomization in the counterexample generation step.However, do we need to resolve each correlation between protected attributes, or does it suffice to change a subset of them to mitigate the overall bias of the classification model?In addition, our research will focus on multivariate correlations for bias detection.We assume that FAIRCAIPI is not able to resolve biased decision-making mechanisms in the case of indirect bias because protected attributes might not be directly covered by local explanations.We believe that methods from causal statistics are a possible solution.Instead of presenting attribution values to users, like SHAP does, we will visualize underlying correlations, for instance, by Bayesian networks, to give the user awareness about how features influence each other.This raises the question: How can user feedback to Bayesian networks be transformed into counterexamples?One of our major future research goals is to educate users, even when biased decision making is less obvious.User education needs to be evaluated in dedicated user studies.We will develop appropriate FAIRCAIPI user interfaces.Their clarity and simplicity also offer potential future research directions.

Figure 1 .
Figure 1.Interactive learning in the domain of lending.

Figure 2 .
Figure 2. Visualization Favored group with Positive label (FP), Favored group with Negative label (FN), Deprived group with Positive label (DP), and Deprived group with Negative label (DN).This figure investigates the favorable label good credit risk regarding the protected feature gender and reveals a gender bias, as it is more likely to have a good credit risk for males than for females.

Figure 3 .
Figure 3. SHAP explanation for a test instance of a classification model trained on the German Credit Risk dataset.The classification score matches the decision boundary of 0.5.SHAP estimates attribution values for each feature.It mimics the classification model's outcome and builds a sum of each feature value, which is scaled by its attribution.The algebraic sign of the attribution value determines whether the feature contributes to a positive or negative classification result.SHAP starts from a base value-SHAP's attribution value for 0. The base value is approximately 0.27.From this point, features in red contribute to a bad credit risk, whereas features in blue drag the classification score toward a good credit risk.(Figure generated with https://shap-lrjball.readthedocs.io/en/latest/generated/shap.force_plot.html,accessed on 19 July 2023).

Algorithm 3
FAIRCAIPI [L, U , S, c, n] Input: Labeled dataset L, unlabeled dataset U , protected attribute S, number of counterexamples c, number of iterations n Output: Classification model f 1: for i ← 1 : n do 2:

Figure 4 .
Figure 4. Human FAIRCAIPI interaction.FAIRCAIPI seeks user feedback regarding the prediction and explanation of a classification model.Placing bias-detection metrics at the beginning and the end of each iteration lets the users relate their correction to the model's bias.Furthermore, CAIPI notifies users if conducted or missed corrections negatively affect bias-detection metrics.FAIRCAIPI, operated by users that act out of good intentions, will make classification models fairer.

Figure 5 .
Figure 5. Number of label and explanation corrections within 100 FAIRCAIPI iterations.We observe a phase with predominantly label corrections around iteration 20 to 30.Before and afterward, both correction types are proportionally distributed across the FAIRCAIPI iterations.

Figure 6
Figure 6  visualizes the development of the investigated bias-detection metrics over the course of 100 FAIRCAIPI iterations, all of which have their optimum at zero.Each metric tends to move towards its optimum and reaches it approximately in its best iteration.Most metrics have a smaller amplitude and are close to the optimum throughout the entire optimization cycle.Exceptions are Equalized Odds and False Positive Error Rate Balance.Although the former steadily converges toward its optimum, the latter has an even higher amplitude, and it tends to diverge from the optimum value again after iteration 80.The bias reduction trend can also be observed in Figure7, where the number of unfair predictions and explanations tend to decrease during FAIRCAIPI optimization.Figure7provides some interesting insights: Due to the size of the labeled and test data, unfair predictions and explanations occur more frequently in the labeled data than in the test data.Moreover, if the explanations are unfair-predictions are made for biased reasons-the prediction is also classified as unfair.This does not hold the other way around, as instances can be part of the deprived group and receive the unfavored label, but out of fair reasons-not caused by the protected attribute.Hence, unfair predictions occur more often than unfair explanations.Bias mitigation should not negatively affect the predictive performance of the classifier.Therefore, we summarize several performance metrics for binary classification in Table3, where we compare accuracy, precision, recall, and F1-score for FAIRCAIPI, a standard Random Forest classifier, and a Random Forest classification pre-processed with Reweighing.FAIRCAIPI is superior for each performance metric.However, we observe a large discrepancy in the performance metrics when they are conditioned on the target.Although all classifiers perform comparatively well for good credit risk, their performance suffers for bad credit risk.We underpin the prior comparison and visualize the test accuracy of each FAIRCAIPI iteration (Figure8) but we stress the highly imbalanced setting (Figure2) and, therefore, emphasize the results in Table3.We only use the accuracy metric in Figure8as a proxy to visualize performance changes over the course of FAIRCAIPI optimization.We see minor changes, which is the desirable behavior, as FAIRCAIPI is designed to reduce the bias of an established decision-making mechanism rather than optimize the predictive quality.Let us finalize this section and answer our research questions:

Figure 6 .
Figure 6.Bias-detection metrics during FAIRCAIPI optimization.We compare the development of bias-detection metrics across 100 FAIRCAIPI iterations.All metrics have their optimum at zero.

Figure 7 .
Figure 7. Unfair predictions and explanations during FAIRCAIPI optimization.In each iteration, we calculate the number of unfair predictions and explanations on labeled and test data.

Figure 8 .
Figure 8. Predictive quality of FAIRCAIPI during 100 iterations.Starting with a baseline accuracy of 75%, we calculate the test accuracy in each iteration.We add a 70% accuracy threshold mark.
group with Positive (favorable) label DN Deprived (unprivileged) group with Negative (unfavorable) label FP Favored (privileged) group with Positive (favorable) label FN Favored (privileged) group with Negative (unfavorable) label

Table 2 .
Bias evaluation.We compare bias metrics for a Random Forest classifier trained on the German Credit Risk dataset.Default values result from a plain Random Forest model, Reweighing includes a Statistical parity-optimized sampling procedure prior to training.The FAIRCAIPI column references the end result (after 100 iterations), FAIRCAIPI (opt.) the optimum values.

Table 3 .
Evaluation of predictive quality.We compare several performance metrics for a Random Forest classifier trained on the German Credit Risk dataset.The Default column references a plain Random Forest model without modifications, Reweighing includes a bias-mitigation pre-processing sampling strategy, and FAIRCAIPI contains the results after 100 iterations.