A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models

Raftopoulos, George; Fazakis, Nikos; Davrazos, Gregory; Kotsiantis, Sotiris

doi:10.3390/a18070435

Open AccessReview

A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models

Department of Mathematics, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(7), 435; https://doi.org/10.3390/a18070435

Submission received: 24 June 2025 / Revised: 9 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue AI-Driven Solutions for Smart Systems in Engineering, Computing, Education, and Society)

Download

Browse Figures

Versions Notes

Abstract

Fairness is a fundamental virtue in machine learning systems, alongside with four other critical virtues: Accountability, Transparency, Ethics, and Performance (FATE + Performance). Ensuring fairness has been a central research focus, leading to the development of various mitigation strategies in the literature. These approaches can generally be categorized into three main techniques: pre-processing (modifying data before training), in-processing (incorporating fairness constraints during training), and post-processing (adjusting outputs after model training). Beyond these, an increasingly explored avenue is the direct modification of existing algorithms, aiming to embed fairness constraints into their design while preserving or even enhancing predictive performance. This paper presents a comprehensive survey of classical machine learning models that have been modified or enhanced to improve fairness concerning sensitive attributes (e.g., gender, race). We analyze these adaptations in terms of their methodological adjustments, impact on algorithmic bias and ability to maintain predictive performance comparable to the original models.

Keywords:

fairness; machine learning algorithms

Graphical Abstract

1. Introduction

In recent years, the application of machine learning (ML) systems to decision-making in sensitive areas such as healthcare, finance, employment, and criminal justice has raised considerable controversy concerning algorithmic fairness. As demonstrated in [1], cases such as the COMPAS recidivism algorithm and Amazon’s hiring algorithm illustrate how AI systems can fall short of fairness, accountability, and transparency, leading to ethical concerns.

Machine learning models, while dominating prediction performance and interpretability, are known to inherit and amplify existing biases in historical data and generate unfair outcomes. This has resulted in mounting research that aims to make such models more fair by developing fairness-aware interventions that minimize disparate effects across demographic groups with no detectable drop in accuracy [2].

The domain of algorithmic fairness has seen tremendous growth in the past decade, spurred by increased consciousness of the social consequences of machine learning (ML) systems. Fairness-aware machine learning arose in reaction to egregious cases of algorithmic bias in areas such as criminal justice, finance, and health. Initial attempts—such as the formulation of demographic parity and equalized odds circa 2016—sought to establish formal fairness definitions for evaluating and curbing discrimination. These approaches typically seek to alter model behavior so that results are balanced across sensitive attributes such as race, gender, or age. However, fairness for ML is a contentious and complex concept; models satisfying one definition of fairness may violate another. Thus, fairness-aware variants of classic ML algorithms are rather more than mere optimizations to justice, but trade-off systems that mesh accuracy, parity, and explainability.

Though various fairness-aware approaches have been proposed, including pre-processing approaches, in-processing algorithmic recasting, and post-processing modifications, practical effect and relative performance across different models and datasets are little explored. A careful examination of such variants must be conducted in order to enlighten practitioners and policymakers regarding the appropriate approaches for practical use. Much of the literature also revolves around deep learning or domain-specific pipelines, excluding the ongoing relevance of classical models such as Logistic Regression, Support Vector Machines, Decision Trees, and k-nearest neighbors to a wide range of practical uses.

This paper provides a comprehensive review and empirical comparison of fairness-aware variants of standard machine learning models. We examine their theoretical underpinnings, cluster them according to intervention strategy, and compare their performance across multiple fairness metrics and datasets across different application domains.

In Section 2, we present various fairness-aware modifications of the Naïve Bayes algorithm. Section 3 explores fairness-enhanced adaptations of Decision Trees, one of the most interpretable machine learning models. Fairness-aware variants of Logistic Regression and a modified version of Support Vector Machine are discussed in Section 4. Finally, fairness-aware ensemble models and combination of exponentiated learning with three different machine learning algorithms, Naive Bayes, Decision Tree and Logistic Regression, are examined in Section 5 and Section 6. Section 7 provides a detailed account of the experimental procedure, systematically examining the datasets employed, the configuration of model parameters, and the evaluation metrics used to assess both predictive performance and fairness. The results presented in Section 8 cover all fairness-aware variants of classical machine learning algorithms across each dataset. The final sections of the paper include a comprehensive discussion of the results (Section 9) and a summary of the main conclusions (Section 10).

2. Fair Editions of Naive Bayes

This section provides a concise overview of the different Naive Bayes algorithm variants introduced in the literature. Table 1 highlights the main features and distinguishing aspects of each approach.

2.1. Calders and Verver Fair Variances of Naive Bayes

In [3] three methods were proposed to modify the Naive Bayes classifier to ensure fairness. The first one adjusts the probability of the decision being positive so that it does not favor any sensitive group. More analytically, instead of modeling

P (S ∣ C)

(where S is the protected attribute and C the result class) as in standard Naive Bayes, the classifier is altered to model

P (C ∣ S)

, then iteratively modifies probability distributions (e.g., adding probability mass to disadvantaged groups and subtracting from advantaged groups) until discrimination is minimized while keeping the total number of positive classifications roughly the same to avoid global distribution shifts.

The second one trains separate Naive Bayes models for each value of the sensitive attribute and balances their predictions. For example, in the case of two groups—one favored and the other unfavored—two Naive Bayes models are trained. To remove discrimination, both models are balanced following the algorithmic logic of the first variant.

The third approach incorporates a latent variable L under the assumptions that L is independent of S and that C is determined by applying discrimination to the L labels based on S. The latent variable is estimated using the Expectation Maximization (EM) algorithm. According to this algorithm, during the Expectation Step, the probability of the latent variable L is estimated given the current classifier parameters. Subsequently, during the Maximization Step, the model parameters are optimized to maximize the likelihood.

The application of these Naive Bayes variants to the benchmark Census Income dataset demonstrated that the two-model approach (training separate Naive Bayes models for different sensitive groups) was the most effective in achieving a balance between fairness and accuracy. In contrast, while the third approach was more mathematically rigorous, it exhibited instability, limiting its practical reliability.

2.2. Fair Naive Bayes Classifier

Choi et al. [4] proposed a variation of the classical Naïve Bayes model to achieve fairness. The authors introduced the concept of discrimination patterns—instances where an individual’s predicted outcome is influenced by the presence of sensitive attributes. The algorithm detects discrimination patterns that exceed a defined threshold

δ

using an optimized search strategy. A cutting-plane approach iteratively refines the Naïve Bayes model by alternating between parameter learning and constraint extraction, gradually incorporating fairness constraints during training. These constraints are formulated as a signomial programming problem, ensuring that fairness adjustments do not significantly degrade model performance. At each iteration, the algorithm learns the maximum likelihood parameters subject to fairness constraints and identifies k additional patterns using the updated parameters, which are then added to the constraint set. This process continues until no further discrimination patterns are found.

This Naïve Bayes fairness variation was applied to real benchmark datasets, including the Adult dataset, the German dataset, and COMPAS, achieving better results in terms of both performance and fairness compared to the 2-Naïve Bayes model [3].

2.3. Fairness-Aware Naive Bayes

Boulitsakis-Logothetis [5] introduced the N-Naïve Bayes (NNB) algorithm, which extends the 2-Naïve Bayes classifier [3] by supporting multiple groups rather than a binary distinction. He proposed two probability-balancing routines to adjust

P (Y | S)

(the probability of a positive outcome given a sensitive feature): NNB-Parity, which enforces statistical parity, and NNB-DF, which enforces differential fairness—an extension of the group fairness concept that accounts for intersectional groups (subgroups defined by multiple overlapping sensitive attributes). The NNB classifier was evaluated on income and employment classification tasks using US Census data, a widely recognized real-world benchmark dataset. While NNB-Parity maintained higher accuracy, it tended to over-favor non-privileged groups. In contrast, NNB-DF provided stronger fairness guarantees, effectively reducing bias amplification, albeit at the cost of a more significant accuracy decline.

3. Fair Editions of Decision Trees

This section provides a concise overview of the different algorithmic variants based on the Decision Tree framework that have been introduced in the literature. Table 2 summarizes the main characteristics and comparative aspects of each method.

3.1. Discrimination-Aware Decision Tree

Kamiran, Calders, and Pechenizkiy [6] proposed an approach that directly integrates fairness constraints into the Decision Tree learning process by modifying the splitting criterion to balance accuracy and fairness. In addition to the conventional information gain (IG) with respect to class labels, they introduced information gain with respect to a protected attribute, leading to the development of three novel splitting strategies. The first strategy,

I G C - I G S

, subtracts discrimination gain from accuracy gain, while the second,

I G C / I G S

, balances the trade-off between accuracy and discrimination. The third strategy,

I G C + I G S

, promotes homogeneity across both class labels and the sensitive attribute. Furthermore, the authors incorporate a post-processing relabeling technique that adjusts Decision Tree leaf labels to mitigate discrimination. This relabeling process is formulated as a knapsack optimization problem, aiming to maximize fairness while minimizing accuracy loss. The proposed method is evaluated on three benchmark datasets—the Census Income dataset, Dutch Census 1971 dataset, and Dutch Census 2001 dataset—demonstrating strong performance in both accuracy and fairness. Notably, the combination of the

I G C + I G S

splitting criterion and the relabeling technique outperforms existing fairness-aware classification methods.

3.2. Optimal and Fair Decision Trees via Regularization (RegOCT)

The authors of [7] proposed a Mixed Integer Programming (MIP) framework for learning optimal and fair Decision Trees, addressing both Disparate Treatment and Disparate Impact in decision-making processes. They introduced two additional fairness indices: the Disparate Impact Discrimination Index (DIDI), which quantifies classification bias, and the Disparate Treatment Discrimination Index (DTDI), which assesses individual fairness violations. The framework formulates Decision Tree optimization using linear programming constraints, with an objective function that balances classification accuracy (minimizing misclassification) and fairness constraints (reducing Disparate Impact and Disparate Treatment). The authors evaluated their approach on classification and regression tasks using real-world datasets, including Credit, Adult Income, COMPAS, and Crime. The results demonstrate that the proposed method achieves near-zero discrimination across these datasets, outperforming other fairness-aware Decision Trees such as the Discrimination-Aware Decision Tree, while maintaining minimal accuracy loss (1–3%) compared to CART and other heuristic-based models.

3.3. FATT Fairness-Aware Tree Training Method

The authors of [8] proposed a genetic optimization-based algorithm that trains evolving Decision Trees using crossover and mutation strategies, prioritizing both individual fairness and accuracy. The algorithm ultimately produces a single model and not an ensemble one, thereby reducing complexity while maintaining fairness. The experimental evaluation of this method was conducted on five benchmark datasets: Adult, COMPAS, Communities and Crime, German Credit, and the Heritage Health dataset. FATT-trained models exhibited a 35–45% improvement in fairness, with only a minor accuracy reduction (3.6% on average) compared to Random Forests and CARTs.

3.4. Fair and Optimal Decision Trees

The authors of [9] addressed the challenge of constructing Decision Trees that are both optimal and fair. They proposed a novel dynamic programming algorithm that efficiently generates fair and optimal Decision Trees, overcoming the scalability limitations of Mixed Integer Programming (MIP)-based models, which struggle with large datasets. Their approach incorporates upper and lower bounds for global fairness constraints, enabling early pruning and thus significantly reducing the search space. The authors evaluated their method on twelve benchmark datasets commonly used in fairness research, including Adult, Bank, Communities and Crime, COMPAS recidivism, COMPAS violent recidivism, Dutch Census, German Credit, KDD Census Income, OULAD, Ricci, Student-Mathematics, and Student-Portuguese. They compared their approach against FairOCT [14], demonstrating superior computational efficiency and scalability.

3.5. FFTree

The authors of [10] proposed FFTree, a novel fair and flexible Decision Tree classifier that integrates multiple fairness criteria while preserving interpretability and classification performance. Unlike traditional fairness-aware models that focus on a single fairness definition, FFTree enables users to select and combine different fairness constraints (e.g., Disparate Impact, Disparate Mistreatment, and Disparate Treatment) according to the specific requirements of their application. In this algorithm, splits are selected using a modified information gain (IG) criterion that incorporates fairness constraints. A split is accepted only if it satisfies the predefined fairness conditions; otherwise, if no fair split is found, the node is converted into a leaf, thereby preventing unfair outcomes. The authors evaluate FFTree using the Adult Income Dataset and a proprietary real-world dataset from Intesa Sanpaolo Bank. The results demonstrate that FFTree achieves accuracy comparable to existing fairness-aware Decision Trees while offering greater flexibility in fairness optimization.

3.6. Optimal Fair Decision Trees

The authors of [11] proposed a Mixed Integer Optimization (MIO) framework for learning optimal Decision Trees that integrate fairness constraints while maintaining high accuracy and interpretability. They construct a flow-based graph representation of the Decision Tree and employ binary constraints to enforce fairness within the tree structure and prediction paths. Subsequently, they utilize global optimization techniques to ensure fairness at all decision levels, distinguishing their approach from heuristic-based models. The authors conduct extensive benchmarking on real-world datasets, comparing their approach against RegOCT and three fairness-aware machine learning models: Correlation Remover, Exponentiated-Gradient Reduction, and Randomized Threshold Optimizer. The proposed model is highly interpretable and achieves near-perfect fairness parity, whereas other models struggle to fully enforce fairness. This is accomplished with only a minor accuracy trade-off of up to 4.2% compared to more complex models.

3.7. Fair C4.5 Algorithm

The authors of [12] introduced FAir-C4.5, an enhanced fairness-aware Decision Tree algorithm that extends the widely used C4.5 model by integrating multiple fairness criteria, including Disparate Impact, discrimination score, consistency, and Disparate Treatment, to mitigate both group and individual discrimination. They propose three fairness-aware attribute selection strategies: (1) the lexicographic strategy, which ranks attributes based on a predefined priority order of fairness metrics; (2) constraint-based strategy, where attributes are selected only if their fairness values fall below a predefined threshold; and (3) the gain ratio with fairness (GRXFR) Strategy, which balances information gain and fairness metrics by multiplying entropy rank with fairness ranks. The algorithm is evaluated on 14 real-world benchmark datasets, including Adult, German Credit, ProPublica Recidivism, NYPD datasets related to racially biased policies, Portuguese Student Performance, Drug Consumption, Ricci, Wine Taste, Bank, Dutch Census, Law School Admission, and UFRGS GPA data. The results demonstrate that FAir-C4.5 significantly reduces discrimination scores across all fairness metrics, outperforming FFTree while maintaining competitive accuracy (0.5–2% lower than the original C4.5). Although FAir-C4.5 incurs higher computational costs than C4.5, it is twice as efficient as FFTree, making it a scalable and effective solution for fairness-aware decision-making.

3.8. SCAFF Fair Tree Classifier

A novel fair Decision Tree classifier that optimizes both classification performance and fairness, measured through strong demographic parity, is presented in [13]. The authors introduce a new Decision Tree splitting criterion, SCAFF (Splitting Criterion AUC for Fairness), which simultaneously optimizes classification performance using ROC-AUC and fairness using strong demographic parity. Extensive experiments are conducted on real-world datasets, including the Adult Income Dataset and the Bank Marketing Dataset, comparing SCAFF against standard Decision Trees and the Discrimination-Aware Decision Tree [6]. The results demonstrate that SCAFF consistently outperforms previous fairness-aware Decision Trees, achieving higher fairness with minimal accuracy loss (2–4% reduction) compared to non-fair models.

4. Fair Editions of Logistic Regression

This section provides a concise overview of the different Decision Tree-based algorithmic variants introduced in the literature. The key characteristics and distinguishing features of each method are summarized in Table 3.

4.1. Prejudice Remover Regularizer

The authors of [15] proposed the Prejudice Remover Regularizer as a mechanism to enhance fairness in machine learning models. Initially, they introduced the Prejudice Index (PI) and the Normalized Prejudice Index (NPI) as quantitative measures of bias. Subsequently, the Prejudice Remover Regularizer was designed to directly mitigate the Prejudice Index. Their approach was integrated into the standard Logistic Regression framework, where the trade-off between classification accuracy and fairness is controlled through a regularization hyperparameter,

η

. Furthermore, the authors evaluated their method against the 2-Naïve Bayes classifier [3] using the Adult Income dataset. The results indicate that while the Prejudice Remover Regularizer effectively reduces bias, it is less efficient in bias mitigation compared to the 2-Naïve Bayes classifier. This algorithm has also been implemented in the AI Fairness 360 (AIF360) Library as an in-processing technique for mitigating bias in machine learning models.

4.2. $η$ -Neutral Logistic Regression

The authors of [16] introduced the concept of

η

-neutrality, which defines a probabilistic model as

η

-neutral if the joint probability distribution of the target variable Y and the sensitive attribute V satisfies the following condition:

\forall v \in V, y \in Y, \frac{P r (y, v)}{P r (y) P r (v)} \leq 1 + η .

Here,

η

is a tunable neutrality parameter that controls the permissible level of dependency between Y and V.

The authors incorporated this concept into Maximum Likelihood Estimation (MLE)-based learning algorithms, specifically logistic and linear regression, to enforce fairness constraints while optimizing prediction accuracy. Their method was evaluated on five real-world classification datasets: Adult Income, German Credit Data, Bank Marketing, Credit Approval, and Dutch Census. Experimental results demonstrated that their approach outperformed existing methods, including 2-NB (Naïve Bayes) [3] and Prejudice Remover [15], in achieving a better balance between accuracy and fairness trade-offs.

Differentially Private and Fair Logistic Regression

The authors of [17] addressed the dual challenge of ensuring differential privacy and fairness in Logistic Regression models without significantly compromising model utility. In this study, fairness is considered at the group level, whereas differential privacy is defined at the individual level. The authors propose two novel methods to enforce both differential privacy and fairness in Logistic Regression. The first method, Private and Fair Logistic Regression (PFLR), incorporates fairness constraints into the Logistic Regression objective function as a penalty term. Subsequently, the functional mechanism [22] is applied to ensure differential privacy by introducing Laplace noise into the polynomial coefficients of the objective function. However, this approach results in reduced model accuracy due to excessive noise injection. To address this limitation, the authors introduce an improved method, PFLR*, which integrates the fairness constraint directly into the noise injection process rather than treating it as a penalty term. While both methods are mathematically proven to satisfy the

ϵ

-differential privacy criterion, PFLR* is more efficient, optimizing the trade-off between utility and privacy. The proposed methods were evaluated on two real-world datasets: the Adult Dataset and the Dutch Census. Experimental results demonstrate that PFLR* consistently outperforms PFLR, achieving higher accuracy while simultaneously preserving both privacy and fairness.

4.3. Disparate Impact-Free Logistic Regression

The authors of [18] propose a constraint-based framework that incorporates fairness constraints directly into the training process of convex classifiers, specifically Logistic Regression and Support Vector Machines (SVMs) (see the next section for details on SVMs). Their approach introduces a covariance-based fairness constraint, which represents the covariance between the decision boundary and the sensitive attribute. This constraint effectively addresses various fairness definitions, including Disparate Treatment, Disparate Impact, and Disparate Mistreatment. Experimental evaluations on real-world datasets, such as the Adult dataset and the Bank Marketing dataset, demonstrate that the proposed methodology reduces Disparate Impact by 50–80%, while maintaining a minimal accuracy drop of only 1–5%. In addition to applying Logistic Regression, Zafar et al. [18] also introduced this covariance-based fairness constraint into the training process of Support Vector Machines with both linear and non-linear kernels, enabling efficient performance while ensuring fairness.

4.4. Constraint Logistic Regression

The authors of [19] extended the fairness-enhanced Logistic Regression model proposed in [18]. They introduced a fairness constraint based on Disparate Impact (also discussed in [18]) as well as a constraint to enforce the equalized odds fairness criterion. Experiments conducted on real-world datasets, such as Adult and COMPAS, demonstrated significant improvements in fairness, with up to a 58% reduction in bias, while incurring only a modest decrease in accuracy (approximately 5–7%). In comparison to the model in [18], which relies solely on the Disparate Impact constraint, their approach achieves better equalized odds when applied to datasets that are inherently unfair, and performs comparably on datasets that are fair or slightly unfair.

4.5. Group-Level Logistic Regression

Unlike existing fairness-aware models that explicitly impose fairness constraints, this variant of Logistic Regression [20] seeks to mitigate bias inherently by structuring model training around equal representation. The approach partitions the training data based on the sensitive attribute, after which the coefficients are updated using the median derivative value across groups. This ensures that underrepresented groups have an equal influence in defining model parameters. The model was evaluated on two real-world datasets: the Adult dataset and the Open University Learning Analytics dataset. While this method does not surpass fairness-optimized approaches such as Fairlearn, it consistently enhances fairness compared to standard Logistic Regression, particularly in binary-sensitive attribute settings.

4.6. Maximum Entropy Logistic Regression with Demographic Parity Constraints

While standard Logistic Regression aims to maximize the likelihood of observed class labels, the authors of [21] proposed a novel approach, Maximum-Entropy Logistic Regression, which maximizes entropy while enforcing linear fairness constraints. These constraints specifically minimize the covariance between model predictions and the protected attribute, thereby ensuring adherence to the Demographic Parity fairness criterion. Additionally, their Logistic Regression variant is integrated with a backward Stepwise Fairness-Aware Feature Selection algorithm, designed to iteratively remove features that contribute most to disparities in demographic parity. The proposed model is evaluated on five real-world benchmark datasets widely used in fairness-aware machine learning: Adult Income, Bank Marketing, COMPAS, German Credit, and Law School Admission. Experimental results demonstrate that the model performs comparably to Disparate Impact-Free Logistic Regression [18].

5. Fair Editions of Ensemble Models

This section provides a succinct overview of ensemble-based algorithmic variants introduced in the literature. Table 4 highlights the core characteristics and comparative features of each method.

5.1. AdaFair

The authors of [23] proposed AdaFair, an extension of AdaBoost, which dynamically updates instance weights not only based on classification error but also to mitigate discrimination against underrepresented groups. As a fairness metric, they adopted equalized odds. Additionally, AdaFair optimizes the number of weak learners in the final ensemble to achieve both fairness and class balance.

Experiments were conducted on four real-world datasets: Adult Income, Bank Marketing, COMPAS, and KDD Census. The KDD Census dataset is similar to the Adult Income dataset but exhibits extreme class imbalance. AdaFair demonstrates superior performance compared to existing fairness-aware methods, achieving an 8–25% improvement in balanced accuracy over competing techniques while maintaining fairness.

In [24], AdaFair was further extended to incorporate two additional fairness metrics: Statistical Parity and Equal Opportunity, which were used to formulate fairness costs. The reweighting function was modified to consider both misclassification error (as in traditional AdaBoost) and cumulative fairness costs. Unlike prior approaches that adjust for fairness independently at each boosting round, AdaFair evaluates fairness across the entire ensemble up to a given point, ensuring a more stable and consistent fairness correction.

The strong performance demonstrated by the original version of AdaFair in [23] across the four real-world datasets persisted in this extended version, confirming its effectiveness in maintaining both fairness and predictive accuracy.

5.2. FAIRGBM

In [25], FairGBM is proposed as an extension of Gradient Boosted Decision Trees (GBDTs) that explicitly incorporates convex fairness constraints—such as equal opportunity and equalized odds—directly into the learning objective via a Lagrangian dual optimization framework. This formulation enables FairGBM to balance predictive performance with fairness guarantees by treating fairness metrics as constrained optimization objectives rather than post hoc adjustments. Empirical results on benchmark datasets, including ACS Income, Adult, and Account Opening Fraud, indicate that FairGBM achieves substantial reductions in group-based disparities while maintaining competitive accuracy, thereby demonstrating the feasibility of fairness-aware learning in tree-based models without significant trade-offs.

5.3. FairXGBoost

The authors of [26] proposed FairXGBoost, a bias-mitigation framework that seamlessly integrates fairness constraints into XGBoost’s training process through a fairness regularizer, without requiring significant modifications to the model’s core framework. The fairness regularizer is designed to eliminate correlations between the sensitive attribute and the target variable, thereby promoting fairness. Additionally, a hyperparameter called

μ

regulates the strength of fairness enforcement, with higher values imposing stricter fairness constraints. By fine-tuning

μ

, practitioners can effectively balance predictive accuracy and bias mitigation, ensuring both performance and fairness in model outcomes. Experiments conducted on the Adult Dataset, COMPAS Dataset, Default Dataset, and Bank Marketing Dataset demonstrated minimal performance degradation while successfully achieving regulatory compliance by maintaining a Disparate Impact (DI) score of at least

0.8

(four-fifths rule).

5.4. GAFairC: Group AdaBoost with Fairness Constraint

The authors of [27] proposed Group AdaBoost with Fairness Constraint (GAFairC), a modified version of AdaBoost that optimizes a group-aware loss function while enforcing fairness constraints. This study employs equalized odds, a widely used fairness metric, to assess model fairness. The proposed modifications to the classical AdaBoost algorithm include three key components: Rectified Fairness Penalty, which introduces a penalty term in the loss function that activates only when unfairness is detected, preventing unnecessary constraints; Penalty Intensity Selection, a dynamic approach that adjusts the intensity of fairness constraints using a greedy search algorithm; and post-pruning, which ensures that the final ensemble model strictly adheres to the fairness constraint while maximizing accuracy. The authors evaluated their approach on five standard benchmark datasets—Bank, KDD, Compas, Credit, and Adult—and compare its performance against methods such as AdaBoost and AdaFair, among others. The results demonstrate that GAFairC achieves superior fairness (lower equalized odds) while maintaining or improving accuracy compared to existing approaches.

5.5. FAEM: Fairness-Aware Ensemble Model

The authors of [28] propose the Fairness-Aware Ensemble Model (FAEM), which comprises two main components: Hybrid Sampling-Based Bias Alleviation and Two-Layer Stacking-Based Fairness-Aware Ensemble Learning. The first component is a pre-processing technique designed to balance both class labels and sensitive attributes by employing cross-validation-based under-sampling and sensitive attribute-based over-sampling using the Adaptive Synthetic Sampling (ADASYN) algorithm. The second component, two-layer stacking-based fairness-aware ensemble learning, consists of two stages. In the first layer, multiple base classifiers, including XGBoost, Random Forest, LightGBM, Gradient Boosting Decision Tree (GBDT), and AdaBoost, are trained to generate prediction probabilities, with high-confidence majority-class predictions adjusted to enhance fairness. In the second layer, Logistic Regression serves as a meta-classifier, aggregating the adjusted predictions to produce the final output. This two-stage approach ensures a balance between accuracy and fairness, effectively mitigating bias while preserving model performance, as demonstrated through experiments on benchmark datasets such as German Credit, Adult, Bank, and Compas.

5.6. FairBoost

FairBoost [29] is a boosting-based ensemble method that integrates fairness constraints during training while maintaining classification accuracy. The model is built upon AdaBoost.SAMME.R, a variant of the AdaBoost algorithm optimized for multi-class classification, and introduces fairness-aware instance weighting to regulate the trade-off between accuracy and fairness. Unlike similar models that focus on a single sensitive feature, FairBoost supports multiple sensitive attributes, each with one or more categories, making it more applicable to real-world scenarios. The authors evaluate FairBoost on four benchmark datasets—Adult, Arrhythmia, Drugs, and German Credit—each containing multiple sensitive features. The results demonstrate that FairBoost outperforms existing methods in fairness while maintaining competitive accuracy.

5.7. Fair Voting Ensemble Classifier

The authors of [30] proposed a post-processing multi-objective optimization-based ensemble method. Their method integrates a soft voting ensemble that combines predictions from multiple pre-trained classifiers with a dynamic weight allocation that assigns weights to base classifiers based on the group being classified (privileged/unprivileged). While the voting ensembles have a set of fixed weights for all instances predicted, the proposed algorithm has two sets of weights: one specifically for privileged instances and another only for unprivileged instances. The weights are optimized to balance multiple accuracy and fairness metrics using RNSGA-II (Robust Non-dominated Sorting Genetic Algorithm-II). The authors evaluated their model on three benchmark datasets: COMPAS, Adult Income and German Credit Dataset. Their method consistently ranked among the other methods across both objectives (fairness and accuracy).

6. Exponentiated Learning Technique Plus Classical Machine Learning Algorithms

Agarwal et al. [31] proposed an algorithm based on Exponentiated Gradient optimization to construct a randomized classifier that minimizes classification error while adhering to fairness constraints—such as demographic parity or equalized odds—formulated as linear inequalities over conditional expectations. The approach treats the underlying classifier as a black box, allowing compatibility with any learning algorithm. This method offers several advantages: it is flexible, as it supports a wide range of fairness definitions; model-agnostic, as it can be applied to any classifier; practical, due to its ease of implementation; and competitive, as empirical results demonstrate that it matches or outperforms existing fairness-aware algorithms on various real-world datasets.

In the present study, we employ this technique as implemented in the Python library Fairlearn [32], in conjunction with classical machine learning algorithms, including Naive Bayes, Decision Trees, and Logistic Regression.

7. Experimental Procedure

This section provides a detailed overview of the datasets utilized, the fair machine learning algorithms implemented, and the quantitative metrics adopted to assess both predictive performance and fairness across different groups.

7.1. Datasets

7.1.1. Titanic Dataset

The Titanic dataset [33] is a well-known dataset, derived from passenger records of the RMS Titanic, which sank in 1912. The dataset consists of 891 records, each representing an individual passenger, and includes attributes such as age, sex, passenger class, port of embarkation, ticket fare, and survival status. The survival variable is binary, with 0 indicating non-survival (549 passengers) and 1 indicating survival (342 passengers). As a sensitive attribute, sex is encoded as male (0), represented by 577 instances, and female (1), represented by 314 instances.

7.1.2. Adult Census Dataset

The Adult Census Dataset, also known as the “Adult” or “Census Income” dataset, is extracted from the 1994 U.S. Census database [34]. It contains 48,842 cases, each representing an individual, with 14 attributes including age, education, occupation, marital status, race, sex, and income level. The primary task associated with this dataset is binary classification, which involves predicting whether an individual’s annual income exceeds USD 50,000, labeled as >50 K (37,155 instances), or not, labeled as ≤50 K (11,687 instances), based on the provided demographic and socioeconomic features. The sensitive attribute, sex, is categorized as male (0) with 32,650 instances and female (1) with 16,192 instances.

7.1.3. Bank Marketing Dataset

The Bank Marketing Dataset [35] originates from direct marketing campaigns (telephone contacts) conducted by a Portuguese banking institution between May 2008 and November 2010. It comprises 41,188 instances, each corresponding to a client interaction, and includes 16 input features—such as age, job, marital status, education, housing loan status, contact type, and outcomes of previous campaigns—along with an output variable indicating whether the client subscribed to a term deposit (yes/no). The class distribution is as follows: ‘No Subscription’ (0) with 36,548 instances and ‘Subscription’ (1) with 4640 instances. The sensitive attribute, age, is categorized as ‘Below Median’ (0) with 19,768 instances and ‘Above Median’ (1) with 21,420 instances.

7.1.4. German Credit Dataset

The German Credit Dataset [36] originates from a credit risk assessment study conducted by a German bank. It comprises 1000 instances, each representing an individual credit applicant, and includes 20 attributes such as age, sex, job, housing status, credit amount, loan duration, purpose, and credit history. The primary objective is binary classification—predicting whether an applicant is a good (low-risk) or bad (high-risk) credit customer. The class distribution includes 300 instances labeled as bad credit (0) and 700 as good credit (1). Featuring a combination of categorical and numerical variables, the dataset is widely utilized for benchmarking classification algorithms and exploring fairness in credit scoring. The sensitive attribute age is encoded as young (0) with 149 instances and older (1) with 851 instances.

7.1.5. MBA Admission Dataset

The MBA Admission Dataset [37] comprises 6094 instances, each representing an applicant to a Master of Business Administration program. Its primary objective is to predict admission outcomes based on a combination of academic and personal profile features. The dataset includes variables such as GMAT scores, undergraduate GPA, work experience, age, and gender, along with the binary admission outcome—0 indicating admission (900 cases) and 1 indicating non-admission (5194 cases). The sensitive attribute gender is encoded as 0 for male (2201 cases) and 1 for female (3893 cases).

7.1.6. Law School Dataset

The Law School Admission Dataset [38], commonly referred to as the Law School Dataset, contains data on law school applicants in the United States. It includes demographic information, academic credentials (such as undergraduate GPA and LSAT scores), and law school outcomes like graduation and bar passage. This study specifically concentrates on predicting the likelihood of a candidate passing the bar examination. For this purpose, we utilize the version of the dataset curated by Dam and Harvey [39].

7.2. Machine Learning Algorithms

While a broad range of algorithms were reviewed in Section 2, Section 3, Section 4 and Section 5 only a selected subset of machine learning algorithms was implemented in our experiments:

AdaFair: A summary of this algorithm is provided in Section 5.1.
FairGBM: A summary of this algorithm is provided in Section 5.2.
Fairlearn-GB: Combines the Exponentiated Gradient technique with a gradient boosting algorithm, implemented using the Fairlearn library.
Fairlearn-NB: Integrates the Exponentiated Gradient technique with a Naive Bayes classifier via the Fairlearn library.
Fairlearn-DT: Applies the Exponentiated Gradient technique in conjunction with Decision Trees, utilizing the Fairlearn library.
Fairlearn-LR: Combines the Exponentiated Gradient technique with Logistic Regression using the Fairlearn library.
Fair Decision Trees: Summarized in Section 3.8, this algorithm is based on the implementation available in the GitHub repository of [13].
Fair Random Forest: An extension of the Fair Decision Trees algorithm, also described in Section 3.8 and obtained from the GitHub repository of [13].
NNB-Parity: Briefly described in Section 2.3, this algorithm is sourced from the GitHub repository of [5].
NNB-DF: Also detailed in Section 2.3, this algorithm is likewise obtained from the GitHub repository of [5].
FairBoost: An overview of this algorithm is presented in Section 5.6, based on the work of [29].
Prejudice Remover Regularizer: Utilized via its implementation in the AI Fairness 360 (AIF360) library.

This selection was guided by practical considerations such as algorithmic diversity, code/library availability and relevance to the characteristics of the datasets described earlier.

Model Parameters

In all experiments, we used 10-fold cross-validation (k = 10) to ensure robust model evaluation, and a fixed random state of 42 to guarantee reproducibility. For each of the aforementioned models, we adopted the default hyperparameter settings as a baseline, allowing for a consistent comparison of model performance without the confounding effects of manual tuning.

7.3. Metrics

For evaluating model performance, we selected accuracy as the primary performance metric. To assess fairness, we employed the following fairness metrics [40]:

Accuracy Difference: Difference in prediction accuracy across the two different groups. Mathematically expressed as $Accuracy Difference = {Accuracy}_{Group 1} - {Accuracy}_{Group 2}$
Statistical Parity Difference: Difference in rate of positive predictions between groups. Mathematically expressed as $Statistical Parity Difference = {PPR}_{Group 1} - {PPR}_{Group 2}$
Equality of Opportunity: Ensures equal true positive rates (TPR) across population groups. Mathematically expressed as $Equality of Opportunity = {TPR}_{Group 1} - {TPR}_{Group 2}$
Equalized Odds: Ensures both equal true positive rates (TPR) and false positive rates (FPR) across groups. Mathematically expressed as $Equality of Opportunity = ({FPR}_{Group 1} - {FPR}_{Group 2}) + ({TPR}_{Group 1} - {TPR}_{Group 2})$

8. Results

8.1. Adult Dataset Results

Table 5 and Figure 1 present a comparative evaluation of several fair machine learning algorithms on the ADULT dataset. The analysis yields several noteworthy observations:

Among all models, FairGBM achieved the highest accuracy (

0.8740 \pm 0.0019

), followed closely by Prejudice Remover (

0.8504 \pm 0.0019

) and Fairboost (

0.8475 \pm 0.0159

). However, these models exhibited notable variations in fairness performance. In particular, both FairGBM and Fairboost reported relatively high statistical parity differences (0.1730 and 0.1713, respectively), suggesting a tendency toward disparate outcomes across demographic groups despite their high predictive performance.

AdaFair achieved perfect fairness across all evaluated metrics, with all fairness values equal to zero. However, this came at the expense of lower predictive accuracy

(0.7607 \pm 0.0000

), indicating a clear trade-off between fairness and accuracy in this case.

Models such as Fairlearn-NB and NNB-DF offered more favorable fairness metrics compared to the top-performing models in terms of accuracy. For instance, Fairlearn-NB reported low TPR and FPR differences (

0.0150 \pm 0.0107

and

0.0128 \pm 0.0111

, respectively). However, these improvements in fairness were accompanied by substantially lower accuracies, generally ranging from 0.58 to 0.62, limiting their practical applicability.

Fair Decision Trees achieved very low disparities in fairness metrics but suffered from the lowest accuracy (

0.2403 \pm 0.0018

), rendering them ineffective for real-world deployment despite their strong fairness characteristics.

Overall, Fairlearn-GB, Fairlearn-LR, and Prejudice Remover demonstrated a more favorable trade-off between fairness and accuracy. For example, Fairlearn-GB attained a strong accuracy (

0.8662 \pm 0.0021

) alongside relatively low fairness metric disparities, including a statistical parity difference of

0.1174 \pm 0.0047

and modest TPR and FPR differences.

8.2. BANK Dataset Results

Table 6, along with Figure 2, presents a comparative analysis of fair machine learning algorithms on the BANK dataset. The results reveal several key insights:

FairGBM and Fairlearn-GB achieved the highest overall accuracies (

0.9177 \pm 0.0018

and

0.9178 \pm 0.0018

, respectively), while also maintaining low disparities across multiple fairness metrics. For example, Fairlearn-GB reported a statistical parity difference of just

0.0120 \pm 0.0065

, along modest true positive rate (TPR) and false positive rate (FPR) differences. These results suggest that both gradient boosting variants deliver strong predictive performance while ensuring relatively equitable treatment across demographic groups.

AdaFair, consistent with its performance on the ADULT dataset, achieved perfect fairness with all fairness metrics equal to zero. Unlike in the previous dataset, however, its accuracy on the BANK dataset was also strong (

0.8874 \pm 0.0000

), making it a highly competitive candidate for fairness-sensitive applications with minimal compromise in performance.

Fairboost demonstrated competitive accuracy (

0.8813 \pm 0.0068

) and an exceptionally low accuracy difference (

0.0020 \pm 0.0012

). However, its relatively high TPR difference (

0.0825 \pm 0.0332

) indicates potential imbalances in positive classification rates across groups.

Both Prejudice Remover and Fairlearn-LR also performed well, achieving accuracies above 0.91 and fairness metrics comparable to the top-performing models. These results underscore that linear and regularized models can attain both fairness and accuracy when properly calibrated.

In contrast, NNB-DF and NNB-Parity exhibited significantly lower accuracies (

0.5624 \pm 0.1693

and

0.7780 \pm 0.0053

, respectively) and higher disparities in key fairness metrics, particularly in statistical parity and FPR. These findings indicate that, although neural network-based approaches offer flexibility, their fairness performance in this context was less consistent and more variable.

Fair Decision Trees recorded the lowest accuracy (

0.5605 \pm 0.0344

) but maintained fairness metrics within acceptable bounds. Nevertheless, their limited predictive performance constrains their practical applicability.

In summary, the results on the BANK dataset indicate that several models—notably Fairlearn-GB, FairGBM, Prejudice Remover, and AdaFair—effectively balance predictive accuracy and fairness, making them strong candidates for deployment in fairness-critical decision-making scenarios.

8.3. Titanic Dataset Results

Table 7 and Figure 3 report the performance of various fair machine learning models on the TITANIC dataset. The results underscore a more pronounced trade-off between fairness and accuracy compared to the ADULT and BANK datasets.

Fairlearn-GB achieved the highest overall accuracy (

0.8299 \pm 0.0220

), closely followed by Fairlearn-DT (

0.8250 \pm 0.0183

) and Prejudice Remover (

0.8216 \pm 0.0125

). However, despite their strong predictive performance, these models exhibited significant fairness disparities. For example, Fairlearn-GB recorded high statistical parity difference (

0.7261 \pm 0.0834

) and true positive rate (TPR) difference (

0.6740 \pm 0.1089

), indicating substantial group-based disparities in classification outcomes.

Prejudice Remover, while not the most fair, offered a more balanced performance across metrics. It maintained high accuracy while reducing disparities across multiple fairness indicators, such as accuracy difference (

0.0161 \pm 0.0134

), statistical parity difference (

0.5725 \pm 0.0417

), and false positive rate (FPR) difference (

0.1847 \pm 0.0993

). This suggests that it achieves a relatively favorable trade-off between accuracy and fairness.

In contrast, AdaFair, which typically excels in fairness across datasets, exhibited anomalously poor fairness performance on the TITANIC dataset. Although its accuracy remained reasonable (

0.7813 \pm 0.0221

), all fairness metrics were maximized at 1.0, indicating extreme and uniform disparity—a concerning result for a model designed to prioritize fairness.

Fairboost delivered solid accuracy (

0.8127 \pm 0.0164

) and a modest accuracy difference (

0.0279 \pm 0.0166

), but exhibited some of the worst fairness disparities among the evaluated models. Its statistical parity difference (

0.8255 \pm 0.0333

) and FPR difference (

0.6269 \pm 0.0811

) point to highly unequal treatment between demographic groups, undermining its utility in fairness-sensitive applications.

Models on the lower end of performance included Fair Decision Trees and neural network-based models (NNB-Parity and NNB-DF). These models yielded poor accuracy (generally below 0.50), with Fair Decision Trees achieving the lowest (

0.3851 \pm 0.0052

). Although Fair Decision Trees exhibited near-perfect fairness metrics, their inadequate predictive performance renders them unsuitable for practical deployment.

Fairlearn-LR and Fair Random Forest demonstrated moderate accuracy with some fairness improvements. Fairlearn-LR achieved

0.7810 \pm 0.0122

in accuracy, though its fairness metrics remained relatively high, suggesting inconsistency in fairness optimization.

In summary, the results on the TITANIC dataset highlight a sharper fairness–accuracy trade-off than observed in previous datasets.

8.4. German Credit Dataset Results

Table 8 and Figure 4 summarizes the performance of various fair machine learning algorithms on the German Credit dataset. The results reveal a complex interplay between accuracy and fairness, with no model achieving an ideal balance.

In terms of predictive accuracy, the top-performing models were Fairlearn-LR (

0.7493 \pm 0.0214

), FairGBM (

0.7463 \pm 0.0182

), Prejudice Remover (

0.7440 \pm 0.0176

), and Fair Random Forest (

0.7423 \pm 0.0196

). Despite their strong predictive performance, these models demonstrated notable fairness disparities. In particular, Fair Random Forest and Prejudice Remover exhibited substantial gaps in key fairness metrics, including accuracy differences of up to 0.1519 and false positive rate (FPR) differences as high as 0.1456, indicating considerable imbalance in treatment across demographic groups.

AdaFair, by contrast, achieved perfect fairness across all evaluated metrics, with all disparity values equal to zero. However, this came at the cost of reduced accuracy (0.7000), exemplifying the classical fairness–accuracy trade-off. A similar trend, albeit less pronounced, was observed with Fair Decision Trees, which maintained relatively low fairness disparities (e.g., statistical parity difference of 0.0271 and FPR difference of 0.0369), but with a lower predictive accuracy of 0.6677.

Fairboost emerged as a strong compromise model, offering competitive accuracy (

0.7293 \pm 0.0173

) while maintaining moderate and consistent fairness metrics. For instance, it reported a statistical parity difference of 0.0537 and a TPR difference of 0.0438, indicating a relatively balanced performance without disproportionately sacrificing either accuracy or fairness.

The NNB-based models exhibited divergent and generally less favorable behavior. NNB-DF achieved a reasonable accuracy (0.7190), but also showed the highest accuracy disparity (0.1993) and FPR difference (0.2285), reflecting substantial bias in classification outcomes. NNB-Parity performed slightly worse in terms of accuracy and still struggled with fairness, as evidenced by a TPR difference of 0.1470.

In summary, no model achieved a perfect balance between fairness and predictive performance on the German Credit dataset. While AdaFair delivered ideal fairness metrics, it did so at the cost of utility. On the other hand, Fairboost and Fairlearn-DT emerged as more pragmatic options, offering relatively fair outcomes while preserving useful levels of predictive accuracy.

8.5. EAP2024 Dataset Results

Table 9 and Figure 5 showcase the comparative performance of fair machine learning algorithms on the EAP2024 dataset. The results highlight the varying trade-offs between predictive accuracy and fairness across different models.

Fairboost achieved the highest overall accuracy

(0.8015 \pm 0.0255

), closely followed by Prejudice Remover (

0.7978 \pm 0.0210

) and FairGBM (

0.7896 \pm 0.0213

). Notably, these top-performing models also maintained relatively low fairness disparities. For instance, Fairboost reported a statistical parity difference of 0.0587 and a TPR difference of 0.1068, suggesting it provides a strong balance between predictive performance and group fairness—making it a compelling candidate for practical deployment.

AdaFair demonstrated particularly low disparities across all fairness metrics (e.g., TPR difference of 0.0307), while still maintaining competitive accuracy (

0.7463 \pm 0.0248

). This indicates that substantial fairness improvements can be achieved with only moderate compromise on predictive power, positioning AdaFair as a suitable choice for fairness-critical applications.

Models such as Fairlearn-NB, Fairlearn-DT, and Fairlearn-LR showed moderate trade-offs between accuracy and fairness. However, Fairlearn-LR exhibited a relatively high TPR disparity (0.1735), suggesting potential group-specific bias or instability in classification performance.

In contrast, NNB-Parity offered acceptable predictive performance (accuracy of 0.7299), but had the highest TPR disparity (0.2319), indicating significant inequity in positive classification rates across demographic groups. Similarly, while Fair Random Forest reported a perfect TPR difference of 0.0000, its poor accuracy (0.5769) limits its practical applicability, as fairness is achieved at the expense of reliable predictions.

Fair Decision Trees performed worst across the board, with the lowest accuracy (0.3791) and only marginal fairness benefits. This outcome suggests limited utility for this model in the context of the EAP2024 dataset.

In summary, Fairboost and FairGBM emerge as the most effective models for this dataset, offering a favorable balance between accuracy and fairness.

8.6. MBA Dataset Results

Table 10, along with Figure 6, provides a comparative evaluation of fair machine learning algorithms on the MBA dataset. The analysis highlights meaningful differences in the trade-offs between predictive accuracy and fairness across the models.

The highest-performing model in terms of accuracy was Fairlearn-LR (

0.8529 \pm 0.0012

), closely followed by AdaFair (

0.8524 \pm 0.0000

) and Fairlearn-NB (

0.8487 \pm 0.0071

). Remarkably, AdaFair achieved perfect fairness across all evaluated metrics, with zero values for statistical parity difference, true positive rate (TPR) difference, and false positive rate (FPR) difference. This result suggests that strict fairness constraints or post-processing techniques may have been applied, allowing the model to enforce equitable treatment without sacrificing accuracy.

Fairlearn-LR also demonstrated exceptional fairness performance, reporting minimal disparity scores—statistical parity difference (0.0020), TPR difference (0.0124), and FPR difference (0.0014)—thereby confirming that, in this context, high predictive performance can coexist with strong fairness guarantees.

Other models such as Fairboost and Prejudice Remover maintained solid predictive performance (approximately 0.84–0.85 in accuracy), but with slightly higher fairness disparities. For example, Fairboost exhibited a statistical parity difference of 0.0544, indicating some trade-off between fairness and accuracy.

Models including Fair Random Forest and the NNB variants (NNB-Parity and NNB-DF) achieved moderate accuracy levels (around 0.81) but reported considerably higher fairness disparities. Notably, both NNB-based models had TPR differences in the range of 0.165–0.170, suggesting inconsistent classification behavior across demographic subgroups and raising concerns about equitable model behavior.

The lowest-performing models in terms of both accuracy and fairness were Fair Decision Trees (accuracy: 0.4852) and Fair Random Forest (accuracy: 0.7808). These models either lacked sufficient predictive capability or suffered from significant fairness issues, particularly large TPR disparities, limiting their suitability for real-world deployment.

In conclusion, AdaFair and Fairlearn-LR emerged as the most balanced models for the MBA dataset, delivering both high accuracy and strong fairness performance.

8.7. LAW Dataset Results

Table 11 and Figure 7 provide a comprehensive comparison of fair machine learning algorithms evaluated on the LAW dataset. The results reveal important trade-offs between predictive performance and fairness across different models.

Among all models, Prejudice Remover achieved the highest accuracy (

0.9102 \pm 0.0015

), closely followed by Fairboost (

0.9096 \pm 0.0015

) and Fairlearn variants—GB and LR—both reporting accuracies of 0.9081. While these top-performing models demonstrated strong predictive capability, their fairness characteristics varied. Notably, AdaFair delivered perfect fairness across all evaluated metrics (zero disparities), with only a slight reduction in accuracy (

0.9017 \pm 0.0000

). This suggests AdaFair enforces strict fairness guarantees, even at the expense of minor performance loss.

Models such as FairGBM and Fairlearn-GB/LR struck a commendable balance between accuracy and fairness. For instance, Fairlearn-GB reported low fairness disparities, including a statistical parity difference of 0.0096, a TPR difference of 0.0026, and an FPR difference of 0.0427. These results indicate reliable and equitable classification behavior, making such models highly suitable for fairness-sensitive domains.

In contrast, while Prejudice Remover was the most accurate model, it also exhibited the highest false positive rate (FPR) disparity (0.1053), raising concerns about differential treatment in negative predictions across groups. Similarly, Fairboost, despite its strong accuracy, recorded a relatively higher FPR difference (0.0919), highlighting the potential fairness–performance trade-off, particularly along the FPR axis.

The lower-performing models included Fair Decision Trees, which had the lowest accuracy (

0.2928 \pm 0.0145

), making them unsuitable for practical deployment. NNB-based models (NNB-Parity and NNB-DF) achieved moderate accuracy (approximately 0.80–0.82), but exhibited higher disparities in fairness metrics, particularly in statistical parity and TPR difference, undermining their fairness reliability.

In summary, Fairlearn-GB, FairGBM, and Fairlearn-LR emerged as the most balanced models, successfully combining high predictive performance with low fairness disparities. These models are strong candidates for deployment in legal or policy-sensitive contexts where equitable outcomes are critical. Meanwhile, AdaFair remains the most rigorous in fairness enforcement, suitable for applications demanding zero tolerance for group-level bias. Although Prejudice Remover and Fairboost offer the highest accuracy, their elevated FPR disparities necessitate careful consideration, especially in high-stakes scenarios.

9. Discussion

This study presents a comprehensive evaluation of various fair machine learning models across multiple diverse datasets—including ADULT, BANK, LAW, MBA, EAP2024, GERMAN, and TITANIC—highlighting persistent trade-offs between predictive accuracy and fairness. The findings reveal no single model universally dominates across all datasets and fairness criteria; instead, model performance seems to be highly context-dependent, influenced by dataset characteristics and fairness goals.

A central theme is the classical fairness–accuracy trade-off: models enforcing strict fairness guarantees, such as AdaFair, often achieve near-perfect or zero disparity in fairness metrics (statistical parity, true positive rate (TPR), false positive rate (FPR) differences), but sometimes at a modest cost to accuracy, especially on more complex or imbalanced datasets like GERMAN, TITANIC, and EAP2024. However, AdaFair’s performance on datasets such as MBA and BANK demonstrates that strict fairness constraints can occasionally be met without significant accuracy loss.

Gradient boosting-based methods—including FairGBM, Fairboost, and Fairlearn-GB—consistently deliver strong overall performance, balancing high accuracy with moderate to low fairness disparities across most datasets. Their robustness and reliability make them attractive for real-world deployment, especially where both accurate predictions and equitable outcomes are desired. Similarly, Prejudice Remover often ranks among the most accurate models, although it sometimes exhibits higher disparities in FPR, raising caution in sensitive decision-making contexts. NNB-Parity and NNB-DF show inconsistent fairness performance and substantial variability in accuracy, frequently underperforming on fairness metrics like TPR and FPR differences.

Within the Fairlearn framework, linear and tree-based models—particularly Fairlearn-LR—offer interpretable and stable trade-offs, achieving near-perfect fairness with minimal accuracy loss on some datasets. Nonetheless, their effectiveness can vary with dataset complexity, as observed in datasets like TITANIC and EAP2024, where fairness disparities remain pronounced despite reasonable accuracy. Fair Decision Trees consistently show favorable fairness metrics but suffer from low predictive accuracy, limiting their practical use as standalone models. They may be better suited as components within ensemble approaches rather than primary classifiers.

Dataset characteristics substantially influence fairness–accuracy dynamics. Balanced datasets such as MBA and LAW allow multiple models to achieve both high accuracy and fairness, while more complex or imbalanced datasets, such as EAP2024 and TITANIC, exacerbate trade-offs, making it difficult to optimize both objectives simultaneously.

In summary, model selection must be guided by specific application requirements and fairness priorities. For scenarios where fairness is non-negotiable—such as legal, hiring, or lending contexts—models like AdaFair and Fairlearn-LR are promising due to their strict fairness guarantees. When higher accuracy is critical and some fairness disparities are tolerable, models like Fairboost, FairGBM, and Prejudice Remover provide compelling options, provided fairness impacts are carefully monitored.

9.1. Limitations and Hyperparameter Sensitivity

A major limitation of our benchmarking experiment is the use of default hyperparameters across all fairness-sensitive models. While this choice was made to favor comparability and replicability across a wide range of methods and datasets, we are aware that fairness–accuracy trade-offs are typically very sensitive to hyperparameter choices. For instance, hyperparameters controlling regularization strength, reweighting coefficients, or adversarial loss functions may have a large influence on both predictive accuracy and fairness metrics. As such, reported performance may not approximate the optimal performance of each model in tighter-to-optimal scenarios. Future work would benefit from incorporating systematic sensitivity analysis or hyperparameter tuning—ideally as part of fairness-constrained cross-validation schemes—to better capture the range of potential trade-offs and to enable more systematic comparison. Incorporating this additional step would also allow for more realistic model selection and deployment recommendations in fairness-sensitive real-world applications.

Our experimental assessment is primarily focused on typical tabular datasets that are widely utilized in the fairness literature due to their interpretability, relatively low dimensionality, and good documentation of sensitive attributes. While such a choice facilitates systematic comparison across fairness-aware models, it does limit the generalizability of our findings to more complex domains such as computer vision, natural language processing, and graph-based learning. Issues of fairness in these settings also tend to involve additional layers of complexity—e.g., word embeddings with hidden bias or biased representation in image features—that will not always be well represented by tabular data.

9.2. Do Fairness Models Fulfill Their Goals?

Fairness-aware models in principle make more equitable predictions by imposing constraints or modifying learning objectives. In practice, however, their performance varies greatly by context. Certain models fulfill fairness constraints but at the cost of lowered accuracy or generalizability, particularly on imbalanced or complex data. For example, adversarial debiasing or reweighting strategies may improve fairness measures at the expense of inducing training instability or undesirable side effects. As our benchmarking shows, how well such models achieve their fairness goals typically depends on dataset attributes such as label skewness, feature correlation with sensitive features, and levels of noise. This disconnect between theoretical guarantees of fairness and real-world performance underscores the need for judicious, context-aware evaluation.

9.3. Evaluation and Selection for Practitioners

Practitioners ought to be reflective in their selection and comparison of fairness-aware methods. First, they must ascertain which notion of fairness is most appropriate for serving the ethical and legal aims of the application context. Second, models must be contrasted on multiple dimensions: fairness metrics (e.g., demographic parity difference), predictive performance (e.g., balanced accuracy or F1 measure), and application limitations (e.g., interpretability or scalability). Since trade-offs often cannot be avoided, such mechanisms as fairness–accuracy Pareto frontiers or risk analysis tailored to domains can be useful in guiding decisions. Practitioners are also cautioned to recognize how fairness interventions such as pre-processing (e.g., dataset rebalancing), in-processing (e.g., regularization), and post-processing (e.g., threshold tuning) work together, especially when deploying models into the real world.

9.4. Clarifying Accuracy and Performance Metrics

Accuracy must not be interpreted as a one-size-fits-all measure of model quality in fairness-aware ML. Standard accuracy (i.e., number of correct predictions) can be misleading when fairness interventions alter class distributions or decision boundaries. Therefore, balanced accuracy, precision/recall across groups, and fairness-related metrics (e.g., equal opportunity difference) offer more sensitive measurements. For instance, a model can be high in overall accuracy but poor on underrepresented groups and thus violate fairness principles. Our work combines such nuanced metrics to uncover performance differences that could be difficult to discern with aggregate accuracy metrics, giving insight into where and when fairness-aware models perform or fail.

10. Conclusions

These findings highlight the lack of a single, universally effective fairness-aware algorithm. Instead of pursuing a one-size-fits-all approach, practitioners need to tailor their choices based on the fairness criteria most relevant to their specific use cases, the nature of their data, and the desired trade-off between fairness and performance. The results indicate that ensemble techniques, especially gradient boosting methods, hold promise in delivering fair outcomes without heavily compromising accuracy. However, this also emphasizes the importance of selecting appropriate fairness metrics and conducting thorough evaluations across diverse datasets.

Like all studies, this one also has certain limitations. Specifically, some valid variations of classical models may have been inadvertently overlooked. Furthermore, the scope of this study is restricted to tabular datasets and is solely focused on classification tasks. This study does not include fairness-enhanced algorithms designed for online streaming data, such as Fair Adaptive Random Forests [41], Fairness-Aware Hoeffding Trees [42], Fairness-Enhancing and Concept-Adapting Trees (FEAT) [43], Online Fair Naive Bayes [44], and Fair-CMNB [45]. Similarly, fairness-aware variants of neural networks [46,47,48,49] and graph neural networks [50,51] are also excluded from this review.

Directions for Future Research

Looking ahead, future research should broaden the focus to include regression models and delve deeper into the explainability of fairness-enhanced machine learning techniques. Moreover, since dataset-specific factors play a critical role in outcomes, it is crucial to carefully consider both data properties and fairness objectives when selecting or designing fair ML algorithms. Future efforts might benefit from creating adaptive approaches that balance fairness and accuracy according to each dataset’s unique bias profile.

There are also a number of other directions of advancement for further advancing fairness-aware ML. First, more robust systems of benchmarking are needed that respect domain-dependent concepts of fairness and lasting impacts of models. Second, adaptive methods that react dynamically to unfolding data distributions or shifting social norms (so-called “adaptive fairness”) are underdeveloped. Third, interpretability and user trust are key barriers to the deployment of fairness-enhancing models; thus, explainability-embedded fairness-aware pipelines hold great promise. Finally, research into multi-objective optimization—balancing trade-offs between fairness, robustness, and performance—can help equip practitioners with more sufficient preparation for challenging real-world deployments. Our findings show that one model does not outperform others on all datasets, but systematic understanding of fairness trade-offs is feasible and necessary for ethical AI deployment.

Author Contributions

Conceptualization, G.R. and S.K.; methodology, N.F. and G.D.; software, G.R. and G.D.; validation, N.F. and G.D.; formal analysis, N.F. and G.D.; investigation, G.R.; resources, N.F. and G.D.; writing—original draft preparation, G.R. and G.D.; writing—review and editing, G.D.; supervision, S.K.; project administration, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

The project with title: easyHPC@eco.plastics.industry and MIS: 6001593 is co-funded by the European Union under Competitiveness Programme (ESPA 2021-2027).

Conflicts of Interest

The authors declare no conflicts of interest.

References

O’Neil, C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy; Crown Publishing Group: New York, NY, USA, 2016; ISBN 978-0-553-41881-1. [Google Scholar]
Raftopoulos, G.; Davrazos, G.; Kotsiantis, S. Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques. Electronics 2025, 14, 1856. [Google Scholar] [CrossRef]
Calders, T.; Verwer, S. Three Naive Bayes Approaches for Discrimination-Free Classification. Data Min. Knowl. Disc. 2010, 21, 277–292. [Google Scholar] [CrossRef]
Choi, Y.; Farnadi, G.; Babaki, B.; Broeck, G.V. den Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10077–10084. [Google Scholar] [CrossRef]
Boulitsakis-Logothetis, S. Fairness-Aware Naive Bayes Classifier for Data with Multiple Sensitive Features. AAAI Spring Symposium: HFIF 23 February 2022. Available online: https://github.com/steliosbl/N-naive-Bayes (accessed on 10 June 2025).
Kamiran, F.; Calders, T.; Pechenizkiy, M. Discrimination Aware Decision Tree Learning. In Proceedings of the IEEE International Conference on Data Mining, Sydney, NSW, Australia, 13–17 September 2010; pp. 869–874. [Google Scholar]
Aghaei, S.; Azizi, M.J.; Vayanos, P. Learning Optimal and Fair Decision Trees for Non-Discriminative Decision-Making. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1418–1426. [Google Scholar]
Ranzato, F.; Urban, C.; Zanella, M. Fair Training of Decision Tree Classifiers. arXiv 2021, arXiv:2101.00909. [Google Scholar] [CrossRef]
van der Linden, J.; de Weerdt, M.; Demirović, E. Fair and Optimal Decision Trees: A Dynamic Programming Approach. Adv. Neural Inf. Process. Syst. 2022, 35, 38899–38911. [Google Scholar]
Castelnovo, A.; Cosentini, A.; Malandri, L.; Mercorio, F.; Mezzanzanica, M. FFTree: A Flexible Tree to Handle Multiple Fairness Criteria. Inf. Process. Manag. 2022, 59, 103099. [Google Scholar] [CrossRef]
Jo, N.; Aghaei, S.; Benson, J.; Gomez, A.; Vayanos, P. Learning Optimal Fair Decision Trees: Trade-Offs Between Interpretability, Fairness, and Accuracy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Montreal, QC, Canada, 8–10 August 2023; pp. 181–192. [Google Scholar]
Bagriacik, M.; Otero, F.E.B. Multiple Fairness Criteria in Decision Tree Learning. Appl. Soft Comput. 2024, 167, 112313. [Google Scholar] [CrossRef]
Pereira Barata, A.; Takes, F.W.; van den Herik, H.J.; Veenman, C.J. Fair Tree Classifier Using Strong Demographic Parity. Mach. Learn. 2024, 113, 3305–3324. [Google Scholar] [CrossRef]
Jo, N.; Aghaei, S.; Benson, J.; Gómez, A.; Vayanos, P. Learning optimal fair classification trees. arXiv 2022, arXiv:2201.09932. [Google Scholar]
Kamishima, T.; Akaho, S.; Asoh, H.; Sakuma, J. Fairness-Aware Classifier with Prejudice Remover Regularizer. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2012, Bristol, UK, 24–28 September 2012; Flach, P.A., De Bie, T., Cristianini, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 35–50. [Google Scholar]
Fukuchi, K.; Kamishima, T.; Sakuma, J. Prediction with Model-Based Neutrality. IEICE Trans. Inf. Syst. 2015, E98.D, 1503–1516. [Google Scholar] [CrossRef]
Xu, D.; Yuan, S.; Wu, X. Achieving Differential Privacy and Fairness in Logistic Regression. In Proceedings of the Companion Proceedings of the 2019WorldWideWeb Conference, Association for Computing Machinery, San Francisco, CA, USA, 13–17 May 2019; pp. 594–599. [Google Scholar]
Zafar, M.B.; Valera, I.; Gomez-Rodriguez, M.; Gummadi, K.P. Fairness Constraints: A Flexible Approach for Fair Classification. J. Mach. Learn. Res. 2019, 20, 1–42. [Google Scholar]
Radovanović, S.; Petrović, A.; Delibašić, B.; Suknović, M. Enforcing Fairness in Logistic Regression Algorithm. In Proceedings of the International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Novi Sad, Serbia, 24–26 August 2020; pp. 1–7. [Google Scholar]
Elliott, M.; P., D. A Group-Level Learning Approach Using Logistic Regression for Fairer Decisions. In Computer Safety, Reliability, and Security. SAFECOMP 2023 Workshops, Proceedings of the ASSURE, DECSoS, SASSUR, SENSEI, SRToITS, and WAISE, Toulouse, France, 19 September 2023; Guiochet, J., Tonetta, S., Schoitsch, E., Roy, M., Bitsch, F., Eds.; Springer: Cham, Switzerland, 2023; pp. 301–313. [Google Scholar]
Vancompernolle Vromman, F.; Courtain, S.; Leleux, P.; de Schaetzen, C.; Beghein, E.; Kneip, A.; Saerens, M. Maximum Entropy Logistic Regression for Demographic Parity in Supervised Classification. In Artificial Intelligence and Machine Learning, Proceedings of the 35th Benelux Conference, BNAIC/Benelearn 2023, Delft, The Netherlands, 8–10 November 2023; Oliehoek, F.A., Kok, M., Verwer, S., Eds.; Springer: Cham, Switzerland, 2025; pp. 189–208. [Google Scholar]
Zhang, J.; Zhang, Z.; Xiao, X.; Yang, Y.; Winslett, M. Functional Mechanism: Regression Analysis under Differential Privacy. Proc. VLDB Endow. 2012, 5, 1364–1375. [Google Scholar] [CrossRef]
Iosifidis, V.; Ntoutsi, E. AdaFair: Cumulative Fairness Adaptive Boosting. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 781–790. [Google Scholar]
Iosifidis, V.; Roy, A.; Ntoutsi, E. Parity-Based Cumulative Fairness-Aware Boosting. Knowl. Inf. Syst. 2022, 64, 2737–2770. [Google Scholar] [CrossRef]
Cruz, A.F.; Belém, C.; Jesus, S.; Bravo, J.; Saleiro, P.; Bizarro, P. FairGBM: Gradient Boosting with Fairness Constraints. arXiv 2023, arXiv:2209.07850. [Google Scholar] [CrossRef]
Ravichandran, S.; Khurana, D.; Venkatesh, B.; Edakunni, N.U. FairXGBoost: Fairness-Aware Classification in XGBoost. arXiv 2020, arXiv:2009.01442. [Google Scholar]
Xue, Z. Group AdaBoost with Fairness Constraint. In Proceedings of the SIAM International Conference on Data Mining (SDM), St. Paul Twin Cities, MN, USA, 27–29 April 2023; pp. 865–873. [Google Scholar]
Zhang, W.; He, F.; Zhang, S. A Novel Fairness-Aware Ensemble Model Based on Hybrid Sampling and Modified Two-Layer Stacking for Fair Classification. Int. J. Mach. Learn. Cyber. 2023, 14, 3883–3896. [Google Scholar] [CrossRef]
Colakovic, I.; Karakatič, S. FairBoost: Boosting Supervised Learning for Learning on Multiple Sensitive Features. Knowl.-Based Syst. 2023, 280, 110999. [Google Scholar] [CrossRef]
Monteiro, W.R.; Reynoso-Meza, G. A Proposal of a Fair Voting Ensemble Classifier Using Multi-Objective Optimization. In Systems, Smart Technologies and Innovation for Society, Proceedings of the CITIS’2023, Kyoto, Japan, 14–17 December 2023; Salgado-Guerrero, J.P., Vega-Carrillo, H.R., García-Fernández, G., Robles-Bykbaev, V., Eds.; Springer: Cham, Switzerland, 2024; pp. 50–59. [Google Scholar]
Agarwal, A.; Beygelzimer, A.; Dudík, M.; Langford, J.; Wallach, H. A Reductions Approach to Fair Classification. arXiv 2018, arXiv:1803.02453. [Google Scholar] [CrossRef]
Bird, S.; Dudík, M.; Edgar, R.; Horn, B.; Lutz, R.; Milan, V.; Sameki, M.; Wallach, H.; Walker, K. Fairlearn: A Python Toolkit for Assessing and Improving Fairness in AI. 2020. Available online: https://fairlearn.org (accessed on 16 May 2025).
Titanic Dataset. Available online: https://www.kaggle.com/datasets/yasserh/titanic-dataset (accessed on 8 April 2025).
Becker, B.; Kohavi, R. Adult Dataset UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/2/adult (accessed on 10 April 2025).
Moro, S.; Rita, P.; Cortez, P. Bank Marketing, UCI Machine Learning Repository. 2014. Available online: https://archive.ics.uci.edu/dataset/222/bank+marketing (accessed on 10 April 2025).
Hofmann, H. German Credit Data, UCI Machine Learning Repository. 1994. Available online: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data (accessed on 10 April 2025).
MBA Admission Dataset, Class 2025. Available online: https://www.kaggle.com/datasets/taweilo/mba-admission-dataset/data (accessed on 2 October 2024).
Le Quy, T.; Roy, A.; Iosifidis, V.; Zhang, W.; Ntoutsi, E. A Survey on Datasets for Fairness-Aware Machine Learning. WIREs Data Min. Knowl. Discov. 2022, 12, e1452. [Google Scholar] [CrossRef]
Harvey, D. Law School Dataset. Available online: https://github.com/damtharvey/law-school-dataset (accessed on 10 April 2025).
Hort, M.; Chen, Z.; Zhang, J.M.; Harman, M.; Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. ACM J. Responsib. Comput. 2024, 1, 1–52. [Google Scholar] [CrossRef]
Zhang, W.; Bifet, A.; Zhang, X.; Weiss, J.C.; Nejdl, W. FARF: A Fair and Adaptive Random Forests Classifier. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, 11–14 May 2021; Karlapalem, K., Cheng, H., Ramakrishnan, N., Agrawal, R.K., Reddy, P.K., Srivastava, J., Chakraborty, T., Eds.; Springer: Cham, Switzerland, 2021; pp. 245–256. [Google Scholar]
Zhang, W.; Ntoutsi, E. FAHT: An Adaptive Fairness-Aware Decision Tree Classifier. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization. Macao, China, 10–16 August 2019; pp. 1480–1486. [Google Scholar]
Zhang, W.; Bifet, A. FEAT: A Fairness-Enhancing and Concept-Adapting Decision Tree Classifier. In Proceedings of the Discovery Science, 23rd International Conference, Thessaloniki, Greece, 19–21 October 2020; Appice, A., Tsoumakas, G., Manolopoulos, Y., Matwin, S., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 175–189. [Google Scholar]
Badar, M.; Fisichella, M.; Iosifidis, V.; Nejdl, W. Discrimination and Class Imbalance Aware Online Naive Bayes. arXiv 2022, arXiv:2211.04812. [Google Scholar] [CrossRef]
Badar, M.; Fisichella, M. Fair-CMNB: Advancing Fairness-Aware Stream Learning with Naïve Bayes and Multi-Objective Optimization. Big Data Cogn. Comput. 2024, 8, 16. [Google Scholar] [CrossRef]
Padala, M.; Gujar, S. FNNC: Achieving Fairness through Neural Networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 7–15 January 2020; pp. 2277–2283. [Google Scholar]
Datta, A.; Swamidass, S.J. Fair-Net: A Network Architecture For Reducing Performance Disparity Between Identifiable Sub-Populations. In Proceedings of the 14th International Conference on Agents and Artificial Intelligence, Virtual Event, 3–5 February 2022; pp. 645–654. [Google Scholar]
Mohammadi, K.; Sivaraman, A.; Farnadi, G. FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, Boston, MA, USA, 30 October–1 November 2023; pp. 1–11. [Google Scholar]
Khedr, H.; Shoukry, Y. CertiFair: A Framework for Certified Global Fairness of Neural Networks. Proc. AAAI Conf. Artif. Intell. 2023, 37, 8237–8245. [Google Scholar] [CrossRef]
Chen, A.; Rossi, R.A.; Park, N.; Trivedi, P.; Wang, Y.; Yu, T.; Kim, S.; Dernoncourt, F.; Ahmed, N.K. Fairness-Aware Graph Neural Networks: A Survey. ACM Trans. Knowl. Discov. Data 2024, 18, 1–23. [Google Scholar] [CrossRef]
Čutura, L.; Vladimir, K.; Delač, G.; Šilić, M. Fairness in Graph-Based Recommendation: Methods Overview. In Proceedings of the 47th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 20–24 May 2024; pp. 850–855. [Google Scholar]

Figure 1. Fairness–accuracy trade-offs of various models on the Adult dataset. Each subplot visualizes the relationships between accuracy, statistical parity difference, equal opportunity difference (TPR difference), and false positive rate (FPR) difference across multiple fairness-aware algorithms. The ideal models are those in the lower-right areas (high accuracy, low fairness disparity).

Figure 2. Fairness–accuracy trade-offs of various models on the BANK dataset.

Figure 3. Fairness–accuracy trade-offs of various models on the TITANIC dataset.

Figure 4. Fairness–accuracy trade-offs of various models on the German Credit dataset.

Figure 5. Fairness–accuracy trade-offs of various models on the EAP2024 dataset.

Figure 6. Fairness–accuracy trade-offs of various models on the MBA dataset.

Figure 7. Fairness–accuracy trade-offs of various models on the LAW dataset.

Table 1. Fair editions of Naive Bayes.

Method	Key Idea	Fairness Metric(s)	Datasets	Remarks
FairBayes I [3]: Probability Mass Adjustment	Modifies P(C\|S) and adjusts probability distributions to equalize outcomes between groups while preserving global distribution.	Discrimination Score known as Statistical (Demographic) Parity	Artificial Data, US Census Income (Adult)	Directly reduces discrimination, computationally manageable.
FairBayes II [3]: Two-Model Approach	Trains separate Naive Bayes classifiers per sensitive attribute group (2NB) and balances predictions.	–	–	Most effective in balancing fairness and accuracy in practice.
FairBayes III [3]: Latent Variable Model	Introduces latent variable L, assumed independent of S, with C derived from L via discrimination; trained using EM algorithm.	–	–	Mathematically rigorous and best in theory but unstable in empirical tests.
Fair Naive Bayes Classifier [4]	Efficient branch-and-bound search algorithm to detect discrimination patterns in Naïve Bayes classifiers.	Discrimination Degree	Adult, COMPAS, German Credit	Novel fairness metric, computationally effective
Fairness-aware Naive Bayes [5]	Extends FairBayes II (2NB) to multiple sensitive attributes (NNB).	Statistical Parity, Disparate Impact, Differential Fairness	US Census Income (Adult) & Employment Data	NNB-Parity maintains better accuracy but may over-favor non-privileged groups. NNB-DF provides stronger fairness guarantees, but at the cost of a more noticeable accuracy drop.

Table 2. Overview of fairness-aware Decision Tree methods.

Method	Key Idea	Fairness Metric(s)	Datasets	Remarks
Discrimination-Aware Decision Tree [6]	Modifies splitting criterion using IG variants (IGC − IGS, IGC/IGS, IGC + IGS) and applies post-processing relabeling (knapsack optimization).	Statistical Parity	UCI Census Income, Dutch Census 1971, Dutch Census 2001	IGC + IGS with relabeling achieves best fairness–accuracy trade-off.
RegOCT [7]	Mixed Integer Programming (MIP) framework with fairness indices DI and DT.	Disparate Impact (DI), Disparate Treatment (DT)	Adult, Default, Crime	Near-zero discrimination; 1–3% accuracy drop better than Discrimination-Aware Decision Tree computationally demanding.
FATT [8]	Genetic optimization using mutation and crossover for fair, single-tree models.	Demographic Parity	Adult, COMPAS, Communities and Crime, German Credit, Heritage Health	Improves fairness by 35–45% with 3.6% accuracy reduction.
Fair and Optimal Decision Trees [9]	Dynamic programming with pruning based on fairness bounds to improve scalability.	Bounds Based on Demographic Parity	Adult, Bank Marketing, Communities and Crime, COMPAS, Dutch Census, German Credit, KDD Census, OULAD, Ricci, Student-Mathematics/ Portuguese	Outperforms FairOCT in runtime and scalability.
FFTree [10]	Allows user-defined combination of fairness constraints in splitting. Converts unfair nodes to leaves.	Demographic Parity, Predictive Equality, Equal Opportunity, Predictive Parity	Adult, Loan Granting	Select more than one Fairness Metric Maintains accuracy and interpretability with fairness flexibility.
Optimal Fair Decision Trees [11]	Flow-based MIO with binary fairness constraints on paths and structure.	Statistical Parity, Conditional Statistical Parity, Equalized Odds	COMPAS, Adult, German Credit	High interpretability and near-perfect fairness 4.2% accuracy trade-off.
FAir-C4.5 [12]	Extension of C4.5 with fairness-aware attribute selection: Lexicographic, Constraint-Based, Gain Ratio-Fairness (GRXFR).	Disparate Impact Ratio, Discrimination Score, Consistency, Disparate Treatment	Adult, German Credit, Propublica Recidivism, NYPD SQF CPW, Student Mathematics/Portuguese, Drug Consumption, Ricci, Wine Taste, Bank, Dutch Census, Law School Admission, UFRGS.	0.5–2% lower accuracy than C4.5 more efficient than FFTree.
SCAFF Tree [13]	New splitting criterion (SCAFF) using ROC-AUC and demographic parity.	Demographic Parity	COMPAS, Adult	Achieves higher fairness with 2–4% accuracy reduction outperforms Discrimination-Aware Decision Tree.

Table 3. Summary of fairness-aware Logistic Regression approaches.

Method	Main Idea	Fairness Criterion	Datasets	Remarks
Prejudice Remover Regularizer [15]	Regularization term to penalize bias using Prejudice Index	Prejudice Index (PI), Normalized PI (NPI)	Adult dataset	Implemented in AIF360 Libary Less Effective than Calder & Verver 2NB
$η$ -Neutral Logistic Regression [16]	Enforces probabilistic independence between target and sensitive attribute	$η$ -Neutrality	Adult, German Credit, Dutch Census, Bank Marketing, Credit Approval	Outperforms 2-NB and Prejudice Remover Regularizer in fairness–accuracy trade-off
Differentially Private and Fair Logistic Regression [17]	Combines fairness penalty with differential privacy via noise injection	Group Fairness + $ϵ$ -Differential Privacy	Adult, Dutch Census	PFLR better utility than PFLR
Disparate Impact-Free Logistic Regression [18]	Adds covariance-based fairness constraint in convex classifiers	Disparate Treatment, Impact, Mistreatment	Synthetic Data, Adult, Bank Marketing, COMPAS, NYPD SQF	Reduces bias by 50–80% with 1–5% accuracy drop
Constraint Logistic Regression [19]	Adds equalized odds constraint to [18] method	Disparate Impact, Equalized Odds	Adult, COMPAS	Up to 58% bias reduction, Better Equalized Odds
Group-Level Logistic Regression [20]	Ensures group balance via median gradient update across sensitive groups	Implicit Fairness via Group Influence	Adult, OULAD	improves fairness vs. standard LR Less effective than Fairlearn algorithms
Max-Entropy LR with Demographic Parity [21]	Maximizes entropy while enforcing fairness constraints on model output	Demographic Parity (via covariance constraint)	Adult, COMPAS, Bank Marketing, German Credit, Law	Includes Fairness-aware Feature Selection algorithm

Table 4. Summary of fairness-aware ensemble methods.

Method	Key Idea	Fairness Metric(s)	Datasets	Remarks
AdaFair [23,24]	AdaBoost extension reweighting instances for fairness and accuracy; dynamically optimizes weak learners	Equalized Odds, Statistical Parity, Equal Opportunity	Adult, Bank, COMPAS, KDD Census	AdaFair integrates fairness into the boosting loop
FAIRGBM [25]	Adds convex fairness constraints into GBDT via Lagrangian optimization	Equal Opportunity, Equalized Odds	ACS Income, Adult, Account Opening Fraud	FairGBM outperforms similar by achieving high fairness without significant performance loss
FairXGBoost [26]	XGBoost with fairness regularizer controlled by hyperparameter $μ$ to reduce correlation with sensitive attributes	Disparate Impact	Adult, COMPAS, Default, Bank	Each dataset responds differently to $μ$ tuning
GAFairC [27]	AdaBoost with group-aware loss and fairness constraint components: penalty, intensity selection, post-pruning.	Equalized Odds	Bank, KDD, COMPAS, Credit, Adult	GAFairC achieves higher fairness maintaining/improving accuracy More stable across imbalanced datasets.
FAEM [28]	Two-stage model: hybrid sampling (ADASYN), followed by two-layer stacking ensemble for fairness-aware predictions	Average Odds Difference, Equal Opportunity Difference, Disparate Impact	German Credit, Adult, Bank, COMPAS	FAEM effectively balances accuracy and fairness, outperforms prior fairness-aware models
FairBoost [29]	Based on AdaBoost.SAMME.R; supports multiple sensitive attributes via fairness-aware instance weighting	Demographic Parity, Equalized Odds Difference	Adult, Arrhythmia, Drugs, German Credit	Handles multiple sensitive features
Fair Voting Ensemble [30]	Soft voting with separate dynamic weights for privileged/unprivileged groups optimized via RNSGA-II	Counterfactual fairness, Equalized odds (custom based)	COMPAS, Adult, German Credit	Post-Processing fairness approach

Table 5. Fair ML algorithm comparison on ADULT dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.7607 \pm 0.0000$	$0.1913 \pm 0.0046$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$
FairGBM	$0.8740 \pm 0.0019$	$0.0950 \pm 0.0050$	$0.1730 \pm 0.0056$	$0.0661 \pm 0.0221$	$0.0605 \pm 0.0039$
Fairlearn-GB	$0.8662 \pm 0.0021$	$0.0898 \pm 0.0060$	$0.1174 \pm 0.0047$	$0.0232 \pm 0.0165$	$0.0207 \pm 0.0037$
Fairlearn-NB	$0.6185 \pm 0.0140$	$0.0179 \pm 0.0149$	$0.0389 \pm 0.0141$	$0.0150 \pm 0.0107$	$0.0128 \pm 0.0111$
Fairlearn-DT	$0.8186 \pm 0.0020$	$0.1223 \pm 0.0045$	$0.1848 \pm 0.0087$	$0.0641 \pm 0.0193$	$0.0988 \pm 0.0059$
Fairlearn-LR	$0.8431 \pm 0.0018$	$0.0904 \pm 0.0073$	$0.1042 \pm 0.0065$	$0.0286 \pm 0.0141$	$0.0168 \pm 0.0043$
Fair Random Forest	$0.8351 \pm 0.0081$	$0.0827 \pm 0.0099$	$0.1924 \pm 0.0179$	$0.0762 \pm 0.0233$	$0.0825 \pm 0.0148$
Fair Decision Trees	$0.2403 \pm 0.0018$	$0.1896 \pm 0.0057$	$0.0019 \pm 0.0033$	$0.0015 \pm 0.0029$	$0.0019 \pm 0.0031$
NNB-Parity	$0.5863 \pm 0.0213$	$0.0378 \pm 0.0165$	$0.2153 \pm 0.0160$	$0.0514 \pm 0.0110$	$0.1601 \pm 0.0143$
NNB-DF	$0.5837 \pm 0.0214$	$0.0362 \pm 0.0162$	$0.2135 \pm 0.0159$	$0.0499 \pm 0.0119$	$0.1589 \pm 0.0139$
Fairboost	$0.8475 \pm 0.0159$	$0.1037 \pm 0.0209$	$0.1713 \pm 0.0203$	$0.1017 \pm 0.0187$	$0.0697 \pm 0.0190$
Prejudice Remover	$0.8504 \pm 0.0019$	$0.1144 \pm 0.0038$	$0.1403 \pm 0.0048$	$0.0551 \pm 0.0132$	$0.0502 \pm 0.0034$

Table 6. Fair ML algorithm comparison on BANK dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.8874 \pm 0.0000$	$0.0160 \pm 0.0042$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$
FairGBM	$0.9177 \pm 0.0018$	$0.0123 \pm 0.0024$	$0.0108 \pm 0.0045$	$0.0233 \pm 0.0162$	$0.0053 \pm 0.0028$
Fairlearn-GB	$0.9178 \pm 0.0018$	$0.0110 \pm 0.0014$	$0.0120 \pm 0.0065$	$0.0220 \pm 0.0216$	$0.0045 \pm 0.0036$
Fairlearn-NB	$0.8214 \pm 0.0098$	$0.0163 \pm 0.0074$	$0.0127 \pm 0.0075$	$0.0347 \pm 0.0146$	$0.0148 \pm 0.0093$
Fairlearn-DT	$0.8894 \pm 0.0022$	$0.0121 \pm 0.0037$	$0.0095 \pm 0.0036$	$0.0240 \pm 0.0147$	$0.0047 \pm 0.0026$
Fairlearn-LR	$0.9122 \pm 0.0014$	$0.0140 \pm 0.0028$	$0.0104 \pm 0.0058$	$0.0267 \pm 0.0141$	$0.0052 \pm 0.0033$
Fair Random Forest	$0.8592 \pm 0.0114$	$0.0158 \pm 0.0051$	$0.0197 \pm 0.0078$	$0.0250 \pm 0.0162$	$0.0131 \pm 0.0071$
Fair Decision Trees	$0.5605 \pm 0.0344$	$0.0153 \pm 0.0091$	$0.0078 \pm 0.0040$	$0.0144 \pm 0.0062$	$0.0110 \pm 0.0083$
NNB-Parity	$0.7780 \pm 0.0053$	$0.0352 \pm 0.0061$	$0.0413 \pm 0.0068$	$0.0174 \pm 0.0123$	$0.0377 \pm 0.0078$
NNB-DF	$0.5624 \pm 0.1693$	$0.0689 \pm 0.0244$	$0.0893 \pm 0.0317$	$0.0475 \pm 0.0131$	$0.0891 \pm 0.0351$
Fairboost	$0.8813 \pm 0.0068$	$0.0020 \pm 0.0012$	$0.0152 \pm 0.0074$	$0.0825 \pm 0.0332$	$0.0160 \pm 0.0054$
Prejudice Remover	$0.9116 \pm 0.0010$	$0.0142 \pm 0.0028$	$0.0082 \pm 0.0054$	$0.0269 \pm 0.0173$	$0.0044 \pm 0.0027$

Table 7. Fair ML algorithm comparison on Titanic dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.7813 \pm 0.0221$	$0.0552 \pm 0.0324$	$1.0000 \pm 0.0000$	$1.0000 \pm 0.0000$	$1.0000 \pm 0.0000$
FairGBM	$0.8090 \pm 0.0204$	$0.0549 \pm 0.0335$	$0.5738 \pm 0.0801$	$0.4563 \pm 0.0932$	$0.2093 \pm 0.1270$
Fairlearn-GB	$0.8299 \pm 0.0220$	$0.0316 \pm 0.0273$	$0.7261 \pm 0.0834$	$0.6740 \pm 0.1089$	$0.3932 \pm 0.1916$
Fairlearn-NB	$0.4608 \pm 0.0124$	$0.4807 \pm 0.0693$	$0.0898 \pm 0.0400$	$0.0736 \pm 0.0584$	$0.3346 \pm 0.0671$
Fairlearn-DT	$0.8250 \pm 0.0183$	$0.0414 \pm 0.0365$	$0.6780 \pm 0.0383$	$0.5792 \pm 0.0901$	$0.3454 \pm 0.1005$
Fairlearn-LR	$0.7810 \pm 0.0122$	$0.1141 \pm 0.0409$	$0.3689 \pm 0.0672$	$0.2232 \pm 0.1007$	$0.0861 \pm 0.0919$
Fair Random Forest	$0.6970 \pm 0.0432$	$0.1215 \pm 0.0630$	$0.1815 \pm 0.0693$	$0.0626 \pm 0.0511$	$0.1851 \pm 0.0725$
Fair Decision Trees	$0.3851 \pm 0.0052$	$0.5448 \pm 0.0528$	$0.0030 \pm 0.0069$	$0.0033 \pm 0.0098$	$0.0047 \pm 0.0142$
NNB-Parity	$0.4810 \pm 0.0186$	$0.3133 \pm 0.1185$	$0.1113 \pm 0.0391$	$0.1056 \pm 0.0555$	$0.3211 \pm 0.0453$
NNB-DF	$0.4810 \pm 0.0186$	$0.3133 \pm 0.1185$	$0.1113 \pm 0.0391$	$0.1056 \pm 0.0555$	$0.3211 \pm 0.0453$
Fairboost	$0.8127 \pm 0.0164$	$0.0279 \pm 0.0166$	$0.8255 \pm 0.0333$	$0.7659 \pm 0.0559$	$0.6269 \pm 0.0811$
Prejudice Remover	$0.8216 \pm 0.0125$	$0.0161 \pm 0.0134$	$0.5725 \pm 0.0417$	$0.4862 \pm 0.0406$	$0.1847 \pm 0.0993$

Table 8. Fair ML algorithm comparison on German Credit dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.7000 \pm 0.0000$	$0.1274 \pm 0.0542$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$
FairGBM	$0.7463 \pm 0.0182$	$0.1063 \pm 0.0626$	$0.1163 \pm 0.0519$	$0.1102 \pm 0.0488$	$0.0930 \pm 0.0653$
Fairlearn-GB	$0.7397 \pm 0.0191$	$0.1050 \pm 0.0584$	$0.1107 \pm 0.0623$	$0.0975 \pm 0.0607$	$0.1233 \pm 0.0834$
Fairlearn-NB	$0.6270 \pm 0.0398$	$0.0858 \pm 0.0712$	$0.0622 \pm 0.0324$	$0.0932 \pm 0.0588$	$0.1247 \pm 0.0805$
Fairlearn-DT	$0.6750 \pm 0.0206$	$0.0814 \pm 0.0427$	$0.0970 \pm 0.0438$	$0.0939 \pm 0.0556$	$0.0668 \pm 0.0761$
Fairlearn-LR	$0.7493 \pm 0.0214$	$0.1282 \pm 0.0494$	$0.0813 \pm 0.0580$	$0.0921 \pm 0.0356$	$0.1341 \pm 0.0524$
Fair Random Forest	$0.7423 \pm 0.0196$	$0.1420 \pm 0.0604$	$0.0478 \pm 0.0224$	$0.0667 \pm 0.0363$	$0.1456 \pm 0.0986$
Fair Decision Trees	$0.6677 \pm 0.0289$	$0.1203 \pm 0.0692$	$0.0271 \pm 0.0170$	$0.0368 \pm 0.0317$	$0.0369 \pm 0.0332$
NNB-Parity	$0.6750 \pm 0.0265$	$0.1314 \pm 0.0473$	$0.1090 \pm 0.0427$	$0.1470 \pm 0.0645$	$0.1349 \pm 0.0953$
NNB-DF	$0.7190 \pm 0.0100$	$0.1993 \pm 0.0376$	$0.0530 \pm 0.0361$	$0.0718 \pm 0.0357$	$0.2285 \pm 0.0818$
Fairboost	$0.7293 \pm 0.0173$	$0.0969 \pm 0.0548$	$0.0537 \pm 0.0496$	$0.0438 \pm 0.0278$	$0.0987 \pm 0.0819$
Prejudice Remover	$0.7440 \pm 0.0176$	$0.1519 \pm 0.0611$	$0.1359 \pm 0.0742$	$0.1549 \pm 0.1130$	$0.0799 \pm 0.0555$

Table 9. Fair ML algorithm comparison on EAP2024 dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.7463 \pm 0.0248$	$0.0383 \pm 0.0356$	$0.0631 \pm 0.0438$	$0.0307 \pm 0.0244$	$0.0644 \pm 0.0434$
FairGBM	$0.7896 \pm 0.0213$	$0.0484 \pm 0.0201$	$0.0618 \pm 0.0480$	$0.0840 \pm 0.0591$	$0.0523 \pm 0.0367$
Fairlearn-GB	$0.7731 \pm 0.0161$	$0.0445 \pm 0.0411$	$0.0933 \pm 0.0501$	$0.0804 \pm 0.0648$	$0.0954 \pm 0.0582$
Fairlearn-NB	$0.7873 \pm 0.0347$	$0.0736 \pm 0.0484$	$0.0852 \pm 0.0608$	$0.0890 \pm 0.0517$	$0.1077 \pm 0.0829$
Fairlearn-DT	$0.7522 \pm 0.0277$	$0.0550 \pm 0.0418$	$0.0617 \pm 0.0436$	$0.1189 \pm 0.0554$	$0.0735 \pm 0.0520$
Fairlearn-LR	$0.7604 \pm 0.0582$	$0.0600 \pm 0.0526$	$0.0787 \pm 0.0307$	$0.1735 \pm 0.0987$	$0.0914 \pm 0.0495$
Fair Random Forest	$0.5769 \pm 0.0200$	$0.1073 \pm 0.0708$	$0.0400 \pm 0.0303$	$0.0000 \pm 0.0000$	$0.0836 \pm 0.0528$
Fair Decision Trees	$0.3791 \pm 0.0601$	$0.0798 \pm 0.0338$	$0.0521 \pm 0.0313$	$0.0623 \pm 0.0502$	$0.0477 \pm 0.0346$
NNB-Parity	$0.7299 \pm 0.0294$	$0.0984 \pm 0.0574$	$0.0788 \pm 0.0456$	$0.2319 \pm 0.0918$	$0.0782 \pm 0.0486$
NNB-DF	$0.6522 \pm 0.0535$	$0.0457 \pm 0.0536$	$0.0655 \pm 0.0585$	$0.1505 \pm 0.0717$	$0.0895 \pm 0.0747$
Fairboost	$0.8015 \pm 0.0255$	$0.0620 \pm 0.0315$	$0.0587 \pm 0.0578$	$0.1068 \pm 0.0547$	$0.0836 \pm 0.0708$
Prejudice Remover	$0.7978 \pm 0.0210$	$0.0610 \pm 0.0290$	$0.1000 \pm 0.0642$	$0.1083 \pm 0.0708$	$0.1045 \pm 0.0615$

Table 10. Fair ML algorithm comparison on MBA dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.8524 \pm 0.0000$	$0.0967 \pm 0.0103$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$
FairGBM	$0.8345 \pm 0.0061$	$0.0812 \pm 0.0073$	$0.0115 \pm 0.0106$	$0.0547 \pm 0.0526$	$0.0091 \pm 0.0062$
Fairlearn-GB	$0.8440 \pm 0.0058$	$0.0840 \pm 0.0124$	$0.0115 \pm 0.0098$	$0.0383 \pm 0.0184$	$0.0119 \pm 0.0107$
Fairlearn-NB	$0.8487 \pm 0.0071$	$0.0961 \pm 0.0093$	$0.0026 \pm 0.0031$	$0.0156 \pm 0.0265$	$0.0023 \pm 0.0025$
Fairlearn-DT	$0.7966 \pm 0.0074$	$0.0493 \pm 0.0155$	$0.0135 \pm 0.0105$	$0.1011 \pm 0.0466$	$0.0202 \pm 0.0167$
Fairlearn-LR	$0.8529 \pm 0.0012$	$0.0952 \pm 0.0116$	$0.0020 \pm 0.0037$	$0.0124 \pm 0.0162$	$0.0014 \pm 0.0021$
Fair Random Forest	$0.7808 \pm 0.0118$	$0.0280 \pm 0.0150$	$0.0212 \pm 0.0128$	$0.2370 \pm 0.0435$	$0.0317 \pm 0.0127$
Fair Decision Trees	$0.4852 \pm 0.0320$	$0.0424 \pm 0.0250$	$0.0209 \pm 0.0132$	$0.1129 \pm 0.0501$	$0.0311 \pm 0.0152$
NNB-Parity	$0.8090 \pm 0.0061$	$0.0614 \pm 0.0085$	$0.0146 \pm 0.0119$	$0.1646 \pm 0.0307$	$0.0166 \pm 0.0098$
NNB-DF	$0.8056 \pm 0.0073$	$0.0519 \pm 0.0119$	$0.0156 \pm 0.0125$	$0.1697 \pm 0.0322$	$0.0214 \pm 0.0122$
Fairboost	$0.8469 \pm 0.0041$	$0.1024 \pm 0.0121$	$0.0544 \pm 0.0125$	$0.0811 \pm 0.0458$	$0.0397 \pm 0.0072$
Prejudice Remover	$0.8441 \pm 0.0055$	$0.0957 \pm 0.0114$	$0.0231 \pm 0.0100$	$0.0321 \pm 0.0280$	$0.0172 \pm 0.0055$

Table 11. Fair ML algorithm comparison on LAW dataset.

Model	Accuracy	Acc. Diff	Stat. Parity Diff	TPR Diff	FPR Diff
AdaFair	$0.9017 \pm 0.0000$	$0.0222 \pm 0.0045$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$	$0.0000 \pm 0.0000$
FairGBM	$0.9076 \pm 0.0013$	$0.0146 \pm 0.0050$	$0.0109 \pm 0.0046$	$0.0024 \pm 0.0020$	$0.0457 \pm 0.0203$
Fairlearn-GB	$0.9081 \pm 0.0019$	$0.0155 \pm 0.0041$	$0.0096 \pm 0.0060$	$0.0026 \pm 0.0021$	$0.0427 \pm 0.0246$
Fairlearn-NB	$0.8352 \pm 0.0045$	$0.0093 \pm 0.0065$	$0.0242 \pm 0.0090$	$0.0104 \pm 0.0080$	$0.0389 \pm 0.0192$
Fairlearn-DT	$0.8587 \pm 0.0032$	$0.0086 \pm 0.0041$	$0.0098 \pm 0.0039$	$0.0055 \pm 0.0063$	$0.0378 \pm 0.0216$
Fairlearn-LR	$0.9081 \pm 0.0020$	$0.0139 \pm 0.0049$	$0.0100 \pm 0.0054$	$0.0023 \pm 0.0018$	$0.0467 \pm 0.0336$
Fair Random Forest	$0.9069 \pm 0.0020$	$0.0149 \pm 0.0021$	$0.0183 \pm 0.0051$	$0.0067 \pm 0.0024$	$0.0682 \pm 0.0297$
Fair Decision Trees	$0.2928 \pm 0.0145$	$0.0065 \pm 0.0054$	$0.0126 \pm 0.0067$	$0.0096 \pm 0.0067$	$0.0296 \pm 0.0145$
NNB-Parity	$0.8177 \pm 0.0054$	$0.0155 \pm 0.0084$	$0.0300 \pm 0.0096$	$0.0171 \pm 0.0097$	$0.0332 \pm 0.0255$
NNB-DF	$0.8030 \pm 0.0057$	$0.0161 \pm 0.0084$	$0.0321 \pm 0.0110$	$0.0191 \pm 0.0104$	$0.0317 \pm 0.0197$
Fairboost	$0.9096 \pm 0.0015$	$0.0164 \pm 0.0048$	$0.0213 \pm 0.0044$	$0.0090 \pm 0.0021$	$0.0919 \pm 0.0308$
Prejudice Remover	$0.9102 \pm 0.0015$	$0.0150 \pm 0.0039$	$0.0236 \pm 0.0048$	$0.0095 \pm 0.0021$	$0.1053 \pm 0.0333$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raftopoulos, G.; Fazakis, N.; Davrazos, G.; Kotsiantis, S. A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models. Algorithms 2025, 18, 435. https://doi.org/10.3390/a18070435

AMA Style

Raftopoulos G, Fazakis N, Davrazos G, Kotsiantis S. A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models. Algorithms. 2025; 18(7):435. https://doi.org/10.3390/a18070435

Chicago/Turabian Style

Raftopoulos, George, Nikos Fazakis, Gregory Davrazos, and Sotiris Kotsiantis. 2025. "A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models" Algorithms 18, no. 7: 435. https://doi.org/10.3390/a18070435

APA Style

Raftopoulos, G., Fazakis, N., Davrazos, G., & Kotsiantis, S. (2025). A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models. Algorithms, 18(7), 435. https://doi.org/10.3390/a18070435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Review and Benchmarking of Fairness-Aware Variants of Machine Learning Models

Abstract

1. Introduction

2. Fair Editions of Naive Bayes

2.1. Calders and Verver Fair Variances of Naive Bayes

2.2. Fair Naive Bayes Classifier

2.3. Fairness-Aware Naive Bayes

3. Fair Editions of Decision Trees

3.1. Discrimination-Aware Decision Tree

3.2. Optimal and Fair Decision Trees via Regularization (RegOCT)

3.3. FATT Fairness-Aware Tree Training Method

3.4. Fair and Optimal Decision Trees

3.5. FFTree

3.6. Optimal Fair Decision Trees

3.7. Fair C4.5 Algorithm

3.8. SCAFF Fair Tree Classifier

4. Fair Editions of Logistic Regression

4.1. Prejudice Remover Regularizer

4.2. η -Neutral Logistic Regression

Differentially Private and Fair Logistic Regression

4.3. Disparate Impact-Free Logistic Regression

4.4. Constraint Logistic Regression

4.5. Group-Level Logistic Regression

4.6. Maximum Entropy Logistic Regression with Demographic Parity Constraints

5. Fair Editions of Ensemble Models

5.1. AdaFair

5.2. FAIRGBM

5.3. FairXGBoost

5.4. GAFairC: Group AdaBoost with Fairness Constraint

5.5. FAEM: Fairness-Aware Ensemble Model

5.6. FairBoost

5.7. Fair Voting Ensemble Classifier

6. Exponentiated Learning Technique Plus Classical Machine Learning Algorithms

7. Experimental Procedure

7.1. Datasets

7.1.1. Titanic Dataset

7.1.2. Adult Census Dataset

7.1.3. Bank Marketing Dataset

7.1.4. German Credit Dataset

7.1.5. MBA Admission Dataset

7.1.6. Law School Dataset

7.2. Machine Learning Algorithms

Model Parameters

7.3. Metrics

8. Results

8.1. Adult Dataset Results

8.2. BANK Dataset Results

8.3. Titanic Dataset Results

8.4. German Credit Dataset Results

8.5. EAP2024 Dataset Results

8.6. MBA Dataset Results

8.7. LAW Dataset Results

9. Discussion

9.1. Limitations and Hyperparameter Sensitivity

9.2. Do Fairness Models Fulfill Their Goals?

9.3. Evaluation and Selection for Practitioners

9.4. Clarifying Accuracy and Performance Metrics

10. Conclusions

Directions for Future Research

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. $η$ -Neutral Logistic Regression