Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques

Raftopoulos, George; Davrazos, Gregory; Kotsiantis, Sotiris

doi:10.3390/electronics14091856

Open AccessFeature PaperArticle

Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques

by

George Raftopoulos

^†

,

Gregory Davrazos

^†

and

Sotiris Kotsiantis

^*

Department of Mathematics, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(9), 1856; https://doi.org/10.3390/electronics14091856

Submission received: 31 March 2025 / Revised: 24 April 2025 / Accepted: 29 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Ensuring fairness in machine learning models applied to educational data is crucial for mitigating biases that can reinforce systemic inequities. This paper compares various fairness-enhancing algorithms across preprocessing, in-processing, and post-processing stages. Preprocessing methods such as Reweighting, Learning Fair Representations, and Disparate Impact Remover aim to adjust training data to reduce bias before model learning. In-processing techniques, including Adversarial Debiasing and Prejudice Remover, intervene during model training to directly minimize discrimination. Post-processing approaches, such as Equalized Odds Post-Processing, Calibrated Equalized Odds Post-Processing, and Reject Option Classification, adjust model predictions to improve fairness without altering the underlying model. We evaluate these methods on educational datasets, examining their effectiveness in reducing disparate impact while maintaining predictive performance. Our findings highlight tradeoffs between fairness and accuracy, as well as the suitability of different techniques for various educational applications.

Keywords:

fairness; learning analytics; Open University Learning Analytics dataset; AIF 360

1. Introduction

In Educational Data Mining (EDM) and Learning Analytics (LAs), research has traditionally prioritized optimizing machine learning models for performance metrics [1,2]. Machine learning models significantly impact student performance analysis, learning pattern identification, and educational outcome improvement [3]. As shown in Figure 1, the educational data mining process involves several stages. Yet, these models can embed and perpetuate biases, thereby reinforcing systemic inequalities in grading, admissions, and personalized learning recommendations [4].

Bias in data refers to systematic distortions that misrepresent the reality the data are supposed to reflect. In educational data, bias can emerge from various sources such as socioeconomic status, geographic location, gender, race, or prior academic exposure. This is particularly critical in educational contexts, where biased data can reinforce existing inequalities or misguide interventions. For example, if historical data underrepresent students from marginalized communities, predictive models may underestimate their performance or potential, leading to unfair treatment in admissions, resource allocation, or personalized recommendations.

These biases are not only present in individual data points but can also manifest across structural and behavioral patterns within different types of education systems. In online learning environments, for instance, students with unstable internet access or lower digital literacy may generate incomplete or misleading learning data, skewing models toward digitally advantaged learners. Similarly, in traditional classroom settings, grading practices or teacher perceptions may introduce subjective bias into performance metrics. Understanding and evaluating these forms of bias is essential, as their unaddressed presence can compound educational disadvantage, misinform policy decisions, and diminish trust in data-driven systems.

To mitigate such impacts, fairness-aware approaches are integrated at different stages of the educational data mining process. Preprocessing techniques aim to reduce bias before model training—this may include rebalancing datasets, removing outliers, or transforming features. In-processing methods embed fairness constraints directly into the learning algorithms, such as adversarial debiasing or fair-regularized loss functions. Post-processing focuses on adjusting model outputs to improve equity, for example, through equalized odds or threshold optimization.

Recent studies and publications from both within and outside these fields have highlighted concerns regarding algorithmic fairness, particularly the unfair treatment of certain demographic groups, primarily based on gender and race (for example, the bestseller [5]). This recognition has spurred significant research into fairness in machine learning and artificial intelligence more broadly [6].

Baker and Hawn [7] provide a nuanced analysis of algorithmic bias in educational technologies, addressing its causes, manifestations, and potential solutions across demographic dimensions such as race, gender, nationality, socioeconomic status, and disability. Similarly, Fenu [8] underscores the need for fairness-aware pipelines, continuous fairness assessment, and transparency-enhancing tools to mitigate bias in educational applications. A systematic review by Lisha et al. [9] investigated predictive bias in educational ML applications, analyzing 49 peer-reviewed studies published since 2010. The review identified three core fairness issues: the role of protected attributes, the application of fairness measures, and bias mitigation strategies. The work highlights that imbalanced training data and flawed model design are primary contributors to systematic discrimination. Notably, while group-level fairness metrics such as equalized odds are commonly applied, there is limited exploration of intersectional bias and metric suitability for specific educational contexts, revealing gaps in existing methodologies.

Deho et al. [10] explored whether incorporating sensitive attributes—such as gender, age, disability, and home language—into LA models enhanced fairness without compromising predictive performance. Their study on a three-year dropout dataset from an Australian university found that the inclusion or exclusion of sensitive attributes had marginal effects on fairness and accuracy. Crucially, the study emphasizes that simply removing sensitive attributes does not eliminate bias but may obscure deeper inequalities. The authors advocate for targeted bias mitigation algorithms over simplistic attribute exclusion, underscoring the importance of recognizing often-overlooked sensitive attributes like home language and disability.

Idowu et al. [11] categorized fairness into individual and group fairness, evaluating algorithmic equity through metrics such as ABROCA, demographic parity, and counterfactual fairness. Their review discusses bias mitigation strategies, including class balancing, adversarial learning, fairness through awareness/unawareness, and counterfactual fairness. They found no strict tradeoff between fairness and accuracy; rather, fairness-enhancing strategies were found to often complement predictive performance. However, they found that societal biases may lead users to prefer biased systems, presenting challenges in real-world adoption.

The authors in [12] investigated the impact of class balancing techniques on mitigating algorithmic bias in predictive modeling for education. According to this study, eleven class balancing techniques were applied across three predictive tasks to assess their effects on fairness and accuracy. The findings suggest that oversampling techniques, such as SMOTE and ADASYN, can significantly improve predictive fairness without sacrificing accuracy. By systematically evaluating the interplay between data balancing methods, fairness, and predictive performance, the research highlights the importance of addressing demographic imbalances in educational datasets.

While class balancing techniques such as SMOTE and ADASYN are widely used to address class imbalance and mitigate algorithmic bias, they are not without limitations. One potential drawback is the risk of overfitting, especially when synthetic instances are generated in sparse regions of the feature space, which may lead the model to learn patterns that do not generalize well to unseen data. Additionally, these methods can introduce fictitious data points that do not accurately reflect the true underlying distribution, particularly in complex or highly non-linear datasets. This is especially problematic in educational data, where subtle contextual factors influence student behavior and performance. In such cases, synthetic examples might lack the nuanced relationships present in real student data, potentially distorting the model’s learning process.

In [13], the authors introduced a Fair Logistic Regression (Fair-LR) model designed to address demographic biases in predictions by enforcing fairness constraints such as equalized odds. Using a dataset from high school students interacting with an Algebra II learning platform, the study compared Fair-LR’s performance against traditional fairness-unaware models like Logistic Regression, Support Vector Machine, and Random Forest. The results revealed that Fair-LR achieved significantly better fairness across demographic groups, particularly in reducing disparities in prediction outcomes (e.g., true positive and false positive rates) by race and gender, while maintaining competitive predictive accuracy.

In [14], the authors conducted a systematic literature review and applied their findings to investigate fairness in learning analytics through a case study using the OULAD [15]. They identified three types of discrimination in learning analytics—direct, indirect, and underestimation—highlighting subtle algorithmic biases, especially in underrepresented groups. They also reviewed methods to measure fairness, such as Disparate Impact (DI), Normalized Mutual Information (NMI), and Underestimation Index (UEI), and approaches to mitigate bias, including data preprocessing, fairness optimization, and post-processing techniques. Applying these insights to the OULAD, they evaluated various predictive models, finding that algorithms tended to reproduce existing biases, with mixed success in improving fairness using constraint-based methods or excluding sensitive attributes. Their work underscores the importance of information-theoretic measures to identify imbalances in datasets and call for refined fairness techniques in learning analytics.

The study by [16] systematically evaluated seven well-established group fairness measures—statistical parity, equal opportunity, equalized odds, predictive parity, predictive equality, treatment equality, and absolute between-ROC area (ABROCA)—across five widely used educational datasets. These datasets vary in size, protected attributes, and class imbalances, thereby reflecting diverse educational contexts. The authors employed four commonly used machine learning models—Decision Trees, Naive Bayes, Multi-layer Perceptron, and Support Vector Machines—along with two fairness-aware algorithms, namely, Agarwal’s reduction-based approach and AdaFair. The results revealed that fairness metrics exhibit substantial variation across models and datasets, emphasizing that the selection of an appropriate fairness measure is context-dependent. Notably, ABROCA demonstrated relatively low variability across models and datasets, making it a more stable and reliable indicator of fairness.

To the best of our knowledge, the most recent research in this area is that of [17]. In their study, the authors investigated the effectiveness of bias mitigation methods in machine learning models applied to an educational dataset, specifically the High School Longitudinal Study of 2009 (HSLS:09) dataset. The authors evaluated the performance of four bias mitigation techniques—reweighting, uniform resampling, preferential resampling, and reject option-based classification (ROC) pivoting—using the DALEX library (v1.7.0) in Python, with Decision Trees as the sole machine learning model. Among the evaluated techniques, the ROC pivoting method emerged as the most balanced approach, achieving a moderate reduction in bias while preserving the original predictive performance of the model.

This study aims to critically address a significant research gap in the systematic evaluation of fairness mitigation strategies within educational machine learning applications. Specifically, we conducted a comprehensive, pipeline-wide comparison of bias mitigation techniques—including preprocessing (e.g., reweighting, learning fair representations, and disparate impact remover), in-processing (e.g., adversarial debiasing and prejudice remover), and post-processing methods (e.g., equalized odds and calibrated equalized odds). These methods were applied across three distinct learning analytics datasets: HOULAD, xAPI-Edu, and OULAD, with each presenting unique fairness-related challenges due to the influence of protected attributes on model outcomes. The need for robust strategies to balance predictive performance with algorithmic fairness is therefore both urgent and context-dependent.

Each dataset represents a critical area in education where fairness and performance are tightly intertwined. The student performance dataset focuses on the early identification of students at risk of academic failure, which is a key concern for supporting timely interventions in secondary education. The MOOC dataset targets the pervasive issue of learner attrition, aiming to predict dropout risk and inform adaptive course designs that foster sustained engagement. The student admissions dataset highlights issues of equity in access to higher education, seeking to uncover and mitigate potential biases in admission decisions driven by sensitive features such as gender or socioeconomic background.

All three tasks are framed as binary classification problems: predicting pass/fail outcomes, dropout likelihood, or admission decisions, respectively. Although these predictive tasks differ in context and implications, they share a common challenge—ensuring that model decisions do not systematically disadvantage specific groups. While we applied a variety of machine learning algorithms and fairness mitigation strategies, our central research question remained consistent: how can we effectively reduce algorithmic bias in educational predictions without sacrificing model utility?

Our motivation stems from a critical need in the field: the absence of a holistic, comparative framework for assessing fairness interventions across varied educational scenarios. Despite the existence of numerous fairness-aware algorithms, their effectiveness can vary widely based on task characteristics, data imbalance, and the nature of sensitive attributes involved. Our approach is therefore not to prescribe a one-size-fits-all solution but rather to provide empirical evidence that can guide educational stakeholders—such as data scientists, institutional decision makers, and policymakers—in selecting contextually appropriate fairness strategies aligned with their specific goals and constraints.

While our empirical focus was on three representative datasets, the broader contribution of this study lies in its systematic and task-agnostic analysis of bias mitigation techniques across the entire machine learning pipeline. By evaluating the performance of these techniques using multiple fairness metrics and under varying data conditions, we offer actionable insights into their practical strengths, limitations, and tradeoffs. Ultimately, the findings from our comparative study serve not only to inform the design of fairer predictive systems in education but also to support the development of responsible AI practices in real-world, high-stakes applications.

In contrast, Wongvorachan et al. [17] focused narrowly on four techniques—uniform and preferential resampling, reweighting, and ROC pivoting—applied solely to the HSLS:09 high school dropout dataset, reporting that the ROC pivot method best preserved the accuracy while reducing false positive disparity. Le Quy et al. [16], meanwhile, did not implement mitigation algorithms but instead analyzed the stability of seven group fairness metrics (e.g., statistical parity, equalized odds, ABROCA, etc.) across five datasets and six predictive models, illustrating the sensitivity of fairness assessments to both the chosen metric and threshold definitions.

Our work extends these efforts by offering a more generalizable evaluation that accounts for the interaction between multiple classifiers and various fairness interventions. Through this approach, we systematically examined the tradeoffs between fairness and predictive performance. While preprocessing methods like learning fair representations often achieve a balanced compromise, in-processing and post-processing techniques offer alternative advantages, though each technique has its own limitations. A nuanced understanding of these tradeoffs is essential for developing ethical and responsible machine learning systems in educational contexts.

The key contributions of this study are as follows:

We conducted a comprehensive empirical evaluation of fairness mitigation strategies (preprocessing, in-processing, and post-processing) across three representative educational datasets, with each reflecting a distinct real-world prediction task.
We analyzed the tradeoffs between fairness and predictive accuracy, offering insight into how different techniques perform under various fairness metrics and data conditions.
We demonstrate how each mitigation method (e.g., reweighting, LFR, DIR, etc.) has been practically applied to educational datasets and highlight their effectiveness and limitations in educational settings.
Based on the experimental results, we propose practical recommendations for selecting fairness-aware methods tailored to specific educational challenges, helping bridge the gap between theoretical fairness frameworks and real-world educational applications.

The remainder of this paper is structured as follows. Section 2 introduces various bias mitigation techniques, categorizing them into preprocessing, in-processing, and post-processing approaches. Section 3.1 presents an overview of the three learning analytics datasets employed in this study, discussing their characteristics. In Section 3.4, we define the classification and fairness metrics used to evaluate the effectiveness of these techniques. The experimental design, including data preprocessing, model selection, and evaluation methodology, is also detailed in Section 3, followed by an in-depth analysis of the results in Section 4. Finally, Section 5 critically examines the findings, contextualizing them within the broader field of fairness in AI, while Section 6 summarizes key insights and outlines directions for future research.

2. Fairness Mitigation Techniques

In this section, we systematically present the fairness mitigation techniques utilized in this study, categorizing them into three distinct approaches: pre-processing, which addresses bias at the data level; in-processing, which incorporates fairness constraints during model training; and post-processing, which adjusts model outputs to enhance fairness; see Figure 2.

While these categories provide a useful conceptual framework, in practice, overlap and interaction often occur. For instance, certain in-processing techniques may require data transformation steps typically associated with pretreatment, or post-processing methods may be informed by model-specific characteristics. The classification we adopt is based primarily on the stage of the machine learning pipeline where the intervention is applied: pretreatment methods act directly on the input data (e.g., reweighting, resampling, or feature modification), in-processing methods modify the learning algorithm itself (e.g., adding fairness constraints or adversarial objectives), and post-processing methods adjust the model’s predictions (e.g., threshold adjustment, equalized odds, etc.). These categories (Table 1) are not mutually exclusive but serve to guide the selection of methods based on constraints such as data access, algorithm transparency, and real-time deployment requirements.

2.1. Pre-Processing Techniques

Reweighting [18,19] is a preprocessing technique based on the hypothesis that membership in a sensitive group and the predicted outcome should be statistically independent. To achieve this, instance weights are adjusted to balance the contributions of different groups within the dataset. Specifically, instances from underrepresented or disadvantaged groups are assigned higher weights, while those from overrepresented groups receive lower weights. This reweighted dataset is then used for training, ensuring that the model does not favor any particular group, thereby mitigating bias in predictions. This approach is particularly useful when dealing with class imbalance and demographic disparities in datasets.

Learning fair representations (LFR) [20] is a preprocessing method designed to encode input features into a transformed space that mitigates bias while preserving predictive accuracy. This technique frames fairness as an optimization problem, incorporating three objectives: (1) preserving as much information as possible for prediction, (2) obfuscating information related to sensitive attributes, and (3) ensuring similarity between instances from different groups when appropriate. These objectives are weighted by hyperparameters that control the balance between fairness and accuracy. By learning a fair latent representation, LFR enables classifiers trained on the transformed dataset to produce less biased predictions, effectively reducing both individual and group disparities.

Disparate impact remover [21] is a preprocessing method that modifies feature distributions to reduce bias while maintaining data utility. This technique systematically adjusts feature values to ensure that sensitive attributes do not unduly influence model predictions. The degree of modification is controlled by a tuning parameter, allowing practitioners to balance fairness constraints against data preservation. By reducing disparate impact—where one group experiences systematically different outcomes compared to another—this method ensures that models trained on the adjusted dataset treat different groups more equitably. However, care must be taken to avoid excessive distortion, which could lead to the loss of critical information needed for accurate predictions.

2.2. In-Processing Techniques

Prejudice remover [22] is an in-processing technique that integrates fairness constraints directly into the learning algorithm to mitigate bias while preserving predictive performance. This method introduces a fairness-aware regularization term into the objective function of a logistic regression classifier, penalizing unfair behavior in model predictions. The strength of the fairness constraint is controlled by a hyperparameter,

η

, which allows for adjusting the tradeoff between fairness and accuracy. While the method theoretically involves two hyperparameters,

λ

and

η

, in practice,

λ

is often set to a fixed value of one. By directly embedding fairness into the training process, prejudice remover ensures that model predictions are less influenced by sensitive attributes while maintaining generalization capabilities. However, the effectiveness of this approach depends on careful tuning of the

η

hyperparameter, as overly aggressive regularization can degrade the overall model performance.

Adversarial debiasing [23] is an in-processing technique that employs an adversarial learning framework to mitigate bias in machine learning models. This method consists of two competing neural networks:

The predictor, which learns to make accurate predictions on the target variable.
The adversary (fairness critic), which attempts to infer the sensitive attribute from the predictor’s output.

The training process is formulated as a minimax game, where the predictor aims to minimize its prediction error, while the adversary maximizes its ability to detect bias. As training progresses, the predictor is forced to generate outputs that contain minimal information about the sensitive attribute, effectively reducing bias while preserving predictive accuracy.

This approach is highly flexible, as it can be applied to various model architectures, including deep learning frameworks. Additionally, the strength of the adversary can be tuned to balance fairness and accuracy, allowing practitioners to adjust the tradeoff as needed. However, adversarial debiasing requires careful training to ensure convergence, as instability between the competing networks can affect performance.

2.3. Post-Processing Techniques

Equality of odds [24,25] is a post-processing fairness intervention applied after training a machine learning model. This technique aims to adjust model predictions to ensure that individuals from different sensitive groups have similar true positive rates (TPRs) and false positive rates (FPRs), thereby satisfying the equalized odds fairness criterion. By enforcing this constraint, the method reduces disparities in model performance across demographic groups. This approach is particularly useful in situations where fairness concerns are identified after model training, as it allows for bias mitigation without retraining the classifier. The method typically involves adjusting decision thresholds for different groups to achieve statistical parity in classification outcomes. However, enforcing fairness constraints in this manner may lead to a tradeoff between fairness and accuracy, as modifying decision boundaries can slightly degrade overall predictive performance. Additionally, this method requires careful implementation to avoid excessive distortion, which could impact the model’s reliability.

Calibrated equalized odds [25] extends the concept of equalized odds by incorporating calibration constraints alongside fairness adjustments. This technique modifies a trained model’s predictions to satisfy the equalized odds criterion while preserving calibration, ensuring that predicted probabilities remain statistically meaningful. Its features are described as follows:

Equalized odds ensures that different demographic groups exhibit similar TPRs and FPRs, reducing disparities in classification outcomes.
Calibration guarantees that predicted probabilities are well calibrated across groups, meaning that if a model assigns a 70% probability to an outcome, that outcome should occur approximately 70% of the time across all groups.

Calibrated equalized odds method is particularly advantageous when fairness adjustments must be made without compromising the interpretability of probabilistic predictions. It allows decision makers to maintain confidence in model scores while reducing demographic biases. However, achieving both fairness and calibration simultaneously can be challenging, as enforcing fairness constraints may require small compromises in overall model reliability.

3. Experiments

3.1. Datasets

In this subsection, we provide a concise overview of the datasets utilized in our experimental procedure for evaluating both performance and fairness.

3.1.1. Hellenic Open University Learning Analytics Dataset (HOULAD)

This dataset described in [26] was collected as part of the Erasmus+ initiative “DevOps Competences for Smart Cities”. It initially included data from 961 MOOC participants, with 944 completing a comprehensive questionnaire. After preprocessing steps such as addressing missing data and performing data cleaning, a total of 923 cases were prepared for the final analysis. The final dataset consists of 66.6% males and 33.4% females. A significant majority, 75.5%, did not succeed in obtaining the certificate, while only 24.5% achieved certification.

The dataset is organized into three main subsets:

Demographics Subset: This section contains mostly categorical features, such as gender, education level, etc.
Performance Subset: It includes 10 primarily numerical features related to academic outcomes.
Activity Subset: This subset features 12 numerical attributes describing engagement on the online learning platform.

3.1.2. xAPI-Educational Mining Dataset

This dataset comprises information on 480 students across primary, secondary, and high school levels, encompassing 16 attributes [27,28]. These attributes include gender, nationality, place of birth, educational level, grade level, course topic, classroom ID, semester, parent responsible for the student, raised hands, visited resources, viewing announcements, discussion group participation, parent answering surveys, parent school satisfaction, and student absence days. The students in the dataset are from 14 predominantly Islamic countries. The dataset covers a diverse range of course topics, including IT, Math, Arabic, Science, English, Quran, Spanish, French, History, Biology, Chemistry, and Geology. It was sourced from the Kalboard 360 LMS via an experienced API (xAPI). The Class column in the dataset categorizes students’ overall performance into three levels—low, middle, and high—based on their total grades. For our study, these categories were simplified into binary classification: Pass (1), representing middle and high performance, and Fail (0), representing low performance. This binary classification served as the target variable for this study and has also been adopted in other studies [28,29,30]. During the preprocessing phase, we excluded the attributes of nationality, place of birth, and semester to streamline the analysis. After preprocessing, 478 cases remained, comprising 63.4% male and 36.6% female participants. A significant majority, 73.9%, demonstrated medium to high academic performance, while the remaining 26.1% were classified as low-performing.

3.1.3. Open University Learning Analytics Dataset (OULAD)

The Open University Learning Analytics Dataset (OULAD) [15] is a publicly available dataset designed to support research in learning analytics and educational data mining. Developed from data collected at the Open University (OU), the largest distance-learning institution in the United Kingdom, OULAD provides a rich source of information on student demographics, course enrollments, assessment performance, and interactions with the Virtual Learning Environment (VLE). The dataset includes anonymized data from 32,593 students enrolled in seven different courses across multiple semesters in 2013 and 2014. The OULAD dataset is organized into seven distinct CSV files, with each representing specific components of student and course data. From these individual files, we constructed a unified dataset that was subsequently used for our analysis. The final outcomes of students enrolled in these courses were categorized into four groups: Distinction, Pass, Fail, and Withdrawn. For the purposes of our analysis, we transformed these groups into a binary classification problem by merging the Distinction and Pass categories. The Fail category was retained as is, while data corresponding to the Withdrawn category were excluded from the analysis. Additionally, certain features, including id_student, date_unregistration, date_registration, weighted_score, late_rate, fail_rate, region, and imd_band, were removed from the dataset. To address imbalances in features such as highest_education and age_band, we suppressed distinct categories within these features. The final dataset comprises 22,437 cases, with 54.4% male and 45.6% female participants. A significant majority, 68.6%, successfully passed the exams, while the remaining 31.4% did not.

3.2. Experimental Procedure

Beyond the preprocessing steps outlined in the preceding sections, our experimental procedure remained consistent across the three learning analytics datasets to ensure methodological consistency and comparability of results.

For each experiment, the dataset was split into a training set (70%) and a testing set (30%) to support model training and evaluation. This process was repeated five times, and the results were averaged to ensure robustness.

In this study, the default parameter settings provided by scikit-learn (sklearn) were used for the implementation of logistic regression, naive Bayes, and decision tree classifiers. For logistic regression, sklearn uses the ‘lbfgs’ solver by default, with L2 regularization (penalty = ‘l2’), and a regularization strength parameter C = 1.0. The model also assumes a maximum of 100 iterations (max_iter = 100) for convergence. In the case of naive Bayes, the Gaussian naive Bayes (GaussianNB) variant was employed, which assumes features follow a normal distribution and uses default priors and a variance smoothing parameter (var_smoothing = 1

\times 10^{- 9}

) to handle numerical stability. For decision trees, the DecisionTreeClassifier was applied with the Gini impurity (criterion = ‘gini’) as the default metric for node splitting, no restriction on tree depth (max_depth = None), and other default values for parameters like min_samples_split = 2 and min_samples_leaf = 1.

The hyperparameter values for the various fairness mitigation techniques were systematically kept identical across all datasets to maintain experimental rigor and eliminate variability arising from parameter selection. Specifically, for the learning fair representations (LFR) technique, the hyperparameters were set to

k = 5, A_{x} = 0.01, A_{y} = 1.0

(default values) and

A_{z} = 2.0

(ad hoc setting instead of the default value that is

A_{z} = 50

.

In our study, the tuning parameter in the disparate impact remover (DIR), which controls the extent to which features are adjusted to mitigate bias, was selected based on a balance between fairness improvement and predictive performance. We experimented with a range of values (typically from 0.0 to 1.0, where 0.0 applies no transformation and 1.0 enforces full fairness) and evaluated each setting using both fairness metrics (e.g., disparate impact ratio, equal opportunity, etc.) and accuracy measures. The repair level was selected as

r e p a i r_l e v e l = 0.8

, controlling the degree of modification applied to the dataset to mitigate bias.

The prejudice remover technique employed a regularization strength of

η = 0.5

, which adjusts the tradeoff between accuracy and fairness. For all other techniques, default hyperparameters were utilized to maintain a baseline configuration.

Additionally, in the case of the calibrated equalized odds technique, we adopted the weighted approach as a cost constraint, ensuring that adjustments to decision thresholds were proportionally distributed to minimize disparities across demographic groups while preserving model performance.

This systematic approach allows for a controlled and replicable experimental framework, facilitating a robust evaluation of fairness interventions within learning analytics datasets.

The experimental procedure (Figure 3) was built around the Python programming language, with the scikit-learn library employed for the development and evaluation of machine learning models and AIF360 [31] serving as the library for assessing and mitigating algorithmic fairness concerns. AI Fairness 360 (AIF360) is an open-source Python library created by IBM Research to address the challenges of algorithmic fairness in machine learning.

3.3. Classification Metrics

Classification measures are essential for evaluating the performance of machine learning models. Accuracy is the most commonly used metric, representing the proportion of correctly classified instances out of the total instances. However, accuracy can be misleading when dealing with imbalanced datasets, where one class dominates the other. AUC (Area Under the Curve), specifically the AUC-ROC (Receiver Operating Characteristic curve), measures the model’s ability to distinguish between classes by evaluating the tradeoff between true positive and false positive rates. A higher AUC indicates better classification performance across different decision thresholds. Recall, also known as sensitivity or true positive rate, quantifies the model’s ability to correctly identify positive instances, making it particularly useful when the cost of missing positive cases is high. Precision, on the other hand, assesses the proportion of correctly predicted positive cases among all predicted positives, ensuring that the model does not produce excessive false positives.

Balancing precision and recall is crucial, and the F1-score provides a harmonic mean of the two, offering a single metric that accounts for both false positives and false negatives. This is especially useful when there is an uneven class distribution. Beyond these traditional measures, Cohen’s kappa evaluates classification performance while considering agreement by chance, making it a robust metric for assessing inter-rater reliability. Similarly, the Matthews correlation coefficient (MCC) provides a more comprehensive evaluation by considering all four elements of the confusion matrix (true positives, false positives, true negatives, and false negatives). The MCC is particularly valuable in cases of class imbalance, as it offers a balanced assessment of predictive performance across all classes.

3.4. Fairness Metrics

Although an extensive number of fairness metrics has been proposed in the literature [32,33,34], and it is widely acknowledged that no single fairness measure is universally applicable across all contexts, our research focuses on three of the most widely adopted and conceptually distinct measures: statistical parity, disparate impact, and consistency [35].

3.4.1. Statistical Parity

This metric requires that model’s predictions to be independent of a protected attribute (such as gender, race, or disability). It ensures that different demographic groups receive positive outcomes at equal rates. Mathematically, a model satisfies statistical parity if

P (\hat{Y} = 1 ∣ A = 0) = P (\hat{Y} = 1 ∣ A = 1)

where

\hat{Y}

is the predicted outcome, and A represents the protected attribute. This means that the proportion of favorable predictions should be the same across all groups, regardless of differences in other characteristics.

3.4.2. Disparate Impact

Disparate impact is mainly a legal concept that is also used for fairness using statistical terms. Contrary to statistical parity, disparate impact is defined as the ratio of favorable predictions between unprivileged and privileged groups. It is defined here as follows:

D I = \frac{Pr (Y = 1 ∣ A = u n p r i v i l e g e d)}{Pr (Y = 1 ∣ A = p r i v i l e g e d)}

3.4.3. Consistency

Consistency is an individual fairness metric unlike the previous two that belongs to group fairness category metrics. This metric evaluates whether similar individuals receive similar predictions from a model. It is based on the idea that individuals with similar characteristics should be treated similarly in terms of outcomes, regardless of their membership in a particular protected attribute group. It is defined as follows:

C o n s i s t e n c y = 1 - \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - \frac{1}{n_n e i g h b o r s} \sum_{j \in N_{n_n e i g h b o r s} (x_{i})} {\hat{y}}_{j}|

Therein, the following are defined:

n is the number of individuals.
${\hat{y}}_{i}$ is the predicted outcome for individual i.
$N_{n_n e i g h b o r s} (x_{i})$ is the set of the $n_n e i g h b o r s$ nearest neighbors of $x_{i}$ (individuals with similar features).
${\hat{y}}_{j}$ is the predicted outcome for each individual j in the neighborhood.

3.4.4. Comparison of Measures and Methods

Fairness metrics such as demographic parity, equalized odds, and predictive parity each reflect different conceptualizations of fairness in machine learning. Demographic parity emphasizes outcome equality across groups, making it valuable in ensuring equal access to educational opportunities. However, it may inadvertently ignore differences in legitimate performance factors among groups. In contrast, equalized odds focuses on achieving similar error rates (e.g., false positives and false negatives) across groups, which is especially important in high-stakes decisions such as student admissions or targeted interventions. Meanwhile, predictive parity ensures consistency in predictive value (e.g., precision) across groups. Importantly, these criteria are often mutually incompatible—satisfying one may inherently violate another. Thus, selecting the appropriate fairness metric must be guided by the ethical and practical priorities of the specific educational context.

Fairness mitigation strategies, much like the metrics themselves, come with their own tradeoffs and implementation considerations. Preprocessing techniques—such as reweighting, learning fair representations (LFR), and the disparate impact remover (DIR)—are model-agnostic and relatively straightforward to implement. This makes them especially suitable for educational systems where internal model access is limited or models are proprietary. However, they may oversimplify nuanced bias structures embedded in the data. In contrast, in-processing methods like adversarial debiasing or fairness-constrained optimization often yield better fairness–performance tradeoffs but require modification of the learning algorithm, which may not be feasible in black-box environments. Post-processing techniques operate on the model outputs, offering flexibility when working with fixed models, though they raise ethical concerns about altering predictions without addressing underlying learning processes.

In our study, we operationalized fairness mitigation by first identifying underrepresented and overrepresented groups based on sensitive attributes such as gender or ethnicity. Group representation was assessed by comparing the proportion of samples belonging to each group either to the overall dataset or to a reference baseline. Reweighting was then applied to balance the influence of each group during training: samples from underrepresented groups were assigned higher weights, while those from overrepresented groups received lower weights. This technique, evaluated using fairness metrics like demographic parity and equal opportunity alongside performance measures, allowed us to minimize bias without compromising accuracy.

For example, we can apply reweighting to the student performance dataset, using gender as the sensitive attribute. By computing group-level frequencies, we rebalance the dataset to amplify the impact of underrepresented students during training. For the MOOC dataset, where the task was to predict dropout risk, we can employ learning fair representations (LFR). LFR transforms the original features into a latent space that preserves task-relevant information while obfuscating sensitive group membership. This allows the model to retain meaningful educational patterns while reducing its dependence on sensitive attributes, ultimately supporting fairer predictions. Finally, we can use the disparate impact remover (DIR) in the student admissions dataset, which involves sensitive attributes like ethnicity. DIR modifies the input features—such as test scores or GPA—by reducing their correlation with sensitive attributes, thereby promoting equitable prediction outcomes. This preprocessing step helps ensure that applicants’ chances of admission are not disproportionately influenced by group membership.

Together, these methods illustrate how fairness-aware interventions—tailored to the characteristics of each dataset and task—can effectively mitigate bias in educational prediction systems.

4. Results

In this section, we present the results for each of the aforementioned datasets, which are categorized according to the three types of mitigation techniques: preprocessing, in-processing, and post-processing.

Each dataset—HOULAD, xAPI-Educational Mining, and OULAD—presents distinct difficulties that can affect the assessment and applicability of fairness mitigation strategies. In the HOULAD, the sensitive attributes include gender and region, but the distribution is highly imbalanced, with underrepresentation of certain subgroups, making it challenging to assess fairness reliably across all demographics. The xAPI-Educational Mining dataset includes sensitive features such as nationality, but the number of learners from some countries is very small, which can lead to unreliable fairness metric estimates and high variance in model evaluation. In OULAD, gender and disability serve as protected attributes; however, the overlap and correlation between these variables and other sociodemographic or performance features can confound fairness assessments. These challenges highlight the need for careful consideration of data representation and subgroup analysis when evaluating and comparing fairness-aware learning strategies.

4.1. Results for HOULAD

The results from Table 2 highlight the impact of the preprocessing techniques on both the performance and fairness metrics. Without preprocessing, models such as logistic regression, naive Bayes, and decision tree demonstrated high accuracy results (ranging from 0.91 to 0.95) but exhibited fairness concerns, as indicated by statistical parity values near 0.03–0.04 and the disparate impact metric exceeding 1.13. The application of reweighting improved fairness, reducing the disparate impact to values below 1.0 while maintaining competitive accuracy results. Learning fair representations (LFR) emerged as the most effective preprocessing method, achieving perfect performance scores (accuracy, F1, recall, and precision at 1.00) along with balanced fairness metrics, reflecting an optimal tradeoff between fairness and predictive performance. In contrast, disparate impact remover slightly lowered model accuracy but enhanced fairness by reducing statistical parity and disparate impact.

Table 3 presents the results for the in-processing techniques, specifically prejudice remover and adversarial debiasing. These techniques attempt to incorporate fairness constraints during model training rather than adjusting data distributions. The prejudice remover method yielded an accuracy of 0.93 while improving fairness, as reflected in a statistical parity of −0.02 and a disparate impact of 0.9. Adversarial debiasing, although slightly reducing the accuracy to 0.91, achieved a more pronounced reduction in statistical disparity (−0.07) and a disparate impact of 0.93. These findings suggest that in-processing methods are effective in mitigating bias but come with slight tradeoffs in predictive performance compared to the best-performing preprocessing strategies.

Table 4 evaluates the post-processing methods, including equality of odds and calibrated equalized odds. These techniques adjust model predictions after training to enhance fairness. The equality of odds method slightly reduced the accuracy (ranging from 0.88 to 0.92) while achieving modest fairness improvements. Calibrated equalized odds, on the other hand, ensured perfect accuracy (1.00) for all models but resulted in lower disparate impact values (0.77), suggesting that while the predictions are well-calibrated for fairness, some level of imbalance remains. Overall, post-processing techniques offer a viable alternative for fairness adjustments but may not be as effective as preprocessing methods like learning fair representations in achieving both high accuracy and fairness simultaneously.

4.2. Results for xAPI-Educational Mining Dataset

Table 5 evaluates the effects of the preprocessing techniques on the performance and fairness outcomes for the xAPI-Educational Mining dataset. Without preprocessing, the models exhibited high accuracy (0.74–0.91) but also fairness concerns, with the disparate impact metric exceeding 1.14 in some cases. Reweighting improved fairness by reducing the disparate impact to values around 1.09–1.19 while maintaining comparable or even improved accuracy. Learning fair representations (LFR) provided the best tradeoff, achieving high accuracy (0.92–0.98) and superior fairness measures. Disparate impact remover, while effective in some aspects, increased fairness at the cost of a significant accuracy drop in models like naive Bayes (0.77). Overall, LFR emerged as the most effective preprocessing approach, ensuring both strong classification performance and reduced bias.

Table 6 presents the results for the in-processing techniques, which integrate fairness constraints directly into model training. The prejudice remover technique achieved a good balance, improving fairness with a statistical parity of 0.19 and a disparate impact of 1.29 while maintaining a solid accuracy of 0.93. Adversarial debiasing showed the highest fairness improvement, increasing the statistical parity to 0.27 and disparate impact to 1.44, but at the cost of reduced accuracy (0.83). These results indicate that in-processing methods are particularly effective in enhancing fairness but can lead to some tradeoffs in predictive performance, making them suitable for scenarios where fairness is prioritized over minor performance losses.

Table 7 examines the post-processing methods that adjust predictions after training. Equality of odds achieved a reasonable balance, with an accuracy of 0.92 across models, but slightly negative statistical parity values (−0.07 to 0.01), and the disparate impact values near 0.81–0.83 indicate room for improvement. In contrast, calibrated equalized ddds enhanced fairness significantly, achieving a statistical parity value up to 0.27 and disparate impact of 1.31, but at the cost of modifying model behavior. Interestingly, this method also resulted in perfect accuracy (1.00), which may suggest an overfitting effect or label adjustments aligning with fairness constraints. While post-processing techniques provide effective fairness improvements, they may not be as well balanced as preprocessing techniques like LFR, which offers both high accuracy and more consistent fairness.

4.3. Results for OULAD

The experiments conducted on the OULAD with the protected feature “gender” (Table 8) demonstrates the impacts of different preprocessing techniques on the performance and fairness measures. Without any processing, logistic regression achieved the highest accuracy (0.79) among the models tested, but the results for fairness metrics such as statistical parity (0.09) and disparate impact (1.11) suggest some level of bias. Applying reweighting had a negligible effect on performance while slightly improving fairness. However, the most significant improvements in both performance and fairness were observed when using learning fair representations, which drastically increases accuracy (0.998) and consistency (0.998) while maintaining statistical parity. Disparate Impact Remover slightly improves fairness measures but does not significantly alter the model’s predictive performance.

A similar analysis for the protected feature “disability” using preprocessing techniques (Table 9) revealed distinct trends. The reweighting method resulted in moderate improvements, particularly for decision tree classifier, which maintained an accuracy of 0.74 with a better statistical parity of 0.05. Learning fair representations again stood out as the most effective technique, achieving near-perfect accuracy (0.998) and high fairness measures. The disparate impact remover yielded mixed results, as naive Bayes experienced a drop in accuracy (0.62), while logistic regression achieved a slightly better statistical parity (0.18) at the expense of predictive performance.

For in-processing techniques addressing gender bias (Table 10), prejudice remover and adversarial debiasing were compared. While both techniques improved the fairness measures compared to the unprocessed models, adversarial debiasing achieved a higher accuracy (0.83) than prejudice remover (0.80). The statistical parity scores (0.07 for adversarial debiasing and 0.08 for prejudice remover) indicate a slight bias reduction, though disparate impact remained around 1.11. These results suggest that while in-processing techniques mitigate bias, they do not fully eliminate it, and their effectiveness depends on the desired tradeoff between fairness and accuracy.

When applying in-processing techniques to mitigate disability bias (Table 11), similar trends were observed. Prejudice remover yielded an accuracy of 0.79, while adversarial debiasing slightly outperformed it with a value of 0.83. The fairness measures remained relatively stable, with statistical parity values at 0.08 for both methods. The disparate impact values also remained within a close range (1.11–1.12), reinforcing that in-processing methods provide moderate fairness improvements while maintaining predictive performance. However, these techniques may not be sufficient alone if the goal is to achieve near-perfect fairness.

Post-processing techniques for mitigating gender bias (Table 12) were evaluated using equality of odds and calibrated equalized odds. The latter method led to a drastic increase in accuracy for all models, reaching 1.00 across logistic regression, naive Bayes, and decision tree classifier. However, the fairness improvements were marginal, with the statistical parity ranging between 0.01 and 0.10 and the disparate impact values remaining close to 1.03. This suggests that while post-processing methods can optimize accuracy, they may not significantly alter the fairness measures unless used in combination with other mitigation strategies.

Finally, the post-processing methods for disability bias mitigation exhibited similar patterns (Table 13). Calibrated equalized odds again achieved perfect accuracy (1.00) across all models but had a limited impact on fairness, as the statistical parity values ranged from 0.09 to 0.38, depending on the model. The disparate impact values remained close to 1.07, indicating that while these methods improve predictive performance, they may not fully eliminate disparities. Overall, these results highlight the importance of choosing the appropriate bias mitigation technique based on the specific tradeoffs required between performance and fairness in real-world applications.

5. Discussion

The experimental results highlight the varying effectiveness of preprocessing, in-processing, and post-processing fairness interventions across the three datasets. In the case of the HOULAD, preprocessing techniques, particularly learning fair representations (LFR), demonstrated the most balanced tradeoff between predictive performance and fairness. The reweighting method provided moderate improvements in fairness while maintaining accuracy, whereas disparate impact remover slightly reduced the model performance. In-processing methods, such as prejudice remover and adversarial debiasing, successfully mitigated bias but introduced minor accuracy tradeoffs. Similarly, post-processing approaches, like equality of odds and calibrated equalized odds, yielded improvements in fairness at the cost of slight accuracy reductions, reinforcing that fairness interventions must be carefully selected based on the desired balance between bias mitigation and predictive power.

For the xAPI-Educational Mining dataset, the results indicate that fairness-aware preprocessing methods provided the most promising outcomes. LFR once again stood out as the most effective, achieving high accuracy while significantly reducing the disparate impact value. Reweighting also improved fairness while maintaining competitive accuracy levels. In-processing techniques, particularly adversarial debiasing, exhibited notable fairness improvements, although this came at the cost of reduced accuracy, making it suitable for cases where fairness is prioritized. Post-processing techniques, such as calibrated equalized odds, further enhanced fairness by improving statistical parity and disparate impact values; however, the potential risk of overfitting to fairness constraints must be carefully considered. These results suggest that preprocessing techniques, particularly LFR, offer the best compromise between fairness and predictive performance for this dataset.

The OULAD experiments, which analyzed fairness concerning gender and disability as protected attributes, reinforce similar trends. Preprocessing techniques, particularly LFR, achieved near-perfect accuracy and fairness, making them the most effective strategy for bias mitigation. Reweighting offered minor improvements without significantly altering the model accuracy, while disparate impact remover had mixed results, sometimes reducing performance for specific classifiers. In-processing methods, including prejudice remover and adversarial debiasing, moderately enhanced fairness but did not entirely eliminate disparities, highlighting their limitations. The choice between these methods depends on the acceptable tradeoff between slight accuracy loss and fairness improvement.

Finally, post-processing interventions for OULAD, such as calibrated equalized odds, yielded perfect accuracy for all models but only marginally improved fairness. While these techniques are beneficial for maintaining high predictive performance, they may not be sufficient in addressing deeper fairness concerns without complementary preprocessing or in-processing strategies. The results suggest that preprocessing approaches, particularly LFR, consistently outperformed other bias mitigation strategies across datasets, providing the most reliable balance between predictive performance and fairness. These findings emphasize the need for dataset-specific fairness interventions, where preprocessing methods appear most effective for achieving equitable predictions without substantial accuracy tradeoffs.

The experimental results reveal that different classification models interact distinctively with various fairness-enhancing interventions. For instance, the decision tree classifier, which typically relies on the entirety of the training data to generate interpretable rules, showed substantial gains in both fairness and accuracy when preprocessing techniques were applied. In the HOULAD, its accuracy increased from 0.91 (without any processing) to 0.95 using the reweighting technique, and its fairness, measured via statistical parity and disparate impact, improved from 0.04 and 1.18 to −0.05 and 0.82, respectively. This suggests that preprocessing methods such as reweighting, which adjusts data distributions prior to learning, allow decision trees to make more equitable splits. Similarly, in the xAPI dataset, the decision tree performance peaked at 0.98 accuracy with learning fair representations (LFR), demonstrating that sophisticated preprocessing can align well with tree-based learners, improving both predictive quality and fairness outcomes.

In contrast, post-processing techniques exhibited mixed impacts on decision trees. While calibrated equalized odds increased the decision tree accuracy to 1.00 in both datasets, the fairness metrics (e.g., statistical parity at −0.05 and disparate impact at 0.77 for HOULAD) indicate moderate bias mitigation. However, compared to preprocessing, the improvement was less pronounced in terms of the disparate impact value. This may be due to the nature of post-processing, which modifies predictions after model training and does not influence the internal structure of the decision tree. Therefore, it does not fully leverage the interpretive capacity of tree-based models, suggesting that preprocessing might be the more effective pairing.

Additionally, in-processing techniques such as prejudice remover and adversarial debiasing were primarily applied to models like logistic regression and naive Bayes. These methods, which modify the learning algorithm itself, maintained high accuracy (0.91–0.93) and achieved respectable fairness scores. However, they are less interpretable and often harder to apply to non-probabilistic models like decision trees. Interestingly, logistic regression—a model that benefits from both data-driven and regularization-based strategies—achieved optimal fairness-accuracy tradeoffs under preprocessing and post-processing, with LFR and calibrated equalized odds achieving perfect or near-perfect metrics in both dimensions. This highlights that the model architecture plays a crucial role: simpler models (e.g., naive Bayes) benefit from fairness adjustments in a stable manner, while more complex ones (e.g., decision trees) respond better to techniques that reshape the input data distribution. These findings suggest that model selection should be considered jointly with the fairness technique to achieve both equitable and effective classification.

6. Conclusions and Future Work

This study evaluated the effectiveness of preprocessing, in-processing, and post-processing fairness mitigation techniques in educational data mining across three datasets: HOULAD, xAPI-Educational Mining, and OULAD. The findings demonstrate that fairness-aware preprocessing techniques, particularly learning fair representations (LFR), consistently achieved the best balance between predictive accuracy and fairness. In-processing methods, such as adversarial debiasing and prejudice remover, also effectively reduced bias but came with slight accuracy tradeoffs. Post-processing approaches, including calibrated equalized odds, improved fairness while maintaining high predictive performance, though they may not fully eliminate disparities without complementary interventions.

The comparative analysis highlights the importance of selecting fairness strategies based on dataset characteristics and the specific fairness–accuracy tradeoffs required for educational applications. While preprocessing methods generally outperform other approaches, the choice of bias mitigation technique should align with the ethical and practical considerations of each use case. Ensuring fairness in educational data mining is particularly crucial, as biased models can reinforce systemic inequalities and negatively impact student opportunities.

In this study, we employed the same hyperparameter values for techniques such as learning fair representations, disparate impact remover, and prejudice remover across all three datasets. For the cost constraint in calibrated equalized odds, we adopted a weighted approach that balanced the false negative ratio and false positive ratio, ensuring consistency across datasets.

We acknowledge that this constitutes a key limitation of our research, as it impacts both the performance and fairness metrics. Simply put, using different hyperparameter settings for certain techniques could potentially lead to different conclusions regarding their suitability. A valuable direction for future research would be to systematically explore the effects of varying these hyperparameter values. Such an investigation could offer deeper insights into their influence on both performance outcomes and fairness metrics, thereby enhancing our understanding of the tradeoffs involved.

We also recognize that intersectional bias remains an open and critical area for future work. Assessing and mitigating intersectional bias is particularly challenging due to the combinatorial explosion of group categories and the resulting data sparsity, which complicates both statistical analysis and fairness interventions.

Future research could also explore hybrid fairness interventions that combine multiple strategies to further optimize both fairness and accuracy. As educational institutions increasingly rely on data-driven decision making, ensuring algorithmic fairness will be essential in promoting equitable and inclusive learning environments.

Author Contributions

Conceptualization, G.R. and S.K.; methodology, G.R. and G.D.; software, G.R. and G.D.; validation, G.D.; formal analysis, G.D.; investigation, G.R.; resources, S.K.; data curation, G.D.; writing—original draft preparation, G.R. and G.D.; writing—review and editing, G.D. and S.K.; visualization, G.D.; supervision, S.K.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

The project with title: easyHPC@eco.plastics.industry and MIS: 6001593 is co-funded by the European Union under Competitiveness Programme (ESPA 2021–2027).

Data Availability Statement

The Hellenic Open University Learning Analytics Dataset (HOULAD) analyzed during the current study is available from the corresponding author upon reasonable request. The dataset Open University Learning Analytics Dataset (OULAD) analyzed during the current study is available in the UC Irvine Machine Learning Repository, which is accessible at https://archive.ics.uci.edu/dataset/349/open+university+learning+analytics+dataset (accessed on 2 February 2025). The dataset xAPI-Educational Mining Dataset analyzed during the current study is available in Kaggle, which is accessible at https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data (accessed on 2 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ramaswami, G.; Susnjak, T.; Mathrani, A.; Lim, J.; Garcia, P. Using educational data mining techniques to increase the prediction accuracy of student academic performance. Inf. Learn. Sci. 2019, 120, 451–467. [Google Scholar] [CrossRef]
Guanin-Fajardo, J.H.; Guaña-Moya, J.; Casillas, J. Predicting Academic Success of College Students Using Machine Learning Techniques. Data 2024, 9, 60. [Google Scholar] [CrossRef]
Hu, Q.; Rangwala, H. Towards Fair Educational Data Mining: A Case Study on Detecting At-Risk Students; International Educational Data Mining Society: Worcester, MA, USA, 2020. [Google Scholar]
Bayer, V.; Hlosta, M.; Fernandez, M. Learning analytics and fairness: Do existing algorithms serve everyone equally? In Proceedings of the International Conference on Artificial Intelligence in Education, Utrecht, The Netherlands, 14–18 June 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 71–75. [Google Scholar]
O’Neil, C. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy; Crown Publishing Group: New York, NY, USA, 2016. [Google Scholar]
de Souza Cabral, L.; Dwan Pereira, F.; Ferreira Mello, R. Enhancing Algorithmic Fairness in Student Performance Prediction Through Unbiased and Equitable Machine Learning Models. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; Springer Nature: Cham, Switzerland, 2024; pp. 418–426. [Google Scholar]
Baker, R.S.; Hawn, A. Algorithmic Bias in Education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
Fenu, G.; Galici, R.; Marras, M. Experts’ View on Challenges and Needs for Fairness in Artificial Intelligence for Education. In Artificial Intelligence in Education; Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 243–255. [Google Scholar]
Li, L.; Sha, L.; Li, Y.; Raković, M.; Rong, J.; Joksimovic, S.; Selwyn, N.; Gašević, D.; Chen, G. Moral Machines or Tyranny of the Majority? A Systematic Review on Predictive Bias in Education. In Proceedings of the LAK23: 13th International Learning Analytics and Knowledge Conference, Arlington, TX, USA, 13–17 March 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 499–508. [Google Scholar]
Deho, O.B.; Joksimovic, S.; Li, J.; Zhan, C.; Liu, J.; Liu, L. Should Learning Analytics Models Include Sensitive Attributes? Explaining the Why. IEEE Trans. Learn. Technol. 2023, 16, 560–572. [Google Scholar] [CrossRef]
Idowu, J.A. Debiasing Education Algorithms. Int. J. Artif. Intell. Educ. 2024, 34, 1510–1540. [Google Scholar] [CrossRef]
Sha, L.; Raković, M.; Das, A.; Gašević, D.; Chen, G. Leveraging Class Balancing Techniques to Alleviate Algorithmic Bias for Predictive Tasks in Education. IEEE Trans. Learn. Technol. 2022, 15, 481–492. [Google Scholar] [CrossRef]
Li, C.; Xing, W.; Leite, W. Using fair AI to predict students’ math learning outcomes in an online platform. Interact. Learn. Environ. 2024, 32, 1117–1136. [Google Scholar] [CrossRef]
Riazy, S.; Simbeck, K.; Schreck, V. Systematic Literature Review of Fairness in Learning Analytics and Application of Insights in a Case Study. In Computer Supported Education; Lane, H.C., Zvacek, S., Uhomoibhi, J., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 430–449. [Google Scholar]
Kuzilek, J.; Hlosta, M.; Zdrahal, Z. Open University Learning Analytics dataset. Sci. Data 2017, 4, 170171. [Google Scholar] [CrossRef] [PubMed]
Le Quy, T.; Nguyen, T.H.; Friege, G.; Ntoutsi, E. Evaluation of Group Fairness Measures in Student Performance Prediction Problems. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases; Koprinska, I., Mignone, P., Guidotti, R., Jaroszewicz, S., Fröning, H., Gullo, F., Ferreira, P.M., Roqueiro, D., Ceddia, G., Nowaczyk, S., et al., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 119–136. [Google Scholar]
Wongvorachan, T.; Bulut, O.; Liu, J.X.; Mazzullo, E. A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning. Information 2024, 15, 326. [Google Scholar] [CrossRef]
Calders, T.; Kamiran, F.; Pechenizkiy, M. Building Classifiers with Independency Constraints. In Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, Miami, FL, USA, 6 December 2009; pp. 13–18. [Google Scholar]
Kamiran, F.; Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 2012, 33, 1–33. [Google Scholar] [CrossRef]
Zemel, R.; Wu, Y.; Swersky, K.; Pitassi, T.; Dwork, C. Learning fair representations. In Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. III-325–III-333. [Google Scholar]
Feldman, M.; Friedler, S.A.; Moeller, J.; Scheidegger, C.; Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015. [Google Scholar]
Kamishima, T.; Akaho, S.; Asoh, H.; Sakuma, J. Fairness-Aware Classifier with Prejudice Remover Regularizer. In Machine Learning and Knowledge Discovery in Databases; Flach, P.A., De Bie, T., Cristianini, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 35–50. [Google Scholar]
Zhang, B.H.; Lemoine, B.; Mitchell, M. Mitigating Unwanted Biases with Adversarial Learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, New Orleans, LA, USA, 2–3 February 2018; ACM: New Orleans, LA, USA, 2018; pp. 335–340. [Google Scholar]
Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 3323–3331. [Google Scholar]
Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; Weinberger, K.Q. On fairness and calibration. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5684–5693. [Google Scholar]
Kostopoulos, G.; Panagiotakopoulos, T.; Kotsiantis, S.; Pierrakeas, C.; Kameas, A. Interpretable Models for Early Prediction of Certification in MOOCs: A Case Study on a MOOC for Smart City Professionals. IEEE Access 2021, 9, 165881–165891. [Google Scholar] [CrossRef]
Students’ Academic Performance Dataset. Available online: https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data (accessed on 21 January 2025).
Amrieh, E.A.; Hamtini, T.; Aljarah, I. Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. IJDTA 2016, 9, 119–136. [Google Scholar] [CrossRef]
Liu, C.; Wang, H.; Du, Y.; Yuan, Z. A Predictive Model for Student Achievement Using Spiking Neural Networks Based on Educational Data. Appl. Sci. 2022, 12, 3841. [Google Scholar] [CrossRef]
Farhood, H.; Joudah, I.; Beheshti, A.; Muller, S. Evaluating and Enhancing Artificial Intelligence Models for Predicting Student Learning Outcomes. Informatics 2024, 11, 46. [Google Scholar] [CrossRef]
Bellamy, R.K.E.; Dey, K.; Hind, M.; Hoffman, S.C.; Houde, S.; Kannan, K.; Lohia, P.; Martino, J.; Mehta, S.; Mojsilovic, A.; et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv 2018, arXiv:1810.01943. [Google Scholar] [CrossRef]
Le Quy, T.; Nguyen, T.H.; Friege, G.; Ntoutsi, E. Evaluation of group fairness measures in student performance prediction problems. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer Nature: Cham, Switzerland, 2022; pp. 119–136. [Google Scholar]
Carey, A.N.; Wu, X. The statistical fairness field guide: Perspectives from social and formal sciences. AI Ethics 2023, 3, 1–23. [Google Scholar] [CrossRef]
Caton, S.; Haas, C. Fairness in Machine Learning: A Survey. ACM Comput. Surv. 2024, 56, 166:1–166:38. [Google Scholar] [CrossRef]
Cohausz, L.; Kappenberger, J.; Stuckenschmidt, H. What fairness metrics can really tell you: A case study in the educational domain. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 792–799. [Google Scholar]

Figure 1. Educational data mining process.

Figure 2. Fairness mitigation techniques.

Figure 3. Experimental procedure.

Table 1. Overview of fairness intervention techniques.

Category	Method	Main Features	References
Preprocessing	Reweighting	Adjusts instance weights to ensure statistical independence between sensitive attributes and outcomes; mitigates group imbalance.	[18,19]
	Learning Fair Representations (LFR)	Learns latent representations that balance prediction accuracy and fairness through multi-objective optimization.	[20]
	Disparate Impact Remover	Alters feature distributions to minimize disparate impact while preserving utility; tuning controls the fairness–utility tradeoff.	[21]
In-processing	Prejudice Remover	Incorporates a fairness regularization term into the learning objective; hyperparameter $η$ controls fairness–accuracy tradeoff.	[22]
In-processing	Adversarial Debiasing	Uses adversarial training with two networks: a predictor and a fairness adversary to reduce sensitive attribute leakage.	[23]
Post-processing	Equality of Odds	Adjusts decision thresholds to equalize true and false positive rates across groups; applied after training.	[24,25]
Post-processing	Calibrated Equalized Odds	Ensures both fairness (equalized odds) and calibration (accurate probabilities) in predictions.	[25]

Table 2. Performance and fairness measures without/with preprocessing techniques for HOULAD.

Processing	Model	Performance Measures						Fairness Measures
Processing	Model	Accuracy	Recall	Precision	F1	Kappa	MCC	Stat. Parity	Disp. Impact	Consistency
Without Processing	Logistic Regression	0.95	0.93	0.88	0.90	0.87	0.87	0.03	1.13	0.85
	Naive Bayes	0.95	0.98	0.84	0.91	0.87	0.88	0.037	1.13	0.85
	Decision Tree	0.91	0.77	0.87	0.82	0.76	0.76	0.04	1.18	0.85
Reweighting	Logistic Regression	0.94	0.96	0.84	0.89	0.85	0.86	−0.01	0.96	0.85
	Naive Bayes	0.93	0.97	0.81	0.88	0.84	0.84	−0.02	0.95	0.85
	Decision Tree	0.95	0.93	0.87	0.90	0.86	0.86	−0.05	0.82	0.85
Learning Fair Representations	Logistic Regression	1.00	1.00	1.00	1.00	1.00	1.00	−0.002	0.99	1.00
	Naive Bayes	0.993	1.00	0.96	0.98	0.977	0.977	0.002	1.01	1.00
	Decision Tree	1.00	1.00	1.00	1.00	1.00	1.00	−0.002	0.99	1.00
Disparate Impact Remover	Logistic Regression	0.93	0.97	0.79	0.87	0.83	0.83	−0.05	0.82	0.85
	Naive Bayes	0.93	1.00	0.76	0.87	0.82	0.84	−0.05	0.85	0.85
	Decision Tree	0.92	0.90	0.77	0.84	0.78	0.79	−0.03	0.88	0.85

Table 3. Performance and fairness measures for in-processing techniques for HOULAD.

In-Processing Techniques	Accuracy	Statistical Parity	Disparate Impact
Prejudice Remover	0.93	−0.02	0.9
Adversarial Debiasing	0.91	−0.07	0.93

Table 4. Performance and fairness measures for post-processing techniques for HOULAD.

Post-Processing Techniques	Models	Accuracy	Statistical Parity	Disparate Impact
Equality of Odds	Logistic Regression	0.92	0.01	0.82
	Naive Bayes	0.92	−0.01	0.81
	Decision Tree Classifier	0.88	−0.05	0.83
Calibrated Equalized Odds	Logistic Regression	1.00	0.01	0.77
	Naive Bayes	1.00	−0.07	0.77
	Decision Tree Classifier	1.00	−0.05	0.77

Table 5. Performance and fairness measures without/with preprocessing techniques for xAPI-Educational Mining dataset.

		Accuracy	Recall	Precision	F1	Kappa	MCC	Statistical Parity	Disparate Impact	Consistency
Without Processing	Logistic Regression	0.90	0.91	0.93	0.92	0.72	0.72	0.106	1.14	0.858
	Naive Bayes	0.74	0.703	0.976	0.82	0.419	0.49	0.17	1.34	0.858
	Decision Tree Classifier	0.91	0.932	0.956	0.94	0.708	0.71	0.14	1.19	0.858
Reweighting	Logistic Regression	0.94	0.96	0.95	0.96	0.83	0.83	0.08	1.19	0.86
	Naive Bayes	0.79	0.74	0.97	0.84	0.56	0.61	0.07	1.14	0.85
	Decision Tree Classifier	0.87	0.92	0.91	0.92	0.673	0.67	0.07	1.09	0.85
Learning Fair Representations	Logistic Regression	0.98	0.98	1.0	0.99	0.96	0.96	0.06	1.08	0.98
	Naive Bayes	0.92	0.9	1.0	0.94	0.82	0.83	0.11	1.18	0.98
	Decision Tree Classifier	0.98	0.97	1.0	0.99	0.95	0.95	0.10	1.14	0.98
Disparate Impact Remover	Logistic Regression	0.94	0.97	0.95	0.96	0.86	0.86	0.25	1.4	0.85
	Naive Bayes	0.77	0.73	0.95	0.83	0.53	0.57	0.16	1.32	0.85
	Decision Tree Classifier	0.86	0.86	0.94	0.9	0.67	0.68	0.16	1.26	0.85

Table 6. Performance and fairness measures with in-processing techniques for xAPI-Educational Mining dataset.

In-Processing Techniques	Accuracy	Statistical Parity	Disparate Impact
Prejudice Remover	0.93	0.19	1.29
Adversarial Debiasing	0.83	0.27	1.44

Table 7. Performance and fairness measures after post-processing techniques for xAPI-Educational Mining dataset.

Post-Processing Techniques	Models	Accuracy	Statistical Parity	Disparate Impact
Equality of Odds	Logistic Regression	0.92	0.01	0.82
	Naive Bayes	0.92	−0.07	0.81
	Decision Tree Classifier	0.92	−0.05	0.83
Calibrated Equalized Odds	Logistic Regression	1.00	0.22	1.31
	Naive Bayes	1.00	0.27	1.31
	Decision Tree Classifier	1.00	0.07	1.31

Table 8. Performance and fairness measures without/with preprocessing techniques for OULAD with protected feature: gender.

Processing	Model	Accuracy	Recall	Precision	F1	Kappa	MCC	Statistical Parity	Disparate Impact	Consistency
Without Processing	Logistic Regression	0.79	0.93	0.80	0.86	0.48	0.50	0.09	1.11	0.77
	Naive Bayes	0.66	0.67	0.79	0.73	0.27	0.28	−0.15	0.76	0.77
	Decision Tree Classifier	0.73	0.80	0.81	0.80	0.40	0.40	0.02	1.03	0.77
Reweighting	Logistic Regression	0.79	0.92	0.81	0.66	0.47	0.49	0.09	1.12	0.77
	Naive Bayes	0.66	0.68	0.80	0.74	0.27	0.28	−0.12	0.80	0.77
	Decision Tree Classifier	0.73	0.78	0.81	0.80	0.37	0.37	−0.001	0.99	0.77
Learning Fair Representations	Logistic Regression	0.998	0.998	0.999	0.999	0.99	0.99	0.09	1.11	0.998
	Naive Bayes	0.93	0.91	1.00	0.95	0.78	0.80	0.09	1.13	0.998
	Decision Tree Classifier	0.999	0.999	0.999	0.999	0.997	0.997	0.09	1.12	0.998
Disparate Impact Remover	Logistic Regression	0.79	0.92	0.81	0.86	0.47	0.49	0.06	1.07	0.77
	Naive Bayes	0.63	0.62	0.80	0.70	0.24	0.26	−0.14	0.76	0.77
	Decision Tree Classifier	0.74	0.80	0.82	0.81	0.40	0.40	−0.001	0.99	0.77

Table 9. Performance and fairness measures with preprocessing techniques for OULAD with protected feature: disability.

Processing	Model	Accuracy	Recall	Precision	F1	Kappa	MCC	Statistical Parity	Disparate Impact	Consistency
Reweighting	Logistic Regression	0.80	0.92	0.81	0.86	0.50	0.51	0.09	1.13	0.77
	Naive Bayes	0.65	0.67	0.79	0.73	0.27	0.28	0.38	2.63	0.77
	Decision Tree Classifier	0.74	0.79	0.82	0.81	0.41	0.41	0.05	1.08	0.77
Learning Fair Representations	Logistic Regression	0.99	0.999	0.99	0.998	0.99	0.99	0.006	1.007	0.99
	Naive Bayes	0.95	0.94	1.00	0.97	0.83	0.84	−0.01	0.98	0.997
	Decision Tree Classifier	0.998	0.999	0.998	0.999	0.99	0.99	0.007	1.008	0.997
Disparate Impact Remover	Logistic Regression	0.78	0.93	0.79	0.85	0.42	0.45	0.18	1.27	0.76
	Naive Bayes	0.62	0.60	0.80	0.69	0.24	0.25	0.21	1.64	0.76
	Decision Tree Classifier	0.74	0.79	0.82	0.81	0.40	0.40	0.042	1.06	0.76

Table 10. Performance and fairness measures for in-processing techniques for OULAD with protected feature: gender.

In-Processing Techniques	Accuracy	Statistical Parity	Disparate Impact
Prejudice Remover	0.80	0.08	1.11
Adversarial Debiasing	0.83	0.07	1.01

Table 11. Performance and fairness measures for in-processing techniques for OULAD with protected feature: disability.

In-Processing Techniques	Accuracy	Statistical Parity	Disparate Impact
Prejudice Remover	0.79	0.08	1.11
Adversarial Debiasing	0.83	0.08	1.12

Table 12. Performance and fairness measures after post-processing mitigation techniques for OULAD with protected feature: gender.

Method	Model	Accuracy	Statistical Parity	Disparate Impact
Equality of Odds	Logistic Regression	0.76	0.10	1.01
	Naive Bayes	0.60	−0.13	1.01
	Decision Tree Classifier	0.71	0.01	1.01
Calibrated Equalized Odds	Logistic Regression	1.00	0.10	1.03
	Naive Bayes	1.00	−0.13	1.03
	Decision Tree Classifier	1.00	0.01	1.03

Table 13. Performance and fairness measures after post-processing mitigation techniques for OULAD with protected feature: disability.

Method	Model	Accuracy	Statistical Parity	Disparate Impact
Equality of Odds	Logistic Regression	0.77	0.13	1.03
	Naive Bayes	0.50	0.38	1.03
	Decision Tree Classifier	0.69	0.09	1.03
Calibrated Equalized Odds	Logistic Regression	1.00	0.13	1.07
	Naive Bayes	1.00	0.38	1.07
	Decision Tree Classifier	1.00	0.09	1.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raftopoulos, G.; Davrazos, G.; Kotsiantis, S. Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques. Electronics 2025, 14, 1856. https://doi.org/10.3390/electronics14091856

AMA Style

Raftopoulos G, Davrazos G, Kotsiantis S. Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques. Electronics. 2025; 14(9):1856. https://doi.org/10.3390/electronics14091856

Chicago/Turabian Style

Raftopoulos, George, Gregory Davrazos, and Sotiris Kotsiantis. 2025. "Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques" Electronics 14, no. 9: 1856. https://doi.org/10.3390/electronics14091856

APA Style

Raftopoulos, G., Davrazos, G., & Kotsiantis, S. (2025). Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques. Electronics, 14(9), 1856. https://doi.org/10.3390/electronics14091856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Fairness Strategies in Educational Data Mining: A Comparative Study of Bias Mitigation Techniques

Abstract

1. Introduction

2. Fairness Mitigation Techniques

2.1. Pre-Processing Techniques

2.2. In-Processing Techniques

2.3. Post-Processing Techniques

3. Experiments

3.1. Datasets

3.1.1. Hellenic Open University Learning Analytics Dataset (HOULAD)

3.1.2. xAPI-Educational Mining Dataset

3.1.3. Open University Learning Analytics Dataset (OULAD)

3.2. Experimental Procedure

3.3. Classification Metrics

3.4. Fairness Metrics

3.4.1. Statistical Parity

3.4.2. Disparate Impact

3.4.3. Consistency

3.4.4. Comparison of Measures and Methods

4. Results

4.1. Results for HOULAD

4.2. Results for xAPI-Educational Mining Dataset

4.3. Results for OULAD

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI