1. Introduction
Enhancing student retention and success remains a fundamental goal for educational institutions, as it is critical to strengthening institutional accountability, meeting accreditation and compliance expectations, securing funding, and promoting equitable access to educational opportunities. The DFW rate (% of D and F grades and withdrawals) is a widely recognized metric for assessing student performance and retention in academic contexts (
Vyas & Reid, 2023). Observed DFW rates can highlight systemic barriers within course structures, design, and instructional methods; reveal gaps between student preparedness and course demands; or expose inequities in academic outcomes among student populations with different demographic and socioeconomic backgrounds (
Hatfield et al., 2022). Certain courses may also consistently exhibit higher DFW rates due to the inherent complexity of their content (
Vyas & Reid, 2023;
Bloemer et al., 2017,
2018). Given that observed DFW rates can signal multiple underlying issues, it is understandable that institutions place substantial weight on them.
Higher education institutions have limited resources to support student success, and imprecise targeting of courses for intervention can lead to inefficient or misguided strategies (
Bloemer et al., 2017). Traditionally, academia has focused on gateway courses, courses with larger enrollments, first- or second-year courses, and particularly those with high observed DFW rates to target interventions.
Bloemer et al. (
2017,
2018) highlighted the effectiveness of targeted interventions by analyzing the GAP—the difference between actual and projected DFW rate (% of D, F, and Withdraw) in a course. Their framework focuses on selecting courses for intervention based on high DFW rates and larger GAPs between expected and actual outcomes, optimizing the use of institutional resources for course improvement and student support.
This study builds on the work by
Bloemer et al. (
2017,
2018) and proposes an enhanced analytical prediction strategy within the GAP framework they proposed. We argue that observed DFW rates, while informative, may provide ambiguous signals for prioritizing course-level interventions because they obscure student composition, course design, instructional factors, and other latent influences. We propose a risk-adjusted baseline for expected DFWI rates (% of D and F grades, withdrawals, and incompletes) in courses, conditional on student heterogeneity and select controls. Course prioritization is based on both high failure rates and underperformance relative to expectations, with observed DFWI rates benchmarked against predicted DFWI rates based on student characteristics. This dual-metric approach ensures that courses are flagged for review based on course-level performance issues to the extent possible, rather than merely reflecting the enrollment of high-risk students. The framework enables fair and actionable prioritization, allowing institutions to review courses for underlying issues and implement tailored, course-based interventions grounded in inclusive practices, thereby ensuring effective resource allocation and support for all students. Conceptually, our GAP framework is analogous to value-added models (
McCaffrey et al., 2004) used in teacher evaluation, in that outcomes are adjusted for observable student characteristics to isolate performance relative to expectation. Similar logic is applied here at the course level solely for diagnostic purposes and support, rather than for evaluation or accountability.
A variety of student-level factors—including prior academic performance, race/ethnicity, first-generation status, and socioeconomic background—can influence course outcomes and DFWI rates (
Geiser & Santelices, 2007;
Johnson et al., 2001;
Drake, 2024). These factors may interact in complex ways, complicating simple linear predictions of student performance. Acknowledging these differences is crucial for predicting DFWI outcomes (grades of D or F, withdrawal, or incomplete) and for effectively targeting interventions in courses with deficiencies. While the observed DFWI rate is useful for screening courses, it can fall short in prioritization when student composition varies considerably. Ignoring this factor is inequitable, as it can unfairly associate poor outcomes with courses serving marginalized populations and stigmatize programs that already face considerable instructional challenges. Additionally, course-level issues (e.g., assessment design, pacing, pedagogy, sequencing) are addressed effectively through instructional support and course redesign. In contrast, student risk factors (e.g., preparation gaps, lack of motivation, first-generation status, financial stress) may require individualized, student-focused interventions. Separating course performance from student risk is useful as these mechanisms require different responses, prevent misattributing failure to course design or instruction, and frame analytics as diagnostic rather than punitive. Furthermore, the dual-metric, review-based approach allows program-level decision-making that preserves appropriate course rigor while targeting support where needed, reducing pressure on instructors to lower standards or inflate grades to manage course DFWI rates. The separation also enables course redesign efforts to address instructional or structural issues while allowing diversity and equity initiatives to focus directly on supporting vulnerable student populations through targeted personalized support. In doing so, the framework provides a non-punitive, analytical basis for further investigation, and it supports both instructional integrity and equity goals.
One key element of GAP analysis is effectively predicting DFWI outcomes using available student data. Recent advances in machine learning methods have expanded their use to improve prediction accuracy in academic research. However, most studies focus on individual student risk (
Cho et al., 2023;
Yang et al., 2020;
Lakkaraju et al., 2015) rather than assessing course-level performance (DFWI rate) or identifying courses for targeted review (
Bloemer et al., 2017,
2018). The main aim of this study is to develop a scalable predictive framework for student DFWI outcomes, average predicted probabilities at the course-level, and apply GAP analysis (
Bloemer et al., 2017,
2018) to prioritize courses for actionable review and intervention. Our study treats the GAP measure as a descriptive, risk-adjusted deviation rather than a causal estimate of course effectiveness. The analysis is predictive in nature; therefore, results should not be interpreted casually (
Shmueli, 2010;
Mullainathan & Spiess, 2017).
This research examines five pivotal inquiries central to the study’s objectives:
R1. To what extent are student demographic and socioeconomic characteristics related to DFWI outcomes, and do these relationships indicate potential equity gaps?
R2. How effectively can different machine learning models predict student DFWI outcomes based on student characteristics?
R3. To what extent are course rankings derived from GAP analysis comparable between parametric and non-parametric models?
R4. Do course rankings based on observed DFWI rates diverge substantially from rankings based on risk-adjusted outcomes?
R5. Do predicted DFWI outcomes exhibit subgroup disparities in calibration and fairness, and does it affect GAP rankings?
We conduct descriptive and association analyses, perform a multi-metric evaluation of machine learning models, calculate GAP, compare course rankings using Spearman’s (
Spearman, 1904) and Kendall’s (
Kendall, 1938) correlations, and validate them using fairness assessments, subgroup calibration analysis, and post-hoc robustness checks. Our findings show that GAP analysis, based on predicted DFWI outcomes, can identify courses with unexplained performance deviations, offering useful insights for prioritizing courses for further review and targeted intervention. However, predictive performance may be constrained by the scope and quality of available student data, persistent structural or historical biases, and latent factors that remain unobserved. We acknowledge the ethical and equity issues involved in using predictive analytics in education (
Baker & Hawn, 2021) and underscore the need for a thorough evaluation of predictive fairness and further validation before making decisions. Moreover, we note that the GAP framework outlined here is purely diagnostic and should be supplemented by a careful review of courses to further inform targeted intervention.
The rest of this article is organized as follows: We begin with a brief literature review. Next, we explore our dataset and preprocessing techniques. Then, we detail the association-strength metrics, classification algorithms, evaluation frameworks, and procedures used for GAP analysis and its validation. We also discuss the study’s ethical considerations, limitations, and assumptions. Following that, we present the results of our analysis. Finally, we highlight the key insights of our study and their implications for learning analytics and higher education practitioners.
2. Literature Review
Previous research has identified several key student-level predictors of DFW outcomes.
Geiser and Santelices (
2007) and
Al Hazaa et al. (
2021) found that high school GPA is a strong predictor of student outcomes.
Hatfield et al. (
2022) noted that introductory courses significantly affect student persistence, particularly by deterring minority students from pursuing STEM. Prior research has reported mixed findings regarding the role of gender in academic outcomes, with some studies identifying significant associations (
Matz et al., 2017) while others find limited or context-dependent effects (
Wang et al., 2017). Research has found that Pell Grant recipients tend to be older, attend part-time, have family commitments, job obligations, and often face financial insecurity (
Cox, 2016), all of which affect their academic persistence (
Titus, 2006) and course success (
Cox, 2016). According to
Engle and Tinto (
2008), low-income first-generation students at 4-year public institutions are three times more likely to leave than their more advantaged counterparts. In addition, studies have found that most DFW grades occur in large-enrollment introductory courses (
Twigg, 2005) and that first-year students enrolled in distance education courses may have higher DFW rates than their peers in in-person courses (
Urtel, 2008).
Recent advances in machine learning have expanded the use of predictive analytics in higher education, enabling models that can capture complex, nonlinear relationships in student academic outcomes.
Delen (
2010) reported success in predicting first-year student attrition using support vector machines, XGBoost, and elastic-net binary logistic regression, while
Dekker et al. (
2009) used logistic regression, decision trees, Bayesian classifier, random forest, and rule-based learners to predict student dropout risk.
Lakkaraju et al. (
2015) reported that ensemble methods, such as random forests, outperformed traditional methods in predicting the risk of high school graduation.
Susheelamma and Ravikumar (
2019) employed XGBoost, and
Cho et al. (
2023) applied common machine learning models and a deep neural network model alongside data-balancing techniques to predict student dropout.
Bloemer et al. (
2018) used a logistic regression model to predict the probability that a student would receive a DFW based on prior GPA, prior DFW rate, student type (Native, Honors, Transfer, Online), and academic lifecycle stage. A random forest model using both institutional and in-class data was employed by
Yang et al. (
2020) to predict DFW outcomes in an introductory physics course.
Our study advances the literature on risk adjustment and value-added modeling in education by assessing outcomes with observable variables to compare performance across diverse populations (
McCaffrey et al., 2004;
Kane & Staiger, 2008). Prior research indicates that predictive models optimized for accuracy may not always reveal causal relationships (
Shmueli, 2010;
Mullainathan & Spiess, 2017). Additionally, findings in educational data science and algorithmic fairness highlight that risk-adjusted predictions can still reflect structural inequalities and students’ readiness, and that they need careful interpretation in decision-making (
Baker & Hawn, 2021). Research in explainable AI emphasizes the importance of interpretable and auditable models to ensure transparency and build trust (
Ribeiro et al., 2016). These issues are particularly relevant in education, where predictive models are used to guide advising and resource allocation. Consequently, the GAP measure in this study is framed as a predictive, risk-adjusted deviation to identify potential challenges in course design, structure, and instruction, without attributing causality. It ultimately serves as a screening tool for further review, with interventions informed by contextual factors.
3. Materials and Methods
Core-curriculum course data from a public university spanning the spring 2022 to summer 2024 terms were analyzed. The data obtained from the Registrar’s records included redacted information on students enrolled in various courses, by course section, their final grades, and demographic and socioeconomic characteristics.
Table 1 outlines the key variables employed in this study.
3.1. Definition of DFWI Rate
The DFWI rate of any group is calculated as follows:
3.2. Data Cleaning and Preprocessing
Two datasets were developed from the original data. The first dataset (Sample 1) was used for descriptive analysis and to assess the strength of association, while the second (Sample 2) was used for prediction and GAP analysis. All core curriculum courses with 15 or more students enrolled from spring 2022 to summer 2024 were included in Sample 1, resulting in 230 unique courses from an original dataset of 248. Final letter grades were classified into two categories: grades D and D+ (Poor), F (Fail), W (Withdraw), WF (Withdraw and Fail), INC (Incomplete), and PR (Work in Progress) were grouped as DFWI, while grades A+, A, A−, B+, B, B−, C+, C, and C− were classified as Non-DFWI. INC grades were included in the DFWI group to provide a more comprehensive assessment of course outcomes and to improve data balance, given the low incidence of DFW grades.
A course format variable was created from course section numbers and classified as face-to-face (f2f), online, or hybrid. Next, a categorical variable was added to represent spring, summer, and fall to account for potential differences across semesters. We also established a non-resident student category. The dataset also included the student’s high school GPA and the credits attempted during the semester. The High School GPA variable had 11,992 (18.10%) missing values. Following
Van Buuren and Groothuis-Oudshoorn (
2011), all missing high school GPA values were imputed using Multiple Imputation by Chained Equations (MICE), treating GPA as the only variable with missing values and conditioning on gender, race or ethnicity, first-generation status, and Pell Grant status. Additionally, the observations with credit loads exceeding 21 were removed as a precaution to ensure data consistency.
Courses were then categorized into three groups based on their historical average DFWI rates: Low (≤15%), Medium (>15% to ≤30%), and High (>30%). The dichotomous categorical variables—DFWI Group, Pell Grant status, first-generation status, and non-resident status—were coded with values of 0 and 1. Additionally, one-hot encoding was applied to transform all non-dichotomous categorical variables into numeric variables, with one column dropped per category. This dataset, referred to as Sample 1, comprised 66,265 observations, including duplicates from students enrolled in multiple courses, and was utilized to analyze course characteristics, student academic classification, and DFWI rates. In contrast, 11,462 unique student observations from Sample 1 were used to examine student demographic and socioeconomic factors.
Some variables in Sample 1 had limited counts and were therefore aggregated into single variables to ensure statistical stability. American Indian or Alaskan Native, Unknown, Native Hawaiian, and Pacific Islander races were combined into the “other race or ethnicity” category. Students classified academically as seniors with a degree, med prep, and unclassified were grouped into the “other academic classification.” A revised Sample 1 dataset, reflecting these changes, including all student records, was created to analyze the strength of the association between DFWI outcomes and various variables across the Low, Medium, and High DFWI groups.
Data processing for prediction and GAP analysis using Sample 2 involved classifying observations from spring 2022 to fall 2023 as training data and those from spring to summer 2024 as test data. Courses with fewer than 15 students in the training set were excluded to ensure the models learn meaningful patterns. We imputed 11,793 (18%) missing high school GPA observations at the student level in Sample 2 using MICE. The model was first fitted on the training set and applied to the test set to prevent data leakage. We retained all student-level predictors in the final sample and removed course characteristics (except semester and course format) from the dataset. Other procedures were similar to the previous dataset. The final dataset comprised 149 unique courses, with multiple records per student, ensuring that courses were aligned across the training set (52,455 observations) and the test set (12,838 observations), enabling consistent out-of-sample predictions.
A lower proportion of DFWI cases in the dataset skews the model toward predicting non-DFWI outcomes, reducing its accuracy in identifying actual DFWI cases (
Guanin-Fajardo et al., 2024;
Yang et al., 2020). Our approach establishes a baseline model to predict DFWI rates from student profiles, thereby enabling attribution of deviations to course- or instructional-level factors. Therefore, we chose to maintain the natural imbalance in the data to preserve the integrity of predicted probabilities and ensure valid GAP analysis. This principle guided our modeling and evaluation framework throughout the study.
3.3. Methods for Assessing the Strength of Association Between DFWI Outcomes and Variables
The point-biserial correlation coefficient (
Tate, 1954) is used to examine the relationship between binary DFWI outcomes and numerical variables, while Cramér’s V (
Cramér, 1946) is used to assess the strength of association between DFWI outcomes and categorical variables across the Low, Medium, and High DFWI groups. Duplicate student records due to students enrolling in multiple courses across semesters are retained, as they do not affect these analyses. Point-biserial correlation values range from +1 to -1, where positive values indicate a positive association with the DFWI outcome and negative values indicate a negative association. The magnitude indicates the strength of the association. Cramér’s V ranges from 0 to 1, where higher values denote a stronger association (e.g., 0.1 is weak, 0.3 is moderate, and 0.5 or more is strong).
3.4. Methods for Predictive Modeling
The classification models used for analysis, along with their specifications, are discussed below.
Logistic Regression (LR) Classifier: The LR classifier is a supervised machine learning model widely used for binary classification in educational settings (
Bloemer et al., 2017,
2018). It estimates the log-odds of the outcome as a linear function of the input variables and employs a sigmoid function to classify outcomes into two classes.
Elastic Net Regularized Logistic Regression (ENLR) Classifier: The ENLR classifier is a hybrid supervised machine learning algorithm commonly used for binary classification tasks (
Zou & Hastie, 2005). It is a variant of LR that combines L1 (Lasso) and L2 (Ridge) regularization and employs a sigmoid function to predict binary outcomes. The ENLR often outperforms simple LR, Lasso, or Ridge models when dealing with numerous potentially correlated variables, some of which may be irrelevant.
The Random Forest (RF) Classifier: The RF classifier is a scalable supervised ensemble method that combines predictions from multiple decision trees, with the final classification made through majority voting (
Breiman, 2001;
Yang et al., 2020). Each tree is built using a bootstrapped sample of the training data via bagging, thereby reducing variance and improving generalization. Splits are determined by maximizing information gain, calculated with entropy, which measures impurity. The algorithm selects the feature and threshold that minimizes entropy (a measure of impurity or uncertainty in the data) at each node, creating more homogeneous sub-nodes. Trees grow to full depth or until the stopping criteria are met, leading to the final prediction. This model can capture complex nonlinear relationships and interactions using the ensemble framework.
Extreme Gradient Boosting (XGBoost) Classifier: The XGBoost classifier is a highly efficient and scalable supervised learning algorithm based on the gradient-boosting framework. The method uses decision trees as base learners and minimizes the logistic loss function. It incorporates both L1 and L2 regularization to enhance generalization and reduce overfitting, making it especially effective for high-dimensional, noisy datasets (
Susheelamma & Ravikumar, 2019). The model builds trees sequentially to minimize prediction errors and is effective in handling complex nonlinear relationships.
Artificial Neural Network (ANN) Classifier: The ANN classifier is a supervised learning model that captures complex nonlinear relationships in high-dimensional data. It consists of interconnected layers of neurons that transform input features through weighted combinations and nonlinear activation functions. These models are typically trained using backpropagation and gradient-based optimization algorithms, such as Adam or Stochastic Gradient Descent (SGD), to minimize the binary cross-entropy loss, which quantifies the discrepancy between predicted probabilities and actual binary outcomes. ANN serves as a complementary model to the parametric and non-parametric tree-based approaches used in the study and can capture intricate interactions among input features.
3.5. Methods for Model Evaluation and Identifying Courses
We evaluate model performance using predicted probabilities of a student receiving a DFWI outcome rather than binary classification, as results can be sensitive to the choice of classification threshold. The primary metric used is the Brier score (BR), which is defined as the mean squared difference between predicted probabilities and observed outcomes (
Brier, 1950), and is represented as
Here, is the predicted probability of observation , is the observed outcome, and is the number of observations. A Lower Brier score value indicates better probabilistic performance, reflecting closer agreement between predicted probabilities and observed outcomes. The Brier score captures both accuracy (correctness of outcomes) and calibration (correctness of probabilities).
The area under the receiver operating characteristic curve (ROC–AUC) and average precision (AP), which summarizes the precision–recall curve (PRC), are threshold-independent metrics used for model selection and model evaluation (
Fawcett, 2006;
Saito & Rehmsmeier, 2015). They assess model performance across all probability thresholds, evaluating the ability to correctly rank positive and negative cases. The ROC curve illustrates the balance between the true positive rate (correctly identifying DFWI students) and the false positive rate (incorrectly identifying non-DFWI students). AUC ranges from 0.5 (random guessing) to 1 (perfect classification), with 0.80 considered indicative of good discrimination (
Hosmer et al., 2013). The AP emphasizes precision (true positives among predicted positives), making them more suitable for imbalanced datasets (
Saito & Rehmsmeier, 2015).
The predicted DFWI rate for a course is obtained by aggregating and averaging individual predicted probabilities for enrolled students in the test set, consistent with their interpretation as conditional expectations (
Hastie et al., 2009). The predicted rates are subsequently subtracted from the observed DFWI rates to identify courses with significant GAPs, which are then ranked in descending order. As a robustness test, Spearman’s rho (ρ) pair-wise rank correlations (
Spearman, 1904) are used to assess the degree of agreement between the rankings of course GAPs generated by various models (
Lakkaraju et al., 2015). The ρ value ranges from +1 (perfect positive association) to −1 (perfect negative association), with 0 indicating no association between the rank orderings. In addition to GAP rankings, the courses are ranked based on observed DFWI rates, and a concordance analysis using Kendall’s tau pairwise correlation (
Kendall, 1938) is conducted to assess the level of agreement among the rankings. Kendall’s tau ranges from −1 to +1, where values closer to +1 indicate strong agreement between rankings, values near 0 indicate little to no association, and values closer to −1 indicate strong disagreement. The dual-metric analysis was then applied to categorize courses into four groups: high-high, high-low, low-high, and low-low, based on their observed DFWI rates and GAP values.
All analyses were performed in Python 3.11.7 using the Anaconda distribution and packages such as scikit-learn 1.8.0, statsmodels 0.14.6, pandas 3.0.1, NumPy 2.4.4, XGBoost 3.2.0, TensorFlow 2.21.0, Keras 3.14.0, and Matplotlib 3.10.8. The author(s) used OpenAI’s ChatGPT (2024 version), Copilot (
https://www.copilotai.com/), GitHub, and Google Scholar to support code generation, debugging, conceptual clarification, and preliminary literature exploration. All ideas, interpretations, and outputs were developed and verified by the author(s). Language editing was supported using Grammarly v1.2.260.1887.
5. Results and Discussion
5.1. Descriptive Statistics
Table 2 summarizes the key patterns in student demographic and socioeconomic characteristics. DFWI rates show little variation by gender but differ substantially across racial and ethnic groups, with the highest rates among Black or African Americans (40.75%), American Indian or Alsakan Natives (26.67%), those identifying as two or more races (24.89%), and Hispanics (24.86%). Freshmen experience markedly higher DFWI rates (43.35%) than other class levels. Higher rates are also observed among first-generation (25.47%), Pell-eligible (31.09%), and resident students (21.49%).
Table 3 summarizes key patterns in the courses. Most courses fall in the medium DFWI rate group, with fewer in the low group and the smallest share (17.50%) in the high DFWI rate category. DFWI rates are 40.73% in the high DFWI group, followed by the medium (21.51%) and low groups (10.57%). Face-to-face and hybrid formats show higher DFWI rates than online courses, and DFWI rates are highest in fall, followed by spring and summer.
5.2. Assessing the Strength of Association Between DFWI Outcomes and Variables
Table 4 presents the results of point-biserial correlations. High School GPA shows a consistently moderate-to-strong negative correlation with DFWI outcomes, strongest in high-DFWI courses (
= −0.36) and weakest in low-DFWI courses (
= −0.23) and is statistically significant at the 1% level. Credits attempted are weakly negatively correlated with DFWI across all groups (
= −0.05 to −0.07), indicating a modest but significant effect. The findings suggest a strong association between student preparedness and DFWI outcomes in more challenging courses.
Cramér’s V values and χ
2 results in
Table 5 indicate a statistically significant medium association between race or ethnicity, with a Cramér’s V exceeding 0.30 in high DFWI courses. Racial differences become more pronounced in higher-risk courses, indicating uneven exposure across different levels of course difficulty. Cramér’s V for academic classification, ranging from 0.26 to 0.29 across groups and Pell eligibility status, ranging from 0.13 to 0.23, indicated a similar pattern. First-generation status showed a moderate association, while gender and residence status displayed weak associations. The course attributes had low Cramér’s V values, suggesting a limited association. The findings suggest structural differences in exposure to high-DFWI course environments across some categories.
Overall, prior academic performance, race or ethnicity, student academic level, first-generation status, and Pell eligibility appear to be correlated with DFWI risk, with effects increasing in more difficult courses. While these results point towards equity concerns, they do not indicate causality.
5.3. Predictive Analysis Results: Model Evaluation
We used multiple models to predict the likelihood that students would receive a DFWI outcome. The dependent variable was the binary DFWI outcome, with independent variables including high school GPA, credits attempted, and categorical factors such as academic classification, gender, race or ethnicity, Pell eligibility, first-generation status, semester, and course format. Semester served as a control to account for any seasonal performance variations, while the course format addressed differences in instructional modality that could affect learning conditions. Educational data often displays nonlinear relationships and multicollinearity, which traditional linear models may miss. To address this, we utilized non-parametric models that can capture complex patterns with fewer assumptions. Standard scaling was applied to the high school GPA and credits attempted in the training set using StandardScaler, with the same scaling applied to the test set to avoid data leakage. Tree-based models were not feature-scaled as they are invariant to monotonic transformations. To handle class imbalance, we tuned model hyperparameters with RandomizedSearchCV and stratified 5-fold cross-validation (
Cheng et al., 2025). Model performance was evaluated using continuous predicted probabilities and threshold-independent metrics, including BR, ROC-AUC, and AP. This method is ideal for probabilistic risk prediction, focused on accurate risk estimations rather than binary classification. The random_state parameter was set to 42 for reproducibility, and optimal parameters were then applied to the test set.
We began predictive analysis by fitting the LR model to the data. The LR model was used as the baseline predictive model. The ENLR model employed regularization (shrinkage to prevent overfitting) to manage model complexity. A total of 25 random combinations from a predefined hyperparameter grid were evaluated. Regularization strength (λ) and L1 ratio () were tested across a range of values. The optimal parameters identified included a regularization strength (λ) of 10, indicating moderate regularization. The alpha value was 0.1, indicating that 10% of the regularization was from the Lasso (L1) component, with the remaining 90% from the Ridge (L2) component. A theoretically valid set of two-way interactions was tested for ENLR but was omitted as they did not improve model performance.
For the RF model, bootstrap sampling was set to its default value, allowing each tree in the forest to be trained on a random subset of the data sampled with replacement. A total of 50 random combinations were evaluated from a predefined hyperparameter grid. The identified optimal hyperparameters are 500 trees with a maximum depth of 20. The minimum number of samples required to split an internal node was 10, allowing moderate subgroup formation; the minimum number of samples needed to be at a leaf node was 1, and the number of features to consider when searching for the best split was the log2 of the total number of features.
The XGBoost model was evaluated on a predefined hyperparameter grid, with 10 randomly selected combinations using logistic loss as the internal evaluation metric. The optimal configuration included 300 trees with a maximum depth of 7, to prevent overfitting. The learning rate, the rate at which the model updates, was set to 0.2, and 80% of the data was used for each tree (subsample ratio of 0.8). All features were used to build each tree. The minimum loss reduction required to make a split (γ) was set to 0, allowing more flexible tree growth. The L2 regularization term (λ) on weights was set to 1.5, limiting large weights, and helping to reduce overfitting and improve generalization.
The ANN was implemented in Keras as a feed-forward neural network (
Figure 1), with an input layer containing all predictor features, followed by two hidden layers using Rectified Linear Unit (ReLU) activation, and a sigmoid layer producing continuous probabilities for binary outcomes. The model was trained using binary cross-entropy loss and the Adam optimizer. The 32 hidden units in the second layer were already fixed. Hyperparameter tuning identified the optimal configuration as 128 hidden units in the first layer, a learning rate of 0.0005, a dropout rate of 0.3 for regularization, 50 training epochs, and a batch size of 32.
The results of the model comparisons are summarized in
Table 6. The ANN model achieved the best overall performance across all metrics (BR = 0.13, AUC = 0.76, AP = 0.42), although improvements over LR and ENLR were marginal. Linear models performed competitively, indicating that the relationship between predictors and the outcome is largely linear or well-approximated by additive effects. Tree-based models (RF and XGB) showed slightly lower performance across all metrics. For brevity, only the ANN model results are discussed in detail. The Brier score indicates that the model provides well-calibrated estimates of student risk, reflecting baseline probabilities rather than exact outcomes. The ANN model achieved an ROC–AUC of 0.76, indicating good ability to rank students by their probability of a DFWI outcome. With an Average Precision (AP) score of 0.42, the model demonstrates moderate effectiveness in prioritizing students more likely to experience DFWI, which is typical for imbalanced datasets. While ROC–AUC assesses ranking for all students, AP focuses on the positive class and is more sensitive to false positives.
Movahedi et al. (
2021) report an AP of 0.43, similar to the ANN model. Ultimately, the model aims to establish baseline risk expectations from student characteristics, recognizing the need for GAP analysis to account for course- or instructional-level effects. Overall, the models are adequate for this purpose.
The ROC–AUC and PRC for the ANN model are presented in
Figure 2 and
Figure 3, respectively.
5.4. Predictive Analysis and GAP Results: Ranking and Identifying Courses
Predicted and actual DFWI rates were computed for each course in the test set using the best-performing ANN model. GAP values were then computed as the difference between actual and predicted course-level DFWI rates and used to rank courses. The results of the pair-wise Spearman’s rank correlation for the GAP course rankings of models are presented in
Table 7. Spearman’s rho coefficient (ρ = 1) indicates perfect positive correlation between the course rankings derived from the LR and ENLR models, indicating that regularization did not change prediction ordering, suggesting the baseline LR model was already stable with minimal overfitting. The non-parametric models (RF, XGBoost, ANN) also show statistically significant ρ values, ranging from 0.96 to 0.98, indicating good alignment among them. The lowest correlation, 0.96, was observed between the ANN and XGBoost models. There is good agreement between the ANN model and the rankings of the other models.
In addition to ranking based on GAP, the courses were also ranked by observed DFWI rates. The course rankings based on observed DFWI and GAP (ANN model) are moderately correlated (Spearman’s Rho = 0.84, Kendall Tau = 0.65) and statistically significant for the full sample. However, this agreement weakens substantially for courses with > 30% observed DFWI (Spearman’s Rho 0.53, Kendall Tau = 0.38), indicating that the two ranking methods diverge where they really matter. Moreover, only 20% of the top 10 highest-risk courses overlap. The rank fluctuations of 0–19 places between the observed DFWI rate and GAP in these courses indicate that unadjusted outcome-based rankings may differ substantially from model-based GAP rankings, thereby motivating the use of predictive adjustment to achieve more stable prioritization.
When analyzing discrepancies between predicted and actual DFWI rates for each course, it is imperative to consider the implications of both large and small (or negative) GAPs. Larger GAPs, where the actual DFWI rate surpasses predictions, may indicate underlying challenges such as misalignment between course content and student preparedness, issues with instructional or course design, or structural factors affecting performance. Conversely, low or negative GAPs, in which actual DFWI rates fall below expectations, may indicate that student performance exceeds predictions. Such desirable outcomes could result from effective teaching strategies, robust support systems, course attributes, or other latent factors that facilitate student success. Reviewing courses further to understand these aspects can yield valuable insights into practices that foster positive student outcomes.
Considering both observed DFWI rates and GAP metrics together distinguishes courses with widespread student difficulty from those that underperform relative to context, enabling fair identification of courses for further review or intervention. A high observed DFWI rate, combined with a high GAP value, indicates the need for an urgent course-level review or intervention. In contrast, a high DFWI rate with a low GAP suggests that the course may require student-oriented support. A moderate-to-low DFWI rate with a high GAP indicates courses where outcomes appear positive but do not meet expectations for enrolled students, suggesting the need for monitoring or further review. In contrast, low DFWI and GAP indicate safe courses that pose minimal risk. This systematic approach highlights opportunities for more in-depth investigation of course design and instructional methodologies. In short, GAP analysis is a viable approach for efficiently identifying high-risk courses, reviewing them, and recommending targeted interventions.
Our findings based on predictive analysis suggested that model choice has only a minimal impact on GAP-based course rankings. If interpretability is prioritized, LR or ENLR models may be more suitable. If performance is the primary concern, the ANN model may be preferred, although rankings may improve only marginally. Nonetheless, these incremental improvements can be significant in situations where complex factors influence predictions and impact GAPs.
5.5. Predictive Fairness, Probability Calibration, and GAP Validation
We computed BR, AUC, and AP for subgroups to compare predicted risks with actual DFWI outcomes across gender, race/ethnicity, Pell eligibility, first-generation status, non-resident status, and academic classification for the best-performing ANN model. Results (
Appendix A.1) reveal consistent predictive rankings (similar AUCs) but differing calibration, i.e., higher-risk groups (e.g., Pell-eligible, first-generation, Black or African American students, Freshmen) show greater prediction error (higher BRs), while lower-risk groups (e.g., White, non-Pell students) display better calibration. Calibration plots (
Appendix A.2) were created for each subgroup by binning predicted probabilities into 10 bins to compare mean predicted risks with observed DFWI outcomes. The ANN model demonstrates reasonable calibration at the bin level, with most errors within ±0.05 and the majority within ±0.10. Deviations partly reflect sparse observations within probability bins for certain subgroups (e.g., gender not reported, Asian, two or more race/ethnicity, other race/ethnicity, nonresident, not Pell-eligible, seniors, and freshmen). Higher variation was observed, particularly in the high-risk tail, where student population is sparse for many subgroups. Excluding the fall semester from the test set may also have caused a slight distributional shift due to variations in DFWI prevalence.
While the model effectively ranks risk levels, calibration analyses indicate some non-uniformity in error distribution across subgroups and modest overprediction of risk. Post-hoc global beta calibration (
Kull et al., 2017) served as a robustness check, marginally improving probability calibration while resulting in minimal changes to unadjusted GAP course rankings. The top 10 courses showed nearly identical rankings, with differences ranging from 0 to 3. Overall, GAP rankings remained strongly correlated (Kendall’s Tau = 0.95), even for courses with DFWI rates >30%. This indicates that course prioritization is generally reliable in our data, even with moderate miscalibration. Nonetheless, results indicate that practitioners should use flexible recalibration as a cross-check before making decisions and interpret course rankings cautiously. GAP estimates and rankings should be used as risk-adjusted indicators rather than as definitive causal measures of course quality.
6. Conclusions and Implications
This study examined DFWI trends in core-curriculum courses by integrating student characteristics and applying predictive modeling within the framework of
Bloemer et al. (
2017,
2018). Higher DFWI rates were observed in lower-division courses, f2f formats, and in fall semesters. Equity gaps were evident across race/ethnicity, Pell eligibility status, and first-generation status, with strong correlations between DFWI outcomes and student factors, including high school GPA and academic classification. Predictive models indicated that student-level factors alone do not fully explain DFWI risk and that course-level and instructional factors also matter. The ANN model performed only marginally better than parametric models, reflecting the moderate predictive signal in the available data.
We found evidence that course prioritization can change significantly when using GAP-adjusted DFWI rates rather than observed DFWI rates, suggesting that risk-adjusted measures may be more informative for identifying courses that need further review and interventions in design, instruction, and structure. While the GAP measure highlights courses for review, its accuracy may be limited by the available student characteristics, the structure, and the quality of the data, and may partially reflect unobserved student heterogeneity. In this context, the dual-metric review-based approach enhances transparency and interpretability, providing additional insights. A high DFWI rate with large GAPs may indicate courses that are potentially underperforming in course design, instruction, structure, and other factors. In contrast, a high DFWI rate with a low GAP suggests that support needs to be oriented towards students. A low DFWI rate with a high GAP suggests that courses are performing below expectations relative to their student composition, warranting closer monitoring. A low DFWI rate and low GAPs, on the other hand, indicated effective practices that may be scaled for other courses.
Our work demonstrates that integrating observed DFWI rates with the risk-adjusted GAP can provide a more informative diagnostic framework for identifying courses warranting further review than observed DFWI rates alone. Because student outcomes are influenced by both student- and course-level factors, models based primarily on student characteristics capture only part of the variation in student performance. Nonetheless, the results support the use of the GAP-augmented DFWI rates as a more informative indicator of course performance than observed DFWI rates alone. While the current predictive performance is sufficient to support the proposed diagnostic framework, improved prediction could further strengthen its precision and practical utility. Future work may explore more advanced calibration and residual estimation approaches, including hierarchical (multilevel) student outcome models that account for clustering within courses; Bayesian or empirical Bayes shrinkage methods to improve the stability of aggregated course-level estimates. Although GAP analysis provides useful diagnostic insights, we emphasize the importance of a thorough evaluation of predictive fairness and additional validation, including robustness checks, before making decisions. The GAP framework discussed here is intended strictly as a diagnostic tool and should be followed by careful course review to guide targeted interventions.
Much of the learning analytics literature to date has concentrated on identifying at-risk students who require academic support, but it often overlooks the wider instructional contexts that affect success. GAP patterns can inform institutional planning by guiding resource allocation across departments, including staffing and support services. By separating courses that consistently carry high risk from those where elevated risk reflects the composition of enrolled students, GAP estimates provide actionable insights for curriculum review and the targeted deployment of instructional support. While customization might be necessary, our methodology can be adapted to other institutions with multi-semester academic and demographic data. As universities invest in learning analytics, connecting predictive modeling with course evaluations can lead to meaningful improvements in teaching and learning.
Importantly, this approach is not at odds with broader inclusion and access objectives. Course identification is used solely for diagnostic purposes, and any resulting interventions, after careful review of courses, should be grounded in inclusive practices that benefit all students. Explicitly modeling student heterogeneity can enhance equity efforts by enabling complementary, targeted supports, such as proactive advising, tutoring referrals, and adaptive academic assistance for at-risk students, rather than relying solely on course redesign. Proper model calibration and subgroup validation not only assess overall predictive fairness but also uncover heterogeneous risk patterns that aggregate metrics may obscure. For instance, examining the high-risk tail (the top percentile of predicted risk) helps identify students who may benefit from individualized support. Calibration analyses can further reveal whether certain demographic, socioeconomic, or academic subgroups experience disproportionately high failure rates relative to their predicted risk levels. Leveraging these insights, universities can enhance student support, proactively address equity gaps, and strengthen student success and retention, ensuring that interventions are fair, evidence-based, and cost-effective.
In summary, despite its limitations, the GAP analysis offers actionable insights to improve course design, instruction, and support strategies, fostering a more equitable and responsive educational environment for everyone. By linking predictive modeling to course-level GAP analysis, this study illustrates how learning analytics can support data-informed instructional improvement and institutional decision-making, thereby enhancing student success.