Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning

Joseph, Kishore; Calvert, Wesley C.; Keys, Oliver; McCrocklin, Shannon M.

doi:10.3390/educsci16050806

Open AccessArticle

Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning

¹

School of Agricultural Sciences, Southern Illinois University, Carbondale, IL 62901, USA

²

School of Mathematical and Statistical Sciences, Southern Illinois University, Carbondale, IL 62901, USA

³

School of Aviation and Automotive Technology, Southern Illinois University, Carbondale, IL 62901, USA

⁴

School of Languages and Linguistics, Southern Illinois University, Carbondale, IL 62901, USA

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2026, 16(5), 806; https://doi.org/10.3390/educsci16050806 (registering DOI)

Submission received: 10 April 2026 / Revised: 12 May 2026 / Accepted: 13 May 2026 / Published: 20 May 2026

(This article belongs to the Section Higher Education)

Download

Browse Figures

Versions Notes

Abstract

We examined the limitations of observed course DFWI rates (% of D and F grades, withdrawals, and incompletes) as evaluation metrics, which obscure student characteristics, course design, instruction, structure, and latent factors, posing challenges in identifying courses that need improvement. An artificial neural network (ANN) was trained using student data to model risk, accounting for variations in student characteristics. The model’s predictions on test data were averaged at the course level, producing expected DFWI rates based on student composition. Courses with high observed DFWI rates and large deviations between observed and predicted DFWI rates (the GAP) were ranked and prioritized for review, as they may reflect aspects of course design, structure, or instructional practices warranting further qualitative evaluation. Our predictions are non-causal, and modeling calibration varies across subgroups; therefore, the original GAP rankings, robust to a post-hoc calibration check, are presented as risk-adjusted indicators for prioritizing courses for further review rather than as definitive causal measures of course quality. Rankings based on observed DFWI rates differ substantially from risk-adjusted GAP rankings, indicating that relying on observed DFWI rates alone may misidentify high-risk courses. Our methodology can assist educators and administrators in making fair resource allocation decisions and improving student outcomes.

Keywords:

DFWI; machine learning; GAP; equity; intervention; student success

1. Introduction

Enhancing student retention and success remains a fundamental goal for educational institutions, as it is critical to strengthening institutional accountability, meeting accreditation and compliance expectations, securing funding, and promoting equitable access to educational opportunities. The DFW rate (% of D and F grades and withdrawals) is a widely recognized metric for assessing student performance and retention in academic contexts (Vyas & Reid, 2023). Observed DFW rates can highlight systemic barriers within course structures, design, and instructional methods; reveal gaps between student preparedness and course demands; or expose inequities in academic outcomes among student populations with different demographic and socioeconomic backgrounds (Hatfield et al., 2022). Certain courses may also consistently exhibit higher DFW rates due to the inherent complexity of their content (Vyas & Reid, 2023; Bloemer et al., 2017, 2018). Given that observed DFW rates can signal multiple underlying issues, it is understandable that institutions place substantial weight on them.

Higher education institutions have limited resources to support student success, and imprecise targeting of courses for intervention can lead to inefficient or misguided strategies (Bloemer et al., 2017). Traditionally, academia has focused on gateway courses, courses with larger enrollments, first- or second-year courses, and particularly those with high observed DFW rates to target interventions. Bloemer et al. (2017, 2018) highlighted the effectiveness of targeted interventions by analyzing the GAP—the difference between actual and projected DFW rate (% of D, F, and Withdraw) in a course. Their framework focuses on selecting courses for intervention based on high DFW rates and larger GAPs between expected and actual outcomes, optimizing the use of institutional resources for course improvement and student support.

This study builds on the work by Bloemer et al. (2017, 2018) and proposes an enhanced analytical prediction strategy within the GAP framework they proposed. We argue that observed DFW rates, while informative, may provide ambiguous signals for prioritizing course-level interventions because they obscure student composition, course design, instructional factors, and other latent influences. We propose a risk-adjusted baseline for expected DFWI rates (% of D and F grades, withdrawals, and incompletes) in courses, conditional on student heterogeneity and select controls. Course prioritization is based on both high failure rates and underperformance relative to expectations, with observed DFWI rates benchmarked against predicted DFWI rates based on student characteristics. This dual-metric approach ensures that courses are flagged for review based on course-level performance issues to the extent possible, rather than merely reflecting the enrollment of high-risk students. The framework enables fair and actionable prioritization, allowing institutions to review courses for underlying issues and implement tailored, course-based interventions grounded in inclusive practices, thereby ensuring effective resource allocation and support for all students. Conceptually, our GAP framework is analogous to value-added models (McCaffrey et al., 2004) used in teacher evaluation, in that outcomes are adjusted for observable student characteristics to isolate performance relative to expectation. Similar logic is applied here at the course level solely for diagnostic purposes and support, rather than for evaluation or accountability.

A variety of student-level factors—including prior academic performance, race/ethnicity, first-generation status, and socioeconomic background—can influence course outcomes and DFWI rates (Geiser & Santelices, 2007; Johnson et al., 2001; Drake, 2024). These factors may interact in complex ways, complicating simple linear predictions of student performance. Acknowledging these differences is crucial for predicting DFWI outcomes (grades of D or F, withdrawal, or incomplete) and for effectively targeting interventions in courses with deficiencies. While the observed DFWI rate is useful for screening courses, it can fall short in prioritization when student composition varies considerably. Ignoring this factor is inequitable, as it can unfairly associate poor outcomes with courses serving marginalized populations and stigmatize programs that already face considerable instructional challenges. Additionally, course-level issues (e.g., assessment design, pacing, pedagogy, sequencing) are addressed effectively through instructional support and course redesign. In contrast, student risk factors (e.g., preparation gaps, lack of motivation, first-generation status, financial stress) may require individualized, student-focused interventions. Separating course performance from student risk is useful as these mechanisms require different responses, prevent misattributing failure to course design or instruction, and frame analytics as diagnostic rather than punitive. Furthermore, the dual-metric, review-based approach allows program-level decision-making that preserves appropriate course rigor while targeting support where needed, reducing pressure on instructors to lower standards or inflate grades to manage course DFWI rates. The separation also enables course redesign efforts to address instructional or structural issues while allowing diversity and equity initiatives to focus directly on supporting vulnerable student populations through targeted personalized support. In doing so, the framework provides a non-punitive, analytical basis for further investigation, and it supports both instructional integrity and equity goals.

One key element of GAP analysis is effectively predicting DFWI outcomes using available student data. Recent advances in machine learning methods have expanded their use to improve prediction accuracy in academic research. However, most studies focus on individual student risk (Cho et al., 2023; Yang et al., 2020; Lakkaraju et al., 2015) rather than assessing course-level performance (DFWI rate) or identifying courses for targeted review (Bloemer et al., 2017, 2018). The main aim of this study is to develop a scalable predictive framework for student DFWI outcomes, average predicted probabilities at the course-level, and apply GAP analysis (Bloemer et al., 2017, 2018) to prioritize courses for actionable review and intervention. Our study treats the GAP measure as a descriptive, risk-adjusted deviation rather than a causal estimate of course effectiveness. The analysis is predictive in nature; therefore, results should not be interpreted casually (Shmueli, 2010; Mullainathan & Spiess, 2017).

This research examines five pivotal inquiries central to the study’s objectives:

R1. To what extent are student demographic and socioeconomic characteristics related to DFWI outcomes, and do these relationships indicate potential equity gaps?
R2. How effectively can different machine learning models predict student DFWI outcomes based on student characteristics?
R3. To what extent are course rankings derived from GAP analysis comparable between parametric and non-parametric models?
R4. Do course rankings based on observed DFWI rates diverge substantially from rankings based on risk-adjusted outcomes?
R5. Do predicted DFWI outcomes exhibit subgroup disparities in calibration and fairness, and does it affect GAP rankings?

We conduct descriptive and association analyses, perform a multi-metric evaluation of machine learning models, calculate GAP, compare course rankings using Spearman’s (Spearman, 1904) and Kendall’s (Kendall, 1938) correlations, and validate them using fairness assessments, subgroup calibration analysis, and post-hoc robustness checks. Our findings show that GAP analysis, based on predicted DFWI outcomes, can identify courses with unexplained performance deviations, offering useful insights for prioritizing courses for further review and targeted intervention. However, predictive performance may be constrained by the scope and quality of available student data, persistent structural or historical biases, and latent factors that remain unobserved. We acknowledge the ethical and equity issues involved in using predictive analytics in education (Baker & Hawn, 2021) and underscore the need for a thorough evaluation of predictive fairness and further validation before making decisions. Moreover, we note that the GAP framework outlined here is purely diagnostic and should be supplemented by a careful review of courses to further inform targeted intervention.

The rest of this article is organized as follows: We begin with a brief literature review. Next, we explore our dataset and preprocessing techniques. Then, we detail the association-strength metrics, classification algorithms, evaluation frameworks, and procedures used for GAP analysis and its validation. We also discuss the study’s ethical considerations, limitations, and assumptions. Following that, we present the results of our analysis. Finally, we highlight the key insights of our study and their implications for learning analytics and higher education practitioners.

2. Literature Review

Previous research has identified several key student-level predictors of DFW outcomes. Geiser and Santelices (2007) and Al Hazaa et al. (2021) found that high school GPA is a strong predictor of student outcomes. Hatfield et al. (2022) noted that introductory courses significantly affect student persistence, particularly by deterring minority students from pursuing STEM. Prior research has reported mixed findings regarding the role of gender in academic outcomes, with some studies identifying significant associations (Matz et al., 2017) while others find limited or context-dependent effects (Wang et al., 2017). Research has found that Pell Grant recipients tend to be older, attend part-time, have family commitments, job obligations, and often face financial insecurity (Cox, 2016), all of which affect their academic persistence (Titus, 2006) and course success (Cox, 2016). According to Engle and Tinto (2008), low-income first-generation students at 4-year public institutions are three times more likely to leave than their more advantaged counterparts. In addition, studies have found that most DFW grades occur in large-enrollment introductory courses (Twigg, 2005) and that first-year students enrolled in distance education courses may have higher DFW rates than their peers in in-person courses (Urtel, 2008).

Recent advances in machine learning have expanded the use of predictive analytics in higher education, enabling models that can capture complex, nonlinear relationships in student academic outcomes. Delen (2010) reported success in predicting first-year student attrition using support vector machines, XGBoost, and elastic-net binary logistic regression, while Dekker et al. (2009) used logistic regression, decision trees, Bayesian classifier, random forest, and rule-based learners to predict student dropout risk. Lakkaraju et al. (2015) reported that ensemble methods, such as random forests, outperformed traditional methods in predicting the risk of high school graduation. Susheelamma and Ravikumar (2019) employed XGBoost, and Cho et al. (2023) applied common machine learning models and a deep neural network model alongside data-balancing techniques to predict student dropout. Bloemer et al. (2018) used a logistic regression model to predict the probability that a student would receive a DFW based on prior GPA, prior DFW rate, student type (Native, Honors, Transfer, Online), and academic lifecycle stage. A random forest model using both institutional and in-class data was employed by Yang et al. (2020) to predict DFW outcomes in an introductory physics course.

Our study advances the literature on risk adjustment and value-added modeling in education by assessing outcomes with observable variables to compare performance across diverse populations (McCaffrey et al., 2004; Kane & Staiger, 2008). Prior research indicates that predictive models optimized for accuracy may not always reveal causal relationships (Shmueli, 2010; Mullainathan & Spiess, 2017). Additionally, findings in educational data science and algorithmic fairness highlight that risk-adjusted predictions can still reflect structural inequalities and students’ readiness, and that they need careful interpretation in decision-making (Baker & Hawn, 2021). Research in explainable AI emphasizes the importance of interpretable and auditable models to ensure transparency and build trust (Ribeiro et al., 2016). These issues are particularly relevant in education, where predictive models are used to guide advising and resource allocation. Consequently, the GAP measure in this study is framed as a predictive, risk-adjusted deviation to identify potential challenges in course design, structure, and instruction, without attributing causality. It ultimately serves as a screening tool for further review, with interventions informed by contextual factors.

3. Materials and Methods

Core-curriculum course data from a public university spanning the spring 2022 to summer 2024 terms were analyzed. The data obtained from the Registrar’s records included redacted information on students enrolled in various courses, by course section, their final grades, and demographic and socioeconomic characteristics. Table 1 outlines the key variables employed in this study.

3.1. Definition of DFWI Rate

The DFWI rate of any group is calculated as follows:

DFWI Rate = \frac{Number of DFWI students in a group}{Total number of students in the group} \times 100

(1)

3.2. Data Cleaning and Preprocessing

Two datasets were developed from the original data. The first dataset (Sample 1) was used for descriptive analysis and to assess the strength of association, while the second (Sample 2) was used for prediction and GAP analysis. All core curriculum courses with 15 or more students enrolled from spring 2022 to summer 2024 were included in Sample 1, resulting in 230 unique courses from an original dataset of 248. Final letter grades were classified into two categories: grades D and D+ (Poor), F (Fail), W (Withdraw), WF (Withdraw and Fail), INC (Incomplete), and PR (Work in Progress) were grouped as DFWI, while grades A+, A, A−, B+, B, B−, C+, C, and C− were classified as Non-DFWI. INC grades were included in the DFWI group to provide a more comprehensive assessment of course outcomes and to improve data balance, given the low incidence of DFW grades.

A course format variable was created from course section numbers and classified as face-to-face (f2f), online, or hybrid. Next, a categorical variable was added to represent spring, summer, and fall to account for potential differences across semesters. We also established a non-resident student category. The dataset also included the student’s high school GPA and the credits attempted during the semester. The High School GPA variable had 11,992 (18.10%) missing values. Following Van Buuren and Groothuis-Oudshoorn (2011), all missing high school GPA values were imputed using Multiple Imputation by Chained Equations (MICE), treating GPA as the only variable with missing values and conditioning on gender, race or ethnicity, first-generation status, and Pell Grant status. Additionally, the observations with credit loads exceeding 21 were removed as a precaution to ensure data consistency.

Courses were then categorized into three groups based on their historical average DFWI rates: Low (≤15%), Medium (>15% to ≤30%), and High (>30%). The dichotomous categorical variables—DFWI Group, Pell Grant status, first-generation status, and non-resident status—were coded with values of 0 and 1. Additionally, one-hot encoding was applied to transform all non-dichotomous categorical variables into numeric variables, with one column dropped per category. This dataset, referred to as Sample 1, comprised 66,265 observations, including duplicates from students enrolled in multiple courses, and was utilized to analyze course characteristics, student academic classification, and DFWI rates. In contrast, 11,462 unique student observations from Sample 1 were used to examine student demographic and socioeconomic factors.

Some variables in Sample 1 had limited counts and were therefore aggregated into single variables to ensure statistical stability. American Indian or Alaskan Native, Unknown, Native Hawaiian, and Pacific Islander races were combined into the “other race or ethnicity” category. Students classified academically as seniors with a degree, med prep, and unclassified were grouped into the “other academic classification.” A revised Sample 1 dataset, reflecting these changes, including all student records, was created to analyze the strength of the association between DFWI outcomes and various variables across the Low, Medium, and High DFWI groups.

Data processing for prediction and GAP analysis using Sample 2 involved classifying observations from spring 2022 to fall 2023 as training data and those from spring to summer 2024 as test data. Courses with fewer than 15 students in the training set were excluded to ensure the models learn meaningful patterns. We imputed 11,793 (18%) missing high school GPA observations at the student level in Sample 2 using MICE. The model was first fitted on the training set and applied to the test set to prevent data leakage. We retained all student-level predictors in the final sample and removed course characteristics (except semester and course format) from the dataset. Other procedures were similar to the previous dataset. The final dataset comprised 149 unique courses, with multiple records per student, ensuring that courses were aligned across the training set (52,455 observations) and the test set (12,838 observations), enabling consistent out-of-sample predictions.

A lower proportion of DFWI cases in the dataset skews the model toward predicting non-DFWI outcomes, reducing its accuracy in identifying actual DFWI cases (Guanin-Fajardo et al., 2024; Yang et al., 2020). Our approach establishes a baseline model to predict DFWI rates from student profiles, thereby enabling attribution of deviations to course- or instructional-level factors. Therefore, we chose to maintain the natural imbalance in the data to preserve the integrity of predicted probabilities and ensure valid GAP analysis. This principle guided our modeling and evaluation framework throughout the study.

3.3. Methods for Assessing the Strength of Association Between DFWI Outcomes and Variables

The point-biserial correlation coefficient (Tate, 1954) is used to examine the relationship between binary DFWI outcomes and numerical variables, while Cramér’s V (Cramér, 1946) is used to assess the strength of association between DFWI outcomes and categorical variables across the Low, Medium, and High DFWI groups. Duplicate student records due to students enrolling in multiple courses across semesters are retained, as they do not affect these analyses. Point-biserial correlation values range from +1 to -1, where positive values indicate a positive association with the DFWI outcome and negative values indicate a negative association. The magnitude indicates the strength of the association. Cramér’s V ranges from 0 to 1, where higher values denote a stronger association (e.g., 0.1 is weak, 0.3 is moderate, and 0.5 or more is strong).

3.4. Methods for Predictive Modeling

The classification models used for analysis, along with their specifications, are discussed below.

Logistic Regression (LR) Classifier: The LR classifier is a supervised machine learning model widely used for binary classification in educational settings (Bloemer et al., 2017, 2018). It estimates the log-odds of the outcome as a linear function of the input variables and employs a sigmoid function to classify outcomes into two classes.

Elastic Net Regularized Logistic Regression (ENLR) Classifier: The ENLR classifier is a hybrid supervised machine learning algorithm commonly used for binary classification tasks (Zou & Hastie, 2005). It is a variant of LR that combines L1 (Lasso) and L2 (Ridge) regularization and employs a sigmoid function to predict binary outcomes. The ENLR often outperforms simple LR, Lasso, or Ridge models when dealing with numerous potentially correlated variables, some of which may be irrelevant.

The Random Forest (RF) Classifier: The RF classifier is a scalable supervised ensemble method that combines predictions from multiple decision trees, with the final classification made through majority voting (Breiman, 2001; Yang et al., 2020). Each tree is built using a bootstrapped sample of the training data via bagging, thereby reducing variance and improving generalization. Splits are determined by maximizing information gain, calculated with entropy, which measures impurity. The algorithm selects the feature and threshold that minimizes entropy (a measure of impurity or uncertainty in the data) at each node, creating more homogeneous sub-nodes. Trees grow to full depth or until the stopping criteria are met, leading to the final prediction. This model can capture complex nonlinear relationships and interactions using the ensemble framework.

Extreme Gradient Boosting (XGBoost) Classifier: The XGBoost classifier is a highly efficient and scalable supervised learning algorithm based on the gradient-boosting framework. The method uses decision trees as base learners and minimizes the logistic loss function. It incorporates both L1 and L2 regularization to enhance generalization and reduce overfitting, making it especially effective for high-dimensional, noisy datasets (Susheelamma & Ravikumar, 2019). The model builds trees sequentially to minimize prediction errors and is effective in handling complex nonlinear relationships.

Artificial Neural Network (ANN) Classifier: The ANN classifier is a supervised learning model that captures complex nonlinear relationships in high-dimensional data. It consists of interconnected layers of neurons that transform input features through weighted combinations and nonlinear activation functions. These models are typically trained using backpropagation and gradient-based optimization algorithms, such as Adam or Stochastic Gradient Descent (SGD), to minimize the binary cross-entropy loss, which quantifies the discrepancy between predicted probabilities and actual binary outcomes. ANN serves as a complementary model to the parametric and non-parametric tree-based approaches used in the study and can capture intricate interactions among input features.

3.5. Methods for Model Evaluation and Identifying Courses

We evaluate model performance using predicted probabilities of a student receiving a DFWI outcome rather than binary classification, as results can be sensitive to the choice of classification threshold. The primary metric used is the Brier score (BR), which is defined as the mean squared difference between predicted probabilities and observed outcomes (Brier, 1950), and is represented as

B R = \frac{1}{N} \sum_{i = 1}^{N} {(p_{i} - y_{i})}^{2} .

(2)

Here,

p_{i}

is the predicted probability of observation

i

,

y_{i} \in \{0, 1\}

is the observed outcome, and

N

is the number of observations. A Lower Brier score value indicates better probabilistic performance, reflecting closer agreement between predicted probabilities and observed outcomes. The Brier score captures both accuracy (correctness of outcomes) and calibration (correctness of probabilities).

The area under the receiver operating characteristic curve (ROC–AUC) and average precision (AP), which summarizes the precision–recall curve (PRC), are threshold-independent metrics used for model selection and model evaluation (Fawcett, 2006; Saito & Rehmsmeier, 2015). They assess model performance across all probability thresholds, evaluating the ability to correctly rank positive and negative cases. The ROC curve illustrates the balance between the true positive rate (correctly identifying DFWI students) and the false positive rate (incorrectly identifying non-DFWI students). AUC ranges from 0.5 (random guessing) to 1 (perfect classification), with 0.80 considered indicative of good discrimination (Hosmer et al., 2013). The AP emphasizes precision (true positives among predicted positives), making them more suitable for imbalanced datasets (Saito & Rehmsmeier, 2015).

The predicted DFWI rate for a course is obtained by aggregating and averaging individual predicted probabilities for enrolled students in the test set, consistent with their interpretation as conditional expectations (Hastie et al., 2009). The predicted rates are subsequently subtracted from the observed DFWI rates to identify courses with significant GAPs, which are then ranked in descending order. As a robustness test, Spearman’s rho (ρ) pair-wise rank correlations (Spearman, 1904) are used to assess the degree of agreement between the rankings of course GAPs generated by various models (Lakkaraju et al., 2015). The ρ value ranges from +1 (perfect positive association) to −1 (perfect negative association), with 0 indicating no association between the rank orderings. In addition to GAP rankings, the courses are ranked based on observed DFWI rates, and a concordance analysis using Kendall’s tau pairwise correlation (Kendall, 1938) is conducted to assess the level of agreement among the rankings. Kendall’s tau ranges from −1 to +1, where values closer to +1 indicate strong agreement between rankings, values near 0 indicate little to no association, and values closer to −1 indicate strong disagreement. The dual-metric analysis was then applied to categorize courses into four groups: high-high, high-low, low-high, and low-low, based on their observed DFWI rates and GAP values.

All analyses were performed in Python 3.11.7 using the Anaconda distribution and packages such as scikit-learn 1.8.0, statsmodels 0.14.6, pandas 3.0.1, NumPy 2.4.4, XGBoost 3.2.0, TensorFlow 2.21.0, Keras 3.14.0, and Matplotlib 3.10.8. The author(s) used OpenAI’s ChatGPT (2024 version), Copilot (https://www.copilotai.com/), GitHub, and Google Scholar to support code generation, debugging, conceptual clarification, and preliminary literature exploration. All ideas, interpretations, and outputs were developed and verified by the author(s). Language editing was supported using Grammarly v1.2.260.1887.

4. Ethical Considerations, Limitations, and Assumptions of the Study

4.1. Ethical Considerations

This study followed ethical standards within practical constraints and received a Not Human Subjects (NHS) determination from the Institutional Review Board (IRB). The use of machine learning models to estimate DFWI probabilities enables the aggregation of student-level predictions to identify courses with higher concentrations of at-risk outcomes, thereby supporting design, structure, and instructional review, as well as targeted academic intervention. However, ethical considerations remain, as predictions are based on historical data that may reflect structural inequities (Baker & Hawn, 2021), and aggregated results may still influence perceptions of courses or cohorts. Additionally, while ANN, RF, and XGBoost can capture nonlinear relationships, they reduce model transparency. The GAP metric discussed in the article should therefore be strictly viewed as a screening tool, not a definitive diagnosis; it should be interpreted as an indicator of where to focus attention, and it should be embedded within a broader review process that includes root-cause analysis, action, follow-up, and regular engagement among students, faculty, and administrators.

4.2. Assumptions and Limitations

The findings of this study should be considered in light of several key assumptions and limitations. The analysis relies on available institutional data, which may not fully capture all factors influencing DFWI outcomes. It assumes accurate enrollment records, stable relationships, and consistent curriculum policies over time. It presumes that demographic, academic, and socioeconomic factors offer valuable insights into student DFWI outcomes. Additionally, it also assumes that the average of predicted probabilities for students enrolled meaningfully reflects course-level risk. While the GAP analysis can identify courses for review, it is expected that subsequent institutional review enables more granular examination of underlying course or instructional vulnerabilities and support targeted interventions.

The limitations of this approach include persistent biases in historical data, class-size imbalances, and unmeasured or unobserved student characteristics that institutions do not record, which could affect predictability. Temporal changes in student behavior or course structures across academic terms can also pose challenges. Even highly accurate models may exhibit systematic performance differences across demographic groups, raising concerns about unequal impacts and potential biases. Moreover, complex machine learning models can reduce interpretability. It is important to note that the models developed in this study are correlational rather than causal, and the resulting risk indicators should only be used as screening tools. While the foundational approach and methodologies discussed here are generalizable, models will need to be locally adapted and retrained to reflect the specific context and needs of each institution.

An important limitation of the present study is logically intrinsic to its question. If targeting courses for intervention is to make sense, there must be an appreciable degree of student success that cannot be predicted from student characteristics—some impact from the course itself. Consequently, the ability of student characteristics to predict course outcomes is likely lower than one would anticipate from a standard predictive model. Accordingly, while the current predictive performance is sufficient to support the proposed GAP diagnostic framework at the course level, improved prediction could further strengthen its precision and practical utility.

5. Results and Discussion

5.1. Descriptive Statistics

Table 2 summarizes the key patterns in student demographic and socioeconomic characteristics. DFWI rates show little variation by gender but differ substantially across racial and ethnic groups, with the highest rates among Black or African Americans (40.75%), American Indian or Alsakan Natives (26.67%), those identifying as two or more races (24.89%), and Hispanics (24.86%). Freshmen experience markedly higher DFWI rates (43.35%) than other class levels. Higher rates are also observed among first-generation (25.47%), Pell-eligible (31.09%), and resident students (21.49%).

Table 3 summarizes key patterns in the courses. Most courses fall in the medium DFWI rate group, with fewer in the low group and the smallest share (17.50%) in the high DFWI rate category. DFWI rates are 40.73% in the high DFWI group, followed by the medium (21.51%) and low groups (10.57%). Face-to-face and hybrid formats show higher DFWI rates than online courses, and DFWI rates are highest in fall, followed by spring and summer.

5.2. Assessing the Strength of Association Between DFWI Outcomes and Variables

Table 4 presents the results of point-biserial correlations. High School GPA shows a consistently moderate-to-strong negative correlation with DFWI outcomes, strongest in high-DFWI courses (

r_{p} b

= −0.36) and weakest in low-DFWI courses (

r_{p} b

= −0.23) and is statistically significant at the 1% level. Credits attempted are weakly negatively correlated with DFWI across all groups (

r_{p} b

= −0.05 to −0.07), indicating a modest but significant effect. The findings suggest a strong association between student preparedness and DFWI outcomes in more challenging courses.

Cramér’s V values and χ² results in Table 5 indicate a statistically significant medium association between race or ethnicity, with a Cramér’s V exceeding 0.30 in high DFWI courses. Racial differences become more pronounced in higher-risk courses, indicating uneven exposure across different levels of course difficulty. Cramér’s V for academic classification, ranging from 0.26 to 0.29 across groups and Pell eligibility status, ranging from 0.13 to 0.23, indicated a similar pattern. First-generation status showed a moderate association, while gender and residence status displayed weak associations. The course attributes had low Cramér’s V values, suggesting a limited association. The findings suggest structural differences in exposure to high-DFWI course environments across some categories.

Overall, prior academic performance, race or ethnicity, student academic level, first-generation status, and Pell eligibility appear to be correlated with DFWI risk, with effects increasing in more difficult courses. While these results point towards equity concerns, they do not indicate causality.

5.3. Predictive Analysis Results: Model Evaluation

We used multiple models to predict the likelihood that students would receive a DFWI outcome. The dependent variable was the binary DFWI outcome, with independent variables including high school GPA, credits attempted, and categorical factors such as academic classification, gender, race or ethnicity, Pell eligibility, first-generation status, semester, and course format. Semester served as a control to account for any seasonal performance variations, while the course format addressed differences in instructional modality that could affect learning conditions. Educational data often displays nonlinear relationships and multicollinearity, which traditional linear models may miss. To address this, we utilized non-parametric models that can capture complex patterns with fewer assumptions. Standard scaling was applied to the high school GPA and credits attempted in the training set using StandardScaler, with the same scaling applied to the test set to avoid data leakage. Tree-based models were not feature-scaled as they are invariant to monotonic transformations. To handle class imbalance, we tuned model hyperparameters with RandomizedSearchCV and stratified 5-fold cross-validation (Cheng et al., 2025). Model performance was evaluated using continuous predicted probabilities and threshold-independent metrics, including BR, ROC-AUC, and AP. This method is ideal for probabilistic risk prediction, focused on accurate risk estimations rather than binary classification. The random_state parameter was set to 42 for reproducibility, and optimal parameters were then applied to the test set.

We began predictive analysis by fitting the LR model to the data. The LR model was used as the baseline predictive model. The ENLR model employed regularization (shrinkage to prevent overfitting) to manage model complexity. A total of 25 random combinations from a predefined hyperparameter grid were evaluated. Regularization strength (λ) and L1 ratio (

α

) were tested across a range of values. The optimal parameters identified included a regularization strength (λ) of 10, indicating moderate regularization. The alpha value was 0.1, indicating that 10% of the regularization was from the Lasso (L1) component, with the remaining 90% from the Ridge (L2) component. A theoretically valid set of two-way interactions was tested for ENLR but was omitted as they did not improve model performance.

For the RF model, bootstrap sampling was set to its default value, allowing each tree in the forest to be trained on a random subset of the data sampled with replacement. A total of 50 random combinations were evaluated from a predefined hyperparameter grid. The identified optimal hyperparameters are 500 trees with a maximum depth of 20. The minimum number of samples required to split an internal node was 10, allowing moderate subgroup formation; the minimum number of samples needed to be at a leaf node was 1, and the number of features to consider when searching for the best split was the log2 of the total number of features.

The XGBoost model was evaluated on a predefined hyperparameter grid, with 10 randomly selected combinations using logistic loss as the internal evaluation metric. The optimal configuration included 300 trees with a maximum depth of 7, to prevent overfitting. The learning rate, the rate at which the model updates, was set to 0.2, and 80% of the data was used for each tree (subsample ratio of 0.8). All features were used to build each tree. The minimum loss reduction required to make a split (γ) was set to 0, allowing more flexible tree growth. The L2 regularization term (λ) on weights was set to 1.5, limiting large weights, and helping to reduce overfitting and improve generalization.

The ANN was implemented in Keras as a feed-forward neural network (Figure 1), with an input layer containing all predictor features, followed by two hidden layers using Rectified Linear Unit (ReLU) activation, and a sigmoid layer producing continuous probabilities for binary outcomes. The model was trained using binary cross-entropy loss and the Adam optimizer. The 32 hidden units in the second layer were already fixed. Hyperparameter tuning identified the optimal configuration as 128 hidden units in the first layer, a learning rate of 0.0005, a dropout rate of 0.3 for regularization, 50 training epochs, and a batch size of 32.

The results of the model comparisons are summarized in Table 6. The ANN model achieved the best overall performance across all metrics (BR = 0.13, AUC = 0.76, AP = 0.42), although improvements over LR and ENLR were marginal. Linear models performed competitively, indicating that the relationship between predictors and the outcome is largely linear or well-approximated by additive effects. Tree-based models (RF and XGB) showed slightly lower performance across all metrics. For brevity, only the ANN model results are discussed in detail. The Brier score indicates that the model provides well-calibrated estimates of student risk, reflecting baseline probabilities rather than exact outcomes. The ANN model achieved an ROC–AUC of 0.76, indicating good ability to rank students by their probability of a DFWI outcome. With an Average Precision (AP) score of 0.42, the model demonstrates moderate effectiveness in prioritizing students more likely to experience DFWI, which is typical for imbalanced datasets. While ROC–AUC assesses ranking for all students, AP focuses on the positive class and is more sensitive to false positives. Movahedi et al. (2021) report an AP of 0.43, similar to the ANN model. Ultimately, the model aims to establish baseline risk expectations from student characteristics, recognizing the need for GAP analysis to account for course- or instructional-level effects. Overall, the models are adequate for this purpose.

The ROC–AUC and PRC for the ANN model are presented in Figure 2 and Figure 3, respectively.

5.4. Predictive Analysis and GAP Results: Ranking and Identifying Courses

Predicted and actual DFWI rates were computed for each course in the test set using the best-performing ANN model. GAP values were then computed as the difference between actual and predicted course-level DFWI rates and used to rank courses. The results of the pair-wise Spearman’s rank correlation for the GAP course rankings of models are presented in Table 7. Spearman’s rho coefficient (ρ = 1) indicates perfect positive correlation between the course rankings derived from the LR and ENLR models, indicating that regularization did not change prediction ordering, suggesting the baseline LR model was already stable with minimal overfitting. The non-parametric models (RF, XGBoost, ANN) also show statistically significant ρ values, ranging from 0.96 to 0.98, indicating good alignment among them. The lowest correlation, 0.96, was observed between the ANN and XGBoost models. There is good agreement between the ANN model and the rankings of the other models.

In addition to ranking based on GAP, the courses were also ranked by observed DFWI rates. The course rankings based on observed DFWI and GAP (ANN model) are moderately correlated (Spearman’s Rho = 0.84, Kendall Tau = 0.65) and statistically significant for the full sample. However, this agreement weakens substantially for courses with > 30% observed DFWI (Spearman’s Rho 0.53, Kendall Tau = 0.38), indicating that the two ranking methods diverge where they really matter. Moreover, only 20% of the top 10 highest-risk courses overlap. The rank fluctuations of 0–19 places between the observed DFWI rate and GAP in these courses indicate that unadjusted outcome-based rankings may differ substantially from model-based GAP rankings, thereby motivating the use of predictive adjustment to achieve more stable prioritization.

When analyzing discrepancies between predicted and actual DFWI rates for each course, it is imperative to consider the implications of both large and small (or negative) GAPs. Larger GAPs, where the actual DFWI rate surpasses predictions, may indicate underlying challenges such as misalignment between course content and student preparedness, issues with instructional or course design, or structural factors affecting performance. Conversely, low or negative GAPs, in which actual DFWI rates fall below expectations, may indicate that student performance exceeds predictions. Such desirable outcomes could result from effective teaching strategies, robust support systems, course attributes, or other latent factors that facilitate student success. Reviewing courses further to understand these aspects can yield valuable insights into practices that foster positive student outcomes.

Considering both observed DFWI rates and GAP metrics together distinguishes courses with widespread student difficulty from those that underperform relative to context, enabling fair identification of courses for further review or intervention. A high observed DFWI rate, combined with a high GAP value, indicates the need for an urgent course-level review or intervention. In contrast, a high DFWI rate with a low GAP suggests that the course may require student-oriented support. A moderate-to-low DFWI rate with a high GAP indicates courses where outcomes appear positive but do not meet expectations for enrolled students, suggesting the need for monitoring or further review. In contrast, low DFWI and GAP indicate safe courses that pose minimal risk. This systematic approach highlights opportunities for more in-depth investigation of course design and instructional methodologies. In short, GAP analysis is a viable approach for efficiently identifying high-risk courses, reviewing them, and recommending targeted interventions.

Our findings based on predictive analysis suggested that model choice has only a minimal impact on GAP-based course rankings. If interpretability is prioritized, LR or ENLR models may be more suitable. If performance is the primary concern, the ANN model may be preferred, although rankings may improve only marginally. Nonetheless, these incremental improvements can be significant in situations where complex factors influence predictions and impact GAPs.

5.5. Predictive Fairness, Probability Calibration, and GAP Validation

We computed BR, AUC, and AP for subgroups to compare predicted risks with actual DFWI outcomes across gender, race/ethnicity, Pell eligibility, first-generation status, non-resident status, and academic classification for the best-performing ANN model. Results (Appendix A.1) reveal consistent predictive rankings (similar AUCs) but differing calibration, i.e., higher-risk groups (e.g., Pell-eligible, first-generation, Black or African American students, Freshmen) show greater prediction error (higher BRs), while lower-risk groups (e.g., White, non-Pell students) display better calibration. Calibration plots (Appendix A.2) were created for each subgroup by binning predicted probabilities into 10 bins to compare mean predicted risks with observed DFWI outcomes. The ANN model demonstrates reasonable calibration at the bin level, with most errors within ±0.05 and the majority within ±0.10. Deviations partly reflect sparse observations within probability bins for certain subgroups (e.g., gender not reported, Asian, two or more race/ethnicity, other race/ethnicity, nonresident, not Pell-eligible, seniors, and freshmen). Higher variation was observed, particularly in the high-risk tail, where student population is sparse for many subgroups. Excluding the fall semester from the test set may also have caused a slight distributional shift due to variations in DFWI prevalence.

While the model effectively ranks risk levels, calibration analyses indicate some non-uniformity in error distribution across subgroups and modest overprediction of risk. Post-hoc global beta calibration (Kull et al., 2017) served as a robustness check, marginally improving probability calibration while resulting in minimal changes to unadjusted GAP course rankings. The top 10 courses showed nearly identical rankings, with differences ranging from 0 to 3. Overall, GAP rankings remained strongly correlated (Kendall’s Tau = 0.95), even for courses with DFWI rates >30%. This indicates that course prioritization is generally reliable in our data, even with moderate miscalibration. Nonetheless, results indicate that practitioners should use flexible recalibration as a cross-check before making decisions and interpret course rankings cautiously. GAP estimates and rankings should be used as risk-adjusted indicators rather than as definitive causal measures of course quality.

6. Conclusions and Implications

This study examined DFWI trends in core-curriculum courses by integrating student characteristics and applying predictive modeling within the framework of Bloemer et al. (2017, 2018). Higher DFWI rates were observed in lower-division courses, f2f formats, and in fall semesters. Equity gaps were evident across race/ethnicity, Pell eligibility status, and first-generation status, with strong correlations between DFWI outcomes and student factors, including high school GPA and academic classification. Predictive models indicated that student-level factors alone do not fully explain DFWI risk and that course-level and instructional factors also matter. The ANN model performed only marginally better than parametric models, reflecting the moderate predictive signal in the available data.

We found evidence that course prioritization can change significantly when using GAP-adjusted DFWI rates rather than observed DFWI rates, suggesting that risk-adjusted measures may be more informative for identifying courses that need further review and interventions in design, instruction, and structure. While the GAP measure highlights courses for review, its accuracy may be limited by the available student characteristics, the structure, and the quality of the data, and may partially reflect unobserved student heterogeneity. In this context, the dual-metric review-based approach enhances transparency and interpretability, providing additional insights. A high DFWI rate with large GAPs may indicate courses that are potentially underperforming in course design, instruction, structure, and other factors. In contrast, a high DFWI rate with a low GAP suggests that support needs to be oriented towards students. A low DFWI rate with a high GAP suggests that courses are performing below expectations relative to their student composition, warranting closer monitoring. A low DFWI rate and low GAPs, on the other hand, indicated effective practices that may be scaled for other courses.

Our work demonstrates that integrating observed DFWI rates with the risk-adjusted GAP can provide a more informative diagnostic framework for identifying courses warranting further review than observed DFWI rates alone. Because student outcomes are influenced by both student- and course-level factors, models based primarily on student characteristics capture only part of the variation in student performance. Nonetheless, the results support the use of the GAP-augmented DFWI rates as a more informative indicator of course performance than observed DFWI rates alone. While the current predictive performance is sufficient to support the proposed diagnostic framework, improved prediction could further strengthen its precision and practical utility. Future work may explore more advanced calibration and residual estimation approaches, including hierarchical (multilevel) student outcome models that account for clustering within courses; Bayesian or empirical Bayes shrinkage methods to improve the stability of aggregated course-level estimates. Although GAP analysis provides useful diagnostic insights, we emphasize the importance of a thorough evaluation of predictive fairness and additional validation, including robustness checks, before making decisions. The GAP framework discussed here is intended strictly as a diagnostic tool and should be followed by careful course review to guide targeted interventions.

Much of the learning analytics literature to date has concentrated on identifying at-risk students who require academic support, but it often overlooks the wider instructional contexts that affect success. GAP patterns can inform institutional planning by guiding resource allocation across departments, including staffing and support services. By separating courses that consistently carry high risk from those where elevated risk reflects the composition of enrolled students, GAP estimates provide actionable insights for curriculum review and the targeted deployment of instructional support. While customization might be necessary, our methodology can be adapted to other institutions with multi-semester academic and demographic data. As universities invest in learning analytics, connecting predictive modeling with course evaluations can lead to meaningful improvements in teaching and learning.

Importantly, this approach is not at odds with broader inclusion and access objectives. Course identification is used solely for diagnostic purposes, and any resulting interventions, after careful review of courses, should be grounded in inclusive practices that benefit all students. Explicitly modeling student heterogeneity can enhance equity efforts by enabling complementary, targeted supports, such as proactive advising, tutoring referrals, and adaptive academic assistance for at-risk students, rather than relying solely on course redesign. Proper model calibration and subgroup validation not only assess overall predictive fairness but also uncover heterogeneous risk patterns that aggregate metrics may obscure. For instance, examining the high-risk tail (the top percentile of predicted risk) helps identify students who may benefit from individualized support. Calibration analyses can further reveal whether certain demographic, socioeconomic, or academic subgroups experience disproportionately high failure rates relative to their predicted risk levels. Leveraging these insights, universities can enhance student support, proactively address equity gaps, and strengthen student success and retention, ensuring that interventions are fair, evidence-based, and cost-effective.

In summary, despite its limitations, the GAP analysis offers actionable insights to improve course design, instruction, and support strategies, fostering a more equitable and responsive educational environment for everyone. By linking predictive modeling to course-level GAP analysis, this study illustrates how learning analytics can support data-informed instructional improvement and institutional decision-making, thereby enhancing student success.

Author Contributions

Conceptualization—K.J., W.C.C., O.K.J., and S.M.M. Methodology, analysis, and writing—original draft: K.J.; data—K.J.; writing—review and editing: K.J., W.C.C., O.K.J., and S.M.M. Visualization: K.J. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the Southern Illinois University Carbondale and no specific grant number.

Institutional Review Board Statement

Not applicable. This study received a Not Human Subjects Determination (NHS Determination) from the Southern Illinois University Carbondale Institutional Review Board (IRB).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in the analysis are confidential, not publicly available, and are the property of the Board of Trustees of Southern Illinois University Carbondale.

Acknowledgments

The authors would like to express their gratitude to the Registrar’s Office, as well as to David G. Shirley, Andrew Walker, and John Janecek from the Institutional Effectiveness Planning and Research at Southern Illinois University Carbondale, for providing institutional data. Special thanks are also given to Amber Burtis and Jeffrey P. Punske for their useful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DFWI rate	% of D and F grades, withdrawals, and incompletes
DFWI outcome	Grades of D or F, withdrawal, or incomplete
GAP	The difference between the observed and predicted DFWI rate

Appendix A

Appendix A.1

Table A1. Subgroup-level predictive performance and calibration metrics.

Category	Level	Count	Brier	AUC	AP
Gender
	Female	6855	0.1215	0.7673	0.4215
	Male	5917	0.1332	0.7340	0.4201
	Not Reported	66	0.1892	0.6487	0.4082
Pell Eligible
	Yes	5633	0.1608	0.7380	0.4608
	No	7205	0.1011	0.7336	0.3519
First Generation
	Yes	4559	0.1434	0.7569	0.4381
	No	8279	0.1184	0.7519	0.4051
Nonresident
	Yes	270	0.0877	0.6119	0.3372
	No	12,568	0.1281	0.7574	0.4220
Race/Ethnicity
	2 or More	525	0.1048	0.8050	0.4187
	Asian	203	0.0816	0.7359	0.3847
	Black or African American	2618	0.1869	0.7313	0.5309
	Hispanic	1337	0.1478	0.7196	0.3946
	Other	338	0.0946	0.5779	0.1632
	White	7817	0.1079	0.7330	0.3309
Academic
Classification	Freshman	2070	0.2298	0.6452	0.5586
	Sophomore	4373	0.1239	0.6847	0.2778
	Junior	2498	0.1152	0.6944	0.2477
	Senior	3804	0.0839	0.6314	0.1429
	Other	93	0.0992	0.8230	0.4410

Note: All values are from Sample 2 test set and are generated using the best-performing ANN model.

Appendix A.2

Figure A1. Calibration by Subgroup: Gender (10 Bins).

Figure A2. Calibration by Subgroup: Race or Ethnicity (10 Bins).

Figure A3. Calibration by Subgroup: First-Generation Status (10 Bins).

Figure A4. Calibration by Subgroup: Pel Eligibility Status (10 Bins).

Figure A5. Calibration by Subgroup: Nonresident Status (10 Bins).

Figure A6. Calibration by Subgroup: Academic Classification (10 Bins).

References

Al Hazaa, K., Abdel-Salam, A. S. G., Ismail, R., Johnson, C., Al-Tameemi, R. A. N., Romanowski, M. H., BenSaid, A., Rhouma, M. B. H., & Elatawneh, A. (2021). The effects of attendance and high school GPA on student performance in first-year undergraduate courses. Cogent Education, 8(1), 1956857. [Google Scholar] [CrossRef]
Baker, R. S., & Hawn, A. (2021). Algorithmic bias in education. International Journal of Artificial Intelligence in Education, 32(4), 1052–1092. [Google Scholar] [CrossRef]
Bloemer, W., Day, S., & Swan, K. (2017). Gap analysis: An innovative look at gateway courses and student retention. Online Learning, 21(3), 5–14. [Google Scholar] [CrossRef]
Bloemer, W., Swan, K., Day, S., & Bogle, L. (2018). Digging deeper into the data: The role of gateway courses in online student retention. Online Learning, 22(4), 109–127. [Google Scholar] [CrossRef]
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. [Google Scholar] [CrossRef]
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. [Google Scholar] [CrossRef]
Cheng, A., Pei, B., & Liu, C. (2025). Balancing act: Early, fair, and accurate identification of at-risk students. Journal of Learning Analytics, 12(3), 47–65. [Google Scholar] [CrossRef]
Cho, C. H., Yu, Y. W., & Kim, H. G. (2023). A study on dropout prediction for university students using machine learning. Applied Sciences, 13(21), 12004. [Google Scholar] [CrossRef]
Cox, R. D. (2016). Complicating conditions: Obstacles and interruptions to low-income students’ college “choices”. The Journal of Higher Education, 87(1), 1–26. [Google Scholar] [CrossRef] [PubMed]
Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press. [Google Scholar]
Dekker, G. W., Pechenizkiy, M., & Vleeshouwers, J. M. (2009, July 1–3). Predicting students’ drop out: A case study [Conference session]. 2nd International Conference on Educational Data Mining “EDM 2009” (pp. 41–50), Cordoba, Spain. Available online: https://pure.tue.nl/ws/portalfiles/portal/2813648/Metis233712.pdf (accessed on 11 September 2025).
Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49(4), 498–506. [Google Scholar] [CrossRef]
Drake, B. M. (2024). The intersectionality of first-generation students and its relationship to inequitable student outcomes (AIR Professional File, Article 175). Association for Institutional Research. [Google Scholar] [CrossRef]
Engle, J., & Tinto, V. (2008). Moving beyond access: College success for low-income, first-generation students. Pell Institute for the Study of Opportunity in Higher Education. Available online: https://files.eric.ed.gov/fulltext/ED504448.pdf (accessed on 15 December 2025).
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. [Google Scholar] [CrossRef]
Geiser, S., & Santelices, M. V. (2007). Validity of high-school grades in predicting student success beyond the freshmen year: High-school record versus standardized tests as indicators of four-year college outcomes. Research and Occasional Paper Series: SGHE.6.07. Center for Studies in Higher Education, University of California, Berkeley. Available online: https://eric.ed.gov/?id=ED502858 (accessed on 14 July 2025).
Guanin-Fajardo, J. H., Guaña-Moya, J., & Casillas, J. (2024). Predicting academic success of college students using machine learning techniques. Data, 9(4), 60. [Google Scholar] [CrossRef]
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer. [Google Scholar] [CrossRef]
Hatfield, N., Brown, N., & Topaz, C. M. (2022). Do Introductory courses disproportionately drive minoritized students out of STEM pathways? PNAS Nexus, 1(4), pgac167. [Google Scholar] [CrossRef]
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Wiley. [Google Scholar] [CrossRef]
Johnson, M. K., Crosnoe, R., & Elder, G. H., Jr. (2001). Students’ attachment and academic engagement: The role of race and ethnicity. Sociology of Education, 74(4), 318–340. [Google Scholar] [CrossRef]
Kane, T. J., & Staiger, D. O. (2008). Estimating teacher impacts on student achievement: An experimental evaluation (NBER Working Paper No. 14607). National Bureau of Economic Research. Available online: https://www.nber.org/papers/w14607 (accessed on 4 May 2026).
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. [Google Scholar] [CrossRef]
Kull, M., Silva Filho, T. M., & Flach, P. (2017). Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2), 5052–5080. [Google Scholar] [CrossRef]
Lakkaraju, H., Aguiar, E., Shan, C., Miller, D., Bhanpuri, N., Ghani, R., & Addison, K. L. (2015, August 10–13). A machine learning framework to identify students at risk of adverse academic outcomes. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Sydney, NSW, Australia. [Google Scholar] [CrossRef]
Matz, A. K., Koester, B. P., Fiorini, S., Grom, G., Shepard, L., & Conley, A. M. (2017). Patterns of gendered performance differences in large introductory courses at five research universities. AERA Open, 3(4), 1–14. [Google Scholar] [CrossRef]
McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. S. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101. [Google Scholar] [CrossRef]
Movahedi, F., Padman, R., & Antaki, J. F. (2021). Limitations of receiver operating characteristic curve on imbalanced data: Evaluation of assist device mortality risk scores. Journal of Thoracic and Cardiovascular Surgery, 165(4), 1433–1442.e2. [Google Scholar] [CrossRef]
Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106. [Google Scholar] [CrossRef]
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August 13–17). “Why should I trust you?” Explaining the predictions of any classifier [Conference session]. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. [Google Scholar] [CrossRef]
Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3), e0118432. [Google Scholar] [CrossRef]
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. [Google Scholar] [CrossRef]
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101. [Google Scholar] [CrossRef] [PubMed]
Susheelamma, K. H., & Ravikumar, K. M. (2019). Student risk identification learning model using machine learning approach. International Journal of Electrical and Computer Engineering, 9(5), 3872. [Google Scholar] [CrossRef]
Tate, M. W. (1954). Correlation between a discrete and a continuous variable: Point-biserial correlation. Annals of Mathematical Statistics, 25(3), 603–607. [Google Scholar] [CrossRef]
Titus, M. A. (2006). Understanding the influence of the financial context of institutions on student persistence at four-year colleges and universities. The Journal of Higher Education, 77(2), 353–375. [Google Scholar] [CrossRef]
Twigg, C. A. (2005). Increasing success in developmental mathematics: Redesigning courses for student achievement. National Center for Academic Transformation. Available online: https://www.thencat.org/Monographs/IncSuccess.pdf (accessed on 14 May 2025).
Urtel, M. G. (2008). Assessing academic performance between traditional and distance education course formats. Educational Technology & Society, 11(1), 322–330. Available online: https://scholarworks.indianapolis.iu.edu/server/api/core/bitstreams/49292de3-b0e4-4096-9d4f-daee0274d9fc/content (accessed on 15 May 2025).
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. [Google Scholar] [CrossRef]
Vyas, V. S., & Reid, S. A. (2023). What moves the needle on DFW rates and student success in general chemistry? A quarter-century perspective. Journal of Chemical Education, 100(4), 1547–1556. [Google Scholar] [CrossRef]
Wang, M.-T., & Degol, J. L. (2017). Gender gap in science, technology, engineering, and mathematics (STEM): Current knowledge, implications for practice, policy, and future directions. Educational Psychology Review, 29(1), 119–140. [Google Scholar] [CrossRef]
Yang, J., DeVore, S., Hewagallage, D., Miller, P., Ryan, Q. X., & Stewart, J. (2020). Using machine learning to identify the most at-risk students in physics classes. Physical Review Physics Education Research, 16(2), 020130. [Google Scholar] [CrossRef]
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: (Statistical Methodology), 67(2), 301–320. [Google Scholar] [CrossRef]

Figure 1. Architecture of a feed-forward artificial neural network with two hidden layers.

Figure 2. ROC–AUC (ANN Model). The curve shows the model’s ability to discriminate between classes across all probability thresholds. The diagonal reference line represents random classification performance.

Figure 3. PRC (ANN Model). The curve summarizes precision and recall over all possible classification thresholds of the predicted probability.

Table 1. Description of key numerical and categorical variables used in this study.

Variable	Description	Categories (Levels)
Numerical
High School GPA	GPA achieved in high school	--
Credits Attempted	Credits attempted in the semester	--
Categorical
Gender	Student gender	Male, Female, Other
Race or Ethnicity	Student self-reported race or ethnicity	White, Black or African American, Hispanic, Asian, American Indian or Alaskan Native, Native Hawaiian or Other Pacific Islander, two or more, Unknown
First-Generation Status	First-generation college student	Yes, No
Pell Grant Eligibility	Eligibility for Pell Grant (low-income indicator)	Yes, No
Non-Resident Status	Not a US citizen/national and in the country on a visa or temporary basis	Yes, No
Academic Classification	Student Classification at the end of the Semester	Freshmen, Sophomore, Junior, Senior, Senior with degree, Med Prep, Unclassified
Course Format	Course delivery mode	Face-to-Face, Online, Hybrid
Semester Type	Semester when the course is offered	Spring, Summer, Fall
DFWI Group	DFWI group classification	DFWI Rate: Low (≤15%), Medium (>15% to ≤30%), High (>30%)
DFWI Outcome	Course result: DFWI or Non-DFWI	Yes, No

Note. Variables include those directly available and those engineered from the dataset. Med Prep represents a small percentage of post-baccalaureate health profession students enrolled in undergraduate courses.

Table 2. Summary of categorical variables: Demographic and socioeconomic characteristics.

Category	Level	Total Count (N)	Proportion of Total (%)	Within-Level DFWI Count	Within-Level DFWI Rate (%)
Gender	Female	5917	51.62	1228	20.75
	Male	5510	48.07	1203	21.83
	Not Reported	35	0.31	8	22.86
Race or Ethnicity	White	7177	62.62	1071	14.92
	Black or African American	2243	19.57	914	40.75
	Hispanic	1086	9.47	270	24.86
	2 or More Races	454	3.96	113	24.89
	Unknown	271	2.36	35	12.92
	Asian	189	1.65	26	13.76
	AI or AN	30	0.26	8	26.67
	NH or PI	12	0.10	2	16.67
First Generation Status	Yes	4586	40.00	1168	25.47
	No	6876	60.00	1271	18.48
Pell Grant Eligible	Yes	4632	40.41	1440	31.09
	No	6830	59.59	999	14.63
Resident Status	Yes	11,202	97.73	2407	21.49
	No	260	2.27	32	12.31
Academic	Freshmen	17,816	26.89	7724	43.35
Classification	Sophomore	20,116	30.36	3306	16.43
	Junior	11,586	17.48	1731	14.94
	Senior	16,171	24.40	1924	11.90
	Senior with a Degree	347	0.52	71	20.46
	Unclassified	215	0.32	13	6.05
	Med Prep (P-B)	14	0.002	0	0.00

Note. All values are computed from Sample 1. Duplicates from students in multiple courses were analyzed for academic classification (Number of observations (N) = 66,264), while unique student observations were examined for demographic and socioeconomic factors (N = 11,462). AI or AN represents American Indian or Alaskan Native, NH or PI represents Native Hawaiian or Other Pacific Islander, and P-B represents Post-Baccalaureate.

Table 3. Summary of categorical variables: Course characteristics.

Category	Level	Total Count (N)	Proportion of Total (%)	Within-Level DFWI Count	Within-Level DFWI Rate (%)
DFWI Group	High	11,596	17.50	4723	40.73
	Medium	38,995	58.85	8389	21.51
	Low	15,674	23.65	1657	10.57
Course Format	Face-to-Face	47,921	72.32	11,345	23.67
	Online	18,093	27.30	3366	18.60
	Hybrid	251	0.38	58	23.11
Semester Type	Fall	28,355	42.79	7045	24.85
	Spring	32,079	48.41	6879	21.44
	Summer	5831	8.80	845	14.50

Note. All values are computed from Sample 1. Duplicates from students in multiple courses were analyzed for course characteristics, academic classification, and DFWI. N = 66,264.

Table 4. Strength of association between DFWI outcomes and numerical variables.

Variable	Low	Medium	High
Variable	r_pb	r_pb	r_pb
High School GPA	−0.23 ***	−0.31 ***	−0.36 ***
Credits Attempted	−0.05 ***	−0.07 ***	−0.05 ***

Note. All values are computed from Sample 1. r_pb represents point-biserial correlation. N = 66,264. Low (≤15%), Medium (>15% to ≤30%), and High (>30%) indicate course groups based on historical DFWI rates. *** indicates statistical significance at the 1% level.

Table 5. Strength of association between DFWI outcomes and categorical variables.

Variable	Low		Medium		High
Variable	V	χ²	V	χ²	V	χ²
Gender	0.03	15.15 ***	0.02	14.99 ***	0.01	2.35
Race or Ethnicity	0.18	535.87 ***	0.25	2351.92 ***	0.31	1104.57 ***
First-Generation Status	0.08	96.45 ***	0.09	316.55 ***	0.11	147.40 ***
Pell Eligibility Status	0.13	268.17 ***	0.17	1130.88 ***	0.23	595.89 ***
Resident Status	0.02	4.31 **	0.04	51.64 ***	0.05	30.32 ***
Academic Classification	0.26	1083.63 ***	0.29	3202.92 ***	0.29	1001.33 ***
Course Format	0.06	63.52 ***	0.03	27.90 ***	0.06	37.30 ***
Semester Group	0.05	37.16 ***	0.05	111.23 ***	0.09	102.56 ***

Note. All values are computed from Sample 1. V represents the Cramér’s V value, and χ² is the Chi-square statistic. Number of observations. N = 66,264. Low (≤15%), Medium (>15% to ≤30%), and High (>30%) indicate course groups based on historical DFWI rates. *** and ** indicate statistical significance at the 1% and 5% levels, respectively.

Table 6. Class-wise metrics for models classifying DFWI outcomes.

Model	Brier Score	ROC–AUC	AP
LR	0.1297	0.7443	0.4047
ENLR	0.1311	0.7400	0.3916
RF	0.1351	0.7303	0.3926
XGBoost	0.1367	0.7329	0.3879
ANN	0.1273	0.7566	0.4198

Note. All values are computed from Sample 2. DFWI = 0 represents non-DFWI cases, while DFWI = 1 indicates DFWI cases. Unbalanced data were used for prediction. LR, ENLR, RF, XGBoost, and ANN denote Logistic Regression, Elastic Net Regularized Logistic Regression, Random Forest, Extreme Gradient Boosting, and Artificial Neural Network models, respectively. AUC and AP denote Area Under the Curve and Average Precision, respectively.

Table 7. Pair-wise rank correlation of courses by GAP using Spearman’s rho.

	LR	ENLR	RF	XGBoost	ANN
LR	1.00
ENLR	1.00	1.00
RF	0.97	0.97	1.00
XGBoost	0.96	0.96	0.97	1.00
ANN	0.98	0.98	0.98	0.96	1.00

Note. All values are computed from Sample 1. LR, ENLR, RF, XGBoost, and ANN denote Logistic Regression, Elastic Net Regularized Logistic Regression, Random Forest, Extreme Gradient Boosting, and Artificial Neural Network models, respectively. To avoid redundancy, only the lower half of the correlation matrix is displayed. N = 149. All pairwise correlations are statistically significant at the 1% level.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Joseph, K.; Calvert, W.C.; Keys, O., Jr.; McCrocklin, S.M. Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning. Educ. Sci. 2026, 16, 806. https://doi.org/10.3390/educsci16050806

AMA Style

Joseph K, Calvert WC, Keys O Jr., McCrocklin SM. Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning. Education Sciences. 2026; 16(5):806. https://doi.org/10.3390/educsci16050806

Chicago/Turabian Style

Joseph, Kishore, Wesley C. Calvert, Oliver Keys, Jr., and Shannon M. McCrocklin. 2026. "Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning" Education Sciences 16, no. 5: 806. https://doi.org/10.3390/educsci16050806

APA Style

Joseph, K., Calvert, W. C., Keys, O., Jr., & McCrocklin, S. M. (2026). Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning. Education Sciences, 16(5), 806. https://doi.org/10.3390/educsci16050806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying Courses for Targeted Review Using GAP Analysis and Machine Learning

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Definition of DFWI Rate

3.2. Data Cleaning and Preprocessing

3.3. Methods for Assessing the Strength of Association Between DFWI Outcomes and Variables

3.4. Methods for Predictive Modeling

3.5. Methods for Model Evaluation and Identifying Courses

4. Ethical Considerations, Limitations, and Assumptions of the Study

4.1. Ethical Considerations

4.2. Assumptions and Limitations

5. Results and Discussion

5.1. Descriptive Statistics

5.2. Assessing the Strength of Association Between DFWI Outcomes and Variables

5.3. Predictive Analysis Results: Model Evaluation

5.4. Predictive Analysis and GAP Results: Ranking and Identifying Courses

5.5. Predictive Fairness, Probability Calibration, and GAP Validation

6. Conclusions and Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI