1. Introduction
Academic performance in medical education is determined by multiple individual, curricular, and institutional factors. Academic risk is the estimated probability that academic performance will fall below a critical institutional threshold in a future period, jeopardizing a student’s educational continuity [
1]. Students enrolled in medical programs must maintain high academic standards for thirteen or more semesters while simultaneously developing cognitive, procedural, and attitudinal competencies that evolve from the basic sciences toward clinical and hospital practice [
2,
3]. Poor academic performance is directly associated with academic lag, course failure, and dropout, entailing high costs for students, the educational system, and society at large [
1].
Intelligent diagnostic systems for academic performance analysis represent one of the most promising applications of artificial intelligence in higher education, as they transform large volumes of data into interpretable signals that enable the faculty tutor to take pedagogically grounded actions before academic lag consolidates [
4].
In Ecuador, 34 universities located across 12 provinces offer medical training programs. A substantial proportion of enrolled students must relocate from their home provinces to access their education, adding a socioeconomic burden that compounds an already demanding academic workload [
5].
Faculty tutors are formally responsible for individual academic monitoring under Ecuadorian higher education regulations [
6], yet must simultaneously manage teaching, research, community outreach, and institutional administration, which constrains the time available for proactive tutorial follow-up. Furthermore, the absence of intelligent diagnostic systems that provide timely and accurate information on individual academic performance reduces the possibilities of timely tutorial intervention, increasing the risk of academic failure. Compounding this problem, four critical methodological gaps persist in the existing literature. First, most intelligent diagnostic systems operate on small or single-semester datasets without modeling the longitudinal hierarchical structure of multi-period educational data [
7,
8]; the absence of studies that simultaneously employ large-volume longitudinal records spanning multiple consecutive academic periods with explicit hierarchical modeling constitutes a fundamental methodological gap that limits the operational applicability of the models proposed in the literature [
9]. Second, existing systems direct their output toward institutional managers or automated LMS platforms, not toward the individual faculty tutor as the operational recipient [
7,
10]. Third, the impact of preprocessing decisions on algorithm selection is rarely reported, limiting reproducibility [
11]. Fourth, no existing diagnostic system for medical education integrates inferential and predictive SHAP layers into a unified explanatory attribution interface deployable as a web prototype.
This study proposes the design and implementation of a single-institution intelligent diagnostic system to identify low mid-period academic performance, with the aim of activating proactive and preventive tutoring before a final assessment. The system is intended to assist faculty tutors in analyzing academic performance and the following:
(1) Uses the first two grades of the same academic period (N1_C and N2), summing them (SUM = N1_C + N2) and using this result as an intermediate checkpoint to transform performance data into pedagogical inputs that activate proactive and preventive tutoring before the final assessment;
(2) Integrates an inferential framework based on a three-level mixed linear model (MLM) with a predictive framework based on a supervised regression pipeline for individual academic performance scoring;
(3) GroupKFold cross-validation applies to respect the longitudinal structure of student academic trajectories;
(4) Implements an explainability and interpretability framework based on SHAP at both the population level (MLM) and the individual level (Random Forest) to produce interpretable low-performance profiles for tutors;
(5) Implements a predictive uncertainty validation framework through a robustness and subgroup equity analysis to determine whether the supervised regression model is stable, reproducible, and applicable across the distinct groups composing the student population;
(6) Explicitly reports the impact of encoding decisions on algorithm performance;
(7) Demonstrates the operational feasibility of the system through a Streamlit web prototype that integrates a dual SHAP framework together with a tutor-facing explanatory attribution matrix.
Mixed linear models (MLMs) constitute the inferential framework of the present study because they allow modeling data with multilevel hierarchical structure by decomposing total variance into three interpretable components: variance attributable to observable fixed effects (marginal R
2), variance attributable to persistent interindividual differences such as academic self-regulation or resilience (the difference between conditional R
2 and marginal R
2), and unexplained residual variance; semi-partial R
2 coefficients additionally isolate each predictor’s unique contribution, enabling a rigorous distinction between statistical significance and practical effect size [
12,
13,
14]. However, MLMs quantify population-level relationships without generating individual predictions with the precision required to guide tutoring decisions and are therefore complemented by supervised machine learning algorithms.
The predictive framework for this study uses four supervised machine learning algorithms: Random Forest, XGBoost, Ridge Regression, and Multilayer Perceptron (MLP). The selection of these algorithms is based on their capacity to capture both linear and nonlinear relationships, control multicollinearity, and evaluate different levels of model complexity, in line with prior studies in medical education [
15].
Random Forest is a supervised machine learning algorithm that constructs a specified number of decision trees in parallel using bootstrap sampling and random predictor subsets per node, obtaining the final prediction as the mean across all trees to reduce variance and improve generalization [
4,
16,
17].
XGBoost builds trees sequentially and additively, where each tree minimizes the residual error of the previous ensemble through gradient descent with L1 and L2 regularization, capturing nonlinear interactions and threshold effects with greater precision on tabular data [
4,
18,
19].
Ridge Regression incorporates an L2 quadratic penalty on the coefficients that shrink estimators toward zero without eliminating variables, making it particularly useful in the presence of multicollinearity and serving as a regularized baseline to quantify the predictive gain of ensemble models [
16,
20].
The Multilayer Perceptron (MLP) learns hierarchical nonlinear representations through hidden layers with ReLU activation and the Adam optimizer, offering competitive performance when data volume is sufficient, although its higher computational cost and lower interpretability generally place it below ensemble models on moderate-size datasets [
7,
20].
All four algorithms are evaluated via GroupKFold cross-validation, a k-fold variant that ensures all observations from the same student appear exclusively in either the training or the test set within each fold, preventing data leakage and producing realistic generalization estimates for unseen students [
16,
21].
Interpretability of individual predictions is achieved through SHAP (SHapley Additive exPlanations), a post hoc framework grounded in cooperative game theory that quantifies the marginal contribution of each predictor by simulating all possible variable coalitions and producing attribution values satisfying the properties of efficiency, symmetry, and additivity [
22,
23]; specifically, TreeExplainer generates a global mean-importance bar chart and an individual-level waterfall plot that transforms a black-box prediction into an actionable model-derived explanation for the tutor [
4].
The validation framework includes robustness and subgroup equity analysis. The robustness analysis verifies that the model’s error does not exhibit a systematic structure over time or across application conditions, confirming that the metrics reported on the full dataset are representative of the expected performance in each operational subset; the RMSE/MAE ratio stratified by academic period provides the standard indicator of this property, as values consistently below √2 ≈ 1.414 across all periods confirm that uncertainty is predominantly random and does not introduce cumulative systematic bias [
24].
The subgroup equity analysis serves to verify that the predictive performance of a machine learning model is not unevenly distributed across the groups composing the population of interest (students differentiated by gender, province of origin, age range, curricular level, and academic period), and that predictive uncertainty is not significantly higher for any particular subgroup—a condition that, if unmet, would imply that the system systematically directs less accurate alerts toward students belonging to structurally more vulnerable groups, amplifying rather than reducing pre-existing educational inequalities [
4]. From the algorithmic fairness perspective, the MAE parity ratio per subgroup, computed as MAE_subgroup/MAE_global, quantifies whether the model commits proportionally larger errors for specific groups; an operational threshold of ≤ 1.25 indicates that the subgroup error does not exceed the global error by more than 25%, a standard adopted from the artificial intelligence fairness literature [
4], and its exceedance in subgroups such as students aged over 30 years or underrepresented provinces signals the need for data enrichment or model adjustment prior to institutional deployment, ensuring that the system complies with the algorithmic accountability principles required of tutoring support systems in higher education [
6,
25].
Predictive uncertainty is validated through MAE, RMSE, and the RMSE/MAE ratio: when this ratio approaches √2 ≈ 1.414, errors are predominantly random and follow an approximately normal distribution; when it exceeds this threshold, outlier errors dominate and indicate systematic bias [
24]; stratified decomposition of this ratio by academic period and sociodemographic subgroup constitutes the standard robustness analysis for verifying that the model’s predictive uncertainty does not compromise its operational applicability as a differentiated tutoring support tool [
24,
26].
In mixed linear models and machine learning algorithms, the way categorical variables are encoded directly influences both the interpretation of coefficients and the predictive performance of the model. Among the most widely used methods are dummy coding, effect coding, ordinal encoding, one-hot encoding, and target encoding.
Dummy coding, or reference coding, assigns a binary column (0/1) to each category using a baseline category; in this way, the coefficients represent differences relative to that reference. This scheme is the default in libraries such as statsmodels and lme4 [
27]. Effect coding, on the other hand, centers the estimates with respect to the grand mean of the dependent variable, allowing each category to be interpreted as a deviation from the overall average, which is useful when the interest lies in main effects rather than in an arbitrary reference category [
27].
Ordinal encoding assigns consecutive integer values while preserving the natural ordering of categories—for example, curricular levels, from the seventh to the thirteenth level. This approach is particularly appropriate for variables with a hierarchical structure and is efficiently exploited by tree-based algorithms such as Random Forest and XGBoost, which learn split points along the ordinal axis [
11]. In contrast, one-hot encoding generates an independent binary column for each category without assuming any ordering relationship among them, making it the most methodologically appropriate option for nominal variables such as gender or province of origin, where imposing a numerical scale would introduce artificial hierarchies and potential biases in the models [
8].
Target encoding replaces each category with the meaning of the outcome variable within that group. This method offers important advantages for high-cardinality variables, as it avoids the excessive dimensionality increase produced by one-hot encoding [
27]. Pargent et al. [
28] demonstrated that regularized target encoding consistently outperforms one-hot and ordinal encoding when the number of categories is large. In practical terms, the literature considers low cardinality when fewer than 10 categories exist, medium cardinality between 10 and 50, and high cardinality when more than 50 categories are present. As the number of levels increases, one-hot encoding generates highly sparse matrices, which negatively affects the stability of both linear and ensemble models [
29].
The explanatory attribution matrix constitutes a tool that integrates multiple explainability metrics to estimate the relative importance of each predictor within the model [
29]. Its primary utility lies in resolving discrepancies between different interpretation methods—for example, when the coefficients of a mixed linear model differ from the SHAP values obtained from ensemble models [
4,
19]. At the global level, this matrix allows identification of whether a variable exerts a protective or risk-associated effect on academic performance; while at the individual level, it facilitates recognition of which factors most strongly influence the specific performance of each student [
29]. In intelligent diagnostic systems applied to education, this interpretive layer is fundamental because it translates the model’s quantitative output into actionable information for faculty and tutors, strengthening evidence-based decision-making [
4,
24].
The remainder of the paper is structured as follows:
Section 2 reviews related work.
Section 3 describes the system architecture and computational flow and presents the web prototype implemented in Streamlit.
Section 4 details the materials and methods.
Section 5 presents the results.
Section 6 discusses the findings.
Section 7 presents the limitations and future work.
Section 8 presents the conclusions.
3. System Architecture and Computational Flow
For the design and implementation of a single-institution intelligent diagnostic system that identifies low mid-period academic performance with the aim of activating proactive and preventive tutoring before a final assessment, the architecture comprises seven functionally distinct modules together with an interactive web interface, as shown in
Figure 1.
3.1. Data Ingestion Module
This module handles data loading into the system using the pandas 2.0 library. The module receives as input a file in (.csv) format, which is read through the pd.read_csv function with a semicolon (;) as the column separator. This file is an anonymized secondary dataset. The output is a dataframe containing 3 academic variables: academic period (ID_PE), curricular level (COD_NI), and course (ID_MAT); 3 sociodemographic variables: gender (ID_SEX), province of origin (COD_PRO), and student age (ED_EST); 2 performance variables: first partial grade (N1) and second partial grade (N2); and one student identifier variable (COD_EST).
3.2. Exploratory Data Analysis (EDA) Module
This module is responsible for performing descriptive statistical analysis of the input variables to understand the data structure, academic context, and the presence of missing values. The libraries used in this module include matplotlib, seaborn, scipy.stats, pandas, and numpy. The EDA determined the following aspects:
Data structure: variables, data types, cardinality, normality, behavior, and dispersion measures.
Total number of records in the database.
Number of academic periods, curricular levels, and courses.
Number of curricular levels per academic period.
Number of courses per level.
Total number of unique enrolled students.
Number of repeated observations per student.
Number of female and male students.
Number of students per province.
Age: minimum, maximum, and mean.
Grades: minimum, maximum, and mean—by period, by level, by course, and other breakdowns.
At this stage, the module receives as input a dataframe with 31,970 records and produces as output a dataframe with an analytic sample of 18,604 records.
3.3. Data Preprocessing Module
This module is responsible for performing feature engineering and variable transformation to ensure data compatibility in subsequent analyses. It receives as input a dataframe with 9 variables. The first partial grade (N1) is rescaled from 8 to 10 points (N1_C) in order to facilitate comparison with the second partial grade (N2) and obtain the academic performance variable (SUM = N1_C + N2), which represents the sum of the first two partial assessments within the same academic period on a 0–20 point scale. Age is log-transformed: np.log1p(ED_EST). The variables PE_ORD and COD_NI are then transformed as ordinal variables, while COD_PRO, ID_SEX, and ID_MAT are transformed as nominal variables using the scikit-learn library. The output is a dataframe with 12 variables.
3.4. Inferential Analysis Module (MLM)
This module is responsible for implementing the inferential framework of the system, considering the data structure. The model formula is defined using the following variables: academic performance (SUM), age (ED_EST), province of origin (COD_PRO), gender (ID_SEX), academic period (PE_ORD), and curricular level (COD_NI) to implement the mixed linear model; the student identifier (COD_EST) is used as the grouping variable. The module was implemented using Python’s statsmodels library. Additionally, custom functions were created to obtain marginal R2, conditional R2, the intraclass correlation coefficient (ICC), and semi-partial R2 coefficients per predictor.
3.5. Predictive Analysis Module (Pipeline)
This module receives as input a dataframe with the 12 variables obtained from the data preprocessing module. A pipeline is then used for preprocessing and subsequent training of the predictive model together with four algorithms—Ridge, XGBoost, MLP, and Random Forest—via GroupKFold cross-validation (k = 5, grouped by student identifier COD_EST). The output is the best-performing predictive model based on MAE, RMSE, and R2 across all folds.
A pipeline is a sequential computational structure that chains the preprocessing, feature engineering, and predictive model training stages in an ordered and reproducible manner into a single object that can be fitted, evaluated, and deployed as a coherent unit [
11]. Its primary utility lies in ensuring that all transformations applied to the training data—OneHotEncoder, OrdinalEncoder, FunctionTransformer, and StandardScaler—are fitted exclusively on the training partition of each cross-validation fold and subsequently applied to the test partition without accessing its distribution, thereby preventing data leakage that produces artificially optimistic performance estimates [
8].
3.6. Validation Module
This module is responsible for performing the error analysis of the winning predictive model, along with the robustness and subgroup equity analysis, using the GroupKFold and cross_val_predict functions from the sklearn library. The outputs include error decomposition by academic period; MAE parity ratios by province of origin, age range, and gender; absolute error distribution plots by province of origin (COD_PRO) and gender; actual versus predicted performance plots; residual distribution; residuals versus fitted values; and model calibration plots.
3.7. SHAP Explainability Module
This module is responsible for implementing the SHAP explainability framework. It receives as input the MLM and the winning model—Random Forest.
At the global level, SHAP explainability ranks model variables according to their mean importance (mean |SHAP value|) across the entire sample, identifying which predictors have the greatest systematic influence on predictions; at the individual level, it decomposes the prediction for a specific case to quantify the marginal contribution of each predictor variable to an individual prediction generated by a machine learning model [
22].
SHAP TreeExplainer was applied to the trained winning model to generate global feature importance rankings and individual-level Shapley value decompositions. The beeswarm plot and the mean absolute SHAP value bar chart constitute the interpretability layer for the tutor.
In the interactive web interface, the SHAP explainability framework generates waterfall plots and an explanatory attribution matrix to integrate the results of the inferential framework with the predictive framework, using the shap library (v0.42).
3.8. Interactive Web Interface
The interactive web interface was implemented using the Spyder 6.1 platform with Python 3.13 and Streamlit 1.54.0. The preliminary diagnostic prototype transfers the results of the inferential framework and the machine learning framework to the individual tutor in an interpretable and practical format that requires no technical expertise [
7,
10]. The preliminary diagnostic prototype integrates Module 4 (inferential layer) and Module 5 (Random Forest predictive layer) with Module 7 (SHAP explainability) into a three-tab interf ace, accessible at the following web address:
https://prototypeews-srcpuzg4c8otwfazefmj2q.streamlit.app/, accessed on 1 June 2026.
The interactive web interface integrates six functional elements organized across a left sidebar and a main panel with three explanatory attribution analysis tabs.
The left sidebar contains a dropdown menu where the tutor can select a student identifier, which then displays four interface elements in the main panel. The first element is the student’s predicted academic performance (SUM) on a 0–100 scale. The second is a color-coded risk semaphore (green: low risk, yellow: moderate risk, red: critical alert). The third element is a strategic action recommendation in natural language for the selected student. The fourth element is a set of 3 tabs. The first tab displays an individual-level waterfall plot presenting the results of the inferential SHAP (MLM layer). The second tab displays a waterfall plot presenting the results of the predictive SHAP (Random Forest model), which captures nonlinear interaction patterns that may protect performance in specific level–course–period combinations. The third tab presents the Explanatory Attribution Matrix, which maps the SHAP values from both frameworks (inferential and predictive) onto the study variables and computes a Priority Index, identifying the dominant risk-associated factor and automatically generating the natural language clinical directive for the tutor.
Figure 2 shows a screenshot of the web interface of the implemented platform.
3.9. Technical Implementation
The intelligent diagnostic system was implemented in two phases: a testing phase and a deployment phase. The testing phase was implemented using the Jupyter 7.5 platform and encompassed data ingestion, exploratory data analysis, data preprocessing, the inferential framework, the predictive framework, the SHAP explainability framework, and the validation framework. Within the predictive framework, a model comparison was conducted using the Ridge, XGBoost, MLP, and Random Forest algorithms, with the latter emerging as the winning model.
In the deployment phase, the preliminary diagnostic prototype was created through a project on the Spyder 6.1 platform using Python 3.13 and Streamlit (v1.54.0). The modules were developed in accordance with the system architecture, and each element of the prototype interface was implemented, including the alert messages and the tabs displaying the results of the inferential framework, the predictive framework, and the SHAP explanatory attribution matrix. The MLM was fitted using the MixedLM module from statsmodels; the Random Forest model was serialized using joblib for deployment within the Streamlit application. SHAP TreeExplainer and waterfall plots were generated using shap (v0.42). The application was first deployed locally, after which the source files were uploaded to GitHub Version 3.5.12 and the prototype was deployed on the Streamlit cloud at the following address:
https://prototypeews-srcpuzg4c8otwfazefmj2q.streamlit.app/, accessed on 1 June 2026.
4. Materials and Methods
4.1. Data Ingestion
The intelligent diagnostic system designed to assist faculty tutors in analyzing academic performance first analyzes input variables and their effects through an MLM, then predicts students’ academic performance using supervised learning algorithms, and uses these predictions to identify low-performance profiles that require early tutorial intervention.
The CSV file contains academic variables including: academic period (PE_ORD), curricular level (COD_NI), and course (ID_MAT); sociodemographic variables including: student age (ED_EST), gender (ID_SEX), and the student’s province of origin code (COD_PRO); and performance variables consisting of partial assessment grades: (N1) for the first partial assessment and (N2) for the second partial assessment of the same academic period, together with the student identifier (COD_EST).
4.2. Exploratory Data Analysis (EDA)
The exploratory data analysis was conducted by applying descriptive statistics to the dataset variables, grouping variables, using Python functions, and creating custom functions to understand the data structure and its academic context.
The statistical analysis revealed that the initial dataset contains a total of 31,970 records, 7 academic periods, 13 curricular levels, 65 courses, and 3 partial grades per academic period: the first grade (N1) scored out of 8 points, the second (N2) out of 10 points, and the third out of 10 points. Given the proactive and preventive tutoring paradigm, only the first two partial grades (N1) and (N2) of the same academic period were considered for this study. The institutional threshold for low academic performance indicates that a partial assessment score must be less than or equal to 7 points; consequently, the sum of the first two partial assessments implies a threshold below 14. Student ages range from 18 to 44 years; students originate from 22 provinces of Ecuador and include both male and female students.
Variable grouping enabled the extraction of the number of students with unique records, the number of male students, the number of female students, the number of students enrolled per academic period and per level, the number of observations per student, and the number of students per academic period and curricular level. This last grouping revealed that the feature matrix exhibits an unbalanced hierarchical nesting configuration with sequential left truncation. The curricular level encompasses longitudinal trajectories that asymptotically converge at the terminal level (Level 13). This architecture comprises 7 sequential sub-cohorts (from cohort 1–13 through cohort 7–13), where each subsequent level naturally omits observations from prior periods, which motivated the adoption of a selective intentional sampling strategy to ensure longitudinal comparability between period 0 and period 6, controlling for the confounding effect arising from the structural reduction in levels across periods. Accordingly, for all subsequent analyses, the dataset was restricted to curricular levels 7 through 13. This decision is justified because these levels concentrate the largest proportion of students enrolled in courses belonging to the clinical sciences and hospital practice phases, which constitute the core training stage of the medical program and present an academic structure that differs substantially from that of the initial curricular levels.
Figure 3 illustrates the hierarchical data structure, where each academic period contains levels; each level includes courses; each course is divided into parallel sections; and each parallel section includes enrolled students.
The final analytic sample comprises 18,604 records corresponding to 1264 unique students across seven consecutive academic periods (2017–2020). The number of observations per student ranges from 1 to 30. The number of courses spanning levels 7 through 13 is 30. The records in the analytic sample were restricted to students enrolled in curricular levels 7 through 13, while records corresponding to other levels were excluded due to the structural reduction in levels across academic periods. There are no missing values. Furthermore, the dataset does not include course-level, section-level, or instructor-level identifiers; the student identifier (COD_EST) is the only grouping variable available.
Table 1 presents a summary of the data selected for the final sample.
As shown in
Table 1, the dataset includes the academic period (PE_ORD), the period description, the number of database records per academic period, the number of students enrolled per academic period from level 7 to level 13, the number of students enrolled by gender, and the number of underperforming students according to the institutional threshold defined for low performance (SUM < 14 points).
Figure 4 presents the distribution of academic performance by academic period, represented through box plots. In general, median academic performance levels are relatively consistent across periods, typically falling within a medium-to-high range, suggesting overall stability in academic performance over time.
4.3. Data Preprocessing
At this stage, feature engineering and variable transformation were performed to improve data quality and prepare the variables for inferential analysis and predictive modeling. Feature engineering addressed the creation of the following variables:
Province of origin (COD_PRO): This variable reflects the territorial context without requiring the collection of additional data. It is derived from the national identification number, whose first two digits indicate the student’s province of origin.
Age (ED_EST): Calculated from the date of birth. A logarithmic transformation was applied because the variable initially exhibited a positive asymmetric distribution (Shapiro–Wilk
p < 0.001). The logarithmic transformation reduces skewness and brings the variable closer to normality, thereby satisfying the regression model assumptions [
36]. The variables (ED_EST) and (COD_PRO) correspond to pre-calculated institutional indicators and are part of the anonymized database.
First partial examination grade (N1_C): Rescaled to a ten-point scale using the linear transformation G10 = (G8 × 10)/8, which ensures comparability with N2 and enables the construction of the composite academic performance variable SUM.
Academic performance (SUM): The sum of the two partial assessments (N1_C and N2) within the same academic period, yielding a continuous variable on a 0–20-point scale. This variable serves as the dependent variable in the predictive model.
The selection of variable encoding techniques was grounded in each variable.
Table 2 summarizes the techniques applied.
As shown in
Table 2, the academic period (PE_ORD) and curricular level (COD_NI) variables possess a hierarchy, which motivated an ordinal categorical transformation using the Pandas function (.astype(‘category’).cat.codes) for the MLM, while OrdinalEncoder was used for the predictive models in order to respect the ordinal scale and the curricular progression sequence in tree-based models.
The EDA revealed that the age variable (ED_EST) exhibits positive skewness and lacks a Gaussian distribution; therefore, to ensure numerical stability in distribution-sensitive estimators, a logarithmic transformation was applied for both the MLM and the predictive models. Additionally, a StandardScaler() transformation was applied to this variable for use in the predictive models.
The course (ID_MAT), province of origin (COD_PRO), and gender (ID_SEX) variables, being nominal in nature, were subjected to indexed coding (Label Encoding) for the MLM, while OneHotEncoder was used for the predictive models, transforming each category into an independent binary column and eliminating any numerical hierarchy assumption that could introduce bias in the model’s linear estimators [
8,
11].
The student identifier variable (COD_EST) is used as a grouping variable in the MLM, while in the predictive models it is excluded from the feature space to prevent overfitting through memorization due to its high cardinality.
The variable (SUM) is the system’s metric output dimension, providing a continuous performance scale evaluated in common by all estimators.
The preprocessing for the predictive framework was encapsulated in a sklearn.Pipeline using ColumnTransformer. This architecture ensures that scaling and encoding are fitted exclusively on the training partition of each fold, preventing test set contamination [
11].
4.4. Inferential Framework: Three-Level Mixed Linear Model
For the inferential analysis, the following variables were selected: academic period (PE_ORD), curricular level (COD_NI), age (ED_EST), province of origin (COD_PRO), and gender (ID_SEX), based on their nature and structure. The course variable (ID_MAT) was not considered due to the structural confounding between the course variance component and the fixed effect of curricular level, as well as the overparameterization of the random-effects structure. Accordingly, applying the parsimony criterion of Bates et al. [
39]—which holds that models with lower structural complexity tend to exhibit greater generalization capacity and lower risk of overfitting [
40]—the model was simplified to a single random effect (COD_EST), yielding more robust estimates.
The full model specification is:
where
Y_ij is the academic performance of student
j in period
i,
β0 is the global intercept,
β1–
β5 are fixed-effect coefficients for period, level, gender, province of origin, and log-transformed age respectively,
u_j ~ N(0, σ
2_u) is the random intercept capturing persistent interindividual differences, and
ε_ij ~ N(0, σ
2) is the residual error. Multicollinearity was assessed using Variance Inflation Factors (VIF threshold < 5). Residual diagnostics confirmed approximate normality (Q-Q plots and Shapiro–Wilk test) and the absence of systematic heteroscedasticity patterns. Effect size was estimated using marginal and conditional R
2 following Nakagawa and Schielzeth [
13], and semi-partial R
2 coefficients were computed for each fixed predictor to quantify independent contributions to outcome variance.
4.5. Predictive Framework: Supervised Regression Process
4.5.1. Definition of the Prediction Problem
The EDA enabled the characterization of variables, from which a supervised regression problem was formulated. A variable mapping and segmentation strategy was defined for the preprocessing pipeline, structured according to the nature and cardinality of the data. The core data preprocessing architecture was then built using advanced Scikit-Learn tools version 1.8.0 (ColumnTransformer and Pipeline).
The predictor matrix X comprises the following variables: PE_ORD (ordinal), COD_NI (ordinal), log-age np.log1p(ED_EST) (continuous), ID_SEX (nominal, One-Hot encoding), and COD_PRO (nominal, One-Hot encoding). The course variable (ID_MAT) was included in the final predictive model as a nominal variable with 30 unique categories corresponding to the courses spanning curricular levels 7 through 13 and was encoded using One-Hot encoding. The target variable y is SUM (continuous, 0–20 scale). The student identifier COD_EST is used exclusively as the grouping variable for GroupKFold cross-validation and is excluded from the predictor matrix X to prevent identity-based memorization and the overfitting associated with its high cardinality (1264 unique categories).
4.5.2. Hyperparameters and Software Configuration
Hyperparameter configuration was performed through a manual grid search guided by values documented in the reference literature for comparable regression tasks [
4,
18,
40].
Table 3 presents the complete technical specification.
All pipelines were encapsulated within sklearn.Pipeline objects with a preprocessing step (StandardScaler for numeric features; OneHotEncoder for categorical features) applied before model fitting. The random_state = 42 seed was set consistently across all stochastic components.
4.5.3. Validation Strategy
GroupKFold cross-validation (k = 5) was used, with the student identifier (COD_EST) as the grouping variable to prevent data leakage. This strategy ensured that records corresponding to the same student were not included simultaneously in the training and test sets [
41], providing a realistic estimate of model performance on previously unseen students. MAE, RMSE, and R
2 are recorded at the fold level; means and standard deviations across the five folds are reported.
4.5.4. Risk Alert Threshold and Classification Metrics
Continuous Random Forest algorithm predictions are converted into binary risk alerts using the institutional threshold of SUM < 14 points. Sensitivity, specificity, and AUC-ROC are computed from cross-validated predictions alongside the regression metrics.
4.6. SHAP Explainability Framework
In the testing phase, inferential SHAP extracts the fixed-effect parameters, which are the equivalents of SHAP values in linear modeling, and produces a bar chart showing the direct impact of variables on the mixed linear model.
Predictive SHAP extracts the estimators and preprocessors from the pipeline and processes the data matrix to obtain a strategic sample of 500 rows, computes the SHAP attribution values, and reconstructs the SHAP explanation object. The results are presented in a SHAP beeswarm plot, which provides a global explanation, consolidating the impact of all variables across the entire student population of the analytic sample into a single view, ranking features from top to bottom according to their importance.
In the deployment phase, SHAP waterfall plots are used to show the academic tutor which specific factor is pushing a student toward pedagogical risk. SHAP TreeExplainer [
22] was applied to the trained Random Forest model and the mixed linear model. Global feature importance is quantified using mean absolute SHAP values.
4.7. Validation Framework: Robustness and Subgroup Equity Analysis
This framework performs robustness and subgroup equity analysis to ensure that the winning model is reliable across the entire student population. Predictive robustness was assessed through error decomposition stratified by period and by demographic and geographic subgroups. The RMSE/MAE ratio serves as an indicator of error structure: values below √2 ≈ 1.414 indicate predominantly random error consistent with a normal residual distribution; values substantially above √2 indicate the presence of systematic bias [
24].
8. Conclusions
This paper presents the design and implementation of a single-institution intelligent diagnostic system to identify low mid-period academic performance with the aim of activating proactive and preventive tutoring. The system is intended to assist faculty tutors in analyzing academic performance in medical education. The contribution of this work is articulated across seven dimensions with direct empirical support.
First, the system uses the first two partial grades of the same academic period as an intermediate checkpoint, transforming academic, sociodemographic, and academic performance data into pedagogical inputs that activate proactive and preventive tutoring before academic lag consolidates at the end of the period, constituting an intervention window with greater corrective potential than the reactive mechanisms documented in the literature.
Second, the integration of a mixed linear model (marginal R2 = 0.297; conditional R2 = 0.473; ICC = 0.176) with a Random Forest regression pipeline that produces individual risk scores (MAE = 1.267 ± 0.04; R2 = 0.551 ± 0.02) simultaneously addresses the inferential question of which factors matter and by how much, and the predictive question of how this specific student will perform—a combination not documented in any prior intelligent diagnostic system for medical education.
Third, GroupKFold cross-validation grouped by student identifier (COD_EST) ensures that records from the same student never appear simultaneously in training and test sets, producing realistic generalization estimates for unseen students and avoiding the performance overestimation that affects studies by omitting this grouping strategy.
Fourth, the implementation of the dual SHAP explainability framework—at the population level through MLM coefficients and at the individual level through SHAP TreeExplainer on Random Forest—produces interpretable low-performance profiles that the faculty tutor can directly use: curricular level emerges as the dominant predictor in both the inferential analysis (R2sp = 0.044; β = 0.577) and the predictive analysis (mean |SHAP| = 1.28), providing cross-framework coherence that strengthens the validity of the findings.
Fifth, the robustness and subgroup equity analysis confirms that predictive uncertainty is predominantly random across all academic periods (RMSE/MAE between 1.29 and 1.38, consistently below √2 = 1.414) and that the system produces equitable estimates across 21 of 22 provinces and both genders (parity ratios ≤ 1.25), with the exception of province code 18 and students aged over 30 years, which identify priority areas for data enrichment prior to institutional deployment.
Sixth, the exploratory data analysis enables adequate characterization of variables, revealing that this analytical decision materially affects the results and conclusions of the study, with direct implications for reproducibility in educational machine learning research.
Finally, the Streamlit-based web prototype demonstrates the operational feasibility of the system by integrating the dual MLM-Random Forest framework into an interface accessible to faculty tutors without technical training, whose most innovative element is the SHAP Explanatory Attribution Matrix, which synthesizes the Shapley values from both analytical layers into a Priority Index per variable and automatically generates a model-derived explanation in natural language that transforms the numerical prediction into a concrete tutorial action—functionality absent from all intelligent diagnostic systems for medical education documented in the literature, and representing the central technological contribution of this work.