Next Article in Journal
Attention-Pool: 9-Ball Game Video Analytics with Object Attention and Temporal Context Gated Attention
Previous Article in Journal
Multi-Task Graph Attention Net for Electricity Consumption Prediction and Anomaly Detection
 
 
Article
Peer-Review Record

Model Drift in Deployed Machine Learning Models for Predicting Learning Success

Computers 2025, 14(9), 351; https://doi.org/10.3390/computers14090351
by Tatiana A. Kustitskaya *, Roman V. Esin and Mikhail V. Noskov
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Reviewer 4: Anonymous
Computers 2025, 14(9), 351; https://doi.org/10.3390/computers14090351
Submission received: 19 July 2025 / Revised: 16 August 2025 / Accepted: 20 August 2025 / Published: 26 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper examines how ML models for predicting student success deteriorate after deployment. The comparison between stable profile data and dynamic LMS activity is valuable, and the use of drift detection methods like PSI and SHAP-loss provides useful insights. However, several issues need to be addressed:  

  • The paper uses SHAP-loss values to detect drifting features but stops short of examining why these drifts occur. Are they meaningful changes in student behavior, or artifacts of data collection? The analysis would benefit from exploring which types of features are most susceptible to drift (particularly click-based metrics) and whether different feature engineering approaches might yield more stable predictors.
  • The manuscript introduces multiple models (LSSp, LSSc-fall, LSSc-spring, LSC) and prediction targets, but the relationships between them are not always clearly explained. A schematic diagram or summary table showing each model, its input features, target variables, and training periods would improve clarity.
  • The conclusion that LMS-derived features offer limited predictive value seems too strong given that LSSc models actually outperform LSSp on recall in several semesters. This suggests the issue may be data quality and standardization rather than an inherent limitation of behavioral features. The conclusion should acknowledge when and under what conditions LMS data proves useful.
  • All analyses use complex models (XGBoost, tree ensembles), but no comparisons are provided with simpler models such as logistic regression or decision trees. Adding such baselines would help isolate the role of model complexity in the observed degradation.
  • In the early sections, clearly define the binary target variable “at risk,” for example: "1 = failed at least one course, 0 = passed all courses."
  • Section 3.4 should clarify the temporal alignment for SHAP-loss comparisons. When comparing week 14 across different spring semesters, are these aligned by calendar date or by academic schedule?
  • The choice of Cohen's d > 0.2 as a threshold for meaningful drift appears somewhat arbitrary. Either justify this cutoff based on domain knowledge or test the sensitivity of findings to different thresholds.
  • A glossary-style table mapping each input feature to its source, meaning, and update frequency (static vs. dynamic) would be helpful, particularly for practitioners trying to understand the data inputs and feature stability.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a study on model drift for predicting learning success. The topic is interesting and the analysis is reasonable.  Some of the findings are specific to the data studied in the paper, and cannot be generalized to other datasets. I suggest the authors emphasize the analysis methodology as the main contribution of the paper. The presentation of the paper is weak. I think the paper can be significantly shortened. For example, Section 1 is more than two pages, which could probably be condensed into one page. Overall, I believe the paper can be revised to meet the publication standards of this journal.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

This article addresses an important and understudied problem in Learning Analytics: model drift in deployed predictive systems. The work demonstrates its practical value by analyzing real-world data from Siberian Federal University's Pythia system over several academic years.
The article addresses a critical gap in the Learning Analytics literature by focusing on model performance after deployment, rather than initial development.
The distinction between Digital Profile and Fingerprint data provides valuable insights into the different types of stability of educational data.
The use of multiple drift detection methods (PSI, classifier-based detectors, SHAP loss) offers a comprehensive analysis.
Real-world deployment data spanning three years provides authentic validation.
While valuable, the article could be improved by considering the following improvements:
The introduction could better situate this work in the broader context of concept drift research. While the authors identify three key issues in the existing literature, the connection to established drift detection frameworks could be stronger. It is recommended that the analysis of why educational data may exhibit unique drift characteristics compared to other domains be expanded.
The choice of drift detection methods is well justified, but the article would benefit from a clearer explanation of the practical significance thresholds used. For example, the PSI interpretation guidelines (0.1, 0.25) seem somewhat arbitrary without specific justification for each domain. Furthermore, the SHAP loss methodology, while innovative, requires a more detailed explanation of how the "practically significant effect size" was determined for educational contexts.
Several results sections contain extensive numerical detail but lack adequate interpretation. For example, Table 3 shows performance metrics across different time periods, but the analysis could better explain the practical implications of these changes for educational stakeholders. The relationship between data drift findings and model performance degradation could be more explicit.
Figure 2 (SHAP loss distributions) is informative but would benefit from clearer axis labels and the possible presentation of confidence intervals. Table A1 in the appendix contains valuable information on feature drift but is quite lengthy; it is recommended to summarize the most important findings in the main text. Figure 3 shows interesting variation across schools but needs better integration with the main narrative.
The discussion section effectively addresses data stability challenges but could extend the generalizability of the findings to other educational institutions. The proposed solutions (conservative hyperparameters, feature selection) showed modest improvements; this limitation merits more critical analysis. Consider discussing whether these modest improvements justify the additional complexity in practice.
The hybrid model approach (LSSp + LSSc) is interesting, but the article could better explain when institutions should consider these architectures instead of simpler approaches.
The finding that min-max scaled features are less stable than z-scale features has practical implications that merit further attention. The substantial decline in first-year student performance in the fall semesters raises important equity considerations that warrant discussion.
Some statistical reporting could be more precise (e.g., reporting exact p-values rather than just significance levels).
The list of abbreviations is helpful, but some terms are introduced without an initial definition in the text.
References are impactful and current (nearly 100% of references are from the 2021–2025 period), and that is appreciated.
The article would benefit from more explicit acknowledgment of limitations and suggesting specific directions for future research, such as investigating domain adaptation techniques or developing education-specific drift detection metrics.

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

The key points of the author's submission, "Model Drift in Deployed Machine Learning Models for Predicting Learning Success," lie in its long-term monitoring based on actual service and its multi-faceted drift diagnosis approach.

The authors' strength lies in systematically evaluating stability and degradation (RA1–RA3) using operational data by deploying the predictive model in an actual university service (Pythia). This approach bridges the gap between field application and research, and the analysis was conducted in a real-world context, complete with a "traffic light" UI for early intervention. Simultaneously, statistical methods (PSI), a classifier-based drift detector (RF), and XAI (SHAP/SHAP-loss) were used to comprehensively diagnose drift from univariate, multivariate, and explainable perspectives. This approach presents a balanced methodology that does not rely on a single metric.

However, several points require further explanation.

1. Despite combining conservative hyperparameters with SHAP-loss to remove features with high drift, the overall performance improvement was "not significantly significant." How this could be improved should be suggested.

2. The performance on new data significantly declined (Fall 2023: recall 0.58, F 0.695; Fall 2024: recall 0.689, F 0.748, etc.), raising the possibility of overfitting. This needs to be addressed.

3. Furthermore, the validation-weighted F1 is only 0.741, suggesting that static profile data alone is insufficient to capture academic performance issues in the current year, and therefore requires revision.

4. There are limitations to sample representativeness. This is due to the varying levels of digitalization across departments. Some majors are not included in LMS footfall data at all (mainly offline), which can lead to data bias and representativeness issues.

5. Furthermore, the performance of first-year students in the fall semester is significantly lower than that of upperclassmen (e.g., Fall 2023: 0.558 vs. 0.742), resulting in poor prediction stability across groups. While this may be a statistical issue, it must be addressed. Consistency across groups is a key issue. The author should be able to answer whether or not it should be considered.

If you'd like, I can also summarize key improvement suggestions (such as expanding data coverage, introducing stress testing and model portfolios, and adaptive learning by cohort/week) based on the above content.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This revised version makes solid improvements. But it does not include simple model baselines. The overall contribution remains meaningful. 

Back to TopTop