1. Introduction
1.1. Research Background: Sustainability Challenges in the Educational Supply Chain
Educational supply chain (ESC) theory conceptualizes higher education institutions as core nodes in the societal talent supply chain, where student enrollment corresponds to raw material input, the teaching process to value-added transformation, and graduation to talent output [
1,
2]. Unlike a conventional manufacturing supply chain, however, the educational supply chain bears a distinct sustainability burden: its “product loss” is not merely an economic write-off but a truncation of a student’s learning trajectory and of the talent that a society can later draw on. Framing student attrition as a sustainability problem, therefore, connects three concrete SDG 4 targets [
3]: 4.3 (equal access to affordable tertiary education), 4.4 (increase in the number of adults with relevant skills for employment), and 4.5 (eliminate gender and vulnerability disparities in access to education). Each at-risk student who is not caught early represents a local failure against all three targets—an equity loss (the weakest learners are typically the first to be lost), a skills loss (fewer graduates enter the downstream labour market), and a disparity loss (attrition tends to concentrate in already disadvantaged cohorts). This is the sense in which an EWS can be a sustainability instrument rather than an administrative convenience: catching a struggling student in Week 6 rather than after the final exam is the ESC analogue of preventing, rather than discarding, a defective unit. This framing also aligns with more recent sustainability-focused higher education research that moves from declarative SDG alignment to measurable institutional mechanisms [
4,
5,
6,
7].
In supply chain management (SCM), traditional models lacking demand forecasting and risk early warning lead to inefficient resource allocation and delayed risk response [
8]; this study argues that educational management faces similar challenges. Traditional educational management models heavily depend on posthoc assessment mechanisms such as midterm and final examinations. Instructors often cannot identify struggling students until mid-to-late semester, missing optimal intervention windows.
Recent rapid advances in artificial intelligence (AI), particularly in deep learning and generative AI, present unprecedented opportunities to address these challenges [
9]. In SCM, AI has been widely applied to demand forecasting, risk identification, resource optimization, and process reengineering [
10]. A natural research question is whether the same paradigm transfers to education. The present study engages this question in a deliberately conservative way: rather than assuming that deep sequence models will outperform simpler baselines on educational data, it evaluates Random Forest with SMOTE against GRU and LSTM on a single-course ESC dataset (188 students, 30 fail cases) and reports the honest trade-offs; the result (
Section 4) is that traditional ML with explicit minority resampling currently dominates deep sequence models on this small, imbalanced dataset—a finding that is itself relevant to the sustainability question, because low-cost, reproducible baselines are what most institutions will actually deploy.
1.2. Theoretical Foundations of AI-Driven Educational Supply Chain Transformation
Following [
1,
2], the ESC can be viewed as a closed-loop system of input (enrollment), transformation (teaching), output (graduate talent) and feedback (outcome assessment), in which higher education institutions act as upstream suppliers to the societal talent supply chain. Within this frame the EWS plays the role of a quality control node: rather than waiting for end-of-term grades, it monitors the transformation process and flags at-risk students early enough to intervene, reducing the “defect” handed on to the downstream labour market. Viewed this way, AI can drive transformation along four ESC nodes: (i) risk forecasting and early warning, (ii) precision allocation of tutoring resources, (iii) data-driven reengineering of teaching and warning workflows, and (iv) continuous feedback that supports organisational learning.
Table 1 summarises the mapping between canonical SCM concepts and their ESC counterparts. A longer review of the AI-in-SCM literature is deferred to
Section 2 to avoid duplication.
1.3. The PED System, Its Limits, and the Role of This Study
Tamkang University’s Office of Information Technology has developed the Smart PASS (Smart Planning and Advising for Student Success) platform, combining iClass (the LMS), iSignal (commendation and early warning), and iCan (intelligent career matching) with a Data–Dashboard–Decision workflow. Within this platform the Performance and Engagement Diagram (PED) system has, since academic year 110, provided bi-weekly dashboards and automated warning e-mails for instructors, academic advisors and students. The PED currently classifies each student each week into four quadrants using fixed thresholds on normalised Performance (P) and Engagement (E). Two practical limitations of this rule-based classifier were observed in deployment and motivate the present study: (i) max–min normalisation of P and E is sensitive to outliers and to class-size variation, and (ii) during early weeks, a majority of students lie close to the origin of the P–E plane, so the fixed-threshold boundary forces nearly identical students into different quadrants and generates many false alarms.
Figure 1 summarises the resulting architecture as a conceptual data-flow diagram. This study therefore evaluates a data-driven alternative that is designed to replace, not merely complement, the quadrant rule inside the existing PED pipeline. Concretely, the intended integration is: (a) the PED continues to ingest iClass activity logs; (b) instead of applying fixed
P/
E thresholds, the weekly feature vector of each student is fed to a trained prediction model (RF or LSTM, depending on availability of a reliable minority-class sample); (c) the model emits a per-student Fail probability that drives the existing iSignal notification channel and the instructor dashboard shown in
Figure 2. The empirical work reported in
Section 3 and
Section 4 serves to evaluate which model family can reasonably be dropped into step (b); the institutionalisation pathway (
Section 5) is offered as a design-level proposal, not an empirical claim of the present study.
1.4. Research Objectives and Contributions
The study pursues three empirical objectives (O1–O3) and two design-level objectives (O4–O5); the latter are explicitly framed as blueprints to be validated in future work, not as claims supported by the present data.
(O1, empirical) Evaluate a small family of EWS models (Random Forest with SMOTE, GRU, LSTM) under a rigorous protocol that uses Fail as the positive class, a temporal train/validation split, five random seeds, and an explicit time-masking leakage check.
(O2, empirical) Compare three temporal feature representations—original weekly values, cumulative values, and a mixed representation—under the same protocol, and report which representation maximises which metric (sensitivity vs. precision vs. F1).
(O3, empirical) Identify the earliest week at which each model family attains a pre-specified operational target (Fail-recall ≥ 0.80 with precision not degrading into dominant-class-collapse), and thus determine the intervention window each model actually supports.
(O4, design-level) Propose a three-tier institutionalisation pathway (instructor pilot → departmental expansion → Smart PASS integration) as a structured blueprint grounded in Rogers [
11] and Senge [
12]; this study does not yet report usage, adoption, or outcome data from the latter two tiers.
(O5, design-level) Propose a four-dimensional impact assessment framework covering student outcomes, resource efficiency, organisational learning, and talent output; this is offered as an evaluation template for 2–5 year follow-up studies, not as measured impact.
The contributions of this paper are positioned as applied, operational improvements rather than broader methodological innovations. Concretely, the paper offers (i) a benchmark of traditional ML against sequence models on a small, realistically imbalanced ESC dataset under a leakage-checked, multi-seed protocol with Fail-recall as the primary operational metric; (ii) a documented failure mode in which sequence models collapse to the majority class under naive Gaussian-noise augmentation, so that apparent high accuracy hides zero sensitivity; and (iii) design-level institutionalisation and impact-assessment scaffolding that subsequent institutional deployments can test. None of these is claimed as a methodological innovation in machine learning or learning analytics; they are practical recipes for practitioners running similar single-course pilots. Claims about downstream corporate benefits, equity gains, or system-wide transformation are reserved for future work.
3. Materials and Methods
3.1. Research Design
The study follows a two-phase design rather than a mixed-methods design in the formal sense. Phase I is empirical and quantitative: model development, evaluation and feature analysis on a temporally split dataset under a 5-seed protocol (
Section 3.2,
Section 3.3,
Section 3.4 and
Section 3.5). Phase II is design-level and conceptual: the institutionalisation pathway and the impact-assessment framework (
Section 3.7 and
Section 3.8) are offered as structured blueprints derived from supply-chain, organisational-learning and diffusion-of-innovation theory. Phase II is explicitly not a qualitative empirical study: no interviews, surveys, focus groups or document analyses were conducted. It is design-based theoretical reasoning, and its conclusions are framed throughout the paper as proposals to be validated by follow-up qualitative and quantitative studies.
3.2. Data Collection and Preprocessing
Learning trajectory data were collected from Tamkang University’s iClass learning management platform across four semesters (1121, 1122, 1131, 1132) of one programming language course, totalling 188 students. This is a deliberately small, single-course, single-institution dataset; its limits are treated as a first-class methodological constraint throughout the paper and a five-seed protocol with confidence intervals is used to quantify the resulting uncertainty. A temporal split strategy was employed: the first two semesters (1121, 1122; 90 students) served as the training set and the latter two (1131, 1132; 98 students) as the validation set (
Table 2), so the evaluation simulates the model’s predictive capability for future cohorts rather than an in-sample random split.
Five learning-activity features were extracted from iClass, corresponding to the five activity types the platform distinguishes.
homework: weekly assignment score rate, computed as over all assignment submissions completed in week w. Captures summative skill acquisition.
forum: weekly forum participation rate over posts and replies in week
w, on the same normalisation. Captures peer interaction and self-regulated help-seeking, which learning-analytics literature [
24,
25] repeatedly finds to mediate between attendance and outcome.
exam: weekly in-class quiz/exam score rate in week w; captures summative assessment within the semester.
custom: weekly completion rate of instructor-defined activities (in this course mostly coding exercises and short concept checks that are not graded as homework).
weblink: weekly click-through/completion rate on instructor-uploaded external resources (supplementary tutorials, reference slides). A “weblink” in iClass is a hyperlinked teaching resource posted by the instructor rather than a graded activity; it records whether a student opens the material, not how long they engaged with it, which is why the empirical analysis in
Section 4.4 finds its predictive contribution close to zero.
Each student generated a time-series tensor of 18 weeks activity types = 90 dimensions. Semester 1132 stored the weblink rate under the file name weblink_score.csv (renamed to weblink.csv for the pipeline); the underlying schema is identical across semesters.
Figure 2 shows the PED system instructor dashboard interface. The upper scatter plot displays all students’ Performance (
x-axis) and Engagement (
y-axis) distribution for the current week; the right-side gauge classifies students into green, yellow and red zones using a fixed-threshold four-quadrant rule; the lower line chart shows an individual student’s weekly trend. The purpose of the EWS proposed in this study, when integrated into the existing PED pipeline, is to replace the fixed-threshold gauge with a data-driven Fail-probability score, while keeping the scatter plot and trend chart unchanged. This separation of concerns keeps the instructor-facing dashboard stable and localises the change to the classification step.
For class imbalance the three models were handled as follows. Random Forest was trained with SMOTE in the flattened feature space at each week cutoff; SMOTE generates synthetic minority samples by interpolating between
k-nearest neighbours of existing minority points, so the added variance reflects the geometry of the minority class rather than isotropic noise. GRU and LSTM used minority-class replication with small Gaussian noise (
) on the 3D time-series tensor.
Section 4.5 shows that this relatively low-variance augmentation is not strong enough to prevent LSTM/GRU from collapsing to the majority class in early weeks, a finding that the present study reports in its own right. All augmentation operations were performed on the training split only, and time-masking (
Section 3.5) is applied after augmentation to guarantee no future-week leakage.
3.3. Feature Engineering Methods
Three temporal feature representations of each
cell are compared. Let
be the weekly score rate of student
i in week
w on activity
a (as defined in
Section 3.2), and let
be the number of scored attempts actually logged in week
w across the cohort.
(1) Original features. The tensor keeps each weekly slot independent. This preserves the temporal shape of a student’s behaviour: sudden drops (“I stopped submitting homework in week 7”) and localised bursts (“I only engaged around midterms”) are directly visible to the model.
(2) Cumulative features. The tensor is the running per-week average up to week w; dividing by w keeps the feature bounded in and removes the spurious trend of raw ∑-cumulatives. Intuitively, is a smoothed trajectory that attenuates short-term fluctuations and is therefore expected to improve precision at the cost of sensitivity to abrupt change points.
(3) Mixed features. The concatenation doubles the per-timestep dimensionality from 5 to 10 (so the total feature count grows from 90 to 180). The mixed representation is included to test whether and carry complementary information; its limitation, with only 90 training samples, is clear—more feature dimensions with the same N increases the risk of over-fitting—so the mixed representation is reported mainly as a diagnostic, not as the preferred operational choice.
3.4. Model Architecture and Training Strategy
Three models were compared:
Random Forest: 100 trees, max depth 10, min samples split 5, trained on the flattened feature matrix at each week cutoff, with SMOTE applied to the training set only (
Section 3.2); used as the traditional-ML baseline.
GRU (Gated Recurrent Unit): Input dimension 5, hidden dimension 64, 2 layers, dropout 0.3; update and reset gates control information flow through the sequence.
LSTM (Long Short-Term Memory): Same network shape as GRU, plus the additional cell-state mechanism (input, forget, output gates) that is intended to capture longer-range dependencies.
Deep-learning models used the Adam optimiser (initial learning rate 0.001), Binary Cross-Entropy loss, dropout 0.3, and early stopping with patience 15. Each deep-learning configuration is trained with five random seeds
and evaluated on both validation semesters, with mean ± 95% CI (t-distribution,
) reported in
Section 4. Time-masking at evaluation time zeroes every timestep
before the forward pass, so the model cannot observe any future week when predicting at week
w; a self-check (
Section 3.5) verifies that this produces identical outputs whether masking is applied pre- or post-standardisation.
3.5. Evaluation Metrics
Positive class. Consistent with the operational goal of an EWS—detecting at-risk students so that timely intervention can be offered—the minority Fail class is treated as the positive class ( if total_score < 60, else ). Under this convention, recall equals sensitivity to at-risk students: a recall of r means that of actually failing students are caught by the EWS. Because the scikit-learn default pos_label = 1 is used, a concrete care in coding is required: if Pass were numerically coded as 1, the default recall would collapse to Pass-recall rather than at-risk sensitivity, so the label encoding is deliberately aligned to Fail .
Metrics. Accuracy, precision, recall and F1 are reported, all with pos_label = 1 (Fail). Recall is the primary metric because the cost of a missed at-risk student (no tutoring offered, possibly attrition) is strictly larger than the cost of a false alarm (an instructor checks in with a student who is, after all, on track). This cost asymmetry is the same one that motivates recall as the primary metric in clinical screening and supply-chain defect detection.
Robustness. Each deep-learning result is computed across five seeds and reported as mean ± 95% CI; the paired RF baseline is trained with a fixed seed since SMOTE + RF with the same random_state is deterministic given the training data. Reported accuracy always refers to the full confusion matrix, so a classifier that predicts everyone as Pass can still obtain accuracy on this dataset while having Fail-recall = 0; Fail-recall and F1 are relied upon to detect this failure mode.
Leakage check. For each trained deep-learning model, the study also verifies that for any two tensors that agree on weeks . All five LSTM seeds and all five GRU seeds passed this check, so week-w predictions are confirmed to depend only on weeks .
Operational warning threshold. To make statements like “the earliest week at which a model is usable” precise, an operational target is fixed in advance: Fail-recall
with Fail-precision > class base rate (so that the model is not simply alarming on everyone).
Section 4.3 uses this target to identify the earliest usable warning week for each model family.
3.6. Use of Generative AI Tools
In line with the journal’s policy on artificial intelligence tools, the author discloses that Claude Code (Anthropic; Opus 4.6 and Opus 4.7) was used as a coding assistant for two purposes: (i) generation of the Python 3.10 source code that implements the experimental protocol described above, and (ii) generation of figure-rendering code—the TikZ source for the conceptual data-flow diagram in
Figure 1, and the
matplotlib 3.9 scripts that render
Figure 3,
Figure 4 and
Figure 5 from the recorded experiment outputs. The experimental design, methodological decisions, and the conceptual layout of
Figure 1 were specified by the author prior to code generation; every generated script was reviewed and executed by the author, and all reported numerical results (
Section 4) are reproducible from the resulting code. No GenAI tool was used for data collection, hypothesis formulation, statistical inference, or interpretation of results. Use of GenAI tools restricted to manuscript text editing (grammar, spelling, formatting) is excluded from this disclosure under the journal’s policy.
3.7. Institutionalisation Framework (Design-Level)
The three-tier institutionalisation framework is presented here as a blueprint to be validated in subsequent deployments; no adoption, usage, or outcome data from Tiers 2 and 3 are reported in this paper. Each tier is explicitly anchored to one Rogers [
11] adoption segment and one Senge [
12] feedback loop, which is what distinguishes it from a generic phased rollout plan.
Tier 1—Instructor level (Rogers early adopters/Senge individual-reflection loop). Volunteer instructors run the EWS on their own course in a “pilot → feedback → optimise” cycle; the artefact of this tier is an instructor-level log of which predicted-Fail students were intervened on and how they ultimately performed, so that the warning model can be audited in situ before the next tier.
Tier 2—Departmental level (Rogers early-majority/Senge departmental curriculum loop). The EWS is integrated into departmental teaching-quality-assurance workflows; a cross-course warning dashboard is made available to academic advisors; the feedback loop is curricular (“does the department need to adjust sequencing or prerequisites?”).
Tier 3—Institutional level (Rogers institutionalisation/Senge policy learning). The EWS is embedded in the Smart PASS platform’s iSignal warning channel with automated campus-wide distribution; the feedback loop is institutional (policy changes, resource reallocation across departments).
The current paper reports data only from an instructor-level pre-pilot; Tier 2 and Tier 3 outcomes are explicitly proposals for future work (
Section 5).
3.8. Impact Assessment Framework (Design-Level)
The four-dimensional framework below is likewise a template for longitudinal impact studies, not a measured outcome of the present work. It is included so that subsequent pilots have a consistent evaluation scaffold, and each dimension is operationalised as a small set of measurable indicators.
(1) Student outcomes. Pass-rate changes (course-level; term-to-term), post-warning improvement rates (conditional on receiving tutoring), self-reported self-regulated-learning scales, and course satisfaction.
(2) Resource efficiency. Tutoring cost per at-risk student, time savings for instructor risk identification (from a documented baseline of “manually scanning the class list”), and the precision of resource allocation (what fraction of tutoring hours go to students who were genuinely at risk).
(3) Organisational learning. Frequency of instructor pedagogical-strategy adjustments (as captured by the Tier 1 logs), departmental curriculum iteration speed, and institutionalisation policy adoption rates.
(4) Talent output. Graduation-rate changes, graduate employment rate and employer satisfaction, and documented contributions to SDG 4 (targets 4.3, 4.4, 4.5) at cohort level. These are long-horizon indicators (2–5 years) and are included as markers for future follow-up, not as claims of current impact.
4. Results
All results below use Fail () as the positive class, averaging over six prediction weeks and two validation semesters (1131, 1132). Deep-learning results are reported as mean ± 95% CI over five random seeds; RF results are deterministic under a fixed random_state. Time-masking was verified leakage-free in 10/10 deep-learning runs (five seeds × two model families).
4.1. Model Performance Comparison
Table 3 reports the three models’ performance averaged across weeks 6–16 and both validation semesters. The clearest headline is that accuracy and Fail-recall diverge sharply: LSTM and GRU obtain the highest accuracy, but they do so largely by predicting the majority class in early weeks. RF with SMOTE attains the highest Fail-recall, the highest F1, and the only Fail-precision that is meaningfully above the class base rate.
Relative to LSTM/GRU, RF + SMOTE provides absolute Fail-recall gains of p.p. over LSTM and p.p. over GRU, and F1 gains of and p.p., respectively. LSTM is nominally the most accurate model by ≈5 percentage points, but that advantage evaporates once sensitivity to the minority class is examined. On this small ( training, 13 Fail) imbalanced ESC dataset, a traditional RF baseline with SMOTE is the operationally preferable EWS.
Table 4 gives per-semester confirmation. The ± values are the 95% CI half-widths over the five seeds for that (model, semester, week-range) cell; individual per-week CIs are in
Table 5.
4.2. Feature Engineering Comparison
Because LSTM/GRU collapse to the majority class in early weeks, the most informative feature-engineering comparison is on RF + SMOTE (the model that actually separates the two classes).
Table 6 reports the three representations averaged across weeks 6–16 and two validation semesters.
The story here is a trade-off rather than a dominance: original features maximise sensitivity to at-risk students (Fail-recall 90.36%), while cumulative and mixed features trade about 4 percentage points of recall for 13–15 percentage points of precision and 6–7 points of F1. For an EWS whose explicit operational target is catching at-risk students, original features are still the recommended representation; cumulative features are a defensible second choice when false-alarm cost rises.
Between-class separation of the two representations. One possible explanation for the original-vs-cumulative difference would be a reduction in between-class separation under cumulative aggregation (a “signal-dilution” hypothesis). This is tested statistically: for every one of the (week, activity) cells, the absolute standardised mean difference (Cohen’s ) between Pass and Fail students is computed under both representations on the training set. The results go in the opposite direction:
mean on original features (median );
mean on cumulative features (median );
a paired Wilcoxon test with alternative “original > cumulative” returns , while the reverse alternative is highly significant.
That is, cumulative features have larger, not smaller, between-class separation on average. The correct mechanistic story is therefore not signal dilution: cumulative features smooth weekly fluctuations, which improves class separation in the marginal sense but loses the fine-grained temporal events (sudden drops, local bursts) that carry most of the sensitivity. Original features give the classifier the signal it needs to flag an abrupt change before a student is past recovery; cumulative features give a more robust but slower verdict. Both statements are now supported by statistics on the full feature grid, not a single anecdote.
4.3. Prediction Time Point Analysis
Table 5 reports each model’s weekly Fail-recall and accuracy (two-semester pooled), so that the trade-off between timeliness and reliability can be inspected week by week. Deep-learning values are mean ± 95% CI over five seeds.
Operational “optimal warning time” criterion.
Section 3.5 fixes the target as “Fail-recall
with precision materially above the class base rate.” Applying this criterion, the earliest usable warning week is:
RF + SMOTE: Week 6 (Fail-recall 87.86%, precision 43.93%, i.e., 2.6× the ≈17% base rate). The 12-week intervention window originally claimed for the EWS is therefore supportable—but only through RF, not LSTM.
LSTM: Week 14 (Fail-recall 80.00%, exactly meeting the operational target; Week 16 reaches 82.86%). The usable intervention window shrinks to ≈4 weeks.
GRU: Week 16 (Fail-recall 77.57%, just below the target; Week 14 only 60.00% with a ±22.94 p.p. CI half-width, so not yet operationally usable).
Non-monotonicity of LSTM recall at early weeks. LSTM recall stays at exactly 0% at Week 6, jumps to at Week 8 and at Week 10 (CIs include zero—most seeds remain in the degenerate all-Pass solution, the occasional seed escapes), then stabilises from Week 12 onwards. The reason is that with only 13 Fail examples replicated via Gaussian noise, the LSTM’s early-week loss surface has a single broad basin in which predicting “always Pass” is globally optimal; the network only leaves that basin once the feature tensor contains enough non-zero columns to break the degeneracy. The 95% CIs over five seeds (24.55 p.p. wide at Week 8, 26.13 p.p. at Week 10) quantify the remaining instability, and underline why LSTM is not recommended for early-week warning on this dataset.
Cautions on interpreting early-week recall. In imbalanced settings such as this one, a high recall at Week 6 under a Pass-as-positive framing would correspond to a degenerate all-Pass predictor (e.g., confusion matrix TN = 0, FP = 7 for LSTM Week 6 on semester 1131); only the Fail-as-positive recall reported here measures sensitivity to at-risk students. Under that framing the LSTM Week 6 Fail-recall is
. The time-masking self-check (
Section 3.5) passed on all 5 seeds × 2 model families
runs, excluding data leakage as a source of any high-recall artefact.
Figure 3 visualises these weekly trajectories for each model on each validation semester.
Figure 4 focuses on the single quantity that matters most for an EWS—Fail-recall—and shows directly how deterministic RF compares to the 5-seed LSTM and GRU distributions over weeks. The shaded 95% CI bands make the “degenerate basin” behaviour of the sequence models visible as a near-vertical spread at Weeks 8–12 that only collapses from Week 14 onwards, while RF stays above the 80% operational target throughout.
4.4. Feature Importance Analysis
Feature importance is reported from the models that are actually used for prediction in this paper. Because RF is the operational model and LSTM is the sequence model of interest, Gini-based importance is reported for RF and permutation-based importance for LSTM. Random Forest (the explanatory lens) gives a decomposition into individual features; LSTM (the sequence lens) gives a decomposition into activity channels. Reporting both avoids reading feature importance off the weakest model while using the strongest for prediction.
RF Gini importance (aggregated by activity type, on original features trained on the full training set).
exam ≈ 34.32%, forum ≈ 31.12%, homework ≈ 23.90%, custom ≈ 10.65%, weblink ≈ 0.00%.
LSTM permutation importance (drop in Fail-recall on the 1131 validation set; mean across 5 seeds × 5 permutations per seed).
exam: drop ;
homework: drop ;
custom: drop ;
forum: drop ;
weblink: drop .
Both models agree that weblink carries no predictive signal for Fail—this matches the way weblink is logged in iClass (click-through only, no engagement depth) and is the cleanest convergent finding. The two models also agree that exam, homework, and custom activities are the most informative and forum, while useful for RF Gini (which mainly isolates the week-8 forum feature), contributes little to LSTM’s sequence-level decisions. The disagreement is narrow: RF’s Gini assigns more weight to the forum because one specific (week-8) forum feature cleanly splits Pass from Fail in a single tree threshold (
Table 7), while LSTM distributes attention across the trio of active-submission activities (exam, homework, custom). For instructional redesign, the two rankings should therefore be read together: consistent weekly submission across homework and custom is the strongest in-semester sequential signal, while exam and the week-8 forum snapshot carry extra discriminative value once such snapshots are available.
4.5. Why LSTM Underperforms RF in This Setting
General architectural arguments would predict that LSTM “should” outperform RF on temporally structured data. The results on this dataset run in the opposite direction; the mechanism is specific and instructive, and the diagnosis below follows the empirical evidence.
(1) Class-imbalance handling is asymmetric across models. RF was trained with SMOTE applied in the flattened feature space, which synthesises new minority points by interpolating between
k nearest minority neighbours; this gives the tree ensemble a genuinely more diverse view of the Fail class. LSTM/GRU used minority replication plus isotropic Gaussian noise with
in the 3D tensor space; with only 13 Fail students,
produces near-duplicates, so the deep model effectively still sees the same 13 trajectories. The empirical consequence—0% Fail-recall at Weeks 6–8—is visible in
Table 5 and reproducible across all five seeds.
(2) Early-week sparsity pushes the LSTM loss surface into a degenerate basin. In Weeks 6–10 most feature cells are either zero or close to zero (students have not yet produced much logged activity). The BCE loss on such inputs is dominated by the large Pass majority, so the network’s global optimum is to predict Pass for everyone; this is the “majority-class collapse” that the confusion matrix directly shows (TN , FP for LSTM Week 6 on semester 1131). RF, whose training set has been rebalanced by SMOTE before fitting, does not face this loss-surface geometry.
(3) LSTM’s cell-state advantage does not translate here because the usable sequence is short. The 18-week window contains at most 2–3 inflection points per student, which is in the regime where RF on flattened features can learn adequate splits; LSTM’s long-range dependency advantage is mostly moot.
(4) Multi-seed variance confirms the diagnosis. The LSTM Fail-recall at Week 8 has a 95% CI of
p.p. over five seeds and at Week 10 of
p.p. (
Table 5). These wide CIs are not a nuisance but a signature of the degenerate basin: most seeds land in the all-Pass solution, the occasional seed escapes it. By Week 14 the CI half-width has contracted to
p.p. as every seed reliably separates the two classes. Under a more aggressive minority-handling strategy (e.g., 3D SMOTE, focal loss, or class-weighted BCE with
pos_weight), this early-week variance would be expected to shrink; testing such remedies is listed in Future Work (
Section 6).
The general lesson, and the cleanest take-home from this study’s revised empirical contribution, is that sequence-model superiority in educational prediction is conditional on class-imbalance handling commensurate with the architecture’s capacity; on small imbalanced datasets typical of programme-level courses a traditional RF + SMOTE baseline is a more reliable starting point, and should be the benchmark that subsequent sequence-model work must beat in sensitivity, not just in accuracy.
5. Discussion
5.1. What the Results Support, and What They Do Not
The empirical results support three claims and rule out several others that might otherwise be drawn from the data. Both sets are stated here clearly so that subsequent institutional decisions rest on what the data actually show.
Supported by the present data. (i) An EWS that flags at-risk students at Week 6 is feasible on this single-course dataset using RF + SMOTE, with Fail-recall 87.86% and precision 2.6× the class base rate; (ii) original weekly features maximise sensitivity (Fail-recall 90.36%) while cumulative and mixed features improve precision/F1 (from 72.06% to ≈79.4%), which is a genuine trade-off rather than an across-the-board dominance; (iii) on a training set of 13 minority trajectories with noise augmentation, LSTM and GRU collapse to the majority class in early weeks and their headline accuracy is therefore misleading; this is consistent across five seeds.
Not supported. (i) A “zero-miss” framing, requiring 100% Fail-recall: with Fail as the positive class, no model in this study reaches 100% Fail-recall at Week 6, and LSTM reaches 0%. (ii) A “signal dilution” account of the original-vs-cumulative gap: on the full (week, feature) grid, cumulative features have
larger average Cohen’s
(0.717 vs. 0.192;
Section 4.2), so the gap is a sensitivity/precision trade-off, not dilution. (iii) Claims that the EWS, on the basis of these results alone, reduces downstream corporate training costs, measurably advances SDG 4, or produces institutional resilience: the data in this paper do not measure any of these outcomes, and the discussion below is careful to separate them from what is measured.
5.2. Comparison with Related Work
The accuracy and recall figures align reasonably well with recent literature on small-scale educational prediction. Al-Azazi and Ghurab [
28] report ≈ 70% accuracy at month 3 of a MOOC; Kalita et al. [
27] report ≈ 88% accuracy with Bi-LSTM + SHAP on a large open dataset; Alnasyan et al. [
26] conclude from their systematic review that deep models’ reported advantages shrink once class-imbalance controls are added. The present result—RF + SMOTE at 85.59% accuracy and 91.19% Fail-recall, beating LSTM/GRU on the minority class—fits this picture: on large and balanced data, deep sequence models can dominate; on small imbalanced data with naive augmentation, classical ML with explicit resampling is often a more dependable baseline. The per-week Fail-recall trajectory (RF 87.86% at Week 6, 92.86% from Week 10) is broadly consistent with the early-warning timing reported by [
28] on MOOC data and with the homework-habit-driven early warning of Wen et al. [
32] in
Sustainability. The RF-with-balancing design adopted here parallels Jawad et al. [
31] but extends their single-seed RF comparison with a leakage-checked multi-seed protocol; Staneviciene et al. [
33] report a related case study on sustainable e-learning prediction whose accuracy range is consistent with ours. The result is obtained on a deliberately small, programme-level dataset; it is therefore cautioned that generalisation of the result requires cross-course and cross-institution replication [
7,
20,
29].
5.3. From Evidence to Supply-Chain Language
Rephrased in ESC terms, the evidence supports three cautious transitions rather than three system-level transformations:
From reactive to proactive, at the course level. An RF + SMOTE EWS allows an instructor in this course to identify ≈87% of at-risk students at Week 6, leaving a 12-week intervention window. This is a course-level finding; extrapolation to department and institution levels is a design-level claim (
Section 3.7).
From undifferentiated to targeted tutoring, within the observed precision. At Week 6 the RF precision is 43.93%, meaning that of every 10 students flagged, ≈4 would actually fail without help; that is strictly better than an undifferentiated approach but it is not “just-in-time” in the narrow sense. Instructors should expect to check in with more students than will ultimately be at risk, and the EWS should be presented to them in that language.
From intuition to quantitative attention, with two decompositions. RF Gini highlights exam and forum; LSTM permutation importance highlights homework and custom. Both agree that weblink carries essentially no predictive signal. A pedagogical reform cannot read directly off the stronger decomposition alone, but the two together sharpen where instructors can usefully look.
Downstream effects on corporate training costs, SDG 4 targets, or whole-chain efficiency are hypotheses that the four-dimensional framework in
Section 3.8 exists to test in a 2–5 year follow-up; they are not supported by the present data, and this paper does not claim they are.
5.4. Long-Term Sustainability Analysis (Design-Level)
This subsection is framed as operational recommendations grounded in the empirical results; it does not claim long-term effects that the dataset cannot measure.
5.4.1. Technical Sustainability
On this small dataset, RF + SMOTE reaches 85.59% accuracy/91.19% Fail-recall while training in seconds on a commodity laptop, which is technically feasible for programme-level deployment. Per-semester performance is not uniform: RF Fail-recall goes from 85.71% (semester 1131) to 96.67% (semester 1132) while accuracy moves the other way (90.37% → 80.82%), consistent with cohort drift in cohort composition and instructor behaviour. Continuous monitoring and periodic retraining are therefore operationally necessary.
To ensure technical sustainability, this study recommends a rolling model update strategy: retraining RF + SMOTE with the two most recent semesters at the end of each academic year; a monitoring dashboard that tracks per-cohort Fail-recall and precision and flags drops below the operational target in
Section 3.5; and a modular architecture that keeps the classifier replaceable, so that if subsequent work delivers a sequence model that reliably beats RF on this task, it can be dropped into the same pipeline.
5.4.2. Economic Sustainability
The RF + SMOTE pipeline trained in under a minute on an Apple Silicon M-series laptop with 16 GB RAM and required no GPU; LSTM/GRU took 2–5 minutes per seed on the same hardware. Inference is essentially instant at the class level, so the system can run on the existing Smart PASS infrastructure without additional procurement. Batch prediction once every 1–2 weeks is sufficient for instructional support, which further reduces the energy footprint of the deployment. The cost-benefit claim is therefore: limited one-time integration cost, minimal per-semester marginal compute cost, and instructor time savings relative to manual roster review. Claims about tuition-revenue recovery or system-level cost offset are deferred to the four-dimensional framework’s 2–5 year follow-up (
Section 3.8); the present data do not support them.
5.4.3. Ecosystem Sustainability
Long-term operation also requires coordination across technical maintenance, pedagogical application, and administrative management; the concrete coordination mechanisms are institution-specific and are out of scope of this paper, which reports a single-course pilot.
5.5. Scalability Assessment
Scaling from a single-course pilot to campus-wide deployment raises three issues that the present data cannot adjudicate but that the four-dimensional template (
Section 3.8) is designed to test in subsequent pilots: cross-course variation in activity patterns (programming vs. general education, STEM, humanities) which may require domain-adaptation or course-clustering strategies; cross-institutional variation in LMS platforms (Moodle, Canvas, Blackboard) which requires a standardised data-exchange schema before transfer learning can be evaluated; and operational batch-processing capacity, for which the present codebase exposes class-level batch prediction and is integrable with Smart PASS automated workflows. None of these is empirically demonstrated at scale in this paper.
5.6. Institutionalization Pathway
Section 3.7 sets out the three-tier pathway anchored in Rogers [
11] and Senge [
12]; concretely it consists of an instructor-level pilot (1–2 semesters), departmental expansion (semesters 3–4), and Smart PASS integration (semesters 5–6). Success factors typically cited in implementation studies—administrative leadership, faculty trust through transparent model explanations, student privacy protection, and a continuous improvement culture—are not measured by the present study and are reserved for the Tier-2 and Tier-3 follow-ups.
5.7. Impact Assessment Framework
The four-dimensional template (
Section 3.8) spans three time horizons—short-term (1–2 semesters; student outcomes, resource efficiency), medium-term (semesters 3–6; organisational learning), and long-term (2–5 years; talent output and SDG-4-aligned indicators). Quasi-experimental designs comparing flagged-and-intervened against flagged-but-not-intervened students, supplemented by longitudinal cohort tracking, are the recommended evaluation method. Numerical thresholds in
Section 3.8 (e.g., target pass-rate improvement, target time savings) are template targets for follow-up pilots, not findings of this study.
5.8. Contributions
The contributions of this study:
A leakage-checked, multi-seed ESC benchmark. An RF + SMOTE/GRU/LSTM comparison on a small () realistically imbalanced ESC dataset, with Fail as the positive class, a temporal train/validation split, five seeds, and an explicit time-masking self-check. The full protocol and scripts are documented for reuse.
A counter-intuitive but well-supported finding on this dataset. On this regime, RF + SMOTE beats LSTM/GRU on Fail-recall by and p.p., and LSTM/GRU exhibit majority-class collapse in early weeks under Gaussian-noise augmentation. This is reported as an operational caution observed in a single-course pilot rather than as a methodological generalisation; cross-course and cross-institution replication is required before any broader claim can be made.
Statistical evidence on feature representation. A 90-cell Wilcoxon test on Cohen’s shows that cumulative features have larger, not smaller, between-class separation; the original-vs-cumulative difference is a sensitivity/precision trade-off, supported by a systematic statistic rather than a single anecdote.
Design-level scaffolding (not yet validated). A Rogers × Senge three-tier institutionalisation pathway and a four-dimensional impact assessment template are offered as blueprints for subsequent institutional pilots to test. The present study reports no Tier-2 or Tier-3 deployment data, no usage or adoption outcomes, and no measured institutional impact; neither framework, therefore, constitutes a closure of the corresponding research gap, and they are positioned as scaffolding for future empirical work rather than as findings of this paper.
5.9. Limitations
The study has six substantive limitations that subsequent work must address before any of the design-level scaffolding can be claimed as realised: (1) single course, single institution, 188 students, 30 Fail; the empirical results do not generalise cross-course or cross-institution without replication, and the broader framings used elsewhere in the manuscript—“sustainable transformation of the ESC”, institutional impact, SDG-aligned outcomes—are not supported by data of this scope and are explicitly reserved for follow-up studies that use the four-dimensional template in
Section 3.8. (2) Binary Pass/Fail target; finer-grained achievement levels were not modelled. (3) Per-semester variation in Fail-recall (RF 85.7% vs. 96.7%) and accuracy (90.4% vs. 82.1%) indicates cohort drift, so a rolling-update strategy is operationally necessary rather than optional. (4) Minority-class augmentation for LSTM/GRU was deliberately kept simple (Gaussian noise) to keep the protocol comparable across seeds; stronger strategies (focal loss, 3D SMOTE, class-weighted BCE with
pos_weight) are left to future work and may shift the RF-vs-LSTM conclusion. (5) Interpretability relies on RF Gini and LSTM permutation importance; SHAP-based interpretation of LSTM on this small dataset was not attempted because 13 Fail trajectories is below the recommended sample size for stable SHAP values. (6) Institutionalisation and impact frameworks (
Section 3.7 and
Section 3.8) are design-level; no Tier 2/Tier 3 deployment, interview, or outcome data are reported, and all long-horizon claims about SDG 4, equity, or downstream business costs are explicitly reserved for future empirical studies.
6. Conclusions
The conclusions are organised into three tiers—observed, reasonable inferences, and future proposals—so that the reader can judge at each tier how far the claim is supported by the data in this paper.
Tier 1—Observed on this single-course dataset. (i) Random Forest with SMOTE attains 85.59% accuracy and 91.19% Fail-recall on average across weeks 6–16 and two validation semesters, already at 87.86% Fail-recall by Week 6 (deterministic under a fixed seed, leakage-checked); it is the operationally preferable EWS in this regime. (ii) LSTM and GRU, under Gaussian-noise augmentation on the 13 Fail training trajectories, collapse to the majority class in Weeks 6–10 (Fail-recall – with CI half-widths up to 26 p.p.) and become usable only from Week 14 (LSTM Fail-recall 80.00%); their high headline accuracy hides very low sensitivity in early weeks. (iii) Original features maximise Fail-recall (90.36%); cumulative/mixed features improve precision and F1 but sacrifice ≈ 4 p.p. of recall. The (week, feature) Cohen’s d analysis shows that cumulative features retain larger between-class separation on average, so the gap is a sensitivity/precision trade-off rather than an information-dilution effect. (iv) RF Gini importance emphasises exam, forum, and homework; LSTM permutation importance emphasises exam, homework, and custom; both models agree that weblink click-through carries no predictive value.
Tier 2—Reasonable inferences beyond this dataset. The 12-week intervention window that RF + SMOTE unlocks is operationally meaningful for programme-level deployment, but its cross-course and cross-institution generalisation is an inference that this study cannot confirm. The observation that sequence models under naive augmentation collapse on small imbalanced data is a methodological warning that is consistent with the broader learning-analytics literature [
26,
28], but any strong claim that “RF will always beat LSTM in ESC settings” would overreach. Likewise, it is reasonable to infer that an instructor-level Tier 1 pilot grounded in Rogers/Senge theory is the right way to enter Tier 2 and Tier 3, but the theoretical grounding does not yet substitute for empirical validation.
Tier 3—Future proposals. The three-tier institutionalisation pathway (
Section 3.7) and the four-dimensional impact assessment template (
Section 3.8) are offered as proposals, not as achievements of this study. Claims about equity gains, institutional resilience, reductions in downstream corporate training cost, or measurable SDG 4 contributions are reserved for 2–5 year follow-up studies that use the template. Within the same tier sit the technical extensions most likely to shift the RF-vs-LSTM conclusion: class-weighted BCE with
pos_weight, focal loss, 3D SMOTE, attention mechanisms, and Transformer-based architectures; SHAP or integrated-gradients interpretability of a deep model that is not majority-collapsed; cross-course/cross-LMS transfer experiments; and randomised controlled trials that measure whether a flagged-and-intervened student actually outperforms a flagged-and-not-intervened counterpart.
Combining the three tiers: (1) on this single-course dataset, RF + SMOTE is an operationally usable EWS from Week 6, and the early-week collapse of LSTM/GRU under naive Gaussian-noise augmentation is a documented caution rather than a methodological generalisation; (2) the Rogers/Senge institutionalisation pathway is a coherent route from instructor pilot to campus-wide deployment, but it is a route, not an arrival—no Tier-2 or Tier-3 deployment, adoption, or outcome data are reported here; (3) claims about equity gains, downstream institutional impact, or system-wide sustainability transformation are reserved for follow-up studies that apply the four-dimensional template (
Section 3.8). The author and his team will continue to extend this work along three directions: cross-course and cross-institution replication of the protocol, sequence-model improvements (focal loss, class-weighted BCE, attention/Transformer architectures) that may shift the RF-vs-LSTM conclusion, and longitudinal cohort tracking against the four-dimensional template to test the design-level scaffolding empirically.