AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment

Chi, Chen-Chung

doi:10.3390/su18115523

Open AccessArticle

AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment

by

Chen-Chung Chi

Center for Distance Education Development, Office of Information Services, Tamkang University, New Taipei City 25137, Taiwan

Sustainability 2026, 18(11), 5523; https://doi.org/10.3390/su18115523 (registering DOI)

Submission received: 28 March 2026 / Revised: 8 May 2026 / Accepted: 29 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue AI for Sustainable Supply Chain-Driven Business Transformation)

Download

Browse Figures

Versions Notes

Abstract

Higher education institutions face the persistent challenge of student attrition, a critical risk node within the educational supply chain (ESC). This study adopts a supply chain management (SCM) perspective to apply artificial intelligence (AI) for sustainable transformation of the ESC and evaluates an early warning system (EWS) for student performance prediction on a single programming course at Tamkang University. Learning trajectory data from 188 students across four semesters (90 for training, 98 for temporal validation; 30 fail cases in total) were collected from the iClass learning management system. To match the operational goal of the EWS—maximizing detection of at-risk students—the minority Failclass was treated as the positive class, so that recall directly measures sensitivity to at-risk cases. Three models were compared under a 5-seed protocol with time-masking to prevent future-week leakage: Random Forest (RF) with SMOTE, GRU, and LSTM. Averaged across weeks 6–16 and both validation semesters, RF achieved an accuracy 85.59%, a Fail-recall 91.19%, a precision 58.89%, and an F1 70.36%, already providing reliable warning at Week 6 (Fail-recall 87.86%). Under the same protocol LSTM and GRU collapsed to the majority class during weeks 6–10 (Fail-recall 0–42%), yielding higher headline accuracy but substantially lower sensitivity; they became usable only from Week 14 onwards (LSTM Fail-recall 80.00% at Week 14, 82.86% at Week 16). A Wilcoxon test on Cohen’s d over 90 (week×feature) pairs showed that cumulative features exhibit larger, not smaller, between-class separation than original features (

| d |

0.717 vs. 0.192;

p < 0.001

), indicating that the original-vs-cumulative trade-off is one of sensitivity versus precision rather than information dilution. As design-level companions to these empirical results, the study also proposes a three-tier institutionalization framework and a four-dimensional impact assessment framework; these are offered as implementation blueprints rather than empirically validated outcomes. The contributions of this paper are operational rather than methodologically novel: (i) a reproducible EWS benchmark on a small, imbalanced ESC dataset, including a diagnosis of LSTM/GRU’s early-week majority-class collapse under naive augmentation, and (ii) design-level institutionalisation and impact-assessment scaffolding offered as a template for subsequent institutional pilots, not as empirically validated outcomes of the present study.

Keywords:

artificial intelligence; educational supply chain; sustainable transformation; early warning system; learning analytics; institutionalization; impact assessment; organizational learning; talent development

1. Introduction

1.1. Research Background: Sustainability Challenges in the Educational Supply Chain

Educational supply chain (ESC) theory conceptualizes higher education institutions as core nodes in the societal talent supply chain, where student enrollment corresponds to raw material input, the teaching process to value-added transformation, and graduation to talent output [1,2]. Unlike a conventional manufacturing supply chain, however, the educational supply chain bears a distinct sustainability burden: its “product loss” is not merely an economic write-off but a truncation of a student’s learning trajectory and of the talent that a society can later draw on. Framing student attrition as a sustainability problem, therefore, connects three concrete SDG 4 targets [3]: 4.3 (equal access to affordable tertiary education), 4.4 (increase in the number of adults with relevant skills for employment), and 4.5 (eliminate gender and vulnerability disparities in access to education). Each at-risk student who is not caught early represents a local failure against all three targets—an equity loss (the weakest learners are typically the first to be lost), a skills loss (fewer graduates enter the downstream labour market), and a disparity loss (attrition tends to concentrate in already disadvantaged cohorts). This is the sense in which an EWS can be a sustainability instrument rather than an administrative convenience: catching a struggling student in Week 6 rather than after the final exam is the ESC analogue of preventing, rather than discarding, a defective unit. This framing also aligns with more recent sustainability-focused higher education research that moves from declarative SDG alignment to measurable institutional mechanisms [4,5,6,7].

In supply chain management (SCM), traditional models lacking demand forecasting and risk early warning lead to inefficient resource allocation and delayed risk response [8]; this study argues that educational management faces similar challenges. Traditional educational management models heavily depend on posthoc assessment mechanisms such as midterm and final examinations. Instructors often cannot identify struggling students until mid-to-late semester, missing optimal intervention windows.

Recent rapid advances in artificial intelligence (AI), particularly in deep learning and generative AI, present unprecedented opportunities to address these challenges [9]. In SCM, AI has been widely applied to demand forecasting, risk identification, resource optimization, and process reengineering [10]. A natural research question is whether the same paradigm transfers to education. The present study engages this question in a deliberately conservative way: rather than assuming that deep sequence models will outperform simpler baselines on educational data, it evaluates Random Forest with SMOTE against GRU and LSTM on a single-course ESC dataset (188 students, 30 fail cases) and reports the honest trade-offs; the result (Section 4) is that traditional ML with explicit minority resampling currently dominates deep sequence models on this small, imbalanced dataset—a finding that is itself relevant to the sustainability question, because low-cost, reproducible baselines are what most institutions will actually deploy.

1.2. Theoretical Foundations of AI-Driven Educational Supply Chain Transformation

Following [1,2], the ESC can be viewed as a closed-loop system of input (enrollment), transformation (teaching), output (graduate talent) and feedback (outcome assessment), in which higher education institutions act as upstream suppliers to the societal talent supply chain. Within this frame the EWS plays the role of a quality control node: rather than waiting for end-of-term grades, it monitors the transformation process and flags at-risk students early enough to intervene, reducing the “defect” handed on to the downstream labour market. Viewed this way, AI can drive transformation along four ESC nodes: (i) risk forecasting and early warning, (ii) precision allocation of tutoring resources, (iii) data-driven reengineering of teaching and warning workflows, and (iv) continuous feedback that supports organisational learning. Table 1 summarises the mapping between canonical SCM concepts and their ESC counterparts. A longer review of the AI-in-SCM literature is deferred to Section 2 to avoid duplication.

1.3. The PED System, Its Limits, and the Role of This Study

Tamkang University’s Office of Information Technology has developed the Smart PASS (Smart Planning and Advising for Student Success) platform, combining iClass (the LMS), iSignal (commendation and early warning), and iCan (intelligent career matching) with a Data–Dashboard–Decision workflow. Within this platform the Performance and Engagement Diagram (PED) system has, since academic year 110, provided bi-weekly dashboards and automated warning e-mails for instructors, academic advisors and students. The PED currently classifies each student each week into four quadrants using fixed thresholds on normalised Performance (P) and Engagement (E). Two practical limitations of this rule-based classifier were observed in deployment and motivate the present study: (i) max–min normalisation of P and E is sensitive to outliers and to class-size variation, and (ii) during early weeks, a majority of students lie close to the origin of the P–E plane, so the fixed-threshold boundary forces nearly identical students into different quadrants and generates many false alarms.

Figure 1 summarises the resulting architecture as a conceptual data-flow diagram. This study therefore evaluates a data-driven alternative that is designed to replace, not merely complement, the quadrant rule inside the existing PED pipeline. Concretely, the intended integration is: (a) the PED continues to ingest iClass activity logs; (b) instead of applying fixed P/E thresholds, the weekly feature vector of each student is fed to a trained prediction model (RF or LSTM, depending on availability of a reliable minority-class sample); (c) the model emits a per-student Fail probability that drives the existing iSignal notification channel and the instructor dashboard shown in Figure 2. The empirical work reported in Section 3 and Section 4 serves to evaluate which model family can reasonably be dropped into step (b); the institutionalisation pathway (Section 5) is offered as a design-level proposal, not an empirical claim of the present study.

1.4. Research Objectives and Contributions

The study pursues three empirical objectives (O1–O3) and two design-level objectives (O4–O5); the latter are explicitly framed as blueprints to be validated in future work, not as claims supported by the present data.

(O1, empirical) Evaluate a small family of EWS models (Random Forest with SMOTE, GRU, LSTM) under a rigorous protocol that uses Fail as the positive class, a temporal train/validation split, five random seeds, and an explicit time-masking leakage check.

(O2, empirical) Compare three temporal feature representations—original weekly values, cumulative values, and a mixed representation—under the same protocol, and report which representation maximises which metric (sensitivity vs. precision vs. F1).

(O3, empirical) Identify the earliest week at which each model family attains a pre-specified operational target (Fail-recall ≥ 0.80 with precision not degrading into dominant-class-collapse), and thus determine the intervention window each model actually supports.

(O4, design-level) Propose a three-tier institutionalisation pathway (instructor pilot → departmental expansion → Smart PASS integration) as a structured blueprint grounded in Rogers [11] and Senge [12]; this study does not yet report usage, adoption, or outcome data from the latter two tiers.

(O5, design-level) Propose a four-dimensional impact assessment framework covering student outcomes, resource efficiency, organisational learning, and talent output; this is offered as an evaluation template for 2–5 year follow-up studies, not as measured impact.

The contributions of this paper are positioned as applied, operational improvements rather than broader methodological innovations. Concretely, the paper offers (i) a benchmark of traditional ML against sequence models on a small, realistically imbalanced ESC dataset under a leakage-checked, multi-seed protocol with Fail-recall as the primary operational metric; (ii) a documented failure mode in which sequence models collapse to the majority class under naive Gaussian-noise augmentation, so that apparent high accuracy hides zero sensitivity; and (iii) design-level institutionalisation and impact-assessment scaffolding that subsequent institutional deployments can test. None of these is claimed as a methodological innovation in machine learning or learning analytics; they are practical recipes for practitioners running similar single-course pilots. Claims about downstream corporate benefits, equity gains, or system-wide transformation are reserved for future work.

2. Literature Review

2.1. AI-Driven Sustainable Transformation in Supply Chain Management

SCM is undergoing AI-driven transformation across four recurring dimensions: demand forecasting, where machine-learning techniques consistently improve on classical time-series baselines [10,13]; risk management, where data-driven analytics enable proactive mitigation before disruptions propagate [14]; resource optimisation, which uses algorithmic approaches to reduce waste [15]; and resilience analytics, which traces ripple effects through the network [16]. Large-scale empirical evidence supports an operational-performance payoff from these capabilities [17]. On the sustainability side, Industry 4.0 adoption has been linked to 10R circular-economy practices [18], illustrating that AI-driven supply chain transformation is not pursued in isolation from environmental and social objectives. The literature on AI-in-SCM therefore provides a template—forecast, control, optimise, audit—that this study transposes to the ESC setting in Section 1.3, without re-deriving its foundations here.

2.2. Educational Supply Chain and Sustainable Higher Education

The educational supply chain concept, originally proposed by Habib and Jungthirapanich [1], conceptualises higher education institutions as analogous to manufacturing supply chains: student enrolment as raw-material input, teaching and research as value transformation, and graduates and research outputs as final products, with feedback driven by societal and industry demand. Chowdhury et al. [2] extend this framework to examine alignment between higher education and 21st-century workforce demand, and Pathak and Pathak [19], drawing on Porter’s value-chain theory, propose an “education value chain” that decomposes higher education activities to analyse value creation and cost control.

The United Nations Sustainable Development Goal 4 (SDG 4: Quality Education) provides a global framework for ESC sustainability [3]. Whereas Leal Filho et al. [4] reported in 2019 that SDG implementation at universities remained in early stages, more recent work paints a mixed, and partly contradictory, picture. Findler et al. [5] and Leal Filho et al. [6] document growing but still fragmented institutional practices; sector surveys published in Sustainability in 2023–2025 increasingly emphasise concrete mechanisms—learning analytics, AI-based advising, and institutional data infrastructure—rather than declarative SDG alignment [7,20,21]. The present study positions itself within this more recent literature: it treats sustainability as operationalised through measurable early-warning mechanisms, not as a framing added post hoc.

2.3. Learning Analytics and AI Applications in Education

Learning analytics is a key enabling technology for ESC digital transformation. Verbert et al. [22] proposed a canonical framework for learning analytics dashboards (LADs) that visualise LMS log data; De Laet et al. [23] showed that LADs can support data-driven advisor–student dialogues; Lu and Cutumisu [24] found that online engagement and formative-assessment performance mediate the attendance–performance link; and Rets et al. [25] reported that students place a higher value on actionable recommendations than on descriptive analytics, indicating a necessary shift from data visualisation to action-oriented design.

On the deep-learning side, Alnasyan et al. [26] systematically reviewed CNN/DNN/ LSTM applications to student-performance prediction and observed that reported accuracies are highly dataset-dependent and often reported without class-imbalance controls. Kalita et al. [27] proposed a Bi-LSTM model integrated with SHAP, achieving ≈ 88% accuracy on a large open dataset while improving interpretability, and Al-Azazi and Ghurab [28] reported ≈ 70% accuracy at month 3 in a MOOC setting with ANN-LSTM. A recurring caveat in these studies, rarely discussed explicitly, is that on strongly imbalanced educational data, a classifier can attain high headline accuracy by predicting the majority class; the revised results in Section 4 document exactly this pathology for LSTM/GRU on the present small-N dataset, and motivate reporting Fail-recall and precision alongside accuracy. Related Sustainability-indexed work has recently emphasised SMOTE-style minority resampling and multi-seed robustness in educational prediction [29,30], which the present protocol adopts. Within Sustainability itself, Jawad et al. [31] use Random Forest with explicit data-balancing to predict engagement and performance in a virtual learning environment, Wen et al. [32] construct an early-warning system driven by homework-habit features toward sustainable higher-education evaluation, and Staneviciene et al. [33] present a data-mining case study of student-performance prediction for sustainable e-learning. The present study sits squarely within this line of work, with two methodological additions: a leakage-checked 5-seed protocol and Fail-as-positive sensitivity reporting with 95% CIs.

From the sustainability-of-higher-education angle, the present study engages three empirical gaps and two design-level gaps. The empirical gaps (i)–(iii) are addressed by the experimental work in Section 3 and Section 4; the design-level gaps (iv)–(v) are not closed by this study—they are partially scoped by structured scaffolding, and empirical closure of either gap is reserved for subsequent institutional pilots described in Section 5.

Empirical–model comparison under imbalance. Most published comparisons report only accuracy and Pass-recall; a protocol that explicitly treats Fail as positive, uses multi-seed runs, and reports Fail-recall with confidence intervals is largely absent. Addressed by the protocol in Section 3 and the results in Section 4.
Empirical–feature-representation trade-offs. The original-vs-cumulative question has been raised but is usually resolved by a single pairwise comparison; a (week × feature) effect-size analysis has not, to the author’s knowledge, been reported. Addressed by the Wilcoxon analysis in Section 4.2.
Empirical–operational early-warning timing. Studies report “early” prediction without a stated criterion for when a warning becomes operationally usable; the present study addresses this by fixing a Fail-recall target in advance and selecting the earliest within-semester cutoff at which it is met. Addressed in Section 4.3.
Design-level–institutionalisation pathway. Few studies provide a concrete tier-by-tier blueprint linking a technical prototype to an existing institutional platform. Not closed by this study. Section 3.7 offers structured scaffolding anchored in Rogers [11] and Senge [12], but the present paper reports no Tier-2 or Tier-3 deployment, adoption, or usage data. Empirical closure of this gap requires the pilots set out in Section 5 and is not claimed here.
Design-level–impact assessment framework. Few studies scaffold the longitudinal assessment that follows deployment. Not closed by this study. Section 3.8 offers a four-dimensional template for 2–5 year follow-up, but the present paper reports no measured impact along any of the four dimensions. Empirical closure of this gap requires longitudinal cohort tracking that is outside the scope of this paper.

2.4. Organisational Learning Theory and Technology Institutionalisation

The three-tier institutionalisation pathway proposed later in this study rests on two classical theoretical sources that need to be brought into the literature review so that the pathway is not presented as an ad-hoc construct. Senge’s [12] “learning organisation” emphasises that an institution’s adaptability depends on feedback loops, systems thinking and continuous collective learning; on this reading, an EWS is not just a classifier but a sensor that closes an organisational feedback loop linking learners, instructors, advisors, and IT services. Rogers’s [11] diffusion of innovations theory predicts that a new practice typically moves through early adopters, an early majority, and institutionalisation, with distinct barriers at each stage. Combining the two, the three-tier pathway in Section 3.7 is not a generic phased plan: each tier corresponds to (a) a Rogers adoption segment (early adopters → departmental majority → institutional embedding) and (b) a Senge feedback loop (instructor-level reflection → departmental curriculum iteration → institutional policy learning). This is the justification for proposing a tiered framework in this study rather than a simpler two-stage one.

The study therefore integrates these theoretical foundations to propose a two-part analytical scaffold for “AI-driven sustainable transformation of the ESC”: a technical dimension (empirical model comparison, Section 3 and Section 4) and a complementary institutional+evaluative dimension (design-level frameworks, Section 5). The two parts are kept deliberately separate so that readers can evaluate the empirical evidence on its own terms before engaging with the design-level proposals.

3. Materials and Methods

3.1. Research Design

The study follows a two-phase design rather than a mixed-methods design in the formal sense. Phase I is empirical and quantitative: model development, evaluation and feature analysis on a temporally split dataset under a 5-seed protocol (Section 3.2, Section 3.3, Section 3.4 and Section 3.5). Phase II is design-level and conceptual: the institutionalisation pathway and the impact-assessment framework (Section 3.7 and Section 3.8) are offered as structured blueprints derived from supply-chain, organisational-learning and diffusion-of-innovation theory. Phase II is explicitly not a qualitative empirical study: no interviews, surveys, focus groups or document analyses were conducted. It is design-based theoretical reasoning, and its conclusions are framed throughout the paper as proposals to be validated by follow-up qualitative and quantitative studies.

3.2. Data Collection and Preprocessing

Learning trajectory data were collected from Tamkang University’s iClass learning management platform across four semesters (1121, 1122, 1131, 1132) of one programming language course, totalling 188 students. This is a deliberately small, single-course, single-institution dataset; its limits are treated as a first-class methodological constraint throughout the paper and a five-seed protocol with confidence intervals is used to quantify the resulting uncertainty. A temporal split strategy was employed: the first two semesters (1121, 1122; 90 students) served as the training set and the latter two (1131, 1132; 98 students) as the validation set (Table 2), so the evaluation simulates the model’s predictive capability for future cohorts rather than an in-sample random split.

Five learning-activity features were extracted from iClass, corresponding to the five activity types the platform distinguishes.

homework: weekly assignment score rate, computed as $\sum score / \sum percentage$ over all assignment submissions completed in week w. Captures summative skill acquisition.
forum: weekly forum participation rate over posts and replies in week w, on the same normalisation. Captures peer interaction and self-regulated help-seeking, which learning-analytics literature [24,25] repeatedly finds to mediate between attendance and outcome.
exam: weekly in-class quiz/exam score rate in week w; captures summative assessment within the semester.
custom: weekly completion rate of instructor-defined activities (in this course mostly coding exercises and short concept checks that are not graded as homework).
weblink: weekly click-through/completion rate on instructor-uploaded external resources (supplementary tutorials, reference slides). A “weblink” in iClass is a hyperlinked teaching resource posted by the instructor rather than a graded activity; it records whether a student opens the material, not how long they engaged with it, which is why the empirical analysis in Section 4.4 finds its predictive contribution close to zero.

Each student generated a time-series tensor of 18 weeks

\times 5

activity types = 90 dimensions. Semester 1132 stored the weblink rate under the file name weblink_score.csv (renamed to weblink.csv for the pipeline); the underlying schema is identical across semesters.

Figure 2 shows the PED system instructor dashboard interface. The upper scatter plot displays all students’ Performance (x-axis) and Engagement (y-axis) distribution for the current week; the right-side gauge classifies students into green, yellow and red zones using a fixed-threshold four-quadrant rule; the lower line chart shows an individual student’s weekly trend. The purpose of the EWS proposed in this study, when integrated into the existing PED pipeline, is to replace the fixed-threshold gauge with a data-driven Fail-probability score, while keeping the scatter plot and trend chart unchanged. This separation of concerns keeps the instructor-facing dashboard stable and localises the change to the classification step.

For class imbalance the three models were handled as follows. Random Forest was trained with SMOTE in the flattened feature space at each week cutoff; SMOTE generates synthetic minority samples by interpolating between k-nearest neighbours of existing minority points, so the added variance reflects the geometry of the minority class rather than isotropic noise. GRU and LSTM used minority-class replication with small Gaussian noise (

μ = 0, σ = 0.01

) on the 3D time-series tensor. Section 4.5 shows that this relatively low-variance augmentation is not strong enough to prevent LSTM/GRU from collapsing to the majority class in early weeks, a finding that the present study reports in its own right. All augmentation operations were performed on the training split only, and time-masking (Section 3.5) is applied after augmentation to guarantee no future-week leakage.

3.3. Feature Engineering Methods

Three temporal feature representations of each

(student, week, activity)

cell are compared. Let

s_{i, w, a}

be the weekly score rate of student i in week w on activity a (as defined in Section 3.2), and let

N_{w}

be the number of scored attempts actually logged in week w across the cohort.

(1) Original features. The tensor

X_{i, w, a}^{orig} = s_{i, w, a}

keeps each weekly slot independent. This preserves the temporal shape of a student’s behaviour: sudden drops (“I stopped submitting homework in week 7”) and localised bursts (“I only engaged around midterms”) are directly visible to the model.

(2) Cumulative features. The tensor

X_{i, w, a}^{cum} = \frac{1}{w} \sum_{w^{'} = 1}^{w} s_{i, w^{'}, a}

is the running per-week average up to week w; dividing by w keeps the feature bounded in

[0, 1]

and removes the spurious trend of raw ∑-cumulatives. Intuitively,

X^{cum}

is a smoothed trajectory that attenuates short-term fluctuations and is therefore expected to improve precision at the cost of sensitivity to abrupt change points.

(3) Mixed features. The concatenation

X_{i, w, :}^{mix} = concat (X_{i, w, :}^{orig}, X_{i, w, :}^{cum})

doubles the per-timestep dimensionality from 5 to 10 (so the total feature count grows from 90 to 180). The mixed representation is included to test whether

X^{orig}

and

X^{cum}

carry complementary information; its limitation, with only 90 training samples, is clear—more feature dimensions with the same N increases the risk of over-fitting—so the mixed representation is reported mainly as a diagnostic, not as the preferred operational choice.

3.4. Model Architecture and Training Strategy

Three models were compared:

Random Forest: 100 trees, max depth 10, min samples split 5, trained on the flattened feature matrix at each week cutoff, with SMOTE applied to the training set only (Section 3.2); used as the traditional-ML baseline.

GRU (Gated Recurrent Unit): Input dimension 5, hidden dimension 64, 2 layers, dropout 0.3; update and reset gates control information flow through the sequence.

LSTM (Long Short-Term Memory): Same network shape as GRU, plus the additional cell-state mechanism (input, forget, output gates) that is intended to capture longer-range dependencies.

Deep-learning models used the Adam optimiser (initial learning rate 0.001), Binary Cross-Entropy loss, dropout 0.3, and early stopping with patience 15. Each deep-learning configuration is trained with five random seeds

{42, 7, 123, 2024, 314}

and evaluated on both validation semesters, with mean ± 95% CI (t-distribution,

d f = n - 1

) reported in Section 4. Time-masking at evaluation time zeroes every timestep

w^{'} \geq w

before the forward pass, so the model cannot observe any future week when predicting at week w; a self-check (Section 3.5) verifies that this produces identical outputs whether masking is applied pre- or post-standardisation.

3.5. Evaluation Metrics

Positive class. Consistent with the operational goal of an EWS—detecting at-risk students so that timely intervention can be offered—the minority Fail class is treated as the positive class (

y = 1

if total_score < 60, else

y = 0

). Under this convention, recall equals sensitivity to at-risk students: a recall of r means that

r \times 100 %

of actually failing students are caught by the EWS. Because the scikit-learn default pos_label = 1 is used, a concrete care in coding is required: if Pass were numerically coded as 1, the default recall would collapse to Pass-recall rather than at-risk sensitivity, so the label encoding is deliberately aligned to Fail

= 1

.

Metrics. Accuracy, precision, recall and F1 are reported, all with pos_label = 1 (Fail). Recall is the primary metric because the cost of a missed at-risk student (no tutoring offered, possibly attrition) is strictly larger than the cost of a false alarm (an instructor checks in with a student who is, after all, on track). This cost asymmetry is the same one that motivates recall as the primary metric in clinical screening and supply-chain defect detection.

Robustness. Each deep-learning result is computed across five seeds and reported as mean ± 95% CI; the paired RF baseline is trained with a fixed seed since SMOTE + RF with the same random_state is deterministic given the training data. Reported accuracy always refers to the full confusion matrix, so a classifier that predicts everyone as Pass can still obtain

\approx 0.83

accuracy on this dataset while having Fail-recall = 0; Fail-recall and F1 are relied upon to detect this failure mode.

Leakage check. For each trained deep-learning model, the study also verifies that

f (mask (X, w)) = f (mask (X^{'}, w))

for any two tensors that agree on weeks

[1, w)

. All five LSTM seeds and all five GRU seeds passed this check, so week-w predictions are confirmed to depend only on weeks

1, \dots, w - 1

.

Operational warning threshold. To make statements like “the earliest week at which a model is usable” precise, an operational target is fixed in advance: Fail-recall

\geq 0.80

with Fail-precision > class base rate (so that the model is not simply alarming on everyone). Section 4.3 uses this target to identify the earliest usable warning week for each model family.

3.6. Use of Generative AI Tools

In line with the journal’s policy on artificial intelligence tools, the author discloses that Claude Code (Anthropic; Opus 4.6 and Opus 4.7) was used as a coding assistant for two purposes: (i) generation of the Python 3.10 source code that implements the experimental protocol described above, and (ii) generation of figure-rendering code—the TikZ source for the conceptual data-flow diagram in Figure 1, and the matplotlib 3.9 scripts that render Figure 3, Figure 4 and Figure 5 from the recorded experiment outputs. The experimental design, methodological decisions, and the conceptual layout of Figure 1 were specified by the author prior to code generation; every generated script was reviewed and executed by the author, and all reported numerical results (Section 4) are reproducible from the resulting code. No GenAI tool was used for data collection, hypothesis formulation, statistical inference, or interpretation of results. Use of GenAI tools restricted to manuscript text editing (grammar, spelling, formatting) is excluded from this disclosure under the journal’s policy.

3.7. Institutionalisation Framework (Design-Level)

The three-tier institutionalisation framework is presented here as a blueprint to be validated in subsequent deployments; no adoption, usage, or outcome data from Tiers 2 and 3 are reported in this paper. Each tier is explicitly anchored to one Rogers [11] adoption segment and one Senge [12] feedback loop, which is what distinguishes it from a generic phased rollout plan.

Tier 1—Instructor level (Rogers early adopters/Senge individual-reflection loop). Volunteer instructors run the EWS on their own course in a “pilot → feedback → optimise” cycle; the artefact of this tier is an instructor-level log of which predicted-Fail students were intervened on and how they ultimately performed, so that the warning model can be audited in situ before the next tier.

Tier 2—Departmental level (Rogers early-majority/Senge departmental curriculum loop). The EWS is integrated into departmental teaching-quality-assurance workflows; a cross-course warning dashboard is made available to academic advisors; the feedback loop is curricular (“does the department need to adjust sequencing or prerequisites?”).

Tier 3—Institutional level (Rogers institutionalisation/Senge policy learning). The EWS is embedded in the Smart PASS platform’s iSignal warning channel with automated campus-wide distribution; the feedback loop is institutional (policy changes, resource reallocation across departments).

The current paper reports data only from an instructor-level pre-pilot; Tier 2 and Tier 3 outcomes are explicitly proposals for future work (Section 5).

3.8. Impact Assessment Framework (Design-Level)

The four-dimensional framework below is likewise a template for longitudinal impact studies, not a measured outcome of the present work. It is included so that subsequent pilots have a consistent evaluation scaffold, and each dimension is operationalised as a small set of measurable indicators.

(1) Student outcomes. Pass-rate changes (course-level; term-to-term), post-warning improvement rates (conditional on receiving tutoring), self-reported self-regulated-learning scales, and course satisfaction.

(2) Resource efficiency. Tutoring cost per at-risk student, time savings for instructor risk identification (from a documented baseline of “manually scanning the class list”), and the precision of resource allocation (what fraction of tutoring hours go to students who were genuinely at risk).

(3) Organisational learning. Frequency of instructor pedagogical-strategy adjustments (as captured by the Tier 1 logs), departmental curriculum iteration speed, and institutionalisation policy adoption rates.

(4) Talent output. Graduation-rate changes, graduate employment rate and employer satisfaction, and documented contributions to SDG 4 (targets 4.3, 4.4, 4.5) at cohort level. These are long-horizon indicators (2–5 years) and are included as markers for future follow-up, not as claims of current impact.

4. Results

All results below use Fail (

y = 1

) as the positive class, averaging over six prediction weeks

{6, 8, 10, 12, 14, 16}

and two validation semesters (1131, 1132). Deep-learning results are reported as mean ± 95% CI over five random seeds; RF results are deterministic under a fixed random_state. Time-masking was verified leakage-free in 10/10 deep-learning runs (five seeds × two model families).

4.1. Model Performance Comparison

Table 3 reports the three models’ performance averaged across weeks 6–16 and both validation semesters. The clearest headline is that accuracy and Fail-recall diverge sharply: LSTM and GRU obtain the highest accuracy, but they do so largely by predicting the majority class in early weeks. RF with SMOTE attains the highest Fail-recall, the highest F1, and the only Fail-precision that is meaningfully above the class base rate.

Relative to LSTM/GRU, RF + SMOTE provides absolute Fail-recall gains of

+ 42.17

p.p. over LSTM and

+ 58.67

p.p. over GRU, and F1 gains of

+ 15.87

and

+ 33.25

p.p., respectively. LSTM is nominally the most accurate model by ≈5 percentage points, but that advantage evaporates once sensitivity to the minority class is examined. On this small (

N = 90

training, 13 Fail) imbalanced ESC dataset, a traditional RF baseline with SMOTE is the operationally preferable EWS.

Table 4 gives per-semester confirmation. The ± values are the 95% CI half-widths over the five seeds for that (model, semester, week-range) cell; individual per-week CIs are in Table 5.

4.2. Feature Engineering Comparison

Because LSTM/GRU collapse to the majority class in early weeks, the most informative feature-engineering comparison is on RF + SMOTE (the model that actually separates the two classes). Table 6 reports the three representations averaged across weeks 6–16 and two validation semesters.

The story here is a trade-off rather than a dominance: original features maximise sensitivity to at-risk students (Fail-recall 90.36%), while cumulative and mixed features trade about 4 percentage points of recall for 13–15 percentage points of precision and 6–7 points of F1. For an EWS whose explicit operational target is catching at-risk students, original features are still the recommended representation; cumulative features are a defensible second choice when false-alarm cost rises.

Between-class separation of the two representations. One possible explanation for the original-vs-cumulative difference would be a reduction in between-class separation under cumulative aggregation (a “signal-dilution” hypothesis). This is tested statistically: for every one of the

18 \times 5 = 90

(week, activity) cells, the absolute standardised mean difference (Cohen’s

| d |

) between Pass and Fail students is computed under both representations on the training set. The results go in the opposite direction:

mean $| d |$ on original features $= 0.192$ (median $0.094$ );
mean $| d |$ on cumulative features $= 0.717$ (median $0.589$ );
a paired Wilcoxon test with alternative “original > cumulative” returns $p = 1$ , while the reverse alternative is highly significant.

That is, cumulative features have larger, not smaller, between-class separation on average. The correct mechanistic story is therefore not signal dilution: cumulative features smooth weekly fluctuations, which improves class separation in the marginal sense but loses the fine-grained temporal events (sudden drops, local bursts) that carry most of the sensitivity. Original features give the classifier the signal it needs to flag an abrupt change before a student is past recovery; cumulative features give a more robust but slower verdict. Both statements are now supported by statistics on the full feature grid, not a single anecdote.

4.3. Prediction Time Point Analysis

Table 5 reports each model’s weekly Fail-recall and accuracy (two-semester pooled), so that the trade-off between timeliness and reliability can be inspected week by week. Deep-learning values are mean ± 95% CI over five seeds.

Operational “optimal warning time” criterion. Section 3.5 fixes the target as “Fail-recall

\geq 0.80

with precision materially above the class base rate.” Applying this criterion, the earliest usable warning week is:

RF + SMOTE: Week 6 (Fail-recall 87.86%, precision 43.93%, i.e., 2.6× the ≈17% base rate). The 12-week intervention window originally claimed for the EWS is therefore supportable—but only through RF, not LSTM.
LSTM: Week 14 (Fail-recall 80.00%, exactly meeting the operational target; Week 16 reaches 82.86%). The usable intervention window shrinks to ≈4 weeks.
GRU: Week 16 (Fail-recall 77.57%, just below the target; Week 14 only 60.00% with a ±22.94 p.p. CI half-width, so not yet operationally usable).

Non-monotonicity of LSTM recall at early weeks. LSTM recall stays at exactly 0% at Week 6, jumps to

16.14 \pm 24.55 %

at Week 8 and

42.43 \pm 26.13 %

at Week 10 (CIs include zero—most seeds remain in the degenerate all-Pass solution, the occasional seed escapes), then stabilises from Week 12 onwards. The reason is that with only 13 Fail examples replicated via

σ = 0.01

Gaussian noise, the LSTM’s early-week loss surface has a single broad basin in which predicting “always Pass” is globally optimal; the network only leaves that basin once the feature tensor contains enough non-zero columns to break the degeneracy. The 95% CIs over five seeds (24.55 p.p. wide at Week 8, 26.13 p.p. at Week 10) quantify the remaining instability, and underline why LSTM is not recommended for early-week warning on this dataset.

Cautions on interpreting early-week recall. In imbalanced settings such as this one, a high recall at Week 6 under a Pass-as-positive framing would correspond to a degenerate all-Pass predictor (e.g., confusion matrix TN = 0, FP = 7 for LSTM Week 6 on semester 1131); only the Fail-as-positive recall reported here measures sensitivity to at-risk students. Under that framing the LSTM Week 6 Fail-recall is

0.00 \pm 0.00 %

. The time-masking self-check (Section 3.5) passed on all 5 seeds × 2 model families

= 10

runs, excluding data leakage as a source of any high-recall artefact.

Figure 3 visualises these weekly trajectories for each model on each validation semester.

Figure 4 focuses on the single quantity that matters most for an EWS—Fail-recall—and shows directly how deterministic RF compares to the 5-seed LSTM and GRU distributions over weeks. The shaded 95% CI bands make the “degenerate basin” behaviour of the sequence models visible as a near-vertical spread at Weeks 8–12 that only collapses from Week 14 onwards, while RF stays above the 80% operational target throughout.

4.4. Feature Importance Analysis

Feature importance is reported from the models that are actually used for prediction in this paper. Because RF is the operational model and LSTM is the sequence model of interest, Gini-based importance is reported for RF and permutation-based importance for LSTM. Random Forest (the explanatory lens) gives a decomposition into individual features; LSTM (the sequence lens) gives a decomposition into activity channels. Reporting both avoids reading feature importance off the weakest model while using the strongest for prediction.

RF Gini importance (aggregated by activity type, on original features trained on the full training set).

exam ≈ 34.32%, forum ≈ 31.12%, homework ≈ 23.90%, custom ≈ 10.65%, weblink ≈ 0.00%.

LSTM permutation importance (drop in Fail-recall on the 1131 validation set; mean across 5 seeds × 5 permutations per seed).

exam: drop $= 0.291 \pm 0.136$ ;
homework: drop $= 0.286 \pm 0.103$ ;
custom: drop $= 0.263 \pm 0.030$ ;
forum: drop $= 0.063 \pm 0.030$ ;
weblink: drop $= 0.000 \pm 0.000$ .

Both models agree that weblink carries no predictive signal for Fail—this matches the way weblink is logged in iClass (click-through only, no engagement depth) and is the cleanest convergent finding. The two models also agree that exam, homework, and custom activities are the most informative and forum, while useful for RF Gini (which mainly isolates the week-8 forum feature), contributes little to LSTM’s sequence-level decisions. The disagreement is narrow: RF’s Gini assigns more weight to the forum because one specific (week-8) forum feature cleanly splits Pass from Fail in a single tree threshold (Table 7), while LSTM distributes attention across the trio of active-submission activities (exam, homework, custom). For instructional redesign, the two rankings should therefore be read together: consistent weekly submission across homework and custom is the strongest in-semester sequential signal, while exam and the week-8 forum snapshot carry extra discriminative value once such snapshots are available.

4.5. Why LSTM Underperforms RF in This Setting

General architectural arguments would predict that LSTM “should” outperform RF on temporally structured data. The results on this dataset run in the opposite direction; the mechanism is specific and instructive, and the diagnosis below follows the empirical evidence.

(1) Class-imbalance handling is asymmetric across models. RF was trained with SMOTE applied in the flattened feature space, which synthesises new minority points by interpolating between k nearest minority neighbours; this gives the tree ensemble a genuinely more diverse view of the Fail class. LSTM/GRU used minority replication plus isotropic Gaussian noise with

σ = 0.01

in the 3D tensor space; with only 13 Fail students,

σ = 0.01

produces near-duplicates, so the deep model effectively still sees the same 13 trajectories. The empirical consequence—0% Fail-recall at Weeks 6–8—is visible in Table 5 and reproducible across all five seeds.

(2) Early-week sparsity pushes the LSTM loss surface into a degenerate basin. In Weeks 6–10 most feature cells are either zero or close to zero (students have not yet produced much logged activity). The BCE loss on such inputs is dominated by the large Pass majority, so the network’s global optimum is to predict Pass for everyone; this is the “majority-class collapse” that the confusion matrix directly shows (TN

= 0

, FP

= 7

for LSTM Week 6 on semester 1131). RF, whose training set has been rebalanced by SMOTE before fitting, does not face this loss-surface geometry.

(3) LSTM’s cell-state advantage does not translate here because the usable sequence is short. The 18-week window contains at most 2–3 inflection points per student, which is in the regime where RF on flattened features can learn adequate splits; LSTM’s long-range dependency advantage is mostly moot.

(4) Multi-seed variance confirms the diagnosis. The LSTM Fail-recall at Week 8 has a 95% CI of

\pm 24.55

p.p. over five seeds and at Week 10 of

\pm 26.13

p.p. (Table 5). These wide CIs are not a nuisance but a signature of the degenerate basin: most seeds land in the all-Pass solution, the occasional seed escapes it. By Week 14 the CI half-width has contracted to

3.73

p.p. as every seed reliably separates the two classes. Under a more aggressive minority-handling strategy (e.g., 3D SMOTE, focal loss, or class-weighted BCE with pos_weight

= n_{neg} / n_{pos}

), this early-week variance would be expected to shrink; testing such remedies is listed in Future Work (Section 6).

The general lesson, and the cleanest take-home from this study’s revised empirical contribution, is that sequence-model superiority in educational prediction is conditional on class-imbalance handling commensurate with the architecture’s capacity; on small imbalanced datasets typical of programme-level courses a traditional RF + SMOTE baseline is a more reliable starting point, and should be the benchmark that subsequent sequence-model work must beat in sensitivity, not just in accuracy.

5. Discussion

5.1. What the Results Support, and What They Do Not

The empirical results support three claims and rule out several others that might otherwise be drawn from the data. Both sets are stated here clearly so that subsequent institutional decisions rest on what the data actually show.

Supported by the present data. (i) An EWS that flags at-risk students at Week 6 is feasible on this single-course dataset using RF + SMOTE, with Fail-recall 87.86% and precision 2.6× the class base rate; (ii) original weekly features maximise sensitivity (Fail-recall 90.36%) while cumulative and mixed features improve precision/F1 (from 72.06% to ≈79.4%), which is a genuine trade-off rather than an across-the-board dominance; (iii) on a training set of 13 minority trajectories with

σ = 0.01

noise augmentation, LSTM and GRU collapse to the majority class in early weeks and their headline accuracy is therefore misleading; this is consistent across five seeds.

Not supported. (i) A “zero-miss” framing, requiring 100% Fail-recall: with Fail as the positive class, no model in this study reaches 100% Fail-recall at Week 6, and LSTM reaches 0%. (ii) A “signal dilution” account of the original-vs-cumulative gap: on the full (week, feature) grid, cumulative features have larger average Cohen’s

| d |

(0.717 vs. 0.192; Section 4.2), so the gap is a sensitivity/precision trade-off, not dilution. (iii) Claims that the EWS, on the basis of these results alone, reduces downstream corporate training costs, measurably advances SDG 4, or produces institutional resilience: the data in this paper do not measure any of these outcomes, and the discussion below is careful to separate them from what is measured.

5.2. Comparison with Related Work

The accuracy and recall figures align reasonably well with recent literature on small-scale educational prediction. Al-Azazi and Ghurab [28] report ≈ 70% accuracy at month 3 of a MOOC; Kalita et al. [27] report ≈ 88% accuracy with Bi-LSTM + SHAP on a large open dataset; Alnasyan et al. [26] conclude from their systematic review that deep models’ reported advantages shrink once class-imbalance controls are added. The present result—RF + SMOTE at 85.59% accuracy and 91.19% Fail-recall, beating LSTM/GRU on the minority class—fits this picture: on large and balanced data, deep sequence models can dominate; on small imbalanced data with naive augmentation, classical ML with explicit resampling is often a more dependable baseline. The per-week Fail-recall trajectory (RF 87.86% at Week 6, 92.86% from Week 10) is broadly consistent with the early-warning timing reported by [28] on MOOC data and with the homework-habit-driven early warning of Wen et al. [32] in Sustainability. The RF-with-balancing design adopted here parallels Jawad et al. [31] but extends their single-seed RF comparison with a leakage-checked multi-seed protocol; Staneviciene et al. [33] report a related case study on sustainable e-learning prediction whose accuracy range is consistent with ours. The result is obtained on a deliberately small, programme-level dataset; it is therefore cautioned that generalisation of the result requires cross-course and cross-institution replication [7,20,29].

5.3. From Evidence to Supply-Chain Language

Rephrased in ESC terms, the evidence supports three cautious transitions rather than three system-level transformations:

From reactive to proactive, at the course level. An RF + SMOTE EWS allows an instructor in this course to identify ≈87% of at-risk students at Week 6, leaving a 12-week intervention window. This is a course-level finding; extrapolation to department and institution levels is a design-level claim (Section 3.7).

From undifferentiated to targeted tutoring, within the observed precision. At Week 6 the RF precision is 43.93%, meaning that of every 10 students flagged, ≈4 would actually fail without help; that is strictly better than an undifferentiated approach but it is not “just-in-time” in the narrow sense. Instructors should expect to check in with more students than will ultimately be at risk, and the EWS should be presented to them in that language.

From intuition to quantitative attention, with two decompositions. RF Gini highlights exam and forum; LSTM permutation importance highlights homework and custom. Both agree that weblink carries essentially no predictive signal. A pedagogical reform cannot read directly off the stronger decomposition alone, but the two together sharpen where instructors can usefully look.

Downstream effects on corporate training costs, SDG 4 targets, or whole-chain efficiency are hypotheses that the four-dimensional framework in Section 3.8 exists to test in a 2–5 year follow-up; they are not supported by the present data, and this paper does not claim they are.

5.4. Long-Term Sustainability Analysis (Design-Level)

This subsection is framed as operational recommendations grounded in the empirical results; it does not claim long-term effects that the dataset cannot measure.

5.4.1. Technical Sustainability

On this small dataset, RF + SMOTE reaches 85.59% accuracy/91.19% Fail-recall while training in seconds on a commodity laptop, which is technically feasible for programme-level deployment. Per-semester performance is not uniform: RF Fail-recall goes from 85.71% (semester 1131) to 96.67% (semester 1132) while accuracy moves the other way (90.37% → 80.82%), consistent with cohort drift in cohort composition and instructor behaviour. Continuous monitoring and periodic retraining are therefore operationally necessary.

To ensure technical sustainability, this study recommends a rolling model update strategy: retraining RF + SMOTE with the two most recent semesters at the end of each academic year; a monitoring dashboard that tracks per-cohort Fail-recall and precision and flags drops below the operational target in Section 3.5; and a modular architecture that keeps the classifier replaceable, so that if subsequent work delivers a sequence model that reliably beats RF on this task, it can be dropped into the same pipeline.

5.4.2. Economic Sustainability

The RF + SMOTE pipeline trained in under a minute on an Apple Silicon M-series laptop with 16 GB RAM and required no GPU; LSTM/GRU took 2–5 minutes per seed on the same hardware. Inference is essentially instant at the class level, so the system can run on the existing Smart PASS infrastructure without additional procurement. Batch prediction once every 1–2 weeks is sufficient for instructional support, which further reduces the energy footprint of the deployment. The cost-benefit claim is therefore: limited one-time integration cost, minimal per-semester marginal compute cost, and instructor time savings relative to manual roster review. Claims about tuition-revenue recovery or system-level cost offset are deferred to the four-dimensional framework’s 2–5 year follow-up (Section 3.8); the present data do not support them.

5.4.3. Ecosystem Sustainability

Long-term operation also requires coordination across technical maintenance, pedagogical application, and administrative management; the concrete coordination mechanisms are institution-specific and are out of scope of this paper, which reports a single-course pilot.

5.5. Scalability Assessment

Scaling from a single-course pilot to campus-wide deployment raises three issues that the present data cannot adjudicate but that the four-dimensional template (Section 3.8) is designed to test in subsequent pilots: cross-course variation in activity patterns (programming vs. general education, STEM, humanities) which may require domain-adaptation or course-clustering strategies; cross-institutional variation in LMS platforms (Moodle, Canvas, Blackboard) which requires a standardised data-exchange schema before transfer learning can be evaluated; and operational batch-processing capacity, for which the present codebase exposes class-level batch prediction and is integrable with Smart PASS automated workflows. None of these is empirically demonstrated at scale in this paper.

5.6. Institutionalization Pathway

Section 3.7 sets out the three-tier pathway anchored in Rogers [11] and Senge [12]; concretely it consists of an instructor-level pilot (1–2 semesters), departmental expansion (semesters 3–4), and Smart PASS integration (semesters 5–6). Success factors typically cited in implementation studies—administrative leadership, faculty trust through transparent model explanations, student privacy protection, and a continuous improvement culture—are not measured by the present study and are reserved for the Tier-2 and Tier-3 follow-ups.

5.7. Impact Assessment Framework

The four-dimensional template (Section 3.8) spans three time horizons—short-term (1–2 semesters; student outcomes, resource efficiency), medium-term (semesters 3–6; organisational learning), and long-term (2–5 years; talent output and SDG-4-aligned indicators). Quasi-experimental designs comparing flagged-and-intervened against flagged-but-not-intervened students, supplemented by longitudinal cohort tracking, are the recommended evaluation method. Numerical thresholds in Section 3.8 (e.g., target pass-rate improvement, target time savings) are template targets for follow-up pilots, not findings of this study.

5.8. Contributions

The contributions of this study:

A leakage-checked, multi-seed ESC benchmark. An RF + SMOTE/GRU/LSTM comparison on a small ( $N = 188$ ) realistically imbalanced ESC dataset, with Fail as the positive class, a temporal train/validation split, five seeds, and an explicit time-masking self-check. The full protocol and scripts are documented for reuse.
A counter-intuitive but well-supported finding on this dataset. On this regime, RF + SMOTE beats LSTM/GRU on Fail-recall by $+ 42$ and $+ 59$ p.p., and LSTM/GRU exhibit majority-class collapse in early weeks under Gaussian-noise augmentation. This is reported as an operational caution observed in a single-course pilot rather than as a methodological generalisation; cross-course and cross-institution replication is required before any broader claim can be made.
Statistical evidence on feature representation. A 90-cell Wilcoxon test on Cohen’s $| d |$ shows that cumulative features have larger, not smaller, between-class separation; the original-vs-cumulative difference is a sensitivity/precision trade-off, supported by a systematic statistic rather than a single anecdote.
Design-level scaffolding (not yet validated). A Rogers × Senge three-tier institutionalisation pathway and a four-dimensional impact assessment template are offered as blueprints for subsequent institutional pilots to test. The present study reports no Tier-2 or Tier-3 deployment data, no usage or adoption outcomes, and no measured institutional impact; neither framework, therefore, constitutes a closure of the corresponding research gap, and they are positioned as scaffolding for future empirical work rather than as findings of this paper.

5.9. Limitations

The study has six substantive limitations that subsequent work must address before any of the design-level scaffolding can be claimed as realised: (1) single course, single institution, 188 students, 30 Fail; the empirical results do not generalise cross-course or cross-institution without replication, and the broader framings used elsewhere in the manuscript—“sustainable transformation of the ESC”, institutional impact, SDG-aligned outcomes—are not supported by data of this scope and are explicitly reserved for follow-up studies that use the four-dimensional template in Section 3.8. (2) Binary Pass/Fail target; finer-grained achievement levels were not modelled. (3) Per-semester variation in Fail-recall (RF 85.7% vs. 96.7%) and accuracy (90.4% vs. 82.1%) indicates cohort drift, so a rolling-update strategy is operationally necessary rather than optional. (4) Minority-class augmentation for LSTM/GRU was deliberately kept simple (Gaussian noise) to keep the protocol comparable across seeds; stronger strategies (focal loss, 3D SMOTE, class-weighted BCE with pos_weight) are left to future work and may shift the RF-vs-LSTM conclusion. (5) Interpretability relies on RF Gini and LSTM permutation importance; SHAP-based interpretation of LSTM on this small dataset was not attempted because 13 Fail trajectories is below the recommended sample size for stable SHAP values. (6) Institutionalisation and impact frameworks (Section 3.7 and Section 3.8) are design-level; no Tier 2/Tier 3 deployment, interview, or outcome data are reported, and all long-horizon claims about SDG 4, equity, or downstream business costs are explicitly reserved for future empirical studies.

6. Conclusions

The conclusions are organised into three tiers—observed, reasonable inferences, and future proposals—so that the reader can judge at each tier how far the claim is supported by the data in this paper.

Tier 1—Observed on this single-course dataset. (i) Random Forest with SMOTE attains 85.59% accuracy and 91.19% Fail-recall on average across weeks 6–16 and two validation semesters, already at 87.86% Fail-recall by Week 6 (deterministic under a fixed seed, leakage-checked); it is the operationally preferable EWS in this regime. (ii) LSTM and GRU, under

σ = 0.01

Gaussian-noise augmentation on the 13 Fail training trajectories, collapse to the majority class in Weeks 6–10 (Fail-recall

0 %

–

42 %

with CI half-widths up to 26 p.p.) and become usable only from Week 14 (LSTM Fail-recall 80.00%); their high headline accuracy hides very low sensitivity in early weeks. (iii) Original features maximise Fail-recall (90.36%); cumulative/mixed features improve precision and F1 but sacrifice ≈ 4 p.p. of recall. The (week, feature) Cohen’s d analysis shows that cumulative features retain larger between-class separation on average, so the gap is a sensitivity/precision trade-off rather than an information-dilution effect. (iv) RF Gini importance emphasises exam, forum, and homework; LSTM permutation importance emphasises exam, homework, and custom; both models agree that weblink click-through carries no predictive value.

Tier 2—Reasonable inferences beyond this dataset. The 12-week intervention window that RF + SMOTE unlocks is operationally meaningful for programme-level deployment, but its cross-course and cross-institution generalisation is an inference that this study cannot confirm. The observation that sequence models under naive augmentation collapse on small imbalanced data is a methodological warning that is consistent with the broader learning-analytics literature [26,28], but any strong claim that “RF will always beat LSTM in ESC settings” would overreach. Likewise, it is reasonable to infer that an instructor-level Tier 1 pilot grounded in Rogers/Senge theory is the right way to enter Tier 2 and Tier 3, but the theoretical grounding does not yet substitute for empirical validation.

Tier 3—Future proposals. The three-tier institutionalisation pathway (Section 3.7) and the four-dimensional impact assessment template (Section 3.8) are offered as proposals, not as achievements of this study. Claims about equity gains, institutional resilience, reductions in downstream corporate training cost, or measurable SDG 4 contributions are reserved for 2–5 year follow-up studies that use the template. Within the same tier sit the technical extensions most likely to shift the RF-vs-LSTM conclusion: class-weighted BCE with pos_weight

= n_{neg} / n_{pos}

, focal loss, 3D SMOTE, attention mechanisms, and Transformer-based architectures; SHAP or integrated-gradients interpretability of a deep model that is not majority-collapsed; cross-course/cross-LMS transfer experiments; and randomised controlled trials that measure whether a flagged-and-intervened student actually outperforms a flagged-and-not-intervened counterpart.

Combining the three tiers: (1) on this single-course dataset, RF + SMOTE is an operationally usable EWS from Week 6, and the early-week collapse of LSTM/GRU under naive Gaussian-noise augmentation is a documented caution rather than a methodological generalisation; (2) the Rogers/Senge institutionalisation pathway is a coherent route from instructor pilot to campus-wide deployment, but it is a route, not an arrival—no Tier-2 or Tier-3 deployment, adoption, or outcome data are reported here; (3) claims about equity gains, downstream institutional impact, or system-wide sustainability transformation are reserved for follow-up studies that apply the four-dimensional template (Section 3.8). The author and his team will continue to extend this work along three directions: cross-course and cross-institution replication of the protocol, sequence-model improvements (focal loss, class-weighted BCE, attention/Transformer architectures) that may shift the RF-vs-LSTM conclusion, and longitudinal cohort tracking against the four-dimensional template to test the design-level scaffolding empirically.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study by Institution Committee as per The Human Subjects Research Act Article 4.

Informed Consent Statement

Informed consent for participation is not required as per local legislation [Personal Data Protection Act of the Republic of China—specifically Article 20(1), item 5].

Data Availability Statement

The raw data presented in this study are not publicly available due to student privacy-protection policies of the institution. The de-identified analytical dataset (raw_data_anon/) on which every reported result is computed contains no student names or institutional identifiers—only per-semester anonymous IDs of the form STU_{semester}_{NNNN}, preserving the record-linking needed to reproduce the five-seed protocol. The mapping between the internal iClass user_no and these anonymous IDs is retained by the PI solely for Research-Ethics-Committee audit purposes and will be destroyed on the timetable specified by the REC exemption determination. The anonymisation script (anonymize_data.py) and the revised experiment scripts (multi_seed_revision_experiment.py, rf_feature_comparison.py) will be made available on reasonable request, so that other institutions with equivalent data can reproduce the protocol.

Acknowledgments

The author acknowledges the support of Tamkang University’s Office of Information Technology in providing access to the iClass learning management platform data. The author also acknowledges the use of Claude Code (Anthropic; Opus 4.6 and Opus 4.7) as a coding assistant for the data-analysis and figure-rendering scripts; details of how the tool was used are reported in Section 3.6.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BCE	Binary Cross-Entropy
CI	Confidence Interval
EWS	Early Warning System
ESC	Educational Supply Chain
ESG	Environmental, Social, and Governance
GRU	Gated Recurrent Unit
LAD	Learning Analytics Dashboard
LMS	Learning Management System
LSTM	Long Short-Term Memory
PED	Performance and Engagement Diagram
RF	Random Forest
SCM	Supply Chain Management
SDG	Sustainable Development Goal
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Over-sampling Technique

References

Habib, M.; Jungthirapanich, C. Research Framework of Educational Supply Chain Management for the Universities. In Proceedings of the 2009 International Conference on Management and Service Science; IEEE: Beijing, China, 2009; pp. 1–4. [Google Scholar]
Chowdhury, S.A.; Sharmin, R.; Al-Amin, M.; Shifat, M.N. Integrating an educational supply chain model in the higher education sector: Meeting the 21st century workforce demands in Bangladesh. Front. Educ. 2025, 10, 1521309. [Google Scholar] [CrossRef]
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development; United Nations: New York, NY, USA, 2015. [Google Scholar]
Leal Filho, W.; Shiel, C.; Paço, A.; Mifsud, M.; Ávila, L.V.; Brandli, L.L.; Molthan-Hill, P.; Pace, P.; Azeiteiro, U.M.; Vargas, V.R.; et al. Sustainable Development Goals and sustainability teaching at universities: Falling behind or getting ahead of the pack? J. Clean. Prod. 2019, 232, 285–294. [Google Scholar] [CrossRef]
Findler, F.; Schönherr, N.; Lozano, R.; Reider, D.; Martinuzzi, A. The impacts of higher education institutions on sustainable development: A review and conceptualization. Int. J. Sustain. High. Educ. 2019, 20, 23–38. [Google Scholar] [CrossRef]
Leal Filho, W.; Salvia, A.L.; Frankenberger, F.; Akib, N.A.M.; Sen, S.K.; Sivapalan, S.; Novo-Corti, I.; Venkatesan, M.; Emblen-Perry, K. Governance and sustainable development at higher education institutions. Environ. Dev. Sustain. 2021, 23, 15969–15990. [Google Scholar] [CrossRef]
Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouverneur, F. Systematic review of research on artificial intelligence applications in higher education—Where are the educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 39. [Google Scholar] [CrossRef]
Christopher, M. Logistics & Supply Chain Management, 5th ed.; Pearson: Harlow, UK, 2016. [Google Scholar]
Dwivedi, Y.K.; Hughes, L.; Ismagilova, E.; Aarts, G.; Coombs, C.; Crick, T.; Duan, Y.; Dwivedi, R.; Edwards, J.; Eirug, A.; et al. Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int. J. Inf. Manag. 2021, 57, 101994. [Google Scholar] [CrossRef]
Toorajipour, R.; Sohrabpour, V.; Nazarpour, A.; Oghazi, P.; Fischl, M. Artificial intelligence in supply chain management: A systematic literature review. J. Bus. Res. 2021, 122, 502–517. [Google Scholar] [CrossRef]
Rogers, E.M. Diffusion of Innovations, 5th ed.; Free Press: New York, NY, USA, 2003. [Google Scholar]
Senge, P.M. The Fifth Discipline: The Art & Practice of the Learning Organization; Revised Edition; Currency Doubleday: New York, NY, USA, 2006. [Google Scholar]
Carbonneau, R.; Laframboise, K.; Vahidov, R. Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res. 2008, 184, 1140–1154. [Google Scholar] [CrossRef]
Baryannis, G.; Validi, S.; Dani, S.; Antoniou, G. Supply chain risk management and artificial intelligence: State of the art and future research directions. Int. J. Prod. Res. 2019, 57, 2179–2202. [Google Scholar] [CrossRef]
Min, H. Artificial intelligence in supply chain management: Theory and applications. Int. J. Logist. Res. Appl. 2010, 13, 13–39. [Google Scholar] [CrossRef]
Ivanov, D.; Dolgui, A.; Sokolov, B. The impact of digital technology and Industry 4.0 on the ripple effect and supply chain risk analytics. Int. J. Prod. Res. 2019, 57, 829–846. [Google Scholar] [CrossRef]
Dubey, R.; Gunasekaran, A.; Childe, S.J.; Bryde, D.J.; Giannakis, M.; Foropon, C.; Roubaud, D.; Hazen, B.T. Big data analytics and artificial intelligence pathway to operational performance under the effects of entrepreneurial orientation and environmental dynamism: A study of manufacturing organisations. Int. J. Prod. Econ. 2020, 226, 107599. [Google Scholar] [CrossRef]
Bag, S.; Gupta, S.; Kumar, S. Industry 4.0 adoption and 10R advance manufacturing capabilities for sustainable development. Int. J. Prod. Econ. 2021, 231, 107844. [Google Scholar] [CrossRef]
Pathak, V.; Pathak, K. Reconfiguring the higher education value chain. Manag. Educ. 2010, 24, 166–171. [Google Scholar] [CrossRef]
Crompton, H.; Burke, D. Artificial intelligence in higher education: The state of the field. Int. J. Educ. Technol. High. Educ. 2023, 20, 22. [Google Scholar] [CrossRef]
Chen, X.; Zou, D.; Xie, H.; Cheng, G.; Liu, C. Two decades of artificial intelligence in education: Contributors, collaborations, research topics, challenges, and future directions. Educ. Technol. Soc. 2022, 25, 28–47. [Google Scholar]
Verbert, K.; Duval, E.; Klerkx, J.; Govaerts, S.; Santos, J.L. Learning analytics dashboard applications. Am. Behav. Sci. 2013, 57, 1500–1509. [Google Scholar] [CrossRef]
De Laet, T.; Millecamp, M.; Ortiz-Rojas, M.; Jimenez, A.; Maya, R.; Verbert, K. Adoption and impact of a learning analytics dashboard supporting the advisor—Student dialogue in a higher education institute in Latin America. Br. J. Educ. Technol. 2020, 51, 1002–1018. [Google Scholar] [CrossRef]
Lu, C.; Cutumisu, M. Online engagement and performance on formative assessments mediate the relationship between attendance and course performance. Int. J. Educ. Technol. High. Educ. 2022, 19, 2. [Google Scholar] [CrossRef] [PubMed]
Rets, I.; Herodotou, C.; Bayer, V.; Hlosta, M.; Rienties, B. Exploring critical factors of the perceived usefulness of a learning analytics dashboard for distance university students. Int. J. Educ. Technol. High. Educ. 2021, 18, 46. [Google Scholar] [CrossRef]
Alnasyan, B.; Basheri, M.; Alassafi, M. The power of Deep Learning techniques for predicting student performance in Virtual Learning Environments: A systematic literature review. Comput. Educ. Artif. Intell. 2024, 6, 100231. [Google Scholar] [CrossRef]
Kalita, E.; Alfarwan, A.M.; El Aouifi, H.; Kukkar, A.; Hussain, S.; Ali, T.; Gaftandzhieva, S. Predicting student academic performance using Bi-LSTM: A deep learning framework with SHAP-based interpretability and statistical validation. Front. Educ. 2025, 10, 1581247. [Google Scholar] [CrossRef]
Al-Azazi, F.A.A.; Ghurab, M. ANN-LSTM: A deep learning model for early student performance prediction in MOOC. Heliyon 2023, 9, e15382. [Google Scholar] [CrossRef]
Susnjak, T.; Ramaswami, G.S.; Mathrani, A. Learning analytics dashboard: A tool for providing students with motivational information and intervention. Educ. Inf. Technol. 2022, 27, 1265–1285. [Google Scholar]
Hew, K.F.; Lan, M.; Tang, Y.; Jia, C.; Lo, C.K. Where is the “theory” within the field of educational technology research? Br. J. Educ. Technol. 2019, 50, 956–971. [Google Scholar] [CrossRef]
Jawad, K.; Shah, M.A.; Tahir, M. Students’ Academic Performance and Engagement Prediction in a Virtual Learning Environment Using Random Forest with Data Balancing. Sustainability 2022, 14, 14795. [Google Scholar] [CrossRef]
Wen, W.; Liu, Y.; Zhu, Z.; Shi, Y. A Study on the Learning Early Warning Prediction Based on Homework Habits: Towards Intelligent Sustainable Evaluation for Higher Education. Sustainability 2023, 15, 4062. [Google Scholar] [CrossRef]
Staneviciene, E.; Gudoniene, D.; Punys, V.; Kukstys, A. A Case Study on the Data Mining-Based Prediction of Students’ Performance for Effective and Sustainable E-Learning. Sustainability 2024, 16, 10442. [Google Scholar] [CrossRef]

Figure 1. Conceptual data-flow diagram of the EWS evaluated in this study and its intended integration with the existing PED pipeline on the Smart PASS platform. Weekly iClass activity logs are processed into a 30-dimensional feature vector (with cumulative counterparts), classified by an EWS model (RF + SMOTE, LSTM, or GRU) under a 5-seed time-masked protocol, and the resulting per-student Fail probability drives the existing iSignal notification channel and the instructor dashboard (Figure 2). The proposed EWS replaces the previously deployed fixed-threshold P/E quadrant rule at the classification step only; data ingestion (iClass) and downstream notification (iSignal, instructor dashboard) remain unchanged. The empirical scope of this paper covers the bracketed segment (feature vector → classifier → Fail probability); institutional roll-out from this single-course pilot to campus-wide deployment is offered as a design-level proposal in Section 5, not an empirical claim of the present study.

Figure 2. PED system instructor dashboard. Top: scatter plot of all students’ weekly Performance and Engagement; right: student distribution under the current fixed-threshold four-quadrant classification; bottom: individual student weekly trend chart. The dashboard interface is in Traditional Chinese; for reference, its principal on-screen labels denote, in reading order, the system title (“Learning Analytics System”), the course name, a class-wide learning-overview panel, a by-zone student-distribution panel, the number and proportion of students in each zone, and the four quadrant buttons (Quadrants 1–4). These are user-interface labels only and do not affect the scientific interpretation of the figure.

Figure 3. Weekly prediction performance across the two validation semesters. (a,c) Accuracy trends for semesters 1131 and 1132; (b,d) Fail-recall trends. RF tracks the target recall from Week 6; LSTM/GRU only reach it from Week 14 onwards.

Figure 4. Per-week Fail-recall on pooled validation semesters. RF + SMOTE is deterministic (single line, fixed random_state); LSTM and GRU are mean of 5 seeds with 95% CI shaded. The dotted line marks the operational target (Fail-recall

\geq 0.80

). RF reaches the target at Week 6 and stays above it; LSTM only reaches the target at Week 14; GRU remains below until Week 16.

Figure 4. Per-week Fail-recall on pooled validation semesters. RF + SMOTE is deterministic (single line, fixed random_state); LSTM and GRU are mean of 5 seeds with 95% CI shaded. The dotted line marks the operational target (Fail-recall

\geq 0.80

). RF reaches the target at Week 6 and stays above it; LSTM only reaches the target at Week 14; GRU remains below until Week 16.

Figure 5. Activity-type importance distribution from the RF Gini decomposition (aggregated over all 18 weeks), shown for the five iClass activity categories (homework, forum, exam, custom, weblink) as they appear in the training data.

Table 1. Mapping between supply chain management concepts and the educational supply chain; “evidence” refers to empirical results produced in this study, “design-level” to blueprints not yet empirically validated.

SCM Concept	Educational Counterpart	Evidence in This Study (Empirical/Design-Level)
Demand forecasting	Student retention/risk prediction	Empirical: RF + SMOTE achieves Fail-recall 87.86% already at Week 6 (30 feature dims available)
Quality control	Learning outcome monitoring	Empirical: RF avg. accuracy 85.59%, Fail-recall 91.19%, F1 70.36% across weeks 6–16
Defect rate	Student failure rate	Observed at-risk ratio ≈17% (30/188)
Real-time detection	Early warning	Empirical: reliable warning from Week 6 (RF)/Week 14 (LSTM); correspondingly 12-week or 4-week intervention window
Resource optimization	Tutoring resource allocation	Design-level: proposed three-tier institutionalization; not yet quantitatively evaluated
Upstream quality → downstream cost reduction	Graduate competency → reduced corporate training cost	Design-level: long-term tracking indicator in the four-dimensional impact framework

Table 2. Dataset partitioning and class distribution.

Semester	Purpose	Students	Pass	Fail	Class Ratio
1121	Training	46	41	5	8.2:1
1122	Training	44	36	8	4.5:1
Training Subtotal		90	77	13	5.9:1
1131	Validation	45	38	7	5.4:1
1132	Validation	53	43	10	4.3:1
Validation Subtotal		98	81	17	4.8:1
Total		188	158	30	5.3:1

Table 3. Average performance across weeks 6–16 and two validation semesters. Fail is the positive class. Deep-learning values are mean ± 95% CI over 5 seeds; RF is deterministic.

Model	Accuracy	Precision (Fail)	Recall (Fail)	F1 (Fail)
Random Forest + SMOTE	85.59%	58.89%	91.19%	70.36%
GRU	88.39%	43.33%	32.52%	37.11%
LSTM	90.84%	62.14%	49.02%	54.49%

Table 4. Per-semester average performance (weeks 6–16).

Model	Semester	Accuracy	Precision	Recall	F1
RF + SMOTE	1131	90.37%	67.26%	85.71%	74.64%
RF + SMOTE	1132	80.82%	50.52%	96.67%	66.07%
GRU	1131	89.48%	43.33%	32.38%	37.01%
GRU	1132	87.30%	43.33%	32.67%	37.21%
LSTM	1131	92.00%	62.78%	49.05%	54.96%
LSTM	1132	89.69%	61.50%	49.00%	54.03%

Table 5. Weekly performance (Fail as positive), pooled across 2 validation semesters. LSTM/GRU values are mean ± 95% CI (5 seeds).

Week	Dims	RF Acc/Rec	LSTM Acc	LSTM Rec	GRU Rec
6	30	78.68%/87.86%	82.79 ± 1.25%	0.00 ± 0.00%	0.00 ± 0.00%
8	40	86.46%/87.86%	83.30 ± 3.02%	16.14 ± 24.55%	0.00 ± 0.00%
10	50	91.95%/92.86%	90.08 ± 4.57%	42.43 ± 26.13%	14.14 ± 21.33%
12	60	88.01%/92.86%	95.32 ± 0.52%	72.71 ± 2.79%	43.43 ± 26.81%
14	70	84.23%/92.86%	96.56 ± 0.63%	80.00 ± 3.73%	60.00 ± 22.94%
16	80	84.23%/92.86%	97.00 ± 0.58%	82.86 ± 2.15%	77.57 ± 4.31%

Table 6. RF + SMOTE feature-engineering comparison, averaged across weeks 6–16 and two validation semesters. Fail is the positive class.

Feature Method	Accuracy	Precision (Fail)	Recall (Fail)	F1 (Fail)
Original	86.59%	62.00%	90.36%	72.06%
Cumulative	91.01%	77.06%	86.19%	79.42%
Mixed (orig + cum)	90.27%	74.55%	86.19%	78.08%

Table 7. RF Gini importance, top 5 individual features (from the full-training-set model).

Rank	Feature	Week	Activity Type	Importance
1	W8_forum	Week 8	Forum	20.28%
2	W10_exam	Week 10	Exam	15.80%
3	W11_custom	Week 11	Custom	10.65%
4	W18_homework	Week 18	Homework	9.18%
5	W15_exam	Week 15	Exam	5.84%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chi, C.-C. AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment. Sustainability 2026, 18, 5523. https://doi.org/10.3390/su18115523

AMA Style

Chi C-C. AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment. Sustainability. 2026; 18(11):5523. https://doi.org/10.3390/su18115523

Chicago/Turabian Style

Chi, Chen-Chung. 2026. "AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment" Sustainability 18, no. 11: 5523. https://doi.org/10.3390/su18115523

APA Style

Chi, C.-C. (2026). AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment. Sustainability, 18(11), 5523. https://doi.org/10.3390/su18115523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Driven Sustainable Transformation of the Educational Supply Chain: Comparative Evaluation of Machine Learning Models for an Early Warning System and Design-Level Frameworks for Institutionalization and Impact Assessment

Abstract

1. Introduction

1.1. Research Background: Sustainability Challenges in the Educational Supply Chain

1.2. Theoretical Foundations of AI-Driven Educational Supply Chain Transformation

1.3. The PED System, Its Limits, and the Role of This Study

1.4. Research Objectives and Contributions

2. Literature Review

2.1. AI-Driven Sustainable Transformation in Supply Chain Management

2.2. Educational Supply Chain and Sustainable Higher Education

2.3. Learning Analytics and AI Applications in Education

2.4. Organisational Learning Theory and Technology Institutionalisation

3. Materials and Methods

3.1. Research Design

3.2. Data Collection and Preprocessing

3.3. Feature Engineering Methods

3.4. Model Architecture and Training Strategy

3.5. Evaluation Metrics

3.6. Use of Generative AI Tools

3.7. Institutionalisation Framework (Design-Level)

3.8. Impact Assessment Framework (Design-Level)

4. Results

4.1. Model Performance Comparison

4.2. Feature Engineering Comparison

4.3. Prediction Time Point Analysis

4.4. Feature Importance Analysis

4.5. Why LSTM Underperforms RF in This Setting

5. Discussion

5.1. What the Results Support, and What They Do Not

5.2. Comparison with Related Work

5.3. From Evidence to Supply-Chain Language

5.4. Long-Term Sustainability Analysis (Design-Level)

5.4.1. Technical Sustainability

5.4.2. Economic Sustainability

5.4.3. Ecosystem Sustainability

5.5. Scalability Assessment

5.6. Institutionalization Pathway

5.7. Impact Assessment Framework

5.8. Contributions

5.9. Limitations

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI