Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation
Abstract
1. Introduction
- RQ1. Can a cloud-native digital screening architecture achieve stable detection (recall) of at-risk learners under classroom-level class imbalance conditions?
- RQ2. Does psychometric-informed feature refinement improve signal clarity and predictive precision without sacrificing the sensitivity required for universal screening?
- RQ3. Can SHAP-based explainability align model outputs with pedagogically interpretable literacy constructs to support data-driven teacher interventions?
2. Theoretical and Research Background
2.1. Digital Literacy Assessment Infrastructure
2.2. Machine Learning-Based Risk Prediction in Digital Assessment Contexts
2.3. Speech-Recognition-Based Assessment of Oral Reading Performance
3. System Architecture and Digital Pipeline
3.1. K-KOBUKI Application Design
3.2. System Architecture and Data Flow
3.3. Application Interface
4. Method
4.1. Participants and Data Collection
4.2. Dataset Structure and Multimodal Integration
4.3. Experimental Design and Modeling Framework
- Feature Engineering and Measurement Verification
- ASR Processing and Human Verification
- Model Implementation and Evaluation Strategy
5. Results
5.1. Measurement Validation
5.2. RQ1—Stability of ML-Based Screening
- Impact of Multimodal Feature Integration
- Stability Across Classifier Families
5.3. RQ2—Impact of Psychometric Refinement
5.4. RQ3—Explainable AI Analysis
6. Discussion and Implications
6.1. System-Level Validation and Architectural Contribution
6.2. Implications for AI-Driven Digital Assessment Engineering
6.3. Limitations and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Exploratory IRT Item Diagnostics
| Format | Questions | 1PL Diff. (b) | Diff.-Interp | 2PL Diff. (b) | 2PL Disk. (a) | Diff.-Interp | Disk.-Interpr |
|---|---|---|---|---|---|---|---|
| Multiple-choice | |||||||
| q1 | −0.9618 | Appropriate | −1.0464 | 0.6772 | Appropriate | Appropriate | |
| q2 | −1.6605 | Appropriate | 8.6513 | −0.1299 | Hard | Low | |
| q3 | 1.5443 | Appropriate | 3.2990 | 0.3224 | Hard | Low | |
| q4 | −5.8956 | Easy | −2.1598 | 10.0375 | Easy | Excellent | |
| q5 | −2.4547 | Easy | −3.4370 | 0.5082 | Easy | Appropriate | |
| q6 | −2.5673 | Easy | −2.1657 | 0.9276 | Easy | Appropriate | |
| q7 | −2.6877 | Easy | −2.6161 | 0.7730 | Easy | Appropriate | |
| q8 | −1.5412 | Appropriate | −1.6002 | 0.7151 | Appropriate | Appropriate | |
| q9 | −3.6135 | Easy | −2.2728 | 1.4063 | Easy | Appropriate | |
| q10 | −0.1027 | Appropriate | −0.2183 | 0.3302 | Appropriate | Low | |
| q11 | −4.3647 | Easy | −5.0235 | 0.6375 | Easy | Appropriate | |
| q12 | −2.8138 | Easy | −3.7648 | 0.5353 | Easy | Appropriate | |
| q13 | −0.0720 | Appropriate | −0.0992 | 0.5254 | Appropriate | Appropriate | |
| q14 | −2.0504 | Easy | −1.7280 | 0.9309 | Appropriate | Appropriate | |
| q15 | −1.5022 | Appropriate | −1.2283 | 0.9721 | Appropriate | Appropriate | |
| q16 | −1.2052 | Appropriate | −0.8554 | 1.1975 | Appropriate | Appropriate | |
| q17 | 1.5441 | Appropriate | 1.3545 | 0.8900 | Appropriate | Appropriate | |
| q18 | −4.5335 | Easy | −5.9420 | 0.5527 | Easy | Appropriate | |
| q19 | −2.8138 | Easy | −2.4608 | 0.8839 | Easy | Appropriate | |
| q20 | −2.8791 | Easy | −1.6811 | 1.6340 | Appropriate | Excellent | |
| q21 | −2.6878 | Easy | −1.6216 | 1.5418 | Appropriate | Excellent | |
| q22 | 1.0377 | Appropriate | 0.7725 | 1.1203 | Appropriate | Appropriate | |
| q23 | −0.9625 | Appropriate | −0.7520 | 1.0392 | Appropriate | Appropriate | |
| q24 | −1.3139 | Appropriate | −0.8868 | 1.2958 | Appropriate | Appropriate | |
| q25 | 1.3189 | Appropriate | 2.1021 | 0.4397 | Hard | Low | |
| Recording items | |||||||
| s1 | −3.2509 | Easy | −3.5143 | 0.6826 | Easy | Appropriate | |
| s2 | −2.1458 | Easy | −1.2270 | 1.7282 | Appropriate | Excellent | |
| s3 | −1.6607 | Appropriate | −1.0272 | 1.4987 | Appropriate | Appropriate | |
| s4 | −3.0949 | Easy | −5.1923 | 0.4196 | Easy | Low | |
| s5 | −2.4547 | Easy | −2.6388 | 0.6862 | Easy | Appropriate | |
| s6 | 0.1431 | Appropriate | 1.4478 | 0.0641 | Appropriate | Low | |
| s7 | −2.1456 | Easy | −4.8084 | 0.3075 | Easy | Low | |
| s8 | −0.6021 | Appropriate | −0.5396 | 0.8623 | Appropriate | Appropriate | |
| s9 | −1.6615 | Appropriate | −0.9393 | 1.7823 | Appropriate | Excellent |
Appendix B. Detailed Model Configurations
| Model | Parameter | Setting |
|---|---|---|
| Logistic Regression | Regularization | L2 |
| Solver | lbfgs/liblinear | |
| Max iterations | 5000–8000 | |
| Class weight | Balanced | |
| Decision Tree | Class weight | Balanced |
| Max depth | 4 | |
| Min samples per leaf | 5 | |
| Random Forest | Number of estimators | 800 |
| Max depth | 6 | |
| Min samples per leaf | 2 | |
| Class weight | Balanced subsample | |
| Gradient Boosting | Configuration | Default (scikit-learn) |
| XGBoost | Number of estimators | 600–800 |
| Learning rate | 0.05 | |
| Max depth | 3–4 | |
| Subsample | 0.9 | |
| Colsample by tree | 0.9 | |
| Gamma | 0.1 | |
| Lambda (L2) | 1.0 | |
| Alpha (L1) | 0.0 | |
| Scale pos weight | Negative/positive ratio | |
| Preprocessing | Resampling method | SMOTE–Tomek |
| SMOTE k-neighbors | 5 | |
| Tomek links | Applied |
References
- Cain, K.; Oakhill, J. Profiles of children with specific reading comprehension difficulties. Br. J. Educ. Psychol. 2006, 76, 683–696. [Google Scholar] [CrossRef]
- Kaderavek, J.N.; Sulzby, E. Narrative production by children with and without specific language impairment. J. Speech Lang. Hear. Res. 2000, 43, 34–49. [Google Scholar] [CrossRef]
- Gough, P.B.; Tunmer, W.E. Decoding, reading, and reading disability. Remedial Spec. Educ. 1986, 7, 6–10. [Google Scholar] [CrossRef]
- Piasta, S.B.; Wagner, R.K. Developing early literacy skills: A meta-analysis of alphabet learning and instruction. Read. Res. Q. 2010, 45, 8–38. [Google Scholar] [CrossRef]
- Ehri, L.C.; Nunes, S.R.; Willows, D.M.; Schuster, B.V.; Yaghoub-Zadeh, Z.; Shanahan, T. Phonemic awareness instruction helps children learn to read: Evidence from the National Reading Panel’s meta-analysis. Read. Res. Q. 2001, 36, 250–287. [Google Scholar] [CrossRef]
- World Bank; UNESCO; UNICEF; FCDO; USAID; Bill & Melinda Gates Foundation. The State of Global Learning Poverty: 2022 Update (Conference Edition); World Bank: Washington, DC, USA, 2022. Available online: https://thedocs.worldbank.org/en/doc/e52f55322528903b27f1b7e61238e416-0200022022/original/Learning-poverty-report-2022-06-21-final-V7-0-conferenceEdition.pdf (accessed on 13 March 2026).
- UNESCO. When Schools Shut: New UNESCO Study Exposes Failure to Factor in Gender in COVID-19 Education Responses; UNESCO: Paris, France, 2021; Available online: https://www.unesco.org/en/articles/when-schools-shut-new-unesco-study-exposes-failure-factor-gender-covid-19-education-responses (accessed on 13 March 2026).
- Kirsten, K.; Greefrath, G.; Emmrich, R. Technology-based versus paper-pencil: Sources of mode effects in large-scale assessment. Int. J. Math. Educ. Sci. Technol. 2026, 1–28. [Google Scholar] [CrossRef]
- Anghel, E.; Khorramdel, L.; von Davier, M. The use of process data in large-scale assessments: A literature review. Large-Scale Assess. Educ. 2024, 12, 13. [Google Scholar] [CrossRef]
- Chuang, P.-L.; Yan, X. Language assessment in the era of generative artificial intelligence: Opportunities, challenges, and future directions. System 2025, 134, 103846. [Google Scholar] [CrossRef]
- Zanellati, A.; Zingaro, S.P.; Gabbrielli, M. Balancing performance and explainability in academic dropout prediction. IEEE Trans. Learn. Technol. 2024, 17, 2086–2099. [Google Scholar] [CrossRef]
- Cukurova, M.; Miao, F. AI Competency Framework for Teachers; UNESCO Publishing: Paris, France, 2024. [Google Scholar]
- Bagdonaite, J.; Dagiene, V. Artificial Intelligence in Primary Education: A Systematic Literature Review 2020–2025. Inform. Educ. 2025, 24, 697–736. [Google Scholar] [CrossRef]
- Rathnayake, N.; Wijewardane, S. Machine learning-based Direct Normal Irradiance (DNI) forecasting using satellite data for Concentrated Solar Power (CSP) plants with Thermal Energy Storage (TES). Sci. Rep. 2026, 16, 11257. [Google Scholar] [CrossRef]
- Siemens, G.; Baker, R.S.J.D. Learning Analytics and Educational Data Mining: Towards Communication and Collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, Canada, 29 April–2 May 2012; pp. 252–254. [Google Scholar] [CrossRef]
- Yumus, M.; Stuhr, C.; Meindl, M.; Leuschner, H.; Jungmann, T. EuleApp©: A computerized adaptive assessment tool for early literacy skills. Front. Psychol. 2025, 16, 1522740. [Google Scholar] [CrossRef]
- Xi, X. Advancing language assessment with AI and ML–Leaning into AI is inevitable, but can theory keep up? Lang. Assess. Q. 2023, 20, 357–376. [Google Scholar] [CrossRef]
- Baker, R.S.; Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
- Kuhn, M.R.; Schwanenflugel, P.J.; Meisinger, E.B. Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Read. Res. Q. 2010, 45, 230–251. [Google Scholar] [CrossRef]
- Bailly, G.; Godde, E.; Piat-Marchand, A.-L.; Bosse, M.-L. Automatic assessment of oral readings of young pupils. Speech Commun. 2022, 138, 67–79. [Google Scholar] [CrossRef]
- Stewart, A.E.; Keirn, Z.; D’Mello, S.K. Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adapt. Interact. 2021, 31, 713–751. [Google Scholar] [CrossRef]
- Yan, L.; Echeverria, V.; Jin, Y.; Fernandez-Nieto, G.; Zhao, L.; Li, X.; Alfredo, R.; Swiecki, Z.; Gašević, D.; Martinez-Maldonado, R. Evidence-based multimodal learning analytics for feedback and reflection in collaborative learning. Br. J. Educ. Technol. 2024, 55, 1900–1925. [Google Scholar] [CrossRef]
- Han, J.; Shim, Y. Inclusive design of a tool to screen literacy of lower grade elementary school students. In Proceedings of the IEEE International Conference on E-Business Engineering (ICEBE) 2023, Beijing, China, 17–19 October 2023; pp. 178–180. [Google Scholar] [CrossRef]
- Tummalapalli, V. Using SMOTE and TOMEK Link Sampling Techniques to Address Imbalanced Data Challenges in the Machine Learning models. IJSAT-Int. J. Sci. Technol. 2025, 16, 1–6. [Google Scholar] [CrossRef]
- Adem, H. Vocal Biomarkers of Childhood Trauma: A Machine-Learning Approach to Speech Analysis. J. Speech Lang. Hear. Res. 2026, 69, 1955–1976. [Google Scholar] [CrossRef] [PubMed]
- Lardhi, J.S.; Ismail, A.F. Generative Artificial Intelligence for SDG 4: Enhancing Sustainable Quality Learning. Sustainability 2026, 18, 2498. [Google Scholar] [CrossRef]
- Tasić, N.; Glušac, D.; Makitan, V.; Jokić, S.; Ljubojev, N.; Vignjević, K. Promoting Sustainable Education Through the Educational Software Scratch: Enhancing Attention Span Among Primary School Students in the Context of Sustainable Development Goal (SDG) 4. Sustainability 2025, 17, 9292. [Google Scholar] [CrossRef]








| Criterion | No. Remove | Main Domains (Examples) | Decision Rule |
|---|---|---|---|
| VIF-based Removal | 7 | Print recognition (q3 *), phonological awareness (q8, q10), word recognition (q12), reading fluency (q18), vocabulary knowledge (q21, q23) | Multicollinearity (VIF > 10) |
| IRT-based Exclusion | 7 | Print recognition (q2, q3 *), phonological awareness (q10 *, s4, s6, s7), vocabulary knowledge (q25) | 2PL: a < 0.3 or extreme b |
| Domain | Max | Total | Typical | Struggling | t | p | |||
|---|---|---|---|---|---|---|---|---|---|
| M | SD | M | SD | M | SD | ||||
| Print Recognition | 4 | 2.52 | 0.93 | 2.63 | 0.9 | 2.21 | 0.83 | 2.79 | 0.007 |
| Phonological Awareness | 15 | 10.9 | 2.52 | 11.26 | 2.17 | 10.0 | 2.48 | 2.91 | 0.005 |
| Word Reading | 5 | 4.02 | 0.99 | 4.19 | 0.83 | 3.54 | 1.02 | 3.70 | 0.001 |
| Vocabulary Knowledge | 7 | 5.19 | 1.47 | 5.54 | 1.2 | 4.05 | 1.43 | 6.33 | <0.001 |
| Reading Fluency | 3 | 1.74 | 0.9 | 1.91 | 0.82 | 1.13 | 0.89 | 4.96 | <0.001 |
| Feature Configuration | Recall | Precision | PR-AUC |
|---|---|---|---|
| Structured only | 0.82 | 0.36 | 0.38 |
| Structured + ASR | 0.85 | 0.41 | 0.47 |
| Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| VIF | IRT | VIF | IRT | VIF | IRT | VIF | IRT | VIF | IRT | |
| Logistic Regression | 0.361 | 0.372 | 0.856 | 0.854 | 0.494 | 0.514 | 0.711 | 0.728 | 0.440 | 0.456 |
| Random Forest | 0.403 | 0.411 | 0.864 | 0.861 | 0.533 | 0.537 | 0.767 | 0.772 | 0.464 | 0.474 |
| Gradient Boosting | 0.361 | 0.368 | 0.870 | 0.872 | 0.496 | 0.507 | 0.717 | 0.725 | 0.419 | 0.435 |
| XGBoost | 0.341 | 0.358 | 0.844 | 0.848 | 0.473 | 0.493 | 0.719 | 0.731 | 0.439 | 0.441 |
| Decision Tree | 0.290 | 0.305 | 0.901 | 0.902 | 0.425 | 0.454 | 0.686 | 0.701 | 0.357 | 0.368 |
| Print Recognition | Phonological Awareness | Word Reading | Vocabulary Knowledge | Reading Fluency | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | VIF | IRT | VIF | IRT | VIF | IRT | VIF | IRT | VIF | IRT |
| Logistic (L2, balanced) | 0.066 | 0.058 | 0.070 | 0.065 | 0.575 | 0.598 | 0.125 | 0.128 | 0.163 | 0.151 |
| Random Forest | 0.046 | 0.042 | 0.052 | 0.048 | 0.738 | 0.721 | 0.055 | 0.061 | 0.110 | 0.128 |
| Gradient Boosting | 0.051 | 0.047 | 0.083 | 0.079 | 0.630 | 0.643 | 0.082 | 0.086 | 0.155 | 0.145 |
| XGBoost | 0.039 | 0.041 | 0.083 | 0.076 | 0.669 | 0.662 | 0.085 | 0.091 | 0.124 | 0.130 |
| Decision Tree | 0.073 | 0.061 | 0.102 | 0.094 | 0.496 | 0.501 | 0.054 | 0.057 | 0.275 | 0.287 |
| Item | Mean |SHAP| |
|---|---|
| q23 | 0.087 |
| q22 | 0.045 |
| q24 | 0.032 |
| s9 | 0.029 |
| q19 | 0.028 |
| Domain | Mean |SHAP| |
|---|---|
| Vocabulary Knowledge (7) | 0.151 |
| Reading Fluency (3) | 0.074 |
| Phonological Awareness (15) | 0.073 |
| Word Reading (5) | 0.065 |
| Print Recognition (4) | 0.041 |
| Feature | Traditional Manual Screening | K-KOBUKI (Proposed Workflow) | Sustainability Impact (SDG 4) |
|---|---|---|---|
| Accessibility | Resource-intensive; often limited to urban areas | Cloud-supported digital delivery with potential applicability in resource-constrained contexts | Potential to improve access to screening processes under appropriate infrastructural conditions [14] |
| Resilience | Delayed feedback (weeks); deficit accumulation | Reduced assessment-to-feedback latency within a semi-automated workflow (subject to human verification processes) | Potential to support earlier identification, although not evaluated as real-time intervention in this study [13] |
| Agency | High labor burden on expert evaluators | HITL verification maintains teacher involvement in data validation and interpretation | Supports teacher-centered decision-making rather than replacing professional judgment [13] |
| Accountability | Rater-dependent; subjective | Combination of human verification and explainable model outputs (e.g., SHAP) supporting interpretability | Potential to enhance transparency, while remaining dependent on human validation processes [12] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lee, S.; Han, J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability 2026, 18, 5142. https://doi.org/10.3390/su18105142
Lee S, Han J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability. 2026; 18(10):5142. https://doi.org/10.3390/su18105142
Chicago/Turabian StyleLee, Sihoon, and Jeonghye Han. 2026. "Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation" Sustainability 18, no. 10: 5142. https://doi.org/10.3390/su18105142
APA StyleLee, S., & Han, J. (2026). Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability, 18(10), 5142. https://doi.org/10.3390/su18105142

