Next Article in Journal
A Comprehensive Review of Space Syntax Applications for Sustainable Urban Development in Commercial Areas
Previous Article in Journal
The Impact of New-Quality Productive Forces on the Growth of Specialized, Refined, Distinctive, and Innovative Enterprises
Previous Article in Special Issue
From Generative AI-Supported Learning to Perceived Sustainability Judgment Capability in Accounting Education
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation

1
The Institute of Brain-Based Learning, Korea National University of Education, Cheongju 28173, Republic of Korea
2
Department of Computer Education, Cheongju National University of Education, Cheongju 28173, Republic of Korea
*
Author to whom correspondence should be addressed.
Sustainability 2026, 18(10), 5142; https://doi.org/10.3390/su18105142
Submission received: 28 March 2026 / Revised: 6 May 2026 / Accepted: 14 May 2026 / Published: 20 May 2026
(This article belongs to the Special Issue AI for Sustainable and Creative Learning in Education)

Abstract

Early literacy screening is essential for reducing long-term educational inequality, yet traditional paper-based assessments remain difficult to scale due to logistical constraints and delayed feedback. This study presents K-KOBUKI, a cloud-based prototype screening workflow that organizes early literacy assessment as a human-validated, data-driven process. The system integrates structured assessment responses with automated speech recognition-based analysis of oral reading performance across five literacy domains and incorporates a human-in-the-loop verification stage to ensure the reliability of speech-derived features. The system was evaluated using data from 195 first-grade students. Across repeated stratified cross-validation, multiple classification models achieved stable recall (≈0.85) under class imbalance conditions, supporting consistent identification of at-risk learners. Psychometric-informed feature refinement improved precision without reducing recall, indicating enhanced signal clarity through measurement-level stabilization. Explainable AI analysis further revealed that word reading and reading fluency contributed strongly to model-level decision boundaries, while vocabulary knowledge provided complementary influence at the individual level. These findings provide prototype-level evidence that a human-validated, multimodal screening workflow can support stable early-risk detection. From a sustainability perspective, the results suggest potential design-level contributions to improving accessibility and reducing delays in early identification processes.

1. Introduction

Early literacy represents a multidimensional set of linguistic competencies that emerge prior to formal schooling and serve as foundational predictors of long-term academic trajectories [1,2]. Core components—including phonological awareness, vocabulary knowledge, decoding efficiency, and listening comprehension—form an interdependent structure that shapes children’s readiness to read [3,4]. Among these components, phonological awareness has been consistently identified as a central mechanism underlying decoding and reading fluency across orthographic systems [5]. From the perspective of Sustainable Development Goal 4 (SDG 4), early literacy is not merely an academic milestone but a foundational condition for inclusive and equitable quality education, and current global monitoring indicates that progress toward minimum reading proficiency remains insufficient to meet the 2030 agenda [6].
However, the global education system currently faces a sustainability crisis known as “learning poverty,” where approximately 70% of 10-year-olds in low- and middle-income contexts are unable to read and understand a simple text [7]. Recent global learning disruptions have further exacerbated this issue, highlighting early literacy screening as a critical prerequisite for educational equity [8]. Because foundational literacy functions as a gateway to subsequent learning, delayed identification of reading difficulty can amplify later educational exclusion; early screening is therefore a systemic issue rather than merely a testing concern [7]. Recent international monitoring likewise reports weak progress in reading proficiency together with persistent inequities in educational access, infrastructure, and teacher capacity.
Despite their widespread adoption, traditional paper-based literacy assessments face structural limitations that hinder timely and equitable screening [9,10]. These assessments are often resource-intensive, requiring extensive one-on-one sessions and manual scoring, which limits their accessibility in marginalized or rural areas where specialized personnel are scarce [8,9]. This “diagnostic divide” prevents timely intervention and reinforces systemic cycles of academic exclusion. While recent advances in artificial intelligence (AI) and learning analytics have enabled automated scoring and machine learning (ML)-based risk prediction, emerging work in AI-enabled assessment emphasizes that such systems must be evaluated not only by predictive performance but also by reliability, validity, fairness, and contextual applicability [11,12]. Accordingly, the key challenge is not automation alone, but the design of a trustworthy assessment architecture in which machine-generated outputs are supported by human validation, pedagogical accountability, and learner-rights protections [12,13].
To address these challenges, the present study frames early literacy screening as a human–AI collaborative workflow design problem. The proposed system, K-KOBUKI, is introduced as a cloud-based prototype workflow that integrates structured assessment delivery, speech-processing support, psychometric feature refinement, machine learning-based risk estimation, and explainable AI within a unified pipeline. In this workflow, automated analysis is coupled with human verification to ensure that derived features reflect validated learner performance.
Within this study, sustainability and scalability are not treated as empirically demonstrated system outcomes but as design-level implications of the proposed workflow. Specifically, sustainability is considered in terms of accessibility, operational continuity, and socio-technical accountability, while scalability is conceptualized as architectural extensibility rather than large-scale deployment. Accordingly, the study focuses on evaluating workflow-level feasibility rather than system-level deployment performance.
From this perspective, the contribution of the study is threefold. First, it demonstrates that stable early-risk detection can be achieved through the integration of multimodal assessment data and machine learning. Second, it shows that psychometric-informed feature refinement contributes to improving signal clarity within the predictive feature space. Third, it illustrates how explainable AI techniques can link model outputs to interpretable literacy domains, supporting pedagogically meaningful interpretation. These contributions are presented as prototype-level evidence.
Accordingly, the present study evaluates the proposed architecture as a human–AI collaborative screening workflow and examines its performance through the following research questions:
  • RQ1. Can a cloud-native digital screening architecture achieve stable detection (recall) of at-risk learners under classroom-level class imbalance conditions?
  • RQ2. Does psychometric-informed feature refinement improve signal clarity and predictive precision without sacrificing the sensitivity required for universal screening?
  • RQ3. Can SHAP-based explainability align model outputs with pedagogically interpretable literacy constructs to support data-driven teacher interventions?

2. Theoretical and Research Background

2.1. Digital Literacy Assessment Infrastructure

Early literacy assessment has traditionally relied on paper-based, individually administered formats that evaluate letter recognition, word reading, phonological awareness, dictation, and sentence comprehension. Although widely adopted in early education, these approaches require substantial time, trained personnel, and manual scoring procedures, which limit scalability in school-wide or district-level implementation [9]. From a sustainability perspective, these resource-intensive methods exacerbate the “diagnostic divide,” where students in rural or underfunded regions are excluded from early intervention opportunities [13].
The emergence of digital assessment platforms has shifted literacy evaluation from static testing instruments toward integrated learning analytics infrastructures. By enabling automated item delivery, standardized scoring, and centralized cloud-based data storage, digital systems reduce operational burden while simultaneously producing machine-readable datasets. Rather than constituting fully scalable solutions, these systems should be understood as enabling conditions that can support improvements in access and consistency within early screening workflows. This structural transformation can be interpreted as a socio-technical shift that may contribute to improving access to diagnostic processes [14]. Within this framework, technology functions as a supporting condition for improving consistency in diagnostic processes across diverse geographical and economic contexts [15].

2.2. Machine Learning-Based Risk Prediction in Digital Assessment Contexts

Advances in machine learning have expanded the analytical capabilities of digital literacy screening systems. Structured item-level responses and aggregated domain scores can be transformed into predictive feature matrices, allowing classification algorithms to estimate the probability of reading difficulties [16,17]. This approach represents a shift from descriptive score reporting toward probabilistic risk modeling.
However, predictive performance alone is insufficient for responsible deployment in educational settings. Sustainable AI integration must move beyond “black-box” optimization toward a socially accountable ecosystem that preserves human agency [12]. Recent scholarship emphasizes that educational AI should augment the teacher’s professional judgment rather than replacing it, aligning with the UNESCO AI Competency Framework [13]. Digital literacy screening must therefore incorporate ethical design principles, including transparency and explainability, to ensure that automated risk classifications do not lead to algorithmic stigmatization of young learners [17].
Importantly, the effectiveness of machine learning in screening contexts depends not only on model selection, but also on the quality and interpretability of the underlying feature space. As a result, psychometric validation and machine learning modeling should be treated as interdependent components of a unified assessment process rather than as separate analytical stages [18]. Within this perspective, machine learning should be understood as a decision-support component operating within a broader assessment workflow, rather than as a standalone predictive solution.

2.3. Speech-Recognition-Based Assessment of Oral Reading Performance

AI-based literacy diagnostic systems increasingly incorporate ASR technologies to evaluate oral reading performance. Speech-processing algorithms can quantify pronunciation accuracy and temporal fluency, enabling computational analysis of decoding automaticity. Such approaches align with established research on oral reading fluency as a key indicator of reading development [19] and have recently been operationalized through AI-driven assessment systems [20].
However, the application of ASR in early literacy contexts presents inherent challenges. Children’s speech is often characterized by variability in pronunciation, incomplete articulation, and developmental differences, while classroom environments introduce additional sources of noise and recording inconsistency. These factors can reduce transcription accuracy and limit the direct interpretability of speech-derived indicators.
For this reason, prior research suggests that speech-based measures become educationally meaningful only when embedded within a reliable interpretive process [20]. Human verification plays a critical role in this process by supporting the validation of ASR outputs and ensuring that derived features reflect actual learner performance rather than recognition artifacts. From this perspective, ASR should be understood as a supportive analytical tool within a broader assessment workflow, rather than as a fully autonomous diagnostic mechanism.
Taken together, existing studies indicate that speech-based analysis can provide valuable information for early literacy screening, but its effectiveness depends on the integration of automated processing and human validation [21,22]. The present study builds on this perspective by examining how speech-derived indicators can be incorporated into a structured, multimodal screening workflow.

3. System Architecture and Digital Pipeline

3.1. K-KOBUKI Application Design

The AI-enabled digital early literacy diagnostic application, K-KOBUKI, was implemented as a prototype-level assessment system consisting of 34 assessment items, including 25 multiple-choice items and 9 speech-response items. The assessment covers five core domains of early literacy: print recognition, phonological awareness, word reading, vocabulary knowledge, and reading fluency. These domains are grounded in the Science of Reading framework, which emphasizes the interdependence of decoding automaticity and language comprehension [3].
To ensure technical transparency and pedagogical validity, the 27 predictive features were selected based on their alignment with oral reading fluency (ORF) constructs, capturing pronunciation accuracy, phoneme-level alignment, and temporal patterns [20]. These features combine structured response accuracy with speech-derived indicators, forming a multimodal representation of early literacy performance. Figure 1 presents the overall structure and item framework.
The multi-domain structure was derived from the framework proposed by Han and Shim (2023) [23] and refined through a multi-stage expert review. To support measurement reliability and interpretability, the items were further examined using a Rasch (1PL) model. Detailed results of the psychometric analysis are provided in Appendix A.

3.2. System Architecture and Data Flow

K-KOBUKI was implemented using the Flutter framework to support cross-platform compatibility. The system is designed as a cloud-based workflow in which data collection, validation, and analysis components are structurally integrated. Figure 2 illustrates the overall system architecture and data flow.
Assessment items are dynamically retrieved from a cloud database and delivered through a tablet-based interface. Learner responses are collected in two forms: structured item responses and speech recordings. These data are transmitted to the server, where they are processed through a sequential pipeline consisting of ASR processing, human verification, feature extraction, and machine learning-based analysis.
A key component of this pipeline is the human-in-the-loop (HITL) verification module, which is positioned upstream of feature extraction within the speech-processing workflow. In this stage, ASR-generated transcriptions are reviewed and corrected prior to their use in downstream analysis, ensuring that speech-derived features are based on validated response data.
Following verification, structured responses and speech-derived indicators are combined into a unified feature space for predictive modeling. This modular pipeline separates data acquisition, validation, and analysis stages, enabling a structured flow from raw input data to model-based screening outcomes.

3.3. Application Interface

The K-KOBUKI interface was designed to capture both structured response data and speech performance within a single digital environment. The multiple-choice interface (Figure 3) evaluates foundational skills such as print recognition and semantic judgment. In alignment with socially accountable design [12] responses are automatically scored, but the final confirmation remains with the examiner before data synchronization.
The speech-response interface (Figure 4) requires learners to read aloud to evaluate decoding fluency. The integrated HITL dashboard allows teachers to verify automatically generated transcriptions, supporting the use of ASR outputs as validated input data for analysis. In this context, the system functions as a support tool for structured data collection rather than as an autonomous assessment system.
By integrating structured responses, speech-derived indicators, and human validation within a single workflow, the system provides a multimodal evidence base for early screening. However, this integration should be interpreted as supporting the feasibility of a human-validated screening workflow under controlled conditions, rather than as establishing a fully scalable or fully automated assessment infrastructure.

4. Method

4.1. Participants and Data Collection

The analytic dataset consisted of 195 first-grade elementary school children aged between 7.00 and 7.08 years. To reflect heterogeneous classroom conditions, participants were recruited from urban, rural, and coastal regions.
The assessment was administered in classroom settings using individual tablets (Figure 5). Audio data were captured in a natural classroom environment using a PCM 16-bit, 44.1 kHz mono format. Participants were instructed to produce spoken responses for a minimum of 8 s per item. To ensure high signal integrity for children’s speech, audio preprocessing—including resampling to 16 kHz and noise normalization—was conducted using the librosa library (version 0.11.0). Ground truth for literacy risk was established using an independent standardized assessment. Based on these scores, participants were categorized into struggling readers (n = 39, 20.0%) and typically developing readers (n = 156, 80.0%). All procedures were conducted under institutional review board approval (IRB No. 1301-202308-HR-0004-02).

4.2. Dataset Structure and Multimodal Integration

Data collection was conducted across two implementation phases. The pilot phase (N = 351) was used for system refinement and item calibration and was excluded from predictive modeling due to changes in assessment structure between phases. The second phase involved 251 learners (Figure 6).
From the second phase, 56 cases were excluded based on predefined quality-control criteria. These included (1) incomplete data or missing identifiers (n = 11), (2) measurement validity concerns such as language mismatch or health-related issues (n = 13), and (3) technical issues including device malfunction, environmental noise, and unusable speech recordings (n = 32). These criteria ensured that the final dataset consisted of complete, reliable, and temporally aligned records suitable for analysis.
The resulting analytic dataset included 195 learners with linked demographic metadata and item-level performance data. All speech recordings (n = 1755) were reviewed and corrected prior to feature extraction to ensure data reliability.
Because verification logs were not systematically recorded for all correction events, the impact of human verification could not be evaluated as an independent variable and is therefore interpreted as a data-quality control mechanism within the pipeline. Accordingly, the results do not establish predictive validity beyond the analyzed cohort.
After preprocessing, the dataset combined structured response indicators with validated speech-derived measures. The structured component included 25 multiple-choice items across five literacy domains, producing 4875 response entries. The speech component consisted of nine items, generating 1755 verified recordings used for feature extraction.

4.3. Experimental Design and Modeling Framework

The experimental framework evaluated supervised machine learning models for identifying struggling readers using structured digital assessment features.
  • Feature Engineering and Measurement Verification
To ensure the psychometric quality of the feature space, input variables were refined through a two-stage diagnostic process. First, multicollinearity was addressed by removing items with a Variance Inflation Factor (VIF) exceeding 10. Second, as detailed in Appendix A, item characteristics were evaluated using the Rasch (1PL) model to verify parameter stability. In addition, items with low discrimination (2PL a < 0.3) were excluded to enhance signal clarity.
Rather than relying on purely data-driven optimization, the feature set was constructed based on theoretical alignment with oral reading fluency (ORF) constructs, including pronunciation accuracy, phoneme-level correspondence, and temporal fluency patterns. The final feature set consisted of 27 variables, combining structured response features (n = 18) and speech-derived indicators (n = 9). Structured features were derived from item-level correctness and domain-level aggregate scores, while speech-derived features included pronunciation accuracy, character-level error rates (CER), and temporal fluency measures such as response duration and reading pace [24].
  • ASR Processing and Human Verification
Speech data were processed using a Transformer-based ASR model consistent with the Whisper architecture [25]. Transcription quality was evaluated using Character Error Rate (CER), which provides a fine-grained measure of phoneme-level accuracy in early literacy contexts.
All speech transcriptions were subjected to HITL verification prior to feature extraction. Two trained evaluators independently reviewed each recording by comparing the child’s oral response with the expected target response, focusing on pronunciation accuracy and item-level correctness. Discrepancies were resolved through discussion to reach a consensus decision. This procedure ensured that speech-derived features reflected validated learner performance.
  • Model Implementation and Evaluation Strategy
Five supervised classification algorithms—Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and XGBoost—were implemented using standard machine learning libraries (scikit-learn version 1.6.1 and XGBoost version 3.2.0). Model configurations were specified using constrained parameter ranges to ensure stability under the small-sample condition.
To address class imbalance (20% struggling readers), the hybrid SMOTE–Tomek resampling technique was applied within the training folds. Given the relatively small minority class (n = 39), this approach was adopted as a sensitivity-oriented modeling strategy rather than a substitute for external validation.
Model evaluation was conducted using repeated stratified cross-validation (5 folds × 5 repeats) to ensure robust estimation under class imbalance conditions. Classification thresholds were optimized within each training fold using precision–recall curves to maximize the F1 score, enabling a balanced trade-off between recall and precision.
Probability calibration was performed using isotonic regression to improve the reliability of predicted probabilities. All preprocessing steps—including resampling, threshold optimization, and model configuration—were conducted strictly within the training folds to prevent data leakage.
The primary optimization metric was recall (sensitivity), reflecting the screening objective of minimizing false negatives [13]. Precision, F1-score, and PR-AUC were used as complementary metrics to evaluate discriminative performance. Because no external validation dataset was available, the results should be interpreted as internal validation within a single cohort. Detailed model configurations are provided in Appendix B.

5. Results

The empirical findings are reported in four stages aligned with the study’s research questions. First, measurement validity and feature separability are examined to confirm that the digital assessment generates structurally reliable and discriminative input signals for machine learning classification. Second, the stability of machine learning-based screening performance is evaluated under repeated cross-validation (RQ1). Third, the effect of psychometric-informed feature refinement on classification performance is examined (RQ2). Finally, explainable AI analyses are conducted to interpret model predictions in relation to literacy development theories (RQ3).

5.1. Measurement Validation

Before evaluating machine learning classification performance, it was necessary to verify that the digital assessment generated both structurally stable measurement signals and sufficiently separable feature distributions across learner groups. Two complementary analyses were conducted: (1) item-level measurement stability based on IRT diagnostics and (2) group-level feature separability across the five literacy domains.
First, item-level stability was examined. As shown in Figure 7, Item Characteristic Curves (ICCs) derived from the Rasch (1PL) and exploratory 2PL models indicate that item responses follow expected probabilistic patterns across ability levels (see Appendix A for detailed results).
To improve feature quality, a two-stage refinement procedure was applied using Variance Inflation Factor (VIF) screening and IRT-based diagnostics. The criteria and results are summarized in Table 1, illustrating the complementary roles of statistical and psychometric filtering.
In addition to item-level stability, group-level performance differences were examined to determine whether the assessment captures meaningful variation between learners. As shown in Table 2, independent-samples t-tests revealed statistically significant differences (p < 0.01) between struggling and typically developing readers across all five literacy domains, including print recognition, phonological awareness, word reading, vocabulary knowledge, and reading fluency. These results indicate that the assessment successfully differentiates between groups across multiple dimensions of early literacy.
Figure 8 presents these standardized mean differences together with Hedges’ g and corresponding 95% confidence intervals across literacy domains. The magnitude of separability—particularly in vocabulary knowledge (Hedges’ g = 1.10) and reading fluency (g = 0.91)—indicates that these domains provide strong discriminative signals for early screening.
Taken together, these results indicate that the feature space demonstrates statistical stability and group-level separability within the analyzed dataset. However, these findings should be interpreted with caution. Although statistically significant differences were observed across domains, the effect sizes reported here represent exploratory evidence of domain-level separability under a relatively small sample condition. Accordingly, the results support the adequacy of the feature space for subsequent predictive modeling under controlled conditions but do not, by themselves, establish generalizable group differences or predictive validity beyond the present cohort.

5.2. RQ1—Stability of ML-Based Screening

To evaluate the stability of the screening mechanism, model performance was examined across multiple feature configurations and classifier families.
  • Impact of Multimodal Feature Integration
As shown in Table 3, incorporating ASR-derived fluency indicators improved recall from 0.82 to 0.85 and increased PR-AUC from 0.38 to 0.47, indicating that speech-derived features provide additional predictive information for identifying at-risk learners.
  • Stability Across Classifier Families
Across all classifier families evaluated using repeated stratified cross-validation ( 5 × 5 ), recall values consistently ranged between 0.84 and 0.87, indicating stable detection performance under class imbalance conditions.
This pattern reflects a recall-oriented screening configuration, where sensitivity is prioritized over specificity. Accordingly, model outputs should be interpreted as preliminary screening signals requiring subsequent human verification rather than as definitive diagnostic classifications.

5.3. RQ2—Impact of Psychometric Refinement

To examine how measurement quality influences predictive performance, two feature-refinement strategies were compared: (1) VIF-based refinement and (2) IRT-informed refinement.
As shown in Table 4, IRT-informed refinement produced modest increases in precision and PR-AUC across classifiers while maintaining comparable recall levels. These results suggest that psychometric filtering contributes to improving the clarity of the feature space within the present dataset. However, these improvements should be interpreted as incremental rather than transformative.

5.4. RQ3—Explainable AI Analysis

To interpret how the screening model generates predictions, explainable AI analyses were conducted at three complementary levels: group-level statistical comparison, model-level sensitivity analysis, and instance-level contribution analysis. These correspond to different analytical perspectives. Specifically, Table 2 reflects group-level separability, Table 5 captures model-level sensitivity based on PR-AUC decrease, and Table 6 and Table 7 represent instance-level contributions based on SHAP values. Differences across these results should therefore be interpreted as reflecting distinct levels of analysis rather than as contradictory findings.
At the model level, domain importance was examined using PR-AUC decrease (Table 5). Across all classifiers, word reading produced the largest decrease in PR-AUC, indicating that it plays a dominant role in defining the global decision boundary of the model. This pattern was consistent across model types, suggesting a stable hierarchy of feature importance at the decision-boundary level.
At the instance level, SHAP-based analyses provided a complementary perspective. Item-level importance (Table 6) showed that predictions are influenced by cumulative contributions across multiple features rather than by isolated item errors.
When aggregated at the domain level (Table 7), vocabulary knowledge exhibited the largest average contribution across individual predictions, indicating a broader and more distributed influence compared to other domains.
Taken together, these results indicate that different literacy domains contribute differently depending on the level of analysis. Word reading primarily shapes model-level decision boundaries, while vocabulary knowledge contributes more diffusely to variation across individual predictions. This distinction reflects differences between global sensitivity patterns and local contribution structures, rather than inconsistency across analytical results.
Finally, individual case patterns suggest that high-risk classifications tend to emerge from combined deficits across multiple domains, particularly vocabulary and fluency-related features. This pattern reflects multidimensional feature aggregation within the model rather than causal relationships between literacy domains.

6. Discussion and Implications

This study examined whether a human-validated multimodal screening workflow could support stable early-risk detection within a digital assessment context.
The findings indicate three key points. First, the proposed workflow achieved stable recall under class imbalance conditions, supporting consistent identification of at-risk learners. Second, psychometric-informed feature refinement contributed to modest improvements in precision, suggesting enhanced signal clarity within the feature space. Third, explainable AI analyses revealed interpretable domain-level patterns, linking model behavior to literacy-related constructs.
Differences across statistical comparisons, model-level sensitivity analyses, and SHAP-based contribution patterns reflect variations in analytical level rather than inconsistency in results. Together, these findings suggest that early literacy screening can be understood as a coordinated workflow integrating measurement design, multimodal data, human validation, and machine learning-based analysis.

6.1. System-Level Validation and Architectural Contribution

Building on the analyses in Section 5, the results demonstrate that the K-KOBUKI architecture shows stable recall (≈0.85) within the analyzed cohort. In the context of educational sustainability, this recall-oriented screening configuration may have potential relevance to the early identification principle of SDG 4, where minimizing false negatives is prioritized [26,27].
The integration of speech-derived indicators contributed additional predictive information, suggesting that multimodal feature combinations capture complementary aspects of reading performance. This highlights the value of combining structured responses with speech-based measures within a unified screening workflow.
An important architectural characteristic is the inclusion of a HITL verification stage, which is intended to support the reliability of speech-derived features by ensuring that input data reflect validated learner performance. At the same time, this reliance on human verification introduces constraints on scalability, as manual intervention is required within the data-processing pipeline.

6.2. Implications for AI-Driven Digital Assessment Engineering

The findings of this study suggest that AI-enabled screening can be conceptualized as a socio-technical workflow in which human judgment and machine-generated signals are structurally integrated, rather than as a fully autonomous system.
The characteristics summarized in Table 8 should be interpreted as design-level properties of the proposed workflow rather than empirically validated system outcomes. In this context, the implications of K-KOBUKI can be understood across four dimensions: accessibility, resilience, agency, and accountability.
While the cloud-based and multimodal design suggests potential improvements in access to screening processes and timeliness of feedback, these implications remain theoretical and have not been empirically evaluated in the present study. Accordingly, future research is required to examine the effectiveness of such workflow-based screening systems under real-world deployment conditions.

6.3. Limitations and Conclusions

Several limitations should be acknowledged. First, the dataset is limited to a localized cohort (N = 195), and model performance may be affected by distributional differences across regions, instructional contexts, and learner populations. In addition, no external validation dataset was available, and the results should therefore be interpreted as internal validation within a single cohort.
Second, the use of the SMOTE–Tomek resampling technique should be understood as an exploratory strategy within a small-sample context rather than as a substitute for real-world minority data.
Third, while full HITL verification improves the reliability of speech-derived features, it limits operational scalability, as manual validation is required within the data-processing pipeline.
Fourth, the relatively low precision observed across models implies a non-negligible rate of false-positive classifications, which may increase teacher workload in practical screening contexts. For example, under a typical classroom scenario of 25 students, a precision level of approximately 0.40 may result in around 3–4 false-positive identifications per class. This highlights the necessity of subsequent human verification to ensure practical usability.
Fifth, the study was conducted in a Korean-language context, and the structural characteristics of the Korean writing system and phonology may influence both speech-derived features and model behavior, limiting direct generalization to other languages.
Finally, the use of probabilistic risk classification raises ethical concerns, as false-positive identifications may lead to unintended labeling or stigmatization. Careful interpretation and human oversight are therefore required in practical use.
Within these limitations, the findings demonstrate that stable early-risk detection can be achieved through the integration of psychometrically informed features and multimodal modeling. K-KOBUKI provides prototype-level evidence for the feasibility of a human-validated, AI-assisted screening workflow, with its primary contribution lying in the integration of assessment, validation, and predictive analytics within a unified architecture rather than in the validation of a deployable system.
Future research should examine external validity across diverse populations, explore selective HITL strategies to balance reliability and efficiency, and evaluate the practical implementation of workflow-based screening systems in real classroom settings.

Author Contributions

Conceptualization, S.L. and J.H.; methodology, S.L.; software, S.L.; validation, S.L. and J.H.; formal analysis, S.L.; investigation, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, J.H.; visualization, S.L.; supervision, J.H.; project administration, J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2023R1A2C1006289).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Cheongju National University of Education (IRB No. 1301-202308-HR-0004-02; approved on 31 August 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study and from their legal guardians.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the participating schools and teachers for their cooperation in data collection and validation processes.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Exploratory IRT Item Diagnostics

The following table reports item difficulty parameters estimated using the one-parameter logistic (1PL; Rasch) model and item difficulty and discrimination parameters estimated using the two-parameter logistic (2PL) model for all items included in the early literacy diagnostic tool.
In this study, the Rasch (1PL) model served as the primary measurement framework to ensure parameter stability, fairness, and interpretability given the modest sample size and mixed item formats. Difficulty parameters from the Rasch model were therefore used as the main reference for evaluating item functioning and coverage of the ability continuum.
The 2PL model was applied strictly for exploratory diagnostic purposes. Discrimination parameters and 2PL-based difficulty estimates are reported to enhance transparency and to document the item refinement process. Given the sample size and the presence of mixed-format items, these parameters are not intended for substantive interpretation or for direct comparison of item quality. In particular, extreme or unstable parameter estimates should be interpreted with caution, as they may reflect sample-specific characteristics rather than generalizable item properties.
Table A1. Item Difficulty and Exploratory Discrimination Parameters from 1PL and 2PL IRT Models.
Table A1. Item Difficulty and Exploratory Discrimination Parameters from 1PL and 2PL IRT Models.
FormatQuestions1PL Diff. (b)Diff.-Interp2PL Diff. (b)2PL Disk. (a)Diff.-InterpDisk.-Interpr
Multiple-choice
q1−0.9618Appropriate−1.04640.6772AppropriateAppropriate
q2−1.6605Appropriate8.6513−0.1299HardLow
q31.5443Appropriate3.29900.3224HardLow
q4−5.8956Easy−2.159810.0375EasyExcellent
q5−2.4547Easy−3.43700.5082EasyAppropriate
q6−2.5673Easy−2.16570.9276EasyAppropriate
q7−2.6877Easy−2.61610.7730EasyAppropriate
q8−1.5412Appropriate−1.60020.7151AppropriateAppropriate
q9−3.6135Easy−2.27281.4063EasyAppropriate
q10−0.1027Appropriate−0.21830.3302AppropriateLow
q11−4.3647Easy−5.02350.6375EasyAppropriate
q12−2.8138Easy−3.76480.5353EasyAppropriate
q13−0.0720Appropriate−0.09920.5254AppropriateAppropriate
q14−2.0504Easy−1.72800.9309AppropriateAppropriate
q15−1.5022Appropriate−1.22830.9721AppropriateAppropriate
q16−1.2052Appropriate−0.85541.1975AppropriateAppropriate
q171.5441Appropriate1.35450.8900AppropriateAppropriate
q18−4.5335Easy−5.94200.5527EasyAppropriate
q19−2.8138Easy−2.46080.8839EasyAppropriate
q20−2.8791Easy−1.68111.6340AppropriateExcellent
q21−2.6878Easy−1.62161.5418AppropriateExcellent
q221.0377Appropriate0.77251.1203AppropriateAppropriate
q23−0.9625Appropriate−0.75201.0392AppropriateAppropriate
q24−1.3139Appropriate−0.88681.2958AppropriateAppropriate
q251.3189Appropriate2.10210.4397HardLow
Recording items
s1−3.2509Easy−3.51430.6826EasyAppropriate
s2−2.1458Easy−1.22701.7282AppropriateExcellent
s3−1.6607Appropriate−1.02721.4987AppropriateAppropriate
s4−3.0949Easy−5.19230.4196EasyLow
s5−2.4547Easy−2.63880.6862EasyAppropriate
s60.1431Appropriate1.44780.0641AppropriateLow
s7−2.1456Easy−4.80840.3075EasyLow
s8−0.6021Appropriate−0.53960.8623AppropriateAppropriate
s9−1.6615Appropriate−0.93931.7823AppropriateExcellent
(a) Rasch (1PL) model results; (b) exploratory 2PL model results.
Notes for appendix preparation. Difficulty classifications (e.g., easy, appropriate, hard) are based on predefined thresholds used for descriptive interpretation and are referenced primarily to the Rasch (1PL) model. Discrimination-related qualitative labels derived from the 2PL model are provided for exploratory diagnostic reference only and do not imply substantive superiority or inferiority of individual items. Items exhibiting extremely high or low discrimination estimates should be interpreted as candidates for further review, revision, or removal, rather than as definitive indicators of item quality. This appendix is provided to ensure transparency in the psychometric evaluation process and to support replication and future refinement of the diagnostic instrument. Table formatting follows standard APA and SSCI appendix conventions.

Appendix B. Detailed Model Configurations

This appendix summarizes the detailed parameter settings for all models used in the study to ensure reproducibility. Model configurations were specified using constrained parameter ranges under small-sample conditions. Table A2 presents the final parameter settings applied in the modeling framework, including classifier configurations and data preprocessing procedures.
Table A2. Summary of model configurations and preprocessing settings.
Table A2. Summary of model configurations and preprocessing settings.
ModelParameterSetting
Logistic RegressionRegularizationL2
Solverlbfgs/liblinear
Max iterations5000–8000
Class weightBalanced
Decision TreeClass weightBalanced
Max depth4
Min samples per leaf5
Random ForestNumber of estimators800
Max depth6
Min samples per leaf2
Class weightBalanced subsample
Gradient BoostingConfigurationDefault (scikit-learn)
XGBoostNumber of estimators600–800
Learning rate0.05
Max depth3–4
Subsample0.9
Colsample by tree0.9
Gamma0.1
Lambda (L2)1.0
Alpha (L1)0.0
Scale pos weightNegative/positive ratio
PreprocessingResampling methodSMOTE–Tomek
SMOTE k-neighbors5
Tomek linksApplied
All preprocessing steps, including resampling and model fitting, were performed within the training folds to prevent data leakage.

References

  1. Cain, K.; Oakhill, J. Profiles of children with specific reading comprehension difficulties. Br. J. Educ. Psychol. 2006, 76, 683–696. [Google Scholar] [CrossRef]
  2. Kaderavek, J.N.; Sulzby, E. Narrative production by children with and without specific language impairment. J. Speech Lang. Hear. Res. 2000, 43, 34–49. [Google Scholar] [CrossRef]
  3. Gough, P.B.; Tunmer, W.E. Decoding, reading, and reading disability. Remedial Spec. Educ. 1986, 7, 6–10. [Google Scholar] [CrossRef]
  4. Piasta, S.B.; Wagner, R.K. Developing early literacy skills: A meta-analysis of alphabet learning and instruction. Read. Res. Q. 2010, 45, 8–38. [Google Scholar] [CrossRef]
  5. Ehri, L.C.; Nunes, S.R.; Willows, D.M.; Schuster, B.V.; Yaghoub-Zadeh, Z.; Shanahan, T. Phonemic awareness instruction helps children learn to read: Evidence from the National Reading Panel’s meta-analysis. Read. Res. Q. 2001, 36, 250–287. [Google Scholar] [CrossRef]
  6. World Bank; UNESCO; UNICEF; FCDO; USAID; Bill & Melinda Gates Foundation. The State of Global Learning Poverty: 2022 Update (Conference Edition); World Bank: Washington, DC, USA, 2022. Available online: https://thedocs.worldbank.org/en/doc/e52f55322528903b27f1b7e61238e416-0200022022/original/Learning-poverty-report-2022-06-21-final-V7-0-conferenceEdition.pdf (accessed on 13 March 2026).
  7. UNESCO. When Schools Shut: New UNESCO Study Exposes Failure to Factor in Gender in COVID-19 Education Responses; UNESCO: Paris, France, 2021; Available online: https://www.unesco.org/en/articles/when-schools-shut-new-unesco-study-exposes-failure-factor-gender-covid-19-education-responses (accessed on 13 March 2026).
  8. Kirsten, K.; Greefrath, G.; Emmrich, R. Technology-based versus paper-pencil: Sources of mode effects in large-scale assessment. Int. J. Math. Educ. Sci. Technol. 2026, 1–28. [Google Scholar] [CrossRef]
  9. Anghel, E.; Khorramdel, L.; von Davier, M. The use of process data in large-scale assessments: A literature review. Large-Scale Assess. Educ. 2024, 12, 13. [Google Scholar] [CrossRef]
  10. Chuang, P.-L.; Yan, X. Language assessment in the era of generative artificial intelligence: Opportunities, challenges, and future directions. System 2025, 134, 103846. [Google Scholar] [CrossRef]
  11. Zanellati, A.; Zingaro, S.P.; Gabbrielli, M. Balancing performance and explainability in academic dropout prediction. IEEE Trans. Learn. Technol. 2024, 17, 2086–2099. [Google Scholar] [CrossRef]
  12. Cukurova, M.; Miao, F. AI Competency Framework for Teachers; UNESCO Publishing: Paris, France, 2024. [Google Scholar]
  13. Bagdonaite, J.; Dagiene, V. Artificial Intelligence in Primary Education: A Systematic Literature Review 2020–2025. Inform. Educ. 2025, 24, 697–736. [Google Scholar] [CrossRef]
  14. Rathnayake, N.; Wijewardane, S. Machine learning-based Direct Normal Irradiance (DNI) forecasting using satellite data for Concentrated Solar Power (CSP) plants with Thermal Energy Storage (TES). Sci. Rep. 2026, 16, 11257. [Google Scholar] [CrossRef]
  15. Siemens, G.; Baker, R.S.J.D. Learning Analytics and Educational Data Mining: Towards Communication and Collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, Canada, 29 April–2 May 2012; pp. 252–254. [Google Scholar] [CrossRef]
  16. Yumus, M.; Stuhr, C.; Meindl, M.; Leuschner, H.; Jungmann, T. EuleApp©: A computerized adaptive assessment tool for early literacy skills. Front. Psychol. 2025, 16, 1522740. [Google Scholar] [CrossRef]
  17. Xi, X. Advancing language assessment with AI and ML–Leaning into AI is inevitable, but can theory keep up? Lang. Assess. Q. 2023, 20, 357–376. [Google Scholar] [CrossRef]
  18. Baker, R.S.; Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
  19. Kuhn, M.R.; Schwanenflugel, P.J.; Meisinger, E.B. Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Read. Res. Q. 2010, 45, 230–251. [Google Scholar] [CrossRef]
  20. Bailly, G.; Godde, E.; Piat-Marchand, A.-L.; Bosse, M.-L. Automatic assessment of oral readings of young pupils. Speech Commun. 2022, 138, 67–79. [Google Scholar] [CrossRef]
  21. Stewart, A.E.; Keirn, Z.; D’Mello, S.K. Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adapt. Interact. 2021, 31, 713–751. [Google Scholar] [CrossRef]
  22. Yan, L.; Echeverria, V.; Jin, Y.; Fernandez-Nieto, G.; Zhao, L.; Li, X.; Alfredo, R.; Swiecki, Z.; Gašević, D.; Martinez-Maldonado, R. Evidence-based multimodal learning analytics for feedback and reflection in collaborative learning. Br. J. Educ. Technol. 2024, 55, 1900–1925. [Google Scholar] [CrossRef]
  23. Han, J.; Shim, Y. Inclusive design of a tool to screen literacy of lower grade elementary school students. In Proceedings of the IEEE International Conference on E-Business Engineering (ICEBE) 2023, Beijing, China, 17–19 October 2023; pp. 178–180. [Google Scholar] [CrossRef]
  24. Tummalapalli, V. Using SMOTE and TOMEK Link Sampling Techniques to Address Imbalanced Data Challenges in the Machine Learning models. IJSAT-Int. J. Sci. Technol. 2025, 16, 1–6. [Google Scholar] [CrossRef]
  25. Adem, H. Vocal Biomarkers of Childhood Trauma: A Machine-Learning Approach to Speech Analysis. J. Speech Lang. Hear. Res. 2026, 69, 1955–1976. [Google Scholar] [CrossRef] [PubMed]
  26. Lardhi, J.S.; Ismail, A.F. Generative Artificial Intelligence for SDG 4: Enhancing Sustainable Quality Learning. Sustainability 2026, 18, 2498. [Google Scholar] [CrossRef]
  27. Tasić, N.; Glušac, D.; Makitan, V.; Jokić, S.; Ljubojev, N.; Vignjević, K. Promoting Sustainable Education Through the Educational Software Scratch: Enhancing Attention Span Among Primary School Students in the Context of Sustainable Development Goal (SDG) 4. Sustainability 2025, 17, 9292. [Google Scholar] [CrossRef]
Figure 1. Structure and question framework of the digital-based early literacy diagnostic app (K-KOBUKI, a Korean early literacy screening system). The figure includes example Korean literacy assessment items. The first item requires learners to read a short sentence and identify the first letter of the third line. The second item assesses recognition of Korean tense consonants. The third item requires learners to select the correct written word corresponding to a spoken pronunciation. The fourth item assesses syllable-count matching, the fifth item assesses semantic association, and the final item assesses oral reading fluency through short-sentence reading.
Figure 1. Structure and question framework of the digital-based early literacy diagnostic app (K-KOBUKI, a Korean early literacy screening system). The figure includes example Korean literacy assessment items. The first item requires learners to read a short sentence and identify the first letter of the third line. The second item assesses recognition of Korean tense consonants. The third item requires learners to select the correct written word corresponding to a spoken pronunciation. The fourth item assesses syllable-count matching, the fifth item assesses semantic association, and the final item assesses oral reading fluency through short-sentence reading.
Sustainability 18 05142 g001
Figure 2. K-KOBUKI System Architecture. Solid arrows indicate the primary data-processing workflow, whereas dashed arrows represent teacher-review and diagnostic-feedback processes.
Figure 2. K-KOBUKI System Architecture. Solid arrows indicate the primary data-processing workflow, whereas dashed arrows represent teacher-review and diagnostic-feedback processes.
Sustainability 18 05142 g002
Figure 3. Example multiple-choice items (MCIs) from the K-KOBUKI assessment. Korean words in the figure represent family-related and literacy-related vocabulary items.
Figure 3. Example multiple-choice items (MCIs) from the K-KOBUKI assessment. Korean words in the figure represent family-related and literacy-related vocabulary items.
Sustainability 18 05142 g003
Figure 4. Example voice-response items (VRIs) and waveform visualization. The Korean sentence in the figure reads, ‘The baby elephant family burst into laughter’.
Figure 4. Example voice-response items (VRIs) and waveform visualization. The Korean sentence in the figure reads, ‘The baby elephant family burst into laughter’.
Sustainability 18 05142 g004
Figure 5. Scenes of user interaction with the diagnostic app, K-KOBUKI.
Figure 5. Scenes of user interaction with the diagnostic app, K-KOBUKI.
Sustainability 18 05142 g005
Figure 6. Data collection and preprocessing flow of the analytic dataset.
Figure 6. Data collection and preprocessing flow of the analytic dataset.
Sustainability 18 05142 g006
Figure 7. Representative item characteristic curves for selected items under the 1PL (left) and 2PL (right) models.
Figure 7. Representative item characteristic curves for selected items under the 1PL (left) and 2PL (right) models.
Sustainability 18 05142 g007
Figure 8. Combined display of standardized means (z) and effect sizes (Hedges’ g with 95% confidence intervals) across literacy domains.
Figure 8. Combined display of standardized means (z) and effect sizes (Hedges’ g with 95% confidence intervals) across literacy domains.
Sustainability 18 05142 g008
Table 1. Comparison of item refinement results based on VIF and IRT analyses.
Table 1. Comparison of item refinement results based on VIF and IRT analyses.
CriterionNo. RemoveMain Domains (Examples)Decision Rule
VIF-based Removal7Print recognition (q3 *), phonological awareness (q8, q10), word recognition (q12), reading fluency (q18), vocabulary knowledge (q21, q23)Multicollinearity (VIF > 10)
IRT-based Exclusion7Print recognition (q2, q3 *), phonological awareness (q10 *, s4, s6, s7), vocabulary knowledge (q25)2PL: a < 0.3 or extreme b
* Indicates items identified in both VIF-based removal and IRT-based exclusion procedures.
Table 2. Means and standard deviations by group and literacy domain (including maximum scores).
Table 2. Means and standard deviations by group and literacy domain (including maximum scores).
DomainMaxTotalTypicalStrugglingtp
MSDMSDMSD
Print Recognition42.520.932.630.92.210.832.790.007
Phonological Awareness1510.92.5211.262.1710.02.482.910.005
Word Reading54.020.994.190.833.541.023.700.001
Vocabulary Knowledge75.191.475.541.24.051.436.33<0.001
Reading Fluency31.740.91.910.821.130.894.96<0.001
Table 3. Impact of Feature Integration on Classification Performance.
Table 3. Impact of Feature Integration on Classification Performance.
Feature ConfigurationRecallPrecisionPR-AUC
Structured only0.820.360.38
Structured + ASR0.850.410.47
Table 4. Cross-validated classification performance under VIF-based and IRT-informed feature refinement.
Table 4. Cross-validated classification performance under VIF-based and IRT-informed feature refinement.
ModelPrecisionRecallF1ROC-AUCPR-AUC
VIFIRTVIFIRTVIFIRTVIFIRTVIFIRT
Logistic
Regression
0.3610.3720.8560.8540.4940.5140.7110.7280.4400.456
Random Forest0.4030.4110.8640.8610.5330.5370.7670.7720.4640.474
Gradient
Boosting
0.3610.3680.8700.8720.4960.5070.7170.7250.4190.435
XGBoost0.3410.3580.8440.8480.4730.4930.7190.7310.4390.441
Decision Tree0.2900.3050.9010.9020.4250.4540.6860.7010.3570.368
Table 5. Summary of domain-level importance (PR-AUC decrease-based).
Table 5. Summary of domain-level importance (PR-AUC decrease-based).
Print
Recognition
Phonological AwarenessWord ReadingVocabulary KnowledgeReading Fluency
ModelVIFIRTVIFIRTVIFIRTVIFIRTVIFIRT
Logistic
(L2, balanced)
0.0660.0580.0700.0650.5750.5980.1250.1280.1630.151
Random
Forest
0.0460.0420.0520.0480.7380.7210.0550.0610.1100.128
Gradient Boosting0.0510.0470.0830.0790.6300.6430.0820.0860.1550.145
XGBoost0.0390.0410.0830.0760.6690.6620.0850.0910.1240.130
Decision Tree 0.0730.0610.1020.0940.4960.5010.0540.0570.2750.287
Table 6. Top 5 Item-Level SHAP Contributions.
Table 6. Top 5 Item-Level SHAP Contributions.
ItemMean |SHAP|
q230.087
q220.045
q240.032
s90.029
q190.028
Table 7. Domain-Level SHAP Contributions.
Table 7. Domain-Level SHAP Contributions.
DomainMean |SHAP|
Vocabulary Knowledge (7)0.151
Reading Fluency (3)0.074
Phonological Awareness (15)0.073
Word Reading (5)0.065
Print Recognition (4)0.041
Table 8. Comparison of Sustainability-Oriented Design Characteristics: Traditional Manual Screening vs. K-KOBUKI Workflow.
Table 8. Comparison of Sustainability-Oriented Design Characteristics: Traditional Manual Screening vs. K-KOBUKI Workflow.
FeatureTraditional Manual ScreeningK-KOBUKI
(Proposed Workflow)
Sustainability Impact (SDG 4)
AccessibilityResource-intensive; often limited to urban areasCloud-supported digital delivery with potential applicability in resource-constrained contextsPotential to improve access to screening processes under appropriate infrastructural conditions [14]
ResilienceDelayed feedback (weeks); deficit accumulationReduced assessment-to-feedback latency within a semi-automated workflow (subject to human verification processes)Potential to support earlier identification, although not evaluated as real-time intervention in this study [13]
AgencyHigh labor burden on expert evaluatorsHITL verification maintains teacher involvement in data validation and interpretationSupports teacher-centered decision-making rather than replacing professional judgment [13]
AccountabilityRater-dependent; subjectiveCombination of human verification and explainable model outputs (e.g., SHAP) supporting interpretabilityPotential to enhance transparency, while remaining dependent on human validation processes [12]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, S.; Han, J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability 2026, 18, 5142. https://doi.org/10.3390/su18105142

AMA Style

Lee S, Han J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability. 2026; 18(10):5142. https://doi.org/10.3390/su18105142

Chicago/Turabian Style

Lee, Sihoon, and Jeonghye Han. 2026. "Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation" Sustainability 18, no. 10: 5142. https://doi.org/10.3390/su18105142

APA Style

Lee, S., & Han, J. (2026). Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability, 18(10), 5142. https://doi.org/10.3390/su18105142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop