Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation

Lee, Sihoon; Han, Jeonghye

doi:10.3390/su18105142

Open AccessArticle

Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation

by

Sihoon Lee

¹

and

Jeonghye Han

^2,*

¹

The Institute of Brain-Based Learning, Korea National University of Education, Cheongju 28173, Republic of Korea

²

Department of Computer Education, Cheongju National University of Education, Cheongju 28173, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(10), 5142; https://doi.org/10.3390/su18105142

Submission received: 28 March 2026 / Revised: 6 May 2026 / Accepted: 14 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue AI for Sustainable and Creative Learning in Education)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Early literacy screening is essential for reducing long-term educational inequality, yet traditional paper-based assessments remain difficult to scale due to logistical constraints and delayed feedback. This study presents K-KOBUKI, a cloud-based prototype screening workflow that organizes early literacy assessment as a human-validated, data-driven process. The system integrates structured assessment responses with automated speech recognition-based analysis of oral reading performance across five literacy domains and incorporates a human-in-the-loop verification stage to ensure the reliability of speech-derived features. The system was evaluated using data from 195 first-grade students. Across repeated stratified cross-validation, multiple classification models achieved stable recall (≈0.85) under class imbalance conditions, supporting consistent identification of at-risk learners. Psychometric-informed feature refinement improved precision without reducing recall, indicating enhanced signal clarity through measurement-level stabilization. Explainable AI analysis further revealed that word reading and reading fluency contributed strongly to model-level decision boundaries, while vocabulary knowledge provided complementary influence at the individual level. These findings provide prototype-level evidence that a human-validated, multimodal screening workflow can support stable early-risk detection. From a sustainability perspective, the results suggest potential design-level contributions to improving accessibility and reducing delays in early identification processes.

Keywords:

early literacy screening; sustainable education; educational equity; multimodal learning analytics; human-in-the-loop

1. Introduction

Early literacy represents a multidimensional set of linguistic competencies that emerge prior to formal schooling and serve as foundational predictors of long-term academic trajectories [1,2]. Core components—including phonological awareness, vocabulary knowledge, decoding efficiency, and listening comprehension—form an interdependent structure that shapes children’s readiness to read [3,4]. Among these components, phonological awareness has been consistently identified as a central mechanism underlying decoding and reading fluency across orthographic systems [5]. From the perspective of Sustainable Development Goal 4 (SDG 4), early literacy is not merely an academic milestone but a foundational condition for inclusive and equitable quality education, and current global monitoring indicates that progress toward minimum reading proficiency remains insufficient to meet the 2030 agenda [6].

However, the global education system currently faces a sustainability crisis known as “learning poverty,” where approximately 70% of 10-year-olds in low- and middle-income contexts are unable to read and understand a simple text [7]. Recent global learning disruptions have further exacerbated this issue, highlighting early literacy screening as a critical prerequisite for educational equity [8]. Because foundational literacy functions as a gateway to subsequent learning, delayed identification of reading difficulty can amplify later educational exclusion; early screening is therefore a systemic issue rather than merely a testing concern [7]. Recent international monitoring likewise reports weak progress in reading proficiency together with persistent inequities in educational access, infrastructure, and teacher capacity.

Despite their widespread adoption, traditional paper-based literacy assessments face structural limitations that hinder timely and equitable screening [9,10]. These assessments are often resource-intensive, requiring extensive one-on-one sessions and manual scoring, which limits their accessibility in marginalized or rural areas where specialized personnel are scarce [8,9]. This “diagnostic divide” prevents timely intervention and reinforces systemic cycles of academic exclusion. While recent advances in artificial intelligence (AI) and learning analytics have enabled automated scoring and machine learning (ML)-based risk prediction, emerging work in AI-enabled assessment emphasizes that such systems must be evaluated not only by predictive performance but also by reliability, validity, fairness, and contextual applicability [11,12]. Accordingly, the key challenge is not automation alone, but the design of a trustworthy assessment architecture in which machine-generated outputs are supported by human validation, pedagogical accountability, and learner-rights protections [12,13].

To address these challenges, the present study frames early literacy screening as a human–AI collaborative workflow design problem. The proposed system, K-KOBUKI, is introduced as a cloud-based prototype workflow that integrates structured assessment delivery, speech-processing support, psychometric feature refinement, machine learning-based risk estimation, and explainable AI within a unified pipeline. In this workflow, automated analysis is coupled with human verification to ensure that derived features reflect validated learner performance.

Within this study, sustainability and scalability are not treated as empirically demonstrated system outcomes but as design-level implications of the proposed workflow. Specifically, sustainability is considered in terms of accessibility, operational continuity, and socio-technical accountability, while scalability is conceptualized as architectural extensibility rather than large-scale deployment. Accordingly, the study focuses on evaluating workflow-level feasibility rather than system-level deployment performance.

From this perspective, the contribution of the study is threefold. First, it demonstrates that stable early-risk detection can be achieved through the integration of multimodal assessment data and machine learning. Second, it shows that psychometric-informed feature refinement contributes to improving signal clarity within the predictive feature space. Third, it illustrates how explainable AI techniques can link model outputs to interpretable literacy domains, supporting pedagogically meaningful interpretation. These contributions are presented as prototype-level evidence.

Accordingly, the present study evaluates the proposed architecture as a human–AI collaborative screening workflow and examines its performance through the following research questions:

RQ1. Can a cloud-native digital screening architecture achieve stable detection (recall) of at-risk learners under classroom-level class imbalance conditions?
RQ2. Does psychometric-informed feature refinement improve signal clarity and predictive precision without sacrificing the sensitivity required for universal screening?
RQ3. Can SHAP-based explainability align model outputs with pedagogically interpretable literacy constructs to support data-driven teacher interventions?

2. Theoretical and Research Background

2.1. Digital Literacy Assessment Infrastructure

Early literacy assessment has traditionally relied on paper-based, individually administered formats that evaluate letter recognition, word reading, phonological awareness, dictation, and sentence comprehension. Although widely adopted in early education, these approaches require substantial time, trained personnel, and manual scoring procedures, which limit scalability in school-wide or district-level implementation [9]. From a sustainability perspective, these resource-intensive methods exacerbate the “diagnostic divide,” where students in rural or underfunded regions are excluded from early intervention opportunities [13].

The emergence of digital assessment platforms has shifted literacy evaluation from static testing instruments toward integrated learning analytics infrastructures. By enabling automated item delivery, standardized scoring, and centralized cloud-based data storage, digital systems reduce operational burden while simultaneously producing machine-readable datasets. Rather than constituting fully scalable solutions, these systems should be understood as enabling conditions that can support improvements in access and consistency within early screening workflows. This structural transformation can be interpreted as a socio-technical shift that may contribute to improving access to diagnostic processes [14]. Within this framework, technology functions as a supporting condition for improving consistency in diagnostic processes across diverse geographical and economic contexts [15].

2.2. Machine Learning-Based Risk Prediction in Digital Assessment Contexts

Advances in machine learning have expanded the analytical capabilities of digital literacy screening systems. Structured item-level responses and aggregated domain scores can be transformed into predictive feature matrices, allowing classification algorithms to estimate the probability of reading difficulties [16,17]. This approach represents a shift from descriptive score reporting toward probabilistic risk modeling.

However, predictive performance alone is insufficient for responsible deployment in educational settings. Sustainable AI integration must move beyond “black-box” optimization toward a socially accountable ecosystem that preserves human agency [12]. Recent scholarship emphasizes that educational AI should augment the teacher’s professional judgment rather than replacing it, aligning with the UNESCO AI Competency Framework [13]. Digital literacy screening must therefore incorporate ethical design principles, including transparency and explainability, to ensure that automated risk classifications do not lead to algorithmic stigmatization of young learners [17].

Importantly, the effectiveness of machine learning in screening contexts depends not only on model selection, but also on the quality and interpretability of the underlying feature space. As a result, psychometric validation and machine learning modeling should be treated as interdependent components of a unified assessment process rather than as separate analytical stages [18]. Within this perspective, machine learning should be understood as a decision-support component operating within a broader assessment workflow, rather than as a standalone predictive solution.

2.3. Speech-Recognition-Based Assessment of Oral Reading Performance

AI-based literacy diagnostic systems increasingly incorporate ASR technologies to evaluate oral reading performance. Speech-processing algorithms can quantify pronunciation accuracy and temporal fluency, enabling computational analysis of decoding automaticity. Such approaches align with established research on oral reading fluency as a key indicator of reading development [19] and have recently been operationalized through AI-driven assessment systems [20].

However, the application of ASR in early literacy contexts presents inherent challenges. Children’s speech is often characterized by variability in pronunciation, incomplete articulation, and developmental differences, while classroom environments introduce additional sources of noise and recording inconsistency. These factors can reduce transcription accuracy and limit the direct interpretability of speech-derived indicators.

For this reason, prior research suggests that speech-based measures become educationally meaningful only when embedded within a reliable interpretive process [20]. Human verification plays a critical role in this process by supporting the validation of ASR outputs and ensuring that derived features reflect actual learner performance rather than recognition artifacts. From this perspective, ASR should be understood as a supportive analytical tool within a broader assessment workflow, rather than as a fully autonomous diagnostic mechanism.

Taken together, existing studies indicate that speech-based analysis can provide valuable information for early literacy screening, but its effectiveness depends on the integration of automated processing and human validation [21,22]. The present study builds on this perspective by examining how speech-derived indicators can be incorporated into a structured, multimodal screening workflow.

3. System Architecture and Digital Pipeline

3.1. K-KOBUKI Application Design

The AI-enabled digital early literacy diagnostic application, K-KOBUKI, was implemented as a prototype-level assessment system consisting of 34 assessment items, including 25 multiple-choice items and 9 speech-response items. The assessment covers five core domains of early literacy: print recognition, phonological awareness, word reading, vocabulary knowledge, and reading fluency. These domains are grounded in the Science of Reading framework, which emphasizes the interdependence of decoding automaticity and language comprehension [3].

To ensure technical transparency and pedagogical validity, the 27 predictive features were selected based on their alignment with oral reading fluency (ORF) constructs, capturing pronunciation accuracy, phoneme-level alignment, and temporal patterns [20]. These features combine structured response accuracy with speech-derived indicators, forming a multimodal representation of early literacy performance. Figure 1 presents the overall structure and item framework.

The multi-domain structure was derived from the framework proposed by Han and Shim (2023) [23] and refined through a multi-stage expert review. To support measurement reliability and interpretability, the items were further examined using a Rasch (1PL) model. Detailed results of the psychometric analysis are provided in Appendix A.

3.2. System Architecture and Data Flow

K-KOBUKI was implemented using the Flutter framework to support cross-platform compatibility. The system is designed as a cloud-based workflow in which data collection, validation, and analysis components are structurally integrated. Figure 2 illustrates the overall system architecture and data flow.

Assessment items are dynamically retrieved from a cloud database and delivered through a tablet-based interface. Learner responses are collected in two forms: structured item responses and speech recordings. These data are transmitted to the server, where they are processed through a sequential pipeline consisting of ASR processing, human verification, feature extraction, and machine learning-based analysis.

A key component of this pipeline is the human-in-the-loop (HITL) verification module, which is positioned upstream of feature extraction within the speech-processing workflow. In this stage, ASR-generated transcriptions are reviewed and corrected prior to their use in downstream analysis, ensuring that speech-derived features are based on validated response data.

Following verification, structured responses and speech-derived indicators are combined into a unified feature space for predictive modeling. This modular pipeline separates data acquisition, validation, and analysis stages, enabling a structured flow from raw input data to model-based screening outcomes.

3.3. Application Interface

The K-KOBUKI interface was designed to capture both structured response data and speech performance within a single digital environment. The multiple-choice interface (Figure 3) evaluates foundational skills such as print recognition and semantic judgment. In alignment with socially accountable design [12] responses are automatically scored, but the final confirmation remains with the examiner before data synchronization.

The speech-response interface (Figure 4) requires learners to read aloud to evaluate decoding fluency. The integrated HITL dashboard allows teachers to verify automatically generated transcriptions, supporting the use of ASR outputs as validated input data for analysis. In this context, the system functions as a support tool for structured data collection rather than as an autonomous assessment system.

By integrating structured responses, speech-derived indicators, and human validation within a single workflow, the system provides a multimodal evidence base for early screening. However, this integration should be interpreted as supporting the feasibility of a human-validated screening workflow under controlled conditions, rather than as establishing a fully scalable or fully automated assessment infrastructure.

4. Method

4.1. Participants and Data Collection

The analytic dataset consisted of 195 first-grade elementary school children aged between 7.00 and 7.08 years. To reflect heterogeneous classroom conditions, participants were recruited from urban, rural, and coastal regions.

The assessment was administered in classroom settings using individual tablets (Figure 5). Audio data were captured in a natural classroom environment using a PCM 16-bit, 44.1 kHz mono format. Participants were instructed to produce spoken responses for a minimum of 8 s per item. To ensure high signal integrity for children’s speech, audio preprocessing—including resampling to 16 kHz and noise normalization—was conducted using the librosa library (version 0.11.0). Ground truth for literacy risk was established using an independent standardized assessment. Based on these scores, participants were categorized into struggling readers (n = 39, 20.0%) and typically developing readers (n = 156, 80.0%). All procedures were conducted under institutional review board approval (IRB No. 1301-202308-HR-0004-02).

4.2. Dataset Structure and Multimodal Integration

Data collection was conducted across two implementation phases. The pilot phase (N = 351) was used for system refinement and item calibration and was excluded from predictive modeling due to changes in assessment structure between phases. The second phase involved 251 learners (Figure 6).

From the second phase, 56 cases were excluded based on predefined quality-control criteria. These included (1) incomplete data or missing identifiers (n = 11), (2) measurement validity concerns such as language mismatch or health-related issues (n = 13), and (3) technical issues including device malfunction, environmental noise, and unusable speech recordings (n = 32). These criteria ensured that the final dataset consisted of complete, reliable, and temporally aligned records suitable for analysis.

The resulting analytic dataset included 195 learners with linked demographic metadata and item-level performance data. All speech recordings (n = 1755) were reviewed and corrected prior to feature extraction to ensure data reliability.

Because verification logs were not systematically recorded for all correction events, the impact of human verification could not be evaluated as an independent variable and is therefore interpreted as a data-quality control mechanism within the pipeline. Accordingly, the results do not establish predictive validity beyond the analyzed cohort.

After preprocessing, the dataset combined structured response indicators with validated speech-derived measures. The structured component included 25 multiple-choice items across five literacy domains, producing 4875 response entries. The speech component consisted of nine items, generating 1755 verified recordings used for feature extraction.

4.3. Experimental Design and Modeling Framework

The experimental framework evaluated supervised machine learning models for identifying struggling readers using structured digital assessment features.

Feature Engineering and Measurement Verification

To ensure the psychometric quality of the feature space, input variables were refined through a two-stage diagnostic process. First, multicollinearity was addressed by removing items with a Variance Inflation Factor (VIF) exceeding 10. Second, as detailed in Appendix A, item characteristics were evaluated using the Rasch (1PL) model to verify parameter stability. In addition, items with low discrimination (2PL a < 0.3) were excluded to enhance signal clarity.

Rather than relying on purely data-driven optimization, the feature set was constructed based on theoretical alignment with oral reading fluency (ORF) constructs, including pronunciation accuracy, phoneme-level correspondence, and temporal fluency patterns. The final feature set consisted of 27 variables, combining structured response features (n = 18) and speech-derived indicators (n = 9). Structured features were derived from item-level correctness and domain-level aggregate scores, while speech-derived features included pronunciation accuracy, character-level error rates (CER), and temporal fluency measures such as response duration and reading pace [24].

ASR Processing and Human Verification

Speech data were processed using a Transformer-based ASR model consistent with the Whisper architecture [25]. Transcription quality was evaluated using Character Error Rate (CER), which provides a fine-grained measure of phoneme-level accuracy in early literacy contexts.

All speech transcriptions were subjected to HITL verification prior to feature extraction. Two trained evaluators independently reviewed each recording by comparing the child’s oral response with the expected target response, focusing on pronunciation accuracy and item-level correctness. Discrepancies were resolved through discussion to reach a consensus decision. This procedure ensured that speech-derived features reflected validated learner performance.

Model Implementation and Evaluation Strategy

Five supervised classification algorithms—Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and XGBoost—were implemented using standard machine learning libraries (scikit-learn version 1.6.1 and XGBoost version 3.2.0). Model configurations were specified using constrained parameter ranges to ensure stability under the small-sample condition.

To address class imbalance (20% struggling readers), the hybrid SMOTE–Tomek resampling technique was applied within the training folds. Given the relatively small minority class (n = 39), this approach was adopted as a sensitivity-oriented modeling strategy rather than a substitute for external validation.

Model evaluation was conducted using repeated stratified cross-validation (5 folds × 5 repeats) to ensure robust estimation under class imbalance conditions. Classification thresholds were optimized within each training fold using precision–recall curves to maximize the F1 score, enabling a balanced trade-off between recall and precision.

Probability calibration was performed using isotonic regression to improve the reliability of predicted probabilities. All preprocessing steps—including resampling, threshold optimization, and model configuration—were conducted strictly within the training folds to prevent data leakage.

The primary optimization metric was recall (sensitivity), reflecting the screening objective of minimizing false negatives [13]. Precision, F1-score, and PR-AUC were used as complementary metrics to evaluate discriminative performance. Because no external validation dataset was available, the results should be interpreted as internal validation within a single cohort. Detailed model configurations are provided in Appendix B.

5. Results

The empirical findings are reported in four stages aligned with the study’s research questions. First, measurement validity and feature separability are examined to confirm that the digital assessment generates structurally reliable and discriminative input signals for machine learning classification. Second, the stability of machine learning-based screening performance is evaluated under repeated cross-validation (RQ1). Third, the effect of psychometric-informed feature refinement on classification performance is examined (RQ2). Finally, explainable AI analyses are conducted to interpret model predictions in relation to literacy development theories (RQ3).

5.1. Measurement Validation

Before evaluating machine learning classification performance, it was necessary to verify that the digital assessment generated both structurally stable measurement signals and sufficiently separable feature distributions across learner groups. Two complementary analyses were conducted: (1) item-level measurement stability based on IRT diagnostics and (2) group-level feature separability across the five literacy domains.

First, item-level stability was examined. As shown in Figure 7, Item Characteristic Curves (ICCs) derived from the Rasch (1PL) and exploratory 2PL models indicate that item responses follow expected probabilistic patterns across ability levels (see Appendix A for detailed results).

To improve feature quality, a two-stage refinement procedure was applied using Variance Inflation Factor (VIF) screening and IRT-based diagnostics. The criteria and results are summarized in Table 1, illustrating the complementary roles of statistical and psychometric filtering.

In addition to item-level stability, group-level performance differences were examined to determine whether the assessment captures meaningful variation between learners. As shown in Table 2, independent-samples t-tests revealed statistically significant differences (p < 0.01) between struggling and typically developing readers across all five literacy domains, including print recognition, phonological awareness, word reading, vocabulary knowledge, and reading fluency. These results indicate that the assessment successfully differentiates between groups across multiple dimensions of early literacy.

Figure 8 presents these standardized mean differences together with Hedges’ g and corresponding 95% confidence intervals across literacy domains. The magnitude of separability—particularly in vocabulary knowledge (Hedges’ g = 1.10) and reading fluency (g = 0.91)—indicates that these domains provide strong discriminative signals for early screening.

Taken together, these results indicate that the feature space demonstrates statistical stability and group-level separability within the analyzed dataset. However, these findings should be interpreted with caution. Although statistically significant differences were observed across domains, the effect sizes reported here represent exploratory evidence of domain-level separability under a relatively small sample condition. Accordingly, the results support the adequacy of the feature space for subsequent predictive modeling under controlled conditions but do not, by themselves, establish generalizable group differences or predictive validity beyond the present cohort.

5.2. RQ1—Stability of ML-Based Screening

To evaluate the stability of the screening mechanism, model performance was examined across multiple feature configurations and classifier families.

Impact of Multimodal Feature Integration

As shown in Table 3, incorporating ASR-derived fluency indicators improved recall from 0.82 to 0.85 and increased PR-AUC from 0.38 to 0.47, indicating that speech-derived features provide additional predictive information for identifying at-risk learners.

Stability Across Classifier Families

Across all classifier families evaluated using repeated stratified cross-validation (

5 \times 5

), recall values consistently ranged between 0.84 and 0.87, indicating stable detection performance under class imbalance conditions.

This pattern reflects a recall-oriented screening configuration, where sensitivity is prioritized over specificity. Accordingly, model outputs should be interpreted as preliminary screening signals requiring subsequent human verification rather than as definitive diagnostic classifications.

5.3. RQ2—Impact of Psychometric Refinement

To examine how measurement quality influences predictive performance, two feature-refinement strategies were compared: (1) VIF-based refinement and (2) IRT-informed refinement.

As shown in Table 4, IRT-informed refinement produced modest increases in precision and PR-AUC across classifiers while maintaining comparable recall levels. These results suggest that psychometric filtering contributes to improving the clarity of the feature space within the present dataset. However, these improvements should be interpreted as incremental rather than transformative.

5.4. RQ3—Explainable AI Analysis

To interpret how the screening model generates predictions, explainable AI analyses were conducted at three complementary levels: group-level statistical comparison, model-level sensitivity analysis, and instance-level contribution analysis. These correspond to different analytical perspectives. Specifically, Table 2 reflects group-level separability, Table 5 captures model-level sensitivity based on PR-AUC decrease, and Table 6 and Table 7 represent instance-level contributions based on SHAP values. Differences across these results should therefore be interpreted as reflecting distinct levels of analysis rather than as contradictory findings.

At the model level, domain importance was examined using PR-AUC decrease (Table 5). Across all classifiers, word reading produced the largest decrease in PR-AUC, indicating that it plays a dominant role in defining the global decision boundary of the model. This pattern was consistent across model types, suggesting a stable hierarchy of feature importance at the decision-boundary level.

At the instance level, SHAP-based analyses provided a complementary perspective. Item-level importance (Table 6) showed that predictions are influenced by cumulative contributions across multiple features rather than by isolated item errors.

When aggregated at the domain level (Table 7), vocabulary knowledge exhibited the largest average contribution across individual predictions, indicating a broader and more distributed influence compared to other domains.

Taken together, these results indicate that different literacy domains contribute differently depending on the level of analysis. Word reading primarily shapes model-level decision boundaries, while vocabulary knowledge contributes more diffusely to variation across individual predictions. This distinction reflects differences between global sensitivity patterns and local contribution structures, rather than inconsistency across analytical results.

Finally, individual case patterns suggest that high-risk classifications tend to emerge from combined deficits across multiple domains, particularly vocabulary and fluency-related features. This pattern reflects multidimensional feature aggregation within the model rather than causal relationships between literacy domains.

6. Discussion and Implications

This study examined whether a human-validated multimodal screening workflow could support stable early-risk detection within a digital assessment context.

The findings indicate three key points. First, the proposed workflow achieved stable recall under class imbalance conditions, supporting consistent identification of at-risk learners. Second, psychometric-informed feature refinement contributed to modest improvements in precision, suggesting enhanced signal clarity within the feature space. Third, explainable AI analyses revealed interpretable domain-level patterns, linking model behavior to literacy-related constructs.

Differences across statistical comparisons, model-level sensitivity analyses, and SHAP-based contribution patterns reflect variations in analytical level rather than inconsistency in results. Together, these findings suggest that early literacy screening can be understood as a coordinated workflow integrating measurement design, multimodal data, human validation, and machine learning-based analysis.

6.1. System-Level Validation and Architectural Contribution

Building on the analyses in Section 5, the results demonstrate that the K-KOBUKI architecture shows stable recall (≈0.85) within the analyzed cohort. In the context of educational sustainability, this recall-oriented screening configuration may have potential relevance to the early identification principle of SDG 4, where minimizing false negatives is prioritized [26,27].

The integration of speech-derived indicators contributed additional predictive information, suggesting that multimodal feature combinations capture complementary aspects of reading performance. This highlights the value of combining structured responses with speech-based measures within a unified screening workflow.

An important architectural characteristic is the inclusion of a HITL verification stage, which is intended to support the reliability of speech-derived features by ensuring that input data reflect validated learner performance. At the same time, this reliance on human verification introduces constraints on scalability, as manual intervention is required within the data-processing pipeline.

6.2. Implications for AI-Driven Digital Assessment Engineering

The findings of this study suggest that AI-enabled screening can be conceptualized as a socio-technical workflow in which human judgment and machine-generated signals are structurally integrated, rather than as a fully autonomous system.

The characteristics summarized in Table 8 should be interpreted as design-level properties of the proposed workflow rather than empirically validated system outcomes. In this context, the implications of K-KOBUKI can be understood across four dimensions: accessibility, resilience, agency, and accountability.

While the cloud-based and multimodal design suggests potential improvements in access to screening processes and timeliness of feedback, these implications remain theoretical and have not been empirically evaluated in the present study. Accordingly, future research is required to examine the effectiveness of such workflow-based screening systems under real-world deployment conditions.

6.3. Limitations and Conclusions

Several limitations should be acknowledged. First, the dataset is limited to a localized cohort (N = 195), and model performance may be affected by distributional differences across regions, instructional contexts, and learner populations. In addition, no external validation dataset was available, and the results should therefore be interpreted as internal validation within a single cohort.

Second, the use of the SMOTE–Tomek resampling technique should be understood as an exploratory strategy within a small-sample context rather than as a substitute for real-world minority data.

Third, while full HITL verification improves the reliability of speech-derived features, it limits operational scalability, as manual validation is required within the data-processing pipeline.

Fourth, the relatively low precision observed across models implies a non-negligible rate of false-positive classifications, which may increase teacher workload in practical screening contexts. For example, under a typical classroom scenario of 25 students, a precision level of approximately 0.40 may result in around 3–4 false-positive identifications per class. This highlights the necessity of subsequent human verification to ensure practical usability.

Fifth, the study was conducted in a Korean-language context, and the structural characteristics of the Korean writing system and phonology may influence both speech-derived features and model behavior, limiting direct generalization to other languages.

Finally, the use of probabilistic risk classification raises ethical concerns, as false-positive identifications may lead to unintended labeling or stigmatization. Careful interpretation and human oversight are therefore required in practical use.

Within these limitations, the findings demonstrate that stable early-risk detection can be achieved through the integration of psychometrically informed features and multimodal modeling. K-KOBUKI provides prototype-level evidence for the feasibility of a human-validated, AI-assisted screening workflow, with its primary contribution lying in the integration of assessment, validation, and predictive analytics within a unified architecture rather than in the validation of a deployable system.

Future research should examine external validity across diverse populations, explore selective HITL strategies to balance reliability and efficiency, and evaluate the practical implementation of workflow-based screening systems in real classroom settings.

Author Contributions

Conceptualization, S.L. and J.H.; methodology, S.L.; software, S.L.; validation, S.L. and J.H.; formal analysis, S.L.; investigation, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, J.H.; visualization, S.L.; supervision, J.H.; project administration, J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2023R1A2C1006289).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Cheongju National University of Education (IRB No. 1301-202308-HR-0004-02; approved on 31 August 2023).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study and from their legal guardians.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the participating schools and teachers for their cooperation in data collection and validation processes.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Exploratory IRT Item Diagnostics

The following table reports item difficulty parameters estimated using the one-parameter logistic (1PL; Rasch) model and item difficulty and discrimination parameters estimated using the two-parameter logistic (2PL) model for all items included in the early literacy diagnostic tool.

In this study, the Rasch (1PL) model served as the primary measurement framework to ensure parameter stability, fairness, and interpretability given the modest sample size and mixed item formats. Difficulty parameters from the Rasch model were therefore used as the main reference for evaluating item functioning and coverage of the ability continuum.

The 2PL model was applied strictly for exploratory diagnostic purposes. Discrimination parameters and 2PL-based difficulty estimates are reported to enhance transparency and to document the item refinement process. Given the sample size and the presence of mixed-format items, these parameters are not intended for substantive interpretation or for direct comparison of item quality. In particular, extreme or unstable parameter estimates should be interpreted with caution, as they may reflect sample-specific characteristics rather than generalizable item properties.

Table A1. Item Difficulty and Exploratory Discrimination Parameters from 1PL and 2PL IRT Models.

Format	Questions	1PL Diff. (b)	Diff.-Interp	2PL Diff. (b)	2PL Disk. (a)	Diff.-Interp	Disk.-Interpr
Multiple-choice
	q1	−0.9618	Appropriate	−1.0464	0.6772	Appropriate	Appropriate
	q2	−1.6605	Appropriate	8.6513	−0.1299	Hard	Low
	q3	1.5443	Appropriate	3.2990	0.3224	Hard	Low
	q4	−5.8956	Easy	−2.1598	10.0375	Easy	Excellent
	q5	−2.4547	Easy	−3.4370	0.5082	Easy	Appropriate
	q6	−2.5673	Easy	−2.1657	0.9276	Easy	Appropriate
	q7	−2.6877	Easy	−2.6161	0.7730	Easy	Appropriate
	q8	−1.5412	Appropriate	−1.6002	0.7151	Appropriate	Appropriate
	q9	−3.6135	Easy	−2.2728	1.4063	Easy	Appropriate
	q10	−0.1027	Appropriate	−0.2183	0.3302	Appropriate	Low
	q11	−4.3647	Easy	−5.0235	0.6375	Easy	Appropriate
	q12	−2.8138	Easy	−3.7648	0.5353	Easy	Appropriate
	q13	−0.0720	Appropriate	−0.0992	0.5254	Appropriate	Appropriate
	q14	−2.0504	Easy	−1.7280	0.9309	Appropriate	Appropriate
	q15	−1.5022	Appropriate	−1.2283	0.9721	Appropriate	Appropriate
	q16	−1.2052	Appropriate	−0.8554	1.1975	Appropriate	Appropriate
	q17	1.5441	Appropriate	1.3545	0.8900	Appropriate	Appropriate
	q18	−4.5335	Easy	−5.9420	0.5527	Easy	Appropriate
	q19	−2.8138	Easy	−2.4608	0.8839	Easy	Appropriate
	q20	−2.8791	Easy	−1.6811	1.6340	Appropriate	Excellent
	q21	−2.6878	Easy	−1.6216	1.5418	Appropriate	Excellent
	q22	1.0377	Appropriate	0.7725	1.1203	Appropriate	Appropriate
	q23	−0.9625	Appropriate	−0.7520	1.0392	Appropriate	Appropriate
	q24	−1.3139	Appropriate	−0.8868	1.2958	Appropriate	Appropriate
	q25	1.3189	Appropriate	2.1021	0.4397	Hard	Low
Recording items
	s1	−3.2509	Easy	−3.5143	0.6826	Easy	Appropriate
	s2	−2.1458	Easy	−1.2270	1.7282	Appropriate	Excellent
	s3	−1.6607	Appropriate	−1.0272	1.4987	Appropriate	Appropriate
	s4	−3.0949	Easy	−5.1923	0.4196	Easy	Low
	s5	−2.4547	Easy	−2.6388	0.6862	Easy	Appropriate
	s6	0.1431	Appropriate	1.4478	0.0641	Appropriate	Low
	s7	−2.1456	Easy	−4.8084	0.3075	Easy	Low
	s8	−0.6021	Appropriate	−0.5396	0.8623	Appropriate	Appropriate
	s9	−1.6615	Appropriate	−0.9393	1.7823	Appropriate	Excellent

(a) Rasch (1PL) model results; (b) exploratory 2PL model results.

Notes for appendix preparation. Difficulty classifications (e.g., easy, appropriate, hard) are based on predefined thresholds used for descriptive interpretation and are referenced primarily to the Rasch (1PL) model. Discrimination-related qualitative labels derived from the 2PL model are provided for exploratory diagnostic reference only and do not imply substantive superiority or inferiority of individual items. Items exhibiting extremely high or low discrimination estimates should be interpreted as candidates for further review, revision, or removal, rather than as definitive indicators of item quality. This appendix is provided to ensure transparency in the psychometric evaluation process and to support replication and future refinement of the diagnostic instrument. Table formatting follows standard APA and SSCI appendix conventions.

Appendix B. Detailed Model Configurations

This appendix summarizes the detailed parameter settings for all models used in the study to ensure reproducibility. Model configurations were specified using constrained parameter ranges under small-sample conditions. Table A2 presents the final parameter settings applied in the modeling framework, including classifier configurations and data preprocessing procedures.

Table A2. Summary of model configurations and preprocessing settings.

Model	Parameter	Setting
Logistic Regression	Regularization	L2
	Solver	lbfgs/liblinear
	Max iterations	5000–8000
	Class weight	Balanced
Decision Tree	Class weight	Balanced
	Max depth	4
	Min samples per leaf	5
Random Forest	Number of estimators	800
	Max depth	6
	Min samples per leaf	2
	Class weight	Balanced subsample
Gradient Boosting	Configuration	Default (scikit-learn)
XGBoost	Number of estimators	600–800
	Learning rate	0.05
	Max depth	3–4
	Subsample	0.9
	Colsample by tree	0.9
	Gamma	0.1
	Lambda (L2)	1.0
	Alpha (L1)	0.0
	Scale pos weight	Negative/positive ratio
Preprocessing	Resampling method	SMOTE–Tomek
	SMOTE k-neighbors	5
	Tomek links	Applied

All preprocessing steps, including resampling and model fitting, were performed within the training folds to prevent data leakage.

References

Cain, K.; Oakhill, J. Profiles of children with specific reading comprehension difficulties. Br. J. Educ. Psychol. 2006, 76, 683–696. [Google Scholar] [CrossRef]
Kaderavek, J.N.; Sulzby, E. Narrative production by children with and without specific language impairment. J. Speech Lang. Hear. Res. 2000, 43, 34–49. [Google Scholar] [CrossRef]
Gough, P.B.; Tunmer, W.E. Decoding, reading, and reading disability. Remedial Spec. Educ. 1986, 7, 6–10. [Google Scholar] [CrossRef]
Piasta, S.B.; Wagner, R.K. Developing early literacy skills: A meta-analysis of alphabet learning and instruction. Read. Res. Q. 2010, 45, 8–38. [Google Scholar] [CrossRef]
Ehri, L.C.; Nunes, S.R.; Willows, D.M.; Schuster, B.V.; Yaghoub-Zadeh, Z.; Shanahan, T. Phonemic awareness instruction helps children learn to read: Evidence from the National Reading Panel’s meta-analysis. Read. Res. Q. 2001, 36, 250–287. [Google Scholar] [CrossRef]
World Bank; UNESCO; UNICEF; FCDO; USAID; Bill & Melinda Gates Foundation. The State of Global Learning Poverty: 2022 Update (Conference Edition); World Bank: Washington, DC, USA, 2022. Available online: https://thedocs.worldbank.org/en/doc/e52f55322528903b27f1b7e61238e416-0200022022/original/Learning-poverty-report-2022-06-21-final-V7-0-conferenceEdition.pdf (accessed on 13 March 2026).
UNESCO. When Schools Shut: New UNESCO Study Exposes Failure to Factor in Gender in COVID-19 Education Responses; UNESCO: Paris, France, 2021; Available online: https://www.unesco.org/en/articles/when-schools-shut-new-unesco-study-exposes-failure-factor-gender-covid-19-education-responses (accessed on 13 March 2026).
Kirsten, K.; Greefrath, G.; Emmrich, R. Technology-based versus paper-pencil: Sources of mode effects in large-scale assessment. Int. J. Math. Educ. Sci. Technol. 2026, 1–28. [Google Scholar] [CrossRef]
Anghel, E.; Khorramdel, L.; von Davier, M. The use of process data in large-scale assessments: A literature review. Large-Scale Assess. Educ. 2024, 12, 13. [Google Scholar] [CrossRef]
Chuang, P.-L.; Yan, X. Language assessment in the era of generative artificial intelligence: Opportunities, challenges, and future directions. System 2025, 134, 103846. [Google Scholar] [CrossRef]
Zanellati, A.; Zingaro, S.P.; Gabbrielli, M. Balancing performance and explainability in academic dropout prediction. IEEE Trans. Learn. Technol. 2024, 17, 2086–2099. [Google Scholar] [CrossRef]
Cukurova, M.; Miao, F. AI Competency Framework for Teachers; UNESCO Publishing: Paris, France, 2024. [Google Scholar]
Bagdonaite, J.; Dagiene, V. Artificial Intelligence in Primary Education: A Systematic Literature Review 2020–2025. Inform. Educ. 2025, 24, 697–736. [Google Scholar] [CrossRef]
Rathnayake, N.; Wijewardane, S. Machine learning-based Direct Normal Irradiance (DNI) forecasting using satellite data for Concentrated Solar Power (CSP) plants with Thermal Energy Storage (TES). Sci. Rep. 2026, 16, 11257. [Google Scholar] [CrossRef]
Siemens, G.; Baker, R.S.J.D. Learning Analytics and Educational Data Mining: Towards Communication and Collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, Canada, 29 April–2 May 2012; pp. 252–254. [Google Scholar] [CrossRef]
Yumus, M.; Stuhr, C.; Meindl, M.; Leuschner, H.; Jungmann, T. EuleApp©: A computerized adaptive assessment tool for early literacy skills. Front. Psychol. 2025, 16, 1522740. [Google Scholar] [CrossRef]
Xi, X. Advancing language assessment with AI and ML–Leaning into AI is inevitable, but can theory keep up? Lang. Assess. Q. 2023, 20, 357–376. [Google Scholar] [CrossRef]
Baker, R.S.; Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
Kuhn, M.R.; Schwanenflugel, P.J.; Meisinger, E.B. Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Read. Res. Q. 2010, 45, 230–251. [Google Scholar] [CrossRef]
Bailly, G.; Godde, E.; Piat-Marchand, A.-L.; Bosse, M.-L. Automatic assessment of oral readings of young pupils. Speech Commun. 2022, 138, 67–79. [Google Scholar] [CrossRef]
Stewart, A.E.; Keirn, Z.; D’Mello, S.K. Multimodal modeling of collaborative problem-solving facets in triads. User Model. User-Adapt. Interact. 2021, 31, 713–751. [Google Scholar] [CrossRef]
Yan, L.; Echeverria, V.; Jin, Y.; Fernandez-Nieto, G.; Zhao, L.; Li, X.; Alfredo, R.; Swiecki, Z.; Gašević, D.; Martinez-Maldonado, R. Evidence-based multimodal learning analytics for feedback and reflection in collaborative learning. Br. J. Educ. Technol. 2024, 55, 1900–1925. [Google Scholar] [CrossRef]
Han, J.; Shim, Y. Inclusive design of a tool to screen literacy of lower grade elementary school students. In Proceedings of the IEEE International Conference on E-Business Engineering (ICEBE) 2023, Beijing, China, 17–19 October 2023; pp. 178–180. [Google Scholar] [CrossRef]
Tummalapalli, V. Using SMOTE and TOMEK Link Sampling Techniques to Address Imbalanced Data Challenges in the Machine Learning models. IJSAT-Int. J. Sci. Technol. 2025, 16, 1–6. [Google Scholar] [CrossRef]
Adem, H. Vocal Biomarkers of Childhood Trauma: A Machine-Learning Approach to Speech Analysis. J. Speech Lang. Hear. Res. 2026, 69, 1955–1976. [Google Scholar] [CrossRef] [PubMed]
Lardhi, J.S.; Ismail, A.F. Generative Artificial Intelligence for SDG 4: Enhancing Sustainable Quality Learning. Sustainability 2026, 18, 2498. [Google Scholar] [CrossRef]
Tasić, N.; Glušac, D.; Makitan, V.; Jokić, S.; Ljubojev, N.; Vignjević, K. Promoting Sustainable Education Through the Educational Software Scratch: Enhancing Attention Span Among Primary School Students in the Context of Sustainable Development Goal (SDG) 4. Sustainability 2025, 17, 9292. [Google Scholar] [CrossRef]

Figure 1. Structure and question framework of the digital-based early literacy diagnostic app (K-KOBUKI, a Korean early literacy screening system). The figure includes example Korean literacy assessment items. The first item requires learners to read a short sentence and identify the first letter of the third line. The second item assesses recognition of Korean tense consonants. The third item requires learners to select the correct written word corresponding to a spoken pronunciation. The fourth item assesses syllable-count matching, the fifth item assesses semantic association, and the final item assesses oral reading fluency through short-sentence reading.

Figure 2. K-KOBUKI System Architecture. Solid arrows indicate the primary data-processing workflow, whereas dashed arrows represent teacher-review and diagnostic-feedback processes.

Figure 3. Example multiple-choice items (MCIs) from the K-KOBUKI assessment. Korean words in the figure represent family-related and literacy-related vocabulary items.

Figure 4. Example voice-response items (VRIs) and waveform visualization. The Korean sentence in the figure reads, ‘The baby elephant family burst into laughter’.

Figure 5. Scenes of user interaction with the diagnostic app, K-KOBUKI.

Figure 6. Data collection and preprocessing flow of the analytic dataset.

Figure 7. Representative item characteristic curves for selected items under the 1PL (left) and 2PL (right) models.

Figure 8. Combined display of standardized means (z) and effect sizes (Hedges’ g with 95% confidence intervals) across literacy domains.

Table 1. Comparison of item refinement results based on VIF and IRT analyses.

Criterion	No. Remove	Main Domains (Examples)	Decision Rule
VIF-based Removal	7	Print recognition (q3 *), phonological awareness (q8, q10), word recognition (q12), reading fluency (q18), vocabulary knowledge (q21, q23)	Multicollinearity (VIF > 10)
IRT-based Exclusion	7	Print recognition (q2, q3 ), phonological awareness (q10 , s4, s6, s7), vocabulary knowledge (q25)	2PL: a < 0.3 or extreme b

* Indicates items identified in both VIF-based removal and IRT-based exclusion procedures.

Table 2. Means and standard deviations by group and literacy domain (including maximum scores).

Domain	Max	Total		Typical		Struggling		t	p
Domain	Max	M	SD	M	SD	M	SD	t	p
Print Recognition	4	2.52	0.93	2.63	0.9	2.21	0.83	2.79	0.007
Phonological Awareness	15	10.9	2.52	11.26	2.17	10.0	2.48	2.91	0.005
Word Reading	5	4.02	0.99	4.19	0.83	3.54	1.02	3.70	0.001
Vocabulary Knowledge	7	5.19	1.47	5.54	1.2	4.05	1.43	6.33	<0.001
Reading Fluency	3	1.74	0.9	1.91	0.82	1.13	0.89	4.96	<0.001

Table 3. Impact of Feature Integration on Classification Performance.

Feature Configuration	Recall	Precision	PR-AUC
Structured only	0.82	0.36	0.38
Structured + ASR	0.85	0.41	0.47

Table 4. Cross-validated classification performance under VIF-based and IRT-informed feature refinement.

Model	Precision		Recall		F1		ROC-AUC		PR-AUC
Model	VIF	IRT	VIF	IRT	VIF	IRT	VIF	IRT	VIF	IRT
Logistic Regression	0.361	0.372	0.856	0.854	0.494	0.514	0.711	0.728	0.440	0.456
Random Forest	0.403	0.411	0.864	0.861	0.533	0.537	0.767	0.772	0.464	0.474
Gradient Boosting	0.361	0.368	0.870	0.872	0.496	0.507	0.717	0.725	0.419	0.435
XGBoost	0.341	0.358	0.844	0.848	0.473	0.493	0.719	0.731	0.439	0.441
Decision Tree	0.290	0.305	0.901	0.902	0.425	0.454	0.686	0.701	0.357	0.368

Table 5. Summary of domain-level importance (PR-AUC decrease-based).

	Print Recognition		Phonological Awareness		Word Reading		Vocabulary Knowledge		Reading Fluency
Model	VIF	IRT	VIF	IRT	VIF	IRT	VIF	IRT	VIF	IRT
Logistic (L2, balanced)	0.066	0.058	0.070	0.065	0.575	0.598	0.125	0.128	0.163	0.151
Random Forest	0.046	0.042	0.052	0.048	0.738	0.721	0.055	0.061	0.110	0.128
Gradient Boosting	0.051	0.047	0.083	0.079	0.630	0.643	0.082	0.086	0.155	0.145
XGBoost	0.039	0.041	0.083	0.076	0.669	0.662	0.085	0.091	0.124	0.130
Decision Tree	0.073	0.061	0.102	0.094	0.496	0.501	0.054	0.057	0.275	0.287

Table 6. Top 5 Item-Level SHAP Contributions.

Item	Mean \|SHAP\|
q23	0.087
q22	0.045
q24	0.032
s9	0.029
q19	0.028

Table 7. Domain-Level SHAP Contributions.

Domain	Mean \|SHAP\|
Vocabulary Knowledge (7)	0.151
Reading Fluency (3)	0.074
Phonological Awareness (15)	0.073
Word Reading (5)	0.065
Print Recognition (4)	0.041

Table 8. Comparison of Sustainability-Oriented Design Characteristics: Traditional Manual Screening vs. K-KOBUKI Workflow.

Feature	Traditional Manual Screening	K-KOBUKI (Proposed Workflow)	Sustainability Impact (SDG 4)
Accessibility	Resource-intensive; often limited to urban areas	Cloud-supported digital delivery with potential applicability in resource-constrained contexts	Potential to improve access to screening processes under appropriate infrastructural conditions [14]
Resilience	Delayed feedback (weeks); deficit accumulation	Reduced assessment-to-feedback latency within a semi-automated workflow (subject to human verification processes)	Potential to support earlier identification, although not evaluated as real-time intervention in this study [13]
Agency	High labor burden on expert evaluators	HITL verification maintains teacher involvement in data validation and interpretation	Supports teacher-centered decision-making rather than replacing professional judgment [13]
Accountability	Rater-dependent; subjective	Combination of human verification and explainable model outputs (e.g., SHAP) supporting interpretability	Potential to enhance transparency, while remaining dependent on human validation processes [12]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, S.; Han, J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability 2026, 18, 5142. https://doi.org/10.3390/su18105142

AMA Style

Lee S, Han J. Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability. 2026; 18(10):5142. https://doi.org/10.3390/su18105142

Chicago/Turabian Style

Lee, Sihoon, and Jeonghye Han. 2026. "Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation" Sustainability 18, no. 10: 5142. https://doi.org/10.3390/su18105142

APA Style

Lee, S., & Han, J. (2026). Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation. Sustainability, 18(10), 5142. https://doi.org/10.3390/su18105142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scaling Early Literacy Screening for Sustainable Education: A Cloud-Native Architecture Integrating Machine Learning and Human-in-the-Loop Validation

Abstract

1. Introduction

2. Theoretical and Research Background

2.1. Digital Literacy Assessment Infrastructure

2.2. Machine Learning-Based Risk Prediction in Digital Assessment Contexts

2.3. Speech-Recognition-Based Assessment of Oral Reading Performance

3. System Architecture and Digital Pipeline

3.1. K-KOBUKI Application Design

3.2. System Architecture and Data Flow

3.3. Application Interface

4. Method

4.1. Participants and Data Collection

4.2. Dataset Structure and Multimodal Integration

4.3. Experimental Design and Modeling Framework

5. Results

5.1. Measurement Validation

5.2. RQ1—Stability of ML-Based Screening

5.3. RQ2—Impact of Psychometric Refinement

5.4. RQ3—Explainable AI Analysis

6. Discussion and Implications

6.1. System-Level Validation and Architectural Contribution

6.2. Implications for AI-Driven Digital Assessment Engineering

6.3. Limitations and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Exploratory IRT Item Diagnostics

Appendix B. Detailed Model Configurations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI