Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory

Yu, Linwei; Wong, Gary K. W.; Zhang, Bingjie; Wang, Feifei

doi:10.3390/educsci16040661

Open AccessSystematic Review

Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory

Centre for Information Technology in Education, Faculty of Education, University of Hong Kong, Pok Fu Lam, Hong Kong

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2026, 16(4), 661; https://doi.org/10.3390/educsci16040661

Submission received: 10 March 2026 / Revised: 12 April 2026 / Accepted: 17 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue The State of the Art and the Future of Education)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Emerging technologies (ETs), such as AI and reality techniques, are reshaping educational measurement. However, existing studies remain dispersed and are rarely synthesized in ways that clarify how ETs participate in the evidentiary work of educational measurement. Guided by PRISMA 2020, we systematically reviewed 933 empirical studies published between 2016 and 2025 in formal educational settings. We coded studies by (a) grain size (micro, meso, macro), (b) Constructing Measures Theory building blocks (construct map, item design, outcome space, measurement model), and (c) ET category. Results showed a strong concentration at the micro level (88.88%) and in outcome space and measurement model work (86.80% combined), indicating that ET-enabled innovation has focused primarily on transforming performances into indicators and modeling those indicators for interpretation and decision-making. Learning analytics and educational data mining, machine learning and deep learning, and automated scoring and feedback systems were the dominant ET clusters. These findings point to an uneven development of ET-enabled educational measurement. Included studies also indicating recurring concerns about transparency, fairness, and governance are linked to the field’s main areas of ET-enabled concentration. We therefore argue for closer alignment among construct claims, evidence, modeling, and intended use, and offer implications for developers, researchers, and education practitioners.

Keywords:

emerging technologies; educational measurement; assessment; systematic review; evidentiary reasoning; constructing measures; granularity; grain size; artificial intelligence; mixed reality

1. Introduction

Over the past decade, emerging technologies (ETs) have become increasingly embedded in the infrastructures and practices of educational measurement. ETs, such as adaptive learning platforms, learning analytics dashboards, AI-enabled scoring tools, and immersive simulations, increasingly shape not only how evidence is collected, but also how performance is represented and acted upon (Bennett, 2015; Holmes et al., 2019; Ifenthaler & Yau, 2020; Romero & Ventura, 2020; F. Wang et al., 2025a). Technology-based measurement can broaden what is observable about learning by capturing more frequent and richer traces than conventional measurement (Bennett, 2015). In turn, ETs can expand what counts as measurement evidence by adding new digital traces, influence how interpretation (i.e., score meaning) is supported through validation and modeling choices, and reshape how uncertainty (e.g., error bands, confidence levels, and model limits) is communicated when results are used for educational inferences and decisions (AERA et al., 2014; Mislevy, 1996).

To shed light on ET-enabled educational measurement, it is important to clarify what “measurement” means. Across the sciences, measurement can be characterized in general terms as “an empirical and informational process, designed on purpose, whose input is an empirical property of an object and that produces information in the form of values of that property” (Mari et al., 2023, p. 25; see also M. Wilson, 2023). Two interpretations emerge for ET-enabled educational measures. First, measurement is intentional design work, requiring choices to be made about what is being represented and how. Second, measurement is informational. The value of a measure depends on the quality of the information it delivers, including how uncertainty and error are understood and conveyed (International Organization for Standardization [ISO], 1995; Mari et al., 2023).

In practice, however, the literature often uses assessment, evaluation, and measurement interchangeably. Here, in this review, rather than treating these three as competing, we would like to understand them as marking different emphases in educational decision-making. The term “assessment” refers more to the broader process of gathering and interpreting evidence about learning (NRC, 2001; Pellegrino, 2014). Evaluation, differently, focuses on judging the quality or effectiveness of programs, tools, or systems, often using evidence as one input among others (AERA et al., 2014). Educational measurement, as our focal emphasis, concerns more specifically the development and use of indicators intended to represent student attributes or performance, and to support important interpretations and decisions (AERA et al., 2014; Messick, 1989, 1994). This distinction becomes especially essential in the ET-enabled educational measurement landscape. By centering educational measurement, the review keeps fuller attention on the evidentiary chain from construct to observation to inference, that is, from what is represented, what evidence is elicited, how it is modeled into indicators, and further to what justifies the intended interpretations and decisions (AERA et al., 2014; Messick, 1989, 1994).

A central commitment in educational measurement is that interpretations and uses should be grounded by a coherent evidentiary argument, meaning a connected chain of measurement claims, evidence, and reasoning that supports score-based inferences and intended uses in educational decision-making (AERA et al., 2014; Messick, 1989, 1994, 1996; Mislevy et al., 2003a, 2003b; NRC, 2001). Here, ETs can strengthen, yet may also weaken, this evidentiary coherence. On the one hand, ETs enable new forms of observation that are difficult to obtain through conventional formats, such as interactive simulations, game-based environments, virtual laboratories, and platform traces, all of which can capture process-rich evidence (Bennett, 2015; V. J. Shute & Ventura, 2013). On the other hand, research on ET-enabled educational measurement in learning analytics and educational data mining is sometimes oriented toward prediction and optimization (Ifenthaler & Yau, 2020; Romero & Ventura, 2020). In some cases, model performance is treated as the primary indicator of success, evaluated through metrics such as classification accuracy or early risk detection, rather than through construct-based arguments about interpretability and validity.

A growing body of reviews has mapped ETs in education and emphasized either technical innovations or ethical concerns, but few have organized the literature around measurement work (i.e., by how technologies participate in the construction, interpretation, and use of measurement evidence) as such. For instance, reviews of AI in education often classify studies by techniques and application functions, such as profiling, prediction, intelligent tutoring, assessment, and feedback (Holmes et al., 2019; Zawacki-Richter et al., 2019). They also point to recurring concerns, including limited educator involvement and risks tied to datafication, algorithmic bias, and commercial platform logics (Holmes et al., 2019). Reviews in learning analytics and educational data mining summarize a wide range of analytical methods, including classification, clustering, sequence mining, and social network analysis (Ifenthaler & Yau, 2020; Romero & Ventura, 2020). Otherwise, reviews that focus on ethics and transparency highlight concerns about consent, data ownership, surveillance, and harm when models misclassify students (Cerratto Pargman & McGrath, 2021; Hakimi et al., 2021; Slade & Prinsloo, 2013). Research on explainable AI (XAI) in education raises a related point: explanations often emphasize local model behavior or feature importance, rather than supporting construct-aligned interpretations that are useful for educational decisions (Khosravi et al., 2022). Together, these reviews show expanding use of ETs to generate and act on educational data, alongside growing awareness of ethical and interpretive risks. However, they less often trace how ETs reshape the evidentiary argument that grounds score meaning and use. In other words, what remains less visible is how ETs enter the measurement chain itself, from defining what is being measured, to designing tasks that elicit evidence, to representing performance as outcomes, and to modeling and scaling scores for inference. What is still needed, therefore, is a measurement-centered frame that can organize this dispersed literature not only by technology type or concern, but by where ETs intervene in the evidentiary workflow of measurement and how those interventions relate to different educational decision contexts.

Our review responds to this need by bringing together two complementary measurement lenses. The first is Wilson’s Constructing Measures Theory, which treats measurement construction as coordinated evidentiary work across four building blocks: construct map, item design, outcome space, and measurement model (M. Wilson, 2023). The second is granularity, which emphasizes that educational measures operate across interconnected micro, meso, and macro grain sizes, and that evidentiary requirements change with the decisions at stake (M. Wilson, 2018, 2024a, 2024b). Taken together, these lenses allow us to examine where ET-enabled work is concentrated within the measurement process and how those patterns differ across decision contexts, rather than organizing the field by technology labels alone. They also make it possible to show both where ET-enabled work is concentrated across the measurement process and where the literature remains comparatively underdeveloped across decision contexts. This, in turn, provides a clearer basis for understanding why and how these patterns matter for the interpretation and use of measurement evidence in practice. In this way, the review offers a more focused account of how ETs are reshaping educational measurement across the current research landscape.

Accordingly, we conducted a PRISMA 2020 guided systematic review of empirical studies published in the past ten years between 2016 and 2025 in which emerging technologies are implemented in educational measurements within formal educational settings (Page et al., 2021). Applying Constructing Measures’ four building block theory, our review offers a structured picture identifying where ETs are being used along the measurement construction workflow. Accordingly, we enable researchers and developers to identify which parts of the measurement argument are well-developed and which are less examined, with implications for evidentiary aligned and interpretable design (Messick, 1994; NRC, 2001; M. Wilson, 2023). Using a granularity lens, the study examines how the patterns differ across micro, meso, and macro decision contexts. This study highlights that measurement designs should be well matched to the intended decision and evidence context (grain-appropriate) measurement design for researchers and developers. At the same time, it helps practitioners understand when evidence, assumptions, and uncertainty may not transfer cleanly across contexts, therefore requiring clear justification and strong governance, especially regarding fairness for learners and groups and accountability for indicator-based decisions (Cerratto Pargman & McGrath, 2021; Hakimi et al., 2021; Slade & Prinsloo, 2013). Accordingly, this systematic review addresses two research questions as follows:

RQ1. How are emerging technology categories distributed under educational measurement decision contexts (across micro, meso, and macro grain sizes)?

RQ2. Within each grain size, how are emerging technology categories positioned in Constructing Measures processes (across the four building blocks of construct map, item design, outcome space, and measurement model)?

2. Theoretical Framework

2.1. Measurement as Evidentiary Process

Our theoretical frame begins from an evidentiary view of measurement. Educational measures are, at their core, structured reasoning from observations to claims under uncertainty (Mislevy, 1996). The National Research Council’s “assessment triangle” provides a minimum account required for such reasoning, that defensible assessment rests on coordinated models of (a) cognition (i.e., a theory of what competence looks like), (b) observation (i.e., tasks that can elicit relevant evidence), and (c) interpretation (i.e., analytic methods that connect evidence to claims) (NRC, 2001). Evidence-centered design (ECD) makes this logic operational through linked student, task, and evidence models, emphasizing that measurement development is ultimately the construction of an interpretive argument about what scores and indicators mean and how they should be used (Mislevy et al., 2003a, 2003b).

From this standpoint, the methodological challenge in ET-enabled educational measurement is not simply collecting more data. Rather, ETs introduce additional design decisions throughout the evidentiary chain, which can complicate how evidence is defined, modeled, and justified for inference. For example, learning management systems, digital platforms, and sensor- and AI-mediated tools can produce high-frequency traces that can function simultaneously as behavioral records, potential performance evidence, and administrative monitoring signals (e.g., Gašević et al., 2016; Siemens & Baker, 2012). This broader “datafication” changes what becomes visible and actionable in educational settings, often through platform-defined categories and metrics that structure attention and decision routines (Selwyn, 2016; Williamson, 2017). ETs also reshape the representation and modeling of heterogeneous traces into indicator outputs. In learning analytics and educational data mining, predictive models, dashboards, and early warning systems are often built to support monitoring and intervention. As a result, they are typically optimized for consequential outcomes such as course success and dropout (from Baker & Siemens, 2014; Ifenthaler & Yau, 2020; Long & Siemens, 2014; Romero & Ventura, 2020). From a measurement perspective, this moves practice away from single test events toward continuous evidence streams, and away from single scores toward profiles, trajectories, classifications, and risk estimates produced within ongoing activities. ETs further shape how results circulate and are acted upon. In learning analytics, dashboards are specifically built to support teachers’ ongoing monitoring, interpretation, and instructional decision-making, effectively coupling measurement with action in day-to-day orchestration, rather than just treating results as a statistical report (Van Leeuwen & Rummel, 2020). At the same time, learning analytics frameworks assume that different stakeholders will use analytics for different purposes. Therefore, the meaning of an indicator is shaped by the local implementation context and governance arrangements (Greller & Drachsler, 2012).

In this review, we treat ET-enabled educational measurement as designed evidentiary work that translates observed performance into information for decision-making. Our focus, accordingly, is to examine how ETs intervene in this evidentiary chain and reshape the construction of educational measures. This intervention is not only technical, but may also shape who or what has greater interpretive influence over how evidence is defined, read, and acted upon (Greller & Drachsler, 2012; Hakimi et al., 2021; Slade & Prinsloo, 2013), particularly when generative AI and other automated systems are used to score or provide feedback on complex student work (Koraishi, 2024; Shermis, 2025). For example, same digital trace may function as a tentative formative signal in a classroom setting, but as an aggregated indicator for monitoring or reporting when used at broader institutional levels (Hakimi et al., 2021; M. Wilson, 2018, 2024a).

2.2. Four Building Blocks Theory in Constructing Measures

Within this broader evidentiary perspective, M. Wilson’s (2023) Constructing Measures Theory provides a specific framework for building educational measures whose interpretations can be justified through an evidence-based validity argument. Scholars define “construct” as the attribute to be measured (Messick, 1989). A construct may represent, for example, a student’s understanding of concepts, attitudes toward learning, or psychological traits. From this perspective, validity is inherently linked to the quality of the design decisions that connect theory, items, scoring, and modeling into an interpretable measurement system (Messick, 1989, 1994; M. Wilson, 2023).

In Constructing Measures, M. Wilson (2023) formalizes measurement construction as four building blocks: the (1) construct map, (2) item design, (3) outcome space, and (4) measurement model, as shown in Figure 1. The construct map specifies what is being measured and how proficiency is expected to vary or develop, often represented as ordered levels and qualitatively distinct ways of thinking or performing (Black et al., 2011; M. Wilson, 2023). Item design translates the construct map into items that capture evidence about where learners fall along the construct (M. Wilson, 2023; M. Wilson & Sloane, 2000). The outcome space defines how responses are observed, coded, and scored, emphasizing finite, exhaustive, and ordered categories that preserve construct meaning (Masters, 1982; M. Wilson, 2023). Measurement model then links observed outcomes to locations on the construct map, supporting person and item estimates on a common scale and enabling the evaluation of uncertainty and model-data fit (M. Wilson, 2023). Importantly, information from model calibration can feed back into construct refinement, supporting iterative construct modeling over cycles of development and use (M. Wilson, 2023). This Constructing Measures perspective aligns with the logic of ECD: construct maps correspond to student models; item designs mirror task models; and outcome spaces and measurement models reflect evidence models (Mislevy et al., 2003a, 2003b).

In this systematic review, the four building blocks provide a way to specify where ETs are doing measurement work. ETs can support different blocks in different ways. Technology-rich task/item environments, such as simulations, games, and virtual labs, often support item design by enabling interactive performances that are difficult to capture with conventional formats, which is a common theme in technology-based assessment (Bennett, 2015). Automated scoring and feedback techniques can reshape the outcome space by changing what is treated as score-bearing evidence and how that evidence is summarized into categories or continuous scores. The literature on automated essay evaluation makes clear that the measurement challenge is not just scoring accuracy but construct alignment and controlling construct-irrelevant variance (Shermis & Burstein, 2013). When measurement relies on machine learning, Bayesian networks, or hybrid psychometric–computational approaches, the measurement model block can be shaped by altering the inferential link between evidence and claims (Ulitzsch, 2022). More broadly, these examples show that ETs do not enter measurement at a single point, but may reshape different parts of the process depending on how evidence is generated, scored, or modeled.

2.3. Measurement Granularity

The second component of our framework is granularity. Here, granularity means the level at which a measure is used and decisions are made. M. Wilson (2018, 2024a, 2024b) argues that educational measurement operates across three interconnected grain sizes: macro, meso, and micro. The selection of grain size should depend on the evidentiary requirements of the decision context in which the resulting evidence is intended to be used. Macro-level measurements (e.g., large-scale testing programs, international studies, and system indicators) are administered relatively infrequently and inform policy and accountability decisions. Meso-level measurements operate at the level of courses, programs, or schools, supporting monitoring, improvement, and progression decisions. Micro-level measurements are embedded in day-to-day classroom practice, including curriculum-embedded tasks and real-time formative checks intended to inform instruction. While ECD cognition–observation–interpretation logic applies at each level, the design constraints differ (Mislevy et al., 2003a, 2003b). Macro-level measurement prioritizes comparability and fairness across populations; micro-level measurement prioritizes timeliness and instructional usefulness; and meso-level measurement negotiates between local relevance and aggregability (NRC, 2001; Pellegrino, 2014; M. Wilson, 2018). Lehrer (2021) highlights this tension through the notion of “accountable assessment”, reflecting that educational measurement should provide actionable information for improving classroom instruction while also meeting the psychometric requirements for broader reporting.

Granularity is especially important in ET-enabled educational measurement because ETs can shift where measurement happens and make it easy for the same digital evidence/indicator to be reused across decision contexts. For example, a clickstream trace captured for micro-level feedback can be interpreted either as a course-level engagement indicator, an institutional risk flag, or a system-level performance metric. This transferability across grain sizes can create new measurement opportunities, but it may also raise validity risks. Evidence may be used in ways that the original interpretation does not support. Scholars on learning analytics have repeatedly emphasized that analytic outputs are not neutral reflections of learning, but are rather constructed within institutional goals, platform logics, and design choices about what is logged and how it is interpreted (Gašević et al., 2016; Wise & Shaffer, 2015). Relatedly, ethical studies in learning analytics have argued that risks intensify when indicators are reused beyond their original purpose or context, or when learners have limited ability to understand, contest, or contextualize inferences (Hakimi et al., 2021; Pardo & Siemens, 2014; Slade & Prinsloo, 2013). In other words, these concerns are granularity sensitive because acceptable uncertainty differs across levels of use. For instance, at the micro-level, an indicator may serve as a tentative signal to guide attention or next steps. However, when it is institutionalized at the meso-level for routine monitoring or labeling, the same indicator may be treated as a stable fact even though uncertainty remains. If it is then tied to macro-level consequences, such as accountability or resource allocation, small errors and context mismatches can compound, leading to distorted interpretations and unfair uses. Granularity is therefore not just a way of sorting measurement settings. Rather, it is essential for ET-enabled educational measurement because it specifies the decision context in which ET-based evidence is meant to carry meaning and the conditions under which that evidence can be used appropriately and responsibly.

In our present review, these two lenses are used not only as conceptual background, but as the basis of the analytic coding framework. Constructing Measures Theory guides the identification of where ETs intervene in measurement work across the four building blocks, while granularity guides the coding of the decision context in which that work is intended to function. Together, these two lenses allow the review to classify studies according to both the location of ET-enabled work within the measurement process and the level at which the resulting evidence is expected to inform educational decisions. This analytic use of the framework provides the basis for the coding procedures described in the Methods section.

3. Methods

This systematic review was conducted and reported in accordance with the PRISMA 2020 statement to ensure transparency (Page et al., 2021).

3.1. Database Search Strategy

To identify studies at the intersection of emerging technologies, education, and measurement, we searched four databases spanning education, social sciences, and technology and engineering: Web of Science (Social Sciences Citation Index–Education & Educational Research), EBSCOhost Education Full Text, EBSCOhost ERIC, and IEEE Xplore. This combination reflects common practice in ETs in education reviews that integrate education-focused and technical sources (Sosa Neira et al., 2017; W. Xu & Ouyang, 2022; Zawacki-Richter et al., 2019; K. Zhang & Aslan, 2021). Our search window covered about ten years, from 1 January 2016 to 29 October 2025.

Our search strategy was structured around three conceptual facets: (1) educational context, to locate studies situated in formal education; (2) measurement, to ensure substantive engagement with assessment/measurement; and (3) emerging technology. We applied a generic Boolean search using “AND” to connect the three facets and “OR” to connect the terms within each facet. Table 1 summarizes the searching facets and terms.

The emerging technology facet was designed to support broad yet conceptually coherent approaches in a field where ET-terminology evolves rapidly and is not always used consistently across studies. We constructed this facet by synthesizing recurring ET terms and categories reported across prior reviews of ETs in education and aligning them with the scope of the present review (Leavy et al., 2023; Ngoc et al., 2020; Sosa Neira et al., 2017; Sembey et al., 2024). Accordingly, our search string combined an umbrella ET term (“emerg technolog”) with ET categories that appeared recurrently in earlier review literature, including artificial intelligence, learning analytics, reality technologies, intelligent tutoring, and adaptive learning. This provided a broad, review-grounded entry point into a heterogeneous literature, while full-text screening and post-retrieval coding were used to refine the classification of implemented ET categories among the retrieved studies.

3.2. Eligibility Criteria

Our eligibility criteria were designed to remain broadly comparable to prior ETs in education reviews while reflecting our specific interest in ET-enabled educational measurement work in formal educational settings. Figure 2 summarizes our inclusion and exclusion criteria.

First, we restricted the inclusion to formal educational settings and to English-language publications with accessible full texts (Leavy et al., 2023; Ngoc et al., 2020; Sosa Neira et al., 2017). Second, consistent with prior reviews that focus on primary empirical research (e.g., Leavy et al., 2023; Ngoc et al., 2020; Sembey et al., 2024; Zawacki-Richter et al., 2019), we included only studies reporting primary data and analyses. Non-empirical publications were excluded from the dataset because our synthesis targets empirical evidence patterns. Third, as this review aims to explore ET-enabled education measurement, we included only studies in which ETs played a measurement or assessment role. We therefore excluded studies in which ETs were used only for instruction, content delivery, or engagement, as well as studies focused primarily on assessing learning about ETs (e.g., AI literacy assessment). This distinguishes our review from broader ET syntheses centered on teaching, adoption, or general learning outcomes (e.g., Delgado et al., 2015; Ngoc et al., 2020; K. Zhang & Aslan, 2021).

Moreover, we required studies to make measurement quality visible in some form. This requirement aligns with validity as an evidentiary judgment (Messick, 1989, 1994) and with the Standards’ emphasis on evidence supporting score use (AERA et al., 2014). We did not impose a single numerical threshold because reporting conventions vary across quantitative, qualitative, and mixed-methods ET studies. Instead, studies were included only when they reported at least one explicit basis on which the quality of the resulting measure or assessment output could be judged. In this review, such a basis referred to any clearly reported information that allowed the reader to evaluate the stability, accuracy, interpretability, or trustworthiness of the output. This could include, for example, reliability indices, inter-rater or human–machine agreement evidence, model fit statistics, validity-oriented justification, rubric development, or clearly described trustworthiness procedures, given that reporting in the technology-focused literature is often uneven or under-specified (Lai & Bower, 2019, 2020). This criterion functioned as a minimum visibility threshold: it was used to determine whether a study made the quality of its measurement output explicitly judgeable in some form. It was not intended to rank the strength of the reported evidence, require a uniform psychometric reporting standard across methodological traditions, or serve as a full appraisal of overall study quality (O’Brien et al., 2014). Studies were excluded when ETs generated assessment data or outputs but the paper provided no explicit basis for judging the quality, interpretability, or trustworthiness of those outputs.

3.3. Study Selection and Screening

Searches returned 4479 records from Web of Science, 4944 from EBSCOhost (Education Full Text n = 3585; ERIC n = 2053; duplicates automatically removed by system), and 467 from IEEE Xplore, yielding 9890 records before deduplication. After removing duplicates, 8644 unique records remained. Following the PRISMA 2020 guidance (Page et al., 2021) and screening procedures used in prior reviews (e.g., Sembey et al., 2024; W. Xu & Ouyang, 2022), we conducted a two-stage screening process: (1) title–abstract screening and (2) full-text screening. Specifically, first, titles and abstracts were screened against the eligibility criteria. Because titles and abstracts often underreport measurement intent and decision context, we used a conservative rule: when information was insufficient to determine whether the ET played a measurement role or whether the setting was formal education, the record was retained for full-text screening rather than excluded at the title–abstract stage. This step reduced the pool from 8644 to 1124 records. Then, we retrieved and reviewed full texts for the 1124 remaining records. Each paper was assessed against all inclusion criteria. After full-text screening, 933 studies met all criteria and were included in the synthesis. Two authors independently screened and coded the records using the predefined criteria and/or codebook. Disagreements at both the screening and coding stages were first discussed between the two researchers. Agreement was reached when both researchers accepted the same final classification. When needed, a third researcher was involved to help adjudicate unresolved cases. The final dataset used for analysis was therefore consensus-coded. Ongoing discussion was also used to refine the codebook and support consistency in its application across the corpus. This procedure was informed by established work on team-based qualitative analysis and codebook development, which treats analytic rigor as being supported through explicit coding frameworks, independent initial coding, iterative refinement, and structured team discussion that makes analytic decisions transparent and contestable within the research team (MacQueen et al., 1998; O’Brien et al., 2014; O’Connor & Joffe, 2020). Figure 3 presents the PRISMA 2020 flow diagram summarizing identification, screening, exclusions, and final inclusion.

3.4. Data Extraction and Coding

For each included study, we extracted information at two levels: (a) demographic information that allows us to contextualize ET-enabled educational measurement in the current research landscape, and (b) analytic codes that classify each study by what measurement work ETs support and the decision context in which that work is used, using the four building blocks and grain size scheme.

We recorded each article’s publication year, geographic region, formal educational level(s), subject area, sample size, and overall research method (quantitative, qualitative, or mixed-methods). The demographic information was used to characterize how ET-enabled education measurement is distributed across regions, educational levels, subject areas, sample sizes, and methods.

Furthermore, analytic coding was conducted by applying our theoretical framework:

(1) Grain size. Each study was coded as micro, meso, and/or macro based on its primary decision context, that is, the level at which the output was intended to be acted upon. Multiple codes could be assigned when needed, i.e., a study could be coded with more than one grain size, like both micro and meso.

(2) Building block(s). We coded which measurement building block(s) were reshaped by the implemented ET(s). When an ET supports multiple building blocks, we assigned multiple codes to identify the implemented system.

(3) Emerging technology category. To identify and categorize ETs, we applied a structured qualitative content analysis to each included full-text article and any additional materials available with the source publication that described the implemented ETs. Qualitative content analysis is well-suited for systematically classifying heterogeneous study features using a transparent codebook and documented decision rules (Hsieh & Shannon, 2005; Lombard et al., 2002). Our unit of ET analysis was the individual study. We coded an ET category only when the technology was implemented in the study’s measurement workflow, meaning that ETs contributed directly to evidence capture, scoring, feedback, measurement-relevant modeling, or the production of decision-relevant indicators. ETs mentioned only as background, motivation, or future directions were not coded. This coding approach reduces ambiguity in a rapidly evolving ET-landscape where terminology is inconsistent across studies (Krippendorff, 2004; Lombard et al., 2002). ET coding proceeded through an iterative sequence. We first extracted all ET-relevant descriptions, producing an initial set of candidate analytical bases (Hsieh & Shannon, 2005). We then identified widely used variants and synonyms to improve comparability across studies (Lombard et al., 2002). We finally constructed and refined the ET-codebook with operational definitions, as shown in Table 2. This included clarifying distinctions that are often unclear in the literature. Two coders independently applied the coding process, while disagreements were resolved through discussion until full agreement was reached (O’Connor & Joffe, 2020). Multiple codes could be assigned to a single study when needed.

4. Results and Discussion

4.1. Descriptive Overview

4.1.1. Descriptive Overview of Demographic Information

The included studies showed an overall increasing trend over time. This upward pattern was consistent with prior reviews, which reported growing attention to emerging learning environments, learning analytics, and technology-enhanced assessment from the late 2010s to the early 2020s (Martin et al., 2020; Sembey et al., 2024; K. Zhang et al., 2023). In this review, the observed trend was further characterized by two distinct growth stages, as shown in Figure 4. In the first stage (2016–2022), annual publication counts increased gradually with periodic plateaus: publications rose from 2016 (n = 33) to 2018 (n = 41), increased to 2019 (n = 56), remained stable in 2020 (n = 56), and then leveled off again in 2021–2022 (n = 79 each year). In the second stage (2023–2025), the studies expanded rapidly, with yearly output increasing from 2023 (n = 124) to 2024 (n = 179) and peaking in 2025 (n = 246). This surge might have been driven by the combined effects of large-scale digitalization policies, the expansion of online and blended learning, and the impact of the COVID-19 pandemic on educational measurement practices (D. Chen et al., 2023; Heil & Ifenthaler, 2023; Retnawati et al., 2024).

As shown in Figure 5, geographical information was coded as the context in which empirical data were collected or the study was conducted. Across the 933 included records, the geographical distribution was highly concentrated. Excluding records with unspecified locations (NS; n = 111; 11.9%), the ten most frequently represented regions were the United States (n = 179; 21.8% of location-specified records), Chinese mainland (n = 135; 16.4%), Taiwan (n = 46; 5.6%), Australia (n = 35; 4.3%), Spain (n = 31; 3.8%), Türkiye (n = 30; 3.6%), the United Kingdom (n = 27; 3.3%), the Netherlands (n = 26; 3.2%), Germany (n = 23; 2.8%), and South Korea (n = 17; 2.1%). Collectively, these top ten regions contributed 58.8% of all records, indicating that the empirical evidence was disproportionately concentrated in a small set of regions. The remaining location-specified studies formed a long tail spanning 63 additional single-region categories (n = 212; 22.7% of all records), none contributing more than 15 studies, alongside a small number of continent-level categories (e.g., “Europe,” n = 10; “North America,” n = 2) and explicitly multi-region studies (i.e., involving two or more regions; “Multi-region,” n = 49). This concentration aligns with prior reviews and likely reflects cross-regional differences in implementation conditions for ET-enabled educational measurement. In particular, these differences may be related to (a) unequal access to digital devices and stable Internet connectivity, (b) constraints in the basic infrastructure and access required for standardized measurement, and (c) variation in teachers’ measurement and feedback competencies, which create corresponding needs for professional development (D. Chen et al., 2023; Picasso, 2024).

Figure 6 maps the flows linking research methods, education levels, subject areas, and sample sizes across the included studies. The flow structure indicates that studies most frequently follow a pathway from quantitative designs to higher education, after which the flows branch into multiple subject areas. In higher education, the strongest links connect to natural sciences and social sciences, with additional but thinner connections to engineering and technology and the humanities; links to medical and health sciences and multidisciplinary categories are less pronounced. Across subjects, flows terminate predominantly in small (1–100) and medium (101–1000) samples, whereas large (1001–10,000) and very large (>10,000) samples appear as minor endpoints. Flow widths indicate the number of studies.

Methodological choices in the included studies were predominantly quantitative (n = 714, 76.5%), followed by mixed-methods approaches (n = 201, 21.5%), whereas qualitative-only designs were rare (n = 18, 1.9%). This pattern aligns with the prior review of learning analytics interventions in learning management systems, which report that studies presenting learner-generated data used quantitative data and that qualitative evidence was not fully employed in intervention functionality. Such concentration on quantitative methods may reflect the centrality of behavioral log-data analytics in ET-enabled educational measurement research, where techniques such as clustering or sequential analysis are used to derive granular indicators and to support predictive modeling (Z. Pan et al., 2024). At the same time, the substantial share of mixed-methods studies suggests increasing use of qualitative evidence to improve the understanding of quantitative patterns in context and to strengthen inference through triangulation (Bonami et al., 2020; Nguyen et al., 2020).

By educational level, the distribution of the included studies was highly skewed toward higher education (n = 628, 67.3%) and K–12 schooling (n = 232, 24.9%). In contrast, adult or professional education (n = 25, 2.7%) and early childhood education (n = 1, 0.1%) were rarely represented. An additional 24 studies spanned multiple educational levels (2.6%), and 23 did not report the educational level (2.5%). Taken together, the distribution indicates that current empirical research on ET-enabled educational measurement practices is predominantly situated in higher education, which echoes patterns noted in prior systematic reviews of technology-enhanced and online assessment in higher education (Heil & Ifenthaler, 2023).

Subject areas were coded into eight broad disciplinary categories. Natural sciences (n = 226, 24.2%) and social sciences (n = 210, 22.5%) were the most prevalent, together accounting for 46.7% of the included studies. Engineering and technology (n = 159, 17.0%) and humanities (n = 155, 16.6%) were the next most common categories. Together, these four fields accounted for 80.4% of all records. Medical and health sciences contributed a further 94 studies (10.1%). A modest yet meaningful share was classified as multidisciplinary (n = 69, 7.4%), indicating research spanning more than one domain. Subject information was not reported for 19 studies (2.0%). Agricultural sciences were rarely represented (n = 1, 0.1%), suggesting minimal coverage within the reviewed studies. This pattern is consistent with prior syntheses, indicating that research on ET-enabled approaches tends to concentrate in science-related and computing-related disciplines rather than being evenly distributed across subject areas (Martin et al., 2020). At the same time, prior research on technology-enhanced measurement in health professions education and on large-scale testing shows that these contexts are supported by established psychometric infrastructures, including systematic item development, item banking, and ongoing quality monitoring. These infrastructures can reduce the additional effort required for item innovation and support the scaling of measurement-oriented applications, which may partly contribute to the higher share of measurement-focused studies observed in the natural and health sciences in the included studies (Fuller et al., 2022; Wools et al., 2019).

Sample sizes were reported in 847 studies (90.8%), which together covered 7,298,004 participants based on the sum of reported sample sizes; the remaining 86 studies (9.2%) did not provide this information. The distribution of sample sizes was highly right-skewed, reaching as high as 1,100,000, with a median of 105 and an interquartile range (IQR) of 50–374. Most studies relied on small samples (1–100; n = 415, 49.0%) or medium samples (101–1000; n = 298, 35.2%), together representing 84.2% of studies with reported sample sizes. Large samples (1001–10,000) were less common (n = 81, 9.6%), and very large samples (>10,000) accounted for 52 studies (6.1%).

Beyond this, we also examined the included corpus through a complementary bibliometric lens by conducting a focused bibliometric mapping using VOSviewer (https://www.vosviewer.com/) (van Eck & Waltman, 2010; Donthu et al., 2021). As shown in Figure 7, A keyword co-occurrence network was generated from the included studies to visualize topical concentrations.

From Figure 7, the strongest topical core is organized around artificial intelligence, learning analytics, educational technology, and intelligent tutoring systems, indicating that these themes form the main conceptual center of the corpus. Around this core, several secondary groupings are also visible. One strand connects natural language processing with computational modeling, testing, and training. Another groups language- and scoring-related terms, including computer software, grading, scoring rubrics, writing evaluation, and English, suggesting a visible cluster of work on language-based assessment and automated scoring. A further grouping links teaching methods, mathematics instruction, program effectiveness, middle school students, and science instruction, indicating that part of the corpus is organized around pedagogical and subject-specific applications rather than ET categories alone. The filtered map preserves this central structure while reducing weaker peripheral links, suggesting that these concentrations are relatively stable. Together, these bibliometric patterns show that the included literature is concentrated around AI-, analytics-, and tutoring-related work, with smaller but visible strands in natural language processing, scoring, language assessment, and discipline-specific instructional applications.

4.1.2. Descriptive Overview of Analytical Coding

Table 3 summarizes how ET-enabled educational measurement studies are distributed by grain size, building block, and emerging technology category. Most activity occurred at the micro level, where 839 tags (88.88%) were coded as micro, compared with 95 (10.06%) at the meso-level and 10 (1.06%) at the macro-level. In other words, ET-enabled educational measurement was most often developed and enacted close to learners and instructors rather than in institutional or system-level infrastructures.

Across the four building blocks, ET applications primarily focused on outcome space (n = 649; 41.39%) and measurement model (n = 712; 45.41%), which together accounted for 86.80%. This indicates that the literature has focused primarily on translating heterogeneous traces and artifacts into indicators and modeling those outputs for monitoring, prediction, or decision-making. Item design was a secondary site of activity (n = 157; 10.01%), whereas construct map work remained comparatively rare (n = 50; 3.19%).

By emerging technology category, learning analytics (LA) & educational data mining (EDM) formed the largest cluster (n = 499; 25.96%), followed by machine learning (ML) & deep learning (DL) (n = 336; 17.48%) and automated scoring & feedback systems (n = 274; 14.26%). Generative AI/LLM systems (n = 159; 8.27%), speech technologies (n = 141; 7.34%), and multimodal and sensor-based measurement (n = 130; 6.76%) also represented visible strands, whereas computer-adaptive assessment and test delivery remained marginal (n = 9; 0.47%).

To complement the descriptive counts for the ET overview, Figure 8 visualizes how ET categories co-occurred across the included studies, based on the technique from Marquart et al. (2021). Consistent with Table 3, the figure shows a dense, highly connected core structure centered on the categories: learning analytics & educational data mining, machine learning & deep learning, and automated scoring & feedback systems, suggesting that these ETs function as the dominant infrastructure operationalized in measurement workflows. Several categories, like speech technologies, natural language processing (NLP), knowledge tracing & learner modeling, adaptive systems & intelligent tutoring systems, and computer vision, appeared closer to the figure core, indicating that they were more often applied in measurement designs together with the above-mentioned dominant ETs, rather than stand-alone. In contrast, immersive/simulation & extended reality and multimodal & sensor-based measurement appeared farther from the figure core, indicating they were used more in specialized measurement settings.

4.2. Emerging Technologies Across Grain Sizes and Building Blocks

To move beyond the overview, we examined how ETs were distributed across decision contexts and where, within each context, they most often intervened in the four building blocks of Constructing Measures. Table 4 summarizes these patterns and reports both diversity (entropy, H) and concentration (Herfindahl–Hirschman Index, HHI) by grain size and building block.

4.2.1. Emerging Technologies Across Grain Sizes

Across grain sizes, the literature was highly micro-centered, which means that ET-enabled educational measurement is most frequently situated in classroom- and individual-facing settings (see Table 3 and Table 4). This distribution may reflect where ET systems are most often functionalized. Many ET applications sit in local learning environments that generate dense evidence streams (such as platform traces, interaction logs, or dialogue data), enable rapid model iteration, and fine-grained evidence capture (Abdi et al., 2019; Charleer et al., 2017; Kim et al., 2016; J. Lin et al., 2022; Niknam & Thulasiraman, 2020; M. Wilson et al., 2016).

From the overview in Table 3, we found that ET activities were concentrated on the outcome space and measurement model, with much less attention to item design and especially the construct map. As Table 4 further shows, this overall pattern was remarkably stable across grain sizes. At the micro-level, outcome space (40.8%) and measurement model (45.4%) together accounted for 86.2% of tags. Meso-level studies showed a similar dominance of outcome space (42.9%) and measurement model (48.4%). The macro-level profile was the most uneven, where measurement model alone accounted for 63.6% of macro-level tags, and construct map activity was absent. In other words, the studies most often placed ET innovation where evidence is converted into usable outputs, like scores, indicators, predictions, profiles, or feedback, rather than where constructs are explicitly specified, refined, or made developmentally interpretable.

The ET diversity indices in Table 4 reinforce this interpretation. Micro-level showed high diversity and low concentration across blocks (H ≈ 3.06–3.32; HHI ≈ 0.108–0.132), suggesting a broad mix of approaches, rather than a single dominant pathway. Meso- and macro- levels, in contrast, were more concentrated (Meso HHI ≈ 0.184–0.249; Macro HHI ≈ 0.276–0.361). Substantively, this concentration is consistent with how meso and macro measurement infrastructures tend to be organized. They rely more heavily on ETs that scale cleanly through stable data analytic workflows (e.g., LA & EDM, ML & DL, automated scoring & feedback systems operating over institutional or large-scale datasets), while more context-specific infrastructures appear less often because they are harder to standardize across building blocks, settings, and decision contexts (Bulathwela et al., 2022; Cheung et al., 2024; Costa-Mendes et al., 2021; Makhlouf & Mine, 2020).

Figure 9 visualizes the same pattern from the ET category perspective. Micro-level studies draw on the widest range of ET approaches. LA & EDM, automated scoring & feedback systems, and ML & DL were the most prominent (e.g., Charleer et al., 2017; Kim et al., 2016). At the meso-level, LA & EDM, and ML & DL remained prominent, but the overall count was limited. This concentration is consistent with meso contexts, where measures are often designed to support course- and program-level decisions through relatively standardized infrastructures (e.g., Gašević et al., 2016; Gelan et al., 2018; Han & Ellis, 2020a, 2020b; Kivimäki et al., 2019). Macro-level studies remained comparatively rare in our review dataset, and were dominated by modeling-oriented uses that prioritize scalable inference and transportable indicators over construct or item redesign (e.g., Bulathwela et al., 2022; Cheung et al., 2024; Makhlouf & Mine, 2020).

4.2.2. Emerging Technologies Across Building Blocks Within Each Grain Size

Micro-level practices. Figure 10 shows that micro-level studies clustered strongly in outcome space (1238 tags; 40.8%) and measurement model (1378 tags; 45.4%), which together accounted for 86.2% of all micro-level tags (see Table 4). Item design appeared as a meaningful but secondary site of activity (313 tags; 10.3%), whereas engagement with the construct map remained comparatively limited (107 tags; 3.5%). This is more than a distribution pattern. It suggests that, in the current research, ETs most often add value by expanding observable evidence and converting that evidence into score-like outputs that can support interpretation and action, rather than by re-specifying constructs. At the same time, the micro-level was the most heterogeneous: entropy remained high across blocks (H ≈ 3.06–3.32) and concentration was consistently low (HHI ≈ 0.108–0.132) (see Table 4). In other words, at the micro-level, ET use is relatively evenly distributed rather than concentrated in a few dominant pathways, with studies spread across multiple ET categories.

Construct map. Although the construct map was the smallest cluster at the micro grain size, it presented the highest entropy (H = 3.319) and the lowest concentration (HHI = 0.108) among building blocks (See Figure 10 and Table 4). Within this block, LA & EDM most often support construct map development by treating fine-grained traces as potential indicators for constructs that are hard to observe, such as engagement, self-regulation, and collaboration. For instance, clickstream patterns can be used as a trace-based evidence of engagement (Vale & Falloon, 2024). Micro-level studies often apply trace indicators as construct-relevant evidence or explanation, when indicators behave coherently across tasks, episodes, or contexts (e.g., Borchers et al., 2025; Y. Chen et al., 2025; Joseph & Abraham, 2023; Suraworachet et al., 2025; M. Wilson et al., 2016).

Automated scoring & feedback systems can also represent construct meaning, but often indirectly, through rubric dimensions or feedback targets. In these cases, construct map is partially carried by scoring design, even when not developed as a proficiency structure (Link et al., 2024; V. Shute et al., 2019; V. J. Shute et al., 2021; Udeozor et al., 2023). In more construct-grounded micro-level studies, item design and evidence modeling are paired with psychometric or probabilistic inference, so the construct claim is stated upfront and directly linked to the evidence being produced (V. Shute et al., 2019; V. J. Shute et al., 2021; M. Wilson et al., 2016).

Finally, ML & DL approaches sometimes operate as construct discovery tools at the micro-level. These studies use methods such as clustering, topic modeling, or representation learning to detect latent behavioral dimensions in trace data, and treat the resulting structure as initial support for potential constructs (e.g., Y. Chen et al., 2025; Joseph & Abraham, 2023; Vignesh et al., 2025; M. Wilson et al., 2016; Wu et al., 2025).

Item design. Item design formed a mid-size micro-level cluster and was slightly more concentrated (HHI = 0.132) than construct map studies. Within item design, its leading ET categories are automated scoring & feedback systems (74 tags), followed by LA & EDM (59 tags), and then immersive/simulation and extended reality (42 tags), as shown in Table 4.

A dominant pathway comes from automated scoring & feedback systems, especially in writing and other constructed-response items. Here, prompt design is closely aligned with what the system can score and what feedback it can deliver, so item design and scoring or feedback design are developed together (e.g., Abdelhalim & Alsehibany, 2025; Butterfuss et al., 2022; C. J. Lin & Hwang, 2025; V. Shute et al., 2019; Steif et al., 2016; Tadjer et al., 2022). For example, Writing Pal illustrates this logic by operationalizing writing as scorable dimensions, using NLP-based evaluation for diagnosis, and considering the learners’ characteristics in targeted strategy instruction, so the item/task is inseparable from the feedback logic that follows it (Butterfuss et al., 2022).

LA & EDM contribute to item design less by generating new items in a traditional sense but more by redefining “item” as interactions that can be mined for diagnostic information, such as annotation moves, stepwise problem solving, hint usage, revision cycles, and timing patterns (e.g., Cabı & Türkoğlu, 2025; Çakiroğlu & Kahyar, 2022; S. Y. Chen & Yeh, 2017; T. C. Yang et al., 2018). This shifts item targets toward process evidence (e.g., strategy use or regulation) rather than final correctness alone. In scalable settings, data requirements can also shape item formats. For example, option-level tracing in EDM typically requires multiple-choice items. As a result, the choice of option-level tracing EDM method constrains item design, because only certain formats produce the option-level data structure the analysis needs (H. Li et al., 2025a).

Immersive/simulation & extended reality applies in micro-level item design through simulation-based performance tasks, where an item becomes a scenario with affordances, constraints, and observable actions (Al Hakim et al., 2022; Cohen et al., 2024a, 2024b; Minty et al., 2022; D. Rodríguez et al., 2025). In these designs, evidence quality depends on implementation practicalities, such as interface design, simulation fidelity, and logging. For instance, mixed reality teacher education assessments rely on standardized simulated interactions and rubric-scored performance, and the task is pre-defined by what the simulation makes observable and recordable (Cohen et al., 2024a, 2024b).

Outcome space. Outcome space, as the second largest micro-level building block, remained quite diverse (H = 3.070; HHI = 0.130). As shown in Table 4 and Figure 10, its leading ET categories include LA & EDM (296 ET tags), automated scoring & feedback systems (206 ET tags), and ML & DL (176 ET tags). Together, these patterns signal a shift from single-point scores toward outcomes that are multidimensional, time-sensitive, and anchored in process evidence, rather than just single test scores.

Across the included studies, LA & EDM construct outcome spaces by transforming digital traces (such as access logs, interaction records, and step-level traces) into observable indicators for formative monitoring and near-term instructional decisions (e.g., AlJarrah et al., 2018; Bulut et al., 2025; Cabı & Türkoğlu, 2025; Çakiroğlu & Kahyar, 2022; Dannath et al., 2025; Harindranathan & Folkestad, 2019; Wen & Song, 2021; Y. Yang et al., 2020). A central measurement interpretation is that these outcome spaces frequently prioritize actionability, like learning profiles and trajectories. Charleer et al. (2017) also illustrated this finding well by showing how institutional grade data are reorganized into dashboard summaries that structure adviser–student interpretation for actionable and practical implications.

Automated scoring & feedback systems further expand micro-level outcome spaces, by making routine assessment feasible for written explanations, short answers, code, and dialogic responses (Dosaru et al., 2025; Firetto et al., 2025; Hirschi et al., 2025; L. Zhang et al., 2020; Hershberger et al., 2024; Kortemeyer et al., 2024; Cai et al., 2018). ML & DL approaches similarly define outcome space through classification labels, predicted probabilities, or learned representations that summarize performance states (e.g., Chejara et al., 2023; Herodotou et al., 2019; Lee et al., 2022; J. J. Lin, 2025; Prasad et al., 2024). In these lines of work, outcomes are typically model-defined indicators used to support further monitoring, prediction, or early identification decisions.

Measurement model. Measurement model was the largest micro-level building block, as reported in Table 4. Specifically, ET applications here are dominated by LA & EDM (347 tags), followed by ML & DL (259 tags) and automated scoring & feedback systems (180 tags). Substantively, this pattern signals a shift in micro-level measurement that modeling is increasingly used to infer risk estimates, latent states, trajectories, or actionable classifications from dense trace-based evidence.

Above all, LA & EDM studies tend to sit close to practice. These studies build predictive or diagnostic models from interaction traces to generate risk estimates, engagement profiles, or process indicators that feed monitoring and feedback loops (e.g., Charleer et al., 2017; J. Chen, 2024; Frick et al., 2022; Horikoshi et al., 2016; Kokoç, 2019; Villagrán et al., 2024). Their main advantage is usability and timeliness. However, an issue related to measurement is that predictive accuracy does not, by itself, ensure interpretability, generalizability, or fairness for consequential use. This logic is evident in how some systems let log-based model summaries define the outcome. For example, Frick et al. (2022) linked temporal trace patterns to mastery-transition probabilities, while J. Chen (2024) converted completion and submission traces into SRL indicators that drove targeted nudges.

Second, ML & DL studies, such as Muresan et al. (2025), Doleck et al. (2020), Nahar et al. (2021), Ong et al. (2022), L. Pan et al. (2020), Sekeroglu et al. (2019), D. Wang et al. (2024), often treat learning as a dynamic latent state inferred from behavioral or multimodal evidence. Typical measurement outputs include predicted performance probabilities, state classifications, embeddings, or time-updated proficiency estimates, reported alongside standard evaluation routines such as cross-validation and robustness checks.

Third, automated scoring & feedback systems in recent studies operationalize a closed-loop measurement logic, linking evidence capture to automated scoring and then to feedback that generates new evidence (Drinkwater Gregg et al., 2025; Flodén, 2025; Forkan et al., 2023; Link et al., 2024; Shabara et al., 2024; Steinbach et al., 2025; J. Wilson et al., 2021). These studies often linked model outputs to measurement claims by showing agreement with human raters, internal consistency of related scales, or coherence with related survey measures. They then used the resulting scores right away to support formative feedback and iterative refinement.

Meso-level practices. As shown in Figure 11, at the meso-level, the ET engagement concentrated heavily on outcome space and measurement model, with comparatively little attention to the construct map and only limited innovation in item design (see also Table 4). In particular, measurement model accounted for 48.4% with 139 ET tags, and outcome space accounted for 42.9% with 123 tags, while item design (6.3%, n = 18) and construct map (2.4%, n = 7) were relatively low. Their diversity and concentration indices reinforce this interpretation. Within meso-level studies, the outcome space block showed relatively higher concentration (HHI = 0.249; H = 2.340). In contrast, measurement model was somewhat less concentrated with HHI counts of 0.184 and H of 2.529, as shown in Table 4.

Construct map. At the meso-level grain size, construct map studies are rare. Only three studies contributed to the seven ET tags in this block, including de Barros Camargo and Hernández Fernández (2024), Divjak et al. (2023), and Wei et al. (2025). Rather than using a shared approach to specifying progress variables or developmental levels, these studies operationalized construct maps in three different ways. Divjak et al. (2023) operationalized a construct map as a learning-outcomes (LOs) map, where tasks (down to sub-tasks or rubric criteria) are aligned to LOs, and LO-level profiles, supplemented by Moodle traces, are used to diagnose misalignment and refine design. Wei et al. (2025), instead, identified the construct using an LLM-assisted clustering pipeline to generate knowledge components from item text. de Barros Camargo and Hernández Fernández (2024) adopted yet another approach, where construct meaning was anchored in a survey factor structure (AI and deep learning-related dimensions) and was supplemented by exploratory neurophysiological evidence (portable EEG) under AI-supported learning conditions. Across these studies, meso-level constructs were less often specified as developmental progression maps. Instead, they more often appeared as pragmatic labels using modeling and classification steps, such as mapping decisions, clustering outputs, or factor-analytic structures (de Barros Camargo & Hernández Fernández, 2024; Divjak et al., 2023; Wei et al., 2025).

Item design. Meso-level item design showed up more often than construct mapping, but the total occurrence was limited (n = 18; H = 2.503; HHI = 0.188). As per Table 4, the dominant ETs here are ML & DL, followed by automated scoring & feedback systems, and speech technologies. In this sense, item design functions at the meso-level often means engineering elicitation conditions that generate analyzable evidence, for example, structured assignments, platform-mediated tasks, and oral assessment prompts. Across the five ML & DL related studies, these technologies shape not only modeling, but also the design of evidence opportunities, for instance, by supporting difficulty targeting, comparability, and scalable selection (Lim et al., 2023; Liu et al., 2022; Lokkila et al., 2023; Pereira et al., 2022; J. Xu et al., 2022). Recommender and prediction approaches can help assemble assignment or exam sets by identifying equivalent items or estimating difficulty from stems and learner traces (Pereira et al., 2022; J. Xu et al., 2022), while log-based error-pattern analyses can inform course-level task demands (Lokkila et al., 2023). Analytics-enabled design frameworks similarly position profiling and predictive insights as inputs to how assessments are structured and iterated across a course (Lim et al., 2023). Automated scoring and feedback infrastructures further influence meso-level item design. For example, workflow-driven standardization toward rubric-friendly evidence formats is constrained by what can be scored, returned, and monitored quickly (Hansel et al., 2024) or GenAI-supported test construction that shifts attention toward what is easiest to generate and score (Ma et al., 2025). Language technologies can also inform prompt construction by linking linguistic features of questions to performance, making wording itself a design (Ontong, 2024).

Outcome space. Outcome space is the core site where meso-level studies translate raw activity into score-like variables (n = 123 tags; H = 2.340; HHI = 0.249). Within this, LA & EDM dominated this block with 64 tags.

It is most commonly constructed from platform traces, submissions, and participation signals, which are transformed into indicators for monitoring and intervention. These indicators include predicted course success or dropout risk labels (Gardner & Brooks, 2018; Gupta & Sabitha, 2019), dashboard-ready summaries of engagement and progress for advising and program improvement (Henríquez et al., 2024; Hilliger et al., 2022), and course-level process indicators derived from log data (Divjak et al., 2023; Jovanović et al., 2021).

Across domains, the underlying logic is consistent: define a finite set of computable indicators, summarize them at a decision-relevant timescale, and treat them as actionable evidence rather than as directly interpretable scale scores (Cukurova et al., 2022; Gašević et al., 2016; Lim et al., 2023). ML & DL accounted for 30 ET tags; their operationalization often takes the form of performance classifications or predictions that serve as measurement outcomes, sometimes building on feature sets engineered from logs or artifacts (e.g., Bertolini et al., 2021; Divasón et al., 2023; Talamás-Carvajal et al., 2025). Compared with the LA & EDM approaches, which may emphasize interpretable dashboards, ML & DL models can reliably predict the outcome spaces. GenAI/LLM systems appear less often but show a clear meso-level outcome space role. Across these seven studies (Cohn et al., 2025; DiSabito et al., 2025; H. Li et al., 2025b; R. Li et al., 2025a, 2025b; Vilanti et al., 2025; Yiğiter & Boduroğlu, 2025), LLM-mediated processing transformed open-ended artifacts, such as essays, lab reports, dialogues, or handwritten responses, into structured rubric scores, analytic categories, or feedback-relevant labels that function as outcomes for review and action.

Measurement model. Meso-level measurement models are where the included studies most explicitly claim inference about learners, courses, or programs. This was also the largest meso-level building block (n = 139; H = 2.529; HHI = 0.184). Figure 11 and Table 4 identify ML & DL (46 tags) and LA & EDM (50 tags) as the leading ETs, followed by speech technologies for 10 tags. Many meso-level studies fit models that are decision-oriented, such as Cukurova et al. (2022), Divasón et al. (2023), Gupta and Sabitha (2019), Premlatha et al. (2016), and Bilal et al. (2025), built for early identification, monitoring, and intervention, rather than for construct-referenced scaling in a psychometric sense. In practice, these models often function as institutional intelligence that supports advising, course redesign, or targeted support, and model quality is typically argued through predictive performance and operational usefulness (Gašević et al., 2016; Gupta & Sabitha, 2019). Within ML & DL studies, meso-level measurement outputs are commonly represented as class labels or predicted states (e.g., standing categories, proficiency levels, or risk strata), with emphasis on generalization and feature learning (Divasón et al., 2023; Gray & Perkins, 2019; Guevara-Flores et al., 2023; Guo, 2025; Mangaroska et al., 2021; Peng et al., 2023; Talamás-Carvajal et al., 2025). Speech and other multimodal technologies contribute by extracting performance-relevant signals from classroom interaction and discourse, enabling models to scale evidence from communication-rich settings (Alfredo et al., 2024; Hou et al., 2025). At the same time, several studies used automated scoring systems and NLP methods to convert open-ended artifacts into computable indicators, supporting peer assessment trustworthiness, text-based dashboards, concept-coverage measures, syllabus-based competence classification, or structured analyses of dialogue traces (e.g., Darvishi et al., 2022; Kong et al., 2025; Pereira et al., 2022 Vilanti et al., 2025; T. C. Yang, 2023).

Macro-level practices. At the macro grain size, as shown in Figure 12, where evidence is used for system-wide reporting, accountability, large-scale monitoring, or institutional decision-making, the whole macro-level practice is small, with only 11 studies in total, also highly unevenly distributed across the four building blocks. Consistent with that macro logic, most ET activities concentrated on measurement model (14 tags; 63.6% within macro), followed by outcome space (6 tags; 27.3%), a very limited layer of item design (2 tags; 9.1%), and construct map, which was absent. The macro-level profile was also more concentrated and less diverse than the micro grain size. For instance, outcome space showed low diversity (H = 1.459) and high concentration (HHI = 0.361), and item design was essentially split across two ET categories only (H = 1.000; HHI = 0.500).

Construct map. The absence of ET tags in the construct map building block at the macro-level, per se, is an finding. Across the macro-level studies in this review, ETs were not used to re-define system-level progress variables or to create new construct maps; rather, constructs were typically treated as policy-anchored, while ETs were applied to scale inferences after evidence has been collected.

Item design. Macro-level ET engagement in item design is extremely rare. In our review, only Klang et al. (2023)’s study contributed to this block. The ET activity here was concentrated in GenAI/LLM systems and automated scoring and processing. This study reflects a macro logic, namely, the goal is less to expand the item universe than to standardize complex response formats so that they remain scorable and governable at scale. Alternatively, at the macro-level, new item formats become feasible when they are supported by robust quality control and reliable subsequent analytical processes.

Outcome space. Macro-level outcome space activity was modest but concentrated (n = 6; HHI = 0.361). The dominant pattern is not a shift toward richly multimodal evidence, but toward operationally portable indicators that only a smaller set of evidence types can be aggregated and compared across contexts. In large-scale assessment settings, ETs often translate heterogeneous inputs into profile or label outcomes that can circulate across decision layers. For example, explainable profiling and clustering on PISA 2022 data operationalizes student heterogeneity into interpretable groupings that function as decision-ready outcome categories (Alvarez-Garcia et al., 2024). Similarly, ML and XAI studies define outcomes as classification labels or risk-related states, such as academic resilience labels derived from PISA indices and plausible values, which support automation and cross-context reporting (Cheung et al., 2024). Outside international large-scale assessments contexts, macro-level outcome spaces often take the form of standardized labels and probability estimates, such as engagement states used for population-level personalization and recommendation (Bulathwela et al., 2022), and performance or quality categories derived from trace-like datasets (Lan, 2025). When LA appears at the macro-level, it is typically built on platform log data and translated into dashboard-ready metrics for system monitoring (Hershkovitz et al., 2022).

Measurement model. As shown in Figure 12, measurement model work was dominant at 63.6% within the macro-level, with moderate diversity (H = 2.039) but still substantial concentration (HHI = 0.276). In practical terms, this means that macro-level ET measurement is most often about building models that turn evidence into decision-relevant quantities of predictions, classifications, latent state estimates, or overall performance indicators, rather than about re-designing constructs or items. One cluster centered on ML- & DL-driven macro-level modeling, where the model targets scale-relevant outcomes such as system monitoring, performance prediction, and risk classification (Bulathwela et al., 2022; Cheung et al., 2024; Costa-Mendes et al., 2021; Lan, 2025). The second emphasis was platform-embedded analytics infrastructures, like early-warning architectures, institutional dashboards, and countrywide monitoring initiatives, where measurement assumptions (i.e., what counts as progress, engagement, or risk) are often implicit but operationally prioritized (Hao et al., 2022; Macarini et al., 2020; Makhlouf & Mine, 2020). Several macro-level studies were also hybrid in practice, combining LA or EDM data streams with ML & DL modeling choices (e.g., predictive layers embedded within dashboard or monitoring architectures) (Bulathwela et al., 2022; Costa-Mendes et al., 2021; Hao et al., 2022).

5. Critical Reflections on Emerging Technologies-Enabled Educational Measurement

Building on the empirical patterns reported above, this section synthesizes the major interpretive and design issues raised across the included studies. In particular, the strong concentration of ET-enabled work at the micro level, the dominance of outcome space and measurement model activity, and the central role of AI-, analytics-, and automated scoring-related approaches help explain why certain risks recur across the corpus. These issues are not minor technical details; they shape what scores represent, how confidently results can be generalized across settings, and how consequences may differ across learners and groups when outputs are used in practice. Drawing on limitations, robustness checks, and reflective discussions reported in the corpus, we identified four recurring themes: construct meaning and validity drift, robustness and generalizability, fairness and transparency, and privacy and governance.

5.1. Construct Meaning and Validity Drift

A first concern in ET-enabled educational measurement is the risk of construct meaning and validity drift. In other words, although a system may model what it captures well, this does not mean that the resulting scores represent the construct intended to be measured. This risk becomes especially salient in a literature that is heavily concentrated in outcome space and measurement model work, where ETs are often used to transform traces and artifacts into indicators before constructs are fully specified or refined. In trace-based work, the authors often acknowledge that behavioral logs are hybrid artifacts of administrative records, interface byproducts, and (only sometimes) evidence of learning. When these distinctions are unclear, the construct may drift (i.e., what the measure represents gradually shifts) toward what the platform captures most consistently and reliably (Pardo et al., 2016; Rohani et al., 2024; Tempelaar, 2017; A. Wilson et al., 2017).

This drift is specified by a group of studies. Several studies note that indicators built from long time windows can be useful, but they may not capture short-lived states or fine-grained turning points in learning progress. Resulting scores may look stable precisely because the time scale has been flattened (Nasir et al., 2021; Saint et al., 2020; Tempelaar, 2017). Others highlight the related issue that when the time grain is too rough, the measure cannot preserve the event structure needed for meaningful inference (Olsen et al., 2017, 2020).

Text-based and LLM-mediated measurement raises similar issues, with an added mechanism, that is, the model’s evidence can be prompt- and specification-contingent. Across studies, scores are sensitive to prompt wording, rubric specificity, response length and complexity, and model versioning. As a result, outputs do not simply reflect students’ ability. Instead, they depend on the conditions under which the model is prompted, guided, and constrained. This means that score meaning depends on the scoring setup, so scores should be interpreted as evidence of the targeted capability only under the specified conditions (Bowen & Todd, 2025; Koraishi, 2024; Núñez-Regueiro et al., 2025; Shermis, 2025; Topuz et al., 2025). Some works further report hallucinations, calculation errors, or inconsistent outputs that undermine the assumption that a score is even repeatable under nominally identical conditions (Daly & Deglaire, 2025; Shermis, 2025). Others describe construct-adjacent proxies, where surface features can mislead a system into rewarding polish, rather than substance (Geckin et al., 2023; Shermis, 2025).

In immersive and multimodal settings, construct contamination often arises because signals are shaped by ecological and hardware constraints, incomplete modality capture, and limitations in visual or behavioral recognition pipelines. Richer data strengthen measurement only when they are paired with clear design decisions and a well-specified interpretive rationale. Without that alignment, adding more variables can instead introduce construct-irrelevant variance into scores (Hou et al., 2025; Ouyang et al., 2022, 2024a, 2024b; Pang et al., 2025; F. Zhao et al., 2020).

5.2. Robustness and Generalizability

The second concern is the robustness and generalizability of ET-enabled educational measurement. That is, whether measures remain interpretable and reasonably stable across contexts, and whether their accuracy holds when applied in different settings. This issue is closely tied to the strong micro-level concentration of the field, since many ET-enabled measures are developed in local instructional settings where context-specific traces are easier to capture but more difficult to generalize across populations, platforms, or time. Many studies are single-course, single-site, or small-cohort. Randomization is often infeasible. Control groups are missing or constrained. These are not just limitations that should be acknowledged but directly define the scope of inference (Novita et al., 2022; Radović & Seidel, 2025; Reid & Drysdale, 2024; M. E. Rodríguez et al., 2022; Rubio et al., 2018).

Several studies also highlight weaknesses that emerge from the data, such as skewed outcomes, sparse features, and imbalanced clusters. When “non-learning” works are rare, or when high/low score bands have few cases, models may stabilize around majority patterns and behave unpredictably in the cases educators care most about (Monllao Olive et al., 2020; Serrano-Mamolar et al., 2023; Nasir et al., 2021; Opoku et al., 2025; R. Zhao et al., 2023). In clustering work, the authors explicitly note how ill-posed “optimal k” selection can be, and that apparently technical choices (e.g., assuming a fixed item order) can silently restrict what generalization means in practice (Nazaretsky et al., 2019).

Across intelligent tutoring and mastery modeling, a related robustness issue appears as systematic miscalibration, especially false positives, where systems over-estimate mastery. In mastery forecasting, a related robustness issue appears as systematic miscalibration, including over-prediction and limited responsiveness to sudden shifts of understanding (e.g., “aha moments”) (Slater & Baker, 2019). Elsewhere, models depend on platform-specific features, historical course data, or previously human-scored responses, making them fragile when curricula shift or when new items arrive without prior evidence (i.e., “cold start”) (Baral et al., 2021; Pereira et al., 2022; C. C. Y. Yang & Ogata, 2023).

LLM-based scoring and feedback adds another layer that, even if a system performs well at one point in time, the underlying model, pricing, context-window behavior, and API conditions can change. The studies repeatedly note that these shifts can bound score meaning temporally unless monitoring, drift detection, and recalibration are built into operational use (Nawahdah et al., 2025; Rantanen et al., 2025; Shermis, 2025; Stewart et al., 2025).

5.3. Fairness and Transparency

The third reflection concerns fairness and transparency, and the studies treat both as core measurement issues. These concerns become especially important when ET-enabled work is concentrated in modeling, scoring, and classification processes, where errors, hidden assumptions, or differential performance may be difficult to detect once outputs are presented as decision-relevant indicators. Differential error can enter through language variety and linguistic form (e.g., idioms, polysemy, grammatical variation) as well as through response content that current pipelines handle unevenly, such as mathematical expressions or image-grounded descriptions (Baral et al., 2021; Oğuz, 2025; Shermis, 2025; Stewart et al., 2025; R. Zhao et al., 2023). Others also note that subgroup fairness analyses are sometimes impossible, because demographic metadata are absent in widely used datasets, that is, an ordinary data limitation with direct consequences for validity claims (Shermis, 2025).

Bias can also be structural. Class imbalance, outcome aggregation (e.g., combining “fail” and “withdraw”), and limited feature sets can increase harm even when headline metrics look acceptable. Multiple studies warn that conventional performance reporting (e.g., accuracy) may conceal who bears the error cost (Monllao Olive et al., 2020; Opoku et al., 2025). In affect and engagement detection, additional fairness issues arise when ground truth is uncertain or culturally variable. That is, what counts as frustration in one context may not map cleanly onto another, and reduced labels can miss whole categories of disengagement (Nam et al., 2017; Padrón-Rivera et al., 2016; Standen et al., 2020).

In practice, transparency and explainability are often limited by what the system chooses (and is able) to surface in an interface. As a result, explanations may help users understand what the model did, but offer limited support for stakeholders to question, contest, or audit model-based judgment in ways that matter for measurement and accountability. Several studies describe heavy documentation burdens, limited institutional capacity, or missing modalities that make explanations partial at best. Others emphasize that even interpretable methods can fail to support action, when the inferred mechanisms remain unclear (Plumley et al., 2024; Hou et al., 2025; Ortega-Morla et al., 2025; Reid & Drysdale, 2024). In other words, a model can be legible, yet still not be accountable as a measure.

5.4. Privacy and Governance

Many studies point to privacy and governance constraints that shape not only the ethics of ET-enabled educational measurement, but also the evidentiary conditions under which such systems operate. These concerns are intensified when the same digital evidence is reused across decision contexts, moving from local instructional support toward broader monitoring, reporting, or institutional use. More specifically, studies point to privacy and compliance constraints, such as cloud uploads, uncertain retention, commercial server dependence, or cost barriers, that shape not only ethics, but also the evidentiary base itself (Koraishi, 2024; Novak et al., 2025; Núñez-Regueiro et al., 2025; Rai et al., 2025). Moreover, when learners know they are being monitored, or when parts of activity happen off-platform, the data no longer provide a neutral window on learning; instead, the measure increasingly reflects the conditions under which the measurement takes place (Roa Romero et al., 2021; Rohani et al., 2024; Tempelaar et al., 2024).

Governance challenges also arise from strategic responding by users (e.g., students or instructors) who adapt their responses to what the system appears to reward. The studies note that automated scoring can be influenced by surface cues, while refusals, unstable formats, and prompt-sensitive variability can undermine auditability and consistency. In these environments, maintaining validity becomes a moving target. Either the system increases controls (often unevenly effective and potentially inequitable), or the score’s meaning diminishes as responses are optimized against the scoring function rather than against the construct (B. Chen et al., 2024; Daly & Deglaire, 2025; Geckin et al., 2023; Nawahdah et al., 2025; Topuz et al., 2025). From a measurement perspective, these behaviors should be treated as expected operating conditions that must be built into design, documentation, and validation, rather than treated as exceptions.

6. Implications and Future Direction

Across the included studies, an important implication for researchers and practitioners is that ET-enabled educational measurement needs to be reported and evaluated as a validity argument, not only as a technical workflow. Specifically, studies should make explicit how construct intent, evidence, and warranted inference are linked under clearly stated conditions (AERA et al., 2014; Mislevy, 1996; NRC, 2001; M. Wilson, 2023). This chain is most likely to weaken when indicators travel across decision contexts (M. Wilson, 2018, 2024a). Moreover, in the past decade, the empirical landscape has been strongly micro-centered, and ET work has been concentrated in outcome space and measurement modeling. Our opportunity, therefore, is to rebalance the evidentiary chain by strengthening its comparatively underdeveloped components, especially construct maps and item design (AERA et al., 2014; Mislevy, 1996; M. Wilson, 2023). The implications below translate these broader patterns into audience-specific directions.

6.1. For Researchers

Our review suggests that researchers, when exploring ET-enabled educational measurement, should ensure that measurement construct work does more than merely frame an ET model. A clearly specified measurement construct articulation, with an explicit developmental interpretation, serves as an anchor for both trace coding and the interpretation of model outputs (NRC, 2001; M. Wilson, 2023). Moreover, validity arguments should be written to the decision context they are meant to serve. Researchers can reduce overclaiming by stating the intended decision context up front (i.e., micro formative guidance, meso advising/monitoring, or macro reporting) and by evaluating evidence under those use conditions (AERA et al., 2014; Mislevy, 1996; M. Wilson, 2018, 2024a, 2024b). Choices such as time windows, logging rules, feature definitions, or instructional shifts may alter what a measure represents and should therefore be treated as measurement issues rather than merely technical details.

Our review also points to future directions. One is to extend research beyond micro-level settings, particularly into meso- and macro-level contexts where demands for comparability, fairness, and accountability are often stronger. Expanding research in these contexts would also help the field test whether indicators developed from rich micro-level traces can sustain stable interpretations when moved across institutions, cohorts, platforms, or time (AERA et al., 2014; Mislevy, 1996; M. Wilson, 2018, 2024a). At the same time, future research could invest more in construct maps and item design, which remain comparatively underdeveloped relative to outcome generation and modeling. Doing so would strengthen interpretability across the evidentiary measurement chain from construct claims to evidence coding to warranted inferences, as well as provide a firmer basis for evaluating ET-enabled measures across contexts.

6.2. For System Designers

The review suggests a shift from building prediction workflows to building instruments with stable meaning. A measurement-consistent design starts by stating construct intent, then engineers evidence capture and scoring so outputs remain interpretable under defined conditions (Mislevy, 1996; NRC, 2001; M. Wilson, 2023). In practice, this means making the scoring function inspectable (i.e., rubrics, evidence rules, and aggregation logic) and documenting what outputs represent and what they do not.

Granularity should also be treated as a governance issue rather than only an interface issue. Micro-level ET-enabled educational measurement tools should emphasize interpretability and uncertainty cues, because teachers need signals they can question, check, and explain in context. In contrast, meso- and macro-level systems need stronger comparability standards and change control, because even small biases can accumulate when outputs are used across institutional decisions (AERA et al., 2014; Gašević et al., 2016; M. Wilson, 2018, 2024a).

Last but not least, robustness and governance need to be engineered as system properties. Drift monitoring, re-calibration pathways, and version control support interpretability over time. For LLM-mediated measurement, prompt templates, scoring policies, and model identifiers should be logged with the assessment record because they shape the scoring function and therefore the meaning of scores (Khosravi et al., 2022; Shermis & Burstein, 2013). Privacy, transparency, and contestability also matter for system design because they shape the evidentiary conditions under which ET-enabled measurement operates (Hakimi et al., 2021; Slade & Prinsloo, 2013; F. Wang et al., 2025b).

6.3. For Educators and Practitioners

For educators, the central implication is that ET-enabled educational measurement outputs should be interpreted as conditional information. Their meaning depends on whether local evidence opportunities and instructional conditions match the conditions under which the indicator was validated. In practice, dashboards and automated feedback can be valuable to support timely adjustment or improvement. Yet many such tools are optimized for actionability rather than for supporting stable claims about underlying capability. A suggestion is to use the ET outputs to engage follow-up (i.e., reviewing student work, observing participation, and checking understanding), instead of directly treating them as standalone conclusions.

For education practitioners involved in system-level reporting, the review cautions against inappropriate granularity transfer, namely, repurposing indicators designed for local optimization or platform monitoring for broader or higher-stakes decisions. Such transfers may intensify bias, hide uncertainty, and redirect instruction toward what platforms capture most easily. Responsible adoption therefore depends on clear documentation of what is measured, how evidence is produced, what limits interpretation, and which uses are warranted or unwarranted.

7. Conclusions

This systematic review shows how emerging technologies (ETs) have reshaped educational measurement over the past decade. ETs are changing the evidentiary chain of educational measurement: what we observe, how observations are turned into outcomes, and how outcomes become decision-relevant information. Using an evidentiary view of measurement (Mislevy, 1996; NRC, 2001) and Wilson’s Constructing Measures Theory (construct map, item design, outcome space, measurement model) (M. Wilson, 2023), we synthesized 933 empirical studies published between 2016 and 2025. We also identified the educational measurement decision context (i.e., micro, meso, or macro grain sizes) (M. Wilson, 2018, 2024a). This allowed us to show not only which ETs are used, but where they enter measurement work and under what decision contexts their outputs are intended to support.

Over the last decade, ET-enabled educational measurement has expanded quickly, with publication growth accelerating after 2021 and peaking in 2025. Yet this growth did not yield a balanced measurement context. Instead, the empirical landscape was strongly micro-centered. Most studies were situated close to learners and classroom activity (micro) rather than within program-level (meso) or system-level infrastructures (macro). This reflects practical conditions. Micro settings might be easier to instrument and iterate, and they generated dense traces (e.g., logs, interactions, discourse, multimodal signals) without requiring major changes to institutional routines. Meso-level research existed but was much smaller, while macro-level studies remained rare and uneven, often emphasizing scalability through modeling.

Across grain sizes, ET activity was heavily concentrated in outcome space and measurement model work, with comparatively limited attention to item design and especially the construct map. In other words, most studies prioritized turning traces and performances into outputs (e.g., scores, labels, indicators, predictions, profiles, and feedback), and then modeling those outputs for monitoring or decision-making. Far fewer studies advanced the work that anchors measurement meaning: specifying constructs with precision, describing how progress should be interpreted (including developmental levels when appropriate), and designing tasks systematically to elicit construct-relevant evidence. This imbalance matters, because outcome and model innovations may scale quickly even when construct meaning is under-specified. This pattern increases the risk that measurable traces drive the definition of the construct, instead of the construct driving what is measured.

The distribution of ET categories reinforces this interpretation. Learning analytics & educational data mining, machine learning & deep learning, and automated scoring/feedback systems function as core infrastructure across building blocks and grain sizes. These approaches often operate as end-to-end pipelines: platform traces become indicators, which are then modeled into predictions, classifications, or feedback routines. In contrast, immersive/simulation/extended reality and multimodal/sensor approaches appear more selectively, typically in micro contexts where the task environment is tightly coupled to what can be logged reliably. Generative AI/LLM systems represented a meaningful share of studies, broadening the feasibility of scoring and structuring open-ended artifacts, while also making score meaning more sensitive to prompt design, model conditions, and system updates.

A granularity-sensitive view further shows that ET-enabled educational measurement became more concentrated as it moved from micro to meso and macro contexts. Micro-level studies were relatively diverse, combining multiple ETs across measurement workflow. At the meso- and macro-levels, the studies narrow toward approaches that scale through stable analytic workflows, especially LA & EDM and ML & DL. In macro-level contexts, ET engagement largely focuses on measurement model work, with little evidence that ETs are being used to develop or refine system-level construct maps. This is important as the evidentiary requirements for comparability, fairness, and accountability typically intensify as the stakes rise.

These structural patterns directly connect to some critical measurement concerns raised by our included studies. Measurement construct meaning may drift when platforms, prompts, or modality pipelines determine what becomes measurable. Robustness may be limited by local datasets, imbalanced outcomes, cold-start conditions, and temporal instability. These limits are particularly relevant for LLM-mediated scoring, where updates and prompt dependence can change score meaning over time. Fairness and transparency risks arise both statistically (e.g., differential error and subgroup harms that can be hidden by overall averages) and structurally (e.g., missing demographic metadata, culturally contingent ground truth, and unequal opportunity to generate the traces treated as evidence). Finally, privacy and governance are not external constraints. They are part of the evidentiary chain because data capture conditions, surveillance effects, commercial infrastructures, and strategic behavior can reshape what counts as evidence, thereby reshaping what the measure represents (Mislevy, 1996; M. Wilson, 2023).

Several limitations should be acknowledged within the scope of this review and point to directions for future work. First, although the search strategy was developed to capture a broad and interdisciplinary body of work, the review was limited to four databases and English-language publications. Also, our ET facet was constructed to support broad, review-grounded retrieval across a heterogeneous field, rather than as an exhaustive inventory of all possible technical labels. This design improved cross-domain coverage, but it also means that some studies associated with rapidly emerging or shifting terminology may still have been under-captured if those labels were not indexed through the selected search string. Although full-text screening and post-retrieval coding helped refine ET categorization among the retrieved studies, they cannot fully address the limitations of the retrieval process introduced at the search stage. Future reviews may extend this coverage by incorporating additional databases, non-English publications, ET categories, and wider range of publication formats. Second, the synthesis necessarily involved interpretive coding. Grain size, building block, and ET category were coded through a predefined theoretical framework or an iterative codebook, with consensus procedures used to support consistency. This coding process was designed as a discussion-based consensus procedure rather than as a separate inter-rater reliability study. Such an approach is consistent with methodological work on team-based qualitative coding, which treats analytic consistency as being supported not only through formal agreement statistics, but also through explicit codebook development, iterative refinement, and structured discussion among coders (MacQueen et al., 1998; O’Connor & Joffe, 2020). At the same time, however, the robustness of classification should be understood as grounded in independent initial coding, structured discussion, codebook refinement, and adjudication when needed, rather than in a separate reliability statistic. Future reviews focused on narrow subdomains may complement such consensus procedures with formal inter-coder reliability indicators (Campbell et al., 2013; O’Connor & Joffe, 2020). In a similarly broad way, the inclusion criterion concerning measurement quality visibility was designed to accommodate the diversity of reporting conventions across quantitative, qualitative, and mixed-methods ET studies. Accordingly, this criterion supported comparability across a heterogeneous corpus, but it should be interpreted as an inclusion threshold rather than as a graded evaluation of evidentiary strength. Future reviews focused on narrow methodological or technological subdomains may apply differentiated criteria for judging the form and depth of reported quality evidence. Finally, this review was designed to map patterns in study characteristics and design emphases rather than to compare the effectiveness of specific technologies or to conduct a meta-analysis. Given the heterogeneity of the included studies in their purposes, designs, data types, and reported quality indicators, the review is best understood as a mapping and interpretive synthesis of how ETs are positioned within educational measurement, where evidentiary work is concentrated, and what validity-related concerns recur across contexts. As the field matures, future research may build on this foundation through more targeted comparative syntheses within specific ET categories, educational levels, or measurement purposes.

Taken together, this review offers a measurement-centered map of ET-enabled educational measurement that supports comparison across studies. It organizes the studies by (a) decision context (grain size) and (b) evidentiary work in measurement (the four building blocks), rather than by ET labels alone. Recent advances have meaningfully expanded what ET-enabled measurement can do. The next step, however, is to restore balance by treating construct claims, evidence coding, modeling, and inference as one coherent measurement argument. Indicators of evidence also require grain-appropriate justification. Such justification should clarify what an indicator means, the conditions under which it remains valid, and how it should or should not be used. They should also clarify who retains interpretive authority when ET-enabled systems increasingly automate parts of evidence capture, scoring, and feedback. As measurement moves from micro-level instructional use toward broader meso- or macro-level decision contexts, this question becomes more consequential. Automation may not remove human judgment, but it can relocate it, making it less visible and potentially less contestable if responsibilities for interpretation are incorporated into technical systems and platform logics. This becomes essential when an indicator is carried from classroom (micro) use to program (meso) or system-level (macro) decisions. ETs can broaden what we can observe and support more timely forms of feedback that are sensitive to learning processes. However, these benefits qualify as defensible measurement only when interpretations are clearly bounded, uncertainty is made visible, and governance anticipates reuse beyond the original context.

Author Contributions

Conceptualization, L.Y. and F.W.; methodology, L.Y. and F.W.; formal analysis, L.Y. and B.Z.; investigation, L.Y. and B.Z.; writing—original draft preparation, L.Y. and B.Z.; writing—review and editing, L.Y., B.Z., and F.W.; supervision, F.W. and G.K.W.W.; funding acquisition, G.K.W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hong Kong Jockey Club Charities Trust Fund (grant number 2024-0086-001) and the Hong Kong Quality Education Fund (grant number 2021/1036).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created in this study.

Acknowledgments

The authors gratefully acknowledge the support of the Hong Kong Jockey Club Charities Trust Fund and the Hong Kong Quality Education Fund. The authors would also like to thank the academic editor and the anonymous reviewers for their valuable feedback and suggestions, which helped strengthen this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abdelhalim, S. M., & Alsehibany, R. A. (2025). Integrating AI-powered tools in EFL pronunciation instruction: Effects on accuracy and L2 motivation. Computer Assisted Language Learning, 1–25. [Google Scholar] [CrossRef]
Abdi, S., Khosravi, H., Sadiq, S., & Gasevic, D. (2019). A multivariate Elo-based learner model for adaptive educational systems. arXiv, arXiv:1910.12581. [Google Scholar] [CrossRef]
Alfredo, R., Echeverria, V., Zhao, L., Lawrence, L., Fan, J. X., Yan, L., Li, X., Swiecki, Z., Gašević, D., & Martinez-Maldonado, R. (2024). Designing a human-centred learning analytics dashboard in-use. Journal of Learning Analytics, 11(3), 62–81. [Google Scholar] [CrossRef]
Al Hakim, V. G., Yang, S. H., Liyanawatta, M., Wang, J. H., & Chen, G. D. (2022). Robots in situated learning classrooms with immediate feedback mechanisms to improve students’ learning performance. Computers & Education, 182, 104483. [Google Scholar] [CrossRef]
AlJarrah, A., Thomas, M. K., & Shehab, M. (2018). Investigating temporal access in a flipped classroom: Procrastination persists. International Journal of Educational Technology in Higher Education, 15(1), 1. [Google Scholar] [CrossRef]
Alvarez-Garcia, M., Arenas-Parra, M., & Ibar-Alonso, R. (2024). Uncovering student profiles. An explainable cluster analysis approach to PISA 2022. Computers & Education, 223, 105166. [Google Scholar] [CrossRef]
American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
Baker, R. S., & Siemens, G. (2014). Educational data mining and learning analytics. In R. K. Sawyer (Ed.), The Cambridge handbook of the learning sciences (2nd ed.). Cambridge University Press. [Google Scholar] [CrossRef]
Baral, S., Botelho, A. F., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2021). Improving automated scoring of student open responses in mathematics. In 14th international conference on educational data mining (EDM 2021) (pp. 130–138). International Educational Data Mining Society. Available online: https://eric.ed.gov/?id=ED615565 (accessed on 22 November 2025).
Bennett, R. E. (2015). The changing nature of educational assessment. Review of Research in Education, 39(1), 370–407. [Google Scholar] [CrossRef]
Bertolini, R., Finch, S. J., & Nehm, R. H. (2021). Testing the impact of novel assessment sources and machine learning methods on predictive outcome modeling in undergraduate biology. Journal of Science Education and Technology, 30(2), 193–209. [Google Scholar] [CrossRef]
Bilal, M., Omar, M., Anwar, W., Bokhari, R. H., & Choi, G. S. (2025). Bridging the gap: From traditional admissions to data-driven insights for predicting and supporting undergraduate performance. Education and Information Technologies, 30(18), 27085–27110. [Google Scholar] [CrossRef]
Black, P., Wilson, M., & Yao, S. Y. (2011). Road maps for learning: A guide to the navigation of learning progressions. Measurement: Interdisciplinary Research & Perspective, 9(2–3), 71–123. [Google Scholar] [CrossRef]
Bonami, B., Piazentini, L., & Dala-Possa, A. (2020). Education, big data and artificial intelligence: Mixed methods in digital platforms. Comunicar, 65, 43–52. [Google Scholar] [CrossRef]
Borchers, C., Fleischer, H., Schanze, S., Scheiter, K., & Aleven, V. (2025). High scaffolding of an unfamiliar strategy improves conceptual learning but reduces enjoyment compared to low scaffolding and strategy freedom. Computers & Education, 236, 105364. [Google Scholar] [CrossRef]
Bowen, N. E., & Todd, R. W. (2025). Enhancing ChatGPT-based writing research through effective prompt use. Teaching English with Technology, 25(1), 26–40. [Google Scholar] [CrossRef]
Bulathwela, S., Verma, M., Pérez-Ortiz, M., Yilmaz, E., & Shawe-Taylor, J. (2022). Can population-based engagement improve personalisation? A novel dataset and experiments. arXiv, arXiv:2207.01504. [Google Scholar] [CrossRef]
Bulut, O., Gorgun, G., & Yildirim-Erbasli, S. N. (2025). The impact of frequency and stakes of formative assessment on student achievement in higher education: A learning analytics study. Journal of Computer Assisted Learning, 41(1), e13087. [Google Scholar] [CrossRef]
Butterfuss, R., Roscoe, R. D., Allen, L. K., McCarthy, K. S., & McNamara, D. S. (2022). Strategy uptake in writing pal: Adaptive feedback and instruction. Journal of Educational Computing Research, 60(3), 696–721. [Google Scholar] [CrossRef]
Cabı, E., & Türkoğlu, H. (2025). The impact of a learning analytics based feedback system on students’ academic achievement and self-regulated learning in a flipped classroom. International Review of Research in Open and Distributed Learning, 26(1), 175–196. [Google Scholar] [CrossRef]
Cai, Z., Graesser, A. C., Windsor, L., Cheng, Q., Shaffer, D. W., & Hu, X. (2018). Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on AutoTutor answer evaluation. Journal of Educational Data Mining. Available online: https://par.nsf.gov/biblio/10098439 (accessed on 22 November 2025).
Campbell, J. L., Quincy, C., Osserman, J., & Pedersen, O. K. (2013). Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement. Sociological Methods & Research, 42(3), 294–320. [Google Scholar] [CrossRef]
Cerratto Pargman, T., & McGrath, C. (2021). Mapping the ethics of learning analytics in higher education: A systematic literature review of empirical research. Journal of Learning Analytics, 8(2), 123–139. [Google Scholar] [CrossRef]
Charleer, S., Vande Moere, A., Klerkx, J., Verbert, K., & De Laet, T. (2017). Learning analytics dashboards to support adviser-student dialogue. IEEE Transactions on Learning Technologies, 11(3), 389–399. [Google Scholar] [CrossRef]
Chejara, P., Kasepalu, R., Prieto, L. P., Rodríguez-Triana, M. J., Ruiz Calleja, A., & Schneider, B. (2023). How well do collaboration quality estimation models generalize across authentic school contexts? British Journal of Educational Technology, 55(4), 1602–1624. [Google Scholar] [CrossRef]
Chen, B., Bao, L., Zhang, R., Zhang, J., Liu, F., Wang, S., & Li, M. (2024). A multi-strategy computer-assisted EFL writing learning system with deep learning incorporated and its effects on learning: A writing feedback perspective. Journal of Educational Computing Research, 61(8), 1596–1638. [Google Scholar] [CrossRef]
Chen, D., Jeng, A., Sun, S., & Kaptur, B. (2023). Use of technology-based assessments: A systematic review covering over 30 countries. Assessment in Education: Principles, Policy & Practice, 30(5–6), 396–428. [Google Scholar] [CrossRef]
Chen, J. (2024). Effects of learning analytics-based feedback on students’ self-regulated learning and academic achievement in a blended EFL course. System, 124, 103388. [Google Scholar] [CrossRef]
Chen, S. Y., & Yeh, C. C. (2017). The effects of cognitive styles on the use of hints in academic English: A learning analytics approach. Journal of Educational Technology & Society, 20(2), 251–264. [Google Scholar]
Chen, Y., Li, J., Liu, Y., Jiang, F., Zhou, A., & Li, Y. (2025). Mining the patterns of teachers’ nonverbal behavior: Automated recognition and systematic exploration. Journal of Educational Computing Research, 63(7–8), 1583–1617. [Google Scholar] [CrossRef]
Cheung, K. C., Sit, P. S., Zheng, J. Q., Lam, C. C., Mak, S. K., & Ieong, M. K. (2024). A machine-learning model of academic resilience in the times of the COVID-19 pandemic: Evidence drawn from 79 countries/economies in the PISA 2022 mathematics study. British Journal of Educational Psychology, 94(4), 1224–1244. [Google Scholar] [CrossRef]
Cohen, J., Anglin, K., & Wiseman, E. (2024a). Tailoring teacher supports: A mixed-methods analysis of responses to coaching and self-reflection. AERA Open, 10, 23328584241289876. [Google Scholar] [CrossRef]
Cohen, J., Wong, V. C., Krishnamachari, A., & Erickson, S. (2024b). Experimental evidence on the robustness of coaching supports in teacher education. Educational Researcher, 53(1), 19–35. [Google Scholar] [CrossRef]
Cohn, C., Snyder, C., Fonteles, J. H., TS, A., Montenegro, J., & Biswas, G. (2025). A multimodal approach to support teacher, researcher and AI collaboration in STEM+ C learning environments. British Journal of Educational Technology, 56(2), 595–620. [Google Scholar] [CrossRef]
Costa-Mendes, R., Oliveira, T., Castelli, M., & Cruz-Jesus, F. (2021). A machine learning approximation of the 2015 Portuguese high school student grades: A hybrid approach. Education and Information Technologies, 26(2), 1527–1547. [Google Scholar] [CrossRef]
Cukurova, M., Khan-Galaria, M., Millán, E., & Luckin, R. (2022). A learning analytics approach to monitoring the quality of online one-to-one tutoring. Journal of Learning Analytics, 9(2), 105–120. [Google Scholar] [CrossRef]
Çakiroğlu, Ü., & Kahyar, S. (2022). Modelling online community constructs through interaction data: A learning analytics based approach. Education and Information Technologies, 27(6), 8311–8328. [Google Scholar] [CrossRef]
Daly, P., & Deglaire, E. (2025). AI-enabled correction: A professor’s journey. Innovations in Education and Teaching International, 62(4), 1241–1257. [Google Scholar] [CrossRef]
Dannath, J., Deriyeva, A., & Paaßen, B. (2025). What is a step? A user study on how to sub-divide the solution process of introductory python tasks. In C. Mills, G. Alexandron, D. Taibi, G. Lo Bosco, & L. Paquette (Eds.), 18th international conference on educational data mining (EDM 2025) (pp. 533–540). International Educational Data Mining Society. Available online: https://eric.ed.gov/?id=ED675667 (accessed on 22 November 2025).
Darvishi, A., Khosravi, H., Sadiq, S., & Gašević, D. (2022). Incorporating AI and learning analytics to build trustworthy peer assessment systems. British Journal of Educational Technology, 53(4), 844–875. [Google Scholar] [CrossRef]
de Barros Camargo, C., & Hernández Fernández, A. (2024). Neuropedagogy and neuroimaging of artificial intelligence and deep learning. Educational Process: International Journal, 13(3), 97–115. [Google Scholar] [CrossRef]
Delgado, A. J., Wardlow, L., McKnight, K., & O’Malley, K. (2015). Educational technology: A review of the integration, resources, and effectiveness of technology in K–12 classrooms. Journal of Information Technology Education: Research, 14, 397–416. [Google Scholar] [CrossRef] [PubMed]
DiSabito, D., Hansen, L., Mennella, T., & Rodriguez, J. (2025). Exploring the frontiers of generative AI in assessment: Is there potential for a human-AI partnership? New Directions for Teaching and Learning, 2025(182), 81–96. [Google Scholar] [CrossRef]
Divasón, J., Martínez-de-Pisón, F. J., Romero, A., & Sáenz-de-Cabezón, E. (2023). Artificial intelligence models for assessing the evaluation process of complex student projects. IEEE Transactions on Learning Technologies, 16(5), 694–707. [Google Scholar] [CrossRef]
Divjak, B., Svetec, B., Horvat, D., & Kadoić, N. (2023). Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design. British Journal of Educational Technology, 54(1), 313–334. [Google Scholar] [CrossRef]
Doleck, T., Lemay, D. J., Basnet, R. B., & Bazelais, P. (2020). Predictive analytics in education: A comparison of deep learning frameworks. Education and Information Technologies, 25(3), 1951–1963. [Google Scholar] [CrossRef]
Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., & Lim, W. M. (2021). How to conduct a bibliometric analysis: An overview and guidelines. Journal of Business Research, 133, 285–296. [Google Scholar] [CrossRef]
Dosaru, D. F., Simion, D. M., Ignat, A. H., Negreanu, L. C., & Olteanu, A. C. (2025). Using GenAI to assess design patterns in student written code. IEEE Transactions on Learning Technologies, 18, 869–876. [Google Scholar] [CrossRef]
Drinkwater Gregg, K., Ryan, O., Katz, A., Huerta, M., & Sajadi, S. (2025). Expanding possibilities for generative AI in qualitative analysis: Fostering student feedback literacy through the application of a feedback quality rubric. Journal of Engineering Education, 114(3), e70024. [Google Scholar] [CrossRef]
Firetto, C. M., Murphy, P. K., Starrett, E., Herman, E. A., Greene, J. A., Tang, Y., & Yan, L. (2025). Investigating grade-level and text genre effects in quality talk discussions: An AI-powered discourse analysis of upper primary students’ high-level comprehension. Learning and Instruction, 100, 102208. [Google Scholar] [CrossRef]
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. [Google Scholar] [CrossRef]
Forkan, A. R. M., Kang, Y.-B., Jayaraman, P. P., Du, H., Thomson, S., Kollias, E., & Wieland, N. (2023). VideoDL: Video-based digital learning framework using AI question generation and answer assessment. International Journal of Advanced Corporate Learning, 16(1), 19–27. [Google Scholar] [CrossRef]
Frick, T. W., Myers, R. D., & Dagli, C. (2022). Analysis of patterns in time for evaluating effectiveness of first principles of instruction. Educational Technology Research and Development, 70(1), 1–29. [Google Scholar] [CrossRef]
Fuller, R., Goddard, V. C. T., Nadarajah, V. D., Treasure-Jones, T., Yeates, P., Scott, K., Webb, A., Valter, K., & Pyörälä, E. (2022). Technology enhanced assessment: Ottawa consensus statement and recommendations. Medical Teacher, 44(8), 836–850. [Google Scholar] [CrossRef] [PubMed]
Gardner, J., & Brooks, C. (2018). Evaluating predictive models of student success: Closing the methodological gap. arXiv, arXiv:1801.08494. [Google Scholar] [CrossRef]
Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success. The Internet and Higher Education, 28, 68–84. [Google Scholar] [CrossRef]
Geckin, V., Kızıltaş, E., & Çınar, Ç. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6(4), 1096–1108. [Google Scholar] [CrossRef]
Gelan, A., Fastré, G., Verjans, M., Martin, N., Janssenswillen, G., Creemers, M., Lieben, J., Depaire, B., & Thomas, M. (2018). Affordances and limitations of learning analytics for computer-assisted language learning: A case study of the VITAL project. Computer Assisted Language Learning, 31(3), 294–319. [Google Scholar] [CrossRef]
Gray, C. C., & Perkins, D. (2019). Utilizing early engagement and machine learning to predict student outcomes. Computers & Education, 131, 22–32. [Google Scholar] [CrossRef]
Greller, W., & Drachsler, H. (2012). Translating learning into numbers: A generic framework for learning analytics. Journal of Educational Technology & Society, 15(3), 42–57. [Google Scholar]
Guevara-Flores, K. F., Hernández-Calderón, J. G., & Soto-Mendoza, V. (2023, November 25–26). Enhancing English proficiency test evaluation: Leveraging artificial intelligence for result classification. 2023 10th International Conference on Soft Computing & Machine Intelligence (ISCMI) (pp. 183–187), Mexico City, Mexico. [Google Scholar] [CrossRef]
Guo, D. (2025, June 6–7). An enhanced evaluation of English teaching quality based on explainable artificial intelligence techniques. 2025 International Conference on Intelligent Computing and Knowledge Extraction (ICICKE) (pp. 1–6), Bengaluru, India. [Google Scholar] [CrossRef]
Gupta, S., & Sabitha, A. S. (2019). Deciphering the attributes of student retention in massive open online courses using data mining techniques. Education and Information Technologies, 24(3), 1973–1994. [Google Scholar] [CrossRef]
Hakimi, L., Eynon, R., & Murphy, V. A. (2021). The ethics of using digital trace data in education: A thematic review of the research landscape. Review of Educational Research, 91(5), 671–717. [Google Scholar] [CrossRef]
Han, F., & Ellis, R. (2020a). Combining self-reported and observational measures to assess university student academic performance in blended course designs. Australasian Journal of Educational Technology, 36(6), 1–14. [Google Scholar] [CrossRef]
Han, F., & Ellis, R. (2020b). Personalised learning networks in the university blended learning context. Comunicar, 28(62), 19–30. [Google Scholar] [CrossRef]
Hansel, C. A., Ottenbreit-Leftwich, A., Quick, J. D., Greene, A. H., & Ricci, M. (2024). Gradescope in large lecture classrooms: A case study at Indiana university: How an online grading platform enhanced student learning and instructor feedback in large-scale courses. Journal of Teaching and Learning with Technology, 13(1), 33–48. [Google Scholar] [CrossRef]
Hao, J., Gan, J., & Zhu, L. (2022). MOOC performance prediction and personal performance improvement via Bayesian network. Education and Information Technologies, 27(5), 7303–7326. [Google Scholar] [CrossRef]
Harindranathan, P., & Folkestad, J. (2019). Learning analytics to inform the learning design: Supporting instructors’ inquiry into student learning in unsupervised technology-enhanced platforms. Online Learning, 23(3), 34–55. [Google Scholar] [CrossRef]
Heil, J., & Ifenthaler, D. (2023). Online assessment in higher education: A systematic review. Online Learning, 27(1), 187–218. [Google Scholar] [CrossRef]
Henríquez, V., Guerra, J., & Scheihing, E. (2024). The impact of an academic counselling learning analytics tool: Evidence from 3 years of use. British Journal of Educational Technology, 55(5), 1884–1899. [Google Scholar] [CrossRef]
Herodotou, C., Rienties, B., Boroowa, A., Zdrahal, Z., & Hlosta, M. (2019). A large-scale implementation of predictive learning analytics in higher education: The teachers’ role and perspective. Educational Technology Research and Development, 67(5), 1273–1306. [Google Scholar] [CrossRef]
Hershberger, P. J., Pei, Y., Bricker, D. A., Crawford, T. N., Shivakumar, A., Castle, A., Conway, K., Medaramitta, R., Rechtin, M., & Wilson, J. F. (2024). Motivational interviewing skills practice enhanced with artificial intelligence: ReadMI. BMC Medical Education, 24, 237. [Google Scholar] [CrossRef]
Hershkovitz, A., Tabach, M., & Cohen, A. (2022). Online activity and achievements in elementary school mathematics: A large-scale exploration. Journal of Educational Computing Research, 60(1), 258–278. [Google Scholar] [CrossRef]
Hilliger, I., Aguirre, C., Miranda, C., Celis, S., & Pérez-Sanagustín, M. (2022). Lessons learned from designing a curriculum analytics tool for improving student learning and program quality. Journal of Computing in Higher Education, 34(3), 633–657. [Google Scholar] [CrossRef]
Hirschi, K., Kang, O., Yang, M., Hansen, J. H. L., & Beloin, K. (2025). Artificial intelligence-generated feedback for second language intelligibility: An exploratory intervention study on effects and perceptions. Language Learning, 75(S1), 204–241. [Google Scholar] [CrossRef]
Holmes, W., Bialik, M., & Fadel, C. (2019). Artificial intelligence in education: Promises and implications for teaching and learning. Center for Curriculum Redesign. [Google Scholar]
Horikoshi, I., Noguchi, M., & Tamura, Y. (2016). Evaluation of learning unit design with use of page flip information analysis. International Association for Development of the Information Society. Available online: https://eric.ed.gov/?id=ED571426 (accessed on 28 November 2025).
Hou, R., Bühler, B., Fütterer, T., Bozkir, E., Gerjets, P., Trautwein, U., & Kasneci, E. (2025). Multimodal assessment of classroom discourse quality: A text-centered attention-based multi-task learning approach. arXiv, arXiv:2505.07902. [Google Scholar] [CrossRef]
Hsieh, H. F., & Shannon, S. E. (2005). Three approaches to qualitative content analysis. Qualitative Health Research, 15(9), 1277–1288. [Google Scholar] [CrossRef]
Ifenthaler, D., & Yau, J. Y. K. (2020). Utilising learning analytics to support study success in higher education: A systematic review. Educational Technology Research and Development, 68(4), 1961–1990. [Google Scholar] [CrossRef]
International Organization for Standardization [ISO]. (1995). Guide to the expression of uncertainty in measurement (GUM). International Organization for Standardization. [Google Scholar]
Joseph, B., & Abraham, S. (2023). Identifying slow learners in an e-learning environment using k-means clustering approach. Knowledge Management & E-Learning, 15(4), 539–553. [Google Scholar] [CrossRef]
Jovanović, J., Saqr, M., Joksimović, S., & Gašević, D. (2021). Students matter the most in learning analytics: The effects of internal and instructional conditions in predicting academic success. Computers & Education, 172, 104251. [Google Scholar] [CrossRef]
Khosravi, H., Buckingham Shum, S., Chen, G., Conati, C., Tsai, Y.-S., Kay, J., Knight, S., Martinez-Maldonado, R., Sadiq, S., & Gašević, D. (2022). Explainable artificial intelligence in education. Computers and Education: Artificial Intelligence, 3, 100074. [Google Scholar] [CrossRef]
Kim, D., Park, Y., Yoon, M., & Jo, I. H. (2016). Toward evidence-based learning analytics: Using proxy variables to improve asynchronous online discussion environments. The Internet and Higher Education, 30, 30–43. [Google Scholar] [CrossRef]
Kivimäki, V., Pesonen, J., Romanoff, J., Remes, H., & Ihantola, P. (2019). Curricular concept maps as structured learning diaries: Collecting data on self-regulated learning and conceptual thinking for learning analytics applications. Journal of Learning Analytics, 6(3), 106–121. [Google Scholar] [CrossRef]
Klang, E., Portugez, S., Gross, R., Kassif Lerner, R., Brenner, A., Gilboa, M., Ortal, T., Ron, S., Robinzon, V., Meiri, H., & Segal, G. (2023). Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: A medical education pilot study with GPT-4. BMC Medical Education, 23, 772. [Google Scholar] [CrossRef] [PubMed]
Kokoç, M. (2019). Flexibility in e-learning: Modelling its relation to behavioural engagement and academic performance. Themes in eLearning, 12(12), 1–16. [Google Scholar]
Kong, X., Liu, Z., Chen, C., Liu, S., Xu, Z., & Tang, Q. (2025). Exploratory study of an AI-supported discussion representational tool for online collaborative learning in a Chinese university. The Internet and Higher Education, 64, 100973. [Google Scholar] [CrossRef]
Koraishi, O. (2024). The intersection of AI and language assessment: A study on the reliability of ChatGPT in grading IELTS writing task 2. Language Teaching Research Quarterly, 43, 22–42. [Google Scholar] [CrossRef]
Kortemeyer, G., Nöhl, J., & Onishchuk, D. (2024). Grading assistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study. Physical Review Physics Education Research, 20(2), 020144. [Google Scholar] [CrossRef]
Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411–433. [Google Scholar] [CrossRef]
Lai, J. W. M., & Bower, M. (2019). How is the use of technology in education evaluated? A systematic review. Computers & Education, 133, 27–42. [Google Scholar] [CrossRef]
Lai, J. W. M., & Bower, M. (2020). Evaluation of technology use in education: Findings from a critical analysis of systematic literature reviews. Journal of Computer Assisted Learning, 36(3), 241–259. [Google Scholar] [CrossRef]
Lan, H. (2025, June 6–7). Quality evaluation of talent cultivation in higher vocational education based on artificial intelligence algorithms. 2025 International Conference on Intelligent Computing and Knowledge Extraction (ICICKE) (pp. 1–7), Bengaluru, India. [Google Scholar] [CrossRef]
Leavy, A., Dick, L., Meletiou-Mavrotheris, M., Paparistodemou, E., & Stylianou, E. (2023). The prevalence and use of emerging technologies in STEAM education: A systematic review of the literature. Journal of Computer Assisted Learning, 39(4), 1061–1082. [Google Scholar] [CrossRef]
Lee, J., Soleimani, F., Irish, I., Hosmer, J., IV, Yilmaz Soylu, M., Finkelberg, R., & Chatterjee, S. (2022). Predicting cognitive presence in at-scale online learning: MOOC and for-credit online course environments. Online Learning, 26(1), 58–79. [Google Scholar] [CrossRef]
Lehrer, R. (2021). Accountable assessment [Keynote presentation]. In Research conference 2021: Excellent progress for every student: Proceedings and program. Australian Council for Educational Research. [Google Scholar] [CrossRef]
Li, H., Xing, W., Li, C., Zhu, W., & Woodhead, S. (2025a). Integrating option tracing into knowledge tracing: Enhancing learning analytics for mathematics multiple-choice questions. Journal of Learning Analytics, 12(1), 322–337. [Google Scholar] [CrossRef]
Li, H., Xing, W., Zhu, W., Zhang, S., & Liu, Z. (2025b). Should educational AI models include gender attribute? explaining the why based on environmental psychology course with gender imbalance. Journal of Computing in Higher Education, 37(4), 1371–1412. [Google Scholar] [CrossRef]
Li, R., Liu, Y., & Gao, N. (2025a, June 13–16). On AI assisted formative assessment of blended teaching model: Taking “cultivation of ethics and fundamentals of law” course as an example. 2025 International Conference on Distance Education and Learning (ICDEL) (pp. 166–170), Kunming, China. [Google Scholar] [CrossRef]
Li, R., Liu, Y., & Gao, N. (2025b, May 14–16). On the effectiveness of formative assessment method assisted by artificial intelligence in college education: Taking “cultivation of ethics and fundamentals of law” course as an example. 2025 5th International Conference on Artificial Intelligence and Education (ICAIE) (pp. 670–674), Suzhou, China. [Google Scholar] [CrossRef]
Lim, T., Gottipati, S., Cheong, M., Ng, J. W., & Pang, C. (2023). Analytics-enabled authentic assessment design approach for digital education. Education and Information Technologies, 28(7), 9025–9048. [Google Scholar] [CrossRef]
Lin, C. J., & Hwang, G. J. (2025). Artificial intelligence-supported procedural scaffolding for promoting EFL learners’ writing performance in flipped peer assessment activities. Interactive Learning Environments, 1–15. [Google Scholar] [CrossRef]
Lin, J., Singh, S., Sha, L., Tan, W., Lang, D., Gašević, D., & Chen, G. (2022). Is it a good move? Mining effective tutoring strategies from human–human tutorial dialogues. Future Generation Computer Systems, 127, 194–207. [Google Scholar] [CrossRef]
Lin, J. J. (2025). AI-assisted evaluation of problem-solving performance using eye movement and handwriting. Journal of Research on Technology in Education, 57(5), 1019–1043. [Google Scholar] [CrossRef]
Link, S., Redmon, R., Shamsi, Y., & Hagan, M. (2024). Generating genre-based automatic feedback on English for research publication purposes. CALICO Journal, 41(3), 319–346. [Google Scholar] [CrossRef]
Liu, C., Feng, Y., & Wang, Y. (2022). An innovative evaluation method for undergraduate education: An approach based on BP neural network and stress testing. Studies in Higher Education, 47(1), 212–228. [Google Scholar] [CrossRef]
Lokkila, E., Christopoulos, A., & Laakso, M. J. (2023). A data-driven approach to compare the syntactic difficulty of programming languages. Journal of Information Systems Education, 34(1), 84–93. [Google Scholar]
Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587–604. [Google Scholar] [CrossRef]
Long, P., & Siemens, G. (2014). Penetrating the fog: Analytics in learning and education. Italian Journal of Educational Technology, 22(3), 132–137. [Google Scholar]
Ma, X., Pan, W., & Yu, X. N. (2025). Evaluating AI-generated examination papers in periodontology: A comparative study with human-designed counterparts. BMC Medical Education, 25(1), 1099. [Google Scholar] [CrossRef]
Macarini, L. A., Lemos dos Santos, H., Cechinel, C., Ochoa, X., Rodés, V., Pérez Casas, A., Lucas, P. P., Maya, R., Alonso, G. E., & Díaz, P. (2020). Towards the implementation of a countrywide K-12 learning analytics initiative in Uruguay. Interactive Learning Environments, 28(2), 166–190. [Google Scholar] [CrossRef]
MacQueen, K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook development for team-based qualitative analysis. Field Methods, 10(2), 31–36. [Google Scholar] [CrossRef]
Makhlouf, J., & Mine, T. (2020). Analysis of click-stream data to predict STEM careers from student usage of an intelligent tutoring system. Journal of Educational Data Mining, 12(2), 1–18. [Google Scholar]
Mangaroska, K., Vesin, B., Kostakos, V., Brusilovsky, P., & Giannakos, M. N. (2021). Architecting analytics across multiple e-learning systems to enhance learning design. IEEE Transactions on Learning Technologies, 14(2), 173–188. [Google Scholar] [CrossRef]
Mari, L., Wilson, M., & Maul, A. (Eds.). (2023). Measurement across the sciences: Developing a shared concept system for measurement (2nd ed.). Springer. [Google Scholar] [CrossRef]
Marquart, C. L., Hinojosa, C., Swiecki, Z., Eagan, B., & Shaffer, D. W. (2021). Epistemic network analysis (Version 1.7.0) [Software]. University of Wisconsin–Madison.
Martin, F., Dennen, V. P., & Bonk, C. J. (2020). A synthesis of systematic review research on emerging learning environments and technologies. Educational Technology Research and Development: ETR & D, 68(4), 1613–1633. [Google Scholar] [CrossRef]
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. [Google Scholar] [CrossRef]
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan. [Google Scholar]
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23. [Google Scholar] [CrossRef]
Messick, S. (1996). Validity of performance assessments. In Technical issues in large-scale performance assessment (pp. 1–18). National Center for Education Statistics. [Google Scholar]
Minty, I., Lawson, J., Guha, P., Luo, X., Malik, R., Cerneviciute, R., Kinross, J., & Martin, G. (2022). The use of mixed reality technology for the objective assessment of clinical skills: A validation study. BMC Medical Education, 22(1), 639. [Google Scholar] [CrossRef]
Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 379–416. [Google Scholar] [CrossRef]
Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003a). A brief introduction to evidence-centered design. ETS Research Report Series, 2003(1), i-29. [Google Scholar] [CrossRef]
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003b). Focus article: On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62. [Google Scholar] [CrossRef]
Monllao Olive, D., Huynh, D. Q., Reynolds, M., Dougiamas, M., & Wiese, D. (2020). A supervised learning framework: Using assessment to identify students at risk of dropping out of a MOOC. Journal of Computing in Higher Education, 32(1), 9–26. [Google Scholar] [CrossRef]
Muresan, A., Cardei, M., & Cardei, I. (2025). Predicting student success with heterogeneous graph deep learning and machine learning models. In 18th international conference on educational data mining (EDM 2025) (pp. 265–275). International Educational Data Mining Society. Available online: https://eric.ed.gov/?id=ED675661 (accessed on 25 November 2025).
Nahar, K., Shova, B. I., Ria, T., Rashid, H. B., & Islam, A. S. (2021). Mining educational data to predict students performance: A comparative study of data mining techniques. Education and Information Technologies, 26(5), 6051–6067. [Google Scholar] [CrossRef]
Nam, S., Frishkoff, G., & Collins-Thompson, K. (2017). Predicting students disengaged behaviors in an online meaning-generation task. IEEE Transactions on Learning Technologies, 11(3), 362–375. [Google Scholar] [CrossRef]
Nasir, J., Kothiyal, A., Bruno, B., & Dillenbourg, P. (2021). Many are the ways to learn identifying multi-modal behavioral profiles of collaborative learning in constructivist activities. International Journal of Computer-Supported Collaborative Learning, 16(4), 485–523. [Google Scholar] [CrossRef]
National Research Council. (2001). Knowing what students know: The science and design of educational assessment. National Academies Press. [Google Scholar]
Nawahdah, M., Sawalha, H., Salameh, R., & Taha, M. (2025, July 9–10). Evaluating the accuracy and effectiveness of AI-based grading in computer science education. 2025 International Conference on Smart Learning Courses (SCME) (pp. 1–6), Hebron, Palestine. [Google Scholar] [CrossRef]
Nazaretsky, T., Hershkovitz, S., & Alexandron, G. (2019). Kappa learning: A new item-similarity method for clustering educational items from response data. In 12th international conference on educational data mining (EDM 2019) (pp. 129–138). International Educational Data Mining Society. Available online: https://eric.ed.gov/?id=ED599209 (accessed on 22 November 2025).
Ngoc, H. D., Hoang, L. H., & Hung, V. X. (2020). Transforming education with emerging technologies in higher education: A systematic literature review. International Journal of Higher Education, 9(5), 252–258. [Google Scholar] [CrossRef]
Nguyen, Q., Rienties, B., & Whitelock, D. (2020). A mixed-method study of how instructors design for learning in online and distance education. Journal of Learning Analytics, 7(3), 64–78. [Google Scholar] [CrossRef]
Niknam, M., & Thulasiraman, P. (2020). LPR: A bio-inspired intelligent learning path recommendation system based on meaningful learning theory. Education and Information Technologies, 25(5), 3797–3819. [Google Scholar] [CrossRef]
Novak, M., Andročec, D., & Picek, R. (2025, September 18–20). Comparison of generative artificial intelligence tools in the assessment of student assignments. 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM) (pp. 1–6), Split, Croatia. [Google Scholar]
Novita, S., Kusuma, P. A., Ratnasari, R. D., Khairani, R. N., Rahmayanthi, D., Noer, A. H., & Purba, F. D. (2022, October 13–15). Mathematics assessment using virtual reality: A study on indonesian elementary school children. 2022 International Conference on Assessment and Learning (ICAL) (pp. 1–6), Bali, Indonesia. [Google Scholar] [CrossRef]
Núñez-Regueiro, F., Falcon, S., & Bressoux, P. (2025). Modeling demands-resources fit in teacher education using open-ended data: A methodological-substantive synergy. Education and Information Technologies, 30(18), 26025–26056. [Google Scholar] [CrossRef]
O’Brien, B. C., Harris, I. B., Beckman, T. J., Reed, D. A., & Cook, D. A. (2014). Standards for reporting qualitative research: A synthesis of recommendations. Academic Medicine, 89(9), 1245–1251. [Google Scholar] [CrossRef]
O’Connor, C., & Joffe, H. (2020). Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods, 19, 1609406919899220. [Google Scholar] [CrossRef]
Oğuz, E. (2025). Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek. Assessing Writing, 66, 100981. [Google Scholar] [CrossRef]
Olsen, J. K., Aleven, V., & Rummel, N. (2017). Exploring dual eye tracking as a tool to assess collaboration. In A. A. von Davier, M. Zhu, & P. C. Kyllonen (Eds.), Innovative assessment of collaboration (pp. 157–172). Springer International Publishing AG. [Google Scholar] [CrossRef]
Olsen, J. K., Sharma, K., Rummel, N., & Aleven, V. (2020). Temporal analysis of multimodal data to predict collaborative learning outcomes. British Journal of Educational Technology, 51(5), 1527–1547. [Google Scholar] [CrossRef]
Ong, N., Zhu, J., & Mossé, D. (2022). Towards including instructor features in student grade prediction. In A. Mitrovic, & N. Bosch (Eds.), 15th international conference on educational data mining (pp. 239–250). International Educational Data Mining Society. Available online: https://eric.ed.gov/?id=ED624131 (accessed on 22 November 2025).
Ontong, J. M. (2024). Do words matter: Investigating the association between linguistic features of accounting examinations and marks. South African Journal of Education, 44(2), 1–8. [Google Scholar] [CrossRef]
Opoku, R. A., Pei, B., & Xing, W. (2025). Unveiling accuracy-fairness trade-offs: Investigating machine learning models in student performance prediction. Journal of Learning Analytics, 12(2), 125–139. [Google Scholar] [CrossRef]
Ortega-Morla, J., Leis, A., Mallo, A., Moran-Fernandez, L., Guerreiro, S., Paz-Lopez, A., Perez-Sanchez, B., Sanchez-Marono, N., Rodriguez-Arias, A., Fontenla-Romero, O., & Bellas, F. (2025). ProgTutor: A robotic-based framework to support teaching and learning of programming fundamentals. IEEE Transactions on Learning Technologies, 18, 783–797. [Google Scholar] [CrossRef]
Ouyang, F., Dai, X., & Chen, S. (2022). Applying multimodal learning analytics to examine the immediate and delayed effects of instructor scaffoldings on small groups’ collaborative programming. International Journal of STEM Education, 9(1), 45. [Google Scholar] [CrossRef]
Ouyang, F., Xu, W., Liu, L., Cai, R., & Liu, J. (2024a). The influence of instructor support levels on collaborative knowledge construction. Learning, Culture and Social Interaction, 47, 100841. [Google Scholar] [CrossRef]
Ouyang, F., Zhang, L., Wu, M., & Jiao, P. (2024b). Empowering collaborative knowledge construction through the implementation of a collaborative argument map tool. The Internet and Higher Education, 62, 100946. [Google Scholar] [CrossRef]
Padrón-Rivera, G., Rebolledo-Mendez, G., Parra, P. P., & Huerta-Pacheco, N. S. (2016). Identification of action units related to affective states in a tutoring system for mathematics. Journal of Educational Technology & Society, 19(2), 77–86. [Google Scholar]
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. [Google Scholar] [CrossRef]
Pan, L., Patterson, N., McKenzie, S., Rajasegarar, S., Wood-Bradley, G., Rough, J., Luo, W., Lanham, E., & Coldwell-Neilson, J. (2020). Gathering intelligence on student information behavior using data mining. Library Trends, 68(4), 636–658. [Google Scholar] [CrossRef]
Pan, Z., Biegley, L., Taylor, A., & Zheng, H. (2024). A systematic review of learning analytics: Incorporated instructional interventions on learning management systems. Journal of Learning Analytics, 11(2), 52–72. [Google Scholar] [CrossRef]
Pang, S., Zhang, Y., Zhang, J., Yang, Y., Sun, D., & Xiang, J. (2025). Automatic detection of students’ classroom behavior via long-term classroom videos to predict students’ learning gains. Education and Information Technologies, 30, 26961–26989. [Google Scholar] [CrossRef]
Pardo, A., Han, F., & Ellis, R. A. (2016). Combining university student self-regulated learning indicators and engagement with online learning events to predict academic performance. IEEE Transactions on Learning Technologies, 10(1), 82–92. [Google Scholar] [CrossRef]
Pardo, A., & Siemens, G. (2014). Ethical and privacy principles for learning analytics. British Journal of Educational Technology, 45(3), 438–450. [Google Scholar] [CrossRef]
Pellegrino, J. W. (2014). Assessment as a positive influence on 21st-century teaching and learning: A systems approach to progress. Psicología Educativa, 20(2), 65–77. [Google Scholar] [CrossRef]
Peng, Y., Wang, Y., & Hu, J. (2023). Examining ICT attitudes, use and support in blended learning settings for students’ reading performance: Approaches of artificial intelligence and multilevel model. Computers & Education, 203, 104846. [Google Scholar] [CrossRef]
Pereira, F. D., Rodrigues, L., Henklain, M. H. O., Freitas, H., Oliveira, D. F., Cristea, A. I., Carvalho, L., Isotani, S., Benedict, A., Dorodchi, M., & de Oliveira, E. H. T. (2022). Toward human–AI collaboration: A recommender system to support CS1 instructors to select problems for assignments and exams. IEEE Transactions on Learning Technologies, 16(3), 457–472. [Google Scholar] [CrossRef]
Picasso, F. (2024). Technology-enhanced assessment and feedback practices: A systematic literature review to explore academic development models. Research on Education and Media, 16(2), 2024. [Google Scholar] [CrossRef]
Plumley, R. D., Bernacki, M. L., Greene, J. A., Kuhlmann, S., Raković, M., Urban, C. J., Hogan, K. A., Lee, C., Panter, A. T., & Gates, K. M. (2024). Co-designing enduring learning analytics prediction and support tools in undergraduate biology courses. British Journal of Educational Technology, 55(5), 1860–1883. [Google Scholar] [CrossRef]
Prasad, L. T. V., Mythili, M., Balavivekanandhan, A., Sreela, B., Bordoloi, D., & Alphonse, F. R. (2024, October 3–5). AI-enhanced deep learning techniques for evaluating progress in English L2 learners. 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) (pp. 1941–1947), Kirtipur, Nepal. [Google Scholar] [CrossRef]
Premlatha, K. R., Dharani, B., & Geetha, T. V. (2016). Dynamic learner profiling and automatic learner classification for adaptive e-learning environment. Interactive Learning Environments, 24(6), 1054–1075. [Google Scholar] [CrossRef]
Radović, S., & Seidel, N. (2025). Uncovering variations in learning behaviors and cognitive engagement among students with diverse learning goals and outcomes. Educational Technology Research and Development, 73(5), 2877–2895. [Google Scholar] [CrossRef]
Rai, L., Sheng, K., & Liu, F. (2025, July 26–28). Automated essay assessment using generative AI: Evaluating DeepSeek’s performance in university-level grading. 2025 IEEE 8th International Conference on Electronic Information and Communication Technology (ICEICT) (pp. 242–247), Weihai, China. [Google Scholar] [CrossRef]
Rantanen, P., Saari, M., Virta, U. T., & Abrahamsson, P. (2025, June 2–6). Toward AI evaluation of student essays. 2025 MIPRO 48th ICT and Electronics Convention (pp. 729–734), Opatija, Croatia. [Google Scholar] [CrossRef]
Reid, D. P., & Drysdale, T. D. (2024). Student-facing learning analytics dashboard for remote lab practical work. IEEE Transactions on Learning Technologies, 17, 1037–1050. [Google Scholar] [CrossRef]
Retnawati, H., Kardanova, E., Sumaryanto, S., Prasojo, L., Jailani, J., Arliani, E., Hidayati, K., Susanti, M., Lestari, H., Apino, E., Rafi, I., Rosyada, M., Tuanaya, R., Dewanti, S., Sotlikova, R., & Kassymova, G. (2024). A systematic review of the use of technology in educational assessment practices: Lesson learned and direction for future studies. International Journal of Robotics and Control Systems, 4(4), 1656–1693. [Google Scholar] [CrossRef]
Roa Romero, Y., Tame, H., Holzhausen, Y., Petzold, M., Wyszynski, J.-V., Peters, H., Alhassan-Altoaama, M., Domanska, M., & Dittmar, M. (2021). Design and usability testing of an in-house developed performance feedback tool for medical students. BMC Medical Education, 21(1), 354. [Google Scholar] [CrossRef]
Rodríguez, D., Guzman, M., Brito, P., & Llorens, R. (2025). Ecological validity of self-perceived voice quality and acoustic measures during voice assessments: An observational study on faculty teachers. Journal of Speech, Language, and Hearing Research, 68(2), 478–490. [Google Scholar] [CrossRef]
Rodríguez, M. E., Guerrero-Roldán, A. E., Baneres, D., & Karadeniz, A. (2022). An intelligent nudging system to guide online learners. International Review of Research in Open and Distributed Learning, 23(1), 41–62. [Google Scholar] [CrossRef]
Rohani, N., Gal, K., Gallagher, M., & Manataki, A. (2024). Providing insights into health data science education through artificial intelligence. BMC Medical Education, 24(1), 564. [Google Scholar] [CrossRef] [PubMed]
Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3), e1355. [Google Scholar] [CrossRef]
Rubio, F., Thomas, J. M., & Li, Q. (2018). The role of teaching presence and student participation in Spanish blended courses. Computer Assisted Language Learning, 31(3), 226–250. [Google Scholar] [CrossRef]
Saint, J., Whitelock-Wainwright, A., Gašević, D., & Pardo, A. (2020). Trace-SRL: A framework for analysis of microlevel processes of self-regulated learning from trace data. IEEE Transactions on Learning Technologies, 13(4), 861–877. [Google Scholar] [CrossRef]
Sekeroglu, B., Dimililer, K., & Tuncal, K. (2019). Artificial intelligence in education: Application in student performance evaluation. Dilemas Contemporáneos: Educación, Política y Valores, 7(1), 1. [Google Scholar]
Selwyn, N. (2016). Education and technology: Key issues and debates. Bloomsbury Academic. [Google Scholar]
Sembey, R., Hoda, R., & Grundy, J. (2024). Emerging technologies in higher education assessment and feedback practices: A systematic literature review. Journal of Systems and Software, 211, 111988. [Google Scholar] [CrossRef]
Serrano-Mamolar, A., Miguel-Alonso, I., Checa, D., & Pardo-Aguilar, C. (2023). Hacia una metodología de evaluación del rendimiento del alumno en entornos de aprendizaje iVR utilizando eye-tracking y aprendizaje automático. Comunicar: Revista Científica de Comunicación y Educación, 31(76), 9–20. [Google Scholar] [CrossRef]
Shabara, R., ElEbyary, K., & Boraie, D. (2024). Teachers or ChatGPT: The issue of accuracy and consistency in L2 assessment. Teaching English with Technology, 24(2), 71–92. [Google Scholar] [CrossRef]
Shermis, M. D. (2025). Using ChatGPT to score essays and short-form constructed responses. Assessing Writing, 66, 100988. [Google Scholar] [CrossRef]
Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge. [Google Scholar]
Shute, V., Rahimi, S., & Smith, G. (2019). Game-based learning analytics in physics playground. In A. Tlili, & M. Chang (Eds.), Data analytics approaches in educational games and gamification systems (pp. 69–93). Springer. [Google Scholar] [CrossRef]
Shute, V. J., Smith, G., Kuba, R., Dai, C.-P., Rahimi, S., Liu, Z., & Almond, R. (2021). The design, development, and testing of learning supports for the Physics Playground game. International Journal of Artificial Intelligence in Education, 31(3), 357–379. [Google Scholar] [CrossRef]
Shute, V. J., & Ventura, M. (2013). Stealth assessment: Measuring and supporting learning in video games. MIT Press. [Google Scholar]
Siemens, G., & Baker, R. S. J. d. (2012). Learning analytics and educational data mining: Towards communication and collaboration. In 2nd international conference on learning analytics and knowledge (LAK’12) (pp. 252–254). Association for Computing Machinery. [Google Scholar] [CrossRef]
Slade, S., & Prinsloo, P. (2013). Learning analytics: Ethical issues and dilemmas. American Behavioral Scientist, 57(10), 1510–1529. [Google Scholar] [CrossRef]
Slater, S., & Baker, R. (2019). Forecasting future student mastery. Distance Education, 40(3), 380–394. [Google Scholar] [CrossRef]
Sosa Neira, E. A., Salinas, J., & De Benito, B. (2017). Emerging technologies (ETs) in education: A systematic review of the literature published between 2006 and 2016. International Journal of Emerging Technologies in Learning, 12(5), 128–149. [Google Scholar] [CrossRef]
Standen, P. J., Brown, D. J., Taheri, M., Galvez Trigo, M. J., Boulton, H., Burton, A., Hallewell, M. J., Lathe, J. G., Shopland, N., Blanco Gonzalez, M. A., Kwiatkowska, G. M., Milli, E., Cobello, S., Mazzucato, A., Traversi, M., & Hortal, E. (2020). An evaluation of an adaptive learning system based on multimodal affect recognition for learners with intellectual disabilities. British Journal of Educational Technology, 51(5), 1748–1765. [Google Scholar] [CrossRef]
Steif, P. S., Fu, L., & Kara, L. B. (2016). Providing formative assessment to students solving multipath engineering problems with complex arrangements of interacting parts: An intelligent tutor approach. Interactive Learning Environments, 24(8), 1864–1880. [Google Scholar] [CrossRef]
Steinbach, M., Fleckenstein, J., Kuklick, L., & Meyer, J. (2025). (De) motivating zero-performing students with negative feedback: Does the salience of performance information matter? Journal of Computer Assisted Learning, 41(4), e70070. [Google Scholar] [CrossRef]
Stewart, J., Anthony, L., Batty, A. O., Nakamura, K., Nicklin, C., McLean, S., & Tomaru, K. (2025). Can we reliably score meaning recall vocabulary tests using AI? A comparison of human vs. AI scoring. Computer Assisted Language Learning, 1–23. [Google Scholar] [CrossRef]
Suraworachet, W., Zhou, Q., & Cukurova, M. (2025). University students’ perceptions of a multimodal AI system for real-world collaboration analytics: Lessons learned from a case study. Journal of Computer Assisted Learning, 41(5), e70103. [Google Scholar] [CrossRef]
Tadjer, H., Lafifi, Y., Seridi-Bouchelaghem, H., & Gülseçen, S. (2022). Improving soft skills based on students’ traces in problem-based learning environments. Interactive Learning Environments, 30(10), 1879–1896. [Google Scholar] [CrossRef]
Talamás-Carvajal, J. A., Ceballos, H. G., & Hilliger, I. (2025). The facts behind the prophecy: Validating a methodology for identifying behavioural differences in higher education student subpopulations under intervention. Journal of Learning Analytics, 12(2), 211–223. [Google Scholar] [CrossRef]
Tempelaar, D. (2017). How dispositional learning analytics helps understanding the worked-example principle. In 14th international conference on cognition and exploratory learning in digital age (CELDA 2017) (pp. 117–124). International Association for Development of the Information Society. Available online: https://eric.ed.gov/?id=ED579458 (accessed on 24 November 2025).
Tempelaar, D., Rienties, B., & Giesbers, B. (2024). Dispositional learning analytics and formative assessment: An inseparable twinship. International Journal of Educational Technology in Higher Education, 21(1), 57. [Google Scholar] [CrossRef]
Topuz, A. C., Yıldız, M., Taşlıbeyaz, E., Polat, H., & Kurşun, E. (2025). Is generative AI ready to replace human raters in scoring EFL writing? Comparison of human and automated essay evaluation. Educational Technology & Society, 28(3), 36–50. [Google Scholar] [CrossRef]
Udeozor, C., Chan, P., Russo Abegão, F., & Glassey, J. (2023). Game-based assessment framework for virtual reality, augmented reality and digital game-based learning. International Journal of Educational Technology in Higher Education, 20(1), 36. [Google Scholar] [CrossRef]
Ulitzsch, E. (2022). Computational psychometrics: New methodologies for a new generation of digital learning and assessment. Psychometrika, 87(4), 1571–1574. [Google Scholar] [CrossRef]
Vale, E., & Falloon, G. (2024). Using learning analytics to understand K–12 learner behavior in online video-based learning. Online Learning, 28(1), 44–68. [Google Scholar] [CrossRef]
van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538. [Google Scholar] [CrossRef] [PubMed]
Van Leeuwen, A., & Rummel, N. (2020, March 23–27). Comparing teachers’ use of mirroring and advising dashboards. Tenth International Conference on Learning Analytics & Knowledge (pp. 26–34), Frankfurt, Germany. [Google Scholar] [CrossRef]
Vignesh, S., Sharmitha, D. K. S., & Libisena, P. S. (2025, January 7–8). AI-powered students’ collaboration and evaluator using LDA. 2025 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI) (pp. 1791–1796), Goathgaun, Nepal. Available online: https://ieeexplore.ieee.org/abstract/document/10883069/ (accessed on 24 November 2025).
Vilanti, T., Luiro, K., Dahlqvist, I., Piipponen, J., Hemminki-Reijonen, U., Tkalcan, S., Ketamo, H., & Koivisto, J. M. (2025). Contraception-related topics in chat dialogues between healthcare students and generative AI patients: A natural language processing analysis. BMC Medical Education, 25(1), 1458. [Google Scholar] [CrossRef] [PubMed]
Villagrán, C., Nygaard, T., Gaete, M. I., Vera, M., & Cecilio-Fernandes, D. (2024). Enhancing feedback uptake and self-regulated learning in procedural skills training: Design and evaluation of a learning analytics dashboard. Journal of Learning Analytics, 11(2), 138–156. [Google Scholar] [CrossRef]
Wang, D., Bian, C., & Chen, G. (2024). Using explainable AI to unravel classroom dialogue analysis: Effects of explanations on teachers’ trust, technology acceptance and cognitive load. British Journal of Educational Technology, 55(6), 2530–2556. [Google Scholar] [CrossRef]
Wang, F., Cheung, A. C., Neitzel, A. J., & Chai, C. S. (2025a). Does chatting with chatbots improve language learning performance? A meta-analysis of chatbot-assisted language learning. Review of Educational Research, 95(4), 623–660. [Google Scholar] [CrossRef]
Wang, F., Li, N., Cheung, A. C., & Wong, G. K. (2025b). In GenAI we trust: An investigation of university students’ reliance on and resistance to generative AI in language learning. International Journal of Educational Technology in Higher Education, 22(1), 59. [Google Scholar] [CrossRef]
Wei, Y., Carvalho, P., & Stamper, J. (2025). KCluster: An LLM-based clustering approach to knowledge component discovery. arXiv, arXiv:2505.06469. [Google Scholar] [CrossRef]
Wen, Y., & Song, Y. (2021). Learning analytics for collaborative language learning in classrooms. Educational Technology & Society, 24(1), 1–15. [Google Scholar] [CrossRef]
Williamson, B. (2017). Big data in education: The digital future of learning, policy and practice. SAGE. [Google Scholar]
Wilson, A., Watson, C., Thompson, T. L., Drew, V., & Doyle, S. (2017). Learning analytics: Challenges and limitations. Teaching in Higher Education, 22(8), 991–1007. [Google Scholar] [CrossRef]
Wilson, J., Huang, Y., Palermo, C., Beard, G., & MacArthur, C. A. (2021). Automated feedback and automated scoring in the elementary grades: Usage, attitudes, and associations with writing outcomes in a districtwide implementation of MI write. International Journal of Artificial Intelligence in Education, 31(2), 234–276. [Google Scholar] [CrossRef]
Wilson, M. (2018). Making measurement important for education: The crucial role of classroom assessment. Educational Measurement: Issues and Practice, 37(1), 5–20. [Google Scholar] [CrossRef]
Wilson, M. (2023). Constructing measures: An item response modeling approach (2nd ed.). Routledge. [Google Scholar]
Wilson, M. (2024a). Finding the right grain-size for measurement in the classroom. Journal of Educational and Behavioral Statistics, 49(1), 3–31. [Google Scholar] [CrossRef]
Wilson, M. (2024b). What makes measurement important for education? Educational Measurement: Issues and Practice, 43(4), 73–82. [Google Scholar] [CrossRef]
Wilson, M., Gochyyev, P., & Scalise, K. (2016). Assessment of learning in digital interactive social networks: A learning analytics approach. Online Learning, 20(2), 97–119. [Google Scholar] [CrossRef][Green Version]
Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181–208. [Google Scholar] [CrossRef]
Wise, A. F., & Shaffer, D. W. (2015). Why theory matters more than ever in the age of big data. Journal of Learning Analytics, 2(2), 5–13. [Google Scholar] [CrossRef]
Wools, S., Molenaar, M., & Hopster-den Otter, D. (2019). The validity of technology enhanced assessments—Threats and opportunities. In B. P. Veldkamp, & C. Sluijter (Eds.), Theoretical and practical advances in computer-based educational measurement (pp. 3–19). Springer. [Google Scholar] [CrossRef]
Wu, J., Wang, J., Lei, S., Wu, F., & Gao, X. (2025). The impact of metacognitive scaffolding on deep learning in a GenAI-supported learning environment. Interactive Learning Environments, 33(9), 5166–5183. [Google Scholar] [CrossRef]
Xu, J., Wei, T., & Lv, P. (2022, July 24–27). SQL-DP: A novel difficulty prediction framework for SQL programming problems. 15th International Conference on Educational Data Mining (pp. 86–97), Durham, UK. Available online: https://eric.ed.gov/?id=ED624132 (accessed on 27 November 2025).
Xu, W., & Ouyang, F. (2022). The application of AI technologies in STEM education: A systematic review from 2011 to 2021. International Journal of STEM Education, 9(1), 59. [Google Scholar] [CrossRef]
Yang, C. C. Y., & Ogata, H. (2023). Personalized learning analytics intervention approach for enhancing student learning achievement and behavioral engagement in blended learning. Education and Information Technologies, 28(3), 2509–2528. [Google Scholar] [CrossRef]
Yang, T. C. (2023). Application of artificial intelligence techniques in analysis and assessment of digital competence in university courses. Educational Technology & Society, 26(1), 232–243. [Google Scholar] [CrossRef]
Yang, T. C., Chen, M. C., & Chen, S. Y. (2018). The influences of self-regulated learning support and prior knowledge on improving learning performance. Computers & Education, 126, 37–52. [Google Scholar] [CrossRef]
Yang, Y., Du, Y., van Aalst, J., Sun, D., & Ouyang, F. (2020). Self-directed reflective assessment for collective empowerment among pre-service teachers. British Journal of Educational Technology, 51(6), 1961–1981. [Google Scholar] [CrossRef]
Yiğiter, M., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students’ handwritten responses to open-ended items. Education and Science, 50, 1–18. [Google Scholar] [CrossRef]
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education—Where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1–27. [Google Scholar] [CrossRef]
Zhang, K., & Aslan, A. B. (2021). AI technologies for education: Recent research & future directions. Computers and Education: Artificial Intelligence, 2, 100025. [Google Scholar] [CrossRef]
Zhang, K., Yılmaz, R., Ustun, A. B., & Karaoğlan Yılmaz, F. G. (2023). Learning analytics in formative assessment: A systematic literature review. Journal of Measurement and Evaluation in Education and Psychology, 14, 359–381. [Google Scholar] [CrossRef]
Zhang, L., Weitlauf, A. S., Amat, A. Z., Swanson, A., Warren, Z. E., & Sarkar, N. (2020). Assessing social communication and collaboration in autism spectrum disorder using intelligent collaborative virtual environments. Journal of Autism and Developmental Disorders, 50(1), 199–211. [Google Scholar] [CrossRef]
Zhao, F., Gaschler, R., Schnotz, W., & Wagner, I. (2020). Regulating distance to the screen while engaging in difficult tasks. Frontline Learning Research, 8(6), 59–76. [Google Scholar] [CrossRef]
Zhao, R., Zhuang, Y., Zou, D., Xie, Q., & Yu, P. L. H. (2023). AI-assisted automated scoring of picture-cued writing tasks for language assessment. Education and Information Technologies, 28(6), 7031–7063. [Google Scholar] [CrossRef]

Figure 1. The four building blocks of Constructing Measures (M. Wilson, 2023).

Figure 2. Inclusion and exclusion criteria.

Figure 3. PRISMA 2020 flow diagram. Note. * EBSCOhost includes Education Full Text and ERIC, and duplicate records between these two databases were automatically removed by the system before export.

Figure 4. Annual publication output by region (2016–2025). Note. Region is defined as the area where the empirical data were collected or the study was conducted, as reported in each publication. The solid line with numeric labels indicates the annual total number of publications. Stacked bars show yearly publication counts for the ten most frequent single-region categories, with all remaining single-region records aggregated into Other Regions. Multi-region denotes publications involving two or more regions. NS denotes publications with unspecified regions. Search coverage spans 1 January 2016 to 29 October 2025. Any 2026 entries reflect preprint posting dates after the search cut-off.

Figure 5. Geographical distribution of the included studies. Note. Orange circles indicate the number of included studies in each region; circle size is proportional to publication counts. Blue shading represents the year of the first publication on this topic in each region, with darker shades indicating earlier adoption. Regions with no identified studies are shown in grey.

Figure 6. Distribution of research methods, education levels, subject areas, and sample sizes. Note. Different colors distinguish the four category dimensions in the Sankey diagram, and node labels show the percentage of studies in each category relative to all included studies. Flow widths indicate the number of studies between adjacent categories, with wider flows representing larger counts.

Figure 7. Keyword co-occurrence network of the included corpus. Note. (a) presents the full keyword co-occurrence network (minimum link strength = 0), while (b) presents a filtered network (minimum link strength = 10) to reduce visual density and improve readability. Node size reflects keyword occurrence frequency, line thickness reflects co-occurrence strength, and colors indicate keyword clusters.

Figure 8. Co-occurrences of emerging technology categories. Note. Red points represent study-level networks (one point per study). Labeled nodes denote ET categories. Line segments indicate co-occurrence connections between ET categories; thicker/darker lines represent stronger co-occurrence. The network is projected into a two-dimensional space, where the axes (SVD1 and SVD2) are the projection dimensions, and the percentages on the axes indicate the proportion of variance explained by each dimension (SVD1 = 11.2%; SVD2 = 9.6%). Ellipses in labels reflect abbreviated display of longer category names.

Figure 9. Grain size distribution of emerging technology categories used in educational measurement. Note. Colors distinguish the three grain sizes: micro, meso, and macro.

Figure 10. Distribution of emerging technology categories in four building blocks at the micro-level. Note. Colors distinguishing four building blocks of construct map, item design, outcome space, and measurement model.

Figure 11. Distribution of emerging technology categories in four building blocks at the meso-level. Note. Colors distinguishing four building blocks of construct map, item design, outcome space, and measurement model.

Figure 12. Distribution of emerging technology categories in four building blocks at the macro-level. Note. Colors distinguishing four building blocks of construct map, item design, outcome space, and measurement model.

Table 1. Search facets and terms.

Facet	Search Terms	Rationale
Educational context	educat; learn; teach; pedagog; student; instruct; school; college; university; class*; K-12; K12	Identify studies situated in formal education across primary, secondary, and higher education.
Measurement	measur; assess; evaluat; test; exam	Identify any form of educational measurement, assessment, evaluation, testing, or exams.
Emerging technology	“emerg* technolog”; AI; “artificial intelligence”; “learning analytics”; “virtual reality”; “augmented reality”; “mixed reality”; “intelligent tutor”; “adaptive learning”	Identify a broad set of ETs used in education.

Note. The asterisk (*) indicates truncation used in the search string to capture multiple word forms, such as educat* for education, educational, and educator.

Table 2. Emerging technology coding scheme and definitions.

ET Category (Code)	Operational Definition	Typical Indicators in Text
Generative artificial intelligence/Large language model systems	Implemented generative artificial intelligence or a large language model to generate, transform, or interpret language or code as part of the measurement workflow.	Generative artificial intelligence (GenAI); large language model (LLM); retrieval-augmented generation (RAG); agent-based chatbot
Machine learning & deep learning (non-LLM)	Implemented machine learning or deep learning models used for prediction, classification, or representation learning when the core model is not an LLM.	Machine learning (ML); deep learning (DL); artificial neural network/deep neural network (ANN/DNN); support vector machine (SVM); random forest
Natural language processing (non-LLM)	Implemented non-LLM language processing used as measurement evidence (feature extraction, classification, or text analytics).	Natural language processing (NLP); term frequency-inverse document frequency (TF-IDF); rule-based text analysis; linguistic feature extraction
Automated scoring & feedback systems	Implemented scoring and/or feedback pipeline that converts evidence into scores and/or actionable feedback, regardless of the underlying model family.	automated scoring; automated grading; automated feedback; auto-evaluation pipeline
Learning analytics/Educational data mining	Implemented analysis of learner process data (logs/traces) that produces indicators, predictions, or monitoring outputs used for measurement or decision-making.	Learning analytics (LA); educational data mining (EDM); dashboards; early-warning indicators; log-based analytics
Knowledge tracing & learner modeling	Implemented modeling of learner knowledge states to infer mastery or trajectories over time.	Knowledge tracing (KT); Bayesian knowledge tracing (BKT); deep knowledge tracing (DKT); additive factors model (AFM); hidden Markov model (HMM)
Adaptive systems & Intelligent tutoring systems	Implemented systems that adapt instruction, practice, or support based on inferred learner state (instructional adaptation).	Intelligent tutoring system (ITS); adaptive learning technology (ALT); AI tutor; personalized adaptive system
Computer-adaptive assessment & test delivery	Implemented adaptive assessment administration focused on measurement delivery (routing/item selection).	Computer-adaptive testing (CAT); computer-adaptive assessment; adaptive test delivery
Multimodal & sensor-based measurement	Implemented multimodal sensing and/or fusion used as measurement evidence.	Multimodal learning analytics (MMLA); electroencephalography (EEG); functional near-infrared spectroscopy (fNIRS); electrodermal activity (EDA); multimodal fusion
Speech technologies	Implemented speech-based evidence capture and/or processing used for measurement.	Automatic speech recognition (ASR); speech analytics; transcription-based evidence capture; text-to-speech (TTS) when used in dialog-based measurement
Computer vision	Implemented image/video-based evidence capture and/or processing used for measurement (e.g., posture, action, facial or behavioral cues).	Computer vision (CV); video analytics; image recognition; facial/action detection for measurement
Immersive/Simulation & Extended reality	Implemented immersive or simulation environments where virtual/augmented/mixed/extended reality interaction is central to performance and evidence generation.	Virtual reality (VR); augmented reality (AR); mixed reality (MR); extended reality (XR); immersive virtual reality (IVR); virtual patient simulation

Table 3. Distribution of tags by grain size, building block, and emerging technology.

Analytic Coding Dimension	Code	n	%
Grain size	Micro	839	88.88
	Meso	95	10.06
	Macro	10	1.06
Building block	Construct map	50	3.19
	Item design	157	10.01
	Outcome space	649	41.39
	Measurement model	712	45.41
Emerging technology category	Generative AI/LLM systems	159	8.27
	Machine learning & deep learning (ML & DL)	336	17.48
	Natural language processing (NLP)	37	1.93
	Automated scoring & feedback systems	274	14.26
	Learning analytics & educational data mining (LA & EDM)	499	25.96
	Knowledge tracing & learner modeling	39	2.03
	Adaptive systems & intelligent tutoring systems	145	7.54
	Computer-adaptive assessment & test delivery	9	0.47
	Multimodal & sensor-based measurement	130	6.76
	Speech technologies	141	7.34
	Computer vision	54	2.81
	Immersive/simulation & extended reality	99	5.15

Note. Counts (n) reflect tags rather than unique studies because a single study could be coded into multiple codes.

Table 4. Emerging technologies across grain sizes and building blocks.

Grain Size	Building Block	Block Total ET Tags	Block % Within Grain	Block Entropy H	Block HHI	Top1 ET	Top1 n	Top2 ET	Top2 n	Top3 ET	Top3 n
Micro	Construct map	107	3.5	3.319	0.108	Learning analytics & educational data mining	20	Automated scoring & feedback systems	15	Machine learning & deep learning	13
	Item design	313	10.3	3.059	0.132	Automated scoring & feedback systems	74	Learning analytics & educational data mining	59	Immersive/simulation & extended reality	42
	Outcome space	1238	40.8	3.07	0.13	Learning analytics & educational data mining	296	Automated scoring & feedback systems	206	Machine learning & deep learning	176
	Measurement model	1378	45.4	3.16	0.123	Learning analytics & educational data mining	347	Machine learning & deep learning	259	Automated scoring & feedback systems	180
Meso	Construct map	7	2.4	2.236	0.224	Machine learning & deep learning	2	Generative AI/LLM systems	1	Learning analytics & educational data mining	1
	Item design	18	6.3	2.503	0.188	Machine learning & deep learning	5	Automated scoring & feedback systems	3	Speech technologies	3
	Outcome space	123	42.9	2.34	0.249	Learning analytics & educational data mining	64	Machine learning & deep learning	30	Generative AI/LLM systems	7
	Measurement model	139	48.4	2.529	0.184	Machine learning & deep learning	46	Learning analytics & educational data mining	50	Speech technologies	10
Macro	Construct map	0	0	0	0	—	0	—	0	—	0
	Item design	2	9.1	1	0.5	Generative AI/LLM systems	1	Automated scoring & feedback systems	1	Machine learning & deep learning	0
	Outcome space	6	27.3	1.459	0.361	Machine learning & deep learning	3	Learning analytics & educational data mining	2	Speech technologies	1
	Measurement model	14	63.6	2.039	0.276	Machine learning & deep learning	7	Learning analytics & educational data mining	4	Automated scoring & feedback systems	1

Note. Entropy (H) reflects the diversity of ET tags within a building block. Herfindahl–Hirschman Index (HHI) reflects concentration and is computed as

\sum p_{i}^{2}

, where

p_{i}

is the within-block proportion of the ET category

i

, and larger HHI values indicate greater concentration. Top1, Top2, and Top3 ETs report the three most frequently coded ET categories used in the specific building block within each grain size.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, L.; Wong, G.K.W.; Zhang, B.; Wang, F. Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory. Educ. Sci. 2026, 16, 661. https://doi.org/10.3390/educsci16040661

AMA Style

Yu L, Wong GKW, Zhang B, Wang F. Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory. Education Sciences. 2026; 16(4):661. https://doi.org/10.3390/educsci16040661

Chicago/Turabian Style

Yu, Linwei, Gary K. W. Wong, Bingjie Zhang, and Feifei Wang. 2026. "Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory" Education Sciences 16, no. 4: 661. https://doi.org/10.3390/educsci16040661

APA Style

Yu, L., Wong, G. K. W., Zhang, B., & Wang, F. (2026). Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory. Education Sciences, 16(4), 661. https://doi.org/10.3390/educsci16040661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Educational Measurement with Emerging Technologies: A Systematic Review Through Evidentiary Lens on Granularity and Constructing Measures Theory

Abstract

1. Introduction

2. Theoretical Framework

2.1. Measurement as Evidentiary Process

2.2. Four Building Blocks Theory in Constructing Measures

2.3. Measurement Granularity

3. Methods

3.1. Database Search Strategy

3.2. Eligibility Criteria

3.3. Study Selection and Screening

3.4. Data Extraction and Coding

4. Results and Discussion

4.1. Descriptive Overview

4.1.1. Descriptive Overview of Demographic Information

4.1.2. Descriptive Overview of Analytical Coding

4.2. Emerging Technologies Across Grain Sizes and Building Blocks

4.2.1. Emerging Technologies Across Grain Sizes

4.2.2. Emerging Technologies Across Building Blocks Within Each Grain Size

5. Critical Reflections on Emerging Technologies-Enabled Educational Measurement

5.1. Construct Meaning and Validity Drift

5.2. Robustness and Generalizability

5.3. Fairness and Transparency

5.4. Privacy and Governance

6. Implications and Future Direction

6.1. For Researchers

6.2. For System Designers

6.3. For Educators and Practitioners

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI