Next Article in Journal
The Apprenticeship of Observation in Teacher Learning
Previous Article in Journal
Deepfakes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Entry

Assessment Analytics in Digital Assessments

by
Okan Bulut
1,* and
Seyma N. Yildirim-Erbasli
2
1
Measurement, Evaluation, and Data Science, Faculty of Education, University of Alberta, Edmonton, AB T6G 2G5, Canada
2
Department of Psychology, Concordia University of Edmonton, Edmonton, AB T5B 4E4, Canada
*
Author to whom correspondence should be addressed.
Encyclopedia 2026, 6(4), 81; https://doi.org/10.3390/encyclopedia6040081
Submission received: 28 February 2026 / Revised: 26 March 2026 / Accepted: 30 March 2026 / Published: 2 April 2026
(This article belongs to the Section Social Sciences)

Definition

The rapid expansion of digital and technology-enhanced assessments has enabled the capture of far more than final responses or total scores. As learners navigate traditional formats, such as multiple-choice, short-answer, and performance tasks, digital delivery platforms routinely capture response times, response revisions, navigation patterns, and item-level metadata. More advanced formats, including interactive simulations, scenario-based tasks, and game-based assessments, further record fine-grained actions such as mouse clicks, keystrokes, hint requests, sequence of operations, and decision pathways. These increasingly rich data streams provide a multidimensional view of test-taker behavior, offering evidence about cognitive processes, strategy use, persistence, and motivation that goes beyond what correctness alone can reveal. Assessment analytics refers to the systematic collection, integration, and analysis of such data generated during the assessment process. In practice, this emerging field combines principles from psychometrics, learning analytics, data science, and human-computer interaction to evaluate the quality, validity, and fairness of assessments in digital environments. The ultimate goal of assessment analytics is to produce actionable evidence about how assessments measure what they intend to measure in contemporary, technology-rich educational contexts.

1. Introduction

The digitization of assessment has fundamentally altered the evidentiary basis on which inferences about learning and proficiency are made. Traditionally, educational measurement relied primarily on outcome data: item responses, total scores, and scale scores derived from psychometric models. These indicators summarized performance at the level of correctness or proficiency but provided limited visibility into how responses were generated. With the migration of assessments to digital platforms, this evidentiary structure has expanded. Computer-based environments routinely capture detailed interaction traces, including response times, revision behaviors, navigation sequences, tool use, and other process indicators. As a result, assessment no longer produces scores alone; it generates structured records of behavior unfolding over time.
This shift creates the conditions for what may be described as assessment analytics: an emerging interdisciplinary field concerned with the systematic analysis of process data to inform measurement, validation, design, and decision-making in assessment contexts. Assessment analytics draws conceptually and methodologically from learning analytics, yet it operates under distinct constraints. Whereas learning analytics often focuses on optimizing instructional environments in ongoing educational settings [1], assessment analytics is centrally concerned with evidentiary arguments about learning outcomes. Table 1 compares learning and assessment analytics across the primary purpose, core question, theoretical foundation, primary data sources, unit of analysis, and interpretive constraints.
The conceptual contribution of assessment data lies in their potential to provide behavioral evidence linked to underlying cognitive processes. For example, response time patterns can help distinguish rapid guessing from effortful responding [2]; revision behaviors may signal monitoring or uncertainty [3]; navigation sequences may reveal alternative solution strategies [4]. Importantly, these indicators do not directly measure cognition [5]. Rather, they provide observable behavioral proxies that, when theoretically justified, may enrich evidentiary claims about how performance is produced.
Assessment analytics recognizes that digitization transforms assessments into temporally structured, data-rich environments while maintaining the central measurement question: What claims about knowledge, skills, or competencies are supported by the available evidence? By situating analytics within established validity frameworks, assessment analytics seeks to harness process data not as an end in itself, but as an extension of the evidentiary logic that underlies educational measurement. Compared with other relevant fields, such as learning analytics and educational data mining, assessment analytics occupies a distinct conceptual space within the broader landscape of educational data science. For instance, while learning analytics broadly encompasses the collection and analysis of student-related data from learning environments to optimize learning processes and provide individualized experiences [6], assessment analytics functions as a targeted subset of learning analytics with a more specific focus (i.e., monitoring, collecting, and interpreting data generated specifically within assessment systems). Similarly, while educational data mining tends to prioritize algorithmic and computational methods for pattern discovery across large educational datasets, assessment analytics is explicitly grounded in assessment and feedback theory, using those frameworks to guide both the interpretation of patterns and the design of subsequent interventions [7]. Taken together, these distinctions underscore that assessment analytics is not merely a methodological application of learning analytics to assessment data or a set of computational methods for exploring hidden patterns in assessment data. Rather, it is a conceptually independent field that requires its own theoretical grounding in the assessment, feedback, and measurement traditions.

2. Data in Digital Assessments

Today’s digital assessments generate substantially more information than traditional paper-and-pencil assessments. In addition to outcome measures (e.g., item responses and total scores), they capture fine-grained process data that document how learners engage with assessment tasks, interfaces, and embedded tools. Such data include response times, clickstream records, cursor movements, navigation patterns, and other trace indicators of interaction [8]. The availability of these process data creates significant opportunities to advance understanding of learners’ cognitive and behavioral strategies during assessment [9]. At the same time, the richness and volume of these data introduce new layers of methodological and interpretive complexity that require careful theoretical framing and analytic rigor.

2.1. Data Types

Digital assessments vary in format, ranging from selected-response items (e.g., multiple-choice and true/false questions) to interactive simulations and collaborative problem-solving tasks. Across these formats, they typically generate six broad categories of data, reflecting both outcomes and processes of engagement.
Outcome data are the most familiar and traditionally emphasized form of assessment evidence. These data include item responses (selected options, constructed text, numerical entries), partial-credit scores, total scores, and rubric-based ratings assigned by human raters or automated scoring systems. In selected-response formats, responses may be dichotomously scored as correct or incorrect (e.g., 0 = Correct and 1 = Incorrect), or polytomously to capture varying levels of accuracy (e.g., 0 = Not correct, 1 = Partially correct, and 2 = Fully correct). In constructed-response tasks, rubric-based scores reflect evaluative judgments about the quality, accuracy, coherence, or completeness of a response.
Digital assessment platforms routinely capture detailed timing data, including item-level response times, latencies between successive actions, and overall time-on-task. These temporal indicators provide insight into patterns of engagement and may reflect processing efficiency, automaticity, persistence, or strategic allocation of effort. Extremely short latencies may be associated with rapid guessing or disengagement [10], whereas unusually long response times may signal difficulty, confusion, or distraction [11,12].
Yet these are inherently multifaceted and context-dependent. Speed observed through digital assessment platforms reflects not only cognitive fluency but also reading fluency, familiarity with digital interfaces, motor coordination, and test-taking strategies [13,14]. In assessments that impose explicit time constraints, speed may become an integral component of the target construct. In other contexts, however, differential variation in speed may introduce construct-irrelevant variance [15]. Thus, the interpretation of timing indicators must be explicitly aligned with the intended role of speed in the assessment design and the theoretical definition of the construct being measured.
Navigation data includes records of item revisits, skipping behaviors, scrolling patterns, and pathway sequences through multi-part or interactive tasks [4,16]. These forms of process data are particularly salient in non-linear assessments and digital environments that allow flexible progression across items or task components. Such patterns can provide evidence of strategic planning, for example, when learners initially skip more difficult items and return to them later, or of metacognitive monitoring, as reflected in revisiting flagged questions. In more complex or exploratory tasks, navigation behaviors may also signal systematic exploration or iterative refinement of solutions. Sequence data, in particular, may reveal whether learners follow expected solution pathways or adopt alternative strategies that diverge from the intended task model [4].
Digital assessment platforms also capture extensive metadata that contextualizes both performance and process indicators. These metadata include item-level features such as content domain or cognitive demand, as well as information about the testing device and platform (e.g., browser type and screen size), and characteristics of the administration context (e.g., remote versus in-person testing conditions, e.g., [17]). Item-level metadata enables analysts to link observed behaviors to specific content features or task characteristics, thereby supporting more refined inferences about strategy use and difficulty [17]. Device characteristics may systematically influence response times [18], and screen size has been shown to affect performance in certain contexts (e.g., [3]). Similarly, the provision of accommodations may alter patterns of interaction with assessment materials (e.g., [19]).
Digital assessment systems may also record detailed revision and editing traces, including answer changes, erasures, backspace use, editing distance between drafts, and complete version histories [20,21]. These indicators provide further insights into the dynamic evolution of responses and can indicate different aspects of self-regulation during task completion. Patterns of revision may signal self-monitoring, reflective evaluation, uncertainty, or shifts in strategy as learners refine their answers over time. For instance, frequent revisions in writing tasks may reflect iterative planning, elaboration, and refinement of ideas [22]. In selected-response contexts, answer changes may indicate reconsideration or re-evaluation of initially chosen options [23]. Revision traces also capture more superficial or mechanical behaviors. Backspace use, for example, may reflect routine typing corrections rather than substantive conceptual change [24].

2.2. Interpretive Cautions

Digital assessments generate a multidimensional record of performance encompassing outcomes, processes, and contextual conditions. Harnessing the analytic potential of this richness requires careful attention to data structure, principled feature engineering, and explicit acknowledgement of interpretive boundaries. Although these analytic opportunities are often substantial, there is a parallel responsibility to ensure that resulting inferences remain grounded in construct theory and evidentiary argumentation [23,24]. The granularity of process data can create an illusion of precision that exceeds what the underlying construct representation can support.
Importantly, not all cognitive processes leave observable traces [25]. The absence of recorded action does not imply the absence of reasoning, deliberation, or mental simulation. Furthermore, both performance and behavior are shaped by interface design and device characteristics [26]. Differences in input modality (e.g., mouse versus touch input), keyboard familiarity, screen size, and scrolling mechanics can influence timing, navigation, and revision traces. Differences in digital fluency may introduce systematic variance unrelated to the intended construct [27]. Effective tool use may reflect both domain understanding and prior exposure to similar interfaces. With limited familiarity, they may exhibit longer latencies or less efficient pathways independent of their conceptual knowledge.
These concerns are particularly salient when time constraints are imposed. Under tightly sped conditions, timing measures may conflate proficiency with processing speed or test-related anxiety. Even in assessments that are not explicitly speeded, learners may differ in their subjective perceptions of time pressure [28]. If rapid responding is not central to the construct definition, heavy reliance on timing-based indicators risks introducing construct-irrelevant variance [15]. Accordingly, the integration of process indicators into validity arguments must be explicitly aligned with the theoretical role of time, technology, and interaction within the assessment framework [29].
The availability of high-volume data does not, in itself, strengthen inferential claims. The inclusion of additional features may increase model complexity without yielding commensurate gains in predictive accuracy or interpretive clarity [30]. Thus, more data and more elaborate models do not automatically translate into more valid conclusions. The quality of evidence depends not only on the quantity of available indicators but also on the strength and defensibility of the linkage between observed behaviors and theoretically defined constructs [31,32]. Process data meaningfully expand the evidentiary base for understanding performance; however, their integration into digital assessment systems requires careful modeling practices, transparent preprocessing decisions, and explicitly articulated validity arguments.

3. Analytic Approaches

Digital assessment data support a wide spectrum of analytic methods. Rather than organizing approaches by statistical technique alone, it is more useful to classify them according to the types of claims they are intended to support. Some analyses aim to describe patterns of behavior; some others seek to inform measurement models; and others focus on modeling sequences or predicting outcomes.

3.1. Descriptive and Diagnostic Analytics

Descriptive and diagnostic analytics summarize and characterize observed behavior without necessarily making strong claims about underlying traits. In digital assessments, these analyses often involve profiling typical response patterns, summarizing timing distributions, mapping navigation frequencies, or visualizing interaction logs (e.g., [33]). For example, analysts may compute average response times, rates of item revisits, frequencies of hint requests, or common paths through tasks (e.g., [10]). These analyses are often implemented using techniques such as summary statistics, distributional analysis, clustering, and anomaly detection methods [10,33].
A key application of descriptive analytics is detecting aberrant patterns and disengagement signals [2]. Extremely rapid responding, uniform answer patterns, prolonged inactivity, or inconsistent navigation behaviors may indicate low effort, random responding, or technical difficulties. While no single indicator definitively establishes disengagement, combinations of signals can inform quality control procedures. These methods are often used to flag responses for review, inform data cleaning decisions, or generate indicators of response validity [34]. At the item level, they can reveal unexpectedly long response times, high omission rates, or unusual distractor selections. This suggests potential issues with item clarity, difficulty calibration, or alignment with the target construct. At the platform level, they can identify interface bottlenecks, confusing navigation structures, or technical friction points. However, descriptive patterns should not be over-interpreted without a clear theoretical framework. They indicate what happened, not necessarily why.

3.2. Measurement-Oriented Modeling

Process data can play different roles within measurement models. Beyond summarizing behavior, assessment data can be incorporated into measurement models. In this family of approaches, indicators such as response time, revision counts, hint usage, and navigation features are integrated with outcome data to support proficiency inference [8]. For example, timing information may be modeled jointly with accuracy data to distinguish rapid guessing from engaged responding [2,35]. Common approaches include joint models of response accuracy and response time and mixture models that distinguish latent classes of responders [2].
In some applications, assessment data are incorporated as collateral information that improves the precision or robustness of proficiency estimates without redefining the construct being measured (e.g., joint modeling of accuracy and response time; [36]). Furthermore, such data may be specified as additional dimensions within multidimensional models. Within this modeling framework, the behaviors themselves represent construct-relevant competencies (e.g., joint ability speed models; [37]). In other cases, they are used to model latent classes (e.g., effortful vs. non-effortful responders), thereby protecting the validity of inferences [2]. These different uses carry distinct interpretive consequences and therefore require differentiated validity arguments.
Incorporating process data introduces additional modeling assumptions. Joint models of accuracy and response time typically assume specific distributions, conditional independence structures, and relationships between latent speed and ability [35]. Misspecification of these assumptions may bias parameter estimates or distort score interpretations. Also, item response behavior is often correlated with contextual factors such as item position [38], complicating the identification of construct-relevant variance.
The integration of data in digital assessments into measurement models must be guided by a clear argument: What construct is being measured, why are these indicators relevant to it, and how do they enhance (rather than distort) inference? Without such justification, technical sophistication alone does not ensure validity.

3.3. Sequence and Process-Oriented Analyses

Learning unfolds over time, and many digital tasks generate ordered streams of actions that cannot be fully understood through final scores alone [39]. Digital environments such as simulations, inquiry tasks, and problem-based scenarios record temporally structured data, enabling analysis of how learners progress through tasks. Sequence and process-oriented analyses focus on the ordering of events and transitions between states rather than isolated behaviors [40]. This temporal perspective is widely used in assessment analytics to examine strategy use and behavioral patterns in complex tasks. Methodologically, these analyses often draw on sequence analysis, Markov models, or state-transition modeling to capture temporal dependencies in action streams [41].
By aggregating and comparing action sequences across learners, researchers can detect recurring behavioral patterns. These patterns may reflect different strategies, levels of engagement, or approaches to task completion. For example, high-performing learners can more consistently follow the weekly tasks than low-performing learners [42]. Identifying such differences can support theory development about strategy variation in digital environments.
Sequence methods are particularly useful in tasks that allow multiple solution paths, as in such contexts, correctness alone does not capture meaningful variation in performance. Two learners may receive the same score but follow different trajectories through the task. Sequence analysis provides evidence about process differences that complement outcome measures. Furthermore, sequence analysis can reveal where learners commonly pause, repeat actions, or show error patterns [43]. Such patterns can inform task evaluation and design improvement.

3.4. Predictive Models

Predictive modeling uses statistical or machine learning methods—such as regression models, decision trees, random forests, gradient boosting, or neural networks—to forecast outcomes from observed data. In digital assessment contexts, predictive models may estimate the likelihood of future performance or identify students at risk of low achievement [44]. These applications can support monitoring, triage, and assessment design.
Uses of prediction include early warning systems that help allocate instructional support [45] or monitor engagement patterns that signal potential disengagement [46]. In these cases, predictive accuracy is valuable, even if the model does not provide deep explanatory insights. The primary goal is utility, supporting decisions with acceptable error rates and transparent performance metrics. However, predictive models have limitations when used for individual labeling or high-stakes decisions without interpretability. High predictive accuracy does not guarantee fairness, construct validity, or stability across contexts. Models trained on one cohort may not generalize to another due to differences in curriculum, platform design, demographic composition, or instructional practices.

3.5. Recent Developments

Recent advances in artificial intelligence (AI), particularly in large language models (LLMs), have expanded the analytic possibilities in digital assessment environments. These developments build on earlier work in educational data mining and learning analytics but introduce new capabilities for automating aspects of scoring and generating adaptive or interactive assessment experiences.
Automated scoring has advanced significantly with AI-driven methods. In constructed-response tasks, natural language processing techniques are used to evaluate written responses [47], while in simulation-based tasks, models can evaluate performance based on sequences of actions rather than final answers alone [48]. Another area of development is adaptive and interactive assessment design [49]. AI techniques enable more flexible task environments in which item selection, feedback, or task progression can respond dynamically to learner behavior. AI-driven systems can model complex problem spaces and capture multiple solution pathways. These approaches offer scalability and efficiency but require careful validation to ensure alignment with consistency across populations and resistance to bias [48].

4. Validity, Fairness, and Responsible Use

4.1. Validity Arguments

Digital assessments expand the observable record of performance. Traditionally, assessment inference relied primarily on outcome data (e.g., correctness). Process data introduces additional observable behaviors that may bear on the construct of interest. A central question in assessment analytics is not what can be modeled, but what claims can be justified. The use of process data must ultimately be justified within a validity framework. Validity theory conceptualizes validity not as a property of a test, but as an argument linking observed performances to intended score interpretations and uses [50]. From this perspective, process data—such as response times, revision patterns, navigation sequences, or tool use—do not automatically constitute evidence. They become evidence only when incorporated into a coherent interpretive argument.
If response time is used to filter rapid guessing, the argument must establish that extremely short latencies reflect disengagement rather than high proficiency [10]. If attempt frequency is incorporated into diagnostic feedback, evidence must support the claim that attempt behavior is meaningfully related to the intended construct [9].

4.2. Fairness and Accessibility

Process data raises well-documented fairness concerns. Interface design, device type [18], familiarity with digital tools [29], and accessibility accommodations (e.g., [19]) may influence interaction patterns. Such factors can introduce the possibility that behavioral indicators reflect technological fluency or environmental constraints rather than the intended construct.
In digital assessments, concerns arise not only from item content or score interpretation, but from the analytic treatment of process data. Digital assessments record fine-grained interaction patterns, which are subsequently transformed into engineered features and incorporated into models. For such indicators to support valid and fair inference, they must function equivalently across groups and administration contexts. The fairness literature in both educational measurement and algorithmic decision-making emphasizes the importance of examining subgroup error rates, false positive rates, and model stability across populations such as countries (e.g., [51]).
The central issue in assessment analytics is not whether behavioral differences exist, but how those differences are interpreted and operationalized. When digital traces are converted into analytic indicators, the risk of construct-irrelevant variance can increase if contextual influences are not carefully examined.

4.3. Ethics, Privacy, and Governance

As analytic models become more complex, transparency becomes increasingly important. Predictive accuracy alone is insufficient for assessment contexts in which interpretations affect learners. The literature on explainable models and algorithmic accountability strongly supports the need for interpretability [52].
Digital assessments generate fine-grained process data. In assessment analytics, the secondary analysis of these data—beyond their original operational purpose—is common, as interaction logs are frequently repurposed for research, model development, quality monitoring, and system improvement (e.g., [53]). While such secondary uses can enhance understanding and strengthen analytical models, they should be governed by clear policies that specify the scope of permissible analyses, conditions of access, and data retention procedures. Principles of data minimization remain relevant in this context: only information necessary for defined and justified assessment purposes should be retained, shared, or incorporated into analytic models [54].
As assessment analytics increasingly relies on automated modeling and predictive systems, human oversight remains essential to ensure responsible use. Analytic outputs should function as decision-support tools rather than autonomous decision-makers, particularly in contexts where results inform consequential interpretations. Maintaining a human-in-the-loop approach allows experts to review model outputs, evaluate anomalies, and consider contextual information that may not be captured in the data [55]. Ongoing human monitoring also supports quality assurance by detecting model drift, unintended subgroup effects, or changes in data distributions over time. Analytic systems should be designed to expand rather than constrain human agency by strengthening transparency, supporting concepts such as metacognitive engagement and teacher judgment [56]. Embedding analytics within structured oversight frameworks strengthens accountability, preserves professional responsibility, and helps ensure that automated processes remain aligned with ethical and validation principles.

5. Implementation Considerations

Process data can support a range of practical applications in assessment systems, from design refinement to operational monitoring and reporting. These applications are most defensible when grounded in clear validity arguments and transparent implementation procedures.

5.1. Assessment Design and Improvement

One of the most well-established applications of assessment data is the refinement of interfaces and tools [57]. Trace evidence can identify unexpected response patterns, prolonged latencies, high omission rates, or systematic navigation difficulties. These indicators can inform iterative improvement of assessment systems. Navigation traces and interaction logs can support usability evaluation. Repeated back-and-forth movements, excessive scrolling, or frequent tool misselection may indicate design friction. Evidence from human-computer interaction research suggests that interface design influences behavior, and process data provides a systematic way to identify design-related barriers [58]. Assessment data can also inform feedback mechanisms [57]. In formative contexts, interaction traces may help determine whether learners engage with feedback tools, revisit explanations, or use hints strategically. When aligned with learning objectives, such information can guide the design of adaptive support.
Another application of process data is item or task refinement. Educational measurement has long relied on item statistics to improve assessments, and assessment analytics extends this base by adding process data such as response times, revision patterns, and navigation behaviors.

5.2. Operational Monitoring and Quality Assurance

Beyond design improvement, process data plays a significant role in operational monitoring and quality assurance [59]. Because digital systems automatically log interactions, they enable continuous evaluation of platform reliability, administration consistency, and data integrity. System logs can detect irregularities such as incomplete sessions, delayed submissions, or technical interruptions. Such indicators help identify infrastructure problems that might affect the comparability of scores. In large-scale assessments, ensuring stable delivery conditions is central to maintaining measurement equivalence across administrations.
Process data can also support the detection of irregular behaviors. Extremely short response times, repetitive patterns, or anomalous navigation sequences may indicate disengagement, technical errors, or atypical responding [10]. While such indicators do not automatically imply invalid performance, they can inform review procedures and data quality checks. This aligns with established practices in test security and quality assurance, now enhanced by fine-grained digital evidence.

6. Two Case Studies of Assessment Analytics

6.1. A Hypothetical University Assessment

To illustrate how assessment analytics can be applied in a real-life setting, we will first consider a hypothetical case in higher education. An undergraduate “Introduction to Research Methods” course is offered at a mid-sized university in a blended format: students attend two in-person sessions per week and interact regularly with course materials and assessments through Canvas, a widely used Learning Management System (LMS). The course enrolls 312 students across four sections. Over the 13-week semester, students complete ten low-stakes weekly quizzes embedded in the LMS, each consisting of 15 four-option multiple-choice items. Quizzes are available for a 48 h window; students may attempt each quiz up to twice (with the higher score recorded), and together the quizzes account for 15% of the final course grade.
This setting is particularly well-suited for an assessment analytics illustration for three reasons. First, the low-stakes nature of the quizzes makes disengaged responding likely [60], making it a context where engagement analysis adds genuine interpretive value. Second, the LMS captures a rich record of every student’s interaction with the platform, including content access, timing, and navigation, creating multiple complementary streams of process data. Third, the analytical methods required are neither proprietary nor computationally intensive, making the approach replicable by practitioners and researchers at institutions with standard LMS infrastructure.

6.1.1. Data Sources

Two data streams are used in this hypothetical analysis: quiz response logs and LMS event logs. The quiz platform records a timestamped log for every item response, including the student identifier, the item identifier, the response selected, a correctness indicator (1/0), the attempt number, and whether the student changed their answer before submitting. Response times are derived by computing the elapsed time between consecutive item submissions within a quiz session. After removing implausible values (<0.5 s and >600 s), the dataset retains 46,800 item responses per attempt across 312 students, 10 quizzes, and 15 items each. Using the Canvas API, all platform interactions for the 312 students across the semester are extracted. Events are categorized as quiz access events (open, submit, review), content access events (page views, file downloads, video plays), communication events (discussion posts, announcement views), and performance-monitoring events (gradebook views). After removing administrative logins, the cleaned dataset contains approximately 214,000 timestamped events.
The two datasets are linked via an anonymized student identifier. For each student, a combined feature set is constructed: mean item response time, proportion of rapid guesses, an effort index (per quiz and per attempt), quiz attempt timing (hours before the deadline), and whether the student accessed any course content in the 24 h before each quiz.

6.1.2. Analytical Approach

The analysis proceeds in three steps. Step 1 focuses on identifying disengaged responses. The Normative Threshold (NT) method [61] is applied using a 20% (NT20) threshold to classify each item response as engaged (response time ≥ 20% of the average response time) or rapidly guessed (response time < 20% of the average response time). Because the quizzes are completed in an unproctored online environment, the response time distributions are more variable than in Cases 1–3; a response time ceiling of 600 s is applied before computing item means to prevent inflated thresholds caused by paused sessions.
Step 2 aims to quantify effort across attempts. The Response Time Effort (RTE) index [62] is computed for each student on each quiz attempt by averaging the binary engagement scores across items. RTE values are computed separately for Attempt 1 and Attempt 2 for the 187 students who used the second-attempt option, enabling a direct comparison of effort across attempts.
Step 3 involves profiling students using cluster analysis. To synthesize the multi-indicator feature set into interpretable engagement profiles, a k-means cluster analysis is performed using four student-level features aggregated across all ten quizzes: mean RTE, mean proportion of NT20-flagged responses, mean quiz attempt timing (hours before deadline), and proportion of quizzes with pre-quiz content access. Features are standardized prior to clustering. The optimal number of clusters is selected using the elbow method and confirmed with the average silhouette coefficient.

6.1.3. Implications for Practice

While the potential of assessment analytics is evident from the example above, translating these methods into routine educational practice raises a number of substantive challenges that practitioners should anticipate.
First, the quality of process data is uneven and context-dependent. In the case study, response times were derived from submission timestamps rather than from continuous event streams that would capture cursor movement, time-on-page, or keystroke sequences. This is a common situation in practice because most LMS platforms and quiz tools log outcomes rather than fine-grained behavioral events. The consequence is that response time, as typically available, reflects when a student submitted an answer but not what they were doing in the intervening period. A student who opens a quiz item, leaves their desk for five minutes, returns, reads the item, and answers quickly will produce a long, then apparently fast, response time that is difficult to interpret correctly. Practitioners should be transparent about exactly what their response time variable measures and resist treating submission-gap timestamps as equivalent to reading-and-reasoning time.
Second, threshold selection in the NT method is consequential but not uniquely determined. As illustrated in the study, the choice of a 20% threshold impacts the number of responses flagged as disengaged, which in turn changes estimates of student effort, item-level disengagement rates, and ultimately any scores or decisions derived from motivation-filtered data. There is no single correct threshold: the NT method was designed as a practical heuristic, and different thresholds will be appropriate depending on item format, time limits, student population, and the specific decisions at stake. Practitioners who apply the NT method without acknowledging this sensitivity risk presenting engagement classifications as more objective than they are. A prudent approach is to report results under at least two threshold values and examine how sensitive the substantive conclusions are to that choice. If the interpretation changes substantially, the conclusions should be stated with corresponding caution.
Third, disengagement and ability are difficult to disentangle. A student who answers an item in two seconds may be guessing randomly or may genuinely know the answer and respond decisively. A student who spends forty seconds on an item before selecting the wrong answer may be deeply engaged but struggling or may be distracted and choosing randomly. Neither response time nor response accuracy, considered alone, can resolve this ambiguity. The combination of the two can improve classification, but it does not eliminate the problem entirely. In practice, this means that engagement flags should be treated as probabilistic indicators rather than definitive classifications, and that consequential decisions (e.g., removing flagged responses from scoring, adjusting grades, or identifying students for intervention) should not be made based on a single indicator or a single quiz administration.
Lastly, ethical and equity considerations are not peripheral. Collecting and analyzing detailed process data about students raises questions that go beyond technical validity. Students may not be aware that their LMS navigation patterns, quiz-attempt timing, and response speeds are being analyzed to infer their level of engagement. Even where data collection is disclosed in a course syllabus, the practical implications of being profiled may not be fully transparent to students. Additionally, if engagement profiles are used to identify students for intervention, there is a risk that structural disadvantages, such as unreliable internet access, which can artifactually inflate response times, or time zone differences in asynchronous courses, which can make late quiz attempts appear as disengagement, are misread as individual behavioral deficits. Responsible implementation of assessment analytics requires not only technical rigor but also careful attention to how engagement data is communicated to students, what actions it triggers, and whether those actions are equitable across the diverse circumstances students bring to their learning.

6.2. National Assessment of Educational Progress (NAEP)

The National Assessment of Educational Progress (NAEP), widely known as the Nation’s Report Card, has been the primary measure of what U.S. students know and can do across core academic subjects for over five decades. In 2017, the NAEP mathematics assessment was administered digitally for the first time for grades 4 and 8, providing students with an engaging experience that aligned with the delivery mode of many other large-scale assessments. This transition from paper to digital delivery fundamentally expanded the evidentiary basis for the assessment by generating process data that had not existed in any prior NAEP administration.
The data logged within the testing system includes student interactions with the assessment platform and the assessment tasks; the data consist of time-stamped records of student-initiated actions, such as clicking an assessment feature like a drawing tool, and automatically generated actions by the system, such as entering an item. The information extracted from this data represents students’ interactions during the assessment and processes used to arrive at answers, including the time students take to complete specific tasks, the steps or strategies they use to solve problems, and the resources or tools they utilize during the assessment [63].
The digital platform also embeds a range of Universal Design Elements directly into the assessment interface. These include color contrast and theme options, zooming, text-to-speech functionality that can read directions, text, figures, and tables aloud, a scratchwork and highlighter tool for annotating figures and performing computations, an onscreen equation editor for specific items, and a calculator for designated assessment blocks [64]. Because all of these tools are built into the platform rather than offered as separate accommodations, students’ use of each one is logged as part of the process data stream, suggesting that researchers can examine not only whether a student used the text-to-speech function or the highlighter, but when they used it, for how long, and in relation to which items.
Extended time is one of the most widely used testing accommodations, yet the process by which it is allocated has long been critiqued for lacking an objective, empirical foundation. The study by Ogut et al. [65] used NAEP process data to address this gap directly, asking whether the behavioral traces students produce in the early minutes of an assessment can predict which students will run out of time before the assessment ends.
The central problem motivating the study is a misalignment between accommodation policy and students’ actual needs. On one side of this misalignment, many students who are formally granted extended time do not use it: among all students granted the extended time accommodation in the 2017 NAEP Grade 8 Mathematics assessment, only 25.1% actually used time beyond the standard 30 min limit, and approximately 72% of students with disabilities granted extended time did not use it at all. On the other hand, a substantial proportion of students without any accommodations were still actively working when the timeout message appeared on their screens, indicating that they needed more time but had no formal access to it. Therefore, many students without extended-time accommodations received a timeout message while actively engaged with an item.

6.2.1. Data Sources

The study [65] drew on two restricted-use datasets from the 2017 NAEP Grade 8 Mathematics assessment: process data logs and response data from approximately 28,000 participants. The key behavioral measure derived from the process data was the receipt of a timeout message while a student was actively interacting with an item (i.e., a binary indicator treated as evidence that the student would have benefited from additional time). Because NAEP allows students to navigate through the assessment in any order, the researchers defined student “interactions” in a way that was agnostic to item order, capturing each instance of a student entering and exiting any item regardless of its position in the test. For the first ten such interactions, they recorded the total time a student spent from entering the item to exiting it (exit time) and the total number of actions taken, encompassing response selections, text field interactions, calculator key presses, and scratchwork adjustments. These interaction-level features formed the predictive input for the machine learning models.

6.2.2. Analytical Approach

The researchers used a decision-tree-based ensemble method optimized using Bayesian hyperparameter tuning to predict which students would receive a timeout message, based on features from their first 10 item interactions [65]. Ten sequential models were estimated, each incorporating information from one additional interaction: the first used only exit time and actions from the first item interaction, the second added data from the second interaction, and so on through the tenth. This sequential design addressed a practically important question: how early in the assessment can the model identify students at risk, and how does predictive accuracy change as more behavioral data accumulates?
The results demonstrated that early assessment behavior is a viable and surprisingly potent predictor of the need for extended time. The model built on the first item interaction alone achieved a recall rate of 98%, correctly identifying most students who would eventually receive a timeout message, though at the cost of a high false positive rate and an accuracy of only 32%. The SHAP (Shapley Additive Explanations) analysis of feature importance revealed that the single most influential predictor in the full ten-interaction model was exit time on the tenth item interaction, followed by whether the student had been granted an extended time accommodation, and the number of actions during the eighth interaction.
Notably, demographic background variables (e.g., eligibility for free or reduced-price lunch, English Language Learner status, and disability status) contributed minimally to the model’s predictions. From both validity and fairness perspectives, this finding is analytically significant. It means the model identifies extended time needs primarily through behavioral evidence of how students pace themselves during the assessment, rather than through categorical group membership. A student’s engagement with the items, not their demographic label, drives the prediction, making demographic-based differential prediction highly unlikely.

6.2.3. Implications for Practice

Overall, this study makes a compelling case for integrating assessment analytics into digital testing systems to deliver accommodations more equitably and responsively. Rather than relying solely on pre-assigned eligibility categories determined before the assessment begins, a data-driven approach could allow a testing system to identify students who are falling behind in real time and adjust their conditions accordingly. At the same time, the study highlights important interpretive cautions that apply equally to this and other applications of assessment analytics. The model predicts who will receive a timeout message, but a timeout message itself serves as a proxy for needing extended time. That is, a student who receives one may have been moving slowly because of a genuine need, momentary distraction, or an unexpectedly challenging item type. The question of when to intervene (e.g., not so early as to disrupt students who are simply thoughtful) remains one that assessment analytics can inform but not resolve on their own.

7. Conclusions

Assessment analytics represents a meaningful extension of educational measurement into digitally rich environments. Digital assessment platforms generate temporally structured, behaviorally detailed records of how learners engage with tasks, tools, and interfaces. When interpreted carefully and situated within established validity frameworks, such data can expand the evidentiary basis for understanding proficiency, monitoring engagement, and improving assessment design [9].
Each of the analytic approaches reviewed in this study (i.e., descriptive profiling, measurement-oriented modeling, sequence analysis, and predictive modeling) offers distinct capabilities and carries distinct interpretive responsibilities. Across all of them, a consistent principle holds: the value of process data lies not in their volume or granularity alone, but in the coherence of the argument linking observed behavior to intended constructs.
Several cross-cutting concerns warrant sustained attention as the field matures. First, validity arguments must be explicit about what behavioral indicators represent and why they are theoretically relevant [31]. Second, fairness analyses must examine whether trace features function equivalently across groups that differ in device access [9,52], digital fluency, or accommodation status. Third, privacy governance must bound the secondary use of interaction logs through clearly articulated purposes and data minimization. Lastly, human oversight must remain integral to analytic workflows, particularly when outputs inform consequential decisions.
In conclusion, assessment analytics does not alter the foundational question of educational measurement (i.e., what can be validly inferred from the available evidence?), but it expands the evidence available for answering it. Realizing this potential responsibly requires integrating technical rigor, theoretical grounding, and sustained attention to the ethical dimensions of data-rich assessment practice.

Directions for Future Research

Assessment analytics is a rapidly evolving field, and the directions in which it will develop are shaped as much by changes in assessment practice as by advances in methodology. Several interrelated areas warrant sustained research attention in the years ahead.
Perhaps the most consequential shift currently underway concerns the redesign of assessments in response to generative AI. As educators move away from traditional closed-response formats toward more open-ended, process-oriented, and authentic tasks, the nature of the process data these assessments produce will change substantially. A student composing a written argument, working through a multi-step simulation, or engaging in a dialog-based task generates a qualitatively richer record than a student selecting among four response options. Future research must develop engagement indicators suited to these new formats. The methods mentioned in the case study above, including the NT method for detecting rapid guessing and response-time-based effort indices, were designed primarily for selected-response contexts; their applicability to open-ended, AI-mediated task environments is largely untested and cannot be assumed. This is not a minor limitation: the validity of any engagement indicator depends on a coherent argument connecting observed behavior to the underlying construct, and that argument must be rebuilt from the ground up when the assessment format changes substantially.
Fairness and differential validity are cross-cutting concerns that deserve to be treated as primary research objectives rather than afterthoughts. Trace features such as response time may not function equivalently across groups that differ in device access, digital fluency, accommodation status, or language background. A student completing an assessment on a low-powered device may produce systematically longer response latencies than a peer on a faster machine, not because they are less engaged or less proficient, but because of a hardware constraint entirely outside their control. Similarly, students who are multilingual, who use assistive technologies, or whose motor coordination affects input speed may produce behavioral profiles that are misread by engagement detection algorithms trained on data from more homogeneous populations. Future research should explicitly examine the differential validity of engagement indicators across these groups, drawing on fairness frameworks established in psychometrics and extending them to the newer class of process-based indicators introduced by assessment analytics.
Finally, as assessment analytics increasingly relies on automation through advanced AI techniques, the question of how to maintain meaningful human oversight of analytic outputs becomes more pressing. We argue that analytic systems should function as decision-support tools rather than autonomous decision-makers, and that human-in-the-loop workflows are essential for detecting model drift, evaluating anomalous outputs, and incorporating contextual information that data alone cannot capture. Future research should examine how such oversight can be implemented at scale in practice, how analytic dashboards and reports can be designed to support rather than supplant professional judgment, and how students themselves can be engaged as active participants in understanding and contesting the inferences drawn from their behavioral data. The goal of assessment analytics, as articulated in this paper, is to expand the evidentiary basis for answering the foundational question of educational measurement: what can be validly inferred from the available evidence? Realizing that goal responsibly requires that the expansion of evidence be matched by a corresponding expansion of interpretive rigor, ethical care, and accountability to the learners whose educational futures are ultimately at stake.

Author Contributions

Conceptualization, O.B. and S.N.Y.-E.; writing—original draft preparation, O.B. and S.N.Y.-E.; writing—review and editing, O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Long, P.; Siemens, G. Penetrating the fog: Analytics in learning and education. EDUCAUSE Rev. 2011, 46, 30–40. [Google Scholar]
  2. Yildirim-Erbasli, S.N.; Bulut, O. Designing predictive models for early prediction of students’ test-taking engagement in computerized formative assessments. J. Appl. Test. Technol. 2023, 24, 34–47. [Google Scholar]
  3. Bridgeman, B.; Lennon, M.L.; Jackenthal, A. Effects of Screen Size, Screen Resolution, And Display Rate on Computer-Based Test Performance. ETS Res. Rep. Ser. 2001, 2001, i-23. [Google Scholar] [CrossRef]
  4. van Bakel, W.T.H. Navigating Exams: Identifying Test-Taking Navigation Behaviour. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2021. [Google Scholar]
  5. Larmuseau, C.; Cornelis, J.; Lancieri, L.; Desmet, P.; Depaepe, F. Multimodal learning analytics to investigate cognitive load during online problem solving. Br. J. Educ. Technol. 2020, 51, 1548–1562. [Google Scholar] [CrossRef]
  6. Ifenthaler, D. Learning analytics. In The SAGE Encyclopedia of Educational Technology; Spector, J.M., Ed.; Sage Publications: Thousand Oaks, CA, USA, 2015; Volume 2, pp. 447–451. [Google Scholar]
  7. Ellis, C. Broadening the scope and increasing the usefulness of learning analytics: The case for assessment analytics. Br. J. Educ. Technol. 2013, 44, 662–664. [Google Scholar] [CrossRef]
  8. He, S.; Cui, Y. A systematic review of the use of log-based process data in computer-based assessments. Comput. Educ. 2025, 228, 105245. [Google Scholar] [CrossRef]
  9. Bulut, O.; Yildirim-Erbasli, S.N.; Gorgun, G. Assessment analytics for digital assessments: Identifying, modeling, and interpreting behavioral engagement. In Assessment Analytics in Education—Designs, Methods and Solutions; Sahin, M., Ifenthaler, D., Eds.; Springer: Cham, Switzerland, 2024; pp. 35–60. [Google Scholar] [CrossRef]
  10. Yildirim-Erbasli, S.N.; Bulut, O. The impact of students’ test-taking effort on growth estimates in low-stakes educational assessments. Educ. Res. Eval. 2021, 26, 368–386. [Google Scholar] [CrossRef]
  11. Jerez, D.; Mazzullo, E.; Bulut, O. Exploring slow responses in international large-scale assessments using sequential process analysis. Computers 2026, 15, 64. [Google Scholar] [CrossRef]
  12. Lehman, B.; Graesser, A. To resolve or not to resolve? That is the big question about confusion. In International Conference on Artificial Intelligence in Education; Springer International Publishing: Cham, Switzerland, 2015; pp. 216–225. [Google Scholar]
  13. Ghafournia, N.; Afghari, A. The Interaction between reading comprehension cognitive test-taking strategies, test performance, and cognitive language learning strategies. Procedia-Soc. Behav. Sci. 2013, 70, 80–84. [Google Scholar] [CrossRef]
  14. Lovett, B.J.; Lewandowski, L.J.; Potts, H.E. Test-taking speed: Predictors and implications. J. Psychoeduc. Assess. 2017, 35, 351–360. [Google Scholar] [CrossRef]
  15. Leighton, J.P.; Gokiert, R.J. The cognitive effects of test item features: Informing item generation by identifying construct irrelevant variance. In Proceedings of the Annual Meeting of the National Council on Measurement in Education, Montreal, QC, Canada, 12–14 April 2005. [Google Scholar]
  16. Bayrak, F.; Aydın, F.; Yurdugül, H. Navigational behavior patterns of learners on dashboards based on assessment analytics. In Visualizations and Dashboards for Learning Analytics; Springer International Publishing: Cham, Switzerland, 2021; pp. 251–268. [Google Scholar]
  17. Kuhfeld, M.; Soland, J. Using assessment metadata to quantify the impact of test disengagement on estimates of educational effectiveness. J. Res. Educ. Eff. 2020, 13, 147–175. [Google Scholar] [CrossRef]
  18. Camara, W.J.; Harris, D.J. Impact of technology, digital devices, and test timing on score comparability. In Integrating Timing Considerations to Improve Testing Practices; Routledge: London, UK, 2020; pp. 104–121. [Google Scholar]
  19. Lang, S.C.; Elliott, S.N.; Bolt, D.M.; Kratochwill, T.R. The effects of testing accommodations on students’ performances and reactions to testing. Sch. Psychol. Q. 2008, 23, 107. [Google Scholar] [CrossRef]
  20. Bishop, S.; Egan, K. Detecting erasures and unusual gain scores: Understanding the status quo. In Handbook of Quantitative Methods for Detecting Cheating on Tests; Routledge: London, UK, 2016; pp. 193–213. [Google Scholar]
  21. Sinharay, S.; Johnson, M.S. Three new methods for analysis of answer changes. Educ. Psychol. Meas. 2017, 77, 54–81. [Google Scholar] [CrossRef] [PubMed]
  22. Roscoe, R.D.; Snow, E.L.; Allen, L.K.; McNamara, D.S. Automated detection of essay revising patterns: Applications for intelligent feedback in a writing tutor. Grantee Submiss. 2015, 10, 59–79. [Google Scholar]
  23. Bridgeman, B. A simple answer to a simple question on changing answers. J. Educ. Meas. 2012, 49, 467–468. [Google Scholar] [CrossRef]
  24. Deane, P.; Zhang, M.; Hao, J.; Li, C. Using keystroke dynamics to detect nonoriginal text. J. Educ. Meas. 2026, 63, e12431. [Google Scholar] [CrossRef]
  25. Ahn, B.T.; Harley, J.M. Facial expressions when learning with a queer history app: Application of the control value theory of achievement emotions. Br. J. Educ. Technol. 2020, 51, 1563–1576. [Google Scholar] [CrossRef]
  26. Zhou, C.; Yuan, F.; Huang, T.; Zhang, Y.; Kaner, J. The impact of interface design element features on task performance in older adults: Evidence from eye-tracking and EEG signals. Int. J. Environ. Res. Public Health 2022, 19, 9251. [Google Scholar] [CrossRef]
  27. Mohammadyari, S.; Singh, H. Understanding the effect of e-learning on individual performance: The role of digital literacy. Comput. Educ. 2015, 82, 11–25. [Google Scholar] [CrossRef]
  28. Holmes, S. Time Limits and Speed of Working in Assessments: When, and to What Extent, Should Speed of Working Be Part of What Is Assessed; Ofqual research report 25/7267/2; Ofqual: Coventry, UK, 2025. [Google Scholar]
  29. Zumbo, B.D.; Hubley, A.M. (Eds.) Understanding and Investigating Response Processes in Validation Research; Springer International Publishing/Springer Nature: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
  30. Emerson, A.; Cloude, E.B.; Azevedo, R.; Lester, J. Multimodal learning analytics for game-based learning. Br. J. Educ. Technol. 2020, 51, 1505–1526. [Google Scholar] [CrossRef]
  31. Sharma, K.; Giannakos, M.N. Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Educ. Technol. 2020, 51, 1450–1484. [Google Scholar] [CrossRef]
  32. Yan, L.; Zhao, L.; Gašević, D.; Martinez-Maldonado, R. Scalability, sustainability, and ethicality of multimodal learning analytics. In Proceedings of the 12th International Learning Analytics and Knowledge Conference, New York, NY, USA, 21–25 March 2022. [Google Scholar] [CrossRef]
  33. Yildirim-Erbasli, S.N.; Gorgun, G. Disentangling the relationship between ability and test-taking effort: To what extent the ability levels can be predicted from response behavior? Technol. Knowl. Learn. 2025, 30, 1475–1497. [Google Scholar] [CrossRef]
  34. Ulitzsch, E.; Yildirim-Erbasli, S.N.; Gorgun, G.; Bulut, O. An explanatory mixture IRT model for careless and insufficient effort responding in self-report measures. Br. J. Math. Stat. Psychol. 2022, 75, 668–698. [Google Scholar] [CrossRef]
  35. Nagy, G.; Ulitzsch, E. A multilevel mixture IRT framework for modeling response times as predictors or indicators of response engagement in IRT models. Educ. Psychol. Meas. 2022, 82, 845–879. [Google Scholar] [CrossRef] [PubMed]
  36. Wang, S.; Zhang, S.; Shen, Y. A joint modeling framework of responses and response times to assess learning outcomes. Multivar. Behav. Res. 2020, 55, 49–68. [Google Scholar] [CrossRef] [PubMed]
  37. Fox, J.P.; Marianti, S. Joint modeling of ability and differential speed using responses and response times. Multivar. Behav. Res. 2016, 51, 540–553. [Google Scholar] [CrossRef] [PubMed]
  38. Kingston, N.M.; Dorans, N.J. The effect of the position of an item within a test on item responding behavior: An analysis based on item response theory. ETS Res. Rep. Ser. 1982, 1982, i–26. [Google Scholar] [CrossRef]
  39. Reimann, P. Time is precious: Variable- and event-centred approaches to process analysis in CSCL research. Int. J. Comput.-Support. Collab. Learn. 2009, 4, 239–257. [Google Scholar] [CrossRef]
  40. Seifried, J.; Brandt, S.; Kögler, K.; Rausch, A. The computer-based assessment of domain-specific problem-solving competence: A three-step scoring procedure. Cogent Educ. 2020, 7, 1719571. [Google Scholar] [CrossRef]
  41. Han, Y.; Liu, H.; Ji, F. A sequential response model for analyzing process data on technology-based problem-solving tasks. Multivar. Behav. Res. 2022, 57, 960–977. [Google Scholar] [CrossRef]
  42. Peach, R.L.; Greenbury, S.F.; Johnston, I.G.; Yaliraki, S.N.; Lefevre, D.; Barahona, M. Data-driven modelling and characterisation of task completion sequences in online courses. arXiv 2020, arXiv:2007.07003. [Google Scholar] [CrossRef]
  43. Roque, F.V.; Junior, L.C.; Cechinel, C.; Marcon, M.Z.; Kuhnen, A.; Munoz, R.; Grellert, M. Learning Analytics for Virtual Industrial Labs: Performance Segmentation and Error Pattern Discovery via Sequential Mining. IEEE Access 2025, 13, 194401–194420. [Google Scholar] [CrossRef]
  44. Bulut, O.; Cormier, D.C.; Yildirim-Erbasli, S.N. Optimized screening for students at-risk in mathematics: A machine learning approach. Information 2022, 13, 400. [Google Scholar] [CrossRef]
  45. Wentworth, L.; Nagaoka, J. Early warning indicators in education: Innovations, uses, and optimal conditions for effectiveness. Teach. Coll. Rec. 2020, 122, 1–22. [Google Scholar] [CrossRef]
  46. Atif, A.; Richards, D.; Liu, D.; Bilgin, A.A. Perceived benefits and barriers of a prototype early alert system to detect engagement and support ‘at-risk’ students: The teacher perspective. Comput. Educ. 2020, 156, 103954. [Google Scholar] [CrossRef]
  47. Wang, Y.; Ding, Z.; Wu, X.; Sun, S.; Liu, N.; Zhai, X. Autoscore: Enhancing automated scoring with multi-agent large language models via structured component recognition. Proc. AAAI Conf. Artif. Intell. 2026, 40, 40898–40906. [Google Scholar] [CrossRef]
  48. Khalifa, A.; Tahhan, O.; Albazooni, M.; Saeed, M.; Hamdi, R.; Stanners, M.; Malik, A. Automated and artificial intelligence (AI)-derived performance assessment in surgical simulation: A systematic review. Cureus 2025, 17, e12. [Google Scholar] [CrossRef]
  49. Khine, M.S. Using AI for adaptive learning and adaptive assessment. In Artificial Intelligence in Education: A Machine-Generated Literature Overview; Springer Nature: Singapore, 2024; pp. 341–466. [Google Scholar]
  50. Kane, M.T. Current concerns in validity theory. J. Educ. Meas. 2001, 38, 319–342. [Google Scholar] [CrossRef]
  51. Gorgun, G.; Yildirim-Erbasli, S.N. Algorithmic bias in BERT for response accuracy prediction: A case study for investigating population validity. J. Educ. Meas. 2026, 63, e12420. [Google Scholar] [CrossRef]
  52. Khalil, M.; Prinsloo, P.; Slade, S. Fairness, trust, transparency, equity, and responsibility in learning analytics. J. Learn. Anal. 2023, 10, 1–7. [Google Scholar] [CrossRef]
  53. Bulut, O.; Gorgun, G.; Yildirim-Erbasli, S.N.; Wongvorachan, T.; Daniels, L.M.; Gao, Y.; Lai, K.W.; Shin, J. Standing on the shoulders of giants: Online formative assessments as the foundation for predictive learning analytics models. Br. J. Educ. Technol. 2022, 54, 19–39. [Google Scholar] [CrossRef]
  54. Ganesh, P.; Tran, C.; Shokri, R.; Fioretto, F. The data minimization principle in machine learning. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, Athens, Greece, 23–26 June 2025; pp. 3075–3093. [Google Scholar]
  55. Kleinman, E.; Shergadwala, M.; Seif El-Nasr, M.; Teng, Z.; Villareale, J.; Bryant, A.; Zhu, J. Analyzing students’ problem-solving sequences: A human-in-the-loop approach. J. Learn. Anal. 2022, 9, 138–160. [Google Scholar] [CrossRef]
  56. Viberg, O.; Poquet, O.; Kovanovic, V.; Khosravi, H. Fostering Human Agency in Age of AI: A Learning Analytics Perspective. J. Learn. Anal. 2025, 12, 1–7. [Google Scholar] [CrossRef]
  57. Sedrakyan, G.; Malmberg, J.; Verbert, K.; Järvelä, S.; Kirschner, P.A. Linking learning behavior analytics and learning science concepts: Designing a learning analytics dashboard for feedback to support learning regulation. Comput. Hum. Behav. 2020, 107, 105512. [Google Scholar] [CrossRef]
  58. Mangaroska, K.; Giannakos, M. Learning analytics for learning design: A systematic literature review of analytics-driven design to enhance learning. IEEE Trans. Learn. Technol. 2018, 12, 516–534. [Google Scholar] [CrossRef]
  59. Asatryan, S.; Hakobyan, L.; Adamyan, N. The role of big data and learning analytics in the quality assurance process of higher education. Educ. 21st Century 2025, 7, 87–99. [Google Scholar] [CrossRef]
  60. Finn, B. Measuring motivation in low-stakes assessments. ETS Res. Rep. Ser. 2015, 2015, 1–17. [Google Scholar] [CrossRef]
  61. Wise, S.L.; Ma, L. Setting response time thresholds for a CAT item pool: The normative threshold method. In Proceedings of the Annual meeting of the National Council on Measurement in Education, Vancouver, BC, Canada, 14–16 April 2012. [Google Scholar]
  62. Wise, S.L.; Kong, X. Response time effort: A new measure of examinee motivation in computer-based tests. Appl. Meas. Educ. 2005, 18, 163–183. [Google Scholar] [CrossRef]
  63. Bulut, O.; Gorgun, G.; Wongvorachan, T.; Tan, B. Rapid guessing in low-stakes assessments: Finding the optimal response time threshold with random search and genetic algorithm. Algorithms 2023, 16, 89. [Google Scholar] [CrossRef]
  64. National Center for Education Statistics. NAEP 2017 Digitally Based Mathematics Assessment. The Nation’s Report Card. 2017. Available online: https://www.nationsreportcard.gov/math_2017/about/digitally-based-assessment/?grade=4 (accessed on 24 March 2026).
  65. Ogut, B.; Circi, R.; Huo, H.; Hicks, J.; Yin, M. Running Out of Time: Leveraging Process Data to Identify Students Who May Benefit from Extended Time. Int. Electron. J. Elem. Educ. 2025, 17, 253–265. [Google Scholar] [CrossRef]
Table 1. A comparison of assessment analytics and learning analytics.
Table 1. A comparison of assessment analytics and learning analytics.
DimensionAssessment AnalyticsLearning Analytics
Primary PurposeSupport measurement and decision-making in assessment contextsSupport learning processes and instructional improvement
Core QuestionWhat can be validly inferred about proficiency?How can learning processes be understood and improved?
Theoretical FoundationEducational measurement, psychometrics, validity theoryLearning sciences, educational data mining
Primary Data SourcesAssessment logs, item responses, timing data, navigation traces, item metadataClickstreams, discussion forums, assignments, engagement metrics
Unit of AnalysisOften items, tests or individualsOften learners, courses, or cohorts
Interpretive ConstraintsLess flexibility; focuses on construct validity, comparability, fairness, and standardizationMore flexibility; focuses on usefulness for intervention and support
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bulut, O.; Yildirim-Erbasli, S.N. Assessment Analytics in Digital Assessments. Encyclopedia 2026, 6, 81. https://doi.org/10.3390/encyclopedia6040081

AMA Style

Bulut O, Yildirim-Erbasli SN. Assessment Analytics in Digital Assessments. Encyclopedia. 2026; 6(4):81. https://doi.org/10.3390/encyclopedia6040081

Chicago/Turabian Style

Bulut, Okan, and Seyma N. Yildirim-Erbasli. 2026. "Assessment Analytics in Digital Assessments" Encyclopedia 6, no. 4: 81. https://doi.org/10.3390/encyclopedia6040081

APA Style

Bulut, O., & Yildirim-Erbasli, S. N. (2026). Assessment Analytics in Digital Assessments. Encyclopedia, 6(4), 81. https://doi.org/10.3390/encyclopedia6040081

Article Metrics

Back to TopTop