A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment

Li, Xia; Chen, Yuxia; Yang, Huali; Geng, Jing

doi:10.3390/electronics14193873

Open AccessArticle

A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment

¹

School of Media and Communication, Wuhan Textile University, Wuhan 430200, China

²

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

³

School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan 430200, China

⁴

School of Education, Jianghan University, Wuhan 430056, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3873; https://doi.org/10.3390/electronics14193873

Submission received: 8 August 2025 / Revised: 24 September 2025 / Accepted: 24 September 2025 / Published: 29 September 2025

Download

Browse Figures

Versions Notes

Abstract

Cognitive diagnosis models (CDMs) assess the proficiency of examinees in specific skills. Online education has increased the amount of data that is available on the response behaviour of examinees. Traditional CDMs determine the state of skills by modelling information on item response results and ignoring vital response time information. In this study, a CDM, named RT-CDM, which models the condition dependence between response time and response accuracy on the speed-accuracy exchange criterion, is proposed. The model’s continuous latent trait function and response time function, used for more precise cognitive analyses, makes it a tractable, interpretable skill diagnosis model. The Markov chain Monte Carlo algorithm is used to estimate the parameters of the RT-CDM. We evaluate RT-CDM through controlled simulations and three real datasets—PISA 2015 computer-based mathematics, EdNet-KT1, and MATH—against multiple baselines, including classical CDMs (e.g., DINA/IRT), RT-extended IRT and joint models (e.g., 4P-IRT, JRT-DINA), and neural CDMs (e.g., NCD, ICD, MFNCD). Across datasets, RT-CDM consistently achieves superior predictive performance, demonstrates stable parameter recovery in simulations, and delivers stronger diagnostic interpretability by leveraging RT alongside RA.

Keywords:

response time; skills diagnosis; cognitive diagnosis model; Markov chain Monte Carlo; condition dependence

1. Introduction

Driven by the accelerating digital transformation of education, the proliferation of online learning has positioned personalized instruction as both a central research focus [1,2,3] and a fundamental pedagogical requirement [4,5,6]. Within these environments, accurately assessing learner performance and diagnosing latent cognitive skills is essential [7,8]. However, traditional psychometric approaches—primarily classical test theory and unidimensional item response theory (IRT) [9]—typically return a single overall ability score, offering insufficient diagnostic granularity for settings that demand fine-grained, personalized intervention [10,11].

Cognitive diagnosis theory (CDT) addresses this limitation by modelling examinees’ mastery of multiple fine-grained skills or attributes [12,13]. Within CDT, probabilistic CDMs—such as the Rule Space Model (RSM) [14], the Attribute Hierarchy Model (AHM) [14,15], and the Deterministic Inputs, Noisy “And” Gate (DINA) model with its extensions [16,17,18], including multiple-choice variants [19,20,21]—provide interpretable, skill-level feedback grounded in explicit cognitive structures. In parallel, neural CDMs (e.g., the Neural cognitive diagnosis model (NCD) [22,23], the Interpretable CD (ICD) [24], the Hierarchical cognitive diagnosis framework (hierarchical CDF) [25] and the Higher-Order Neural CD (HO-NCD) [26]) leverage representation learning to advance predictive performance. Yet across both families, modelling typically relies almost exclusively on binary response accuracy (RA) while overlooking response time (RT)—a key indicator of processing efficiency, engagement, and the speed–accuracy trade-off [27,28].

RT is now widely collected in computer-based assessments [29], and psychometric research shows that jointly modelling RA and RT helps disentangle ability and speed [23,28,30,31]. Four-parameter IRT (4P-IRT) and hierarchical RT–IRT frameworks typically link accuracy and time via higher-level correlations between latent speed and ability [28,32], with extensions such as Box–Cox transformations [28], differential speed modelling [31], and dynamic ability tracking [23]. Nonetheless, these approaches remain essentially unidimensional and often assume conditional independence of RA and RT given the latent variables—an assumption challenged by empirical evidence [33], including the classic speed–accuracy trade-off [34] and Roskam’s finding that the probability of correctness increases asymptotically with time [35]. Attempts to incorporate RT into CDMs—such as JRT-DINA within joint testlet structures [36], combining RT with CDM outputs for fluency [23], or log-normal hierarchical time models [37]—have advanced the field but still largely rely on hierarchical couplings; more recent neural approaches like MFNCD [7] integrate RT as an auxiliary feature within a neural CDM to enhance prediction but at the cost of limited interpretability at the attribute level.

Despite these advances, two gaps remain critical for diagnosis: (1) Attribute-level coupling. Many approaches are effectively unidimensional or assume conditional independence between RA and RT given latent speed/ability, which is frequently violated in practice [33] and obscures the local (attribute-level) dependence implied by the speed–accuracy trade-off [34,35]. (2) Interpretability vs. process sensitivity. Neural CDMs can be strong predictors but often lack transparent links to cognitive constructs; classical CDMs are interpretable but RT-agnostic, missing process information that can sharpen mastery inferences—especially in timed or moderately constrained assessments [38]. These limitations motivate a model that explicitly links RT [39] and RA at the attribute level, retaining interpretability while exploiting process data [40].

To address the limitations of existing CDMs, we propose RT-CDM, a cognitive diagnosis model that explicitly incorporates response time (RT) as an attribute-level covariate to capture the speed–accuracy trade-off while retaining interpretability. As a necessary foundation, we first generalize the binary mastery representation into a continuous-attribute variant (R-DINA), which relaxes the strict dichotomy of mastery versus non-mastery and provides smoother diagnostic inference. Building on this extension, RT-CDM augments the diagnostic process with a continuous latent-speed component (item-specific time intensity and discrimination), directly modeling local RT–RA dependence and yielding more precise and stable mastery inferences. Parameters are estimated via a data-augmented MCMC scheme. RT-CDM is most suitable for timed or moderately constrained assessments where RT reflects meaningful cognitive effort; in power tests with ample time, the incremental value of RT is expected to be limited.

(1): Modeling innovation. We formalize local RT–RA dependence at the attribute level and propose RT-CDM as a cognitively interpretable diagnostic mechanism that integrates response time with response accuracy. By introducing a continuous latent-speed component (item-specific time intensity and discrimination), RT-CDM uses deviations in response time as diagnostic signals, thereby improving both interpretability and robustness of mastery inference.
(2): Extension of mastery representation. As a foundational step, we generalize binary mastery into a continuous-attribute representation (R-DINA). This extension relaxes the strict dichotomy of mastery versus non-mastery, yields smoother diagnostic inference, and provides the basis upon which RT-CDM is developed.
(3): Comprehensive evaluation. We perform controlled simulations and empirical analyses on three large-scale datasets—PISA 2015, EdNet, and MATH—comparing RT-CDM against classical CDMs (e.g., IRT, DINA), RT-extended models (e.g., 4P-IRT, JRT-DINA), and neural CDMs (e.g., NCD, ICD, MFNCD). Across these datasets, RT-CDM consistently demonstrates superior predictive accuracy and calibration, stronger interpretability (e.g., higher DOA), and more robust parameter recovery.

2. Problem Formulation

The model was used to evaluate the students’ mastery of attributes (such as skills, abilities) in a cognitive assessment in a computer-based learning environment. To formulate the model, suppose an assessment consists of

J

items to measure

K

attributes and is answered by

I

examinees. Let

Q = {\{q_{j k}\}}_{J \times K}

matrix denote the relationship between

J

items and

K

attributes, and the element

q_{j k} = 1

if the

j^{t h}

item requires the

k^{t h}

attribute and 0 otherwise. Through assessing items at each test, two types of multivariate data are collected. The first is the response data of examinees, denoted by

Y_{i} = {\{y_{i j}\}}_{I \times J}

, where element

y_{i j} = 1

if the

i^{t h}

examinee answers correctly and 0 otherwise. The second is the RT data when examinees answer each item, denoted by

T_{i} = {\{T_{i j}\}}_{I \times J}

. The

i^{t h}

examinee’s speed is denoted as

τ_{i}

. In addition, the standard RT for each item

j

is denoted by

T_{j}

. The toy example of RT-CDM is shown in Figure 1.

The hierarchical model [41] assumes conditional independence between the RT and responses as follows:

P (Y_{i j}, T_{i j} |α_{i}, τ_{i}) = P (Y_{i j} |α_{i}, τ_{i}) P (T_{i j} |α_{i}, τ_{i})

(1)

However, the conditional independent distribution does not fit real-world scenarios. On the contrary, more information can be learned from the conditional dependence between RT and RA. Therefore, the proposed modelling framework adopts the following assumptions.

(1): Each examinee’s latent ability is denoted by a multidimensional binary latent variable, called attribute profile and described as $α_{i} = \{α_{i k}\}$ , the element $α_{i k} = 1$ if $i^{t h}$ examinee masters $k^{t h}$ attribute and 0 otherwise.
(2): The RA of examinee $Y_{i j}$ is not only determined by the attribute profile $α_{i k}$ and the item characteristics of the test, but also by the RT of the examinee $T_{i j}$ .
(3): The examinee’s RT $T_{i j}$ is related to the speed $τ_{i}$ , which is not constant and is affected by the item discrimination $λ_{j}$ .
(4): The matching degree between the examinee’s RT and the standard RT of $j^{t h}$ item, denoted by $ε_{i j}$ , is determined by the time intensity of $j^{t h}$ item $β_{j}$ , the examinee’s speed $τ_{i}$ , and the examinee’s RT $T_{i j}$ .

Based on these assumptions, the proposed modelling framework introduced the response time, named RT-CDM. The joint model for the condition dependence between RT and RA is the following:

P (Y_{i j}, T_{i j} |α_{i}, τ_{i}) = P (Y_{i j} |α_{i}, T_{i j}, τ_{i}) P (T_{i j} |τ_{i})

(2)

3. Methods

RT-CDM introduces a continuous item mastery function and a RT function to achieve a more precise cognitive analysis. The framework of RT-CDM is as shown in Figure 2. First, input the examinee’s response time, response result data and Q-matrix. Second, analyse the factors that affect the examinee’s cognitive process, including response speed, mistakes, guessing and other factors, and model these factors. Finally, obtain the examinee’s skill mastery according to the model.

3.1. Response Time Model

Response time (RT) has been widely modeled using a lognormal distribution [42], which assumes that the logarithm of observed RT follows a normal distribution [43]. The standard lognormal model is specified as

\log T_{i j} = β_{j} - τ_{i} + ε_{i j}, ε_{i j} \sim N (0, σ^{2})

(3)

where

T_{i j}

is the RT of examinee i on item j,

β_{j}

is the item time-intensity parameter,

τ_{i}

is the latent speed of examinee i, and

ε_{i j}

is a random error term. This formulation, however, assumes that an examinee’s speed parameter

τ_{i}

is constant across all items, which is often unrealistic in practice because different items require different baseline amounts of time to complete.

To address this issue, Fox and Marianti [31] extended the lognormal RT model by introducing an item discrimination parameter

λ_{j}

, yielding the following specification:

\log T_{i j} = β_{j} - λ_{j} τ_{i} + ε_{i j}

Here,

λ_{j}

allows examinees’ latent speed to interact with item-specific characteristics, thereby accommodating heterogeneity in item timing demands. Nevertheless, the model still inherits the limitation that

τ_{i}

is a global speed factor, assuming that examinees’ relative speed is constant across all items. This does not fully reflect the reality that items differ not only in discrimination but also in their intrinsic baseline time requirements. In order to illustrate how the

λ_{j}

works, Figure 3 displays the relationship between the RT

T_{i j}

and the error

ε_{i j}

under the different discrimination parameter

λ_{j}

. If the RT is the same for an item, the greater the value of

λ_{j}

, the greater the

ε_{i j}

, hence solving the problem that examinee speed

τ_{i}

is constant for each item.

To overcome this limitation, we introduce an item-specific baseline response time, denoted by

T_{j}

. This parameter represents the expected RT for item j under average speed, allowing observed RTs to be standardized across items. Specifically, we define the log-transformed residual workload as

ε_{i j} = (\ln T_{j} + β_{j}) - (\ln T_{i j} + λ_{j} τ_{i})

where

T_{j}

serves as a reference point that adjusts for the baseline difficulty or workload of each item. The proposed RT model can then be expressed as

\log \frac{T_{i j}}{T_{j}} = β_{j} - λ_{j} τ_{i} + ε_{i j}, ε_{i j} \sim N (0, σ^{2})

(4)

This specification ensures that RTs are comparable across items, since deviations are measured relative to each item’s baseline requirement.

Regarding the treatment of

T_{j}

, the most rigorous approach is to estimate it jointly with other model parameters within a Bayesian framework, thereby incorporating parameter uncertainty. However, due to the substantial computational cost of such an approach, in our implementation we fix

T_{j}

as the mean RT of item j computed from the training data only, and apply these values as exogenous constants when evaluating the model on the test data. This practice avoids data leakage while maintaining computational efficiency. We explicitly acknowledge this choice and its limitations in Section 7.

3.2. R-DINA Model

To capture examinees’ problem-solving accuracy, we first extend the conventional DINA model by allowing the mastery indicator to take continuous values. For examinee i and item j, the mastery indicator is defined as

η_{i j} = \frac{q_{j} α_{i}^{T}}{q_{j}^{T} q_{j}}

where

α_{i} = (α_{i 1}, \dots, α_{i K})

denotes the attribute profile of examinee i, and

q_{j}

is the j^th row of the Q-matrix indicating which attributes are required by item j. The numerator counts the number of required attributes mastered by the examinee, and the denominator is the total number of attributes required by the item. Hence,

η_{i j} \in [0, 1]

represents the proportion of required attributes mastered, with

η_{i j} = 0

indicating no mastery and

η_{i j} = 1

indicating full mastery. Intermediate values reflect partial mastery.

Given this continuous mastery indicator, R-DINA model the probability of a correct response is specified as

logit P (Y_{i j} = 1 | η_{i j}) = logit (g_{j}) + η_{i j} [logit (1 - s_{j}) - logit (g_{j})],

where

g_{j}

and

s_{j}

are the guessing and slipping parameters, constrained by

0 < g_{j}, s_{j} < 1

and

1 - s_{j} > g_{j}

. This formulation preserves the diagnostic interpretation: when

η_{i j} = 0

, the success probability equals

g_{j}

; when

η_{i j} = 1

, it equals

1 - s_{j}

. For example,

η_{i j}

is valued in the range of 0~1 for a random fraction. For example, on the condition that an item tests three attributes (1 1 1),

η_{i j}

divides examinees into four types: examinees mastering none, one, two, and three attributes. The polychotomy of

η_{i j}

divides the mastery mode of the examinee more finely.

η_{i j} = 1

indicates that the examinee has mastered all the skills required for the item (see Figure 4), and

η_{i j} \in [0, 1)

indicates that the examinee has mastered some of the skills required for the item. By allowing

η_{i j}

to vary continuously, R-DINA generalizes mastery representation, provides smoother diagnostic inference, and lays the foundation for extending CDMs to incorporate process data such as RT. This continuous formulation preserves interpretability while relaxing the strict dichotomy of DINA and provides the necessary foundation for incorporating RT information.

3.3. Response Time—Cognitive Diagnosis Model Framework

To incorporate response time into the diagnostic framework, we extend the R-DINA model by integrating the time error term

ε_{i j}

, derived from the standardized RT model in Section 3.1. The probability of a correct response is modeled as a Bernoulli random variable with mean

μ_{i j}

:

logit (μ_{i j}) = logit (g_{j}) + η_{i j} [logit (1 - s_{j}) - logit (g_{j})] + b_{j} ψ_{j} (ε_{i j}),

(5)

where

Y_{i j} \sim Bernoulli (μ_{i j}) .

ε_{i j}

is the standardized time error, representing the discrepancy between the observed RT and the expected workload of item j;

ψ_{j} (ε_{i j})

is a monotonic transformation of the time error (commonly

ψ_{j} (ε_{i j}) = - 4 ε_{i j}

), centered so that its mean is zero;

b_{j}

is an item-specific coefficient quantifying the magnitude and direction of the RT effect. This specification ensures interpretability: when

η_{i j} = 0

and

ψ_{j} (ε_{i j}) = 0

, the success probability reduces to

g_{j}

; when

η_{i j} = 1

and

ψ_{j} (ε_{i j}) = 0

,, it reduces to

1 - s_{j}

.

Cognitive interpretation. This additive logit model can be interpreted as combining two independent sources of evidence:

(1): Mastery-based evidence ( $η_{i j}$ ) from cognitive attributes;
(2): Time-based evidence (ε_ij) reflecting efficiency relative to expectations.

When

b_{j} > 0

, taking longer than expected (negative ε_ij) lowers the log-odds of success, consistent with the interpretation that hesitation acts as negative evidence. Conversely, efficient use of time (positive ε_ij) strengthens the probability of success.

Thus, RT-CDM retains the classical diagnostic meaning of guessing and slipping, while incorporating RT deviations as an additional diagnostic signal. This allows the model to capture local RT–RA dependence and explain the speed–accuracy trade-off in a cognitively interpretable manner. A graphical representation of the RT-CDM that jointly models RT and RA is displayed in Figure 5.

Monotonicity. The proposed RT-CDM preserves the monotonicity property widely recognized in cognitive diagnostic modelling [44,45]. By retaining the diagnostic structure of the R-DINA framework, the probability of a correct response remains non-decreasing with respect to attribute mastery. Furthermore, the time-related component ε is modelled as a monotonically decreasing function of response time, reflecting the empirical observation that longer latencies typically indicate lower efficiency or uncertainty. Together, these design choices ensure that the incorporation of response time does not undermine the fundamental monotonicity assumption of CDMs.

Identifiability. Regarding identifiability, the RT-CDM builds upon the R-DINA framework, whose identifiability conditions have been well established in the literature. Although our model introduces additional behavioral parameters through the integration of response time, potential identifiability issues are mitigated by fixing anchoring parameters, employing a well-structured Q-matrix, and applying regularized estimation [46]. Moreover, parameter constraints (β, τ > 0; λ ∈ (0,1); g, s ∈ (0,1)) and hierarchical priors are imposed to avoid over-parameterization. Empirical evidence further shows that the model achieves stable convergence across multiple runs, supporting its practical identifiability.

4. Estimation

A fully Bayesian formulation with Markov chain Monte Carlo (MCMC) was adopted to estimate the parameters of the RT-CDM.

4.1. Model Parameter Estimation

We adopt a fully Bayesian framework for parameter estimation in RT-CDM. Let the observed response accuracy and response time be

Y = {y_{i j}}, T = {t_{i j}}, i = 1, \dots, N, j = 1, \dots, J,

with corresponding latent mastery profiles

α_{i}

.

(1): Likelihood function.

The probability of a correct response conditional on skills and time error

ε_{i j}

is

P (y_{i j} = 1 | α_{i}, β_{j}, λ_{j}, τ_{i}, g_{j}, s_{j}, ε_{i j}) = (1 - s_{j}) η_{i j} g_{j}^{1 - η i j} f (ε_{i j}),

(6)

where

η_{i j}

is the conjunctive mastery indicator and

f (ε_{i j}) = \frac{1}{1 + \exp (ε_{i j})}

is the time-error function. The log-response time is modeled as

\log T_{i j} \sim N (β_{j} - λ_{j} τ_{i}, σ^{2})

.

Thus, the joint likelihood is

L (β, λ, τ, g, s, α | Y, T) = \prod_{i = 1}^{N} \prod_{j = 1}^{J} P {(y_{i j} | α_{i}, \cdot)}^{y_{i j}} {[1 - P (y_{i j} | α_{i}, \cdot)]}^{1 - y_{i j}} p (t_{i j} | β_{j}, λ_{j}, τ_{i}) .

(7)

(2): Prior distributions.

We use weakly informative priors to ensure identifiability and shrinkage:

β_{j} \sim N (0, 1), λ_{j} \sim N (0, 1), τ_{i} \sim N (0, 1), α_{i k} \sim B e r n o u l l i (0.5), g_{j}, s_{j} \sim B e t a (2, 2) .

(8)

To address scale non-identifiability between

λ_{j}

and

τ_{i}

, we fix the mean of

τ_{i}

at zero and its variance at one. This provides a location–scale anchor and avoids over-parameterization.

(3): Posterior distribution.

The joint posterior distribution of

β, λ, τ, α, g

given

Y

and

T

is

L (β, λ, g, s, τ, α |Y, T) \propto L (β, λ, s, g; α, τ) P (τ) P (α) P (β) P (λ) P (g) P (s)

(9)

4.2. Markov Chain Monte Carlo

We employ data-augmented MCMC for posterior sampling. Specifically,

Sampler: No-U-Turn Sampler (NUTS) for continuous parameters

(β, λ, τ)

, and Gibbs updates for discrete mastery profiles α.

Blocking & vectorization: Item-level parameters

(β_{j}, λ_{j}, g_{j}, s_{j})

are updated in parallel, and likelihood terms are vectorized to reduce runtime from

O (N I)

per iteration to batched matrix operations.

Convergence checks: Chains are run with multiple seeds; convergence is assessed using Gelman–Rubin

\hat{R} < 1.1

and effective sample size (ESS).

This yields a stable and computationally efficient estimation routine while ensuring transparency in the connection between the likelihood and the sampling procedure.

5. Real Data Experiments

In this section, we evaluated the proposed RT-CDM model across three datasets and compared it with various baseline methods. We aimed to assess both the predictive accuracy and the interpretability of the model. Extensive experiments were conducted to investigate whether incorporating response time information improves prediction performance and whether the model can provide reliable diagnostic insights into students’ attribute mastery. Based on these experiments and analyses, we addressed the following research questions:

RQ1. How effectively does RT-CDM predict student performance compared to baseline methods?

RQ2. To what extent does RT-CDM enhance interpretability in diagnosing students’ attribute mastery relative to existing cognitive diagnosis models?

5.1. Data Description

To evaluate the performance of the proposed model, we conducted experiments on three representative datasets: PISA2015 (standardized assessment), EdNet-KT1 (large-scale longitudinal learning logs), and MATH (static exam data), covering diverse educational scenarios.

PISA (https://www.oecd.org/pisa/data/2015database/): The PISA 2015 mathematics dataset evaluates the mathematical literacy of 15-year-old students by measuring their ability to apply mathematical knowledge and skills in diverse real-world contexts [47]. In this study, we focus on seven mathematics-related attributes defined in the PISA framework: change and relationships (α₁), space and shape (α₂), quantity (α₃), uncertainty and data (α₄), formulating situations mathematically (α₅), employing mathematical concepts, procedures, and reasoning (α₆), and interpreting, applying, and evaluating mathematical outcomes (α₇). In addition to students’ responses, the dataset also records response times, enabling analyses not only of problem-solving accuracy but also of cognitive processing efficiency and test-taking behaviors. This dual information provides richer insights into students’ mathematical understanding and their readiness to apply mathematics in personal, occupational, and societal contexts. The corresponding Q-matrix, constructed and validated by domain experts in prior work [48], is presented in Table 1.

EdNet (https://github.com/riiid/ednet): The EdNet-KT1 dataset, collected from the Santa Academy learning platform, provides large-scale longitudinal logs of students’ learning activities. Each record includes a student identifier, question ID, answer submission, timestamp, session ID, and the elapsed time spent on solving each problem. With millions of interaction records covering thousands of unique questions, the dataset captures both the correctness of responses and the dynamics of response times, offering opportunities to analyze students’ knowledge acquisition, learning behaviours, and temporal patterns of problem-solving.

Math (https://edudata.readthedocs.io/en/latest/tutorial/zh/DataSet.html#math2015): The MATH dataset, developed by iFLYTEK Co., Ltd., is based on data collected via the iFLYTEK Learning Machine from a final mathematics examination for high school students. As a static single-assessment dataset, it features a dense structural design that captures detailed student responses, exemplifying the traditional model of standardized evaluation while offering a solid foundation for analysing performance patterns and learning outcomes.

In the preprocessing stage, we first removed incomplete or erroneous records (e.g., missing responses, non-positive or incoherent timestamps) and anonymized all student identifiers. We standardized interaction logs by aligning item identifiers, normalizing correctness labels, and unifying response-time units. For RT specifically, we applied a log transform and, for each item j, computed a baseline time

T_{j}

(item-level average/robust mean) to form a centered measure

\log (T_{i j} / T_{j})

). To improve robustness, we trimmed/winsorized extreme latencies using percentile-based rules (dataset-specific tails) and flagged potential rapid-guessing or off-task records for exclusion from the time likelihood while retaining accuracy information. RT missingness/censoring (e.g., timeouts) was handled under a missing-at-random assumption conditional on item and learner factors; such cases were either excluded from the time component or imputed using a truncated normal model [17,49]. To address sparsity, we filtered students and items with insufficient interactions. The cleaned data were then converted into sequential formats suitable for knowledge tracing and model training. Summary statistics after preprocessing are reported in Table 2.

5.2. Analysis

Based on the real dataset, eight baseline models were selected to compare with the proposed model from the perspectives of accuracy and convergence. Two Markov chains were run for each model, with 8000 iterations for each chain. The first 5000 iterations in each chain were discarded as burn-in, and the last 3000 iterations were used to compute the point estimates of model parameters.

5.2.1. Baseline Models

To ensure a comprehensive and fair evaluation, we selected representative baseline models from two perspectives: (1) We considered whether response time (RT) information is incorporated into the modeling process. Models that focus solely on response accuracy (RA), such as IRT, DINA, NCD, and ICD, provide a comparison point for our refined R-DINA model, while models that jointly model RA and RT, including JRT-DINA, 4P-IRT, and MFNCD, serve as counterparts to the proposed RT-CDM. (2) We categorized the baselines by methodological paradigm: probability-based statistical models (IRT, DINA, JRT-DINA, 4P-IRT) versus neural network-based approaches (NCD, ICD, MFNCD). This dual classification not only clarifies the rationale for selecting these models but also ensures that the evaluation of RT-CDM is conducted from multiple perspectives, strengthening the validity of the comparative analysis.

IRT. The 3PL IRT model predicts the probability that an examinee with a given ability level (θ) will answer a test item correctly, using three item characteristics: discrimination (a), difficulty (b), and guessing parameter (c).

P (θ) = c + \frac{1 - c}{1 + e^{- a (θ - b)}}

(10)

DINA model. The DINA model is the most popular and most used CDM, which uses the examinees’ binary response results to perform binary modelling.

P (Y_{i j} = 1 |α_{i}) = {(1 - s_{j})}^{η_{i j}} g_{j}^{1 - η_{i j}} | η_{i j} = \prod_{k = 1}^{K} α_{i k}^{q_{j k}}

(11)

R-DINA. The R-DINA model is a refined model, which ignores the RT of RT-CDM. Compared to traditional deterministic-input, noisy “and” gate (DINA) models that assume binary latent attributes, R-DINA relaxes the binary assumption by introducing probabilistic or continuous representations of mastery. However, it remains more restricted than generalized models such as G-DINA or LCDM, which allow for complex, multi-way interactions among attributes.

P (Y_{i j} = 1 |α_{i}) = {(1 - s_{j})}^{η_{i j}} g_{j}^{1 - η_{i j}}, η_{i j} = \frac{q_{j} α_{i}^{T}}{q_{j}^{T} q_{j}}

(12)

4P-IRT [32]. The typical 4PIRT model that extends the 3PIRT predicts the probability of students correctly answering exercises by including response time

t_{i j}

and parameters representing exercise and student slowness.

P (Y_{i j} = 1 | θ_{i}, ρ_{i}, a_{j}, b_{j}, c_{j}, d_{j}, t_{i j}) = c_{j} + \frac{1 - c_{j}}{1 + \exp {- 1.7 a_{j} [θ_{i} - (ρ_{i} d_{j} / t_{i j}) - b_{j}]}}

(13)

where

a_{j}, b_{j}, c_{j}

are the discrimination, difficulty, and guessing parameter of exercise;

θ_{i}

is the ability parameter of student i. The difference from the 3PIRT model is that 4PIRT model adds parameter

d_{j}

that represents the exercise’s slowness parameter, parameter

ρ_{i}

that represents the student’s speed parameter, parameter

t_{i j}

that denotes student’s response time on exercise j.

JRT-DINA [50]. The JRT model is a hierarchical modelling framework to model the RA and RT of examinees.

\begin{array}{l} logit (P (T_{i j})) = ζ_{j} - τ_{i} + ε_{i j}, ε_{i j} \sim N (0, σ^{2}) \\ logit (P (Y_{i j} = 1)) = β_{j} + δ_{i} \prod_{k = 1}^{K} α_{i k}^{q_{j k}}, \\ logit (P (α_{i k} = 1)) = γ_{k} θ_{i} - λ_{k} \end{array}

(14)

NCD [5]: NCD is a deep learning-based cognitive diagnostic model that combines IRT to assess students’ cognitive attributes and exercise performance.

ICD [5]: ICD is a novel neural network-based cognitive diagnostic model that considers the interactions between knowledge concepts and the quantitative relationships between exercises and concepts.

MFNCD [7]: MFNCD integrates multidimensional features by using students’ reaction times as process information. This facilitates the simultaneous modelling of students’ reaction accuracy and reaction speed using neural networks.

5.2.2. Experimental Details

Evaluation metrics. Item parameter recovery was examined using the RMSE (Root Mean Square Error), MAE (Mean Absolute Error), ACC (Predication Accuracy) and AUC (Area Under ROC) of the estimated category response function for each latent class of the true values. RMSE and MAE represent the error between estimated value and the true value; therefore, the smaller the value, the better the model. AUC and ACC indicate the accuracy of model [3]; therefore, the bigger the value, the better the model. We perform an 80%/20% train/test split of datasets, using the first 80% data of each student to train parameters. Then, we inference each student’s proficiency after finishing his/her training records, which is used to predict scores on his/her testing (last 20%) data.

(1): RMSE is defined as follows:

$R M S E = \sqrt{\frac{1}{I \times J} \sum_{j = 1}^{J} \sum_{i = 1}^{I} {(Y_{i j} - P (Y_{i j} = 1))}^{2}}$

(15)
(2): MAE is defined as follows:

$M A E = \frac{1}{I \times J} \sum_{j = 1}^{J} \sum_{i = 1}^{I} |Y_{i j} - P (Y_{i j} = 1)|$

(16)
(3): ACC is defined as follows:

$A C C = \sum_{i = 1}^{I} \frac{I [Y_{i j} = {\hat{Y}}_{i j}]}{I}$

(17)

where $I (\cdot)$ is an indicator function. If $\cdot$ is true, the value is 1; otherwise, it is 0.
(4): AUC provides a robust metric for binary prediction evaluation.

Experimental setting. To ensure fair comparison across models, we implemented consistent training strategies with hyperparameter tuning for all baselines. For the IRT and 4P-IRT models, we performed grid search over learning rates [1e-4,1e-3,1e-2], L2 regularization coefficients [0.0,1e-4,1e-3], and batch sizes. The IRT model was trained using the EM algorithm with tuned convergence thresholds. The JRT-DINA model employed an Adam optimizer with a tuned learning rate and dropout regularization when applicable. For the neural network models (NCD, ICD, MFNCD), we searched the learning rate from {0.001, 0.002, 0.005}, batch size from {16, 32, 64}, and hidden dimensions from {8, 16, 32}. For MFNCD in particular, the dimensions of fully connected layers were tuned within {512–256–1, 256–128–1}. All models used the Sigmoid activation function. Hyperparameter selection was based on 5-fold cross-validation performance on the training set, and the best configuration was applied to the test set. These settings were selected based on validation performance to mitigate overfitting and ensure equitable optimization conditions across models.

The RT-CDM model was implemented using the PyMC3 probabilistic programming framework, leveraging the No-U-Turn Sampler (NUTS) for efficient Markov Chain Monte Carlo (MCMC) inference. All experiments were conducted on a workstation equipped with an AMD Ryzen 9 7950X CPU and 64 GB RAM (Advanced Micro Devices (AMD), Santa Clara, CA, USA). For the PISA dataset (n ≈ 6000 students, 17 items), we performed 8000 sampling iterations, which required approximately 7.5 h for model convergence. Convergence was assessed using Gelman-Rubin statistics (

\hat{R} < 1.1

) and effective sample size metrics.

To improve runtime and scalability, all item–student likelihood computations were performed using vectors, avoiding explicit loops. The model likelihood was evaluated in block form, enabling efficient memory use and stable gradients during NUTS sampling. These optimizations ensured that the computational complexity scaled linearly with the number of responses (O(NJ)), making the method feasible for medium- to large-scale assessments. For even larger datasets, the model can be parallelized across chains or implemented in GPU-enabled frameworks (e.g., PyMC v4/NumPyro) to further reduce runtime.

5.2.3. Experimental Results

To rigorously evaluate the predictive performance of the proposed methods, experiments were conducted on three representative datasets: PISA2015 (standardized assessment data), EdNet-KT1 (longitudinal learning log data), and MATH (static single-exam data). Baseline models were organized along two dimensions. From the perspective of data utilization, models were classified into those relying solely on response accuracy (RA) and those jointly modeling response accuracy and response time (RA-RT). From the perspective of modeling paradigm, models were further categorized into probabilistic models (e.g., IRT, DINA, 4P-IRT) and neural network-based models (e.g., NCD, ICD, MFNCD). Within this framework, the proposed R-DINA model is positioned as an enhanced RA-based probabilistic model, strengthening cognitive diagnostic capacity, while the proposed RT-CDM model is positioned as an RA-RT joint probabilistic model, designed to integrate accuracy and temporal behavior in diagnosing student performance. This design enables a comprehensive comparison across data types and modeling paradigms to verify the robustness and generalizability of the proposed approaches. Table 3 presents the overall results for three datasets.

As shown in Table 3, RT-CDM achieves consistently strong performance across datasets. On PISA 2015 and EdNet, where RT information is available, RT-CDM yields clear gains over classical CDMs (e.g., DINA, R-DINA) in terms of both accuracy and AUC, confirming that response time provides useful diagnostic signals at the attribute level. Compared with neural models such as MFNCD, RT-CDM performs competitively: while MFNCD sometimes achieves lower MAE or RMSE, RT-CDM maintains higher discriminative ability (AUC), highlighting its advantage in interpretability-oriented diagnosis. On the MATH dataset, which lacks RT information, RT-CDM reduces to R-DINA and produces nearly identical results, as expected. This confirms the model’s internal consistency and ensures that RT-CDM does not degrade performance when RT is unavailable. Overall, these results indicate that RT-CDM balances predictive performance and interpretability: it leverages RT to enhance diagnostic inference where available, while remaining stable in settings without RT. Neural CDMs may outperform in certain error-based metrics, suggesting complementary strengths that future research could explore.

Figure 6 shows the prediction performance for each item of the four models. From each subfigure, we can observe that the RT-CDM model outperforms almost all the baselines on most items and the R-DINA model is also better than the other two models. Moreover, JRT-DINA model’s performance is the most unstable among the four models, and the results fluctuate greatly on different items, indicating that the JRT-DINA model is the most affected by the factors of the items.

To verify the precision and interpretability of the proposed RT-CDM model in diagnosing students’ attribute mastery, the Degree of Agreement (DOA) metric was adopted, which corresponds to a monotonicity analysis [26]. The rationale of DOA is that if a student exhibits higher estimated proficiency in a given attribute than another student, the former should consistently achieve better performance on exercises associated with that attribute. By averaging DOA values across all attributes, the overall plausibility of diagnostic results can be evaluated. Since other models do not provide explicit outputs of attribute mastery, DINA, NCD, ICD, and MFNCD were selected as baseline models, and experiments were conducted on three datasets, the results are shown in Figure 7.

As shown in Figure 6, the results demonstrate that the proposed model outperforms these state-of-the-art baselines. Specifically, it surpasses DINA, indicating that continuous modelling of attribute mastery is superior to dichotomous approaches, and it also achieves higher interpretability than neural network-based models, thereby highlighting its dual advantage in both diagnostic precision and explanatory power.

Computational complexity. RT-CDM is estimated via a data-augmented MCMC scheme. Per iteration, (a) sampling examinee attributes

α

costs

O (N K)

; (b) updating item/attribute and time parameters costs

O (I K)

; and (c) evaluating the joint likelihood scales linearly with the number of observed interactions

M \leq N I

. Thus, the overall per-iteration time is approximately

O (M K)

with moderate constants. Mixing can slow when the number of attributes K is large or when the Q-matrix is highly sparse/imbalanced. In practice, wall-clock time is reduced via parallel updates across items/examinees, batched likelihood evaluations, and vectorized implementations, with convergence monitored by

\hat{R}

and effective sample size (ESS) and iteration caps used to control compute.

6. Simulation Study

A follow-up simulation study was conducted to further evaluate model parameter recovery and to compare the R-DINA and RT-CDM models in ideal simulated conditions. The study method is to simulate the response results matrix and the RT matrix of the examinees by fixing the number of attributes, items, and examinees, and by estimating the parameters of items and obtaining the skill state of the examinees.

6.1. Data Generation

The study simulates five separate attributes, which can generate

2^{5} = 32

skill states and simulates 31 items (excluding the mode (00000)). The Q-matrix for this data is given in Figure 8. (1) RT parameters:

β \sim N (9, 1)

;

τ \sim N (0, 1)

;

λ

obeys truncated normal distribution [12], the lower limit is 0.0001;

ε

obeys a uniform distribution with a mean of 0 and a variance of [0.3,0.7]. (2) RA parameters:

α_{i k} \sim Bernoulli (0.7)

;

g_{j} \sim 4 - Beta (1, 2, 0, 0.6)

;

s_{j} \sim 4 - Beta (1, 2, 0, 0.6)

. According to these parameters, the study simulates the response results and RT of 5000 examinees in total.

6.2. Analysis

With respect to the classification of individual attributes and profiles, this study computed the ACCR (Attribute Correct Classification Rate) and the PCCR (Pattern Correct Classification Rate). The ACCR evaluates the accuracy of the individual attribute classifications and the PCCR evaluates the accuracy of the attribute vector classifications which is a concatenation of the individual attribute classifications.

A C C R = \sum_{i = 1}^{I} \frac{I [{\hat{α}}_{i k} = α_{i k}]}{I}, P C C R = \sum_{i = 1}^{I} \frac{I [{\hat{α}}_{i} = α_{i}]}{I}

(18)

where

I (\cdot)

is an indicator function. If

\cdot

is true, the value is 1; otherwise, it is 0.

Additionally, the recovery of the model, the MAE, RMSE, ACC and AUC, were computed. Based on these evaluation indicators, the RT-CDM model and the R-DINA model were compared to illustrate the impact of the RT in cognitive diagnosis.

6.3. Results

In this section, the two models were compared from three aspects: the accuracy of model, the recovery of item parameters and the accuracy of attribute classification.

Model accuracy. Table 4 displays the RMSE, MAE, ACC, and AUC to compare the two models: (1) the RMSE and MAE represent the error between the estimated value and the true value, and (2) the AUC and ACC indicate the accuracy of two models. These measures were computed for each replication, and the results in Table 3 are averages over 100 replications. On the one hand, MAE and RMSE errors of RT-CDM are lower than for the R-DINA model from the perspective of project parameter recovery, and the MAE result of RT-CDM is less than 0.05. On the other hand, the accuracy of the RT-CDM is much higher than that of the R-DINA model. The ACC of RT-CDM is greater than 0.97. This result shows that if the RT and the response result are combined in modelling, it can significantly improve the accuracy of the model.

Item parameters recovery. Table 5 displays the recovery of item parameters for the two models by presenting the MAE and RMSE between estimated and true values of all item parameters. For the parameters, the results of the RT-DINA model are much better than the results of the R-DINA model; for the parameter g, the results of the two models are equivalent, and the result of the RT-DINA model is only 0.001 higher than the result of the R-DINA. From the overall situation, the item parameter recovery of the RT-DINA is relatively stable.

Attribute classification. Table 5 presents the recovery of individual attributes (ACCR) and attribute patterns (PCCR) of the RT-CDM and R-DINA model. The values in the table are computed by comparing the true and estimated classifications and represent the percentage of correct classifications across the replication results and examinees’ real response results. The RT-CDM was higher than the R-DINA model in terms of both ACCR and PCCR, which indicates that ignoring the effect of the RT on the RA would reduce ACCR and PCCRs.

From the results of Table 5, the RT-CDM model’s item parameters estimation was stable, and accuracy was based on the MCMC algorithm. The results of Table 6 demonstrate that by using the RT-CDM model to analyse the data, dramatic improvement in classification accuracy can be achieved over conventional tests and analyses.

As shown in Table 6, the proposed RT-CDM models the local dependency between RA and RT, achieving higher ACCR and PCCR than competing approaches. This indicates that RT-CDM can more effectively differentiate examinees’ attribute mastery and provides stronger interpretability and diagnostic power. By contrast, MFNCD, an educational neural network model, shows markedly lower ACCR and PCCR, reflecting weaker interpretability and limited attribute discrimination. JRT-DINA, as a hierarchical model, incorporates response times but assumes independence between RA and RT; as a result, its overall performance is comparable to that of R-DINA, which relies solely on response accuracy.

7. Discussion

Several key findings regarding the validity and positioning of RT-CDM relative to existing cognitive diagnostic frameworks are as follows:

(1): Comparative advantages over JRT-DINA

RT-CDM’s attribute-level local dependence plus continuous mastery produces more robust parameter recovery, stronger interpretability, and more stable, better-calibrated predictions than JRT-DINA [41]. Compared with JRT-DINA, RT-CDM models the attribute-level coupling between response time (RT) and response accuracy (RA) and replaces binary mastery with a continuous proficiency representation (R-DINA). This makes RT an explicit, interpretable modulator of the Bernoulli mean at the skill level, so the model naturally captures the speed–accuracy trade-off and preserves diagnostic monotonicity (greater mastery → higher correctness) while improving calibration and prediction. By contrast, JRT-DINA embeds RT through higher-level (e.g., testlet) factors and typically assumes conditional independence between RA and RT given those factors, so observed time does not directly inform item-skill probabilities. When data exhibit strong speed–accuracy exchange or heterogeneous attribute effects, this architecture can misallocate dependence, letting RT effects be absorbed by testlet or person factors, which weakens identifiability, increases collinearity between speed and ability, and often yields slow, unstable estimation. In addition, DINA’s hard (binary) gating is sensitive to thresholding and Q-matrix sparsity; with short tests or uneven item coverage, parameter estimates and mastery profiles become brittle.

(2): Positioning relative to traditional and neural CDMs.

Compared with traditional CDMs such as DINA and its variants, RT-CDM extends the binary mastery paradigm into a continuous attribute representation (R-DINA) and further enriches the diagnostic process by integrating temporal information. This allows the model to capture more nuanced differences in students’ proficiency and processing efficiency, overcoming the restrictive dichotomy of mastery versus non-mastery. In comparison with neural CDMs (e.g., NCD, ICD, MFNCD), RT-CDM achieves a balance between predictive accuracy and interpretability. While neural models often exhibit strong performance, their latent representations are difficult to align with meaningful cognitive constructs. By contrast, RT-CDM retains the interpretability of classical CDMs while delivering improved predictive performance through the incorporation of RT.

(3): Comparative analysis with published results.

To highlight its positioning, we compared RT-CDM against publicly reported results along methodological dimensions rather than raw ACC/AUC values, since data preprocessing and evaluation criteria vary across studies. As summarized in Table 7, RT-CDM demonstrates clear advantages in terms of continuous representation, explicit use of RT, interpretability, parameter stability, and calibration, while maintaining moderate computational demand. This suggests that RT-CDM provides a balanced framework that integrates the predictive validity of neural models with the diagnostic transparency of classical CDMs.

(4): Applicability conditions.

RT-CDM is most effective in timed or moderately constrained computer-based assessments where response time (RT) carries meaningful process signal and exhibits attribute-level dependence with response accuracy (RA). In power tests (ample time, little speed–accuracy pressure) or when RT is dominated by non-cognitive factors, the marginal value of RT is limited; under our hierarchical priors, time-modulation parameters naturally shrink, yielding a benign degeneration to an RA-dominant diagnosis.

(5): Limitations and boundary conditions.

Design dependence. Gains rely on a substantive speed–accuracy trade-off. Short tests or uneven attribute coverage can increase confounding between time and ability; adequate Q-matrix coverage and prior shrinkage help mitigate this.

Data quality and preprocessing. RT data are typically skewed and may contain extremely short (rapid-guessing) or long (off-task/external) latencies. In practice, we apply a log transform, trim or winsorize extremes using percentile-based rules (dataset-specific thresholds), and flag potential rapid-guessing records for exclusion. While these procedures improve robustness, they may also introduce subjective choices that affect reproducibility. Future work should investigate more automated and model-based approaches to handling extreme or missing RTs.

Computational costs and application scope. The MCMC estimation required by RT-CDM has significantly higher computational demands compared with neural CDMs. This limits its feasibility in environments requiring real-time diagnostic feedback. Accordingly, the true value of RT-CDM lies in offline, detailed analysis contexts—such as post hoc evaluation of large-scale assessments, curriculum studies, and high-stakes testing—where interpretability, diagnostic accuracy, and robustness are paramount. In contrast, for adaptive testing or classroom-level real-time feedback, more computationally efficient models may be preferable.

Baseline time and prior sensitivity. In principle, the item-level baseline should be treated as an unknown parameter and estimated jointly with other parameters to fully capture its uncertainty. For computational tractability, however, we fix

T_{j}

as the mean RT of item j computed from the training data only, and then apply these values as exogenous constants in the test phase. This avoids information leakage while maintaining efficiency. We note that mild mis-specification of

T_{j}

primarily affects the residual time component, which is partially compensated by the time-modulation parameters. Hierarchical priors further provide shrinkage to mitigate overfitting and identifiability concerns.

External heterogeneity. Device/platform differences (mobile vs. desktop), UI latency, and strategy shifts can contaminate RT; such covariates can be included in the time model, or heavy-tailed residuals (e.g., log-t) can be used to enhance robustness.

8. Conclusions

This study set out to enhance cognitive diagnosis by explicitly incorporating response time (RT) alongside response accuracy (RA) within a unified diagnostic framework. The proposed RT-CDM directly models the local dependence between RA and RT at the attribute level, thereby addressing limitations in both traditional psychometric models and recent neural CDMs [51]. To evaluate its effectiveness, we conducted simulation studies and analyses on multiple real datasets which consistently demonstrated that RT-CDM outperforms classical CDMs, RT-extended IRT models, and neural CDMs in terms of classification accuracy, parameter recovery, and predictive stability. These findings confirm that incorporating RT not only improves diagnostic precision but also yields deeper insights into learners’ proficiency and processing efficiency.

Although the current study provides strong empirical and theoretical support for RT-CDM, several avenues remain open for future research. First, more comprehensive sensitivity analyses should be conducted, particularly with respect to the specification of time parameters such as, to examine the robustness of RT-CDM under different testing conditions. Second, while this study validated RT-CDM on large-scale assessment data, future work may extend the framework to adaptive testing and classroom-based formative assessment, where time constraints vary dynamically. Third, integrating RT with additional behavioral signals (e.g., eye-tracking, clickstream data, or affective measures) may further enhance the ecological validity of cognitive diagnosis. Finally, hybridizing RT-CDM with neural architectures could provide a promising direction for combining interpretability with the flexibility of representation learning. These extensions will not only refine the methodological contributions of RT-CDM but also broaden its applicability in real-world educational settings.

Author Contributions

Conceptualization, X.L., H.Y. and J.G.; Writing—review & editing, X.L., Y.C. and H.Y.; Validation, H.Y.; Methodology, J.G.; Writing—original draft, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by Planning Foundation of Social Science and Humanity, China Ministry of Education, No. 19YJA880026.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Q. Towards a new generation of cognitive diagnosis. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–26 August 2021; pp. 4961–4964. [Google Scholar]
Reinhold, F.; Leuders, T.; Loibl, K.; Nückles, M.; Beege, M.; Boelmann, J.M. Learning mechanisms explaining learning with digital tools in educational settings: A cognitive process framework. Educ. Psychol. Rev. 2024, 36, 14. [Google Scholar] [CrossRef]
Tatsuoka, K.K. Rule space: An approach for dealing with misconceptions based on item response theory. J. Educ. Meas. 1983, 20, 345–354. [Google Scholar] [CrossRef]
Huang, Y.-M.; Lin, Y.-T.; Cheng, S.-C. An adaptive testing system for supporting versatile educational assessment. Comput. Educ. 2009, 52, 53–67. [Google Scholar] [CrossRef]
Beck, Z.A.P.J.E.; Heffernan, C.R.N.T. The composition effect: Conjunctive or compensatory? An analysis of multi-skill math questions in ITS. Educ. Data Min. 2008, 2008, 147. [Google Scholar]
Tuxtayevich, K.I.; Ahmatovna, P.S.; Turgunbayevna, M.N.; Rasulovna, R.M.; Qizi, T.F.R.; Qizi, Y.N.A. Different Approaches to Enhance Critical Thinking in Digital Education. SPAST Rep. 2024, 1, 7. [Google Scholar] [CrossRef]
Huang, T.; Geng, J.; Yang, H.; Hu, S.; Ou, X.; Hu, J.; Yang, Z. Interpretable neuro-cognitive diagnostic approach incorporating multidimensional features. Knowl.-Based Syst. 2024, 304, 112432. [Google Scholar] [CrossRef]
Lindsley, O.R. Precision teaching: Discoveries and effects. J. Appl. Behav. Anal. 1992, 25, 51–57. [Google Scholar] [CrossRef]
Huang, H.Y.; Wang, W.C. Multilevel higher-order item response theory models. Educ. Psychol. Meas. 2013, 74, 495–515. [Google Scholar] [CrossRef]
Li, X.; Wang, W.-C. Assessment of differential item functioning under cognitive diagnosis models: The DINA model example. J. Educ. Meas. 2015, 52, 28–54. [Google Scholar] [CrossRef]
Ulitzsch, E.; von Davier, M.; Pohl, S. A Multiprocess Item Response Model for not-reached items due to time limits and quitting. Educ. Psychol. Meas. 2020, 80, 522–547. [Google Scholar] [CrossRef]
Appleby, J.; Samuels, P.; Treasure-Jones, T. Diagnosys—A knowledge-based diagnostic test of basic mathematical skills. Comput. Educ. 1997, 28, 113–131. [Google Scholar] [CrossRef]
Xu, G.; Zhang, S. Identifiability of diagnostic classification models. Psychometrika 2016, 81, 625–649. [Google Scholar] [CrossRef]
Leighton, J.P.; Gierl, M.J.; Hunka, S.M. The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s rule-space approach. J. Educ. Meas. 2004, 41, 205–237. [Google Scholar] [CrossRef]
Henson, R.; Douglas, J. Test construction for cognitive diagnosis. Appl. Psychol. Meas. 2005, 29, 262–277. [Google Scholar] [CrossRef]
Chen, H.; Chen, J. Retrofitting non-cognitive-diagnostic reading assessment under the generalized DINA model framework. Lang. Assess. Q. 2016, 13, 218–230. [Google Scholar] [CrossRef]
de la Torre, J. DINA model and parameter estimation: A didactic. J. Educ. Behav. Stat. 2008, 34, 115–130. [Google Scholar] [CrossRef]
de la Torre, J. The generalized DINA model framework. Psychometrika 2011, 76, 179–199. [Google Scholar] [CrossRef]
Chen, Y.; Culpepper, S.A.; Chen, Y.; Douglas, J. Bayesian Estimation of the DINA Q matrix. Psychometrika 2018, 83, 89–108. [Google Scholar] [CrossRef]
de la Torre, J. A cognitive diagnosis model for cognitively based multiple-choice options. Appl. Psychol. Meas. 2009, 33, 163–183. [Google Scholar] [CrossRef]
Ozaki, K. DINA Models for multiple-choice items with few parameters: Considering incorrect answers. Appl. Psychol. Meas. 2015, 39, 431–447. [Google Scholar] [CrossRef]
Wang, F.; Liu, Q.; Chen, E.; Huang, Z.; Chen, Y.; Yin, Y.; Huang, Z.; Wang, S. Neural cognitive diagnosis for intelligent education systems. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6153–6161. [Google Scholar]
van der Linden, W.J. A lognormal model for response times on test items. J. Educ. Behav. Stat. 2006, 31, 181–204. [Google Scholar] [CrossRef]
Qi, T.; Ren, M.; Guo, L.; Li, X.; Li, J.; Zhang, L. ICD: A new interpretable cognitive diagnosis model for intelligent tutor systems. Expert Syst. Appl. 2023, 215, 119309. [Google Scholar] [CrossRef]
Li, J.; Wang, F.; Liu, Q.; Zhu, M.; Huang, W.; Huang, Z.; Chen, E.; Su, Y.; Wang, S. Hiercdf: A bayesian network-based hierarchical cognitive diagnosis framework. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14–18 August 2022; pp. 904–913. [Google Scholar]
Huang, T.; Chen, Y.; Geng, J.; Yang, H.; Hu, S. A higher-order neural cognitive diagnosis model with hierarchical attention networks. Expert Syst. Appl. 2025, 273, 126848. [Google Scholar] [CrossRef]
De Boeck, P.; Jeon, M. An overview of models for response times and processes in cognitive tests. Front. Psychol. 2019, 10, 102. [Google Scholar] [CrossRef]
Entink, R.H.K.; van Der Linden, W.J.; Fox, J.-P. A Box-Cox normal model for response times. Br. J. Math. Stat. Psychol. 2009, 62, 321–640. [Google Scholar] [CrossRef]
Li, X.; Guo, S.; Wu, J.; Zheng, C. An interpretable polytomous cognitive diagnosis framework for predicting examinee performance. Inf. Process. Manag. 2025, 62, 103913. [Google Scholar] [CrossRef]
de la Torre, J.; Douglas, J.A. Higher-order latent trait models for cognitive diagnosis. Psychometrika 2004, 69, 333–353. [Google Scholar] [CrossRef]
Fox, J.-P.; Marianti, S. Joint modeling of ability and differential speed using responses and response times. Multivar. Behav. Res. 2016, 51, 540–553. [Google Scholar] [CrossRef]
van der Linden, W.J. Conceptual issues in response-time modeling. J. Educ. Meas. 2009, 46, 247–272. [Google Scholar] [CrossRef]
Bolsinova, M.; Molenaar, D. Modeling nonlinear conditional dependence between response time and accuracy. Front. Psychol. 2018, 9, 1525. [Google Scholar] [CrossRef]
Luce, R.D. Response Times: Their Role in Inferring Elementary Mental Organization; Oxford University Press on Demand: Kettering, UK, 1986. [Google Scholar]
Roskam, E.E. Models for speed and time-limit tests. In Handbook of Modern Item Response Theory; van der Linden, W.J., Hambleton, R.K., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; pp. 187–208. [Google Scholar]
Zhan, P.; Liao, M.; Bian, Y. Joint testlet cognitive diagnosis modeling for paired local item dependence in response times and response accuracy. Front. Psychol. 2018, 9, 607. [Google Scholar] [CrossRef] [PubMed]
Ranger, J.; Kuhn, J.T.; Ortner, T.M. Modeling responses and response times in tests with the Hierarchical Model and the Three-Parameter Lognormal Distribution. Educ. Psychol. Meas. 2020, 80, 1059–1089. [Google Scholar] [CrossRef] [PubMed]
Kaya, Y.; Leite, W.L. Assessing change in latent skills across time with longitudinal cognitive diagnosis modeling: An evaluation of model performance. Educ. Psychol. Meas. 2017, 77, 369–388. [Google Scholar] [CrossRef] [PubMed]
Zhan, P.; Jiao, H.; Liao, D. Cognitive diagnosis modelling incorporating item response times. Br. J. Math. Stat. Psychol. 2018, 71, 262–286. [Google Scholar] [CrossRef]
Wang, S.; Chen, Y. Using response times and response accuracy to measure fluency within cognitive diagnosis models. Psychometrika 2020, 85, 600–629. [Google Scholar] [CrossRef]
van der Linden, W.J. A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 2007, 72, 287. [Google Scholar] [CrossRef]
Manning, S.; Dix, A. Identifying students’ mathematical skills from a multiple-choice diagnostic test using an iterative technique to minimise false positives. Comput. Educ. 2008, 51, 1154–1171. [Google Scholar] [CrossRef]
Wang, S.; Zhang, S.; Shen, Y. A joint modeling framework of responses and response times to assess learning outcomes. Multivar. Behav. Res. 2020, 55, 49–68. [Google Scholar] [CrossRef]
Liu, Q.; Wu, R.; Chen, E.; Xu, G.; Su, Y.; Chen, Z.; Hu, G. Fuzzy Cognitive Diagnosis for Modelling Examinee Performance. ACM Trans. Intell. Syst. Technol. 2018, 9, 1–26. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, T.; Wang, X.; Yu, G.; Li, T. New development of cognitive diagnosis models. Front. Comput. Sci. 2023, 17, 171604. [Google Scholar] [CrossRef]
Verhelst, N.D.; Verstralen, H.H.F.M.; Jansen, M.G.H. A logistic model for time-limit tests. In Handbook of Modern Item Response Theory; van der Linden, W.J., Hambleton, R.K., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; pp. 169–185. [Google Scholar]
OECD. PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic and Financial Literacy; PISA, OECD Publishing: Paris, France, 2016. [Google Scholar]
Wang, T.; Hanson, B.A. Development and calibration of an item response model that incorporates response time. Appl. Psychol. Meas. 2005, 29, 323–339. [Google Scholar] [CrossRef]
Junker, B.W.; Sijtsma, K. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Appl. Psychol. Meas. 2001, 25, 258–272. [Google Scholar] [CrossRef]
Wu, X.; Wu, R.; Chang, H.H.; Kong, Q.; Zhang, Y. International comparative study on PISA mathematics achievement test based on cognitive diagnostic models. Front. Psychol. 2020, 11, 2230. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, L.; Liu, Q.; Liu, J.; Huang, Z.; Yin, Y.; Zhuang, Y.; Gao, W.; Chen, E. Understanding and improving fairness in cognitive diagnosis. Sci. China Inf. Sci. 2024, 67, 152106. [Google Scholar] [CrossRef]

Figure 1. A toy example of RT-CDM.

Figure 2. The framework of RT-CDM.

Figure 3. The relationship between response time and the error term ε under different values of the time discrimination parameter λ. The curves illustrate how increasing λ shifts the decay pattern of ε, with all cases showing a downward convergence as response time increases. This figure highlights the role of λ in controlling the sensitivity of the error term to response time, which is a key component of the proposed RT-CDM model.

Figure 4. The latent response variate η.

Figure 5. RT-CDM parameter graph.

Figure 6. Item-level prediction performance of the compared models. Note. The four subplots present results across 17 items under different evaluation metrics: MAE, RMSE, ACC, and AUC. Each line corresponds to one baseline or proposed model (DINA, R-DINA, RT-CDM, JRT). RT-CDM consistently outperforms other models, demonstrating improved prediction accuracy and stability by incorporating response time.

Figure 7. DOA results across three datasets (PISA2015, EdNet, and MATH). The figure compares baseline models (DINA, NCD, ICD, MFNCD) with the proposed RT-CDM model. The results show that RT-CDM consistently achieves the highest DOA scores, indicating superior diagnostic interpretability and precision compared to both traditional probabilistic models and neural network-based models.

Figure 8. Q-matrix for simulation study. Note. Blue = 0; green = 1.

Table 1. Q-Matrix for PISA 2015 computer-based mathematics items.

Items	α₁	α₂	α₃	α₄	α₅	α₆	α₇
CM033Q01	0	1	0	0	0	0	1
CM474Q01	0	0	1	0	0	1	0
CM155Q01	1	0	0	0	0	1	0
CM155Q04	1	0	0	0	0	0	1
CM411Q01	0	0	1	0	0	1	0
CM411Q02	0	0	0	1	0	0	1
CM803Q01	0	0	0	1	1	0	0
CM442Q02	0	0	1	0	0	0	1
CM034Q01	0	1	0	0	1	0	0
CM305Q01	0	1	0	0	0	1	0
CM496Q01	0	0	1	0	1	0	0
CM496Q02	0	0	1	0	0	1	0
CM423Q01	0	0	0	1	0	0	1
CM603Q01	0	0	1	0	0	1	0
CM571Q01	1	0	0	0	0	0	1
CM564Q01	0	0	1	0	1	0	0
CM564Q02	0	0	0	1	1	0	0

Table 2. Dataset statistics.

Statistics	PISA2015	EdNet	MATH
# Students	6000	5739	4209
# Items	17	27	20
# Attributes	7	17	11
# Item/Attribute	2.43	1.59	1.82
Response time	√	√	×
Positive label rate	51.4%	48.9%	63.9%

Table 3. Experimental results on student performance prediction. The best results are bolded.

	Models	PISA 2015				EdNet				MATH
	Models	MAE	RMSE	ACC	AUC	MAE	RMSE	ACC	AUC	MAE	RMSE	ACC	AUC
RA-based model	IRT	0.417	0.664	0.559	0.566	0.375	0.478	0.653	0.652	0.353	0.482	0.767	0.768
	DINA	0.401	0.633	0.599	0.606	0.345	0.587	0.655	0.595	0.326	0.477	0.773	0.796
	NCD	0.351	0.474	0.691	0.747	0.312	0.438	0.708	0.746	0.315	0.418	0.741	0.817
	ICD	0.324	0.523	0.661	0.745	0.308	0.440	0.710	0.748	0.308	0.423	0.778	0.819
	R-DINA	0.212	0.46	0.653	0.759	0.289	0.471	0.708	0.732	0.292	0.402	0.723	0.765
RT-RA Jointly model	JRT	0.349	0.591	0.651	0.733	0.301	0.532	0.691	0.741	-	-	-	-
	4P-IRT	0.218	0.475	0.642	0.689	0.215	0.461	0.672	0.708	-	-	-	-
	MFNCD	0.211	0.441	0.71	0.776	0.211	0.442	0.711	0.745	0.306	0.414	0.797	0.830
	RT-CDM	0.207	0.455	0.733	0.792	0.225	0.462	0.706	0.743	0.292	0.402	0.718	0.762

Table 4. The MAE, RMSE, ACC and AUC of two models.

Models	Error		Accuracy
Models	MAE	RMSE	ACC	AUC
R-DINA	0.148	0.384	0.852	0.656
RT-CDM	0.026	0.161	0.974	0.802

Table 5. Item parameter recovery of two models.

Index		RT-CDM				R-DINA
Index		s	g	$β$	$λ$	s	g
MAE	Mean	0.146	0.188	0.581	1.207	0.383	0.184
MAE	SD	0	0.001	0.018	0.014	0	0
RMSE	Mean	0.176	0.227	1.587	1.273	0.393	0.224
RMSE	SD	0	0.001	0.003	0.01	0	0

Table 6. Percentage of correctly classified attributes.

100 Replications	ACCR					PCCR
100 Replications	α₁	α₂	α₃	α₄	α₅	PCCR
MFNCD	0.732	0.716	0.723	0.756	0.745	0.523
JRT-DINA	0.941	0.932	0.921	0.935	0.938	0.742
R-DINA	0.938	0.944	0.911	0.925	0.944	0.726
RT-CDM	0.953	0.957	0.933	0.957	0.958	0.815

Table 7. Qualitative comparison of RT-CDM with existing cognitive diagnostic frameworks.

Dimension	Traditional CDMs (IRT, DINA)	JRT-DINA	Neural CDMs (NCD, ICD)	MFNCD	RT-CDM (Ours)
Representation of mastery	Binary	Binary	Continuous/latent, opaque	Continuous/latent, opaque	Continuous, interpretable (R-DINA)
Use of response time (RT)	×	◎ (indirect)	×	◎ (latent embedding)	√ (explicit coupling)
Interpretability	√	◎	×	×	√
Predictive performance	×	◎	√	√	√
Best applicable scenarios	Classical diagnostic needs	Limited	Large-scale prediction tasks	Large-scale prediction tasks	Timed/moderately constrained CBTs

Notes: √ denotes the model performs well or has a clear advantage in this dimension. × denotes the model lacks this capability or does not support this dimension. ◎ denotes the model partially supports this dimension or works under specific conditions, but with limitations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Chen, Y.; Yang, H.; Geng, J. A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment. Electronics 2025, 14, 3873. https://doi.org/10.3390/electronics14193873

AMA Style

Li X, Chen Y, Yang H, Geng J. A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment. Electronics. 2025; 14(19):3873. https://doi.org/10.3390/electronics14193873

Chicago/Turabian Style

Li, Xia, Yuxia Chen, Huali Yang, and Jing Geng. 2025. "A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment" Electronics 14, no. 19: 3873. https://doi.org/10.3390/electronics14193873

APA Style

Li, X., Chen, Y., Yang, H., & Geng, J. (2025). A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment. Electronics, 14(19), 3873. https://doi.org/10.3390/electronics14193873

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Joint Diagnosis Model Using Response Time and Accuracy for Online Learning Assessment

Abstract

1. Introduction

2. Problem Formulation

3. Methods

3.1. Response Time Model

3.2. R-DINA Model

3.3. Response Time—Cognitive Diagnosis Model Framework

4. Estimation

4.1. Model Parameter Estimation

4.2. Markov Chain Monte Carlo

5. Real Data Experiments

5.1. Data Description

5.2. Analysis

5.2.1. Baseline Models

5.2.2. Experimental Details

5.2.3. Experimental Results

6. Simulation Study

6.1. Data Generation

6.2. Analysis

6.3. Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI