1. Introduction
Driven by the accelerating digital transformation of education, the proliferation of online learning has positioned personalized instruction as both a central research focus [
1,
2,
3] and a fundamental pedagogical requirement [
4,
5,
6]. Within these environments, accurately assessing learner performance and diagnosing latent cognitive skills is essential [
7,
8]. However, traditional psychometric approaches—primarily classical test theory and unidimensional item response theory (IRT) [
9]—typically return a single overall ability score, offering insufficient diagnostic granularity for settings that demand fine-grained, personalized intervention [
10,
11].
Cognitive diagnosis theory (CDT) addresses this limitation by modelling examinees’ mastery of multiple fine-grained skills or attributes [
12,
13]. Within CDT, probabilistic CDMs—such as the Rule Space Model (RSM) [
14], the Attribute Hierarchy Model (AHM) [
14,
15], and the Deterministic Inputs, Noisy “And” Gate (DINA) model with its extensions [
16,
17,
18], including multiple-choice variants [
19,
20,
21]—provide interpretable, skill-level feedback grounded in explicit cognitive structures. In parallel, neural CDMs (e.g., the Neural cognitive diagnosis model (NCD) [
22,
23], the Interpretable CD (ICD) [
24], the Hierarchical cognitive diagnosis framework (hierarchical CDF) [
25] and the Higher-Order Neural CD (HO-NCD) [
26]) leverage representation learning to advance predictive performance. Yet across both families, modelling typically relies almost exclusively on binary response accuracy (RA) while overlooking response time (RT)—a key indicator of processing efficiency, engagement, and the speed–accuracy trade-off [
27,
28].
RT is now widely collected in computer-based assessments [
29], and psychometric research shows that jointly modelling RA and RT helps disentangle ability and speed [
23,
28,
30,
31]. Four-parameter IRT (4P-IRT) and hierarchical RT–IRT frameworks typically link accuracy and time via higher-level correlations between latent speed and ability [
28,
32], with extensions such as Box–Cox transformations [
28], differential speed modelling [
31], and dynamic ability tracking [
23]. Nonetheless, these approaches remain essentially unidimensional and often assume conditional independence of RA and RT given the latent variables—an assumption challenged by empirical evidence [
33], including the classic speed–accuracy trade-off [
34] and Roskam’s finding that the probability of correctness increases asymptotically with time [
35]. Attempts to incorporate RT into CDMs—such as JRT-DINA within joint testlet structures [
36], combining RT with CDM outputs for fluency [
23], or log-normal hierarchical time models [
37]—have advanced the field but still largely rely on hierarchical couplings; more recent neural approaches like MFNCD [
7] integrate RT as an auxiliary feature within a neural CDM to enhance prediction but at the cost of limited interpretability at the attribute level.
Despite these advances, two gaps remain critical for diagnosis: (1) Attribute-level coupling. Many approaches are effectively unidimensional or assume conditional independence between RA and RT given latent speed/ability, which is frequently violated in practice [
33] and obscures the local (attribute-level) dependence implied by the speed–accuracy trade-off [
34,
35]. (2) Interpretability vs. process sensitivity. Neural CDMs can be strong predictors but often lack transparent links to cognitive constructs; classical CDMs are interpretable but RT-agnostic, missing process information that can sharpen mastery inferences—especially in timed or moderately constrained assessments [
38]. These limitations motivate a model that explicitly links RT [
39] and RA at the attribute level, retaining interpretability while exploiting process data [
40].
To address the limitations of existing CDMs, we propose RT-CDM, a cognitive diagnosis model that explicitly incorporates response time (RT) as an attribute-level covariate to capture the speed–accuracy trade-off while retaining interpretability. As a necessary foundation, we first generalize the binary mastery representation into a continuous-attribute variant (R-DINA), which relaxes the strict dichotomy of mastery versus non-mastery and provides smoother diagnostic inference. Building on this extension, RT-CDM augments the diagnostic process with a continuous latent-speed component (item-specific time intensity and discrimination), directly modeling local RT–RA dependence and yielding more precise and stable mastery inferences. Parameters are estimated via a data-augmented MCMC scheme. RT-CDM is most suitable for timed or moderately constrained assessments where RT reflects meaningful cognitive effort; in power tests with ample time, the incremental value of RT is expected to be limited.
- (1)
Modeling innovation. We formalize local RT–RA dependence at the attribute level and propose RT-CDM as a cognitively interpretable diagnostic mechanism that integrates response time with response accuracy. By introducing a continuous latent-speed component (item-specific time intensity and discrimination), RT-CDM uses deviations in response time as diagnostic signals, thereby improving both interpretability and robustness of mastery inference.
- (2)
Extension of mastery representation. As a foundational step, we generalize binary mastery into a continuous-attribute representation (R-DINA). This extension relaxes the strict dichotomy of mastery versus non-mastery, yields smoother diagnostic inference, and provides the basis upon which RT-CDM is developed.
- (3)
Comprehensive evaluation. We perform controlled simulations and empirical analyses on three large-scale datasets—PISA 2015, EdNet, and MATH—comparing RT-CDM against classical CDMs (e.g., IRT, DINA), RT-extended models (e.g., 4P-IRT, JRT-DINA), and neural CDMs (e.g., NCD, ICD, MFNCD). Across these datasets, RT-CDM consistently demonstrates superior predictive accuracy and calibration, stronger interpretability (e.g., higher DOA), and more robust parameter recovery.
2. Problem Formulation
The model was used to evaluate the students’ mastery of attributes (such as skills, abilities) in a cognitive assessment in a computer-based learning environment. To formulate the model, suppose an assessment consists of
items to measure
attributes and is answered by
examinees. Let
matrix denote the relationship between
items and
attributes, and the element
if the
item requires the
attribute and 0 otherwise. Through assessing items at each test, two types of multivariate data are collected. The first is the response data of examinees, denoted by
, where element
if the
examinee answers correctly and 0 otherwise. The second is the RT data when examinees answer each item, denoted by
. The
examinee’s speed is denoted as
. In addition, the standard RT for each item
is denoted by
. The toy example of RT-CDM is shown in
Figure 1.
The hierarchical model [
41] assumes conditional independence between the RT and responses as follows:
However, the conditional independent distribution does not fit real-world scenarios. On the contrary, more information can be learned from the conditional dependence between RT and RA. Therefore, the proposed modelling framework adopts the following assumptions.
- (1)
Each examinee’s latent ability is denoted by a multidimensional binary latent variable, called attribute profile and described as , the element if examinee masters attribute and 0 otherwise.
- (2)
The RA of examinee is not only determined by the attribute profile and the item characteristics of the test, but also by the RT of the examinee .
- (3)
The examinee’s RT is related to the speed , which is not constant and is affected by the item discrimination .
- (4)
The matching degree between the examinee’s RT and the standard RT of item, denoted by , is determined by the time intensity of item , the examinee’s speed , and the examinee’s RT .
Based on these assumptions, the proposed modelling framework introduced the response time, named RT-CDM. The joint model for the condition dependence between RT and RA is the following:
3. Methods
RT-CDM introduces a continuous item mastery function and a RT function to achieve a more precise cognitive analysis. The framework of RT-CDM is as shown in
Figure 2. First, input the examinee’s response time, response result data and
Q-matrix. Second, analyse the factors that affect the examinee’s cognitive process, including response speed, mistakes, guessing and other factors, and model these factors. Finally, obtain the examinee’s skill mastery according to the model.
3.1. Response Time Model
Response time (RT) has been widely modeled using a lognormal distribution [
42], which assumes that the logarithm of observed RT follows a normal distribution [
43]. The standard lognormal model is specified as
where
is the RT of examinee
i on item
j,
is the item time-intensity parameter,
is the latent speed of examinee
i, and
is a random error term. This formulation, however, assumes that an examinee’s speed parameter
is constant across all items, which is often unrealistic in practice because different items require different baseline amounts of time to complete.
To address this issue, Fox and Marianti [
31] extended the lognormal RT model by introducing an item discrimination parameter
, yielding the following specification:
Here,
allows examinees’ latent speed to interact with item-specific characteristics, thereby accommodating heterogeneity in item timing demands. Nevertheless, the model still inherits the limitation that
is a global speed factor, assuming that examinees’ relative speed is constant across all items. This does not fully reflect the reality that items differ not only in discrimination but also in their intrinsic baseline time requirements. In order to illustrate how the
works,
Figure 3 displays the relationship between the RT
and the error
under the different discrimination parameter
. If the RT is the same for an item, the greater the value of
, the greater the
, hence solving the problem that examinee speed
is constant for each item.
To overcome this limitation, we introduce an item-specific baseline response time, denoted by
. This parameter represents the expected RT for item
j under average speed, allowing observed RTs to be standardized across items. Specifically, we define the log-transformed residual workload as
where
serves as a reference point that adjusts for the baseline difficulty or workload of each item. The proposed RT model can then be expressed as
This specification ensures that RTs are comparable across items, since deviations are measured relative to each item’s baseline requirement.
Regarding the treatment of
, the most rigorous approach is to estimate it jointly with other model parameters within a Bayesian framework, thereby incorporating parameter uncertainty. However, due to the substantial computational cost of such an approach, in our implementation we fix
as the mean RT of item j computed from the training data only, and apply these values as exogenous constants when evaluating the model on the test data. This practice avoids data leakage while maintaining computational efficiency. We explicitly acknowledge this choice and its limitations in
Section 7.
3.2. R-DINA Model
To capture examinees’ problem-solving accuracy, we first extend the conventional DINA model by allowing the mastery indicator to take continuous values. For examinee
i and item
j, the mastery indicator is defined as
where
denotes the attribute profile of examinee
i, and
is the
jth row of the
Q-matrix indicating which attributes are required by item
j. The numerator counts the number of required attributes mastered by the examinee, and the denominator is the total number of attributes required by the item. Hence,
represents the proportion of required attributes mastered, with
indicating no mastery and
indicating full mastery. Intermediate values reflect partial mastery.
Given this continuous mastery indicator, R-DINA model the probability of a correct response is specified as
where
and
are the guessing and slipping parameters, constrained by
and
. This formulation preserves the diagnostic interpretation: when
, the success probability equals
; when
, it equals
. For example,
is valued in the range of 0~1 for a random fraction. For example, on the condition that an item tests three attributes (1 1 1),
divides examinees into four types: examinees mastering none, one, two, and three attributes. The polychotomy of
divides the mastery mode of the examinee more finely.
indicates that the examinee has mastered all the skills required for the item (see
Figure 4), and
indicates that the examinee has mastered some of the skills required for the item. By allowing
to vary continuously, R-DINA generalizes mastery representation, provides smoother diagnostic inference, and lays the foundation for extending CDMs to incorporate process data such as RT. This continuous formulation preserves interpretability while relaxing the strict dichotomy of DINA and provides the necessary foundation for incorporating RT information.
3.3. Response Time—Cognitive Diagnosis Model Framework
To incorporate response time into the diagnostic framework, we extend the R-DINA model by integrating the time error term
, derived from the standardized RT model in
Section 3.1. The probability of a correct response is modeled as a Bernoulli random variable with mean
:
where
is the standardized time error, representing the discrepancy between the observed RT and the expected workload of item
j;
is a monotonic transformation of the time error (commonly
), centered so that its mean is zero;
is an item-specific coefficient quantifying the magnitude and direction of the RT effect. This specification ensures interpretability: when
and
, the success probability reduces to
; when
and
,, it reduces to
.
Cognitive interpretation. This additive logit model can be interpreted as combining two independent sources of evidence:
- (1)
Mastery-based evidence () from cognitive attributes;
- (2)
Time-based evidence (εij) reflecting efficiency relative to expectations.
When , taking longer than expected (negative εij) lowers the log-odds of success, consistent with the interpretation that hesitation acts as negative evidence. Conversely, efficient use of time (positive εij) strengthens the probability of success.
Thus, RT-CDM retains the classical diagnostic meaning of guessing and slipping, while incorporating RT deviations as an additional diagnostic signal. This allows the model to capture local RT–RA dependence and explain the speed–accuracy trade-off in a cognitively interpretable manner. A graphical representation of the RT-CDM that jointly models RT and RA is displayed in
Figure 5.
Monotonicity. The proposed RT-CDM preserves the monotonicity property widely recognized in cognitive diagnostic modelling [
44,
45]. By retaining the diagnostic structure of the R-DINA framework, the probability of a correct response remains non-decreasing with respect to attribute mastery. Furthermore, the time-related component ε is modelled as a monotonically decreasing function of response time, reflecting the empirical observation that longer latencies typically indicate lower efficiency or uncertainty. Together, these design choices ensure that the incorporation of response time does not undermine the fundamental monotonicity assumption of CDMs.
Identifiability. Regarding identifiability, the RT-CDM builds upon the R-DINA framework, whose identifiability conditions have been well established in the literature. Although our model introduces additional behavioral parameters through the integration of response time, potential identifiability issues are mitigated by fixing anchoring parameters, employing a well-structured Q-matrix, and applying regularized estimation [
46]. Moreover, parameter constraints (β, τ > 0; λ ∈ (0,1); g, s ∈ (0,1)) and hierarchical priors are imposed to avoid over-parameterization. Empirical evidence further shows that the model achieves stable convergence across multiple runs, supporting its practical identifiability.
4. Estimation
A fully Bayesian formulation with Markov chain Monte Carlo (MCMC) was adopted to estimate the parameters of the RT-CDM.
4.1. Model Parameter Estimation
We adopt a fully Bayesian framework for parameter estimation in RT-CDM. Let the observed response accuracy and response time be
with corresponding latent mastery profiles
.
- (1)
Likelihood function.
The probability of a correct response conditional on skills and time error
is
where
is the conjunctive mastery indicator and
is the time-error function. The log-response time is modeled as
.
Thus, the joint likelihood is
- (2)
Prior distributions.
We use weakly informative priors to ensure identifiability and shrinkage:
To address scale non-identifiability between and , we fix the mean of at zero and its variance at one. This provides a location–scale anchor and avoids over-parameterization.
- (3)
Posterior distribution.
The joint posterior distribution of
given
and
is
4.2. Markov Chain Monte Carlo
We employ data-augmented MCMC for posterior sampling. Specifically,
Sampler: No-U-Turn Sampler (NUTS) for continuous parameters , and Gibbs updates for discrete mastery profiles α.
Blocking & vectorization: Item-level parameters are updated in parallel, and likelihood terms are vectorized to reduce runtime from per iteration to batched matrix operations.
Convergence checks: Chains are run with multiple seeds; convergence is assessed using Gelman–Rubin and effective sample size (ESS).
This yields a stable and computationally efficient estimation routine while ensuring transparency in the connection between the likelihood and the sampling procedure.
5. Real Data Experiments
In this section, we evaluated the proposed RT-CDM model across three datasets and compared it with various baseline methods. We aimed to assess both the predictive accuracy and the interpretability of the model. Extensive experiments were conducted to investigate whether incorporating response time information improves prediction performance and whether the model can provide reliable diagnostic insights into students’ attribute mastery. Based on these experiments and analyses, we addressed the following research questions:
RQ1. How effectively does RT-CDM predict student performance compared to baseline methods?
RQ2. To what extent does RT-CDM enhance interpretability in diagnosing students’ attribute mastery relative to existing cognitive diagnosis models?
5.1. Data Description
To evaluate the performance of the proposed model, we conducted experiments on three representative datasets: PISA2015 (standardized assessment), EdNet-KT1 (large-scale longitudinal learning logs), and MATH (static exam data), covering diverse educational scenarios.
PISA (
https://www.oecd.org/pisa/data/2015database/): The PISA 2015 mathematics dataset evaluates the mathematical literacy of 15-year-old students by measuring their ability to apply mathematical knowledge and skills in diverse real-world contexts [
47]. In this study, we focus on seven mathematics-related attributes defined in the PISA framework: change and relationships (α
1), space and shape (α
2), quantity (α
3), uncertainty and data (α
4), formulating situations mathematically (α
5), employing mathematical concepts, procedures, and reasoning (α
6), and interpreting, applying, and evaluating mathematical outcomes (α
7). In addition to students’ responses, the dataset also records response times, enabling analyses not only of problem-solving accuracy but also of cognitive processing efficiency and test-taking behaviors. This dual information provides richer insights into students’ mathematical understanding and their readiness to apply mathematics in personal, occupational, and societal contexts. The corresponding Q-matrix, constructed and validated by domain experts in prior work [
48], is presented in
Table 1.
EdNet (
https://github.com/riiid/ednet): The EdNet-KT1 dataset, collected from the Santa Academy learning platform, provides large-scale longitudinal logs of students’ learning activities. Each record includes a student identifier, question ID, answer submission, timestamp, session ID, and the elapsed time spent on solving each problem. With millions of interaction records covering thousands of unique questions, the dataset captures both the correctness of responses and the dynamics of response times, offering opportunities to analyze students’ knowledge acquisition, learning behaviours, and temporal patterns of problem-solving.
Math (
https://edudata.readthedocs.io/en/latest/tutorial/zh/DataSet.html#math2015): The MATH dataset, developed by iFLYTEK Co., Ltd., is based on data collected via the iFLYTEK Learning Machine from a final mathematics examination for high school students. As a static single-assessment dataset, it features a dense structural design that captures detailed student responses, exemplifying the traditional model of standardized evaluation while offering a solid foundation for analysing performance patterns and learning outcomes.
In the preprocessing stage, we first removed incomplete or erroneous records (e.g., missing responses, non-positive or incoherent timestamps) and anonymized all student identifiers. We standardized interaction logs by aligning item identifiers, normalizing correctness labels, and unifying response-time units. For RT specifically, we applied a log transform and, for each item j, computed a baseline time
(item-level average/robust mean) to form a centered measure
). To improve robustness, we trimmed/winsorized extreme latencies using percentile-based rules (dataset-specific tails) and flagged potential rapid-guessing or off-task records for exclusion from the time likelihood while retaining accuracy information. RT missingness/censoring (e.g., timeouts) was handled under a missing-at-random assumption conditional on item and learner factors; such cases were either excluded from the time component or imputed using a truncated normal model [
17,
49]. To address sparsity, we filtered students and items with insufficient interactions. The cleaned data were then converted into sequential formats suitable for knowledge tracing and model training. Summary statistics after preprocessing are reported in
Table 2.
5.2. Analysis
Based on the real dataset, eight baseline models were selected to compare with the proposed model from the perspectives of accuracy and convergence. Two Markov chains were run for each model, with 8000 iterations for each chain. The first 5000 iterations in each chain were discarded as burn-in, and the last 3000 iterations were used to compute the point estimates of model parameters.
5.2.1. Baseline Models
To ensure a comprehensive and fair evaluation, we selected representative baseline models from two perspectives: (1) We considered whether response time (RT) information is incorporated into the modeling process. Models that focus solely on response accuracy (RA), such as IRT, DINA, NCD, and ICD, provide a comparison point for our refined R-DINA model, while models that jointly model RA and RT, including JRT-DINA, 4P-IRT, and MFNCD, serve as counterparts to the proposed RT-CDM. (2) We categorized the baselines by methodological paradigm: probability-based statistical models (IRT, DINA, JRT-DINA, 4P-IRT) versus neural network-based approaches (NCD, ICD, MFNCD). This dual classification not only clarifies the rationale for selecting these models but also ensures that the evaluation of RT-CDM is conducted from multiple perspectives, strengthening the validity of the comparative analysis.
IRT. The 3PL IRT model predicts the probability that an examinee with a given ability level (θ) will answer a test item correctly, using three item characteristics: discrimination (a), difficulty (b), and guessing parameter (c).
DINA model. The DINA model is the most popular and most used CDM, which uses the examinees’ binary response results to perform binary modelling.
R-DINA. The R-DINA model is a refined model, which ignores the RT of RT-CDM. Compared to traditional deterministic-input, noisy “and” gate (DINA) models that assume binary latent attributes, R-DINA relaxes the binary assumption by introducing probabilistic or continuous representations of mastery. However, it remains more restricted than generalized models such as G-DINA or LCDM, which allow for complex, multi-way interactions among attributes.
4P-IRT [
32]. The typical 4PIRT model that extends the 3PIRT predicts the probability of students correctly answering exercises by including response time
and parameters representing exercise and student slowness.
where
are the discrimination, difficulty, and guessing parameter of exercise;
is the ability parameter of student
i. The difference from the 3PIRT model is that 4PIRT model adds parameter
that represents the exercise’s slowness parameter, parameter
that represents the student’s speed parameter, parameter
that denotes student’s response time on exercise
j.JRT-DINA [
50]. The JRT model is a hierarchical modelling framework to model the RA and RT of examinees.
NCD [
5]: NCD is a deep learning-based cognitive diagnostic model that combines IRT to assess students’ cognitive attributes and exercise performance.
ICD [
5]: ICD is a novel neural network-based cognitive diagnostic model that considers the interactions between knowledge concepts and the quantitative relationships between exercises and concepts.
MFNCD [
7]: MFNCD integrates multidimensional features by using students’ reaction times as process information. This facilitates the simultaneous modelling of students’ reaction accuracy and reaction speed using neural networks.
5.2.2. Experimental Details
Evaluation metrics. Item parameter recovery was examined using the RMSE (Root Mean Square Error), MAE (Mean Absolute Error), ACC (Predication Accuracy) and AUC (Area Under ROC) of the estimated category response function for each latent class of the true values. RMSE and MAE represent the error between estimated value and the true value; therefore, the smaller the value, the better the model. AUC and ACC indicate the accuracy of model [
3]; therefore, the bigger the value, the better the model. We perform an 80%/20% train/test split of datasets, using the first 80% data of each student to train parameters. Then, we inference each student’s proficiency after finishing his/her training records, which is used to predict scores on his/her testing (last 20%) data.
- (1)
RMSE is defined as follows:
- (2)
MAE is defined as follows:
- (3)
ACC is defined as follows:
where
is an indicator function. If
is true, the value is 1; otherwise, it is 0.
- (4)
AUC provides a robust metric for binary prediction evaluation.
Experimental setting. To ensure fair comparison across models, we implemented consistent training strategies with hyperparameter tuning for all baselines. For the IRT and 4P-IRT models, we performed grid search over learning rates [1e-4,1e-3,1e-2], L2 regularization coefficients [0.0,1e-4,1e-3], and batch sizes. The IRT model was trained using the EM algorithm with tuned convergence thresholds. The JRT-DINA model employed an Adam optimizer with a tuned learning rate and dropout regularization when applicable. For the neural network models (NCD, ICD, MFNCD), we searched the learning rate from {0.001, 0.002, 0.005}, batch size from {16, 32, 64}, and hidden dimensions from {8, 16, 32}. For MFNCD in particular, the dimensions of fully connected layers were tuned within {512–256–1, 256–128–1}. All models used the Sigmoid activation function. Hyperparameter selection was based on 5-fold cross-validation performance on the training set, and the best configuration was applied to the test set. These settings were selected based on validation performance to mitigate overfitting and ensure equitable optimization conditions across models.
The RT-CDM model was implemented using the PyMC3 probabilistic programming framework, leveraging the No-U-Turn Sampler (NUTS) for efficient Markov Chain Monte Carlo (MCMC) inference. All experiments were conducted on a workstation equipped with an AMD Ryzen 9 7950X CPU and 64 GB RAM (Advanced Micro Devices (AMD), Santa Clara, CA, USA). For the PISA dataset (n ≈ 6000 students, 17 items), we performed 8000 sampling iterations, which required approximately 7.5 h for model convergence. Convergence was assessed using Gelman-Rubin statistics () and effective sample size metrics.
To improve runtime and scalability, all item–student likelihood computations were performed using vectors, avoiding explicit loops. The model likelihood was evaluated in block form, enabling efficient memory use and stable gradients during NUTS sampling. These optimizations ensured that the computational complexity scaled linearly with the number of responses (O(NJ)), making the method feasible for medium- to large-scale assessments. For even larger datasets, the model can be parallelized across chains or implemented in GPU-enabled frameworks (e.g., PyMC v4/NumPyro) to further reduce runtime.
5.2.3. Experimental Results
To rigorously evaluate the predictive performance of the proposed methods, experiments were conducted on three representative datasets: PISA2015 (standardized assessment data), EdNet-KT1 (longitudinal learning log data), and MATH (static single-exam data). Baseline models were organized along two dimensions. From the perspective of data utilization, models were classified into those relying solely on response accuracy (RA) and those jointly modeling response accuracy and response time (RA-RT). From the perspective of modeling paradigm, models were further categorized into probabilistic models (e.g., IRT, DINA, 4P-IRT) and neural network-based models (e.g., NCD, ICD, MFNCD). Within this framework, the proposed R-DINA model is positioned as an enhanced RA-based probabilistic model, strengthening cognitive diagnostic capacity, while the proposed RT-CDM model is positioned as an RA-RT joint probabilistic model, designed to integrate accuracy and temporal behavior in diagnosing student performance. This design enables a comprehensive comparison across data types and modeling paradigms to verify the robustness and generalizability of the proposed approaches.
Table 3 presents the overall results for three datasets.
As shown in
Table 3, RT-CDM achieves consistently strong performance across datasets. On PISA 2015 and EdNet, where RT information is available, RT-CDM yields clear gains over classical CDMs (e.g., DINA, R-DINA) in terms of both accuracy and AUC, confirming that response time provides useful diagnostic signals at the attribute level. Compared with neural models such as MFNCD, RT-CDM performs competitively: while MFNCD sometimes achieves lower MAE or RMSE, RT-CDM maintains higher discriminative ability (AUC), highlighting its advantage in interpretability-oriented diagnosis. On the MATH dataset, which lacks RT information, RT-CDM reduces to R-DINA and produces nearly identical results, as expected. This confirms the model’s internal consistency and ensures that RT-CDM does not degrade performance when RT is unavailable. Overall, these results indicate that RT-CDM balances predictive performance and interpretability: it leverages RT to enhance diagnostic inference where available, while remaining stable in settings without RT. Neural CDMs may outperform in certain error-based metrics, suggesting complementary strengths that future research could explore.
Figure 6 shows the prediction performance for each item of the four models. From each subfigure, we can observe that the RT-CDM model outperforms almost all the baselines on most items and the R-DINA model is also better than the other two models. Moreover, JRT-DINA model’s performance is the most unstable among the four models, and the results fluctuate greatly on different items, indicating that the JRT-DINA model is the most affected by the factors of the items.
To verify the precision and interpretability of the proposed RT-CDM model in diagnosing students’ attribute mastery, the Degree of Agreement (DOA) metric was adopted, which corresponds to a monotonicity analysis [
26]. The rationale of DOA is that if a student exhibits higher estimated proficiency in a given attribute than another student, the former should consistently achieve better performance on exercises associated with that attribute. By averaging DOA values across all attributes, the overall plausibility of diagnostic results can be evaluated. Since other models do not provide explicit outputs of attribute mastery, DINA, NCD, ICD, and MFNCD were selected as baseline models, and experiments were conducted on three datasets, the results are shown in
Figure 7.
As shown in
Figure 6, the results demonstrate that the proposed model outperforms these state-of-the-art baselines. Specifically, it surpasses DINA, indicating that continuous modelling of attribute mastery is superior to dichotomous approaches, and it also achieves higher interpretability than neural network-based models, thereby highlighting its dual advantage in both diagnostic precision and explanatory power.
Computational complexity. RT-CDM is estimated via a data-augmented MCMC scheme. Per iteration, (a) sampling examinee attributes costs ; (b) updating item/attribute and time parameters costs ; and (c) evaluating the joint likelihood scales linearly with the number of observed interactions . Thus, the overall per-iteration time is approximately with moderate constants. Mixing can slow when the number of attributes K is large or when the Q-matrix is highly sparse/imbalanced. In practice, wall-clock time is reduced via parallel updates across items/examinees, batched likelihood evaluations, and vectorized implementations, with convergence monitored by and effective sample size (ESS) and iteration caps used to control compute.
6. Simulation Study
A follow-up simulation study was conducted to further evaluate model parameter recovery and to compare the R-DINA and RT-CDM models in ideal simulated conditions. The study method is to simulate the response results matrix and the RT matrix of the examinees by fixing the number of attributes, items, and examinees, and by estimating the parameters of items and obtaining the skill state of the examinees.
6.1. Data Generation
The study simulates five separate attributes, which can generate
skill states and simulates 31 items (excluding the mode (00000)). The
Q-matrix for this data is given in
Figure 8. (1) RT parameters:
;
;
obeys truncated normal distribution [
12], the lower limit is 0.0001;
obeys a uniform distribution with a mean of 0 and a variance of [0.3,0.7]. (2) RA parameters:
;
;
. According to these parameters, the study simulates the response results and RT of 5000 examinees in total.
6.2. Analysis
With respect to the classification of individual attributes and profiles, this study computed the ACCR (Attribute Correct Classification Rate) and the PCCR (Pattern Correct Classification Rate). The ACCR evaluates the accuracy of the individual attribute classifications and the PCCR evaluates the accuracy of the attribute vector classifications which is a concatenation of the individual attribute classifications.
where
is an indicator function. If
is true, the value is 1; otherwise, it is 0.
Additionally, the recovery of the model, the MAE, RMSE, ACC and AUC, were computed. Based on these evaluation indicators, the RT-CDM model and the R-DINA model were compared to illustrate the impact of the RT in cognitive diagnosis.
6.3. Results
In this section, the two models were compared from three aspects: the accuracy of model, the recovery of item parameters and the accuracy of attribute classification.
Model accuracy. Table 4 displays the RMSE, MAE, ACC, and AUC to compare the two models: (1) the RMSE and MAE represent the error between the estimated value and the true value, and (2) the AUC and ACC indicate the accuracy of two models. These measures were computed for each replication, and the results in
Table 3 are averages over 100 replications. On the one hand, MAE and RMSE errors of RT-CDM are lower than for the R-DINA model from the perspective of project parameter recovery, and the MAE result of RT-CDM is less than 0.05. On the other hand, the accuracy of the RT-CDM is much higher than that of the R-DINA model. The ACC of RT-CDM is greater than 0.97. This result shows that if the RT and the response result are combined in modelling, it can significantly improve the accuracy of the model.
Item parameters recovery.
Table 5 displays the recovery of item parameters for the two models by presenting the MAE and RMSE between estimated and true values of all item parameters. For the parameters, the results of the RT-DINA model are much better than the results of the R-DINA model; for the parameter g, the results of the two models are equivalent, and the result of the RT-DINA model is only 0.001 higher than the result of the R-DINA. From the overall situation, the item parameter recovery of the RT-DINA is relatively stable.
Attribute classification. Table 5 presents the recovery of individual attributes (ACCR) and attribute patterns (PCCR) of the RT-CDM and R-DINA model. The values in the table are computed by comparing the true and estimated classifications and represent the percentage of correct classifications across the replication results and examinees’ real response results. The RT-CDM was higher than the R-DINA model in terms of both ACCR and PCCR, which indicates that ignoring the effect of the RT on the RA would reduce ACCR and PCCRs.
From the results of
Table 5, the RT-CDM model’s item parameters estimation was stable, and accuracy was based on the MCMC algorithm. The results of
Table 6 demonstrate that by using the RT-CDM model to analyse the data, dramatic improvement in classification accuracy can be achieved over conventional tests and analyses.
As shown in
Table 6, the proposed RT-CDM models the local dependency between RA and RT, achieving higher ACCR and PCCR than competing approaches. This indicates that RT-CDM can more effectively differentiate examinees’ attribute mastery and provides stronger interpretability and diagnostic power. By contrast, MFNCD, an educational neural network model, shows markedly lower ACCR and PCCR, reflecting weaker interpretability and limited attribute discrimination. JRT-DINA, as a hierarchical model, incorporates response times but assumes independence between RA and RT; as a result, its overall performance is comparable to that of R-DINA, which relies solely on response accuracy.
7. Discussion
Several key findings regarding the validity and positioning of RT-CDM relative to existing cognitive diagnostic frameworks are as follows:
- (1)
Comparative advantages over JRT-DINA
RT-CDM’s attribute-level local dependence plus continuous mastery produces more robust parameter recovery, stronger interpretability, and more stable, better-calibrated predictions than JRT-DINA [
41]. Compared with JRT-DINA, RT-CDM models the attribute-level coupling between response time (RT) and response accuracy (RA) and replaces binary mastery with a continuous proficiency representation (R-DINA). This makes RT an explicit, interpretable modulator of the Bernoulli mean at the skill level, so the model naturally captures the speed–accuracy trade-off and preserves diagnostic monotonicity (greater mastery → higher correctness) while improving calibration and prediction. By contrast, JRT-DINA embeds RT through higher-level (e.g., testlet) factors and typically assumes conditional independence between RA and RT given those factors, so observed time does not directly inform item-skill probabilities. When data exhibit strong speed–accuracy exchange or heterogeneous attribute effects, this architecture can misallocate dependence, letting RT effects be absorbed by testlet or person factors, which weakens identifiability, increases collinearity between speed and ability, and often yields slow, unstable estimation. In addition, DINA’s hard (binary) gating is sensitive to thresholding and Q-matrix sparsity; with short tests or uneven item coverage, parameter estimates and mastery profiles become brittle.
- (2)
Positioning relative to traditional and neural CDMs.
Compared with traditional CDMs such as DINA and its variants, RT-CDM extends the binary mastery paradigm into a continuous attribute representation (R-DINA) and further enriches the diagnostic process by integrating temporal information. This allows the model to capture more nuanced differences in students’ proficiency and processing efficiency, overcoming the restrictive dichotomy of mastery versus non-mastery. In comparison with neural CDMs (e.g., NCD, ICD, MFNCD), RT-CDM achieves a balance between predictive accuracy and interpretability. While neural models often exhibit strong performance, their latent representations are difficult to align with meaningful cognitive constructs. By contrast, RT-CDM retains the interpretability of classical CDMs while delivering improved predictive performance through the incorporation of RT.
- (3)
Comparative analysis with published results.
To highlight its positioning, we compared RT-CDM against publicly reported results along methodological dimensions rather than raw ACC/AUC values, since data preprocessing and evaluation criteria vary across studies. As summarized in
Table 7, RT-CDM demonstrates clear advantages in terms of continuous representation, explicit use of RT, interpretability, parameter stability, and calibration, while maintaining moderate computational demand. This suggests that RT-CDM provides a balanced framework that integrates the predictive validity of neural models with the diagnostic transparency of classical CDMs.
- (4)
Applicability conditions.
RT-CDM is most effective in timed or moderately constrained computer-based assessments where response time (RT) carries meaningful process signal and exhibits attribute-level dependence with response accuracy (RA). In power tests (ample time, little speed–accuracy pressure) or when RT is dominated by non-cognitive factors, the marginal value of RT is limited; under our hierarchical priors, time-modulation parameters naturally shrink, yielding a benign degeneration to an RA-dominant diagnosis.
- (5)
Limitations and boundary conditions.
Design dependence. Gains rely on a substantive speed–accuracy trade-off. Short tests or uneven attribute coverage can increase confounding between time and ability; adequate Q-matrix coverage and prior shrinkage help mitigate this.
Data quality and preprocessing. RT data are typically skewed and may contain extremely short (rapid-guessing) or long (off-task/external) latencies. In practice, we apply a log transform, trim or winsorize extremes using percentile-based rules (dataset-specific thresholds), and flag potential rapid-guessing records for exclusion. While these procedures improve robustness, they may also introduce subjective choices that affect reproducibility. Future work should investigate more automated and model-based approaches to handling extreme or missing RTs.
Computational costs and application scope. The MCMC estimation required by RT-CDM has significantly higher computational demands compared with neural CDMs. This limits its feasibility in environments requiring real-time diagnostic feedback. Accordingly, the true value of RT-CDM lies in offline, detailed analysis contexts—such as post hoc evaluation of large-scale assessments, curriculum studies, and high-stakes testing—where interpretability, diagnostic accuracy, and robustness are paramount. In contrast, for adaptive testing or classroom-level real-time feedback, more computationally efficient models may be preferable.
Baseline time and prior sensitivity. In principle, the item-level baseline should be treated as an unknown parameter and estimated jointly with other parameters to fully capture its uncertainty. For computational tractability, however, we fix as the mean RT of item j computed from the training data only, and then apply these values as exogenous constants in the test phase. This avoids information leakage while maintaining efficiency. We note that mild mis-specification of primarily affects the residual time component, which is partially compensated by the time-modulation parameters. Hierarchical priors further provide shrinkage to mitigate overfitting and identifiability concerns.
External heterogeneity. Device/platform differences (mobile vs. desktop), UI latency, and strategy shifts can contaminate RT; such covariates can be included in the time model, or heavy-tailed residuals (e.g., log-t) can be used to enhance robustness.
8. Conclusions
This study set out to enhance cognitive diagnosis by explicitly incorporating response time (RT) alongside response accuracy (RA) within a unified diagnostic framework. The proposed RT-CDM directly models the local dependence between RA and RT at the attribute level, thereby addressing limitations in both traditional psychometric models and recent neural CDMs [
51]. To evaluate its effectiveness, we conducted simulation studies and analyses on multiple real datasets which consistently demonstrated that RT-CDM outperforms classical CDMs, RT-extended IRT models, and neural CDMs in terms of classification accuracy, parameter recovery, and predictive stability. These findings confirm that incorporating RT not only improves diagnostic precision but also yields deeper insights into learners’ proficiency and processing efficiency.
Although the current study provides strong empirical and theoretical support for RT-CDM, several avenues remain open for future research. First, more comprehensive sensitivity analyses should be conducted, particularly with respect to the specification of time parameters such as, to examine the robustness of RT-CDM under different testing conditions. Second, while this study validated RT-CDM on large-scale assessment data, future work may extend the framework to adaptive testing and classroom-based formative assessment, where time constraints vary dynamically. Third, integrating RT with additional behavioral signals (e.g., eye-tracking, clickstream data, or affective measures) may further enhance the ecological validity of cognitive diagnosis. Finally, hybridizing RT-CDM with neural architectures could provide a promising direction for combining interpretability with the flexibility of representation learning. These extensions will not only refine the methodological contributions of RT-CDM but also broaden its applicability in real-world educational settings.