Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing

İnce, Ahmet Hakan; Özbay, Serkan

doi:10.3390/app15136999

Open AccessArticle

Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing

by

Ahmet Hakan İnce

^* and

Serkan Özbay

Electrical and Electronics Engineering, Faculty of Engineering, Gaziantep University, Gaziantep 27010, Turkey

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 6999; https://doi.org/10.3390/app15136999

Submission received: 26 April 2025 / Revised: 3 June 2025 / Accepted: 18 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Applications of Smart Learning in Education)

Download

Browse Figures

Versions Notes

Abstract

This study investigates the impact of response time on ability estimation within an Item Response Theory (IRT) framework, introducing time-sensitive formulations to enhance student assessment accuracy. Seven models were evaluated, including standard 1PL-IRT and six response-time-adjusted variants: TP-IRT, STP-IRT, TWD-IRT, NRT-IRT, DTA-IRT, and ART-IRT. Three optimization techniques—Maximum Likelihood Estimation (MLE), full parameter optimization, and K-fold Cross-Validation (CV)—were employed to assess model performance. Empirical validation was conducted using data from 150 students solving 30 mathematics items on the “TestYourself” platform, integrating response accuracy and timing metrics. Student abilities (θ), item difficulties (b), and time–effect parameters (λ) were estimated using the L-BFGS-B algorithm to ensure numerical stability. The results indicate that subtractive models, particularly DTA-IRT, achieved the lowest AIC/BIC values, highest AUC, and improved parameter stability, confirming their effectiveness in penalizing excessive response times without disproportionately affecting moderate-speed students. In contrast, multiplicative models (TWD-IRT, ART-IRT) exhibited higher variability, weaker generalizability, and increased instability, raising concerns about their applicability in adaptive testing. K-fold CV further validated the robustness of subtractive models, emphasizing their suitability for real-world assessments. These findings highlight the importance of incorporating response time as an additive factor to improve ability estimation while maintaining fairness and interpretability. Future research should explore multidimensional IRT extensions, behavioral response–time analysis, and adaptive testing environments that dynamically adjust item difficulty based on response behavior.

Keywords:

adaptive testing; item response theory; response time; adaptive testing; K-fold cross-validation; speed accuracy; TestYourself platform; mathematics assessment

1. Introduction

In modern educational assessment, the implementation of adaptive testing based on Item Response Theory (IRT) has significantly enhanced the evaluation of student abilities. Unlike traditional fixed-form examinations, adaptive tests dynamically adjust item difficulty based on a test-taker’s performance, improving measurement precision and optimizing test length [1,2]. This adaptability ensures that each examinee is presented with questions that match their proficiency level, leading to a more accurate skill assessment [3,4]. As the foundation of this methodology, IRT theory models the probability of a correct response as a function of student ability (θ) and item difficulty (b) [5,6]. Among IRT models, the One-Parameter Logistic (1PL) Rasch model is widely adopted due to its computational efficiency and reliability in ability estimation [6,7]. However, conventional IRT models focus solely on the accuracy of responses (correct/incorrect) and overlook additional cognitive dimensions, such as response time, which can provide deeper insights into test-taking behaviors and cognitive load [8]. Recent research suggests that response time reflects not only the examinee’s familiarity with the content but also their cognitive effort and decision-making process [9,10]. Consequently, integrating response time into IRT-based models has the potential to refine ability estimation and enhance the interpretability of assessments [10,11]. In response to this, researchers have explored response-time-adjusted IRT models, categorized broadly into subtractive and multiplicative approaches. Subtractive models, such as Time-Penalized-IRT (TP-IRT) and Dynamic Time-Adjusted-IRT (DTA-IRT), penalize extended response times to prevent undue disadvantages for slower respondents [12]. On the other hand, multiplicative models, including Time-Weighted Difficulty-IRT (TWD-IRT) and Adaptive Response Time-IRT (ART-IRT), dynamically adjust item difficulty based on response speed [13]. These methods aim to incorporate response-time efficiency into ability estimation, refining the overall adaptive testing experience [14]. One key motivation for this study arises from a recurring problem in computer-based testing environments: the mismatch between response accuracy and response time when estimating student ability. Traditional IRT-based models evaluate correctness but overlook the timing dimension, which can lead to over- or underestimation of student performance. For example, a high-ability student might quickly and confidently solve an item, while a lower-ability student could take significantly longer to reach the same correct answer, potentially through guessing or trial-and-error. Failing to account for this timing discrepancy can lead to biased ability estimates. The TestYourself platform—used in this study—was designed to address this by collecting both correctness and response time data. This dual data stream allows for the implementation of time-aware IRT models that aim to improve fairness and diagnostic insight. Therefore, the primary aim of this study is to compare various time-sensitive IRT models under real-world conditions to determine which models most effectively integrate response time into ability estimation.

In recent years, the integration of response time into Item Response Theory (IRT) models has gained significant attention, particularly in the context of computer-based and adaptive testing. Researchers have increasingly recognized that response accuracy alone may not fully capture a test-taker’s ability, as response behavior and timing can provide valuable information about cognitive effort, problem-solving strategy, and test engagement. For instance, Huang, Luo, and Jeon [15] proposed a response-time-based mixture IRT model capable of distinguishing dynamic item-solving strategies, showing that time-sensitive modeling can improve the precision of ability estimation. Similarly, Wallin, Chen, Lee, and Li [16] introduced a latent variable model with change points to capture shifts in behavior under time pressure, demonstrating the psychological relevance of response timing. In the realm of AI-based adaptivity, the work of researchers published in Nature Scientific Reports—namely, Johnson, Tanaka, and Clarke [17]—demonstrated how machine learning techniques can be leveraged to optimize adaptive testing, including the modeling of time-based variables in real-time test environments. In terms of practical implementation, Huda et al. [18] presented an IRT-based computer adaptive testing system that emphasizes both digital scalability and test precision. Additionally, Smith and Alvarez [19] contributed a recent methodological tutorial on sample-size planning in IRT models, offering updated guidelines that are particularly relevant when incorporating multidimensional elements such as response time. Despite these advancements, the literature lacks a systematic comparison of multiple response-time-integrated IRT models under the same testing conditions using real-world data. This study aims to address this gap by evaluating six distinct time-sensitive IRT models alongside the traditional 1PL model, using empirical data collected from the authors’ adaptive testing platform, TestYourself. By comparing these models across multiple performance indicators (AIC, BIC, AUC), this research contributes a comprehensive, up-to-date investigation into how response time can be effectively incorporated into modern educational measurement systems.

A recent empirical study utilizing the TestYourself platform investigated the effectiveness of these models using data from 150 students completing 30 mathematics items. Model performance was evaluated using key statistical measures, including the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Area Under the Curve (AUC), and Item–Test Correlation [20]. The results indicated that the DTA-IRT model exhibited greater stability, highlighting a trade-off between responsiveness and predictive reliability. These findings suggest that integrating response time into IRT-based adaptive testing can significantly enhance student ability estimation while ensuring fairness. As the field progresses, further research is needed to explore large-scale implementations of response-time-aware IRT models across diverse student populations. Additionally, integrating Artificial Intelligence (AI) into adaptive testing frameworks may further refine real-time decision-making and test personalization [21,22].

2. Literature Review

The integration of response time into Item Response Theory (IRT) frameworks has addressed critical limitations of traditional accuracy-only models, enhancing educational assessment precision and fairness [23,24]. Classical IRT models, such as the 1PL (One-Parameter Logistic) and Rasch models, focus on estimating student ability (θ) and item difficulty (b) through response accuracy but overlook cognitive processes and engagement levels reflected in response times [25,26]. This gap has stimulated the development of time-sensitive IRT models that include response time as an auxiliary parameter to improve capability prediction and adaptive testing strategies [27,28]. The role of response time in IRT: response time provides insights into cognitive load, problem-solving behavior, and test-taker engagement [29]. Van der Linden [27] pioneered a hierarchical speed–accuracy trade-off model, showing how response times distinguish deliberate problem-solving from rapid guessing. Subsequent work, such as Wang and Xu [30], advanced Bayesian hierarchical models to improve parameter estimation in IRT-based response-time analyses. However, response times remain susceptible to external factors like test anxiety, cognitive fatigue, and item complexity, leading to parameter instability [31,32]. To address this, researchers have proposed data-smoothing methods and Bayesian priors for response-time parameters [33]. In this context, Yao et al. (2014) investigated how Multidimensional Computerized Adaptive Testing (MCAT) can simultaneously achieve accurate ability estimation and minimize test duration, finding that the use of MCAT significantly reduced test time compared to full-length selection tests without causing substantial losses in accuracy. The study also noted that incorporating response time constraints into item selection helped to further lower test time, although it introduced potential content imbalance, which could be mitigated by applying content constraints [34].

The use of Response Time (RT) in Computerized Adaptive Testing (CAT) has gained significant attention as researchers strive to optimize both the efficiency and security of adaptive test systems. Choe, Kern, and Chang (2017) explored how response time could be incorporated into CAT frameworks to not only improve the efficiency of item selection but also to minimize test duration, an aspect traditionally overlooked in standard item-based efficiency metrics [35].

Building on previous research by Fan et al. (2012), the study highlights the MI with Time (MIT) criterion, which maximizes the Fisher information-to-response time ratio for item selection. MIT effectively reduces the average test completion time in fixed-length tests. However, a critical drawback identified by the authors is that MIT excessively favors items with high discrimination (a) and low expected RT, resulting in imbalanced item exposure, a situation that poses risks to test security and fairness [36]. To address these issues and use response times in a more balanced way, the study investigated three alternative methods: (1) b-partitioned MIT (BMIT), (2) b-matching MI (MIB), and (3) Generalized MIT (GMIT). Among these, GMIT was found to be the most promising approach. GMIT optimizes the item selection criterion by centering the expected RT around a mean value (v) and applying a weighting exponent (w). This dual mechanism allows for flexible adjustment of the impact of expected RT on item selection, with the mean value (v) focusing on aligning the expected RT with a target value, improving item pool usage, and stabilizing test duration. The weighting (w) controls the influence of the centered RT term. Simulations and real-world data analyses demonstrated that GMIT significantly reduced both the average and variability of test times, as well as the average and variability of test overlap (a measure of item exposure), without requiring additional item exposure control mechanisms. These improvements in time and overlap reduction contributed to increased test score validity by mitigating the likelihood of hasty or random responses due to time pressure and decreasing the chance of prior knowledge influencing item responses. In conclusion, Choe et al. (2017) emphasized the central role of response time in enhancing time efficiency and test security within CAT systems. The study showed that GMIT, compared to other RT-based approaches (BMIT, MIB), provided a more balanced and effective solution for optimizing item selection algorithms, thereby improving the validity of test scores [35].

Time-sensitive IRT models fall into two categories: 1. Subtractive models (e.g., DTA-IRT), which penalize excessive response times by reducing ability estimates and discouraging inefficient time use [37]. 2. Multiplicative models (e.g., TWD-IRT, ART-IRT), which apply response time as a weighting factor, directly modulating ability estimates but introducing higher variability [38,39]. Empirical comparisons show that subtractive models achieve superior stability in high-stakes testing, as evidenced by lower AIC/BIC values [40], while multiplicative models risk inflating ability estimates, reducing their utility in adaptive testing [41]. Optimization techniques parameter estimation in time-sensitive IRT models demands robust optimization. Traditional Maximum Likelihood Estimation (MLE) struggles with computational complexity for response-time parameters (λ) [42]. Advanced methods like full parameter optimization and K-fold Cross-Validation (CV) mitigate overfitting [43]. The L-BFGS-B algorithm has emerged as particularly effective, balancing numerical stability and efficiency in high-dimensional parameter spaces [44]. Recent studies confirm its superiority over alternative optimization methods [45]. Adaptive Testing and Empirical Insights Modern Computerized Adaptive Testing (CAT) frameworks leverage response-time metrics to dynamically adjust test difficulty, improving efficiency and fairness [46]. Van der Linden and Jiang [33] demonstrated that integrating response-time variables via the shadow-test approach reduces test length without sacrificing precision. Empirical studies in medical education highlight adaptive testing’s motivational benefits, particularly when aligning item difficulty with individual proficiency [28]. However, system efficacy depends on item bank diversity; limited variability can introduce measurement bias [47]. Despite progress, key gaps persist; for example, large-scale empirical evaluations of subtractive models (e.g., DTA-IRT) remain scarce [48]. Regarding dynamic item selection, the interaction between response time and item difficulty in real-time adaptive testing requires deeper exploration [49]. Previous research has explored methods to incorporate response time into IRT-based assessments. One of the earliest approaches involved penalizing excessive response times by adjusting ability estimates [50]. Other studies examined time-efficiency trade-offs in test-taking, proposing methods to weight or normalize response time in ability estimation. However, these approaches were often limited in scope and lacked systematic comparisons across multiple response time-influenced models. The integration of response time in Item Response Theory (IRT) has significantly improved the accuracy and efficiency of adaptive testing methods. Traditional IRT models primarily rely on response accuracy to estimate student ability, but incorporating response time provides deeper insights into cognitive processing and test-taking behavior. A major challenge in adaptive testing is selecting the most informative items dynamically. Van der Linden and Pashley (2009) [50] introduced an advanced item selection framework that optimizes the trade-off between response time and test precision. This approach aligns with recent developments in computerized adaptive testing, where response time models are used to adjust test difficulty in real time, reducing test anxiety and cognitive overload [51]. Despite these advancements, gaps remain in the empirical validation of response time models across diverse testing conditions. Future research should explore hybrid models that integrate IRT with machine learning to refine adaptive testing strategies [52]. Among the few response-time-enhanced IRT models, some approaches modify the probability function by incorporating response time as a multiplicative factor, while others subtract a penalty based on response time. However, there is no consensus on which approach best optimizes adaptive testing performance, as no comprehensive evaluation has been conducted across multiple models under controlled conditions [53]. Despite growing interest in response time modeling, a systematic comparison of response time-integrated IRT models in adaptive testing remains largely unexplored. While previous studies have proposed individual modifications to IRT models, no study has empirically evaluated multiple models on real student data using standardized model comparison metrics such as AIC, BIC, AUC, and item–test correlation. This study addresses this gap by systematically evaluating six alternative response-time-enhanced IRT models alongside the standard 1PL Rasch model. These models vary in their approach to integrating response time, either multiplicatively or subtractively, offering unique perspectives on how response time influences ability estimation. Through empirical testing on real-world data from 150 students answering 30 test items, this study aims to identify which response-time-aware model best enhances adaptive test efficiency. By conducting this comparative analysis, this research contributes to the development of next-generation adaptive testing systems, where response time can be leveraged not only to refine ability estimation but also to optimize real-time question selection strategies.

3. Methods

3.1. Data Collection

This study employs a real-world dataset collected through the TestYourself platform, a web-based adaptive testing system developed by the authors to provide personalized and scalable educational assessments. The participants of this study consisted of 150 undergraduate students enrolled in the 2nd, 3rd, and 4th years of the Faculty of Engineering at Gaziantep Islam, Science and Technology University. Students volunteered to participate in the study without any academic or financial compensation. While age data were not collected, all participants were actively registered undergraduate students at the time of the assessment.

The dataset includes responses from these 150 students, each of whom completed 30 multiple-choice mathematics test items in a web-based, self-paced, and untimed testing environment. The TestYourself system was developed not only to administer adaptive tests but also to record accuracy (correct/incorrect responses) and response time (measured in seconds) for each item.

The test content was carefully constructed to include a diverse range of difficulty levels, ensuring that the assessment effectively discriminates between students of varying proficiency. The adaptive functionality of TestYourself follows item selection algorithms that dynamically adjust question difficulty based on real-time performance, closely mimicking the behavior of large-scale adaptive testing environments such as the Graduate Record Examination (GRE) and the Computerized Adaptive Testing (CAT) framework.

No time constraints were imposed on participants, allowing students to complete the assessment at their own pace and ensuring natural variation in response times.

To ensure compliance with ethical research standards, all participants provided informed consent before taking part in the study. The data collection process was approved by the institution’s Ethics Committee for Educational Research, adhering to guidelines on data privacy and student anonymity. To enhance data quality, extreme outliers in response times (e.g., significantly low values due to accidental clicks or high values due to interruptions) were identified and removed using statistical anomaly detection techniques.

3.2. Derived Item Response Models

This study evaluates seven IRT-based models, including the 1PL Rasch model as the baseline and six alternative models incorporating response time as a parameter.

In the subsequent formulations, T_j^avg is defined as the average response time for item j across all test-takers.

Formally,

T_{j}^{a v g} = \frac{1}{N} \sum_{i = 1}^{N} T_{i j}

where N denotes the total number of students and T_ij is the response time of student i to item j. This convention will be used consistently in all time-normalized models presented below.

3.2.1. The Standard 1PL Rasch Model,

It Widely Used in Adaptive Testing, Assumes That the Probability of a Correct Response Is a Function of the Difference Between a Student’s Latent Ability (θ) and the Difficulty of an Item (b)

P (Y_{i j} = 1∣ θ_{i}, b_{j}) = \frac{1}{1 + e^{- (θ_{i} - b_{j})}}

(1)

In Equations (1) and (2), The standard 1PL model captures the relationship between student ability (θ_i) and item difficulty (b_j). The probability of a correct response increases as the student’s ability exceeds the item’s difficulty. This model is simple and interpretable, making it widely used in educational assessments. However, it ignores external factors like time spent, motivation, or testing conditions, limiting its applicability in contexts where these variables significantly impact performance.

Although the 1PL Rasch model effectively estimates ability, it does not account for response time, which can provide additional insights into cognitive effort and test-taking strategy. To address this limitation, six alternative models were implemented, integrating response time either multiplicatively or subtractively.

3.2.2. Time-Weighted Difficulty Model (TWD-IRT): Incorporates Response Time as a Multiplicative Factor

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- (λ \cdot (\frac{T_{j}^{a v g}}{T_{i j}}) \cdot (θ_{i} - b_{j}))}}

(2)

In Equation (3), time (T_ij) scales the ability–difficulty interaction. If λ⋅T_ij > 1, the effect of the ability–difficulty difference is amplified, suggesting that time enhances the student’s capacity to leverage their ability (e.g., in untimed tests, where reflection improves accuracy). This model suits scenarios where time acts as a positive resource, but multiplicative terms can complicate parameter interpretation and increase overfitting risks.

3.2.3. Normalized Response Time Model (NRT-IRT): Adjusts Difficulty Using a Time Normalization Factor

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- ((\frac{T_{j}^{a v g}}{T_{i j}}) \cdot (θ_{i} - b_{j}))}}

(3)

Here, the ratio T_j^avg/T_ij scales the ability–difficulty difference. If T_ij < T_j^avg, the ratio exceeds 1, amplifying the effect of θ_i − b_j. This implies that faster responders gain an advantage, as seen in speed-based assessments. While intuitive, multiplicative scaling can distort parameter estimates if time ratios are extreme.

3.2.4. Time-Penalized Model (TP-IRT): Applies a Penalty Based on Response Time

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- (θ_{i} - b_{j} - λ \cdot T_{i j})}}

(4)

This model assumes that the time spent (T_ij) linearly modifies the effective difficulty parameter. For example, if λ > 0, longer time spent (T_ij) increases the perceived difficulty (b_j + λT_ij), reflecting scenarios where excessive time correlates with higher stress or cognitive load (e.g., timed exams). This model is ideal for analyzing negative time effects but oversimplifies complex time–performance relationships due to its linear assumption.

3.2.5. Scaled Time-Penalized Model (STP-IRT): Introduces a Scaled Penalty Using the Average Response Time

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- (θ_{i} - b_{j} - \frac{T_{j}^{a v g}}{T_{i j}})}}

(5)

This model incorporates the relative time effect by comparing individual time (T_ij) to the average time (T_j^avg). If T_ij < T_j^avg, the term T_j^avg/T_ij > 1 increases the effective difficulty, penalizing students who answer too quickly (e.g., rushed responses under time pressure). It is useful for exams with strict time limits but may suffer numerical instability if T_ij approaches zero.

3.2.6. Dynamic Time-Adjusted Model (DTA-IRT): Applies a Subtractive Time-Based Adjustment

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- (θ_{i} - b_{j} - λ \cdot \frac{T_{j}^{a v g}}{T_{i j}})}}

(6)

This model combines the relative time effect (T_j^avg/T_ij) with a weighting parameter (λ) to adjust the difficulty. For λ > 0, spending less time than average (T_ij < T_j^avg) increases effective difficulty, penalizing rushed answers. It is useful for modeling time pressure effects in high-stakes exams but requires sufficient data to estimate λ reliably.

3.2.7. Adaptive Response Time Model (ART-IRT): Uses a Multiplicative Normalization Factor for Adaptive Question Selection

P (Y_{i j} = 1∣ θ_{i}, b_{j}, T_{i j}) = \frac{1}{1 + e^{- (λ \cdot \frac{T_{j}^{a v g}}{T_{i j}} \cdot {(θ}_{i} - b_{j}))}}

(7)

This model dynamically scales the ability–difficulty interaction using both the relative time ratio and a weighting parameter (λ). For example, λ⋅(T_j^avg/T_ij) > 1 amplifies the impact of ability, suggesting that efficient time use enhances performance. It is powerful for capturing complex time interactions but risks overfitting due to multiple parameters.

3.3. Parameter Estimation

The estimation of parameters in this study involved three distinct optimization frameworks, each tailored to balance model complexity, computational efficiency, and statistical accuracy. The primary parameters estimated were student abilities (θ), item difficulties (b), and time–effect coefficients (λ), all of which were optimized through different methodologies based on model type.

3.3.1. First Approach Maximum Likelihood Estimation (MLE) with Fixed Item Difficulties

In the first optimization approach, item difficulties (b) were precomputed using the logit transformation of correct response probabilities, ensuring numerical stability by clipping probabilities between 0.05 and 0.95:

b_{j} = l n (\frac{p_{j}}{1 - p_{j}})

(8)

where

p_{j}

represents the proportion of correct responses for item j. Once determined, b remained fixed, and only θ and λ were optimized. This approach minimized computational complexity but introduced potential bias if initial b estimates were inaccurate.

The L-BFGS-B optimization algorithm was used to estimate θ and λ, enforcing theoretical constraints:

θ ∈ [−3, 3]: Student abilities were constrained to a realistic range, preventing extreme values that could distort model interpretability.
b ∈ [−1, 1]: Item difficulties were bounded to ensure balanced item calibration, avoiding excessively easy or difficult items.
λ > 0 for multiplicative models: Time–effect coefficients were constrained to positive values to maintain interpretability in multiplicative interactions.

The L-BFGS-B optimization algorithm was employed for parameter estimation due to its ability to handle bound constraints, ensuring that parameters such as student abilities (θ ∈ [−3, 3]), item difficulties (b ∈ [−1, 1]), and time–effect coefficients (λ > 0) remain within theoretically valid ranges. Its gradient-based approach and memory efficiency make it particularly suitable for large-scale problems, such as those encountered in IRT models with multiple parameters. Additionally, L-BFGS-B’s robustness against local minima and its integration into widely-used libraries like scipy.optimize facilitate reliable and efficient estimation of model parameters.

The objective function was defined as the regularized negative log-likelihood:

L = - \sum_{i = 1}^{N} \sum_{j = 1}^{M} [Y_{i j} \cdot {l n}^{P_{i j}} + (1 - Y_{i j}) \cdot ({l n}^{{(1 - P}_{i j)}})] + λ_{r} \sum_{k} θ_{k}^{2}

(9)

where λ_r = 0.01 represents the L2 penalty to prevent overfitting, and P_ij = σ (θ_i − b_j) represents the probability of correct response under the logistic function.

3.3.2. Second Approach Full Parameter Optimization (θ, b, λ Joint Estimation)

In the second approach, all parameters θ, b, and λ were optimized simultaneously rather than fixing b beforehand. This allowed greater flexibility in estimating item characteristics while preserving interpretability. However, this method required careful initialization to prevent local minima. The regularized log-likelihood function was expanded to include constraints for b and λ, ensuring a stable estimation process:

L = - \sum_{i = 1}^{N} \sum_{j = 1}^{M} [Y_{i j} \cdot {l n}^{P_{i j}} + (1 - Y_{i j}) \cdot ({l n}^{{(1 - P}_{i j)}})] + λ_{r} (\sum_{k} θ_{k}^{2} + \sum_{j} θ_{j}^{2})

(10)

where an additional L2 penalty on b prevents large variations across item difficulties. Gradient norms were monitored during optimization, and termination criteria were set at 500 iterations or a tolerance of 10⁻⁶ for parameter updates. The optimization process minimized the regularized negative log-likelihood function, which incorporated an L2 penalty (λ = 0.01) to mitigate overfitting.

3.3.3. Third Approach K-Fold Cross-Validation and Generalization Testing

To validate the robustness of estimated parameters, a K-fold Cross-Validation (K-fold CV) framework with five folds was implemented. All parameters (θ, b, λ) were optimized in the training set using the L-BFGS-B algorithm. Only θ was re-estimated in the test set, ensuring realistic ability estimation without modifying pre-trained item and time–effect parameters.

This method allowed for generalizability testing, reducing the risk of overfitting while providing confidence intervals for θ estimates across different samples.

Shortly, the three codes are explained below:

In the first code (MLE-based approach), certain parameters are kept fixed and optimized, only θ and λ. However, this approach did not include cross-validation (K-fold CV).

In the second approach (code), all parameters are optimized together, and regularization is introduced to prevent overfitting. However, statistical tests were not directly applied.

In the third approach (code) (the version with K-fold CV), cross-validation is utilized to assess model generalizability, the optimized θ parameters are evaluated on the test set, and the mean of AIC, BIC, and correlation values are calculated according to all fold.

3.4. Evaluation Metrics

Several performance metrics, as below, were used to evaluate the models:

AUC (Area Under the Curve): The AUC measures the model’s ability to distinguish between correct and incorrect responses. It is particularly useful for handling imbalanced datasets, where the proportion of correct and incorrect responses may vary significantly across items. For each item, the AUC was calculated by comparing the predicted probabilities of correct responses (derived from the model’s logit function) against the observed binary responses. The final AUC score was computed as the average across all items.
Item–Test Correlation: The correlation between student ability estimates (θ) and their overall test performance (mean response accuracy) was calculated to assess the consistency of the model’s ability estimates. This metric ensures that students with higher ability estimates tend to perform better across all items.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): These criteria were used to compare models based on their goodness of fit while penalizing model complexity. Lower AIC and BIC values indicate a better balance between model fit and parsimony. The AIC and BIC were calculated using the negative log-likelihood value and the number of parameters in the model. The models with lower AIC and BIC values are preferred.

3.5. Statistical Validation

To ensure model generalizability and reliability, statistical validation techniques such as K-fold cross-validation with five folds were employed. This method helps prevent overfitting and provides a robust estimate of model performance on unseen data by systematically partitioning the dataset into training and testing sets. Additionally, bootstrapping techniques were applied to evaluate the statistical significance and stability of the time–effect parameter (λ). By generating multiple resampled datasets, this approach facilitated a comprehensive analysis of parameter variability, reinforcing the robustness of the model’s time-sensitive estimates. The combination of MLE-based parameter estimation, full optimization of (θ, b, λ), and cross-validation frameworks ensured that the models were rigorously tested, minimizing biases and improving predictive accuracy. To ensure data quality and reliability, response times were preprocessed by normalizing values and identifying outliers using both the Interquartile Range (IQR) method and the Z-score approach. Outliers were defined as response times exceeding 1.5 times the interquartile range or falling beyond ±3 standard deviations and were subsequently removed. Missing values were imputed using the median response time per item to minimize bias while preserving the dataset’s distribution.

3.6. Model Fit Evaluation and Parameter Consideration

In this study, model fit was assessed using the Log-Likelihood (LL), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). While all three metrics are likelihood-based, AIC and BIC differ in how they penalize model complexity. AIC imposes a linear penalty (2k), whereas BIC applies a logarithmic penalty (ln (n)·k), where k is the number of estimated parameters, and n is the number of observations. In the full estimation approach (Approach 2), the value of k includes the total number of estimated parameters: student abilities (θ), item difficulties (b), and time sensitivity (λ) for time-aware models. In contrast, for the K-fold cross-validation approach (Approach 3), only θ was re-estimated on the test data, while b and λ were kept fixed from the training phase. Therefore, when calculating AIC and BIC for the test set, k was set equal to the number of θ parameters (i.e., number of students in the test fold). This distinction ensures that the complexity penalty reflects only the truly optimized components of the model during each evaluation context. By treating b and λ as constants in the cross-validation setup, we maintain consistency with the principles of information-theoretic model selection and ensure fair comparisons across different modeling approaches.

4. Model Performance and Results Analysis

The results obtained from the three different computational approaches reveal significant differences in model performance, stability, and interpretability when estimating student abilities using Item Response Theory (IRT) models. Each approach employed different optimization strategies, regularization techniques, and validation methodologies, directly influencing the effectiveness of the seven tested models.

4.1. First Computational Approach Analysis

The first approach implemented direct parameter estimation without cross-validation, making it computationally efficient but potentially prone to overfitting. While it incorporated basic regularization to constrain parameter values, it lacked hyperparameter tuning and extensive evaluation across different data splits. The results indicated that the 1PL-IRT model (standard IRT) performed the best, achieving a relatively low AIC (1776.87) and BIC (1961.77) while maintaining a high AUC (0.712). The TP-IRT model exhibited comparable performance, though it had slightly lower AIC/BIC values (1773.15/1960.13), suggesting a minor trade-off between model complexity and fit. In contrast, TWD-IRT and NRT-IRT models demonstrated significantly lower AUC values (0.693 and 0.681, respectively), along with higher AIC/BIC scores, indicating that these transformations introduced instability rather than improving predictive accuracy. The STP-IRT model performed worse than expected, showing elevated AIC/BIC values (2118.26/2305.23) and a low AUC (0.692), which suggests that subtracting a scaled time ratio negatively impacted likelihood estimation. Despite these variations, test information (TIF) remained relatively stable across models, reinforcing that the 1PL-IRT framework effectively captures student ability with minimal modifications.

4.2. Second Computational Approach Analysis

The second computational approach improved upon the first by introducing refined parameter initialization based on empirical distributions, leading to more stable convergence. Additionally, it applied L2 regularization, effectively mitigating overfitting while ensuring numerical stability. However, it still did not incorporate cross-validation, limiting its ability to assess generalizability. The results showed that the 1PL-IRT model once again performed exceptionally well, achieving AIC (1936.50) and BIC (2424.11) while maintaining a high AUC (0.8217). Notably, the DTA-IRT model closely followed the standard model in terms of AIC/BIC, suggesting that subtracting a scaled response time factor may provide slight improvements in estimation. The TP-IRT model also showed high performance with an AUC of 0.8220, which was nearly identical to the 1PL-IRT model. The TWD-IRT model continued to underperform, showing high AIC (3598.07) and BIC (4085.68), with a substantial drop in AUC (0.400), confirming that multiplicative response time transformations introduce noise rather than valuable predictive information. Interestingly, the NRT-IRT and ART-IRT models yielded poor AUC values (0.339 and 0.429, respectively), indicating a failure to effectively differentiate between high- and low-ability students. These findings highlight that while some response-time-based modifications may enhance the estimation process, certain transformations, particularly those involving multiplication, can significantly degrade model performance.

4.3. Third Computational Approach Analysis

The third computational approach, which emerged as the most reliable, utilized five-fold cross-validation to systematically evaluate model performance across multiple training and testing subsets. This approach not only provided more robust estimates but also refined the optimization process with stricter parameter constraints, resulting in greater stability and interpretability. The results showed that the 1PL-IRT standard model once again outperformed all alternatives, achieving the lowest AIC (1416.44), BIC (1574.11), and highest AUC (0.804). The TP-IRT model followed closely, maintaining a similar AUC (0.793) while exhibiting slightly higher AIC/BIC values compared to the 1PL-IRT model. This suggests that while subtracting a scaled time factor does not drastically improve performance, it does not negatively impact the model either. The TWD-IRT and NRT-IRT models once again showed poor performance, with AUC values as low as 0.772 and 0.759, respectively, alongside elevated AIC/BIC values. The ART-IRT model performed the worst in this approach, with an AUC of 0.734, indicating that the multiplication-based transformations continued to degrade performance. The DTA-IRT model, however, showed a slight improvement in stability, with AIC/BIC values comparable to the 1PL-IRT standard model, suggesting that a subtractive transformation may be more theoretically justified than a multiplicative one. Test information function (TIF) values remained within a reasonable range for most models, with the standard model maintaining the highest reliability in ability estimation.

4.4. Overall Findings

The 1PL-IRT model remains the most stable, generalizable, and interpretable framework for ability estimation. The DTA-IRT model provides a slight improvement in specific cases but does not significantly outperform the standard model. Specifically, the DTA-IRT model demonstrated robust performance across key evaluation metrics in the third computational approach. It achieved an AUC value of 0.804, which was the highest among all models tested, indicating superior classification accuracy in distinguishing between correct and incorrect responses. In terms of model fit, DTA-IRT also recorded one of the lowest AIC (1410.343) and BIC (1573.265) values, as presented in Table 1, reflecting its strong balance between goodness-of-fit and model complexity. Table 2 further supports these findings by showing that the DTA-IRT model yielded consistently favorable log-likelihood scores and balanced item fit values across all three computational approaches. Furthermore, as shown in Table 3, the correlation between the DTA-IRT model’s ability estimates (θ) and those of the baseline 1PL-IRT model was 0.992, with a mean absolute error (MAE) of 0.216. This high correlation confirms that DTA-IRT preserves the examinee ranking consistency observed in standard models while incorporating response-time sensitivity. The observed trade-off lies in the model’s subtractive approach to time adjustment: while it successfully penalizes disproportionately long response times promoting efficiency and fairness it may slightly moderate ability estimates for high-performing students who invest more time per item. Thus, the combination of low AIC/BIC values, high AUC, and strong agreement with 1PL-IRT in θ estimates provides empirical justification for the model’s stability and interpretability, even if its advantage over the baseline is context-dependent rather than universal. Multiplicative transformations (TWD-IRT, NRT-IRT, ART-IRT) consistently degrade model performance, as evidenced by higher AIC/BIC values, lower AUC, and unstable correlation scores. The third computational approach (cross-validation + regularization) proved to be the most effective methodology, ensuring better generalization and robustness in parameter estimation. These results highlight that while response time can be incorporated into IRT models, its integration must be theoretically justified and empirically validated to avoid unnecessary complexity that reduces model reliability.

Table 1 compares the AIC, BIC, and AUC values of different IRT models across three computational approaches. The 1PL-IRT model consistently demonstrates the best performance, achieving the lowest AIC/BIC values and high AUC scores. TP-IRT and DTA-IRT models also perform well, showing stable results across all approaches. In contrast, TWD-IRT and ART-IRT models exhibit poor performance, particularly in the second approach, with significantly lower AUC values. The third approach outperforms the other two because it incorporates five-fold cross-validation and regularization techniques. This method provides more reliable parameter estimates, reduces overfitting, and improves model generalization. The lower AIC/BIC values and more balanced AUC scores indicate that the third approach offers a more robust methodology for model selection and performance evaluation.

Table 2 compares the performance of different IRT models based on TIF (Test Information Function), log-likelihood, and item fit values across the first, second, and third approaches. Higher log-likelihood values, higher TIF, and balanced item fit values indicate better model fit and reliability. Overall, the 1PL-IRT and DTA-IRT models demonstrated the most stable and reliable performance. Notably, the DTA-IRT model achieved high log-likelihood values across all three approaches, indicating a strong fit to the data and providing a high Test Information Function (TIF). Similarly, the TP-IRT model also performed well but showed slightly higher log-likelihood values compared to DTA-IRT. On the other hand, the TWD-IRT, NRT-IRT, and ART-IRT models exhibited weaker model fit, particularly in the second and third approaches, with lower log-likelihood values and less stable item fit results. These models, especially those incorporating multiplicative transformations, suffered from performance loss, suggesting that multiplicative adjustments may introduce additional noise, reducing estimation stability. In conclusion, these findings indicate that models using subtractive transformations (DTA-IRT and TP-IRT) tend to produce more stable and reliable results compared to those employing multiplicative transformations (TWD-IRT and ART-IRT). The consistent performance of the DTA-IRT model across all approaches suggests that integrating response time as a subtractive factor enhances measurement accuracy and model robustness.

4.5. Comparison of Theta Estimates Between 1PL-IRT and Time-Aware Models

To evaluate how time-aware models influence ability estimation, we compared the theta (θ) values obtained from the standard 1PL-IRT model with those derived from each of the six proposed models that incorporate response time. Two statistical measures were used: Pearson correlation coefficient (r) to assess the consistency of examinee rankings and Mean Absolute Error (MAE) to evaluate average estimation deviation.

Table 3 presents the results. Subtractive models (TP-IRT and DTA-IRT) demonstrate high agreement with the 1PL model, showing strong correlations (r > 0.99) and low MAE values (≈0.22). This suggests that these models retain the structural integrity of ability estimates while incorporating temporal behavior. In contrast, some multiplicative models (e.g., ART-IRT and NRT-IRT) exhibit lower correlation and higher divergence. Notably, STP-IRT had the highest MAE (1.60), suggesting potential over-penalization of slow responses.

4.6. Lambda (λ) Estimates

This section presents the optimized values of the response-time sensitivity parameter (λ) for each of the four time-aware IRT models. These estimates were obtained using both full-dataset optimization and five-fold cross-validation procedures to ensure robustness and generalizability. As shown in Table 4, TP-IRT consistently yielded negative λ values (−0.2688 with cross-validation and −0.2679 with full data), indicating a systematic penalization of longer response times. In contrast, TWD-IRT and ART-IRT produced small but positive λ values, reflecting only a mild reward for faster responses. The DTA-IRT model displayed a notably larger λ under cross-validation (0.1557) compared to full-dataset training (0.0146), which suggests a stronger regularization effect when generalization is prioritized. These λ estimates highlight how each model interprets the influence of response time differently, offering insight into their respective time-sensitivity dynamics.

These values confirm that λ was not fixed but was treated as a free parameter and empirically optimized during model training. The negative λ observed in TP-IRT supports the subtractive logic of this model, where prolonged response times correspond to reduced performance estimates. In contrast, TWD-IRT exhibited a small positive λ, reflecting a modest amplification effect whereby response speed slightly increases the effective ability–difficulty gap.

The relatively higher λ values observed in DTA-IRT and ART-IRT under cross-validation compared to the full-data scenario suggest that response-time sensitivity may vary across subpopulations and testing conditions. This variation highlights the relevance of flexible time-aware modeling, particularly in adaptive environments where generalization is crucial.

Furthermore, we acknowledge the nested nature of some models, as noted by the reviewer:

The STP-IRT model is a restricted case of DTA-IRT when λ = 1.
Similarly, the NRT-IRT model is a special case of ART-IRT under the same constraint.

However, the empirically optimized λ values were all significantly lower than 1 in both the full-data and cross-validation results. This observation empirically justifies the use of the more general DTA-IRT and ART-IRT formulations, which allow for greater flexibility and better data fit compared to their fixed-lambda counterparts.

Figure 1 compares the AIC values of different models under three different computational approaches. A lower AIC value indicates a better model fit to the data. DTA-IRT and 1PL-IRT models exhibit the lowest AIC values, suggesting that these models achieve better alignment with the dataset. TWD-IRT and ART-IRT models have the highest AIC values, indicating poorer model fit compared to other models. Approach 3 consistently produces the lowest AIC values, demonstrating that this approach is more effective in improving model performance. These findings highlight that AIC should not be the sole metric for model selection; other criteria must also be considered to ensure comprehensive model evaluation.

Figure 2 presents the Bayesian Information Criterion (BIC) values for different models. Lower BIC values suggest better model fit while avoiding excessive complexity. The 1PL-IRT and DTA-IRT models have the lowest BIC values, making them the most compatible with the dataset. ART-IRT and TWD-IRT models exhibit the highest BIC values, indicating that their complexity reduces their overall efficiency. Approach 3 yields the lowest BIC values across most models, whereas Approach 2 shows higher BIC values for some models. These results emphasize the importance of balancing model complexity and data fit, as overly complex models may not provide meaningful improvements.

Figure 3 displays the Area Under Curve (AUC) values for different models. A higher AUC value (closer to 1) indicates a better ability to distinguish between high- and low-ability students. Models trained with Approach 2 generally have the highest AUC values, suggesting improved predictive power. 1PL-IRT, DTA-IRT, and TP-IRT models demonstrate the best performance, with AUC values exceeding 0.80. ART-IRT and NRT-IRT models have the lowest AUC values, indicating weaker classification performance compared to other models. Approach 3 achieves better results in terms of AIC and BIC, whereas Approach 2 provides the best AUC scores. These findings suggest that model selection should not be based solely on statistical fit but also on predictive accuracy, ensuring a well-balanced approach to model evaluation.

4.7. Performance of Multiplicative Models

Multiplicative models, such as TWD-IRT and ART-IRT, generally underperformed, with lower AUC scores and higher AIC/BIC values. The poor performance of multiplicative models highlights the challenges of incorporating response time effects in a way that does not distort ability estimates. The key reason for this poor performance lies in how multiplicative models handle response time variability. Multiplicative transformations amplify small variations in response time, leading to disproportionately large effects on estimated ability levels. This issue becomes particularly problematic for students who either respond extremely fast or slow, leading to unreliable and erratic ability estimates.

4.8. Model Generalizability Through Cross-Validation

The inclusion of K-fold cross-validation provided robust evidence of the models’ generalizability. The DTA-IRT model demonstrated stable performance across folds, with low standard deviations in AUC and AIC values, reinforcing its validity for practical applications. This suggests that subtractive time adjustments, when appropriately scaled, offer a promising approach for improving ability estimation in timed testing environments. In the third approach involving K-fold cross-validation, only the ability parameters (θ) were re-estimated on the test set, while item difficulties (b) and time sensitivity parameters (λ) were fixed, using estimates from the training set. Consequently, when computing AIC and BIC for the test data, the number of free parameters (k) was set equal to the number of students in the test fold, as only θ was optimized. This ensures consistency with the principles of information-theoretic model evaluation and allows for a fair comparison of model generalization performance under fixed b and λ assumptions.

4.9. Findings of the Highest Performing Approach

The third approach stands out as the most stable and highest-performing method based on the data in the table. This approach enhances generalization and minimizes overfitting risks through five-fold cross-validation and regularization techniques. The 1PL-IRT and DTA-IRT models achieve the highest classification performance, with AUC values of 0.803 and 0.804, respectively. Additionally, the AIC and BIC values in this approach are the lowest compared to others, indicating better model fit and reduced complexity. In conclusion, the third approach is the most effective method in terms of both stability and predictive power.

In Figure 4, the bar chart compares the Area Under the Curve (AUC) scores for different IRT-based models, including the baseline 1PL-IRT model and six response-time-adjusted variants. The AUC represents the classification performance of each model, where higher values indicate better discrimination between correct and incorrect responses.

The DTA-IRT and TP-IRT models achieve the highest AUC scores, suggesting that subtractive time-based adjustments lead to better ability estimation. The 1PL-IRT model also performs well, indicating its robustness despite the lack of response-time integration. In contrast, multiplicative models such as TWD-IRT and ART-IRT display significantly lower AUC values, highlighting their instability in adjusting difficulty based on response time. These findings confirm that subtracting response time as a penalty is a more reliable approach than multiplicative scaling.

The Figure 5 line chart presents AIC and BIC values for each model. Lower AIC and BIC values indicate better model fit while avoiding unnecessary complexity.

The DTA-IRT and TP-IRT models exhibit the lowest AIC and BIC scores, reinforcing their efficiency in modeling response-time effects without adding unnecessary complexity. The 1PL-IRT model has slightly higher AIC/BIC values, likely due to its lack of response-time adjustments. On the other hand, multiplicative models (TWD-IRT, ART-IRT) show significantly higher AIC/BIC values, indicating greater complexity without corresponding improvements in performance. These results align with existing research, which cautions that multiplicative transformations in IRT models can lead to parameter instability.

Figure 6 presents the relationship between AUC (Area Under Curve) scores and correlation values for different models. AUC represents the classification performance of a model, while correlation measures the consistency between predicted and actual ability estimates. DTA-IRT, 1PL-IRT, and TP-IRT models exhibit the highest correlation values (~1.0) while maintaining high AUC scores (~0.80), indicating strong predictive accuracy and stability. STP-IRT achieves a moderate correlation (~0.85) but a slightly lower AUC score, suggesting it may introduce some instability in prediction performance. TWD-IRT and NRT-IRT models show significantly lower AUC scores (~0.70–0.75) and moderate correlation values, implying weaker classification power compared to higher-performing models. ART-IRT has the lowest AUC score and correlation, indicating poor classification accuracy and unreliable predictions. Overall, models with higher AUC and correlation values (DTA-IRT, 1PL-IRT, TP-IRT) are more reliable for estimating student abilities, while models with lower values (ART-IRT, NRT-IRT) demonstrate limited predictive effectiveness.

The DTA-IRT and TP-IRT models show high AUC values with minimal variance, confirming their robust performance across different data splits. The 1PL-IRT model also demonstrates consistency, though it lacks response-time sensitivity. In contrast, multiplicative models (TWD-IRT, NRT-IRT, ART-IRT) exhibit high variance, suggesting instability and sensitivity to dataset variations. These results indicate that subtracting time effects (DTA-IRT, TP-IRT) provides more reliable and generalizable ability estimation than multiplicative scaling.

Figure 7 compares the distribution of errors in ability estimation across subtractive and multiplicative models. The width and spread of each box represent the degree of variability in estimation errors.

Subtractive models (DTA-IRT, STP-IRT, TP-IRT) exhibit narrower error distributions, indicating greater consistency and precision in ability estimation. In contrast, multiplicative models (TWD-IRT, ART-IRT, NRT-IRT) show wider distributions and more extreme outliers, suggesting higher variability and greater estimation errors. These findings further confirm that subtractive models are more stable for adjusting ability estimates based on response time, while multiplicative models introduce noise rather than improving model interpretability.

The figures provide clear empirical evidence supporting the superiority of subtractive response-time models (DTA-IRT, TP-IRT, STP-IRT) over multiplicative models (TWD-IRT, ART-IRT, NRT-IRT).

The DTA-IRT and TP-IRT models consistently outperform others across AUC, AIC/BIC, and cross-validation metrics, confirming that subtracting response time as a penalty leads to more accurate and stable estimations. In contrast, multiplicative models (TWD-IRT, ART-IRT) introduce greater variance, lower AUC scores, and wider error distributions, reinforcing the hypothesis that difficulty scaling based on response time is unstable and unreliable.

Additionally, the K-fold cross-validation analysis confirms the generalizability of subtractive models, as they maintain high performance across different training and testing splits. Multiplicative models, however, exhibit inconsistency, suggesting that their effectiveness is heavily dependent on the dataset used for training.

5. Limitations and Future Work

Despite the promising results obtained in this study, there are several limitations that need to be addressed in future research. First, the dataset used in this study consisted of 150 students and 30 test items, which, while sufficient for initial comparisons, may limit the generalizability of the findings. Larger and more diverse datasets, incorporating students from different educational backgrounds and testing environments, are necessary to validate the robustness of the proposed models. Additionally, the relatively short 30-item test may not provide enough data to accurately estimate student abilities (θ) and item difficulties (b), particularly for more complex models that include time-based parameters. Future research should explore the application of these models to longer tests or datasets with a greater number of students to ensure stable and reliable parameter estimation.

Another key limitation is that item difficulty parameters (b) were fixed in some approaches (e.g., MLE-based estimation). While this simplification reduces computational complexity, it may introduce bias if the initial b estimates are not entirely stable. Allowing b parameters to be optimized iteratively rather than precomputed could lead to more accurate and adaptable models. Additionally, this study focused solely on One-Parameter Logistic (1PL) IRT models, which assume that all test items have the same discrimination power. However, higher-dimensional models, such as Two-Parameter (2PL) and Three-Parameter (3PL) IRT models, could offer deeper insights by incorporating item discrimination and guessing parameters. Exploring the integration of time effects within these more complex models could enhance our understanding of the relationship between response time and ability estimation.

Furthermore, the response time data used in this study was limited to the total time spent on each question. However, test-taking behavior is influenced by various cognitive and strategic factors, such as rapid guessing, distraction, and strategic time allocation. More detailed response time data—such as time spent reading versus solving a question or patterns of revisiting previous questions—could help refine the models’ ability to distinguish between different student behaviors. Future research should consider incorporating behavioral time data to improve the interpretability and predictive accuracy of response-time-enhanced IRT models.

The time adjustments in this study were modeled using linear transformations (subtractive and multiplicative effects). However, the relationship between response time and ability is likely nonlinear, and alternative transformations such as logarithmic, exponential, or hybrid models may provide a better representation of how time affects test performance. Investigating nonlinear models and their impact on ability estimation could lead to more precise and theoretically sound representations of time effects in IRT.

Additionally, in future implementations, we plan to investigate adaptive item selection algorithms that jointly consider both difficulty and expected response time, which could improve the alignment between item selection and the response time-aware modeling framework.

6. Discussion

One important consideration pertains to the consistency between item selection and post-hoc scoring procedures. In this study, adaptive item selection was performed using a traditional IRT model (1PL-Rasch) without incorporating response time during test administration. However, response-time-aware models (e.g., DTA-IRT, ART-IRT) were applied retrospectively during the ability estimation phase. This discrepancy introduces a potential source of model mismatch—where item exposure and sequencing were governed by a time-agnostic selection algorithm, while final scoring accounted for timing behaviors. Such a mismatch may affect estimation fairness, particularly for high-ability students who might spend more time solving complex items and yet be penalized under subtractive models that assign lower ability scores to longer response latencies.

To mitigate this risk, the time-aware models used in this study employed regularization mechanisms to constrain the influence of λ (time sensitivity), resulting in stable and interpretable parameter estimates. Furthermore, the “TestYourself” platform was intentionally designed with a dual-scoring strategy that distinguishes between correctness-based and time-based performance feedback. This design empowers learners to evaluate not only their proficiency but also their response efficiency without conflating the two in the final score interpretation.

When item response time is formally adopted in IRT modeling, several practical and theoretical outcomes can be expected. First, the inclusion of time data enhances the diagnostic power of assessments by distinguishing deliberate reasoning, impulsive guessing, and disengaged behavior. This additional dimension allows for deeper interpretation of ability estimates and may lead to more equitable feedback for test-takers. Second, it is anticipated that incorporating response time can improve the efficiency of adaptive testing by enabling dual-criteria item selection algorithms that consider both information value and expected completion time. Furthermore, response-time-aware models may support fairness in assessment design, particularly for students who differ in processing speed due to anxiety, cognitive load, or language proficiency. However, such benefits are contingent on careful model design that avoids over-penalizing thoughtful or slow respondents. Therefore, the adoption of response time must be accompanied by appropriate calibration, regularization, and empirical validation to ensure that it improves measurement without compromising validity.

Empirical findings revealed that time-sensitive models consistently achieved better fit metrics (AIC, BIC) and improved generalizability across K-fold cross-validations, reinforcing the value of response time as a complementary source of information. Additionally, we acknowledge that the item difficulty parameters (b) in this study were calibrated based solely on response accuracy without incorporating response time during the item calibration phase. This was an intentional methodological decision to isolate the impact of response time on student ability estimation (θ) rather than confounding it with item-level effects. By holding item difficulties fixed across all models, we were able to more clearly assess how response-time-aware scoring affects ability estimates. Despite the absence of time-informed item calibration, the empirical results demonstrated that time-sensitive models—especially subtractive approaches such as DTA-IRT—yielded improved model fit and generalizability. This suggests that student-level response time adjustment can meaningfully enhance scoring accuracy, even when item-level calibration does not explicitly account for time. Future studies may benefit from jointly modeling item difficulty and expected response time for more comprehensive calibration. Nevertheless, the inconsistency between selection and scoring logic underscores an opportunity for future development. To clarify the experimental structure, it is important to note that the TestYourself platform utilized a standard 1PL-IRT (Rasch) model for item selection during test administration, employing the Fisher Information Criterion to identify the most informative items based on current ability estimates. The six time-sensitive models introduced in this study (e.g., DTA-IRT, ART-IRT) were applied exclusively in the post-hoc analysis phase after the response accuracy and timing data were collected. This separation was intentional, ensuring that the live test administration remained stable and interpretable while allowing for a controlled evaluation of each model’s ability estimation capabilities. Although response-time-informed models were not used in the real-time item selection process, their offline evaluation provided valuable insights into their performance without introducing bias during data collection. Future iterations of TestYourself may incorporate these models in real time to achieve a fully integrated adaptive testing system. In particular, we recommend extending this work by designing CAT systems that integrate expected response time into the item selection algorithm itself. Such a unified, real-time framework could enhance both fairness and measurement precision, especially for examinees across a wide range of ability levels.

Importantly, the causal interpretation of our results is grounded in a controlled modeling framework. Since all response accuracy and timing data were collected under the same conditions using the standard 1PL-IRT model, and item selection was held constant across all examinees, any observed improvements in AIC, BIC, or AUC across models can be directly attributed to the integration of response time during post-hoc ability estimation. The variation in results between subtractive and multiplicative time-aware models reflects differences in how response time is mathematically incorporated rather than experimental variability, thus establishing a clear cause–effect link between the modeling approach and estimation performance.

However, we acknowledge that our current investigation is limited to post-hoc scoring, and the adaptive item selection process itself did not incorporate response time. Therefore, our findings, while theoretically supportive of time-aware item selection, do not provide empirical validation for its effects on real-time test efficiency or learning outcomes. Future research should simulate or implement real-time CAT algorithms that integrate response time into both selection and scoring procedures to empirically assess their impact on fairness and performance estimation.

7. Conclusions

The findings of this study have important implications for educational assessment, test development, and adaptive learning systems. One of the key takeaways is that subtractive time adjustments, particularly DTA-IRT and TP-IRT, can enhance ability estimation without adding excessive complexity to the model. This result suggests that educational testing systems, particularly computerized adaptive testing (CAT) platforms, can integrate subtractive time adjustments to improve student assessment accuracy. By penalizing excessive response times proportionally rather than disproportionately affecting students with moderate response speeds, subtractive models provide a fairer and more stable alternative to traditional IRT models.

Another important implication relates to fairness in timed assessments. Many standardized tests impose strict time limits, potentially disadvantaging students who process information more carefully or those with slower reading speeds. This study shows that multiplicative time models, such as TWD-IRT, NRT-IRT, and ART-IRT, tend to disproportionately penalize slower respondents, introducing potential bias. In contrast, subtractive models (DTA-IRT, TP-IRT) offer a more balanced way of incorporating time effects while maintaining predictive accuracy. Test designers and policymakers should consider integrating subtractive time adjustments to ensure fairer assessments, particularly for tests that aim to evaluate knowledge and reasoning ability rather than speed alone.

Furthermore, the use of response-time-enhanced IRT models could improve automated test analysis and cheating detection. For example, very short response times combined with high accuracy may indicate rapid guessing or pre-exposure to test questions, whereas excessively long response times with low accuracy may indicate disengagement or test fatigue. By implementing best-performing models (such as DTA-IRT) in automated scoring systems, test administrators can detect irregular response patterns and flag potential test-taking anomalies for further investigation.

Finally, the results of this study have implications for personalized learning environments. Adaptive learning platforms often adjust question difficulty based on previous responses, but they typically do not consider response time as an additional factor. The findings suggest that incorporating response time into adaptive question selection algorithms could improve the efficiency and accuracy of personalized learning systems. By considering both correctness and response time, these systems could dynamically adjust content delivery to match a student’s cognitive processing speed, ultimately leading to more effective and engaging learning experiences.

These findings contribute to the broader understanding of time-based IRT models by demonstrating that subtractive time adjustments (DTA-IRT, TP-IRT) provide a more robust and unbiased alternative to traditional approaches (1PL-IRT). While these findings offer promising directions, further research with larger and more diverse student populations is needed to validate their applicability across different testing environments. Additionally, future studies should explore nonlinear transformations of response time, such as logarithmic or exponential models, to better capture the relationship among time, ability estimation, and test performance. Overall, this study demonstrates that integrating response time into IRT models enhances assessment accuracy while promoting fairness and adaptability in educational measurement. Future research should continue refining these approaches to optimize learning outcomes and improve test validity in diverse educational settings.

Author Contributions

Conceptualization, A.H.İ.; Methodology, A.H.İ.; Software, A.H.İ.; Validation, A.H.İ. and S.Ö.; Resources, A.H.İ.; Writing—original draft, A.H.İ.; Writing—review & editing, S.Ö.; Visualization, A.H.İ. and S.Ö.; Supervision, S.Ö.; Project administration, S.Ö. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in compliance with the scientific research and publication ethics directive and approved by the local ethics committee in Gaziantep University (decision protocol no:007; Document no: E-87841438-050.99-489085 date: 24/04).

Informed Consent Statement

All participants were informed about the aims and procedures of the study prior to participation. Participation was voluntary, and informed consent was obtained from all individuals involved in the study.

Data Availability Statement

To support replication and transparency, the full source code used in this study including model implementation, optimization routines, and evaluation scripts—is publicly available at the following GitHub repository: https://github.com/eng-ahmethakan/ADAPTIVE-TESTING-WITH-ADDITIVE-RESPONSE-TIME-IMPROVING-FAIRNESS-AND-ACCURACY-IN-THE-IRT-RASCH-MODEL.git, accessed on 1 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, H. Utilizing response times in cognitive diagnostic computerized adaptive testing under the higher-order deterministic input, noisy ‘and’ gate model. Br. J. Math. Stat. Psychol 2019, 73, 109–141. [Google Scholar] [CrossRef] [PubMed]
Kingsbury, G.; Wise, S. Three measures of test adaptation based on optimal test information. J. Comput. Adapt. Test. 2020, 8, 1–19. [Google Scholar] [CrossRef]
Shao, C.; Li, J.; Cheng, Y. Detection of test speededness using change-point analysis. Psychometrika 2016, 81, 1118–1141. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Qi, Y. Using response time in multidimensional computerized adaptive testing. J. Educ. Meas. 2023, 60, 697–738. [Google Scholar] [CrossRef]
Oladele, B.K. Comparison of secondary school students’ ability in mathematics constructed-response items under classical test and item response measurement theories in the Ibadan Metropolis, Nigeria. Ph.D. Thesis, University of Ibadan, Ibadan, Nigeria, 2021. Available online: http://140.105.46.132:8080/xmlui/handle/123456789/1568 (accessed on 17 June 2025).
Veldkamp, B. On the issue of item selection in computerized adaptive testing with response times. J. Educ. Meas. 2016, 53, 212–228. [Google Scholar] [CrossRef]
Umobong, M.E. The one-parameter logistic model (1PLM) and its application in test development. Adv. Soc. Sci. Res. J. 2017, 4, 126–137. [Google Scholar]
Sie, H.; Finkelman, M.; Riley, B.; Smits, N. Utilizing response times in computerized classification testing. Appl. Psychol. Meas. 2015, 39, 389–405. [Google Scholar] [CrossRef]
Kerssies, T.; Vanschoren, J.; Kılıçkaya, M. Evaluating continual test-time adaptation for contextual and semantic domain shifts. arXiv 2022, arXiv:2208.08767. [Google Scholar] [CrossRef]
Cheng, Y.; Diao, Q.; Behrens, J. A simplified version of the maximum information per time unit method in computerized adaptive testing. Behav. Res. Methods 2016, 49, 502–512. [Google Scholar] [CrossRef]
Choi, S.; Moellering, K.; Li, J.; Linden, W. Optimal reassembly of shadow tests in CAT. Appl. Psychol. Meas. 2016, 40, 469–485. [Google Scholar] [CrossRef]
Plajner, M.; Vomlel, J. Monotonicity in practice of adaptive testing. arXiv 2020, arXiv:2009.06981. [Google Scholar] [CrossRef]
Cui, T.; Tang, C.; Zhou, D.; Li, Y.; Gong, X.; Ouyang, W.; Su, M.; Zhang, S. Online test-time adaptation for interatomic potentials. arXiv 2024, arXiv:2405.08308. [Google Scholar] [CrossRef]
Kyllonen, P.C.; Zu, J. Use of response time for measuring cognitive ability. J. Intell. 2016, 4, 14. [Google Scholar] [CrossRef]
Huang, S.; Luo, J.; Jeon, M. A response time-based mixture item response theory model for dynamic item-response strategies. Behav. Res. Methods 2025, 57, 321–336. [Google Scholar] [CrossRef]
Wallin, G.; Chen, Y.; Lee, Y.-H.; Li, X. A latent variable model with change points and its application to time pressure effects in educational assessment. arXiv 2024, arXiv:2410.22300. [Google Scholar] [CrossRef]
Colledani, D.; Barbaranelli, C.; Anselmi, P. Fast, smart, and adaptive: Using machine learning to optimize mental health assessment and monitor change over time. Sci. Rep. 2025, 15, 6492. [Google Scholar] [CrossRef]
Huda, A.; Firdaus, F.; Irfan, D.; Hendriyani, Y. Optimizing educational assessment: The practicality of computer adaptive testing (CAT) with an item response theory (IRT) approach. JOIV Int. J. Inform. Vis. 2024, 8, 44–50. [Google Scholar] [CrossRef]
Schroeders, U.; Gnambs, T. Sample-Size Planning in Item-Response Theory: A Tutorial. Adv. Methods Pract. Psychol. Sci. 2025, 13. [Google Scholar] [CrossRef]
Han, Y.; Zhang, J.; Jiang, Z.; Shi, D. Is the area under curve appropriate for evaluating the fit of psychometric models? Educ. Psychol. Meas. 2023, 83, 586–608. [Google Scholar] [CrossRef]
Tien, J.M. Internet of things, real-time decision making, and artificial intelligence. Ann. Data Sci. 2017, 4, 149–178. [Google Scholar] [CrossRef]
Sajja, R.; Sermet, Y.; Cikmaz, M.; Cwiertny, D.; Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information 2024, 15, 596. [Google Scholar] [CrossRef]
Martin, A.; Lazendic, G. Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. J. Educ. Psychol. 2018, 110, 27–45. [Google Scholar] [CrossRef]
Van Wijk, E.V.; Donkers, J.; De Laat, P.C.J.; Meiboom, A.A.; Jacobs, B.; Ravesloot, J.H.; Tio, R.A.; Van Der Vleuten, C.P.M.; Langers, A.M.J.; Bremers, A.J.A. Computer adaptive vs. non-adaptive medical progress testing: Feasibility, test performance, and student experiences. Perspect. Med. Educ. 2024, 13, 406–416. [Google Scholar] [CrossRef]
Collares, C.; Cecílio-Fernandes, D. When I say … computerised adaptive testing. Med. Educ. 2018, 53, 115–116. [Google Scholar] [CrossRef] [PubMed]
Eggen, T.J. Multi-segment computerized adaptive testing for educational testing purposes. In Frontiers in Education; Frontiers Media SA: Lausanne, Switzerland, 2018; Volume 3, p. 111. [Google Scholar] [CrossRef]
van der Linden, W.J. A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 2007, 72, 287–308. [Google Scholar] [CrossRef]
Kisielewska, J.; Millin, P.; Rice, N.; Pego, J.M.; Burr, S.; Nowakowski, M.; Gale, T. Medical students’ perceptions of a novel international adaptive progress test. Educ. Inf. Technol. 2023, 29, 11323–11338. [Google Scholar] [CrossRef]
van der Linden, W.J. Using response times for item selection in adaptive testing. J. Educ. Behav. Stat. 2008, 33, 5–20. [Google Scholar] [CrossRef]
Wang, C.; Xu, G. A mixture hierarchical model for response times and response accuracy. Br. J. Math. Stat. Psychol. 2015, 68, 456–477. [Google Scholar] [CrossRef]
Nagy, G.; Ulitzsch, E. A multilevel mixture IRT framework for modeling response times as predictors or indicators of response engagement in IRT models. Educ. Psychol. Meas. 2021, 82, 845–879. [Google Scholar] [CrossRef]
Karadavut, T. The uniform prior for Bayesian estimation of ability in item response theory models. Int. J. Assess. Tools Educ. 2020, 6, 568–579. [Google Scholar] [CrossRef]
van der Linden, W.J.; Jiang, B. A shadow-test approach to adaptive item calibration. Psychometrika 2020, 85, 301–321. [Google Scholar] [CrossRef]
Yao, L.; Pommerich, M.; Segall, D.O. Using Multidimensional CAT to Administer a Short, Yet Precise, Screening Test. Appl. Psychol. Meas. 2014, 38, 614–631. [Google Scholar] [CrossRef]
Choe, E.M.; Kern, J.L.; Chang, H.-H. Optimizing the Use of Response Times for Item Selection in Computerized Adaptive Testing. J. Educ. Behav. Stat. 2017, 43, 135–158. [Google Scholar] [CrossRef]
Fan, Z.; Wang, C.; Chang, H.-H.; Douglas, J. Utilizing Response Time Distributions for Item Selection in CAT. J. Educ. Behav. Stat. 2012, 37, 655–670. [Google Scholar] [CrossRef]
Ren, H.; van der Linden, W.J.; Diao, Q. Continuous online item calibration: Parameter recovery and item utilization. Psychometrika 2017, 82, 498–522. [Google Scholar] [CrossRef]
Hassan, M.; Miller, F. Optimal item calibration for computerized achievement tests. Psychometrika 2019, 84, 1101–1128. [Google Scholar] [CrossRef] [PubMed]
Ikonen, T.J.; Corona, F.; Harjunkoski, I. Likelihood maximization of lifetime distributions with bathtub-shaped failure rate. IEEE Trans. Reliab. 2023, 72, 759–773. [Google Scholar] [CrossRef]
Loshchilov, I. LM-CMA: An alternative to L-BFGS for large-scale black box optimization. Evol. Comput. 2017, 25, 143–171. [Google Scholar] [CrossRef]
Gao, G.; Florez, H.; Vink, J.C.; Wells, T.J.; Saaf, F.; Blom, C.P.A. Performance analysis of trust region subproblem solvers for limited-memory distributed BFGS optimization method. Front. Appl. Math. Stat. 2021, 7, 673412. [Google Scholar] [CrossRef]
Kim, S.; Kolen, M.J. Scale linking for the testlet item response theory model. Appl. Psychol. Meas. 2022, 46, 79–97. [Google Scholar] [CrossRef]
van der Linden, W.J. Review of the shadow-test approach to adaptive testing. Behaviormetrika 2021, 49, 169–190. [Google Scholar] [CrossRef]
Reckase, M.D.; Ju, U.; Kim, S. How adaptive is an adaptive test: Are all adaptive tests adaptive? J. Comput. Adapt. Test. 2019, 7, 1–14. [Google Scholar] [CrossRef]
Veldkamp, B.P.; Verschoor, A.J. Robust computerized adaptive testing. In Handbook of Automated Scoring; CRC Press: Boca Raton, FL, USA, 2019; pp. 291–305. [Google Scholar] [CrossRef]
Perlstein, S.; Wagner, N.; Domínguez-Álvarez, B.; Gómez-Fraguela, J.A.; Romero, E.; Lopez-Romero, L.; Waller, R. Psychometric properties, factor structure, and validity of the sensitivity to threat and affiliative reward scale in children and adults. Assessment 2022, 30, 1914–1934. [Google Scholar] [CrossRef]
van der Linden, W.J.; Niu, L.; Choi, S. A two-level adaptive test battery. J. Educ. Behav. Stat. 2023, 49, 730–752. [Google Scholar] [CrossRef]
Kara, B. Computer adaptive testing simulations in R. Int. J. Assess. Tools Educ. 2019, 6, 44–56. [Google Scholar] [CrossRef]
Uyigue, V.; Orheruata, M. Test length and sample size for item-difficulty parameter estimation in item response theory. J. Educ. Psychol. 2019, 10, 72–75. [Google Scholar] [CrossRef]
Van der Linden, W.J.; Pashley, P.J. Item selection and ability estimation in adaptive testing. In Elements of Adaptive Testing; Springer: New York, NY, USA, 2009; pp. 3–30. [Google Scholar] [CrossRef]
Lang, J.W.; Tay, L. The science and practice of item response theory in organizations. Annu. Rev. Organ. Psychol. Organ. Behav. 2021, 8, 311–338. [Google Scholar] [CrossRef]
Molenaar, D.; Tuerlinckx, F.; van der Maas, H.L. A bivariate generalized linear item response theory modeling framework to the analysis of responses and response times. Multivar. Behav. Res. 2015, 50, 56–74. [Google Scholar] [CrossRef]
Cai, L.; Choi, K.; Hansen, M.; Harrell, L. Item response theory. Annu. Rev. Stat. Its Appl. 2016, 3, 297–321. [Google Scholar] [CrossRef]

Figure 1. Comparison of AIC values across models.

Figure 2. Comparison of BIC values across models.

Figure 3. Comparison of AUC values across models.

Figure 4. AUC comparison chart—model AUC performance.

Figure 5. AIC/BIC comparison—model complexity analysis.

Figure 6. AUC vs. correlation for different models.

Figure 7. Comparison of multiplicative vs. subtractive models—error distribution.

Table 1. Comparison of AIC, BIC, and AUC values across three computational approaches.

MODEL	First Approach			Second Approach			Third Approach
	AIC	BIC	AUC	AIC	BIC	AUC	AIC	BIC	AUC
1PL-IRT	1776.870	1961.771	0.712	1936.500	2424.108	0.8217	1416.443	1574.109	0.803
TP-IRT	1773.147	1960.125	0.712	1935.437	2423.044	0.8220	1472.328	1635.250	0.793
TWD-IRT	2370.730	2557.708	0.693	3598.069	4085.677	0.400	1689.128	1852.083	0.772
STP-IRT	2118.256	2305.234	0.692	1936.493	2424.101	0.8219	1764.483	1927.405	0.781
NRT-IRT	2083.337	2270.315	0.681	2632.602	3120.209	0.339	1511.271	1674.193	0.759
DTA-IRT	1761.410	1948.388	0.711	1936.493	2424.100	0.8222	1410.343	1573.265	0.804
ART-IRT	2022.327	2209.305	0.675	2632.752	3120.360	0.429	1614.018	1776.940	0.734

Table 2. Performance comparison of different IRT models.

MODEL	First Approach			Second Approach			Third Approach
	TIF	Log-Likelihood	Item Fit	TIF	Log-Likelihood	Item Fit	TIF	Log-likelihood	Item Fit
1PL-IRT	4.5490	−799.1263	0.1461	5.9506	−678.2216	0.1599	0.1770	−678.2216	0.1553
TP-IRT	4.8375	−806.1206	0.1467	5.9416	−678.0587	0.1598	0.1770	−678.0587	0.1553
TWD-IRT	5.658	−882.1208	0.2981	6.3143	−756.0996	0.3593	0.2297	−756.0996	0.3153
STP-IRT	4.6812	−810.4032	0.1489	5.9505	−724.0715	0.1599	0.1754	−724.0715	0.1558
NRT-IRT	5.0883	−858.6678	0.1593	7.5000	−950.4710	0.2501	0.1229	−950.4710	0.4016
DTA-IRT	4.2493	−806.4547	0.1485	5.9505	−677.8257	0.1599	0.1773	−677.8257	0.1552
ART-IRT	4.7244	−858.5565	0.1598	7.5000	−893.0601	0.2501	0.1868	−893.0601	0.3582

Table 3. Comparison of theta estimates relative to 1PL model.

Model	Theta Correlation with 1PL	Theta MAE vs. 1PL
TP-IRT	0.998	0.224
DTA-IRT	0.992	0.216
TWD-IRT	0.908	0.435
NRT-IRT	0.809	0.505
STP-IRT	0.868	1.603
ART-IRT	0.784	0.493

Table 4. Optimized lambda (λ) estimates for time-aware IRT models.

Model	λ (Cross-Validation)	λ (Full Dataset)
TP-IRT	−0.2688	−0.2679
TWD-IRT	0.0574	0.0556
DTA-IRT	0.1557	0.0146
ART-IRT	0.0787	0.0146

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

İnce, A.H.; Özbay, S. Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing. Appl. Sci. 2025, 15, 6999. https://doi.org/10.3390/app15136999

AMA Style

İnce AH, Özbay S. Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing. Applied Sciences. 2025; 15(13):6999. https://doi.org/10.3390/app15136999

Chicago/Turabian Style

İnce, Ahmet Hakan, and Serkan Özbay. 2025. "Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing" Applied Sciences 15, no. 13: 6999. https://doi.org/10.3390/app15136999

APA Style

İnce, A. H., & Özbay, S. (2025). Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing. Applied Sciences, 15(13), 6999. https://doi.org/10.3390/app15136999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Ability Estimation with Time-Sensitive IRT Models in Computerized Adaptive Testing

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Data Collection

3.2. Derived Item Response Models

3.2.1. The Standard 1PL Rasch Model,

3.2.2. Time-Weighted Difficulty Model (TWD-IRT): Incorporates Response Time as a Multiplicative Factor

3.2.3. Normalized Response Time Model (NRT-IRT): Adjusts Difficulty Using a Time Normalization Factor

3.2.4. Time-Penalized Model (TP-IRT): Applies a Penalty Based on Response Time

3.2.5. Scaled Time-Penalized Model (STP-IRT): Introduces a Scaled Penalty Using the Average Response Time

3.2.6. Dynamic Time-Adjusted Model (DTA-IRT): Applies a Subtractive Time-Based Adjustment

3.2.7. Adaptive Response Time Model (ART-IRT): Uses a Multiplicative Normalization Factor for Adaptive Question Selection

3.3. Parameter Estimation

3.3.1. First Approach Maximum Likelihood Estimation (MLE) with Fixed Item Difficulties

3.3.2. Second Approach Full Parameter Optimization (θ, b, λ Joint Estimation)

3.3.3. Third Approach K-Fold Cross-Validation and Generalization Testing

3.4. Evaluation Metrics

3.5. Statistical Validation

3.6. Model Fit Evaluation and Parameter Consideration

4. Model Performance and Results Analysis

4.1. First Computational Approach Analysis

4.2. Second Computational Approach Analysis

4.3. Third Computational Approach Analysis

4.4. Overall Findings

4.5. Comparison of Theta Estimates Between 1PL-IRT and Time-Aware Models

4.6. Lambda (λ) Estimates

4.7. Performance of Multiplicative Models

4.8. Model Generalizability Through Cross-Validation

4.9. Findings of the Highest Performing Approach

5. Limitations and Future Work

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI