Abstract
The root mean square deviation (RMSD) is a widely used item fit statistic in item response models. However, the sample RMSD is known to exhibit positive bias in small samples. To address this, seven alternative bias-corrected RMSD estimators are proposed and evaluated in a simulation study involving items with uniform differential item functioning (DIF). The results demonstrate that the proposed estimators effectively reduce the bias of the original RMSD statistic. Their performance is compared, and the most favorable estimators are highlighted for empirical research. Finally, the application of the various RMSD statistics is illustrated using PISA 2006 reading data.
1. Introduction
Item response theory (IRT) models [1,2,3,4,5,6] are multivariate statistical models for vectors of discrete random variables. They are widely used in the social sciences, particularly in educational large-scale assessment (LSA; [7]) studies, where cognitive tasks are administered.
This article considers dichotomous (binary) random variables. Let denote a vector of I items, with . A unidimensional IRT model [8] specifies the probability distribution of as
where is the normal density with mean and SD . The latent variable , often referred to as a trait or ability, has parameters collected in . The vector contains the item parameters for the parametric item response functions (IRFs) . The latent ability variable can be viewed as a unidimensional summary of the high-dimensional contingency table of item responses (see [9]). Larger values of (typically positive) are associated with more able persons who solve more items, whereas smaller values of (typically negative) indicate less able persons.
The two-parameter logistic (2PL; [10]) model is among the most widely used IRT models. Its IRF is given by
where and denote item discrimination and difficulty, respectively, and is the logistic function. The item parameter vector is denoted by . The 2PL model accommodates differences in how strongly items relate to the ability variable through variation in the discrimination parameter , with larger positive values indicating stronger associations and values near zero indicating weaker associations. Variation in the marginal probability of answering an item correctly is mainly captured by the difficulty parameter , where positive values correspond to more difficult items and negative values correspond to easier items.
For a sample of N individuals with independent and identically distributed observations of , the parameters of the IRT model in (1) are consistently estimated via marginal maximum likelihood (MML; [11]), commonly implemented using the EM algorithm [12].
In LSA applications such as programme for international student assessment (PISA; [13,14]), item response functions (IRFs) are typically modeled parametrically as in (1). These studies involve multiple countries, and it is generally assumed that item parameters are invariant across countries. In practice, this assumption may be violated, as certain items can systematically advantage or disadvantage specific countries. This phenomenon is known as differential item functioning (DIF; [15,16]), although alternative terms such as measurement bias or item bias are also used [17]. The presence of DIF may bias group differences, which motivates the search for items with DIF to ensure that comparisons are less distorted [18]. However, DIF may also cancel out across the test (i.e., balanced DIF; [19,20,21]), a situation in which removing DIF items from the test might be required.
As a result, the assumed IRF represents a (slight) misspecification of the true IRT model (1) for a given country (or group, henceforth). The multivariate random vector can then be expressed as
where denotes the assumed IRF and represents the IRF in the data-generating model for the group. In practice, the approximation of by is generally expected to introduce only minimal distortion in estimating the distribution parameters and .
Assessing the adequacy of parametric IRFs (i.e., item fit; [22,23,24,25,26]) is a central issue in psychometrics. The discrepancy between the true IRF and the assumed parametric IRF should be quantified using an appropriate effect size measure, ideally accompanied by statistical inference. Of particular interest is the identification of misfitting items i for which the assumed IRFs deviate substantially from .
The present study focuses the root mean square deviation (RMSD; [27,28,29,30,31,32,33,34,35,36]). The motivation for examining RMSD lies in its widespread use in current PISA studies.
A weighted RMSD statistic is defined following Joo et al. [37]. Let denote a weighting function for item i such that it integrates to 1, that is, (see [37]). The weighting function may be item-specific or common across items. The primary objective is to quantify the discrepancy between a data-generating IRF and the model-assumed IRF . The RMSD statistic applies weights to the deviations to provide a summarized measure of model-data discrepancy. The weighted RMSD statistic summarizes squared IRF differences by calculating
If the weighting function is chosen as the normal density with group mean and group SD , and is the same across all items, the RMSD statistic is referred to as the distribution-weighted RMSD. If the weighting function is item-specific, chosen as the normal density with mean (the assumed item difficulty in the model) and SD of 1, it is referred to as the difficulty-weighted RMSD.
The assumed IRF typically depends on item parameters that are either known (i.e., fixed) or estimated. In LSA studies, the item parameters used in often originate from an international scaling step in which item parameters are obtained by pooling all country-specific datasets into a single total dataset. The resulting international item parameters are then treated as fixed in the subsequent country-specific scaling step. Differences between country-specific IRFs and the model-assumed IRF typically arise from country DIF [38]; that is, the functioning of items differs slightly across countries. For instance, some items become easier in a particular country relative to the international average, whereas others become more difficult. The RMSD statistic (4) is therefore used to detect items exhibiting country DIF.
As another application, in (4) can represent the assumed IRF in the 2PL model. If the IRT model is treated as a working model, the true IRF differs from the model-assumed IRF . For example, the true IRF might include a guessing parameter [39] or could take the form of any monotone function of the ability variable (see [40,41]). In this context, the RMSD statistic can be employed to detect misfitting items.
The RMSD definition (4) is formulated at the population level and does not involve sample data. In empirical research, model fit is assessed in datasets with limited sample sizes. Therefore, it is essential to investigate estimators of the RMSD statistic that perform well in small-sample settings.
A sample-based version of the RMSD statistic is now defined. Let denote the observed IRF, a sample-based estimate of , evaluated at a theta point () as
where represents the posterior distribution of person n at grid point . The posterior distribution is typically obtained by fitting the IRT model via MML [11], so the quantities in (5) can be computed directly from standard software output. A discrete evaluation of the weighting function is then defined as
The sample-based RMSD is then defined as
where .
It has been shown in [33] that the RMSD estimator (7) is positively biased in small samples. Consequently, it is desirable to develop bias-corrected RMSD estimators with improved properties. This paper proposes seven alternative bias-corrected RMSD estimators, derived based on asymptotic bias considerations. The performance of these estimators is evaluated in a simulation study in which items exhibit DIF.
The rest of the article is organized as follows. Section 2 introduces the newly proposed bias-corrected RMSD estimators. Section 3 reports results from a simulation study examining the performance of these RMSD statistics under uniform DIF. Section 4 presents an empirical example using PISA 2006 reading data. Finally, Section 5 closes with a discussion.
2. Derivation of Bias-Corrected RMSD Estimators
This section discusses alternative bias-corrected estimators of the RMSD statistic. The population RMSD value is defined as
This statistic depends on the true data-generating item response probabilities , which are unknown and must be inferred to compute the RMSD statistic. The quantities and are treated as fixed in the following discussion.
The originally proposed RMSD statistic is defined as
and depends on the estimated item response probabilities . The substitution of with introduces sampling error, which leads to a positive bias because these errors appear in the squared terms of (9) (see [33,42]).
A bias correction approach relies on quantifying the sampling variances of the estimates . Let denote the variance matrix . This matrix can be estimated using the framework of M-estimation [43], as the estimates satisfy the estimating equations
where represents the posterior probability of subject n at the theta grid point . The variance of is given by
and the covariance between and is expressed as
For distant points and , the numerator in (12) shows that the covariance is approximately zero, since when the posterior distribution of subject n is concentrated around a specific value. Note that the formulas (11) and (12) also appeared in [44].
The bias in the squared RMSD, , is first considered. Note that
where . This directly implies
which proves (13).
The second term in (13) represents the positive bias in and is given by
This result motivates the bias-corrected RMSD estimator
where . This approach was also proposed in [33], but without incorporating the more appropriate variance estimation given in (11) and (12).
However, Jensen’s inequality [43,45] implies that
for any positive random variable Z. Applying (17) to and disregarding negative values yields
Thus, exhibits a negative bias. A bias correction applied to the squared RMSD (i.e., ) removes bias in for , but introduces bias in . Therefore, it is more appropriate to derive a bias correction directly for that operates on the square-root scale.
A second type of bias correction can be derived using a Taylor expansion of around with respect to the true IRF . Because the RMSD statistic is defined as the square root of a positively valued statistic that exhibits bias, the correction relies on the quadratic Taylor expansion of the square root function given by
for positive x and with x close to . Letting and yields the approximation
In (20), replacing y by the sample RMSD and by the population RMSD gives
where is the population MSD statistic. Taking expectations in (21) results in
Assuming that is approximately equal to , the bias of can then be obtained from (22) as
The bias approximation (23) is used to derive bias-corrected estimators for . At first, only the linear term from the Taylor expansion is employed; that is, the bias component involving (i.e., ) from (23) which can be readily estimated as in (15), while ignoring the term that involves . The quantity in this term can be estimated using or , yielding the following bias-corrected RMSD estimators:
Because is smaller than , the bias-corrected estimator is always at least as large as .
Further bias-corrected estimators follow by retaining the second term in (23) that involves . First, note that
The term can be regarded as small relative to , which motivates the approximation of by . An approximate expression for is
An natural estimator of this variance is given by
but the substitution of with introduces an additional variance component that arises solely from sampling variability. This additional component contributes bias to the estimator and is approximately given by
The estimators and replace by and differ in the quantity used to substitute for . The statistic corrects for the positive bias in the variance estimate that affects :
Finally, the last two bias-corrected RMSD estimators are derived by substituting for in (30) and (31), resulting in
In practice, divisions by zero values of must be prevented when computing the bias-corrected estimators. In such cases, the bias-corrected RMSD estimates , , and are set to zero.
3. Simulation Study
3.1. Method
The 2PL IRT model was applied for both data generation and analysis. The latent trait followed a normal distribution with mean 0 and standard deviation 1. The simulation study comprised items. Ten base items were defined with discrimination parameters for all items, and difficulty parameters fixed at , , , , , , , , , and . These ten items were each duplicated four times to yield a total test length of 40 items. Two of the 40 items were simulated to exhibit DIF. Items j and , for , , or , were assigned uniform DIF in the difficulty parameters with values and , respectively. The DIF effect size was set to and , representing large DIF magnitudes [16,42].
The sample size N was varied across 125, 250, 500, 1000, 2000, and 4000, reflecting typical conditions in large-scale assessment studies [7].
For each of the 6 (sample size N) × 3 (type of DIF items) × 2 (DIF effect size ) simulation conditions, 1500 replications were performed. Item parameters were fixed at the values of the base items when fitting the 2PL model. While all item parameters remained fixed, the distribution parameters and of the latent trait were freely estimated. The presence of DIF was ignored during the scaling step. The distribution-weighted RMSD statistics and difficulty-weighted RMSD statistics () were computed (see Section 2).
A true RMSD value was defined using the IRF from the data-generating model that included the DIF items. Table 1 presents the RMSD values for items exhibiting DIF effects. The results indicate that the distribution-weighted RMSD varies with item difficulty for a fixed DIF effect , whereas the difficulty-weighted RMSD statistic remains unaffected. For items without DIF effects (48 out of 50 items), the RMSD values were zero.
Table 1.
Simulation Study: Population values of the root mean square deviation (RMSD) statistic as a function of item difficulty and DIF effect size .
The bias and root mean square error (RMSE) of the RMSD statistics were calculated with respect to the true RMSD values. A relative RMSE was defined by dividing each RMSD statistic by the originally proposed RMSD statistic or , and multiplying by 100. In addition, the 5th, 10th, 25th, 50th, 75th, 90th and 95th percentiles were computed for the RMSD statistics.
All analyses were conducted using the open-source statistical software R (Version 4.4.1; [46]). The 2PL model was estimated with the sirt::xxirt() function from the R package sirt (Version 4.2-133; [47]). Custom R functions were developed to compute the various RMSD statistics. These functions, along with the replication materials, are available at https://osf.io/amd8k (accessed on 17 October 2025). Percentile plots were generated using the R package ggplot2 (Version 4.0.0; [48]).
3.2. Results
Table 2 presents the bias and relative RMSE for the distribution-weighted RMSD statistics of items with DIF. Substantial positive biases were observed for the RMSD statistic at the smaller sample sizes ( and ), which were markedly reduced by the bias-corrected RMSD statistics. The average absolute bias was for and for , , and . The positive bias of appeared relatively unaffected by item difficulty . Across all simulation conditions, the RMSE was higher for the bias-corrected RMSD statistics than for the original statistic . The lowest average RMSE values were obtained for () and (), whereas showed the largest average RMSE value of .
Table 2.
Simulation Study: Bias and relative root mean square error (RMSE) of distribution-weighted RMSD statistics () for items with differential item functioning (DIF) as a function of item difficulty , DIF effect size and sample size N.
Figure 1 shows percentile plots for the distribution-weighted RMSD statistics and for items with DIF (). The results indicate that sampling variability decreased substantially with increasing sample size. However, the variability of both and remained considerable at the smallest sample size (). For the easy item with , the bias-corrected RMSD statistic was frequently estimated as 0 when . The figure also illustrates that the median of slightly decreased with increasing sample size and eventually converged to the true RMSD value, whereas the median of the bias-corrected RMSD increased with increasing sample size.
Figure 1.
Simulation Study: Percentile plots for distribution-weighted RMSD statistics and for items with differential item functioning (DIF) as a function of item difficulty and sample size N Note. Outlined in black: 5–95th percentile; light shading: 10–90th; dark gray: 25–75th. Medians are horizontal lines.
Table 3 presents the bias and relative RMSE for the difficulty-weighted RMSD statistics of items with DIF. The positive bias of the difficulty-weighted RMSD statistic was more pronounced than the distribution-weighted RMSD statistic , particularly for easy or difficult items whose item difficulties substantially differed from 0. Like for the statistics, the corresponding bias-corrected statistics () and had the lowest average absolute biases (), while the average absolute bias for was . Notably, there was even a positive bias of 0.024 for for and for the relatively large sample size of . However, bias decreased with increasing sample sizes, but large sample sizes are required to obtain precise estimates of the difficulty-weighted RMSD statistic .
Table 3.
Simulation Study: Bias and relative root mean square error (RMSE) of difficulty-weighted RMSD statistics () for items with differential item functioning (DIF) as a function of item difficulty , DIF effect size and sample size N.
Figure 2 shows percentile plots for the difficulty-weighted RMSD statistics and for items with DIF (). Substantial variability was observed in the RMSD statistics, particularly for the smaller sample sizes ( and ). For items with moderate difficulty (i.e., and ), the median RMSD values remained relatively stable across sample sizes.
Figure 2.
Simulation Study: Percentile plots for difficulty-weighted RMSD statistics and for items with differential item functioning (DIF) as a function of item difficulty and sample size N Note. Outlined in black: 5–95th percentile; light shading: 10–90th; dark gray: 25–75th. Medians are horizontal lines.
Table 4 presents the bias and relative RMSE for the distribution-weighted RMSD statistics of items without DIF. The condition was selected in which the third and eight item had DIF effects of 0.6 and −0.6, respectively. The true RMSD value was 0 for items without DIF. The positive bias of all RMSD statistics was evident, but much more pronounced for the original RMSD statistic with an average absolute bias of . The smallest average absolute bias was observed of (). This bias-corrected RMSD statistic had also the smallest average RMSE ().
Table 4.
Simulation Study: Bias and relative root mean square error (RMSE) of distribution-weighted RMSD statistics () for items without differential item functioning (DIF) as a function of item difficulty and sample size N.
Figure 3 shows percentile plots for the distribution-weighted RMSD statistics and for items without DIF. The positive bias as evident for , while the median values of the statistic was 0 which conincided with the true RMSD value. As expected, the sampling variability of and for non-DIF items was also reduced with increasing sample sizes.

Figure 3.
Simulation Study: Percentile plots for distribution-weighted RMSD statistics and for items without differential item functioning (DIF) as a function of item difficulty and sample size N Note. Outlined in black: 5–95th percentile; light shading: 10–90th; dark gray: 25–75th. Medians are horizontal lines.
Table 5 presents the bias and relative RMSE for the difficulty-weighted RMSD statistics of items without DIF. The positive bias of the difficulty-weighted RMSD statistics was most pronounced for smaller sample sizes and for items with more extreme difficulty levels. Similar to the distribution-weighted RMSD statistics, showed the lowest average absolute bias (), followed by (). The lowest average RMSE value was also observed for ().
Table 5.
Simulation Study: Bias and relative root mean square error (RMSE) of difficulty-weighted RMSD statistics () for items without differential item functioning (DIF) as a function of item difficulty and sample size N.
Figure 4 shows percentile plots for the difficulty-weighted RMSD statistics and for items without DIF. Across nearly all conditions, the median of the bias-corrected RMSD statistic was 0 and matched the true RMSD value. Furthermore, the upper percentiles of the statistic were not greater than those of the statistic, indicating that using instead of for DIF assessment in non-DIF items does not introduce additional risk.
Figure 4.
Simulation Study: Percentile plots for difficulty-weighted RMSD statistics and for items without differential item functioning (DIF) as a function of item difficulty and sample size N Note. Outlined in black: 5–95th percentile; light shading: 10–90th; dark gray: 25–75th. Medians are horizontal lines.
In summary, the original difficulty-weighted RMSD statistic exhibited larger positive biases and substantially greater variability than the original distribution-weighted RMSD statistic. The proposed bias-corrected RMSD statistics effectively reduced bias, although they introduced a small to moderate increase in variability. Notably, for items without DIF, the bias-corrected RMSD statistics had median values of 0 in most conditions, indicating favorable performance for DIF detection.
The bias-corrected RMSD statistics that applied the bias correction directly to using a Taylor approximation performed better than those based on a bias correction of followed by taking its (positive) square root.
4. Empirical Example
4.1. Method
In this section, alternative RMSD statistics are illustrated using a subdataset of the PISA 2006 dataset [49] for the reading domain. The complete PISA 2006 dataset is publicly available at https://www.oecd.org/en/data/datasets/pisa-2006-database.html (accessed on 17 October 2025). Processed data and the derived subdataset are available in the data.pisa2006Read dataset from the R package sirt (Version 4.2-133; [47]). The analysis focuses on item responses from Booklet 1 to examine RMSD statistics for sample sizes of approximately 200 to 300 students per item, reflecting conditions typical of PISA field trials.
This analysis included 22 countries with valid responses on all 28 reading items from Booklet 1. Sample sizes per country ranged from 287 to 1738 (with a mean of , a median of , and an SD of ), yielding a total sample size of 13,876 students. In the original data.pisa2006Read dataset, student weights were normalized within each country to represent a hypothetical sample size of 5000, ensuring equal contributions across countries. Because students were randomly assigned to booklets, the total student weights were approximately balanced across countries.
Among the 28 reading items, several were originally scored polytomously but were recoded dichotomously in this analysis, with only the highest category coded as correct. The remaining items were treated as dichotomous, in line with the officially reported PISA analysis.
In the first step of the analysis, joint item parameters of the 2PL model, assumed to be invariant across countries, were estimated. The 2PL model was fitted as a single-group IRT model using student weights to ensure that all 22 countries contributed approximately equally to the analysis. The estimated item discriminations and item difficulties are reported in Table 6. Item discriminations ranged from 0.59 to 1.95 (, ), and item difficulties ranged from to 3.23 (, ).
Table 6.
Empirical Example, PISA 2006 Reading, Booklet 1, Estonia: International item parameters and estimated RMSD statistics (standard errors in parentheses).
In the second step, the 2PL model was fitted separately for each of the 22 countries, fixing the item parameters to the joint international estimates while estimating only the country mean and country SD .
Finally, the original and bias-corrected distribution-weighted and difficulty-weighted RMSD statistics were computed under the assumption of independent student sampling. Although this assumption is violated in practice, the clustered sampling design primarily affects means and SDs rather than item parameters. Furthermore, the same independence assumption is employed in PISA reporting for both the field trial and main study analyses.
Statistical inference for the RMSD statistics was obtained using a nonparametric bootstrap [50] of students within countries. The bootstrap procedure involved independent resampling of students with replacement, maintaining the original sample size for each country. During the computation of the RMSD statistics, bootstrap sampling was applied to the posterior distribution, resulting in varying values across bootstrap samples and, consequently, a distribution of RMSD values. Standard errors were estimated using a normal approximation to the empirical distributions, and 95% symmetric confidence intervals were computed.
The computed confidence intervals for the RMSD statistics enabled testing a minimum effect size hypothesis [51], assessing whether a given RMSD statistic was significantly greater than a prespecified cutoff value, such as 0.05. Item misfit was identified when the lower bound of the confidence interval exceeded this cutoff.
Instead of relying on a fixed cutoff value for the RMSD statistic, a data-driven approach proposed by von Davier and Bezirhan [36] identifies items with large RMSD values as outliers, which are subsequently flagged if they exceed a specified outlier criterion.
4.2. Results
In the first analysis, results for the RMSD statistics are presented for Estonia, which included 287 students with item responses from Booklet 1. Table 6 reports the estimated distribution-weighted and difficulty-weighted RMSD statistics. In most cases, the distribution-weighted RMSD values were smaller than their difficulty-weighted counterparts (). As expected, the bias-corrected RMSD statistics were smaller than the original statistics and were often estimated as zero.
For six items (R055Q03, R055Q05, R104Q05, R219Q01E, R219Q02, and R220Q05), the original difficulty-weighted RMSD statistic exceeded the commonly used cutoff value of 0.08, whereas this was not the case for the widely applied original distribution-weighted RMSD statistic . The standard errors of the RMSD statistics were relatively large, indicating that statistical uncertainty should be considered when making item selection decisions.
The subsequent analyses examined RMSD statistics for items across all 22 countries. Table 7 presents the percentages of items flagged when the RMSD statistics exceeded a specified cutoff value, were significantly greater than a given cutoff, or were identified as outliers according to the method proposed by von Davier and Bezirhan.
Table 7.
Empirical Example, PISA 2006 Reading, Booklet 1: Percentage of flagged items across items and countries according to several criteria regarding RMSD statistics.
Using a strict cutoff value of 0.05, 59.1% and 72.4% of items were classified as misfitting based on and , respectively. These proportions decreased slightly when the corresponding bias-corrected RMSD statistics were applied. When a cutoff value of 0.12 was used, the flagging rates dropped to 10.1% for and 19.8% for . Incorporating statistical inference into the decision process further reduced the flagging rates substantially.
The data-driven RMSD cutoffs proposed by von Davier and Bezirhan resulted in flagging rates between 2.9% and 6.8%, indicating that most items fit the 2PL model adequately. The estimated data-driven cutoffs were 0.152 (for ), 0.165 (for ), 0.177 (for ), 0.152 (for ), 0.172 (for ), 0.191 (for ), 0.209 (for ), 0.223 (for ), 0.199 (for ), and 0.222 (for ).
Figure 5 presents a percentile plot of the empirical distribution of RMSD statistics across all items and countries. The bias-corrected RMSD statistics ( and for ) yielded slightly smaller values than the original RMSD statistics ( and ). However, the reduction in magnitude due to bias correction did not substantially alter the overall distribution of the original RMSD statistics.
Figure 5.
Empirical Example, PISA 2006 Reading, Booklet 1: Percentile plot of empirical distribution of RMSD statistics across all items and countries. Note. Outlined in black: 5–95th percentile; light shading: 10–90th; dark gray: 25–75th. Medians are horizontal lines. Red triangle: data-driven RMSD cutoff according to von Davier and Bezirhan [36]. = distribution-weighted RMSD statistics (); = difficulty-weighted RMSD statistics ().
5. Discussion
This paper proposed and examined seven alternative bias-corrected distribution-weighted and difficulty-weighted RMSD estimators. The key conclusion is that bias correction should be applied directly to the RMSD rather than to its squared form. The correction methods were based on either a first-order (linear) or a second-order (quadratic) Taylor expansion. All approaches effectively reduced the positive bias observed in the original RMSD statistic, though they slightly increased variance. Bias-corrected RMSD statistics that incorporated the quadratic terms of the Taylor expansion generally demonstrated higher precision. Notably, these bias-corrected RMSD statistics often had a median value of zero for items without DIF, corresponding to the population RMSD value of zero in such cases. In empirical research, the distribution-weighted bias-corrected RMSD statistics and and the difficulty-weighted bias-corrected RMSD statistics and are recommended, as they substantially reduce bias without substantially increasing variance.
The empirical example used PISA data from a single booklet. The results showed that the percentage of items flagged as misfitting by the RMSD statistic varied substantially depending on the chosen RMSD cutoff value. Moreover, the proportion of misfitting items was very low when using the data-driven cutoff proposed by [36].
The findings also indicated that sampling variance in the estimated item response functions, which are required for computing the RMSD statistic, becomes particularly critical in small samples, leading to positive bias in the RMSD. The proposed bias-correction methods are therefore especially relevant for small-scale applications aimed at detecting item misfit, such as field trials in PISA.
As an anonymous reviewer critically noted, the simulation study assumed equal item discriminations of 1, implying that the one-parameter logistic (1PL) model served as the data-generating model, whereas the 2PL model was fitted to the responses. This discrepancy can be viewed as a substantial limitation. The behavior of the different bias-correction approaches for the RMSD statistic may depend on variation in item discriminations. A more detailed investigation of this issue may be addressed in future research.
The study was limited to dichotomous items. Future research could extend the bias-corrected RMSD estimators to polytomous items. However, several alternative generalizations of the RMSD statistic exist for polytomous data, and it remains unclear which formulation should be preferred in applied research to enable a coherent generalization of the cutoff values used for dichotomous items.
The approach presented in this study is applicable to any weighted RMSD statistic [37]. It remains unclear whether the distribution-weighted RMSD statistic has notable disadvantages compared to the difficulty-weighted RMSD statistic. In the originally proposed distribution-weighted RMSD, misfitting items with extreme difficulties tend to receive lower RMSD values than misfitting items with moderate difficulty. When likelihood-based inference includes precision weighting in the estimation of group means, it is reasonable for item misfit to be partially influenced by the likelihood contribution. Items with low precision, which contribute weakly to the likelihood, should arguably not be strongly flagged in the item misfit analysis.
Funding
This research received no external funding.
Data Availability Statement
Replication material for the Simulation Study in Section 3 can be found at https://osf.io/amd8k (accessed on 17 October 2025). The dataset data.pisa2006Read used in the empirical example in Section 4 is available from the R package sirt (Version 4.2-133; https://doi.org/10.32614/CRAN.package.sirt; accessed on 27 September 2025).
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| 1PL | one-parameter logistic |
| 2PL | two-parameter logistic |
| DIF | differential item functioning |
| IRF | item response function |
| IRT | item response theory |
| LSA | large-scale assessment |
| MML | marginal maximum likelihood |
| PISA | programme for international student assessment |
| RMSD | root mean square deviation |
| RMSE | root mean square error |
| SD | standard deviation |
References
- Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
- Bock, R.D.; Moustaki, I. Item response theory in a general framework. Handb. Stat. 2007, 26, 469–513. [Google Scholar] [CrossRef]
- Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory – A statistical framework for educational and psychological measurement. Stat. Sci. 2025, 40, 167–194. [Google Scholar] [CrossRef]
- Tutz, G. A Short Guide to Item Response Theory Models; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
- van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
- von Davier, M. Why sum scores may not tell us all about test takers. Newborn Infant Nurs. Rev. 2010, 10, 27–36. [Google Scholar] [CrossRef]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
- Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
- OECD. PISA 2018. Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 23 November 2025).
- OECD. PISA 2022. Technical Report; OECD: Paris, France, 2024; Available online: https://rb.gy/crutc6 (accessed on 23 November 2025).
- Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. Handb. Stat. 2007, 26, 125–167. [Google Scholar] [CrossRef]
- Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2022, 47, 36–68. [Google Scholar] [CrossRef]
- Pohl, S.; Schulze, D.; Stets, E. Partial measurement invariance: Extending and evaluating the cluster approach for identifying anchor items. Appl. Psychol. Meas. 2021, 45, 477–493. [Google Scholar] [CrossRef]
- Schulze, D.; Reuter, B.; Pohl, S. Measurement invariance: Dealing with the uncertainty in anchor item choice by model averaging. Struct. Equ. Model. 2022, 22, 521–530. [Google Scholar] [CrossRef]
- Douglas, J.; Cohen, A. Nonparametric item response function estimation for assessing parametric model fit. Appl. Psychol. Meas. 2001, 25, 234–243. [Google Scholar] [CrossRef]
- van Rijn, P.W.; Sinharay, S.; Haberman, S.J.; Johnson, M.S. Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assess. Educ. 2016, 4, 10. [Google Scholar] [CrossRef]
- Sinharay, S.; Haberman, S.J. How often is the misfit of item response theory models practically significant? Educ. Meas. 2014, 33, 23–35. [Google Scholar] [CrossRef]
- Sinharay, S.; Monroe, S. Assessment of fit of item response theory models: A critical review of the status quo and some future directions. Brit. J. Math. Stat. Psychol. 2025, 78, 711–733. [Google Scholar] [CrossRef] [PubMed]
- Swaminathan, H.; Hambleton, R.K.; Rogers, H.J. Assessing the fit of item response theory models. Handb. Stat. 2007, 26, 683–718. [Google Scholar] [CrossRef]
- Buchholz, J.; Hartig, J. Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Appl. Psychol. Meas. 2019, 43, 241–250. [Google Scholar] [CrossRef]
- Buchholz, J.; Hartig, J. Measurement invariance testing in questionnaires: A comparison of three multigroup-CFA and IRT-based approaches. Psychol. Test Assess. Model. 2020, 62, 29–53. Available online: https://bit.ly/38kswHh (accessed on 23 November 2025).
- Khorramdel, L.; Shin, H.J.; von Davier, M. GDM software mdltm including parallel EM algorithm. In Handbook of Diagnostic Classification Models; von Davier, M., Lee, Y.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 603–628. [Google Scholar] [CrossRef]
- Kunina-Habenicht, O.; Rupp, A.A.; Wilhelm, O. A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Stud. Educ. Eval. 2009, 35, 64–70. [Google Scholar] [CrossRef]
- Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
- Joo, S.; Ali, U.; Robin, F.; Shin, H.J. Impact of differential item functioning on group score reporting in the context of large-scale assessments. Large-Scale Assess. Educ. 2022, 10, 18. [Google Scholar] [CrossRef]
- Robitzsch, A. Statistical properties of estimators of the RMSD item fit statistic. Foundations 2022, 2, 488–503. [Google Scholar] [CrossRef]
- Sueiro, M.J.; Abad, F.J. Assessing goodness of fit in item response theory with nonparametric models: A comparison of posterior probabilities and kernel-smoothing approaches. Educ. Psychol. Meas. 2011, 71, 834–848. [Google Scholar] [CrossRef]
- Tijmstra, J.; Bolsinova, M.; Liaw, Y.L.; Rutkowski, L.; Rutkowski, D. Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. J. Educ. Meas. 2020, 57, 566–583. [Google Scholar] [CrossRef]
- von Davier, M.; Bezirhan, U. A robust method for detecting item misfit in large scale assessments. Educ. Psychol. Meas. 2023, 83, 740–765. [Google Scholar] [CrossRef] [PubMed]
- Joo, S.; Valdivia, M.; Svetina Valdivia, D.; Rutkowski, L. Alternatives to weighted item fit statistics for establishing measurement invariance in many groups. J. Educ. Behav. Stat. 2024, 49, 465–493. [Google Scholar] [CrossRef]
- Glas, C.A.W.; Jehangir, M. Modeling country-specific differential functioning. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 97–115. [Google Scholar] [CrossRef]
- Lord, F.M.; Novick, R. Statistical Theories of Mental Test Scores; Addison-Wesley: Reading, MA, USA, 1968. [Google Scholar]
- Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
- Feuerstahler, L.M. Metric transformations and the filtered monotonic polynomial item response model. Psychometrika 2019, 84, 105–123. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychol. Test Assess. Model. 2020, 62, 233–279. Available online: https://bit.ly/3ezBB05 (accessed on 23 November 2025).
- Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Kondratek, B. Item-fit statistic based on posterior probabilities of membership in ability groups. Appl. Psychol. Meas. 2022, 46, 462–478. [Google Scholar] [CrossRef]
- Held, L.; Sabanés Bové, D. Applied Statistical Inference; Springer: Berlin, Germany, 2014. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 15 June 2024).
- Robitzsch, A. sirt: Supplementary Item Response Theory Models; R Package Version 4.2-133; R Core Team: Vienna, Austria, 2025. [Google Scholar] [CrossRef]
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
- OECD. PISA 2006. Technical Report; OECD: Paris, France, 2009; Available online: https://bit.ly/38jhdzp (accessed on 23 November 2025).
- Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar] [CrossRef]
- Grissom, R.J.; Kim, J.J. Effect Sizes for Research: A Broad Practical Approach; Lawrence Erlbaum: Mahwah, NJ, USA, 2005. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).