1. Introduction
Item response theory (IRT) models [
1,
2,
3,
4,
5,
6] are multivariate statistical models for vectors of discrete random variables. They are widely used in the social sciences, particularly in educational large-scale assessment (LSA; [
7]) studies, where cognitive tasks are administered.
This article considers dichotomous (binary) random variables. Let
denote a vector of
I items, with
. A unidimensional IRT model [
8] specifies the probability distribution of
as
where
is the normal density with mean
and SD
. The latent variable
, often referred to as a trait or ability, has parameters collected in
. The vector
contains the item parameters for the parametric item response functions (IRFs)
. The latent ability variable
can be viewed as a unidimensional summary of the high-dimensional contingency table of item responses
(see [
9]). Larger values of
(typically positive) are associated with more able persons who solve more items, whereas smaller values of
(typically negative) indicate less able persons.
The two-parameter logistic (2PL; [
10]) model is among the most widely used IRT models. Its IRF is given by
where
and
denote item discrimination and difficulty, respectively, and
is the logistic function. The item parameter vector is denoted by
. The 2PL model accommodates differences in how strongly items relate to the ability variable
through variation in the discrimination parameter
, with larger positive values indicating stronger associations and values near zero indicating weaker associations. Variation in the marginal probability of answering an item correctly is mainly captured by the difficulty parameter
, where positive values correspond to more difficult items and negative values correspond to easier items.
For a sample of
N individuals with independent and identically distributed observations
of
, the parameters of the IRT model in (
1) are consistently estimated via marginal maximum likelihood (MML; [
11]), commonly implemented using the EM algorithm [
12].
In LSA applications such as programme for international student assessment (PISA; [
13,
14]), item response functions (IRFs) are typically modeled parametrically as in (
1). These studies involve multiple countries, and it is generally assumed that item parameters are invariant across countries. In practice, this assumption may be violated, as certain items can systematically advantage or disadvantage specific countries. This phenomenon is known as differential item functioning (DIF; [
15,
16]), although alternative terms such as measurement bias or item bias are also used [
17]. The presence of DIF may bias group differences, which motivates the search for items with DIF to ensure that comparisons are less distorted [
18]. However, DIF may also cancel out across the test (i.e., balanced DIF; [
19,
20,
21]), a situation in which removing DIF items from the test might be required.
As a result, the assumed IRF represents a (slight) misspecification of the true IRT model (
1) for a given country (or group, henceforth). The multivariate random vector
can then be expressed as
where
denotes the assumed IRF and
represents the IRF in the data-generating model for the group. In practice, the approximation of
by
is generally expected to introduce only minimal distortion in estimating the distribution parameters
and
.
Assessing the adequacy of parametric IRFs (i.e., item fit; [
22,
23,
24,
25,
26]) is a central issue in psychometrics. The discrepancy between the true IRF
and the assumed parametric IRF
should be quantified using an appropriate effect size measure, ideally accompanied by statistical inference. Of particular interest is the identification of misfitting items
i for which the assumed IRFs
deviate substantially from
.
The present study focuses the root mean square deviation (RMSD; [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36]). The motivation for examining RMSD lies in its widespread use in current PISA studies.
A weighted RMSD statistic is defined following Joo et al. [
37]. Let
denote a weighting function for item
i such that it integrates to 1, that is,
(see [
37]). The weighting function
may be item-specific or common across items. The primary objective is to quantify the discrepancy between a data-generating IRF
and the model-assumed IRF
. The RMSD statistic applies weights to the deviations
to provide a summarized measure of model-data discrepancy. The weighted RMSD statistic summarizes squared IRF differences by calculating
If the weighting function
is chosen as the normal density with group mean
and group SD
, and is the same across all items, the RMSD statistic is referred to as the distribution-weighted RMSD. If the weighting function
is item-specific, chosen as the normal density with mean
(the assumed item difficulty in the model) and SD of 1, it is referred to as the difficulty-weighted RMSD.
The assumed IRF
typically depends on item parameters that are either known (i.e., fixed) or estimated. In LSA studies, the item parameters used in
often originate from an international scaling step in which item parameters are obtained by pooling all country-specific datasets into a single total dataset. The resulting international item parameters are then treated as fixed in the subsequent country-specific scaling step. Differences between country-specific IRFs
and the model-assumed IRF
typically arise from country DIF [
38]; that is, the functioning of items differs slightly across countries. For instance, some items become easier in a particular country relative to the international average, whereas others become more difficult. The RMSD statistic (
4) is therefore used to detect items exhibiting country DIF.
As another application,
in (
4) can represent the assumed IRF in the 2PL model. If the IRT model is treated as a working model, the true IRF
differs from the model-assumed IRF
. For example, the true IRF might include a guessing parameter [
39] or could take the form of any monotone function of the ability variable
(see [
40,
41]). In this context, the RMSD statistic can be employed to detect misfitting items.
The RMSD definition (
4) is formulated at the population level and does not involve sample data. In empirical research, model fit is assessed in datasets with limited sample sizes. Therefore, it is essential to investigate estimators of the RMSD statistic that perform well in small-sample settings.
A sample-based version of the RMSD statistic is now defined. Let
denote the observed IRF, a sample-based estimate of
, evaluated at a theta point
(
) as
where
represents the posterior distribution of person
n at grid point
. The posterior distribution is typically obtained by fitting the IRT model via MML [
11], so the quantities in (
5) can be computed directly from standard software output. A discrete evaluation of the weighting function
is then defined as
The sample-based RMSD is then defined as
where
.
It has been shown in [
33] that the RMSD estimator (
7) is positively biased in small samples. Consequently, it is desirable to develop bias-corrected RMSD estimators with improved properties. This paper proposes seven alternative bias-corrected RMSD estimators, derived based on asymptotic bias considerations. The performance of these estimators is evaluated in a simulation study in which items exhibit DIF.
The rest of the article is organized as follows.
Section 2 introduces the newly proposed bias-corrected RMSD estimators.
Section 3 reports results from a simulation study examining the performance of these RMSD statistics under uniform DIF.
Section 4 presents an empirical example using PISA 2006 reading data. Finally,
Section 5 closes with a discussion.
2. Derivation of Bias-Corrected RMSD Estimators
This section discusses alternative bias-corrected estimators of the RMSD statistic. The population RMSD value is defined as
This statistic depends on the true data-generating item response probabilities
, which are unknown and must be inferred to compute the RMSD statistic. The quantities
and
are treated as fixed in the following discussion.
The originally proposed RMSD statistic
is defined as
and depends on the estimated item response probabilities
. The substitution of
with
introduces sampling error, which leads to a positive bias because these errors appear in the squared terms of (
9) (see [
33,
42]).
A bias correction approach relies on quantifying the sampling variances of the estimates
. Let
denote the variance matrix
. This matrix can be estimated using the framework of M-estimation [
43], as the estimates
satisfy the estimating equations
where
represents the posterior probability of subject
n at the theta grid point
. The variance of
is given by
and the covariance between
and
is expressed as
For distant
points
and
, the numerator in (
12) shows that the covariance
is approximately zero, since
when the posterior distribution of subject
n is concentrated around a specific
value. Note that the formulas (
11) and (
12) also appeared in [
44].
The bias in the squared RMSD,
, is first considered. Note that
where
. This directly implies
which proves (
13).
The second term in (
13) represents the positive bias in
and is given by
This result motivates the bias-corrected RMSD estimator
where
. This approach was also proposed in [
33], but without incorporating the more appropriate variance estimation given in (
11) and (
12).
However, Jensen’s inequality [
43,
45] implies that
for any positive random variable
Z. Applying (
17) to
and disregarding negative values yields
Thus,
exhibits a negative bias. A bias correction applied to the squared RMSD (i.e.,
) removes bias in
for
, but introduces bias in
. Therefore, it is more appropriate to derive a bias correction directly for
that operates on the square-root scale.
A second type of bias correction can be derived using a Taylor expansion of
around
with respect to the true IRF
. Because the RMSD statistic is defined as the square root of a positively valued statistic that exhibits bias, the correction relies on the quadratic Taylor expansion of the square root function given by
for positive
x and
with
x close to
. Letting
and
yields the approximation
In (
20), replacing
y by the sample RMSD
and
by the population RMSD
gives
where
is the population MSD statistic. Taking expectations in (
21) results in
Assuming that
is approximately equal to
, the bias of
can then be obtained from (
22) as
The bias approximation (
23) is used to derive bias-corrected estimators for
. At first, only the linear term from the Taylor expansion is employed; that is, the bias component involving
(i.e.,
) from (
23) which can be readily estimated as
in (
15), while ignoring the term that involves
. The quantity
in this term can be estimated using
or
, yielding the following bias-corrected RMSD estimators:
Because
is smaller than
, the bias-corrected estimator
is always at least as large as
.
Further bias-corrected estimators follow by retaining the second term in (
23) that involves
. First, note that
The term
can be regarded as small relative to
, which motivates the approximation of
by
. An approximate expression for
is
An natural estimator of this variance is given by
but the substitution of
with
introduces an additional variance component that arises solely from sampling variability. This additional component contributes bias to the estimator and is approximately given by
The estimators
and
replace
by
and differ in the quantity used to substitute for
. The statistic
corrects for the positive bias in the variance estimate that affects
:
Finally, the last two bias-corrected RMSD estimators are derived by substituting
for
in (
30) and (
31), resulting in
In practice, divisions by zero values of must be prevented when computing the bias-corrected estimators. In such cases, the bias-corrected RMSD estimates , , and are set to zero.
In the simulation study in
Section 3 and the empirical example in
Section 4, the distribution-weighted RMSD statistics are denoted by
, and the difficulty-weighted RMSD statistics by
(
).
4. Empirical Example
4.1. Method
In this section, alternative RMSD statistics are illustrated using a subdataset of the PISA 2006 dataset [
49] for the reading domain. The complete PISA 2006 dataset is publicly available at
https://www.oecd.org/en/data/datasets/pisa-2006-database.html (accessed on 17 October 2025). Processed data and the derived subdataset are available in the
data.pisa2006Read dataset from the R package sirt (Version 4.2-133; [
47]). The analysis focuses on item responses from Booklet 1 to examine RMSD statistics for sample sizes of approximately 200 to 300 students per item, reflecting conditions typical of PISA field trials.
This analysis included 22 countries with valid responses on all 28 reading items from Booklet 1. Sample sizes per country ranged from 287 to 1738 (with a mean of , a median of , and an SD of ), yielding a total sample size of 13,876 students. In the original data.pisa2006Read dataset, student weights were normalized within each country to represent a hypothetical sample size of 5000, ensuring equal contributions across countries. Because students were randomly assigned to booklets, the total student weights were approximately balanced across countries.
Among the 28 reading items, several were originally scored polytomously but were recoded dichotomously in this analysis, with only the highest category coded as correct. The remaining items were treated as dichotomous, in line with the officially reported PISA analysis.
In the first step of the analysis, joint item parameters of the 2PL model, assumed to be invariant across countries, were estimated. The 2PL model was fitted as a single-group IRT model using student weights to ensure that all 22 countries contributed approximately equally to the analysis. The estimated item discriminations
and item difficulties
are reported in
Table 6. Item discriminations ranged from 0.59 to 1.95 (
,
), and item difficulties ranged from
to 3.23 (
,
).
In the second step, the 2PL model was fitted separately for each of the 22 countries, fixing the item parameters to the joint international estimates while estimating only the country mean and country SD .
Finally, the original and bias-corrected distribution-weighted and difficulty-weighted RMSD statistics were computed under the assumption of independent student sampling. Although this assumption is violated in practice, the clustered sampling design primarily affects means and SDs rather than item parameters. Furthermore, the same independence assumption is employed in PISA reporting for both the field trial and main study analyses.
Statistical inference for the RMSD statistics was obtained using a nonparametric bootstrap [
50] of students within countries. The bootstrap procedure involved independent resampling of students with replacement, maintaining the original sample size for each country. During the computation of the RMSD statistics, bootstrap sampling was applied to the posterior distribution, resulting in varying
values across bootstrap samples and, consequently, a distribution of RMSD values. Standard errors were estimated using a normal approximation to the empirical distributions, and 95% symmetric confidence intervals were computed.
The computed confidence intervals for the RMSD statistics enabled testing a minimum effect size hypothesis [
51], assessing whether a given RMSD statistic was significantly greater than a prespecified cutoff value, such as 0.05. Item misfit was identified when the lower bound of the confidence interval exceeded this cutoff.
Instead of relying on a fixed cutoff value for the RMSD statistic, a data-driven approach proposed by von Davier and Bezirhan [
36] identifies items with large RMSD values as outliers, which are subsequently flagged if they exceed a specified outlier criterion.
4.2. Results
In the first analysis, results for the RMSD statistics are presented for Estonia, which included 287 students with item responses from Booklet 1.
Table 6 reports the estimated distribution-weighted and difficulty-weighted RMSD statistics. In most cases, the distribution-weighted RMSD values
were smaller than their difficulty-weighted counterparts
(
). As expected, the bias-corrected RMSD statistics were smaller than the original statistics and were often estimated as zero.
For six items (R055Q03, R055Q05, R104Q05, R219Q01E, R219Q02, and R220Q05), the original difficulty-weighted RMSD statistic exceeded the commonly used cutoff value of 0.08, whereas this was not the case for the widely applied original distribution-weighted RMSD statistic . The standard errors of the RMSD statistics were relatively large, indicating that statistical uncertainty should be considered when making item selection decisions.
The subsequent analyses examined RMSD statistics for items across all 22 countries.
Table 7 presents the percentages of items flagged when the RMSD statistics exceeded a specified cutoff value, were significantly greater than a given cutoff, or were identified as outliers according to the method proposed by von Davier and Bezirhan.
Using a strict cutoff value of 0.05, 59.1% and 72.4% of items were classified as misfitting based on and , respectively. These proportions decreased slightly when the corresponding bias-corrected RMSD statistics were applied. When a cutoff value of 0.12 was used, the flagging rates dropped to 10.1% for and 19.8% for . Incorporating statistical inference into the decision process further reduced the flagging rates substantially.
The data-driven RMSD cutoffs proposed by von Davier and Bezirhan resulted in flagging rates between 2.9% and 6.8%, indicating that most items fit the 2PL model adequately. The estimated data-driven cutoffs were 0.152 (for ), 0.165 (for ), 0.177 (for ), 0.152 (for ), 0.172 (for ), 0.191 (for ), 0.209 (for ), 0.223 (for ), 0.199 (for ), and 0.222 (for ).
Figure 5 presents a percentile plot of the empirical distribution of RMSD statistics across all items and countries. The bias-corrected RMSD statistics (
and
for
) yielded slightly smaller values than the original RMSD statistics (
and
). However, the reduction in magnitude due to bias correction did not substantially alter the overall distribution of the original RMSD statistics.
5. Discussion
This paper proposed and examined seven alternative bias-corrected distribution-weighted and difficulty-weighted RMSD estimators. The key conclusion is that bias correction should be applied directly to the RMSD rather than to its squared form. The correction methods were based on either a first-order (linear) or a second-order (quadratic) Taylor expansion. All approaches effectively reduced the positive bias observed in the original RMSD statistic, though they slightly increased variance. Bias-corrected RMSD statistics that incorporated the quadratic terms of the Taylor expansion generally demonstrated higher precision. Notably, these bias-corrected RMSD statistics often had a median value of zero for items without DIF, corresponding to the population RMSD value of zero in such cases. In empirical research, the distribution-weighted bias-corrected RMSD statistics and and the difficulty-weighted bias-corrected RMSD statistics and are recommended, as they substantially reduce bias without substantially increasing variance.
The empirical example used PISA data from a single booklet. The results showed that the percentage of items flagged as misfitting by the RMSD statistic varied substantially depending on the chosen RMSD cutoff value. Moreover, the proportion of misfitting items was very low when using the data-driven cutoff proposed by [
36].
The findings also indicated that sampling variance in the estimated item response functions, which are required for computing the RMSD statistic, becomes particularly critical in small samples, leading to positive bias in the RMSD. The proposed bias-correction methods are therefore especially relevant for small-scale applications aimed at detecting item misfit, such as field trials in PISA.
As an anonymous reviewer critically noted, the simulation study assumed equal item discriminations of 1, implying that the one-parameter logistic (1PL) model served as the data-generating model, whereas the 2PL model was fitted to the responses. This discrepancy can be viewed as a substantial limitation. The behavior of the different bias-correction approaches for the RMSD statistic may depend on variation in item discriminations. A more detailed investigation of this issue may be addressed in future research.
The study was limited to dichotomous items. Future research could extend the bias-corrected RMSD estimators to polytomous items. However, several alternative generalizations of the RMSD statistic exist for polytomous data, and it remains unclear which formulation should be preferred in applied research to enable a coherent generalization of the cutoff values used for dichotomous items.
The approach presented in this study is applicable to any weighted RMSD statistic [
37]. It remains unclear whether the distribution-weighted RMSD statistic has notable disadvantages compared to the difficulty-weighted RMSD statistic. In the originally proposed distribution-weighted RMSD, misfitting items with extreme difficulties tend to receive lower RMSD values than misfitting items with moderate difficulty. When likelihood-based inference includes precision weighting in the estimation of group means, it is reasonable for item misfit to be partially influenced by the likelihood contribution. Items with low precision, which contribute weakly to the likelihood, should arguably not be strongly flagged in the item misfit analysis.