1. Introduction
Item response theory (IRT) models [
1] are an important class of statistical models for the analysis of multivariate binary random variables (i.e., dichotomous variables). IRT models can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular interest is the application of IRT models in educational large-scale assessment studies (LSA; [
2]) like the programme for international student assessment (PISA; [
3]) that summarize the ability of students on test items in different cognitive domains. This article focuses on unidimensional IRT models that involve a unidimensional latent variable used for describing the discrete multivariate data. Moreover, we only consider dichotomous items, although LSA studies typically involve dichotomous and polytomous items.
Let
be the vector of
I dichotomous items
. There are
different realizations for the multivariate variable
. A unidimensional IRT model [
4,
5] is a statistical model for the probability distribution
for
, where
f is a univariate density function. In the rest of the article, we fix this distribution to be standard normal, but this can be weakened [
6,
7,
8]. The functions
are denoted as item response functions (IRF).
It is important to note that in (
1), item responses
are conditionally independent of
. This means that after controlling the latent ability
, pairs of items
and
are conditionally uncorrelated. This local independence assumption can be statistically tested [
5,
9].
In most cases, a parametric model is utilized to estimate the IRF appearing in (
1). In more detail, for each item, a parametric IRF
is assumed. The vectors of item parameters
are estimated in the IRT model. The one-parameter logistic (1PL) model (also referred to as the Rasch model; [
10]) employs the IRF
, where
is the logistic link function,
is the item difficulty of item
i, and
a is the common item discrimination. Note that
a can be alternatively set to 1, and the standard deviation of the trait
is estimated. As an alternative, the two-parameter logistic (2PL) model [
11] is also frequently used in practice. The 2PL model employs the IRF
and has two item-specific parameters. In contrast to the 1PL model, item discriminations are allowed to be item-specific.
Typically, the parametric assumption will be a (slight) misspecification of the true IRT model (
1). That is, the multivariate vector
is represented by:
In practical applications, it can be hoped that the approximation of
by
is good enough because the shape of the IRF is used for describing and selecting items in an educational test.
The parameters
of the estimated IRFs in Equation (
1) can be estimated by (marginal) maximum likelihood using an expectation-maximization algorithm [
12,
13,
14]. In practice, the integral in (
2) can be approximated by fixed (rectangular) quadrature integration. If a standard normal density
f is used, a quadrature grid of 21 or 41 equidistant
points between
and 6 is often used in software implementations.
The assessment of the adequacy of parametric IRFs (i.e., item fit; [
15,
16]) is an active field in psychometric research. The main idea is to assess the discrepancy between a true IRF
and the assumed parametric IRF
. Of vital interest is to find those misfitting items
i for which the assumed IRFs
are seriously incorrect. In these cases, different functional forms of the IRF might be used, or item
might be deleted from further analysis. In this article, we study the statistical behavior of the root mean square deviation (RMSD; [
17,
18,
19,
20]) item fit statistic. It is shown that a misfit for some items also affects the item fit assessment of the fitting items because the misspecified IRT model allocates the misspecification of one item to other fitting items. Moreover, we demonstrate that the expected value of the RMSD statistic depends on the sample size. To circumvent this obstacle, three alternative bias-corrected estimators of the RMSD statistic are investigated.
The rest of the article is structured as follows. In
Section 2, the RMSD statistic is introduced and a few population and finite-sample properties are presented.
Section 3 proposes three alternative bias-corrected RMSD estimators. In
Section 4, four numerical experiments are carried out in order to compare the performance of the original RMSD estimators with the bias-corrected RMSD alternatives. Moreover, we also study the behavior of the population RMSD value as a function of the proportion of misfitting items. Finally, the paper closes with a discussion in
Section 5.
2. RMSD Item Fit Statistic
In this section, we introduce the RMSD item fit statistic. The item fit can be defined as the discrepancy between
and
. In practice, the parametric IRFs
are obtained, but the true IRFs
can be nonparametrically defined and are not directly accessible. Nevertheless, one can define:
For a fitted IRT model with a parametric assumption
(see Equation (
2)), the involved true but unknown IRFs
must be replaced by some estimate. As already mentioned in
Section 1, the estimation of the IRT model relies on evaluating the integral in (
2) on a grid
of
T quadrature points for the ability variable
. Hence, all involved integrations in model fitting and item fit assessment will be replaced by summations that involve the finite grid of quadrature points.
As pointed out by an anonymous reviewer, the RMSD statistic in (
3) is only designed to detect misfit in the functional form of the IRF. The RMSD is insensitive to detecting violations from the local independence assumption and unidimensionality. However, the RMSD can be effectively utilized for studying differential item functioning (see
Section 4.3).
For
I dichotomous items
, there are
different item response patterns. For a vector
, we define the index
p of an item response pattern by
. Hence, we can associate the vector of item responses with item response patterns. According to the local independence assumption, we can compute the individual likelihood function for pattern
p based on true or assumed parametric IRF, respectively, by:
In Equations (
1) and (
2), the normal distribution is typically fixed. Hence, values of the density
f evaluated at the discrete quadrature grid are known as
with
. Note also that the data-generating model (
1) can be rewritten by replacing the integration with summation as
Clearly, it also holds that
.
The estimation of the unknown IRF
in Equation (
3) is based on individual posterior distributions [
20,
21,
22]. For each pattern
p and each quadrature point
. the posterior distribution
is given by
where
. Finally, the observed IRF
as an estimate of
is defined by
Then, the RMSD statistic from (
3) can be rewritten as:
The RMSD statistic in Equation (
9) refers to a population value because the probabilities
of item response patterns are known. For sample data, observed frequencies
instead of
are used for defining an estimate of the true IRF. This estimate is given by:
A sample-based RMSD statistic is then defined as:
Note that the item parameter
in (
11) might be known or unknown.
The RMSD fit statistic has broad applicability in educational assessment [
21,
23,
24,
25,
26]. It is primarily as an effect size of item misfit [
15,
27] and RMSD values larger than 0.05 or 0.08 might be a viable violation of the parametric IRF assumption [
19,
22,
28,
29,
30]. The RMSD item fit statistic bears similarity to residual-based test statistics developed by Haberman and colleagues [
15,
31,
32,
33]. Related research based on residual statistics can be found in [
26,
34].
2.1. Unbiasedness of the Population Value of the RMSD Statistic for a Correctly Specified IRT
Model
We now show unbiasedness of the population RMSD statistic (see Equation (
9)) if the IRT model is correctly specified. In this case, we have
for all
,
for all
, and
for all
. The finding has also been presented by [
32]. We only have to show that
. We analyze the numerator and the denominator of
in (
8). For the numerator of
, we get:
For the denominator of
, we receive:
Hence, we get
. If the IRT model is correctly specified, the RMSD population value is zero, and we get unbiasedness.
2.2. Population RMSD Statistic for Misspecified IRT Models
Now, we derive the population value of the RMSD statistic if the IRT model is misspecified. This means that the assumed parametric IRF
differs from the true data-generating IRF
. Consequently, it follows that
. Define
. We now study the numerator and the denominator of
in Equation (
8). For the numerator, we get
where
. Similar calculations for the denominator result in
Hence, the observed IRF
can be determined as:
By applying a Taylor expansion of (
16) and ignoring higher-order terms, we get:
Notably, misspecified IRFs enter
, which subsequently enter the
terms in (
17). Interestingly, the observed IRF of fitting items (i.e.,
) will also be typically biased if there are some misfitting items in the test. Therefore, the RMSD statistic for fitting items will be larger than zero. It is unclear how Equation (
17) affects the RMSD population values for misfitting items. In our experience from empirical applications, the RMSD value for misfitting items will be much smaller than the pseudo-true RMSD value defined in Equation (
3) (see [
21]).
2.3. On the Positive Bias of the Sample-Based RMSD Statistic
We now show that the expected value of the sample-based RMSD statistic
is typically larger than the population RMSD statistic. The reason is that we now use observed frequencies
instead of item response pattern probabilities
in the computation of the estimated observed IRF
. We obtain by applying a multivariate Taylor expansion of first order
We can simplify (
18) to
Therefore, we can write
where
is the second term after the ≃ sign in (
19) and has an expected value of zero. Hence, we get an expected value of the square of the sample-based RMSD statistic of
As a consequence of (
21), sample-based estimates of the RMSD statistic typically turn out to be larger on average than their population-based counterparts.
5. Discussion
In this article, we systematically studied the behavior of the RMSD estimators in infinite sample sizes (i.e., at the population level) and finite sample sizes. It turned out that the the population RMSD value depended on the proportion of misfitting items. With a larger proportion of misfitting items, RMSD values of misfitting items decrease, but RMSD values of fitting items increase. This means that the RMSD item fit statistic must always be interpreted as a relative fit statistic. The RMSD item fit statistic depends on the properties of the other items appearing in the test.
As with all simulation studies, our study is limited to the studied conditions. We only investigated relatively short test lengths, although the findings can be expected to generalize to longer tests. We also used only a few simulation factor levels for the proportion of missing items. Finally, we restricted ourselves to the study of a misspecified 1PL model (see [
21,
46] for more complex misspecified item response functions) and uniform differential item functioning.
Moreover, it was demonstrated that the RMSD estimator was positively biased in smaller samples. This can be explained by the fact that the RMSD is defined as a discrepancy statistic. A discrepancy statistic will always be positive in small samples due to sampling variability. This property has also been shown for global fit statistics in structural equation modeling [
47,
48]. As the developments in structural equation modeling [
49], we pursued the route of constructing bias-corrected estimators for the RMSD based on an analytical treatment and a fully computational solution based on bootstrap and jackknife resampling approaches. While the original RMSD estimator showed a positive bias for misfitting items, our proposed bias-corrected RMSD alternatives were negatively biased. However, the analytical bias-corrected RMSD estimator had the most desirable properties and could be recommended for default use in applied research. Future research might consider an average estimator of the original RMSD estimate and a bias-corrected RMSD estimator with an even lower bias while also not increasing the standard deviation of the resulting estimator.
Interestingly, other fit statistics such as item outfit [
50] or the
statistic [
46] also involve the distance
as an effect size but replace the weighting by the density
f with a weighting function that standardizes the squared distance by
. Pursuing this idea further, it might be interesting to investigate a more general RMSD statistic of the type
with an appropriate weighting function
(see also [
51]).
Finally, we argued that the RMSD values depend on test length and the proportion of misfitting items. Hence, using a general cutoff value for declaring misfitting items might not be justified. Indeed, it was also acknowledged by researchers Matthias von Davier and Ummugul Bezirhan [
52] that misfitting items should be detected by assuming a mixture distribution of RMSD values of misfitting and fitting items. Items with large RMSD values are treated as outliers and will be detected by techniques from robust statistics ([
52]; see also [
53]). We think that such an approach is a promising direction for future research. The approach of von Davier and Bezirhan implies that RMSD cutoff values must be selected dependent on conditions appearing in a particular dataset. Identifying misfitting items as outliers corresponds to the idea that only a portion of the items in a test do not follow an assumed functional form of the item response function. In our opinion, it can be questioned whether item misfit could be rather unsystematically distributed. We argued elsewhere that a particular IRT model is chosen on purpose, and item or model misfit should play no or only a minor role in model selection [
54].