Statistical Properties of Estimators of the RMSD Item Fit Statistic

Robitzsch, Alexander

doi:10.3390/foundations2020032

Open AccessArticle

Statistical Properties of Estimators of the RMSD Item Fit Statistic

by

Alexander Robitzsch

^1,2

¹

IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany

Foundations 2022, 2(2), 488-503; https://doi.org/10.3390/foundations2020032

Submission received: 29 April 2022 / Revised: 25 May 2022 / Accepted: 2 June 2022 / Published: 6 June 2022

(This article belongs to the Section Mathematical Sciences)

Download Versions Notes

Abstract

:

In this article, statistical properties of the root mean square deviation (RMSD) item fit statistic in item response models are studied. It is shown that RMSD estimates will indicate even misfit for items whose parametric assumption of the item response function is correct (i.e., fitting items) if some item response functions in the test are misspecified. Moreover, it is demonstrated that the RMSD values of misfitting and fitting items depend on the proportion of misfitting items. We propose three alternative bias-corrected RMSD estimators that reduce the bias for fitting items. However, these alternative estimators provide slightly negatively biased estimates for misfitting items compared to the originally proposed RMSD statistic. In the numerical experiments, we study the case of a misspecified one-parameter logistic item response model and the behavior of the RMSD statistic if differential item functioning occurs.

Keywords:

item response model; item fit; RMSD; differential item functioning

1. Introduction

Item response theory (IRT) models [1] are an important class of statistical models for the analysis of multivariate binary random variables (i.e., dichotomous variables). IRT models can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular interest is the application of IRT models in educational large-scale assessment studies (LSA; [2]) like the programme for international student assessment (PISA; [3]) that summarize the ability of students on test items in different cognitive domains. This article focuses on unidimensional IRT models that involve a unidimensional latent variable used for describing the discrete multivariate data. Moreover, we only consider dichotomous items, although LSA studies typically involve dichotomous and polytomous items.

Let

X = (X_{1}, \dots, X_{I})

be the vector of I dichotomous items

X_{i} \in {0, 1}

. There are

C = 2^{I}

different realizations for the multivariate variable

X

. A unidimensional IRT model [4,5] is a statistical model for the probability distribution

P (X = x)

for

x \in {0, 1}^{I}

, where

P (X = x) = \int \prod_{i = 1}^{I} [P_{i} {(θ)}^{x_{i}} {(1 - P_{i} (θ))}^{1 - x_{i}}] f (θ) d θ and

(1)

f is a univariate density function. In the rest of the article, we fix this distribution to be standard normal, but this can be weakened [6,7,8]. The functions

P_{i} (θ) = P (X_{i} = 1 | θ)

are denoted as item response functions (IRF).

It is important to note that in (1), item responses

X_{i}

are conditionally independent of

θ

. This means that after controlling the latent ability

θ

, pairs of items

X_{i}

and

X_{j}

are conditionally uncorrelated. This local independence assumption can be statistically tested [5,9].

In most cases, a parametric model is utilized to estimate the IRF appearing in (1). In more detail, for each item, a parametric IRF

P_{i}^{*} (θ; γ_{i}) ≃ P (X_{i} = 1 | θ)

is assumed. The vectors of item parameters

γ_{i}

are estimated in the IRT model. The one-parameter logistic (1PL) model (also referred to as the Rasch model; [10]) employs the IRF

P_{i} (θ) = Ψ (a (θ - b_{i}))

, where

Ψ

is the logistic link function,

b_{i}

is the item difficulty of item i, and a is the common item discrimination. Note that a can be alternatively set to 1, and the standard deviation of the trait

θ

is estimated. As an alternative, the two-parameter logistic (2PL) model [11] is also frequently used in practice. The 2PL model employs the IRF

P_{i} (θ) = Ψ (a_{i} (θ - b_{i}))

and has two item-specific parameters. In contrast to the 1PL model, item discriminations are allowed to be item-specific.

Typically, the parametric assumption will be a (slight) misspecification of the true IRT model (1). That is, the multivariate vector

X

is represented by:

P (X = x) ≃ \int \prod_{i = 1}^{I} [P_{i}^{*} {(θ; γ_{i})}^{x_{i}} {(1 - P_{i}^{*} (θ; γ_{i}))}^{1 - x_{i}}] f (θ) d θ .

(2)

In practical applications, it can be hoped that the approximation of

P_{i}

by

P_{i}^{*}

is good enough because the shape of the IRF is used for describing and selecting items in an educational test.

The parameters

γ_{i}

of the estimated IRFs in Equation (1) can be estimated by (marginal) maximum likelihood using an expectation-maximization algorithm [12,13,14]. In practice, the integral in (2) can be approximated by fixed (rectangular) quadrature integration. If a standard normal density f is used, a quadrature grid of 21 or 41 equidistant

θ

points between

- 6

and 6 is often used in software implementations.

The assessment of the adequacy of parametric IRFs (i.e., item fit; [15,16]) is an active field in psychometric research. The main idea is to assess the discrepancy between a true IRF

P_{i}

and the assumed parametric IRF

P_{i}^{*}

. Of vital interest is to find those misfitting items i for which the assumed IRFs

P_{i}^{*}

are seriously incorrect. In these cases, different functional forms of the IRF might be used, or item

X_{i}

might be deleted from further analysis. In this article, we study the statistical behavior of the root mean square deviation (RMSD; [17,18,19,20]) item fit statistic. It is shown that a misfit for some items also affects the item fit assessment of the fitting items because the misspecified IRT model allocates the misspecification of one item to other fitting items. Moreover, we demonstrate that the expected value of the RMSD statistic depends on the sample size. To circumvent this obstacle, three alternative bias-corrected estimators of the RMSD statistic are investigated.

The rest of the article is structured as follows. In Section 2, the RMSD statistic is introduced and a few population and finite-sample properties are presented. Section 3 proposes three alternative bias-corrected RMSD estimators. In Section 4, four numerical experiments are carried out in order to compare the performance of the original RMSD estimators with the bias-corrected RMSD alternatives. Moreover, we also study the behavior of the population RMSD value as a function of the proportion of misfitting items. Finally, the paper closes with a discussion in Section 5.

2. RMSD Item Fit Statistic

In this section, we introduce the RMSD item fit statistic. The item fit can be defined as the discrepancy between

P_{i}

and

P_{i}^{*}

. In practice, the parametric IRFs

P_{i}^{*}

are obtained, but the true IRFs

P_{i}

can be nonparametrically defined and are not directly accessible. Nevertheless, one can define:

{RMSD}_{i, pseudo - true} = \sqrt{\int {[P_{i} (θ) - P_{i}^{*} (θ; γ_{i})]}^{2} f (θ) d θ} .

(3)

For a fitted IRT model with a parametric assumption

P_{i}^{*}

(see Equation (2)), the involved true but unknown IRFs

P_{i}

must be replaced by some estimate. As already mentioned in Section 1, the estimation of the IRT model relies on evaluating the integral in (2) on a grid

θ_{1}, \dots, θ_{T}

of T quadrature points for the ability variable

θ

. Hence, all involved integrations in model fitting and item fit assessment will be replaced by summations that involve the finite grid of quadrature points.

As pointed out by an anonymous reviewer, the RMSD statistic in (3) is only designed to detect misfit in the functional form of the IRF. The RMSD is insensitive to detecting violations from the local independence assumption and unidimensionality. However, the RMSD can be effectively utilized for studying differential item functioning (see Section 4.3).

For I dichotomous items

X_{1}, \dots, X_{I}

, there are

C = 2^{I}

different item response patterns. For a vector

x_{p} = (x_{p 1}, \dots, x_{p I}) \in {0, 1}^{I}

, we define the index p of an item response pattern by

p = \sum_{i = 1}^{I} 2^{i - 1} x_{p i}

. Hence, we can associate the vector of item responses with item response patterns. According to the local independence assumption, we can compute the individual likelihood function for pattern p based on true or assumed parametric IRF, respectively, by:

f_{p | t} = P (X = x_{p} | θ_{t}) = \prod_{i = 1}^{I} P_{i} {(θ_{t})}^{x_{p i}} {[1 - P_{i} (θ_{t})]}^{1 - x_{p i}} and

(4)

f_{p | t}^{*} = \prod_{i = 1}^{I} {[P_{i}^{*} (θ_{t})]}^{x_{p i}} {[1 - P_{i}^{*} (θ_{t})]}^{1 - x_{p i}} for t = 1, \dots, T .

(5)

In Equations (1) and (2), the normal distribution is typically fixed. Hence, values of the density f evaluated at the discrete quadrature grid are known as

f_{t} = P (θ = θ_{t})

with

\sum_{t = 1}^{T} f_{t} = 1

. Note also that the data-generating model (1) can be rewritten by replacing the integration with summation as

w_{p} = P (X = x_{p}) = \sum_{t = 1}^{T} f_{p | t} f_{t} .

(6)

Clearly, it also holds that

\sum_{p = 1}^{C} w_{p} = 1

.

The estimation of the unknown IRF

P_{i}

in Equation (3) is based on individual posterior distributions [20,21,22]. For each pattern p and each quadrature point

θ_{t}

. the posterior distribution

h_{t | p}^{*}

is given by

h_{t | p}^{*} = \frac{f_{p | t}^{*} f_{t}}{\sum_{u = 1}^{T} f_{p | u}^{*} f_{u}} = \frac{f_{p | t}^{*} f_{t}}{w_{p}^{*}},

(7)

where

w_{p}^{*} = \sum_{t = 1}^{T} f_{p | t}^{*} f_{t}

. Finally, the observed IRF

P_{i, obs}

as an estimate of

P_{i}

is defined by

P_{i, obs} (θ_{t}) = \frac{\sum_{p = 1}^{C} w_{p} h_{t | p}^{*} x_{p i}}{\sum_{p = 1}^{C} w_{p} h_{t | p}^{*}} .

(8)

Then, the RMSD statistic from (3) can be rewritten as:

{RMSD}_{i} = \sqrt{\sum_{t = 1}^{T} {[P_{i, obs} (θ_{t}) - P_{i}^{*} (θ_{t}; γ_{i})]}^{2} f_{t}} .

(9)

The RMSD statistic in Equation (9) refers to a population value because the probabilities

w_{p}

of item response patterns are known. For sample data, observed frequencies

{\hat{w}}_{p}

instead of

w_{p}

are used for defining an estimate of the true IRF. This estimate is given by:

{\hat{P}}_{i, obs} (θ_{t}) = \frac{\sum_{p = 1}^{C} {\hat{w}}_{p} h_{t | p}^{*} x_{p i}}{\sum_{p = 1}^{C} {\hat{w}}_{p} h_{t | p}^{*}} .

(10)

A sample-based RMSD statistic is then defined as:

{\hat{RMSD}}_{i} = \sqrt{\sum_{t = 1}^{T} {[{\hat{P}}_{i, obs} (θ_{t}) - P_{i}^{*} (θ_{t}; γ_{i})]}^{2} f_{t}} .

(11)

Note that the item parameter

γ_{i}

in (11) might be known or unknown.

The RMSD fit statistic has broad applicability in educational assessment [21,23,24,25,26]. It is primarily as an effect size of item misfit [15,27] and RMSD values larger than 0.05 or 0.08 might be a viable violation of the parametric IRF assumption [19,22,28,29,30]. The RMSD item fit statistic bears similarity to residual-based test statistics developed by Haberman and colleagues [15,31,32,33]. Related research based on residual statistics can be found in [26,34].

2.1. Unbiasedness of the Population Value of the RMSD Statistic for a Correctly Specified IRT Model

We now show unbiasedness of the population RMSD statistic (see Equation (9)) if the IRT model is correctly specified. In this case, we have

P_{i} (θ_{t}) = P_{i}^{*} (θ_{t}; γ_{i})

for all

i = 1, \dots, I

,

f_{p | t} = f_{p | t}^{*}

for all

t = 1, \dots, T

, and

w_{p} = w_{p}^{*}

for all

p = 1, \dots, C

. The finding has also been presented by [32]. We only have to show that

P_{i} (θ_{t}) = P_{i, obs} (θ_{t})

. We analyze the numerator and the denominator of

P_{i, obs}

in (8). For the numerator of

P_{i, obs}

, we get:

\sum_{p = 1}^{C} w_{p} h_{t | p}^{*} x_{p i} = \sum_{p = 1}^{C} w_{p} \frac{f_{p | t} f_{t}}{w_{p}} x_{p i} = \sum_{p = 1}^{C} f_{t} f_{p | t} x_{p i} = f_{t} P_{i} (θ_{t}) .

(12)

For the denominator of

P_{i, obs}

, we receive:

\sum_{p = 1}^{C} w_{p} h_{t | p}^{*} = \sum_{p = 1}^{C} w_{p} \frac{f_{p | t} f_{t}}{w_{p}} = f_{t} .

(13)

Hence, we get

P_{i, obs} (θ_{t}) = P_{i} (θ_{t})

. If the IRT model is correctly specified, the RMSD population value is zero, and we get unbiasedness.

2.2. Population RMSD Statistic for Misspecified IRT Models

Now, we derive the population value of the RMSD statistic if the IRT model is misspecified. This means that the assumed parametric IRF

P_{i}^{*}

differs from the true data-generating IRF

P_{i}

. Consequently, it follows that

f_{p | t}^{*} \neq f_{p | t}

. Define

e_{p | t}^{*} = f_{p | t}^{*} - f_{p | t}

. We now study the numerator and the denominator of

P_{i, obs}

in Equation (8). For the numerator, we get

\sum_{p = 1}^{C} w_{p} h_{t | p}^{*} x_{p i} = \sum_{p = 1}^{C} x_{p i} f_{p | t}^{*} f_{t} {w_{p}^{*}}^{- 1} \sum_{u = 1}^{T} f_{p | u} f_{u} = \sum_{p = 1}^{C} x_{p i} f_{p | t}^{*} f_{t} {w_{p}^{*}}^{- 1} \sum_{u = 1}^{T} (f_{p | u}^{*} - e_{p | u}^{*}) f_{u} = f_{t} P_{i} (θ_{t}) - f_{t} \sum_{p = 1}^{C} x_{p i} f_{p | t}^{*} e_{p}^{*},

(14)

where

e_{p}^{*} = {w_{p}^{*}}^{- 1} \sum_{t = 1}^{T} e_{p | t}^{*} f_{t}

. Similar calculations for the denominator result in

\sum_{p = 1}^{C} w_{p} h_{t | p}^{*} = f_{t} - f_{t} \sum_{p = 1}^{C} f_{p | t}^{*} e_{p}^{*} .

(15)

Hence, the observed IRF

P_{i, obs}

can be determined as:

P_{i, obs} (θ_{t}) = \frac{P_{i} (θ_{t}) - \sum_{p = 1}^{C} x_{p i} f_{p | t}^{*} e_{p}^{*}}{1 - \sum_{p = 1}^{C} f_{p | t}^{*} e_{p}^{*}} .

(16)

By applying a Taylor expansion of (16) and ignoring higher-order terms, we get:

P_{i, obs} (θ_{t}) ≃ P_{i} (θ_{t}) - \sum_{p = 1}^{C} (x_{p i} - P_{i} (θ_{t})) f_{p | t}^{*} e_{p}^{*} .

(17)

Notably, misspecified IRFs enter

e_{p | t}^{*}

, which subsequently enter the

e_{p}^{*}

terms in (17). Interestingly, the observed IRF of fitting items (i.e.,

P_{i} = P_{i}^{*}

) will also be typically biased if there are some misfitting items in the test. Therefore, the RMSD statistic for fitting items will be larger than zero. It is unclear how Equation (17) affects the RMSD population values for misfitting items. In our experience from empirical applications, the RMSD value for misfitting items will be much smaller than the pseudo-true RMSD value defined in Equation (3) (see [21]).

2.3. On the Positive Bias of the Sample-Based RMSD Statistic

We now show that the expected value of the sample-based RMSD statistic

{\hat{RMSD}}_{i}

is typically larger than the population RMSD statistic. The reason is that we now use observed frequencies

{\hat{f}}_{p}

instead of item response pattern probabilities

f_{p}

in the computation of the estimated observed IRF

{\hat{P}}_{i, obs}

. We obtain by applying a multivariate Taylor expansion of first order

{\hat{P}}_{i, obs} (θ_{t}) = \frac{\sum_{p = 1}^{C} {\hat{w}}_{p} h_{t | p}^{*} x_{p i}}{\sum_{p = 1}^{C} {\hat{w}}_{p} h_{t | p}^{*}} ≃ P_{i, obs} (θ_{t}) + \sum_{p = 1}^{C} \frac{h_{t | p}^{*} x_{p i} [\sum_{q = 1}^{C} w_{q} h_{t | q}^{*}] - [\sum_{q = 1}^{C} w_{q} h_{t | q}^{*} x_{q i}] h_{t | p}^{*}}{{[\sum_{q = 1}^{C} w_{q} h_{t | q}^{*}]}^{2}} ({\hat{w}}_{p} - w_{p}) .

(18)

We can simplify (18) to

{\hat{P}}_{i, obs} (θ_{t}) ≃ P_{i, obs} (θ_{t}) + \sum_{p = 1}^{C} \frac{h_{t | p}^{*} [\sum_{q = 1}^{C} w_{q} (x_{p i} - x_{q i}) h_{t | q}^{*}]}{{[\sum_{q = 1}^{C} w_{q} h_{t | q}^{*}]}^{2}} ({\hat{w}}_{p} - w_{p}) .

(19)

Therefore, we can write

{\hat{P}}_{i, obs} (θ_{t}) ≃ P_{i, obs} (θ_{t}) + e_{i} (θ_{i}),

(20)

where

e_{i} (θ_{t})

is the second term after the ≃ sign in (19) and has an expected value of zero. Hence, we get an expected value of the square of the sample-based RMSD statistic of

E {({\hat{RMSD}}_{i})}^{2} = {({RMSD}_{i})}^{2} + \sum_{t = 1}^{T} E {(e_{i} (θ_{t}))}^{2} f_{t} .

(21)

As a consequence of (21), sample-based estimates of the RMSD statistic typically turn out to be larger on average than their population-based counterparts.

3. Bias-Corrected RMSD Estimators

In Section 2.3, it was shown that the sample-based RMSD statistic is positively biased. Three alternative RMSD estimators are proposed in the following two subsections to eliminate the bias partially.

3.1. Analytical Bias Correction

At first, we discuss an analytical bias correction (abc) of the RMSD statistic that has been implemented for about five years in the R [35] package CDM [36,37]. The bias correction relies on the idea that there is a long test (i.e., I is sufficiently large), and there is no bias in the estimation of

θ

for each person in the population. This means that a person with a data-generating ability value

θ_{t}

is concentrated at

θ_{t}

in its individual posterior distribution

h_{t | p}^{*}

. Let the observed IRF be estimated as

{\hat{P}}_{i, obs} (θ_{t})

. The estimated variance of

{\hat{P}}_{i, obs} (θ_{t})

is given by:

E {({\hat{P}}_{i, obs} (θ_{t}) - P_{i, obs} (θ_{t}))}^{2} = V_{i t} = N^{- 1} {\hat{P}}_{i, obs} (θ_{t}) (1 - {\hat{P}}_{i, obs} (θ_{t})),

(22)

where N denotes the sample size. Hence, the total amount of expected bias in

{({\hat{RMSD}}_{i})}^{2}

is given by:

B_{i, abc} = \sum_{t = 1}^{T} f_{t} V_{i t} .

(23)

This term must be subtracted from the square of the original RMSD statistic

{({\hat{RMSD}}_{i})}^{2}

. To prevent from taking the square root of negative numbers, we define the square root of a positive part by:

{sqrt}_{+} (x) = \sqrt{max (x, 0)} .

(24)

Finally, we can define the analytical bias-corrected RMSD estimator as

{\hat{RMSD}}_{i, abc} = {sqrt}_{+} ({({\hat{RMSD}}_{i})}^{2} - B_{i, abc}) .

(25)

Due to the definition of (25), the analytical bias-corrected RMSD will always be smaller than the original RMSD statistic.

3.2. Bootstrap and Jackknife Bias Correction

As an alternative to analytical bias corrections of the RMSD statistic, we also investigate bias corrections by bootstrap and jackknife [38,39]. As in Section 3.1, we conduct a bias correction of the square of the RMSD statistic and take the square root for the RMSD calculation afterward. This approach differs from [21], which performed nonparametric bootstrap for bias correction based on the non-squared RMSD statistic. The performance of this approach was not entirely satisfactory. Moreover, we also aimed to compare our proposed resampling-based bias correction with the analytical bias correction.

For the bootstrap bias correction (bbc), we draw a sample of N persons with replacement. The average of the squared RMSD statistic of bootstrap samples will typically be larger than the square of the original RMSD statistic, in particular in small samples. Let

{\bar{MSD}}_{i, bbc}

be the average of the square of RMSD statistics across bootstrap estimates. The bias term in the square of the RMSD statistic can be determined by

B_{i, bbc} = {\bar{MSD}}_{i, bbc} - {({\hat{RMSD}}_{i})}^{2} .

(26)

Hence, the value

B_{i, bbc}

can be used for a bias correction of the RMSD (see [38]). In more detail, we define the bootstrap bias-corrected RMSD statistic by

{\hat{RMSD}}_{i, bbc} = {sqrt}_{+} ({({\hat{RMSD}}_{i})}^{2} - B_{i, bbc}) .

(27)

Similarly, a bias term can also be determined by the jackknife method. In this case, the sample of N persons is divided into J parts. For each jackknife sample, the analysis is repeated by excluding the j-th part (

j = 1, \dots, J

) from the dataset. Let

{\bar{MSD}}_{i, jbc}

be the average of the square of RMSD statistics across jackknife estimates. The bias term

B_{i, jbc}

using the jackknife method can be computed as:

B_{i, jbc} = (J - 1) ({\bar{MSD}}_{i, jbc} - {({\hat{RMSD}}_{i})}^{2}) .

(28)

Then, the jackknife bias-corrected RMSD statistic can be defined as:

{\hat{RMSD}}_{i, jbc} = {sqrt}_{+} ({({\hat{RMSD}}_{i})}^{2} - B_{i, jbc}) .

(29)

4. Numerical Experiments

The following numerical examples pursue several goals. First, we want to investigate the statistical behavior of the original and bias-corrected RMSD estimators. Second, we are interested in comparing population RMSD values for tests with different amounts of misfitting items. In particular, we are interested in the discrepancy between the pseudo-true RMSD statistic and the population RMSD statistic.

4.1. Study 1: Correctly Specified IRT Model

We assume a correctly specified 1PL IRT model in the first numerical experiment. The test consists of

I = 9

items. We use three items that are three times duplicated with item difficulties

b_{i}

of

- 1.0

(Items 1,4,7),

0.5

(Items 2,5,8), and

2.0

(Items 3,6,9). The common item discrimination a was set to 1.0, and a normal distribution was assumed for generating item responses. We varied sample sizes N in five levels: 125, 250, 500, 1000, and 2000 and utilized 1000 replications in each cell of this simulation study. Note that the population RMSD values (see Equation (9)) equal zero because the IRT model is correctly specified (see Section 2.1). Hence, there are no misfitting items in this Study 1. For finite sample sizes, we compute the original RMSD estimator (see Equation (9) in Section 2), the RMSD estimator based on analytical bias correction (abc; see Equation (25) in Section 3.1), the RMSD estimator based on bootstrap bias correction (bbc; see Equation (27) in Section 3.2), and the RMSD estimator based on jackknife bias correction (jbc; see Equation (29) in Section 3.2). A total number of 200 bootstrap samples and 50 jackknife samples were used, respectively. The same numbers were also used in the other numerical experiments in this section. For computing the sample-based RMSD estimators, we assumed that the item parameters were known.

In Table 1, the means, standard deviations, and root mean square errors (RMSE) of the four different RMSD estimators are compared as a function of sample size. Because the population RMSD value equals zero, it is evident that the RMSD estimators are positively biased. Notably, the bias is larger for smaller samples. It can be seen that the bias decreases in larger samples and approaches zero in infinite samples. Moreover, the bias-corrected RMSD estimators satisfactorily reduced the bias. The bias was smaller for the resampling-based estimates (bbc and jbc) than the RMSD estimator based on the analytical bias correction (abc). It is important to emphasize that there was essentially no difference between the bootstrap and the jackknife RMSD estimators. However, all bias-corrected RMSD estimators had increased standard deviations more than the original RMSD estimator.

In empirical applications of the RMSD estimator, item misfit is often assessed if the RMSD estimate exceeds a certain cutoff value. We also computed the proportion that an RMSD estimate had values larger than a cutoff of 0.05. For a sample size of

N = 125

, the proportion was 0.323 for the original RMSD estimator. However, it was notably smaller for the analytical bias-corrected (abc: 0.158), the bootstrap bias-corrected (bbc: 0.114) and the jackknife bias-corrected (jbc: 0.105) RMSD estimators. For

N = 250

, the proportions turned out to be much smaller, but the bias-corrected estimators still had a substantially lower proportion of declaring an item to be misfitting (orig: 0.100; abc: 0.054; bbc: 0.038; jbc: 0.038). For sample sizes of at least 500, the proportions of RMSD estimates larger than 0.05 were below 0.01.

4.2. Study 2: Simulated 2PL Model, but Fitted 1PL Model

In Study 2, we investigated the statistical behavior of the RMSD statistic in a misspecified IRT model. We simulated item responses according to the 2PL model, but fitted a misspecified IRT model. We used test lengths I of 6, 9, 12, and 15 items. As in Study 1, three base items with item difficulties

b_{i}

of

- 1.0

,

0.5

, and

2.0

were used. For the different test lengths, these difficulties were replicated twice, three times, four times, and five times, respectively. For all test lengths, we chose the number of misfitting items

I_{misfit}

as 1, 2, or 3. All misfitting had item discriminations

a_{i}

according to one of the four levels 0.0, 0.2, 0.4, or 0.6. All misfitting items in a condition had the same item discrimination. The fitting items had item discriminations of 1.0. With one misfitting item, Item 2 had the misfit. With two misfitting items, Items 2 and 3 showed misfit, and with three misfitting items, Items 1, 2, and 3 were misfitting. We used the same sample sizes

N = 125, 250, 1000

and 2000 as in Study 1. Moreover, we also computed population values of the original RMSD statistic using infinite sample size and computed probabilities of all item response patterns by numerical integration in Equation (2).

In Table 2, the population values of the RMSD estimator are displayed as a function of item discriminations and the number of misfitting items in a test with

I = 9

items. RMSD values of 6 out of the 9 items are presented in Table 2. It can be seen that the RMSD statistic of the misfitting Item 2 with item discrimination

a_{i} = 0

was larger in the test with one misfitting item (0.079) compared to a test with three misfitting items (0.036). Interestingly, the RMSD value of fitting items with item discrimination of

a_{i} = 0

for misfitting items increases with an increasing proportion of misfitting items (e.g., for Item 4, we have for

I_{misfit} = 1

an RMSD value 0.011 and for

I_{misfit} = 3

a value of 0.019). Note that a relatively strong item misfit such as

a_{i} = 0.6

provided RMSD values of 0.027 in the case of

I_{misfit}

=1. This implies that a cutoff of 0.05 for the RMSD value should likely be reduced to detect misfitting items.

In Table 3, we computed the population RMSD value in a test with one misfitting item with item discrimination

a_{i} = 0.2

and displayed the values as a function of test length. According to Equation (3), the pseudo-true RMSD value for Item 2 was 0.159. The population RMSD value increased from 0.037 (for

I = 6

) to 0.090 (for

I = 15

) and will eventually reach the pseudo-true RMSD value of 0.159 in a test of infinite length. However, note that the test length of

I = 15

corresponded to a proportion of 0.067 of misfitting items, which is not quite large for practical applications. Hence, one can expect that the population RMSD values for misfitting items are much smaller than the pseudo-true RMSD values in practical applications. Note that the population RMSD values were slightly larger than zero for fitting items (e.g., Items 1,3,4,5, 6 in Table 3). These values converge very slowly to the pseudo-true RMSD value of 0.

Finally, Table 4 displays the behavior of different RMSD estimators in finite samples for a test of

I = 9

items with one or three misfitting items. In this case, Item 2 was a misfitting item, and Item 5 was a fitting item. It can be seen that the bias-corrected RMSD estimators were similar to the original RMSD estimator only in a large sample of

N = 2000

. In smaller samples, the original RMSD estimator had larger values and the bias-corrected RMSD estimators had lower values than the asymptotic limit of the RMSD estimator (i.e., the population value). Notably, the analytical bias-corrected RMSD estimator had better finite-sample performance in terms of reaching the population value than the bootstrap- or jackknife-based RMSD estimator. As in Study 1, it can be seen that the bootstrap- and jackknife-based RMSD estimators were most effective in reducing the bias of the RMSD value for fitting items. Overall, given the larger negative bias of bootstrap and jackknife RMSD estimators, we tend to prefer the analytical bias-corrected RMSD estimator for practical use because it has a much lower smaller negative bias for misfitting items and still outperformed the original RMSD estimator.

4.3. Study 3: Unbalanced Differential Item Functioning

Differential item functioning (DIF; [40,41,42]) exists if item parameters differ in subgroups of the population or are functions of covariates. For example, in educational LSA studies such as PISA, item parameters can differ across countries [43]. In this situation, the RMSD statistic is applied to detect items with DIF across countries [19,22,23,28].

In Study 3, we investigate a misspecified IRT model in which an item difficulty is shifted in the 1PL model. This corresponds to the situation of uniform DIF [28]. That is, an item difficulty

b_{i}

is used for generating the data, but a parameter

b_{i}^{*} \neq b_{i}

is used in fitting the IRT model. In more detail, we fixed item parameters to known values for the computation of the RMSD statistic. As in Study 1 and 2, we use three base items with item difficulties

- 1.0

,

0.5

, and

2.0

. We considered scenarios with one, two, or three misfitting items in a test comprising of

I = 9

items. The base items were three times replicated. For misfitting items, we generated item responses with item difficulties

b_{i}^{*} = b_{i} + δ

, where the uniform DIF effect

δ

was varied as 0.2, 0.4, 0.6, and 1.0. Note that all DIF effects were positive, resulting in an unbalanced DIF condition [44]. In the computation of the RMSD statistic, we used to incorrect item parameters

b_{i}

that ignored DIF. In the scenario with one misfitting item, Item 2 had item misfit (i.e., DIF). In the scenarios with two or three misfitting items, Items 2 and 3 and Items 1, 2, and 3 had item misfit, respectively.

In Table 5, the population RMSD values are displayed as a function of the size of the uniform DIF effect

δ

and the number of misfitting items. It can be seen that the RMSD values increased with larger DIF

δ

and decreased with a larger number of misfitting items. Interestingly, for Item 2, the population RMSD values with one misfitting item were not substantially smaller than their pseudo-true RMSD values of 0.041 (

δ = 0.2

), 0.080 (

δ = 0.4

), 0.117 (

δ = 0.6

), and 0.186 (

δ = 1.0

). Hence, assessing DIF effects with the RMSD statistic seemed to be more effective than assessing item misfit regarding the functional form of the IRF (i.e., incorrect specification of the item discrimination by imposing a 1PL model). In alignment with Study 2, RMSD population values for misfitting items were also smaller for a larger proportion of misfitting items. Note that the RMSD population values for fitting items increased with a larger proportion of misfitting items. For example, in the scenario with uniform DIF effects of

δ = 1.0

, the population RMSD value of the misfitting Item 2 was even lower (i.e., 0.062) than the population RMSD value of the fitting Item 5 (i.e., 0.065). Of course, this property is a consequence that RMSD values for items with the same DIF effect, but different item difficulties can be substantially different [22].

In Table 6, finite sample properties of different RMSD estimators are presented for the scenario involving one or three misfitting items for a uniform DIF effect

δ = 0.6

. As in Study 2, it can be concluded that the analytical bias-corrected RMSD estimator (abc) should be preferred over the bootstrap- or jackknife-based RMSD estimators (bbc or jbc) regarding the negative bias for misfitting in smaller samples. For fitting items, the loss in efficiency in bias reduction of the abc RMSD estimator compared to the bbc or jbc seems to be defensible.

4.4. Study 4: Comparing Balanced and Unbalanced Differential Item Functioning

Finally, Study 4 compares balanced and unbalanced DIF. In balanced DIF, DIF in item difficulties cancels out at the test level [45]. That is, there are positive and negative DIF effects, and it can be assumed that there is no average bias at the test level. We investigated two misfitting items in tests of length

I = 6, 9, 12, 15

. We used the same item difficulties as in other studies in this section. For the item difficulty of Item 2, we used

b_{i} + δ

in the data-generating model, while we used

b_{i} - δ

for Item 3 in the scenario with balanced DIF. For a comparison with the unbalanced DIF condition, we also used

b_{i} + δ

as the item difficulty for Item 3 as in Study 3. We fixed the DIF effect

δ

to 0.6 and varied test lengths I as 6, 9, 12, and 15.

In Table 7, the population RMSD values are presented as a function of test length. The pseudo-true parameter for Item 2 in the simulation was 0.117. For Item 3 it was 0.069 for unbalanced DIF, and it was 0.091 for balanced DIF. Interestingly, the population RMSD values for the misfitting items almost reached the pseudo-true RMSD values in the case of balanced DIF for all studied test lengths. This was in contrast to the scenario of unbalanced DIF in which the population RMSD value increased with increasing test lengths. This finding can be explained by the fact that we will not expect strongly biased individual posterior distributions and observed IRFs in the case of balanced DIF. In contrast, posterior distributions and observed IRFs will be typically biased in the scenario of unbalanced DIF.

5. Discussion

In this article, we systematically studied the behavior of the RMSD estimators in infinite sample sizes (i.e., at the population level) and finite sample sizes. It turned out that the the population RMSD value depended on the proportion of misfitting items. With a larger proportion of misfitting items, RMSD values of misfitting items decrease, but RMSD values of fitting items increase. This means that the RMSD item fit statistic must always be interpreted as a relative fit statistic. The RMSD item fit statistic depends on the properties of the other items appearing in the test.

As with all simulation studies, our study is limited to the studied conditions. We only investigated relatively short test lengths, although the findings can be expected to generalize to longer tests. We also used only a few simulation factor levels for the proportion of missing items. Finally, we restricted ourselves to the study of a misspecified 1PL model (see [21,46] for more complex misspecified item response functions) and uniform differential item functioning.

Moreover, it was demonstrated that the RMSD estimator was positively biased in smaller samples. This can be explained by the fact that the RMSD is defined as a discrepancy statistic. A discrepancy statistic will always be positive in small samples due to sampling variability. This property has also been shown for global fit statistics in structural equation modeling [47,48]. As the developments in structural equation modeling [49], we pursued the route of constructing bias-corrected estimators for the RMSD based on an analytical treatment and a fully computational solution based on bootstrap and jackknife resampling approaches. While the original RMSD estimator showed a positive bias for misfitting items, our proposed bias-corrected RMSD alternatives were negatively biased. However, the analytical bias-corrected RMSD estimator had the most desirable properties and could be recommended for default use in applied research. Future research might consider an average estimator of the original RMSD estimate and a bias-corrected RMSD estimator with an even lower bias while also not increasing the standard deviation of the resulting estimator.

Interestingly, other fit statistics such as item outfit [50] or the

Q_{1}

statistic [46] also involve the distance

{(P_{i} - P_{i}^{*})}^{2}

as an effect size but replace the weighting by the density f with a weighting function that standardizes the squared distance by

P_{i}^{*} (1 - P_{i}^{*})

. Pursuing this idea further, it might be interesting to investigate a more general RMSD statistic of the type

\sqrt{\int {(P_{i} (θ) - P_{i} (θ))}^{2} ω_{i} (θ) d θ}

with an appropriate weighting function

ω_{i}

(see also [51]).

Finally, we argued that the RMSD values depend on test length and the proportion of misfitting items. Hence, using a general cutoff value for declaring misfitting items might not be justified. Indeed, it was also acknowledged by researchers Matthias von Davier and Ummugul Bezirhan [52] that misfitting items should be detected by assuming a mixture distribution of RMSD values of misfitting and fitting items. Items with large RMSD values are treated as outliers and will be detected by techniques from robust statistics ([52]; see also [53]). We think that such an approach is a promising direction for future research. The approach of von Davier and Bezirhan implies that RMSD cutoff values must be selected dependent on conditions appearing in a particular dataset. Identifying misfitting items as outliers corresponds to the idea that only a portion of the items in a test do not follow an assumed functional form of the item response function. In our opinion, it can be questioned whether item misfit could be rather unsystematically distributed. We argued elsewhere that a particular IRT model is chosen on purpose, and item or model misfit should play no or only a minor role in model selection [54].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

1PL	one-parameter logistic model
2PL	two-parameter logistic model
DIF	differential item functioning
IRF	item response function
IRT	item response theory
LSA	large-scale assessment
PISA	programme for international student assessment
RMSD	root mean square deviation
RMSE	root mean square error

References

van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
OECD. PISA 2009. Technical Report; OECD: Paris, France, 2012; Available online: https://bit.ly/3xfxdwD (accessed on 29 April 2022).
Bock, R.D.; Moustaki, I. Item response theory in a general framework. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 469–513. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
Casabianca, J.M.; Lewis, C. IRT item parameter recovery with marginal maximum likelihood estimation using loglinear smoothing models. J. Educ. Behav. Stat. 2015, 40, 547–578. [Google Scholar] [CrossRef] [Green Version]
Woods, C.M. Empirical histograms in item response theory with ordinal data. Educ. Psychol. Meas. 2007, 67, 73–87. [Google Scholar] [CrossRef]
Xu, X.; von Davier, M. Fitting the Structured General Diagnostic Model to NAEP Data; (Research Report No. RR-08-28); Educational Testing Service: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
Yen, W.M. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Appl. Psychol. Meas. 1984, 8, 125–145. [Google Scholar] [CrossRef]
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
Robitzsch, A. A note on a computationally efficient implementation of the EM algorithm in item response models. Quant. Comput. Methods Behav. Sc. 2021, 1, e3783. [Google Scholar] [CrossRef]
Sinharay, S.; Haberman, S.J. How often is the misfit of item response theory models practically significant? Educ. Meas. 2014, 33, 23–35. [Google Scholar] [CrossRef]
Swaminathan, H.; Hambleton, R.K.; Rogers, H.J. Assessing the fit of item response theory models. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 683–718. [Google Scholar] [CrossRef]
Khorramdel, L.; Shin, H.J.; von Davier, M. GDM software mdltm including parallel EM algorithm. In Handbook of Diagnostic Classification Models; von Davier, M., Lee, Y.S., Eds.; Springer: Cham, Switzerland, 2019; pp. 603–628. [Google Scholar] [CrossRef]
Kunina-Habenicht, O.; Rupp, A.A.; Wilhelm, O. A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Stud. Educ. Eval. 2009, 35, 64–70. [Google Scholar] [CrossRef]
Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
Sueiro, M.J.; Abad, F.J. Assessing goodness of fit in item response theory with nonparametric models: A comparison of posterior probabilities and kernel-smoothing approaches. Educ. Psychol. Meas. 2011, 71, 834–848. [Google Scholar] [CrossRef]
Köhler, C.; Robitzsch, A.; Hartig, J. A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. J. Educ. Behav. Stat. 2020, 45, 251–273. [Google Scholar] [CrossRef]
Tijmstra, J.; Bolsinova, M.; Liaw, Y.L.; Rutkowski, L.; Rutkowski, D. Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. J. Educ. Meas. 2020, 57, 566–583. [Google Scholar] [CrossRef]
Buchholz, J.; Hartig, J. Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Appl. Psychol. Meas. 2019, 43, 241–250. [Google Scholar] [CrossRef]
Buchholz, J.; Hartig, J. Measurement invariance testing in questionnaires: A comparison of three multigroup-CFA and IRT-based approaches. Psych. Test Assess. Model. 2020, 62, 29–53. Available online: https://bit.ly/38kswHh (accessed on 29 April 2022).
Köhler, C.; Robitzsch, A.; Fährmann, K.; von Davier, M.; Hartig, J. A semiparametric approach for item response function estimation to detect item misfit. Brit. J. Math. Stat. Psychol. 2021, 74, 157–175. [Google Scholar] [CrossRef] [PubMed]
Monroe, S. Testing latent variable distribution fit in IRT using posterior residuals. J. Educ. Behav. Stat. 2021, 46, 374–398. [Google Scholar] [CrossRef]
Köhler, C.; Hartig, J. Practical significance of item misfit in educational assessments. Appl. Psychol. Meas. 2017, 41, 388–400. [Google Scholar] [CrossRef] [PubMed]
Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. Available online: https://bit.ly/3ezBB05 (accessed on 29 April 2022).
Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2022, 47, 36–68. [Google Scholar] [CrossRef]
Shin, H.J.; Kerzabi, E.; Joo, S.H.; Robin, F.; Yamamoto, K. Comparability of response time scales in PISA. Psych. Test Assess. Model. 2020, 62, 107–135. [Google Scholar]
Haberman, S.J.; Sinharay, S. Generalized residuals for general models for contingency tables with application to item response theory. J. Am. Stat. Assoc. 2013, 108, 1435–1444. [Google Scholar] [CrossRef]
Haberman, S.J.; Sinharay, S.; Chon, K.H. Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika 2013, 78, 417–440. [Google Scholar] [CrossRef] [PubMed]
van Rijn, P.W.; Sinharay, S.; Haberman, S.J.; Johnson, M.S. Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assess. Educ. 2016, 4, 10. [Google Scholar] [CrossRef] [Green Version]
Lim, H.; Choe, E.M.; Han, K.T. A residual-based differential item functioning detection framework in item response theory. J. Educ. Meas. 2022; Epub ahead of print. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 11 January 2022).
George, A.C.; Robitzsch, A.; Kiefer, T.; Groß, J.; Ünlü, A. The R package CDM for cognitive diagnosis models. J. Stat. Softw. 2016, 74, 1–24. [Google Scholar] [CrossRef] [Green Version]
Robitzsch, A.; George, A.C. The R package CDM for diagnostic modeling. In Handbook of Diagnostic Classification Models; von Davier, M., Lee, Y.S., Eds.; Springer: Cham, Switzerland, 2019; pp. 549–572. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
Ellis, J.L.; Van den Wollenberg, A.L. Local homogeneity in latent trait models. A characterization of the homogeneous monotone IRT model. Psychometrika 1993, 58, 417–429. [Google Scholar] [CrossRef]
Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. Available online: https://bit.ly/3ANjH3V (accessed on 29 April 2022).
Chalmers, R.P.; Counsell, A.; Flora, D.B. It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educ. Psychol. Meas. 2016, 76, 114–140. [Google Scholar] [CrossRef] [Green Version]
Chalmers, R.P.; Ng, V. Plausible-value imputation statistics for detecting item misfit. Appl. Psychol. Meas. 2017, 41, 372–387. [Google Scholar] [CrossRef]
Maydeu-Olivares, A.; Shi, D.; Rosseel, Y. Assessing fit in structural equation models: A Monte-Carlo evaluation of RMSEA versus SRMR confidence intervals and tests of close fit. Struct. Equ. Modeling 2018, 25, 389–402. [Google Scholar] [CrossRef]
Shi, D.; Maydeu-Olivares, A.; DiStefano, C. The relationship between the standardized root mean square residual and model misspecification in factor analysis models. Multivar. Behav. Res. 2018, 53, 676–694. [Google Scholar] [CrossRef] [PubMed]
Maydeu-Olivares, A. Assessing the size of model misfit in structural equation models. Psychometrika 2017, 82, 533–558. [Google Scholar] [CrossRef] [PubMed]
Wright, B.D.; Masters, G.N. Computation of OUTFIT and INFIT statistics. Rasch Meas. Trans. 1990, 3, 84–85. Available online: https://bit.ly/3Nyfzv1 (accessed on 29 April 2022).
Oshima, T.C.; Morris, S.B. Raju’s differential functioning of items and tests (DFIT). Educ. Meas. 2008, 27, 43–50. [Google Scholar] [CrossRef]
von Davier, M.; Bezirhan, U. A robust method for detecting item misfit in large scale assessments. PsyArXiv 2021. [Google Scholar] [CrossRef]
Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]

Table 1. Study 1: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items as a function of sample size N.

Table 1. Study 1: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items as a function of sample size N.

		Mean				SD				RMSE
Item	N	orig	abc	bbc	jbc	orig	abc	bbc	jbc	orig	abc	bbc	jbc
1	125	0.042	0.020	0.013	0.013	0.020	0.025	0.023	0.023	0.046	0.032	0.026	0.026
	250	0.029	0.014	0.009	0.009	0.014	0.018	0.016	0.016	0.032	0.022	0.018	0.018
	500	0.021	0.010	0.006	0.006	0.010	0.013	0.011	0.011	0.023	0.016	0.013	0.013
	1000	0.015	0.007	0.005	0.005	0.007	0.009	0.008	0.008	0.017	0.012	0.010	0.010
	2000	0.010	0.005	0.003	0.003	0.005	0.006	0.006	0.006	0.011	0.008	0.007	0.007
2	125	0.044	0.021	0.014	0.014	0.021	0.026	0.024	0.024	0.048	0.034	0.028	0.028
	250	0.031	0.014	0.009	0.009	0.014	0.019	0.017	0.017	0.034	0.024	0.019	0.019
	500	0.022	0.011	0.007	0.007	0.010	0.013	0.012	0.012	0.025	0.017	0.014	0.014
	1000	0.015	0.007	0.005	0.005	0.007	0.009	0.008	0.008	0.017	0.012	0.009	0.009
	2000	0.011	0.006	0.004	0.004	0.005	0.007	0.006	0.006	0.012	0.009	0.007	0.007
3	125	0.039	0.023	0.013	0.012	0.018	0.023	0.021	0.021	0.043	0.033	0.025	0.024
	250	0.028	0.017	0.009	0.009	0.013	0.017	0.015	0.015	0.031	0.024	0.018	0.018
	500	0.019	0.011	0.006	0.006	0.009	0.012	0.011	0.011	0.022	0.016	0.012	0.012
	1000	0.014	0.008	0.004	0.004	0.006	0.008	0.007	0.007	0.015	0.011	0.008	0.008
	2000	0.010	0.006	0.003	0.003	0.005	0.006	0.005	0.005	0.011	0.008	0.006	0.006

Note. orig = original RMSD estimator (see Equation (9)); abc = RMSD estimator based on analytical bias correction (see Equation (25)); bbc = RMSD estimator based on bootstrap bias correction (see Equation (27)); jbc = RMSD estimator based on jackknife bias correction (see Equation (29)).

Table 2. Study 2: Population value of the original RMSD estimator in a test with

I = 9

items as a function of item discriminations of misfitting items and the number of misfitting items.

Table 2. Study 2: Population value of the original RMSD estimator in a test with

I = 9

items as a function of item discriminations of misfitting items and the number of misfitting items.

	$a = 0$			$a = 0.2$			$a = 0.4$			$a = 0.6$
	$I_{misfit}$			$I_{misfit}$			$I_{misfit}$			$I_{misfit}$
Item	1	2	3	1	2	3	1	2	3	1	2	3
1	0.011	0.018	0.036	0.008	0.014	0.033	0.006	0.011	0.026	0.004	0.007	0.018
2	0.079	0.057	0.036	0.061	0.047	0.033	0.043	0.035	0.027	0.027	0.023	0.018
3	0.009	0.057	0.036	0.007	0.046	0.033	0.005	0.033	0.025	0.003	0.021	0.016
4	0.011	0.018	0.019	0.008	0.014	0.017	0.006	0.011	0.014	0.004	0.007	0.009
5	0.012	0.019	0.021	0.009	0.015	0.019	0.006	0.011	0.015	0.004	0.007	0.010
6	0.009	0.014	0.016	0.007	0.011	0.015	0.005	0.008	0.012	0.003	0.005	0.008

Note. a = item discrimination of the misfitting items;

I_{misfit}

= number of misfitting items; Values of the original RMSD estimator for misfitting items are printed in bold.

Table 3. Study 2: Population value of the original RMSD estimator in a test with one misfitting item with an item discrimination of

a = 0.2

as a function of the number of items I.

Table 3. Study 2: Population value of the original RMSD estimator in a test with one misfitting item with an item discrimination of

a = 0.2

as a function of the number of items I.

Item	$I = 6$	$I = 9$	$I = 12$	$I = 15$
1	0.008	0.008	0.008	0.007
2	0.037	0.061	0.078	0.090
3	0.007	0.007	0.007	0.006
4	0.008	0.008	0.008	0.007
5	0.008	0.009	0.009	0.008
6	0.007	0.007	0.007	0.006

Note. Values of the original RMSD estimator for misfitting items are printed in bold.

Table 4. Study 2: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items for

I_{misfit} = 1

or

I_{misfit} = 3

misfitting items with an item discrimination of

a = 0.2

as a function of sample size N.

Table 4. Study 2: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items for

I_{misfit} = 1

or

I_{misfit} = 3

misfitting items with an item discrimination of

a = 0.2

as a function of sample size N.

			Mean				SD				RMSE
Item	$I_{misfit}$	N	orig	abc	bbc	jbc	orig	abc	bbc	jbc	orig	abc	bbc	jbc
2	1	125	0.077	0.063	0.055	0.055	0.033	0.040	0.042	0.042	0.037	0.040	0.042	0.042
		250	0.071	0.064	0.060	0.060	0.026	0.029	0.031	0.031	0.028	0.030	0.031	0.031
		500	0.070	0.066	0.064	0.064	0.019	0.020	0.021	0.021	0.021	0.021	0.021	0.021
		1000	0.068	0.067	0.066	0.066	0.013	0.014	0.014	0.014	0.015	0.015	0.015	0.015
		2000	0.068	0.067	0.067	0.067	0.009	0.009	0.010	0.010	0.012	0.011	0.011	0.011
	3	125	0.066	0.050	0.043	0.043	0.030	0.038	0.038	0.038	0.045	0.041	0.040	0.040
		250	0.061	0.052	0.048	0.048	0.024	0.029	0.030	0.030	0.037	0.034	0.033	0.033
		500	0.057	0.053	0.051	0.050	0.018	0.020	0.021	0.021	0.030	0.028	0.027	0.027
		1000	0.057	0.055	0.054	0.054	0.013	0.013	0.013	0.013	0.027	0.025	0.025	0.024
		2000	0.056	0.055	0.055	0.055	0.009	0.009	0.009	0.009	0.024	0.024	0.023	0.023
5	1	125	0.044	0.020	0.014	0.013	0.020	0.026	0.023	0.023	0.040	0.028	0.024	0.024
		250	0.032	0.017	0.012	0.012	0.015	0.020	0.018	0.018	0.028	0.021	0.019	0.018
		500	0.023	0.012	0.009	0.009	0.011	0.014	0.014	0.013	0.018	0.015	0.014	0.013
		1000	0.018	0.010	0.007	0.007	0.008	0.011	0.010	0.010	0.012	0.011	0.010	0.010
		2000	0.013	0.008	0.006	0.006	0.006	0.008	0.008	0.008	0.008	0.008	0.009	0.009
	3	125	0.049	0.028	0.022	0.022	0.023	0.030	0.029	0.029	0.038	0.031	0.029	0.029
		250	0.037	0.023	0.018	0.018	0.018	0.023	0.023	0.023	0.025	0.023	0.023	0.023
		500	0.032	0.023	0.019	0.019	0.014	0.018	0.018	0.018	0.019	0.018	0.018	0.018
		1000	0.029	0.024	0.022	0.022	0.011	0.013	0.014	0.014	0.014	0.014	0.014	0.014
		2000	0.028	0.025	0.024	0.024	0.008	0.009	0.009	0.009	0.011	0.011	0.010	0.010

Note. orig = original RMSD estimator (see Equation (9)); abc = RMSD estimator based on analytical bias correction (see Equation (25)); bbc = RMSD estimator based on bootstrap bias correction (see Equation (27)); jbc = RMSD estimator based on jackknife bias correction (see Equation (29)); Values of RMSD estimators for misfitting items are printed in bold.

Table 5. Study 3: Population value of the original RMSD estimator in a test with

I = 9

items as a function of uniform differential item functioning of misfitting items and the number of misfitting items.

Table 5. Study 3: Population value of the original RMSD estimator in a test with

I = 9

items as a function of uniform differential item functioning of misfitting items and the number of misfitting items.

	$δ = 0.2$			$δ = 0.4$			$δ = 0.6$			$δ = 1.0$
	$I_{misfit}$			$I_{misfit}$			$I_{misfit}$			$I_{misfit}$
Item	1	2	3	1	2	3	1	2	3	1	2	3
1	0.005	0.006	0.026	0.009	0.012	0.054	0.013	0.018	0.083	0.019	0.027	0.144
2	0.035	0.032	0.027	0.069	0.062	0.053	0.101	0.092	0.077	0.160	0.146	0.122
3	0.004	0.018	0.017	0.008	0.035	0.031	0.012	0.049	0.043	0.019	0.072	0.062
4	0.005	0.006	0.012	0.009	0.012	0.025	0.013	0.018	0.036	0.019	0.027	0.059
5	0.006	0.009	0.014	0.011	0.018	0.027	0.016	0.026	0.040	0.026	0.040	0.065
6	0.004	0.007	0.009	0.008	0.014	0.017	0.012	0.020	0.026	0.019	0.032	0.042

Note. δ = DIF in item difficulties;

I_{misfit}

= number of misfitting items; Values of the original RMSD estimator for misfitting items are printed in bold.

Table 6. Study 3: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items for

I_{misfit} = 1

or

I_{misfit} = 3

misfitting items with a uniform DIF effect of

δ = 0.6

as a function of sample size N.

Table 6. Study 3: Mean, standard deviation (SD) and root mean square error (RMSE) for different estimators of the RMSD statistic in a test with

I = 9

items for

I_{misfit} = 1

or

I_{misfit} = 3

misfitting items with a uniform DIF effect of

δ = 0.6

as a function of sample size N.

			Mean				SD				RMSE
Item	$I_{misfit}$	N	orig	abc	bbc	jbc	orig	abc	bbc	jbc	orig	abc	bbc	jbc
2	1	125	0.105	0.096	0.090	0.089	0.034	0.039	0.042	0.042	0.035	0.039	0.043	0.043
		250	0.103	0.100	0.097	0.097	0.025	0.027	0.028	0.028	0.025	0.027	0.028	0.028
		500	0.102	0.100	0.099	0.099	0.018	0.019	0.019	0.019	0.018	0.019	0.019	0.019
		1000	0.101	0.100	0.099	0.099	0.013	0.013	0.013	0.013	0.013	0.013	0.014	0.014
		2000	0.102	0.101	0.101	0.101	0.009	0.009	0.009	0.009	0.009	0.009	0.009	0.009
	3	125	0.083	0.072	0.064	0.063	0.033	0.039	0.042	0.041	0.033	0.040	0.044	0.044
		250	0.081	0.076	0.071	0.071	0.025	0.027	0.029	0.029	0.025	0.027	0.030	0.030
		500	0.078	0.076	0.074	0.074	0.018	0.019	0.020	0.020	0.018	0.019	0.020	0.020
		1000	0.077	0.076	0.075	0.075	0.013	0.014	0.014	0.014	0.013	0.014	0.014	0.014
		2000	0.078	0.077	0.077	0.077	0.009	0.009	0.009	0.009	0.009	0.009	0.009	0.009
5	1	125	0.046	0.024	0.017	0.017	0.022	0.028	0.027	0.027	0.037	0.029	0.027	0.027
		250	0.034	0.018	0.014	0.013	0.017	0.021	0.020	0.020	0.024	0.021	0.021	0.021
		500	0.026	0.016	0.012	0.012	0.013	0.016	0.016	0.016	0.016	0.016	0.017	0.017
		1000	0.021	0.014	0.011	0.011	0.010	0.012	0.012	0.012	0.011	0.012	0.013	0.013
		2000	0.019	0.015	0.013	0.013	0.008	0.010	0.010	0.010	0.008	0.010	0.011	0.011
	3	125	0.057	0.037	0.030	0.029	0.027	0.034	0.035	0.035	0.032	0.035	0.036	0.036
		250	0.048	0.036	0.031	0.031	0.022	0.027	0.028	0.028	0.023	0.027	0.030	0.029
		500	0.044	0.038	0.034	0.034	0.018	0.021	0.022	0.022	0.018	0.021	0.023	0.023
		1000	0.041	0.038	0.036	0.036	0.012	0.014	0.015	0.015	0.013	0.014	0.015	0.015
		2000	0.041	0.040	0.039	0.039	0.009	0.009	0.010	0.010	0.009	0.009	0.010	0.010

Note. orig = original RMSD estimator (see Equation (9)); abc = RMSD estimator based on analytical bias correction (see Equation (25)); bbc = RMSD estimator based on bootstrap bias correction (see Equation (27)); jbc = RMSD estimator based on jackknife bias correction (see Equation (29)); Values of RMSD estimators for misfitting items are printed in bold.

Table 7. Study 4: Population value of the original RMSD estimator in a test with two misfitting items with an uniform DIF effects of

| δ | = 0.6

for balanced DIF and unbalanced DIF as a function of the number of items I.

Table 7. Study 4: Population value of the original RMSD estimator in a test with two misfitting items with an uniform DIF effects of

| δ | = 0.6

for balanced DIF and unbalanced DIF as a function of the number of items I.

	Balanced DIF				Unbalanced DIF
Item	$I = 6$	$I = 9$	$I = 12$	$I = 15$	$I = 6$	$I = 9$	$I = 12$	$I = 15$
1	0.009	0.006	0.004	0.003	0.026	0.018	0.013	0.011
2	0.112	0.114	0.115	0.115	0.079	0.092	0.098	0.102
3	0.092	0.092	0.092	0.091	0.039	0.049	0.054	0.057
4	0.009	0.006	0.004	0.003	0.026	0.018	0.013	0.011
5	0.006	0.004	0.003	0.003	0.039	0.026	0.019	0.015
6	0.002	0.002	0.001	0.001	0.030	0.020	0.015	0.012

Note. Values of the original RMSD estimator for misfitting items are printed in bold.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Robitzsch, A. Statistical Properties of Estimators of the RMSD Item Fit Statistic. Foundations 2022, 2, 488-503. https://doi.org/10.3390/foundations2020032

AMA Style

Robitzsch A. Statistical Properties of Estimators of the RMSD Item Fit Statistic. Foundations. 2022; 2(2):488-503. https://doi.org/10.3390/foundations2020032

Chicago/Turabian Style

Robitzsch, Alexander. 2022. "Statistical Properties of Estimators of the RMSD Item Fit Statistic" Foundations 2, no. 2: 488-503. https://doi.org/10.3390/foundations2020032

APA Style

Robitzsch, A. (2022). Statistical Properties of Estimators of the RMSD Item Fit Statistic. Foundations, 2(2), 488-503. https://doi.org/10.3390/foundations2020032

Article Menu

Statistical Properties of Estimators of the RMSD Item Fit Statistic

Abstract

1. Introduction

2. RMSD Item Fit Statistic

2.1. Unbiasedness of the Population Value of the RMSD Statistic for a Correctly Specified IRT Model

2.2. Population RMSD Statistic for Misspecified IRT Models

2.3. On the Positive Bias of the Sample-Based RMSD Statistic

3. Bias-Corrected RMSD Estimators

3.1. Analytical Bias Correction

3.2. Bootstrap and Jackknife Bias Correction

4. Numerical Experiments

4.1. Study 1: Correctly Specified IRT Model

4.2. Study 2: Simulated 2PL Model, but Fitted 1PL Model

4.3. Study 3: Unbalanced Differential Item Functioning

4.4. Study 4: Comparing Balanced and Unbalanced Differential Item Functioning

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI