Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning

Robitzsch, Alexander

doi:10.3390/math13233752

Open AccessFeature PaperArticle

Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning

by

Alexander Robitzsch

^1,2

¹

IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany

Mathematics 2025, 13(23), 3752; https://doi.org/10.3390/math13233752 (registering DOI)

Submission received: 4 October 2025 / Revised: 17 November 2025 / Accepted: 18 November 2025 / Published: 23 November 2025

(This article belongs to the Special Issue Computational Statistics, Data Analysis and Applications)

Download

Browse Figures

Versions Notes

Abstract

In educational large-scale assessment studies, uniform differential item functioning (DIF) across countries often challenges the application of a common item response model, such as the two-parameter logistic (2PL) model, to all participating countries. DIF occurs when certain items provide systematic advantages or disadvantages to specific groups, potentially biasing ability estimates and secondary analyses. Identifying misfitting items caused by DIF is therefore essential, and several item fit statistics have been proposed in the literature for this purpose. This article investigates the performance of four commonly used item fit statistics under uniform DIF: the weighted root mean square deviation (RMSD), the weighted mean deviation (MD), the infit, and the outfit statistics. Analytical approximations were derived to relate the uniform DIF effect size to these item fit statistics, and the theoretical findings were confirmed through a comprehensive simulation study. The results indicate that distribution-weighted RMSD and MD statistics are less sensitive to DIF in very easy or very difficult items, whereas difficulty-weighted RMSD and MD exhibit consistent detection performance across all item difficulty levels. However, the sampling variance of the difficulty-weighted statistics is notably higher for items with extreme difficulty. Infit and outfit statistics were largely ineffective in detecting DIF in items of moderate difficulty, with sensitivity limited to very easy or very difficult items. To illustrate the practical application of these statistics, they were computed for the PISA 2006 reading study, and the distribution of the statistics across participating countries was descriptively examined. The findings guide selecting appropriate item fit statistics in large-scale assessments and highlight the strengths and limitations of different approaches under uniform DIF conditions.

Keywords:

item response model; 2PL model; differential item functioning; RMSD; MD; infit; outfit; item fit

MSC:

62H25; 62P25

1. Introduction

Item response theory (IRT) models [1,2,3,4,5,6] are multivariate statistical models designed to analyze a vector of discrete random variables. IRT models are widely applied in the social sciences, particularly in educational large-scale assessment (LSA; [7,8]) studies, where cognitive tasks are administered. The use of IRT is advantageous because it accommodates complex test designs [9] and addresses measurement error in estimated abilities for secondary analyses [10].

This article focuses on dichotomous (i.e., binary) random variables. Let

X = (X_{1}, \dots, X_{I})

denote a vector of I random variables

X_{i} \in {0, 1}

, commonly referred to as items or scored item responses. A unidimensional IRT model [11] defines a parametric model for the probability distribution

P (X = x)

for

x = (x_{1}, \dots, x_{I}) \in {0, 1}^{I}

as

P (X = x; δ, γ) = \int \prod_{i = 1}^{I} [{P_{i} (θ; γ_{i})}^{x_{i}} {(1 - P_{i} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ,

(1)

where

ϕ

denotes the density of the normal distribution parameterized by the mean

μ

and standard deviation (SD)

σ

. The distribution parameters

μ

and

σ

of the latent variable

θ

, often referred to as a trait or ability variable, are collected in the vector

δ = (μ, σ)

. The vector

γ = (γ_{1}, \dots, γ_{I})

comprises the item parameters for the item response functions (IRFs)

P_{i} (θ; γ_{i}) = P (X_{i} = 1 | θ)

for

i = 1, \dots, I

. The IRF of the two-parameter logistic (2PL) model [12] is given by

P_{i} (θ; γ_{i}) = Ψ (a_{i} (θ - b_{i})),

(2)

where

a_{i}

and

b_{i}

denote the item discrimination and item difficulty parameters, respectively, and

Ψ (x) = {(1 + exp (- x))}^{- 1}

is the logistic distribution function. In this formulation, the item parameter vector is

γ_{i} = (a_{i}, b_{i})

.

The Rasch model [13,14] represents a special case of the 2PL model in which all item discrimination parameters

a_{i}

are fixed to 1.

For a sample of N individuals with independently and identically distributed observations (also referred to as cases, subjects, students, or persons) based on the realizations

x_{1}, \dots, x_{N}

from the distribution of the random variable

X

, the model parameters of the IRT model in (1) can be consistently estimated using marginal maximum likelihood estimation (MML; [15,16]), typically implemented with the expectation–maximization algorithm [17,18].

In LSA applications such as programme for international student assessment (PISA; [19]), a parametric model for the IRFs in (1) is typically specified. These studies involve a large number of countries, and it is generally assumed that the item parameters are identical across countries, implying parameter invariance. However, this assumption may be violated in practice because certain items can provide systematic advantages or disadvantages to specific countries. This phenomenon is known as differential item functioning (DIF; [20,21]), although alternative terms such as measurement bias or item bias are also used [22].

As a result, the assumed IRF represents a (slight) misspecification of the true IRT model (1) for a given country (or group, henceforth). The multivariate random vector

X

is represented as

P (X = x; δ, γ) ≃ \int \prod_{i = 1}^{I} [P_{i}^{*} {(θ; γ_{i})}^{x_{i}} {(1 - P_{i}^{*} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ,

(3)

where

P_{i}^{*}

denotes the assumed IRF and

P_{i}

represents the IRF in the data-generating model for the group. In practical applications, it is typically expected that the approximation of

P_{i}

by

P_{i}^{*}

introduces only minimal distortion in estimating the distribution parameters

μ

and

σ

.

The assessment of the adequacy of parametric IRFs (i.e., item fit; [23,24,25]) is a central topic in psychometrics. The discrepancy between the true IRF

P_{i}

and the assumed parametric IRF

P_{i}^{*}

should be quantified using an appropriate effect size measure. Ideally, statistical inference for this effect size should also be available. Of primary interest is the identification of misfitting items i for which the assumed IRFs

P_{i}^{*}

deviate substantially from

P_{i}

.

A wide range of item fit statistics has been proposed in the literature [25]. The present study focuses on the root mean square deviation (RMSD; [26,27,28,29,30,31,32,33,34,35,36,37,38]) and the related mean deviation (MD) statistic, as well as the infit and outfit statistics [39,40]. The motivation for examining these statistics arises from their operational use in current PISA studies. Item fit statistics are routinely analyzed in both the PISA field and main studies [41]. The field study generally involves relatively small sample sizes per item at the country level. In this context, the fit statistics are computed for each country and reported to participating PISA countries to identify potential issues with items, such as translation errors or country-specific test administration problems at the item level. The investigation of item fit statistics under small sample sizes is therefore particularly relevant, as the field study reports are expected to provide reliable information—that is, item fit statistics with sufficiently small sampling error. In the PISA main study, RMSD and MD item fit statistics are employed to identify items whose parameters deviate substantially from the assumed international item parameters defined across countries. Items flagged for DIF are then assigned country-specific parameters in the operational scaling procedure of the PISA main study [42]. Consequently, reliable DIF detection based on these item fit statistics is essential, as it influences decisions concerning the scaling model applied in official PISA reporting.

The performance of these item fit statistics is investigated under uniform DIF, that is, when the assumed item difficulties

b_{i}

in the IRF

P_{i}^{*}

of the 2PL model differ from the true item difficulties in the IRF

P_{i}

. The behavior of these statistics under uniform DIF is examined through analytical derivations and a simulation study. To the best of current knowledge, this study is the first to provide analytical approximations of these item fit statistics under DIF conditions.

The remainder of the article is structured as follows. Section 2 presents analytical results for various weighted RMSD and MD statistics under DIF. Section 3 provides analytical findings for the infit and outfit statistics under DIF. Results from a simulation study examining the performance of these item fit statistics under uniform DIF are reported in Section 4. An empirical example using PISA 2006 data is presented in Section 5. Finally, Section 6 concludes with a discussion.

2. Weighted RMSD and Weighted MD Under DIF

In this section, the definitions of weighted RMSD and weighted MD statistics, as proposed by Joo et al. [43], are reviewed. These item fit statistics are examined under uniform DIF in the 2PL model.

Assume that

P_{i}^{*} (θ) = Ψ (a_{i} (θ - b_{i}))

denotes the assumed IRF, while the IRF in the data-generating model is given by

P_{i} (θ) = Ψ (a_{i} (θ - b_{i} - δ_{i}))

. Here,

δ_{i}

represents the uniform DIF effect, also interpreted as a DIF effect size on the logit metric within the 2PL model. The RMSD and MD statistics assign weights to deviations

P_{i} (θ) - P_{i}^{*} (θ)

to provide a summarized measure of model–data discrepancy.

2.1. Weighted RMSD and Weighted MD

In the definitions of the weighted RMSD and weighted MD statistics, a weighting function

w_{i}

for item i is specified such that it integrates to 1; that is,

\int w_{i} (θ) d θ = 1

(see [43]). The weighting function

w_{i}

may be item-specific or common across items.

The weighted MD statistic is defined as

{MD}_{w, i} = \int [P_{i} (θ) - P_{i}^{*} (θ)] w_{i} (θ) d θ

(4)

and represents the weighted average of the IRF differences

P_{i} - P_{i}^{*}

. The MD statistic is signed and operates on the probability difference metric. Since

P_{i} (θ)

and

P_{i}^{*} (θ)

take values in

[0, 1]

, the MD statistic is bounded between

- 1

and 1, although substantial deviations from 0 are unlikely in practice. This statistic focuses on average differences implied by a misfitting IRF.

The RMSD statistic summarizes squared IRF differences by computing

{RMSD}_{w, i} = \sqrt{\int {[P_{i} (θ) - P_{i}^{*} (θ)]}^{2} w_{i} (θ) d θ} .

(5)

Unlike the MD statistic, differences between

P_{i}

and

P_{i}^{*}

that cancel out on average do not vanish under RMSD. Consequently, RMSD can reveal discrepancies that remain hidden in MD. In particular, the presence of nonuniform DIF [21] (i.e., DIF in item discriminations

a_{i}

) may be detected by RMSD but remain undetected by MD.

The MD and RMSD statistics are mathematically related. Jensen’s inequality [44] implies

{[E (X)]}^{2} \leq E (X^{2})

for any square-integrable random variable X, leading to

| E (X) | \leq \sqrt{E (X^{2})} .

(6)

Applying this result to

X = P_{i} (θ) - P_{i}^{*} (θ)

, where the expectation is taken with respect to the probability density

w_{i} (θ)

, yields

| {MD}_{w, i} | \leq {RMSD}_{w, i},

(7)

demonstrating that the absolute value of the MD statistic is bounded by the RMSD statistic.

2.2. Distribution-Weighted RMSD and MD Statistics

The RMSD and MD statistics were originally defined using a weighting function given by the normal density

w (θ) = ϕ (θ; μ, σ)

(see [28,32,36]), where

μ

and

σ

denote the mean and SD of the

θ

distribution in the group. This approach is referred to as distribution-weighted (dist) item fit statistics for RMSD and MD. In this case, the weighting function is identical across items and depends solely on the location and shape of the

θ

distribution within the group.

To evaluate the integrals appearing in the RMSD and MD statistics, products of logit IRFs and normal densities must be computed. These integrals become easier to handle when the logit IRF is approximated by a probit IRF, resulting in expressions involving normal distribution functions and densities. In some cases, no closed-form solution exists, or the available closed form is too complex to interpret. In such situations, linear and quadratic Taylor expansions with respect to the uniform DIF effect

δ_{i}

are applied, enabling the examination of the behavior of the item fit statistics for

δ_{i}

values close to zero.

The difference

P_{i} - P_{i}^{*}

between the IRFs

P_{i}

and

P_{i}^{*}

can be approximated as

P_{i} (θ) - P_{i}^{*} (θ) ≃ Φ (D^{- 1} a_{i} (θ - b_{i} - δ_{i})) - Φ (D^{- 1} a_{i} (θ - b_{i})) ≃ - δ_{i} a_{i} D^{- 1} ϕ (D^{- 1} a_{i} (θ - b_{i})),

(8)

where

ϕ

denotes the standard normal density function. The first approximation in (8) replaces the logit IRFs with probit IRFs, expressed through the standard normal distribution function

Φ

, using the constant

D = 1.701

(see [45,46,47]). The absolute difference between the logit IRF and its probit approximation does not exceed 0.0095. The second approximation in (8) applies a first-order Taylor expansion to the difference of probit IRFs.

The distribution-weighted MD statistic can be approximately expressed as (see also [48])

{MD}_{d i s t, i} ≃ - δ_{i} \frac{a_{i}}{\sqrt{2 π (a_{i}^{2} σ^{2} + D^{2})}} exp [- \frac{a_{i}^{2} {(μ - b_{i})}^{2}}{2 (a_{i}^{2} σ^{2} + D^{2})}],

(9)

where the approximations in (8) and (A1) from Appendix A were applied. Setting

a_{i} = 1

and

σ = 1

simplifies (9) to

{MD}_{d i s t, i} = - 0.202 δ_{i} exp (- 0.128 {(μ - b_{i})}^{2}) .

(10)

Hence, it holds that

{MD}_{d i s t, i} ≃ - 0.202 δ_{i} for | μ - b_{i} | = 0 .

(11)

Furthermore, we have

{MD}_{d i s t, i} ≃ - 0.178 δ_{i}

for

| μ - b_{i} | = 1

, and

{MD}_{d i s t, i} ≃ - 0.128 δ_{i}

for

| μ - b_{i} | = 2

. This indicates that the absolute value of the MD statistic decreases as the assumed item difficulties deviate further from the group mean

μ

.

The distribution-weighted RMSD can be approximated as

{RMSD}_{d i s t, i} ≃ | δ_{i} | \frac{D a_{i}}{\sqrt{2 π} {(2 a_{i}^{2} σ^{2} / D^{2} + 1)}^{1 / 4}} exp [- \frac{a_{i}^{2} {(μ - b_{i})}^{2}}{2 (2 a_{i}^{2} σ^{2} + D^{2})}],

(12)

based on the approximations in (8) and (A3). With

a_{i} = 1

and

σ = 1

, (12) reduces to

|{RMSD}_{d i s t, i}| ≃ 0.206 | δ_{i} | exp (- 0.102 {(μ - b_{i})}^{2}) .

(13)

The coefficient

0.206

in (13) is only slightly larger than the coefficient

0.202

in (11) for the MD. This indicates that the MD and RMSD are nearly equivalent when uniform DIF is the source of item misfit, at least at the population level. The Simulation Study in Section 4 further demonstrates that, in small to moderately sized samples, sample-based RMSD estimates are considerably larger than the corresponding MD estimates. Similar to the MD statistic, the RMSD decreases as the absolute difference between item difficulty

b_{i}

and group mean

μ

increases.

The computation of the RMSD statistic in (12) relies on two key approximations. First, the logit IRF is approximated by a probit IRF. Second, the difference between the group-specific probit IRF and the assumed probit IRF is approximated using a linear Taylor expansion with respect to the uniform DIF effect

δ_{i}

. An upper bound for the resulting approximation error in the computed RMSD statistic is provided in Appendix B.

2.3. Difficulty-Weighted RMSD and MD

Section 2.2 shows that both MD and RMSD depend on the magnitude of

| μ - b_{i} |

. This dependence can be critical when the group mean is much smaller than the item difficulties for most items, which poses challenges for detecting item misfit in low-performing countries in LSA studies [37]. Furthermore, distribution-weighted RMSD and MD are known to perform poorly in detecting uniform DIF for very easy or very difficult items, that is, when

| μ - b_{i} |

is large. To address these limitations, difficulty-weighted RMSD and MD statistics have been proposed in [43].

Difficulty-weighted (denoted by diff) MD and RMSD statistics (i.e.,

{MD}_{d i f f, i}

and

{RMSD}_{d i f f, i}

) use the normal density

ϕ (θ; b_{i}, 1)

with mean

b_{i}

and SD 1 as the weighting function. In this way, item misfit receives maximal weight at the assumed item difficulty

b_{i}

. By definition, the difficulty-weighted RMSD and MD statistics can be obtained as special cases of the distribution-weighted counterparts by setting

μ = b_{i}

and

σ = 1

in Equations (9) and (12), yielding

{MD}_{d i f f, i} ≃ - δ_{i} \frac{a_{i}}{\sqrt{2 π (a_{i}^{2} + D^{2})}},

(14)

{RMSD}_{d i f f, i} ≃ | δ_{i} | \frac{D a_{i}}{\sqrt{2 π} {(2 a_{i}^{2} / D^{2} + 1)}^{1 / 4}} .

(15)

The item difficulty

b_{i}

no longer appears in the expressions (14) and (15). Consequently, for

a_{i} = 1

, the approximate identities

{MD}_{d i f f, i} ≃ - 0.202 δ_{i}

and

{RMSD}_{d i f f, i} ≃ 0.206 | δ_{i} |

hold, regardless of

σ

,

μ

, and

b_{i}

.

2.4. Information-Weighted RMSD and MD

As an alternative weighting function, Joo et al. [43] proposed using the item information function corresponding to

P_{i}^{*}

to weight differences between the IRFs

P_{i}

and

P_{i}^{*}

. The item information function

I_{i} (θ)

for item i is defined as

I_{i} (θ) = a_{i} P_{i}^{*} (θ) (1 - P_{i}^{*} (θ)) .

(16)

The factor

a_{i}

cancels in the weighting function

w_{i}

, as this function must integrate to 1. Moreover, the product of logit functions in (16) can be reasonably well approximated by a normal density function

ϕ

as

Ψ (x) (1 - Ψ (x)) ≃ ϕ (x; 0, S),

(17)

where the constant

S = 1.595

is obtained by matching the values of the functions on both sides of the approximation at

x = 0

. The difference between

Ψ (1 - Ψ)

and its normal density approximation does not exceed 0.0115. Applying the approximation (17), the weighting function

w_{i}

for item i is given by

ϕ (a_{i} (θ - b_{i}); 0, S) = \frac{ϕ (θ; b_{i}, S / a_{i})}{a_{i}} \propto ϕ (θ; b_{i}, S / a_{i}) \equiv w_{i} .

(18)

Information-weighted (denoted by info) RMSD and MD statistics are obtained as special cases of distribution-weighted RMSD and MD statistics with

μ = b_{i}

and

σ = S / a_{i}

. Using (9) and (12), the information-weighted MD and RMSD statistics are approximately

{MD}_{i n f o, i} ≃ - δ_{i} \frac{a_{i}}{\sqrt{2 π (S^{2} + D^{2})}} and

(19)

{RMSD}_{i n f o, i} ≃ | δ_{i} | \frac{D a_{i}}{\sqrt{2 π} {(2 S^{2} / D^{2} + 1)}^{1 / 4}} .

(20)

Information-weighted RMSD and MD statistics share with difficulty-weighted statistics the property of providing an assessment of item misfit due to uniform DIF independent of

σ

,

μ

, and

b_{i}

. However, they remain dependent on item discrimination

a_{i}

. For items with

a_{i} = 1

, the difficulty-weighted statistics typically appear larger in absolute value than the corresponding information-weighted MD and RMSD statistics.

2.5. Uniformly Weighted RMSD and MD

Finally, Joo et al. [43] proposed a uniform density on a chosen interval

[L, U]

with

Δ \equiv U - L > 0

as a weighting function. The lower and upper bounds of the interval can, for instance, be set to the 1st and 99th quantiles of the group-specific

θ

distribution. For a standard normal distribution, this corresponds to

[- 2.33, 2.33]

. The weighting function

w_{i}

is defined as

w_{i} (θ) = Δ^{- 1} 1 (L \leq θ \leq U),

(21)

where

1

denotes the indicator function, taking the value 0 outside

[L, U]

. As the IRF difference

P_{i} - P_{i}^{*}

is typically small outside

[L, U]

,

w_{i}

can be approximated by the improper constant weighting function

{\tilde{w}}_{i} (θ) \equiv Δ^{- 1}

, which does not integrate to 1.

RMSD and MD statistics using

w_{i}

are referred to as uniformly weighted (denoted by unif) RMSD and MD statistics. For an improper weighting function, Raju [49] provided the explicit expression

\int [P_{i} (θ) - P_{i}^{*} (θ)] d θ = - δ_{i},

(22)

which is termed the signed area statistic. Using this result, the uniformly weighted MD is approximately

{MD}_{u n i f, i} ≃ - δ_{i} \frac{1}{Δ} .

(23)

It is evident that

{MD}_{u n i f, i}

approaches zero as the interval length

Δ

increases. Therefore, using the improper uniform weighting function equal to 1 for MD assessment is theoretically more appropriate, yielding a value of

- δ_{i}

. This approach also allows direct estimation of the DIF effect size

δ_{i}

, improving statistical efficiency when the signed area measure is computed. Raju [49] further derived

\int |P_{i} (θ) - P_{i}^{*} (θ)| d θ = | δ_{i} |,

(24)

which is referred to as the unsigned area statistic.

For the uniformly weighted RMSD statistic, the difference in IRFs is approximated using (8). Applying a linear Taylor approximation to the resulting integral yields

{RMSD}_{u n i f, i} ≃ | δ_{i} | \sqrt{\frac{a_{i}}{2 D Δ \sqrt{π}}} .

(25)

For

a_{i} = 1

, this simplifies to

{RMSD}_{u n i f, i} ≃ 0.407 | δ_{i} | / \sqrt{Δ}

.

2.6. Estimation of Weighted RMSD and Weighted MD

Until this point, the variants of the RMSD and MD statistics have been described only at the population level, assuming that the group-specific IRF

P_{i}

is known, as is the assumed IRF

P_{i}^{*}

. The observed IRF

{\hat{P}}_{i}

, a sample-based estimate of

P_{i}

, is defined for a theta point

θ_{t}

(

t = 1, \dots, T

) as

{\hat{P}}_{i} (θ_{t}) = \frac{\sum_{n = 1}^{N} h_{n t} x_{n i}}{\sum_{n = 1}^{N} h_{n t}},

(26)

where

h_{n t}

is the posterior distribution of person n at grid point

θ_{t}

. The posterior distribution is typically obtained by fitting the IRT model via MML [15], so the quantities in (26) can be directly computed from software output. A discrete evaluation of the weighting function

w_{i}

is defined as

w_{i t} = \frac{w_{i} (θ_{t})}{\sum_{u = 1}^{T} w_{i} (θ_{u})} for t = 1, \dots, T .

(27)

The sample-based MD statistic is then defined as

{\hat{MD}}_{w, i} = \sum_{t = 1}^{T} [{\hat{P}}_{i} (θ_{t}) - P_{i}^{*} (θ_{t})] w_{i t}, .

(28)

Following the derivations in [35], it holds that

E [{\hat{MD}}_{w, i}] = {MD}_{w, i} + \sum_{t = 1}^{T} E [{\hat{P}}_{i} (θ_{t}) - P_{i} (θ_{t})] w_{i t} = {MD}_{w, i},

(29)

because

E [{\hat{P}}_{i} (θ_{t})] = P_{i} (θ_{t})

(at least, asymptotically). Thus, on average, the MD statistic is largely unaffected by sampling errors in the estimate of

P_{i}

.

The sample-based RMSD is defined as

{\hat{RMSD}}_{w, i} = \sqrt{\sum_{t = 1}^{T} {[{\hat{P}}_{i} (θ_{t}) - P_{i}^{*} (θ_{t})]}^{2} w_{i t}} .

(30)

Along the lines of [35] (see also [48]), it follows that

E {({\hat{RMSD}}_{w, i})}^{2} = {({RMSD}_{w, i})}^{2} + \sum_{t = 1}^{T} E {[{\hat{P}}_{i} (θ_{t}) - P_{i} (θ_{t})]}^{2} w_{i t} .

(31)

Equation (31) shows that the sampling variance of

{\hat{P}}_{i} (θ_{t})

contributes explicitly to the expected value of the sample RMSD. As a result, the sample-based RMSD typically exceeds the population-based RMSD. The quantity

E {[{\hat{P}}_{i} (θ_{t}) - P_{i} (θ_{t})]}^{2}

is larger for extreme

θ_{t}

points, reflecting higher variability of the observed IRF at the tails. This variance contribution can be minimized by choosing

w_{i t}

as precision weights, that is, the inverse variances associated with

{\hat{P}}_{i} (θ_{t})

.

3. Infit and Outfit Under DIF

In this section, the expected values of the item fit statistics infit and outfit under uniform DIF are examined. These statistics are primarily applied for model fit assessment in the Rasch model [14,39,40,50,51,52].

First, the statistics are formally introduced. Let n denote a case and

x_{n i}

the item response of case n on a dichotomous item i. Define the standardized residual [14,53,54] as

z_{n i} = \frac{x_{n i} - E_{n i}^{*}}{\sqrt{V_{n i}^{*}}},

(32)

where

E_{n i}^{*}

and

V_{n i}^{*}

denote the expected value and variance of

x_{n i}

under the assumption that

P_{i}^{*}

is the true IRF.

Early studies used individual ability estimates

{\hat{θ}}_{n}

to compute

E_{n i}^{*}

and

V_{n i}^{*}

, which are given by

E_{n i}^{*} = P_{i}^{*} ({\hat{θ}}_{n}) and V_{n i}^{*} = P_{i}^{*} ({\hat{θ}}_{n}) [1 - P_{i}^{*} ({\hat{θ}}_{n})] .

(33)

However, using

{\hat{θ}}_{n}

instead of the true ability

θ_{n}

may bias the distribution of

z_{n i}

even if the model holds. To address this, Wu and Adams [53,55] proposed computing infit and outfit statistics based on individual posterior distributions

h_{n}

rather than

{\hat{θ}}_{n}

[53,54]. Accordingly, the expected value and variance are computed as

E_{n i}^{*} = \int P_{i}^{*} (θ) h_{n} (θ) d θ and V_{n i}^{*} = \int P_{i}^{*} (θ) [1 - P_{i}^{*} (θ)] h_{n} (θ) d θ .

(34)

In practice,

θ

is evaluated on a finite grid

θ_{t}

(

t = 1, \dots, T

) and the posterior distribution is represented by probabilities

h_{n t}

. The quantities in (34) are then approximated by

E_{n i}^{*} = \sum_{t = 1}^{T} P_{i}^{*} (θ_{t}) h_{n t} and V_{n i}^{*} = \sum_{t = 1}^{T} P_{i}^{*} (θ_{t}) [1 - P_{i}^{*} (θ_{t})] h_{n t} .

(35)

The outfit statistic is defined as

{Outfit}_{i} = \frac{1}{N} \sum_{n = 1}^{N} z_{n i}^{2},

(36)

If the IRT model holds (i.e., no item misfit such as uniform DIF), the residuals

z_{n i}

are approximately normally distributed, and

z_{n i}^{2}

follows a chi-square distribution with one degree of freedom. The expected value of

z_{n i}^{2}

is therefore 1. Consequently, deviations of the outfit statistic from 1, such as values below 0.7 or above 1.3, indicate potential item misfit [51,56,57].

The infit statistic weights the squared standardized residuals

z_{n i}^{2}

according to their variances:

{Infit}_{i} = \frac{\sum_{n = 1}^{N} V_{n i}^{*} z_{n i}^{2}}{\sum_{n = 1}^{N} V_{n i}^{*}} .

(37)

Equivalently, it can be expressed as

{Infit}_{i} = \frac{\frac{1}{N} \sum_{n = 1}^{N} {(x_{n i} - E_{n i}^{*})}^{2}}{\frac{1}{N} \sum_{n = 1}^{N} V_{n i}^{*}} .

(38)

As a weighted average of chi-square distributed variables

z_{n i}^{2}

, the infit statistic also has an expected value of 1 under the model, and cutoffs of 0.7 and 1.3 are frequently used to detect item misfit.

Statistical inference for infit and outfit should be applied when stringent model fit assessment is required [50,54,58].

The following sections present the expected values of the outfit and infit statistics in the presence of uniform DIF.

3.1. Expected Value of Outfit Statistic

In this section, the expected value of the outfit statistic under uniform DIF is computed. The approach consists of evaluating the expected values of the squared residuals

z_{n i}^{2}

in the outfit statistic while letting N tend to infinity.

First, note that

E {[x_{n i} - P_{i}^{*} (θ_{n})]}^{2} = Var (x_{n i}) + {[P_{i} (θ_{n}) - P_{i}^{*} (θ_{n})]}^{2} .

(39)

For a large number of items (i.e., estimated ability values

{\hat{θ}}_{n}

converge to true ability values

θ_{n}

) and a large number of cases (

N \to \infty

), the outfit statistic

\frac{1}{N} \sum_{n = 1}^{N} z_{n i}^{2}

converges to

\int \frac{P_{i} (θ) [1 - P_{i} (θ)]}{P_{i}^{*} (θ) [1 - P_{i}^{*} (θ)]} ϕ (θ; μ, σ) d θ + \int \frac{{[P_{i} (θ) - P_{i}^{*} (θ)]}^{2}}{P_{i}^{*} (θ) [1 - P_{i}^{*} (θ)]} ϕ (θ; μ, σ) d θ = O_{1} + O_{2},

(40)

where

O_{1}

and

O_{2}

denote the two summands in (40). Using the normal density approximation (17) for the item information function, the first summand in the outfit statistic in (40) can be expressed as

O_{1} ≃ \int \frac{ϕ (a_{i} (θ - b_{i} - δ_{i}); 0, S)}{ϕ (a_{i} (θ - b_{i}); 0, S)} ϕ (θ; μ, σ) d θ .

(41)

The integral on the right side of (41) has a closed form using the identity (A2) from Appendix A. A quadratic Taylor approximation of the obtained closed form around

δ_{i} = 0

yields

O_{1} ≃ 1 + δ_{i} \frac{a_{i}^{2}}{S^{2}} (μ - b_{i}) + δ_{i}^{2} [\frac{a_{i}^{4}}{2 S^{4}} (σ^{2} + {(μ - b_{i})}^{2}) - \frac{a_{i}^{2}}{2 S^{2}}] .

(42)

Next, the computation of

O_{2}

is considered. Using the probit IRF approximation of the logit IRF difference

P_{i} - P_{i}^{*}

and its linear Taylor expansion with respect to

δ_{i}

(see (8)) in (40), together with the normal density approximation (17) of the information function

P_{i}^{*} (1 - P_{i}^{*})

, the integral

O_{2}

can be expressed in closed form by applying the identity (A4) from Appendix A. A quadratic Taylor expansion of this result yields

O_{2} ≃ δ_{i}^{2} K, where K = \frac{a_{i}^{2} S}{2 D^{2} σ \sqrt{π (C + \frac{1}{2 σ^{2}})}} exp (- \frac{C {(μ - b_{i})}^{2}}{2 σ^{2} (C + \frac{1}{2 σ^{2}})}) and C = a_{i}^{2} (\frac{1}{D^{2}} - \frac{1}{2 S^{2}}) .

(43)

Overall, the expected value of the outfit statistic can be approximated as

E [{Outfit}_{i}] = O_{1} + O_{2} ≃ 1 + δ_{i} \frac{a_{i}^{2}}{S^{2}} (μ - b_{i}) + δ_{i}^{2} f_{2} (μ - b_{i}),

(44)

where

f_{2}

is a function of the squared difference

{(μ - b_{i})}^{2}

, which can be derived from (42) and (43). For

a_{i} = 1

and

σ = 1

, this reduces to

E [{Outfit}_{i}] ≃ 1 + 0.393 δ_{i} (μ - b_{i}) + δ_{i}^{2} [- 0.119 + 0.078 {(μ - b_{i})}^{2} + 0.193 exp (- 0.115 {(μ - b_{i})}^{2})] .

(45)

Since the outfit statistic cannot be negative, the approximation (45) is valid only for sufficiently small

δ_{i}

. The linear term shows that the outfit decreases below 1 for positive DIF effects (

δ_{i} > 0

) when

μ - b_{i} < 0

, i.e., for difficult items, and increases when

μ - b_{i} > 0

. Interestingly, for items with

b_{i} \approx μ

, (45) simplifies to

1 + 0.074 δ_{i}^{2}

. This implies that very large DIF effects are required to detect item misfit in the outfit statistic for items of moderate difficulty. In this respect, the outfit statistic behaves in the opposite way of distribution-weighted RMSD and MD statistics, which are more sensitive to uniform DIF when

b_{i} \approx μ

.

3.2. Expected Value of Infit Statistic

Next, the expected value of the infit statistic is computed. First, consider the numerator in the infit statistic (38) and decompose it as

\int P_{i} (θ) [1 - P_{i} (θ)] ϕ (θ; μ, σ) d θ + \int {[P_{i} (θ) - P_{i}^{*} (θ)]}^{2} ϕ (θ; μ, σ) d θ = I_{1} + I_{2} .

(46)

The denominator in (38) converges to

\int P_{i}^{*} (θ) [1 - P_{i}^{*} (θ)] ϕ (θ; μ, σ) d θ = I_{3} .

(47)

Using the approximation in (17), it follows that

I_{1} ≃ \frac{1}{\sqrt{2 π V_{1}}} exp (- \frac{{(μ - b_{i} - δ_{i})}^{2}}{2 V_{1}}), where V_{1} = σ^{2} + \frac{S^{2}}{a_{i}^{2}} .

(48)

Similarly,

I_{3} ≃ \frac{1}{\sqrt{2 π V_{1}}} exp (- \frac{{(μ - b_{i})}^{2}}{2 V_{1}}) .

(49)

Moreover,

I_{2} ≃ δ_{i}^{2} \frac{a_{i}^{2}}{2 D^{2} \sqrt{2} π σ \sqrt{A}} exp (- \frac{a_{i}^{2}}{2 σ^{2} D^{2} A} {(μ - b_{i})}^{2}), where A = \frac{a_{i}^{2}}{D^{2}} + \frac{1}{2 σ^{2}} .

(50)

Taking the expected value of the fraction in the infit statistic and applying a Taylor approximation around

δ_{i} = 0

yields

E [{Infit}_{i}] ≃ \frac{I_{1}}{I_{3}} + \frac{I_{2}}{I_{3}} ≃ 1 + δ_{i} \frac{μ - b_{i}}{V_{1}} + δ_{i}^{2} (\frac{{(μ - b_{i})}^{2}}{2 V_{1}^{2}} - \frac{1}{2 V_{1}}) + δ_{i}^{2} \frac{a_{i}^{2} \sqrt{V_{1}}}{2 D^{2} \sqrt{π} σ \sqrt{A}} exp (- \frac{{(μ - b_{i})}^{2}}{2} [\frac{a_{i}^{2}}{D^{2} σ^{2} A} - \frac{1}{V_{1}}]) .

(51)

Setting

a_{i} = 1

and

σ = 1

simplifies the expression (51) to

E [{Infit}_{i}] = 1 + 0.282 δ_{i} (μ - b_{i}) + δ_{i}^{2} (- 0.141 + 0.040 {(μ - b_{i})}^{2} + 0.200 exp (- 0.063 {(μ - b_{i})}^{2})) .

(52)

Comparing the linear term of the outfit statistic (0.393, see (45)) with the corresponding linear term in the infit statistic (0.282, see (52)) shows that uniform DIF is expected to induce larger deviations from 1 in the outfit than in the infit statistic. Hence, the power to detect model misfit caused by uniform DIF may be higher for the outfit than for the infit statistic.

As with the outfit, positive uniform DIF effects yield infit statistics smaller than 1 when

b_{i} > μ

and larger than 1 when

b_{i} < μ

. For

b_{i} \approx μ

, values slightly above 1 are expected, as indicated by the approximation

1 + 0.059 δ_{i}^{2}

(see (52)).

4. Simulation Study

4.1. Method

The 2PL model was used for both data generation and analysis. The mean and SD of the normally distributed factor variable

θ

were set to 0 and 1, respectively. Non-normal distributions of

θ

were not simulated, as substantially different results were not expected in such conditions.

The simulation study included

I = 50

items to examine the behavior of the fit statistics under more typical conditions of longer tests, where ability estimates are relatively precise. Ten base items with known item discriminations

a_{i}

and item difficulties

b_{i}

were defined in the simulation and duplicated five times to create a test of 50 items. All item discriminations

a_{i}

were fixed at 1, resulting in a test with average discrimination. The item difficulties

b_{i}

were set to −2.00, −1.56, −1.11, −0.67, −0.22, 0.22, 0.67, 1.11, 1.56, and 2.00, producing a test with item difficulties uniformly distributed across a broad range of ability values. Exactly two out of the 50 items were simulated to exhibit uniform DIF, corresponding to 4% of the total items. A relatively small proportion of DIF items was selected to minimize the influence of DIF item misfit on the non-DIF items (i.e., items without DIF). Items j and

10 - j + 1

for

j = 1, \dots, 5

were specified to have uniform DIF in difficulties

b_{i}

with values

δ

and

- δ

, respectively. For instance, for

j = 1

, the first and tenth items were affected by DIF and assigned difficulties

- 2.00 + δ

and

2.00 - δ

. The DIF effect size

δ

was set to

- 0.6

and

0.6

, representing large DIF magnitudes [21,59].

Sample sizes N were set to 250, 500, 1000, 2000, and 4000, reflecting typical applications of item fit statistics in large-scale assessments using the 2PL model [8].

In each of the 5 (sample size N) × 5 (type of selected DIF items) × 2 (DIF effect size

δ

)

= 50

simulation conditions, 750 replications were conducted. Item parameters were fixed at the parameters of the base items. Hence, the presence of DIF was ignored in the scaling step in order to allow detecting DIF by the item fit statistics

{RMSD}_{d i s t}

,

{MD}_{d i s t}

,

{RMSD}_{d i f f}

,

{MD}_{d i f f}

,

Infit

and

Outfit

. In this scaling step, only the mean

μ

and the SD

σ

were estimated.

The uniformly weighted RMSD and MD fit statistics were not computed in this Simulation Study, as the appropriate bounds for the uniform distribution domain are not clearly defined. The information-weighted RMSD and MD fit statistics were also not evaluated in this simulation, as they are expected to behave similarly to the difficulty-weighted fit statistics, with additional scaling by item discriminations.

Means, SDs, and percentiles (5th, 25th, 50th, 75th, and 95th) were computed for all item fit statistics.

Analyses were conducted in R (Version 4.4.1; [60]) using the TAM package (Version 4.3-25; [61]) for 2PL model fitting. Custom R functions were written for RMSD and MD statistics, while infit and outfit statistics were computed with TAM::msq.itemfit(), which uses numerical integration rather than the stochastic integration in TAM::tam.fit(). Replication materials are available at https://zenodo.org/records/17241167 (accessed on 4 October 2025). Figures were created using the R package ggplot2 (Version 4.0.0; [62,63]).

4.2. Results

Table 1 reports the mean and SD of the item fit statistics for DIF items as a function of item difficulty

b_{i}

, DIF effect size

δ

, and sample size N. For items with difficulties close to the mean

μ = 0

(e.g.,

b_{i} = - 0.22

), the distribution-weighted RMSD (

{RMSD}_{d i s t}

) exhibited mean and SD values very similar to the difficulty-weighted RMSD (

{RMSD}_{d i f f}

). The empirical RMSD closely matched the approximate expected value of

0.2 \cdot 0.6 = 0.12

derived in Section 2. MD values reflected the sign of the uniform DIF

δ_{i}

. Consistent with Section 3, infit and outfit statistics did not indicate uniform DIF for items with

b_{i}

near the mean, remaining close to 1 (e.g., for

b_{i} = - 0.22

).

For easier items (i.e., lower

b_{i}

), RMSD and absolute MD values weighted by the distribution were smaller than the difficulty-weighted counterparts. Notably, difficulty-weighted RMSD and MD exhibited similar means across large samples regardless of

b_{i}

, whereas

{RMSD}_{d i s t}

and

{MD}_{d i s t}

approached 0 for items with extreme difficulties. However, difficulty-weighted RMSD and MD exhibited substantially larger SDs, indicating that sampling error strongly affected these statistics for extreme items. The SD of these fit statistics decreased as the sample size increased. Bias due to sampling error in small to moderate samples was also more pronounced for difficulty-weighted RMSD.

As expected from the analytical results in Section 3, infit and outfit statistics detected uniform DIF for items with difficulties far from the mean. Deviations from 1 were asymmetric for items with negative versus positive DIF. For instance, for

b_{i} = - 2.00

and

N = 4000

, infit was 0.729 for

δ_{i} = - 0.6

and 1.404 for

δ_{i} = 0.6

, highlighting potential limitations of symmetric cutoffs such as 0.8 and 1.2 in misfit detection. The SD of outfit was substantially larger than that of infit, particularly for items with extreme difficulties.

Figure 1 presents percentile plots (5th, 25th, 50th, 75th, and 95th) of distribution-weighted and difficulty-weighted RMSD statistics for DIF items as a function of item difficulty, with a DIF effect of

δ_{i} = - 0.6

. Variability in item fit statistics was markedly higher for small sample sizes (e.g.,

N = 250

). In particular, the sampling distribution of the difficulty-weighted RMSD (

{RMSD}_{d i f f}

) for the item with

b_{i} = - 2.00

remained wide even for a large sample size of

N = 2000

. This limitation may restrict the operational applicability of this statistic. Positive bias due to sampling error for

{RMSD}_{d i f f}

at

b_{i} = - 2.00

was also evident in small to moderate samples.

Table 2 reports the mean and SD for items without DIF under simulation conditions with

δ = - 0.6

, where the fourth and seventh items exhibited DIF. While MD statistics were centered around the expected value of 0, RMSD statistics showed positive values that deviated substantially from 0 in small to moderate sample sizes. Asymptotically, RMSD values are expected to approach 0; however, sampling error induced notable bias, particularly for items with extreme difficulty (

b_{i} = - 2.00

). For instance, the mean of the difficulty-weighted RMSD for this item was 0.158 at

N = 250

. In contrast, infit and outfit statistics closely matched their expected value of 1.

Figure 2 presents percentile plots of RMSD statistics for items without DIF. Similar to items with DIF, the sampling distribution of the difficulty-weighted RMSD (

{RMSD}_{d i f f}

) was wide, with sampling bias particularly pronounced for items with extreme difficulty. In contrast, the distribution-weighted RMSD (

{RMSD}_{d i s t}

) exhibited comparable variability across items with different difficulties.

5. Empirical Example

5.1. Method

This section employs the PISA 2006 dataset [64] for the reading domain. The dataset comprises participants from 26 selected countries that participated in PISA 2006. The complete PISA 2006 dataset is publicly available at https://www.oecd.org/en/data/datasets/pisa-2006-database.html (accessed on 4 October 2025) and can be accessed as the data.pisa2006Read dataset from the R package sirt (Version 4.2-133; [65]).

In PISA 2006, reading items were administered to a subset of students. The present analysis included only students who responded to at least one item from the reading domain. In total, 110,236 students were included, with country-specific sample sizes ranging from 2010 to 12,142 (mean = 4239.8, SD = 3046.7).

Student (sampling) weights were applied throughout the analysis. Within each country, weights were normalized to sum to 5000, ensuring equal contribution across countries. These normalized weights are also referred to as house weights.

Among the 28 reading items, some were originally scored polytomously but were recoded dichotomously for this empirical example, with only the highest category coded as correct. The remaining items were treated as dichotomous, consistent with the original PISA analysis.

In the first step, international item parameters were estimated using the 2PL model applied to the weighted pooled dataset across all countries. The resulting item discriminations

a_{i}

and item difficulties

b_{i}

are reported in Table 3. In the subsequent country-wise scaling step, only the country mean

μ

and country SD

σ

were estimated, while item parameters were fixed at their international values from the pooled analysis. Item fit statistics were then computed at the country level based on these fixed parameters, which corresponds to the assumption of full invariance.

In the next step, the distribution parameters

μ

and

σ

were fixed, and country-specific DIF effects

δ_{i}

were estimated by allowing item difficulties to vary freely while fixing item discriminations to the base-item values (i.e., set to 1). The estimated

δ_{i}

can be interpreted as a DIF effect size on the logit scale of the 2PL model, whereas the additional item fit statistics represent alternative DIF quantifications.

The entire procedure was repeated 80 times using balanced repeated replicate weights [64], which allow estimation of standard errors (SE) for approximately normally distributed statistics [66,67]. Replicate weights were used to obtain standard errors for RMSD, MD, infit, outfit, and

δ_{i}

statistics.

Items were flagged (i.e., indicated as misfitting) if certain cutoffs of the fit statistics were exceeded: absolute

δ_{i}

values larger than 0.4, RMSD values larger than 0.08, and infit statistics smaller than 0.7 and larger than 1.3.

5.2. Results

Item fit statistics for PISA 2006 reading results in the Netherlands (NLD) are first reported as an example. A total of 2666 students participated in this country, and results for the 28 reading items are presented. The estimated country mean for NLD was 0.093 (SE = 0.032), and the estimated SD was 1.019 (SE = 0.030). The SD of the estimated DIF effects

δ_{i}

was 0.432, commonly referred to as the DIF SD [68].

Table 3 reports the item fit statistics for NLD along with their estimated SEs. For instance, 9 of the 28 items exceeded the cutoff value of 0.08 for the RMSD statistic weighted by the distribution (

{RMSD}_{d i s t}

). For 4 additional items (e.g., Item R067Q01), the difficulty-weighted RMSD (

{RMSD}_{d i f f}

) exceeded the threshold, while

{RMSD}_{d i s t}

did not. These items had difficulties that deviated substantially from 0. In most cases, items flagged by RMSD were also flagged by the corresponding MD statistic. For NLD, no further items were identified by infit or outfit beyond those already detected as misfitting by

| δ_{i} |

, RMSD, or MD statistics.

This study did not attempt to provide substantive interpretations for why DIF occurred in some items for NLD. Although potential explanations such as translation issues or differences in opportunities-to-learn may be conjectured, clear substantive interpretations of DIF are generally the exception rather than the rule [69]. It is also relevant that current operational PISA practice removes items with country-specific DIF effects from group comparisons by assigning them unique item parameters in the scaling model [19,70]. As a result, PISA does not aim to explain DIF but instead treats it as construct-irrelevant [71], a stance that has been subject to criticism [72,73].

In the second part of this section, the distribution of item fit statistics across all countries is examined. The analysis is based on 724 cases (i.e., items crossed with countries, with some items excluded in certain countries for technical reasons in the official PISA dataset).

Figure 3 displays histograms of the distribution-weighted RMSD (

{RMSD}_{d i s t}

) and the difficulty-weighted RMSD (

{RMSD}_{d i f f}

). On average, the distribution-weighted RMSD produced smaller values (

M = 0.060

) than the difficulty-weighted RMSD counterpart (

M = 0.072

). Both distributions were notably skewed (skewness for

{RMSD}_{d i s t}

: 1.843; for

{RMSD}_{d i f f}

: 1.712).

Figure 4 presents a correlation plot of the computed item fit statistics. Corresponding

RMSD

and

| MD |

were highly correlated (

r = 0.99

), whereas

{RMSD}_{d i s t}

and

{RMSD}_{d i f f}

showed a lower correlation of 0.82, indicating substantial differences between these measures. The DIF effect size

| δ_{i} |

and the RMSD and MD statistics were largely uncorrelated with infit or outfit. Notably,

| δ_{i} |

correlated slightly more with

{MD}_{d i f f}

than with

{MD}_{d i s t}

.

Finally, Table 4 presents the patterns of item flagging based on

| δ_{i} |

,

{RMSD}_{d i s t}

,

{RMSD}_{d i f f}

, and infit. Overall, 65.1% of items were not flagged, indicating that 34.9% were identified by at least one fit statistic. Specifically, 20.7% of items were flagged by

| δ_{i} |

, indicating uniform DIF as the source of misfit. Additionally, 24.0% of items were flagged by

{RMSD}_{d i s t}

, 32.7% by

{RMSD}_{d i f f}

, and 6.4% by infit.

Only 3.0% of items were flagged by all four statistics. Moreover, 2.8% of items were flagged by

{RMSD}_{d i f f}

but not by

{RMSD}_{d i s t}

. Notably, 14.2% of items were flagged when

| δ_{i} | < 0.4

, suggesting that the misfit could be due to nonuniform DIF or misspecification of the IRF functional form. Of these, 12.8% were additionally detected by

{RMSD}_{d i f f}

, indicating limited incremental value of the infit statistic.

Table 5 displays the percentages of flagged items at the country level for the 26 countries included in this study. The results indicate considerable variation in the proportion of detected DIF items across countries. For instance, the average percentage was 24.1% for the distribution-weighted RMSD statistic (

{RMSD}_{d i s t}

), with an SD of

6.53 %

, ranging from 7.1% to 37.0%. For the difficulty-weighted RMSD statistic (

{RMSD}_{d i f f}

), the average percentage increased to 32.74% with an SD of 9.94%, ranging from 17.9% to 51.9%. These findings suggest that DIF may be viewed as a country-specific characteristic, with items in some countries exhibiting greater susceptibility to DIF than in others.

6. Discussion

This article aimed to provide analytical and simulation evidence on how the uniform DIF effect

δ_{i}

is reflected in the item fit statistics RMSD, MD, infit, and outfit.

Closed-form approximations were derived for the item fit statistics. The distribution-weighted RMSD and MD statistics, as originally proposed and now applied in operational practice [42], attain the largest absolute values when item difficulties align with the mean of the ability distribution. In this case, RMSD and the absolute MD statistics are approximately

0.20 | δ_{i} |

, whereas they decrease to around

0.13 | δ_{i} |

when

| μ - b_{i} | = 2

. The dependency of the item fit statistic on item difficulty is removed by using the difficulty-weighted RMSD and MD statistics. However, the difficulty-weighted statistics have the disadvantage of exhibiting substantially larger variances than their distribution-weighted counterparts.

Moreover, analytical approximations of the expected values of the outfit and infit statistics showed that they exhibit only slight deviations from 1 under uniform DIF when item difficulty

b_{i}

is close to the group mean

μ

. For positive DIF effects, the expected values of the infit and outfit statistics fall below 1 for difficult items (i.e.,

b_{i} > μ

), while they exceed 1 for easy items.

The simulation study indicated that distribution-weighted RMSD and MD statistics consistently detected uniform DIF, but only for items with difficulty levels near the mean. In contrast, difficulty-weighted RMSD and MD were sensitive to uniform DIF even for items with difficulties far from 0. These statistics were also more affected by sampling variability, exhibiting elevated SDs and positive bias for items with extreme difficulties, particularly in small-to-moderate samples. Notably, distribution-weighted RMSD produced substantially lower values for items with extreme difficulties compared to difficulty-weighted RMSD, complicating DIF detection for items with identical DIF effect sizes

δ_{i}

in the item difficulty parameter. Difficulty-weighted RMSD and MD statistics suffer from substantial sampling variability in small samples, particularly for items with extreme difficulties. Researchers might therefore consider using these difficulty-weighted versions; however, their effectiveness in detecting uniform DIF for such items is limited. Consequently, sufficiently large sample sizes are required if uniform DIF for items with extreme difficulties is to be detected using the difficulty-weighted RMSD and MD statistics.

The results of the simulation study and analytical derivations further indicate that item infit and outfit statistics are not suitable for detecting uniform DIF, except in the case of items with extreme difficulties. However, in such cases, difficulty-weighted RMSD and MD fit statistics provide viable alternatives.

An anonymous reviewer noted that in countries with extremely low or high performance, distribution-weighted RMSD and MD statistics may behave differently even under the same extent of DIF [37]. In such cases, difficulty-weighted or information-weighted RMSD and MD statistics may be preferable. However, these statistics would also exhibit greater variability when the items difficulty substantially deviates from the group mean.

A substantial body of literature addresses methods for DIF detection [74,75]. These methods often assess DIF directly through deviations in item parameters, such as differences in item difficulties or discriminations in the 2PL model [76]. In contrast, the RMSD, MD, infit, and outfit statistics provide aggregated measures that summarize DIF across IRT model parameters into a single discrepancy metric. The choice of metric for DIF detection ultimately depends on the researcher’s objectives. This study also demonstrated that uniform DIF is challenging to detect for items with extreme difficulty when using distribution-weighted RMSD or MD statistics. However, it may be argued that DIF in items with extreme difficulties is of lesser practical importance than DIF in items with moderate difficulties, as the latter receive greater weights in likelihood-based estimation of group differences, whereas the former contribute less. Consequently, for researchers primarily concerned with the practical impact of DIF in likelihood-based estimation, distribution-weighted RMSD or MD statistics may be more appropriate than their difficulty-weighted counterparts.

The empirical application using PISA data demonstrated that item fit statistics deviated substantially from perfect correlations with one another. This highlights the importance of selecting which aspects of misfit are of interest. The assessment of standard errors should also be incorporated in operational practice, as item fit statistics can exhibit considerable variability, particularly in smaller samples, such as those in PISA field test studies.

This article focused exclusively on weighted RMSD item fit statistics as originally defined. RMSD has been shown to exhibit positive bias for small to moderate sample sizes. Future research could explore bias-corrected RMSD estimates to reduce such bias in finite samples (see [35]).

An additional limitation is the exclusive focus on uniform DIF, that is, DIF in item difficulties. Nonuniform DIF, defined as DIF in item discriminations, may also be relevant in empirical applications and represents a promising direction for future research, although evidence indicates that uniform DIF occurs more frequently in practice than nonuniform DIF [77,78,79]. The MD statistic is expected to be less effective for nonuniform DIF than the RMSD fit statistic. Item infit and outfit statistics are expected to show greater sensitivity to misspecified item discriminations [54], and therefore, they may be more suitable for detecting nonuniform DIF than uniform DIF.

An anonymous reviewer raised the question of what would happen if anchoring were not perfect and no fixed item parameters were used. In the simulation study, only two of the 50 items were assumed to exhibit uniform DIF, which is likely a lower proportion of misfitting items than would typically be found in practice. This choice was made to examine the behavior of the fit statistics under ideal conditions, in which the anchor item set shows no DIF and has minimal influence on the item under investigation. The behavior of the item fit statistics can be expected to deteriorate as the proportion of DIF items increases. More specifically, for the RMSD item fit statistic, the average RMSD for non-DIF items would increase notably above zero, whereas the average RMSD for DIF items would decrease as the proportion of DIF items increases [35]. Consequently, as the quality of the anchor item set decreases in terms of DIF absence, detecting DIF in truly affected items becomes more difficult. Moreover, average values of the item fit statistics are not expected to change substantially if item parameters are estimated rather than fixed, although the variability of the item fit statistics would increase when item parameters are estimated.

Finally, this study focused solely on the 2PL model. It has been argued that a potentially misfitting 2PL model (or even the Rasch model) may still be the preferred operational scaling model, even when a more complex IRT model holds [73,80]. Notably, misspecification in the IRF is also reflected in the RMSD statistic in addition to DIF. This property may help explain why RMSD is preferred over MD in applications such as the PISA study, as MD only partially offsets the effect of IRF misfit. If the group-specific item parameters of the 2PL model are interpreted as the best approximation that minimizes the Kullback–Leibler distance, it may be advantageous to compute an RMSD statistic directly on group-specific 2PL item parameters rather than relying on nonparametric estimation of group-specific IRFs.

Funding

This research received no external funding.

Data Availability Statement

Replication material for the Simulation Study in Section 4 can be found at https://zenodo.org/records/17241167 (accessed on 4 October 2025). The dataset data.pisa2006Read used in the empirical example in Section 5 is available from the R package sirt (https://doi.org/10.32614/CRAN.package.sirt; accessed on 4 October 2025).

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2PL	two-parameter logistic
DIF	differential item functioning
IRF	item response function
IRT	item response theory
LSA	large-scale assessment
MD	mean deviation
MML	marginal maximum likelihood
NLD	the Netherlands
PISA	programme for international student assessment
RMSD	root mean square deviation
SD	standard deviation
SE	standard error

Appendix A. Integral Identities for the Normal Distribution

Let

ϕ

denote the density of the standard normal distribution. Moreover, let

a > 0

,

c > 0

, b and d be real numbers. The following integral identities hold according to Owen [81]:

\int_{- \infty}^{\infty} ϕ (a x + b) ϕ (x) d x = \frac{1}{\sqrt{2 π (a^{2} + 1)}} exp (- \frac{b^{2}}{2 (a^{2} + 1)}),

(A1)

\int_{- \infty}^{\infty} ϕ (a x + b) ϕ (c x + d) ϕ (x) d x = \frac{1}{2 π \sqrt{a^{2} + c^{2} + 1}} exp (- \frac{1}{2} [b^{2} + d^{2} - \frac{{(a b + c d)}^{2}}{a^{2} + c^{2} + 1}]), and

(A2)

\int_{- \infty}^{\infty} ϕ {(a x + b)}^{2} ϕ (x) d x = \frac{1}{2 π \sqrt{2 a^{2} + 1}} exp (- \frac{b^{2}}{2 a^{2} + 1}) .

(A3)

Note that (A3) is obtained from (A2) by setting

c = a

and

d = b

.

The computation of the above integrals is based on the identity

\int_{- \infty}^{\infty} exp (- \frac{1}{2} [A x^{2} + 2 B x + C]) d x = \sqrt{\frac{2 π}{A}} exp (- \frac{1}{2} [C - \frac{B^{2}}{A}]),

(A4)

with real numbers A, B, C, and A is positive.

Appendix B. Approximation Error in the Computation of the RMSD Statistic

A bound is now derived for the approximation error when computing the distribution-weighted RMSD statistic based on the probit IRF approximation and the Taylor approximation of the difference in the probit IRFs. To simplify notation, let

P_{i} (θ) = Ψ (a_{i} (θ - b_{i} - δ_{i}))

and

P_{i}^{*} (θ) = Ψ (a_{i} (θ - b_{i}))

denote the true IRFs. Furthermore, as in (8), their respective probit IRF approximations are

{\tilde{P}}_{i} (θ) = Φ (D^{- 1} a_{i} (θ - b_{i} - δ_{i}))

and

{\tilde{P}}_{i}^{*} (θ) = Φ (D^{- 1} a_{i} (θ - b_{i}))

. It was noted in Section 2 that the absolute difference

| Ψ (x) - Φ (D^{- 1} x) |

does not exceed

ϵ = 0.0095

.

The approximation error of the linear Taylor approximation (8) of the difference

{\tilde{P}}_{i} (θ) - {\tilde{P}}_{i}^{*} (θ)

with respect to the uniform DIF effect

δ_{i}

is now determined. This requires a bound on the remainder term of the Taylor approximation. The necessary second derivative of

{\tilde{P}}_{i}^{*}

is

\frac{\partial^{2}}{\partial b_{i}^{2}} {\tilde{P}}_{i}^{*} (θ) = - \frac{D^{3}}{a_{i}^{3}} (θ - b_{i}) ϕ (D^{- 1} a_{i} (θ - b_{i})) .

(A5)

Hence, the approximation error in the Taylor expansion (8) is bounded by

|{\tilde{P}}_{i} (θ) - {\tilde{P}}_{i}^{*} (θ) - U_{i} (θ)| \leq \frac{D^{3}}{2 \sqrt{2 π} a_{i}^{3}} | θ - b_{i} | δ_{i}^{2},

(A6)

where

U_{i} (θ) = - δ_{i} a_{i} D^{- 1} ϕ (D^{- 1} a_{i} (θ - b_{i}))

denotes the linear Taylor approximation of

{\tilde{P}}_{i} (θ) - {\tilde{P}}_{i}^{*} (θ)

.

Let

M_{1}

denote the square of the RMSD (i.e., the MSD statistic) obtained using the two approximations, formally defined as

M_{1} = \int U_{i} {(θ)}^{2} ϕ (θ; μ, σ) d θ .

(A7)

The true value

M_{0}

of the square of the RMSD statistic, without approximations, is

M_{0} = \int {(P_{i} (θ) - P_{i}^{*} (θ))}^{2} ϕ (θ; μ, σ) d θ .

(A8)

The following identity holds

[P_{i} (θ) - P_{i}^{*} (θ)] - U_{i} (θ) = [P_{i} (θ) - {\tilde{P}}_{i} (θ)] - [P_{i}^{*} (θ) - {\tilde{P}}_{i}^{*} (θ)] + [{\tilde{P}}_{i} (θ) - {\tilde{P}}_{i}^{*} (θ) - U_{i} (θ)] .

(A9)

Using (A6), this yields

| [P_{i} (θ) - P_{i}^{*} (θ)] - U_{i} (θ) | \leq 2 ϵ + \frac{D^{3}}{2 \sqrt{2 π} a_{i}^{3}} | θ - b_{i} | δ_{i}^{2} .

(A10)

Hence, a bound for the approximation error of the MSD statistic is obtained as

| M_{1} - M_{0} | \leq 4 ϵ^{2} + ϵ \frac{D^{3}}{\sqrt{2 π} a_{i}^{3}} δ_{i}^{2} (| μ - b_{i} | + σ \sqrt{\frac{2}{π}}) + \frac{D^{6}}{4 π a_{i}^{6}} δ_{i}^{4} ({(μ - b_{i})}^{2} + σ^{2}) \equiv E,

(A11)

where

E

denotes the upper bound of the approximation error in the computed MSD statistic.

Finally, a bound for the resulting approximation error of the RMSD statistic is derived. For positive x and real-valued e, the following inequality holds:

|\sqrt{x + e} - \sqrt{x}| \leq \frac{| e |}{\sqrt{x}} .

(A12)

Let

R_{1} = \sqrt{M_{1}}

and

R_{0} = \sqrt{M_{0}}

denote the RMSD values obtained with and without approximation, respectively. Using (A12), the approximation error in the RMSD statistic satisfies

| R_{1} - R_{0} | \leq \frac{| E |}{R_{0}} .

(A13)

Appendix C. Country Labels for PISA 2006 Reading Study

The country labels used in Table 5 are as follows: AUS = Australia; AUT = Austria; BEL = Belgium; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; JPN = Japan; KOR = Korea; LUX = Luxembourg; NLD = Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; SWE = Sweden.

References

Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
Bock, R.D.; Moustaki, I. 15 item response theory in a general framework. In Handbook of Statistics: Psychometrics; Rao, C.R., Sinharay, S., Eds.; North Holland (Elsiver): Amsterdam, The Netherlands, 2007; Volume 26, pp. 469–513. [Google Scholar] [CrossRef]
Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory—A statistical framework for educational and psychological measurement. Stat. Sci. 2025, 40, 167–194. [Google Scholar] [CrossRef]
Tutz, G. A Short Guide to Item Response Theory Models; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
Frey, A.; Hartig, J.; Rupp, A.A. An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educ. Meas. 2009, 28, 39–53. [Google Scholar] [CrossRef]
Braun, H.; von Davier, M. The use of test scores from large-scale assessment surveys: Psychometric and statistical considerations. Large-Scale Assess. Educ. 2017, 5, 17. [Google Scholar] [CrossRef]
van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 1, pp. 11–30. [Google Scholar] [CrossRef]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
Bond, T.; Yan, Z.; Heene, M. Applying the Rasch Model; Routledge: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 2, pp. 197–216. [Google Scholar] [CrossRef]
Robitzsch, A. A note on a computationally efficient implementation of the EM algorithm in item response models. Quant. Comput. Methods Behav. Sc. 2021, 1, e3783. [Google Scholar] [CrossRef]
Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 2, pp. 217–236. [Google Scholar] [CrossRef]
Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
OECD. PISA 2018; Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 4 October 2025).
Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics: Psychometrics; Rao, C.R., Sinharay, S., Eds.; North Holland (Elsiver): Amsterdam, The Netherlands, 2007; Volume 26, pp. 125–167. [Google Scholar] [CrossRef]
Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
Douglas, J.; Cohen, A. Nonparametric item response function estimation for assessing parametric model fit. Appl. Psychol. Meas. 2001, 25, 234–243. [Google Scholar] [CrossRef]
Sinharay, S.; Haberman, S.J. How often is the misfit of item response theory models practically significant? Educ. Meas. 2014, 33, 23–35. [Google Scholar] [CrossRef]
Swaminathan, H.; Hambleton, R.K.; Rogers, H.J. Assessing the fit of item response theory models. In Handbook of Statistics: Psychometrics; Rao, C.R., Sinharay, S., Eds.; North Holland (Elsiver): Amsterdam, The Netherlands, 2007; Volume 26, pp. 683–718. [Google Scholar] [CrossRef]
Buchholz, J.; Hartig, J. Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Appl. Psychol. Meas. 2019, 43, 241–250. [Google Scholar] [CrossRef]
Buchholz, J.; Hartig, J. Measurement invariance testing in questionnaires: A comparison of three multigroup-CFA and IRT-based approaches. Psychol. Test Assess. Model. 2020, 62, 29–53. Available online: https://bit.ly/38kswHh (accessed on 4 October 2025).
Khorramdel, L.; Shin, H.J.; von Davier, M. GDM software mdltm including parallel EM algorithm. In Handbook of Diagnostic Classification Models; von Davier, M., Lee, Y.S., Eds.; Springer: Cham, Switzerland, 2019; pp. 603–628. [Google Scholar] [CrossRef]
Kim, Y.K.; Cai, L.; Kim, Y. Evaluation of item fit with output from the EM algorithm: RMSD index based on posterior expectations. Educ. Psychol. Meas. 2025; Epub ahead of print. [Google Scholar] [CrossRef]
Köhler, C.; Robitzsch, A.; Hartig, J. A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. J. Educ. Behav. Stat. 2020, 45, 251–273. [Google Scholar] [CrossRef]
Köhler, C.; Robitzsch, A.; Fährmann, K.; von Davier, M.; Hartig, J. A semiparametric approach for item response function estimation to detect item misfit. Brit. J. Math. Stat. Psychol. 2021, 74, 157–175. [Google Scholar] [CrossRef]
Kunina-Habenicht, O.; Rupp, A.A.; Wilhelm, O. A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Stud. Educ. Eval. 2009, 35, 64–70. [Google Scholar] [CrossRef]
Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
Joo, S.; Ali, U.; Robin, F.; Shin, H.J. Impact of differential item functioning on group score reporting in the context of large-scale assessments. Large-Scale Assess. Educ. 2022, 10, 18. [Google Scholar] [CrossRef]
Robitzsch, A. Statistical properties of estimators of the RMSD item fit statistic. Foundations 2022, 2, 488–503. [Google Scholar] [CrossRef]
Sueiro, M.J.; Abad, F.J. Assessing goodness of fit in item response theory with nonparametric models: A comparison of posterior probabilities and kernel-smoothing approaches. Educ. Psychol. Meas. 2011, 71, 834–848. [Google Scholar] [CrossRef]
Tijmstra, J.; Bolsinova, M.; Liaw, Y.L.; Rutkowski, L.; Rutkowski, D. Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. J. Educ. Meas. 2020, 57, 566–583. [Google Scholar] [CrossRef]
von Davier, M.; Bezirhan, U. A robust method for detecting item misfit in large scale assessments. Educ. Psychol. Meas. 2023, 83, 740–765. [Google Scholar] [CrossRef]
Wright, B.D.; Stone, M.H. Best Test Design; Mesa Press: Chicago, IL, USA, 1979; Available online: https://bit.ly/38jnLMX (accessed on 4 October 2025).
Wu, M.; Tam, H.P.; Jen, T.H. Educational Measurement for Applied Researchers; Springer: Singapore, 2016. [Google Scholar] [CrossRef]
Fährmann, K.; Köhler, C.; Hartig, J.; Heine, J.H. Practical significance of item misfit and its manifestations in constructs assessed in large-scale studies. Large-Scale Assess. Educ. 2022, 10, 7. [Google Scholar] [CrossRef]
von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
Joo, S.; Valdivia, M.; Svetina Valdivia, D.; Rutkowski, L. Alternatives to weighted item fit statistics for establishing measurement invariance in many groups. J. Educ. Behav. Stat. 2024, 49, 465–493. [Google Scholar] [CrossRef]
Held, L.; Sabanés Bové, D. Applied Statistical Inference; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar] [CrossRef]
Camilli, G. Origin of the scaling constant d = 1.7 in item response theory. J. Educ. Stat. 1994, 19, 293–295. [Google Scholar] [CrossRef]
Camilli, G. The scaling constant D in item response theory. Open J. Stat. 2017, 7, 780–785. [Google Scholar] [CrossRef]
Savalei, V. Logistic approximation to the normal: The KL rationale. Psychometrika 2006, 71, 763–767. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychol. Test Assess. Model. 2020, 62, 233–279. Available online: https://bit.ly/3ezBB05 (accessed on 4 October 2025).
Raju, N.S. The area between two item characteristic curves. Psychometrika 1988, 53, 495–502. [Google Scholar] [CrossRef]
Joo, S.H.; Lee, P. Detecting differential item functioning using posterior predictive model checking: A comparison of discrepancy statistics. J. Educ. Meas. 2022, 59, 442–469. [Google Scholar] [CrossRef]
Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. Available online: https://bit.ly/2UV6Eht (accessed on 4 October 2025). [PubMed]
van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
Adams, R.J.; Wu, M.L. The mixed-coefficients multinomial logit model: A generalized form of the Rasch model. In Multivariate and Mixture Distribution Rasch Models; von Davier, M., Carstensen, C.H., Eds.; Springer: New York, NY, USA, 2007; pp. 57–75. [Google Scholar] [CrossRef]
Wu, M.; Adams, R.J. Properties of Rasch residual fit statistics. J. Appl. Meas. 2013, 14, 339–355. [Google Scholar]
Adams, R.; Wu, M. (Eds.) PISA 2000 Technical Report; OECD: Paris, France, 2003; Available online: https://tinyurl.com/y79c3kmp (accessed on 4 October 2025).
Lamprianou, I. Applying the Rasch Model in Social Sciences Using R and BlueSky Statistics; Routledge: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Wilson, M. Constructing Measures: An Item Response Modeling Approach; Routledge: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
Silva Diaz, J.A.; Köhler, C.; Hartig, J. Performance of infit and outfit confidence intervals calculated via parametric bootstrapping. Appl. Meas. Educ. 2022, 35, 116–132. [Google Scholar] [CrossRef]
Osterlind, S.J.; Everson, H.T. Differential Item Functioning; Sage: Newcastle upon Tyne, UK, 2009. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 15 June 2024).
Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules. R Package Version 4.3-25. 2025. Available online: https://cran.r-project.org/web/packages/TAM/index.html (accessed on 28 August 2025).
Wickham, H.; Chang, W.; Henry, L.; Pedersen, T.L.; Takahashi, K.; Wilke, C.; Woo, K.; Yutani, H.; Dunnington, D.; van den Brand, T.; et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R Package Version 4.0.0. 2025. Available online: https://cran.r-project.org/web/packages/ggplot2/index.html (accessed on 11 September 2025).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
OECD. PISA 2006 Technical Report; OECD: Paris, France, 2009; Available online: https://bit.ly/38jhdzp (accessed on 4 October 2025).
Robitzsch, A. sirt: Supplementary Item Response Theory Models. R Package Version 4.2-133. 2025. Available online: https://cran.r-project.org/web/packages/sirt/index.html (accessed on 27 September 2025).
Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef]
Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
Longford, N.T.; Holland, P.W.; Thayer, D.T. Stability of the MH D-DIF statistics across populations. In Differential Item Functioning; Holland, P.W., Wainer, H., Eds.; Routledge: London, UK, 1993; pp. 171–196. [Google Scholar]
Ackerman, T.A.; Ma, Y. Examining differential item functioning from a multidimensional IRT perspective. Psychometrika 2024, 89, 4–41. [Google Scholar] [CrossRef]
von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
Adams, R.J. Comments on Kreiner 2011: Is the Foundation Under PISA Solid? A Critical Look at the Scaling Model Underlying International Comparisons of Student Attainment; Technical Report; OECD: Paris, France, 2011; Available online: https://bit.ly/3wVUKo0 (accessed on 4 October 2025).
Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci. 2022, 4, 9. [Google Scholar] [CrossRef]
Bauer, D.J. Enhancing measurement validity in diverse populations: Modern approaches to evaluating differential item functioning. Brit. J. Math. Stat. Psychol. 2023, 76, 435–461. [Google Scholar] [CrossRef]
Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef]
Lord, F.M. Applications of Item Response Theory to Practical Testing Problems; Erlbaum: Hillsdale, NJ, USA, 1980. [Google Scholar] [CrossRef]
Boer, D.; Hanke, K.; He, J. On detecting systematic measurement error in cross-cultural research: A review and critical reflection on equivalence and invariance tests. J. Cross-Cult. Psychol. 2018, 49, 713–734. [Google Scholar] [CrossRef]
He, J.; Barrera-Pedemonte, F.; Buchholz, J. Cross-cultural comparability of noncognitive constructs in TIMSS and PISA. Assess. Educ. 2019, 26, 369–385. [Google Scholar] [CrossRef]
Rutkowski, L.; Svetina, D. Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educ. Psychol. Meas. 2014, 74, 31–57. [Google Scholar] [CrossRef]
Robitzsch, A. On the choice of the item response model for scaling PISA data: Model selection based on information criteria and quantifying model uncertainty. Entropy 2022, 24, 760. [Google Scholar] [CrossRef] [PubMed]
Owen, D.B. A table of normal integrals. Commun. Stat. Simul. Comput. 1980, 9, 389–419. [Google Scholar] [CrossRef]

Figure 1. Simulation Study: Percentile plot of RMSD item fit statistics for items with differential item functioning (DIF) showing the 5th–95th percentile range (black rectangle), the 25th–75th percentile range (gray-shaded rectangle) and the median (thick black line).

Figure 2. Simulation Study: Percentile plot of RMSD item fit statistics for items without differential item functioning (DIF) showing the 5th–95th percentile range (black rectangle), the 25th–75th percentile range (gray-shaded rectangle) and the median (thick black line).

Figure 3. Empirical Example, PISA 2006 Reading: Histogram of distribution-weighted RMSD (

{RMSD}_{d i s t}

; left panel) and difficulty-weighted RMSD (

{RMSD}_{d i f f}

; right panel) statistics.

Figure 3. Empirical Example, PISA 2006 Reading: Histogram of distribution-weighted RMSD (

{RMSD}_{d i s t}

; left panel) and difficulty-weighted RMSD (

{RMSD}_{d i f f}

; right panel) statistics.

Figure 4. Empirical Example, PISA 2006 Reading: Correlation plot of item fit statistics.

Table 1. Simulation Study: Mean and standard deviation (SD) of item fit statistics for items with differential item functioning (DIF) as a function of item difficulty

b_{i}

, DIF effect size

δ_{i}

and sample size N.

Table 1. Simulation Study: Mean and standard deviation (SD) of item fit statistics for items with differential item functioning (DIF) as a function of item difficulty

b_{i}

, DIF effect size

δ_{i}

and sample size N.

			Mean						SD
			Dist		Diff		Infit	Outfit	Dist		Diff		Infit	Outfit
$b_{i}$	$δ_{i}$	$N$	RMSD	MD	RMSD	MD	Infit	Outfit	RMSD	MD	RMSD	MD	Infit	Outfit
−0.22	−0.6	250	0.129	0.117	0.133	0.120	0.982	0.975	0.027	0.027	0.028	0.028	0.047	0.066
		500	0.126	0.118	0.129	0.121	0.983	0.977	0.020	0.019	0.021	0.020	0.033	0.046
		1000	0.125	0.119	0.128	0.122	0.981	0.973	0.014	0.014	0.015	0.014	0.023	0.032
		2000	0.123	0.118	0.126	0.121	0.982	0.974	0.010	0.010	0.011	0.010	0.016	0.022
		4000	0.123	0.118	0.125	0.121	0.982	0.974	0.007	0.007	0.008	0.007	0.012	0.016
	0.6	250	0.134	−0.123	0.133	−0.122	1.064	1.088	0.028	0.028	0.027	0.028	0.051	0.076
		500	0.131	−0.123	0.129	−0.122	1.063	1.086	0.020	0.020	0.020	0.020	0.036	0.053
		1000	0.129	−0.124	0.128	−0.123	1.064	1.087	0.014	0.014	0.014	0.014	0.026	0.038
		2000	0.127	−0.123	0.126	−0.122	1.063	1.085	0.011	0.010	0.010	0.010	0.017	0.025
		4000	0.127	−0.123	0.126	−0.122	1.064	1.086	0.007	0.007	0.007	0.007	0.013	0.019
−1.11	−0.6	250	0.108	0.092	0.152	0.125	0.839	0.791	0.025	0.023	0.052	0.048	0.056	0.085
		500	0.104	0.092	0.139	0.123	0.841	0.793	0.018	0.016	0.037	0.035	0.041	0.062
		1000	0.102	0.092	0.130	0.120	0.836	0.786	0.013	0.011	0.025	0.024	0.027	0.041
		2000	0.100	0.091	0.125	0.118	0.839	0.790	0.009	0.008	0.018	0.017	0.021	0.032
		4000	0.100	0.092	0.125	0.120	0.839	0.791	0.006	0.006	0.012	0.012	0.014	0.022
	0.6	250	0.126	−0.112	0.142	−0.120	1.237	1.323	0.029	0.029	0.035	0.042	0.080	0.136
		500	0.121	−0.112	0.132	−0.119	1.237	1.322	0.019	0.019	0.025	0.028	0.057	0.099
		1000	0.119	−0.111	0.128	−0.120	1.235	1.322	0.014	0.013	0.017	0.019	0.040	0.070
		2000	0.118	−0.112	0.126	−0.120	1.237	1.324	0.010	0.010	0.013	0.014	0.028	0.048
		4000	0.117	−0.111	0.124	−0.120	1.236	1.322	0.007	0.007	0.010	0.010	0.021	0.036
−2.00	−0.6	250	0.077	0.057	0.215	0.145	0.731	0.670	0.021	0.019	0.118	0.118	0.096	0.154
		500	0.072	0.057	0.174	0.126	0.729	0.671	0.015	0.013	0.087	0.085	0.063	0.102
		1000	0.070	0.058	0.154	0.124	0.727	0.668	0.011	0.009	0.067	0.062	0.046	0.073
		2000	0.069	0.057	0.140	0.121	0.730	0.671	0.008	0.006	0.048	0.044	0.032	0.052
		4000	0.069	0.058	0.132	0.120	0.729	0.671	0.006	0.004	0.031	0.028	0.022	0.037
	0.6	250	0.101	−0.083	0.177	−0.107	1.416	1.537	0.027	0.026	0.067	0.097	0.143	0.260
		500	0.094	−0.080	0.150	−0.111	1.400	1.522	0.019	0.018	0.046	0.062	0.102	0.182
		1000	0.091	−0.080	0.137	−0.113	1.401	1.525	0.014	0.012	0.035	0.044	0.068	0.125
		2000	0.091	−0.081	0.129	−0.116	1.404	1.527	0.009	0.009	0.024	0.029	0.049	0.092
		4000	0.090	−0.080	0.124	−0.115	1.404	1.528	0.007	0.006	0.018	0.022	0.035	0.063

Note. RMSD = root mean square deviation; MD = mean deviation; dist = RMSD and MD weighted by groupspecific distribution; diff = RMSD and MD weighted by a normal density located at item difficulty

b_{i}

.

Table 2. Simulation Study: Mean and standard deviation (SD) of item fit statistics for items without differential item functioning (DIF) as a function of item difficulty

b_{i}

and sample size N.

Table 2. Simulation Study: Mean and standard deviation (SD) of item fit statistics for items without differential item functioning (DIF) as a function of item difficulty

b_{i}

and sample size N.

		Mean						SD
		Dist		Diff		Infit	Outfit	Dist		Diff		Infit	Outfit
$b_{i}$	$N$	RMSD	MD	RMSD	MD	Infit	Outfit	RMSD	MD	RMSD	MD	Infit	Outfit
−0.22	250	0.051	0.000	0.053	0.000	0.997	0.997	0.016	0.028	0.017	0.029	0.047	0.066
	500	0.036	0.000	0.037	0.000	0.997	0.997	0.011	0.020	0.012	0.020	0.033	0.046
	1000	0.025	0.000	0.026	−0.001	0.997	0.996	0.008	0.014	0.008	0.014	0.023	0.032
	2000	0.018	0.000	0.019	0.000	0.997	0.996	0.006	0.010	0.006	0.010	0.016	0.023
	4000	0.013	0.000	0.013	0.000	0.997	0.996	0.004	0.007	0.004	0.007	0.011	0.016
−1.11	250	0.048	0.000	0.079	0.001	0.998	0.995	0.015	0.025	0.039	0.044	0.068	0.109
	500	0.035	−0.001	0.055	−0.003	0.998	0.997	0.011	0.018	0.027	0.033	0.048	0.079
	1000	0.025	−0.001	0.040	−0.002	0.998	0.998	0.007	0.013	0.017	0.022	0.034	0.055
	2000	0.017	−0.001	0.028	−0.002	0.999	0.997	0.005	0.009	0.012	0.016	0.023	0.036
	4000	0.012	−0.001	0.020	−0.002	0.998	0.997	0.004	0.006	0.008	0.011	0.017	0.028
−2.00	250	0.045	0.000	0.158	0.023	1.003	1.005	0.014	0.021	0.098	0.114	0.112	0.191
	500	0.033	−0.001	0.116	0.008	1.003	0.999	0.010	0.015	0.073	0.081	0.081	0.133
	1000	0.022	0.000	0.082	0.005	1.000	0.997	0.007	0.011	0.052	0.055	0.056	0.092
	2000	0.016	0.000	0.060	0.001	0.999	0.993	0.005	0.008	0.034	0.038	0.042	0.067
	4000	0.011	−0.001	0.042	−0.002	1.002	1.000	0.004	0.005	0.023	0.026	0.028	0.046

Note. RMSD = root mean square deviation; MD = mean deviation; dist = RMSD and MD weighted by groupspecific distribution; diff = RMSD and MD weighted by a normal density located at item difficulty

b_{i}

.

Table 3. Empirical Example, PISA 2006 Reading, Netherlands: International item parameters, DIF effect size

δ_{i}

and item fit statistics (with standard errors in parentheses).

Table 3. Empirical Example, PISA 2006 Reading, Netherlands: International item parameters, DIF effect size

δ_{i}

and item fit statistics (with standard errors in parentheses).

Item	$a_{i}$	$b_{i}$	$δ_{i}$	Dist		Diff		Infit	Outfit
Item	$a_{i}$	$b_{i}$	$δ_{i}$	RMSD	MD	RMSD	MD	Infit	Outfit
R055Q01	1.394	−1.487	0.655 (0.054)	0.122 (0.012)	−0.118 (0.011)	0.133 (0.014)	−0.128 (0.013)	1.666 (0.008)	2.645 (0.030)
R055Q02	1.378	0.043	0.288 (0.048)	0.073 (0.012)	−0.070 (0.011)	0.072 (0.012)	−0.070 (0.011)	1.057 (0.005)	1.113 (0.011)
R055Q03	1.619	−0.334	0.176 (0.043)	0.048 (0.010)	−0.042 (0.011)	0.055 (0.010)	−0.049 (0.011)	0.989 (0.004)	1.030 (0.009)
R055Q05	2.116	−0.779	−0.055 (0.043)	0.037 (0.008)	0.018 (0.008)	0.042 (0.008)	0.003 (0.011)	0.850 (0.002)	0.719 (0.003)
R067Q01	1.226	−2.073	0.523 (0.064)	0.077 (0.013)	−0.064 (0.010)	0.108 (0.033)	−0.093 (0.048)	1.572 (0.009)	1.724 (0.011)
R067Q04	0.832	0.723	0.723 (0.093)	0.123 (0.013)	−0.118 (0.013)	0.132 (0.015)	−0.130 (0.015)	0.934 (0.003)	0.919 (0.003)
R067Q05	1.087	−0.307	0.304 (0.060)	0.072 (0.013)	−0.069 (0.013)	0.076 (0.014)	−0.072 (0.014)	1.051 (0.003)	1.087 (0.007)
R102Q04A	1.460	0.669	−0.371 (0.047)	0.094 (0.012)	0.090 (0.011)	0.094 (0.013)	0.088 (0.012)	1.103 (0.004)	1.215 (0.009)
R102Q05	1.330	0.244	−0.611 (0.048)	0.151 (0.012)	0.145 (0.011)	0.150 (0.012)	0.144 (0.011)	1.046 (0.004)	1.086 (0.008)
R102Q07	1.417	−1.494	0.823 (0.047)	0.177 (0.011)	−0.153 (0.010)	0.179 (0.016)	−0.153 (0.021)	2.000 (0.010)	2.415 (0.019)
R104Q01	1.626	−1.322	0.087 (0.072)	0.017 (0.014)	−0.012 (0.012)	0.021 (0.021)	−0.018 (0.028)	1.044 (0.004)	0.959 (0.006)
R104Q02	0.584	1.334	0.157 (0.112)	0.025 (0.010)	−0.017 (0.013)	0.018 (0.011)	0.000 (0.021)	0.969 (0.001)	0.960 (0.001)
R104Q05	1.133	3.129	0.622 (0.147)	0.034 (0.008)	−0.024 (0.004)	0.175 (0.087)	−0.166 (0.078)	0.639 (0.004)	0.594 (0.003)
R111Q01	1.365	−0.604	0.470 (0.039)	0.113 (0.009)	−0.108 (0.009)	0.112 (0.011)	−0.101 (0.014)	1.228 (0.004)	1.408 (0.012)
R111Q02B	1.047	1.912	−0.324 (0.065)	0.048 (0.010)	0.046 (0.010)	0.038 (0.016)	0.035 (0.019)	1.228 (0.004)	1.341 (0.006)
R111Q06B	1.588	0.542	0.162 (0.044)	0.042 (0.011)	−0.039 (0.010)	0.046 (0.012)	−0.044 (0.011)	0.990 (0.004)	0.973 (0.007)
R219Q01E	1.633	−0.250	−0.236 (0.054)	0.074 (0.014)	0.058 (0.013)	0.081 (0.015)	0.067 (0.014)	1.021 (0.006)	1.008 (0.010)
R219Q01T	1.860	−0.664	−0.246 (0.062)	0.077 (0.017)	0.054 (0.013)	0.099 (0.023)	0.081 (0.020)	1.023 (0.006)	1.049 (0.013)
R219Q02	1.533	−1.179	−0.120 (0.055)	0.031 (0.011)	0.020 (0.009)	0.038 (0.016)	0.026 (0.016)	0.921 (0.002)	1.048 (0.013)
R220Q01	1.761	0.305	0.170 (0.033)	0.058 (0.008)	−0.053 (0.008)	0.058 (0.009)	−0.054 (0.008)	0.981 (0.005)	0.986 (0.009)
R220Q02B	1.520	−0.376	0.102 (0.047)	0.050 (0.010)	−0.029 (0.012)	0.058 (0.012)	−0.032 (0.016)	1.006 (0.004)	1.060 (0.010)
R220Q04	1.301	−0.312	−0.260 (0.055)	0.062 (0.011)	0.054 (0.012)	0.060 (0.012)	0.051 (0.014)	0.888 (0.003)	0.854 (0.004)
R220Q05	1.976	−1.145	−0.007 (0.050)	0.023 (0.011)	−0.002 (0.009)	0.029 (0.017)	−0.013 (0.019)	0.920 (0.004)	0.701 (0.004)
R220Q06	1.166	−0.675	−0.110 (0.061)	0.032 (0.008)	0.018 (0.012)	0.038 (0.014)	0.015 (0.018)	0.917 (0.003)	0.882 (0.004)
R227Q01	0.778	−0.151	−0.778 (0.066)	0.131 (0.010)	0.129 (0.010)	0.131 (0.011)	0.129 (0.011)	0.942 (0.002)	0.937 (0.003)
R227Q02T	0.993	0.793	−0.907 (0.066)	0.194 (0.013)	0.184 (0.013)	0.193 (0.013)	0.183 (0.013)	1.230 (0.004)	1.289 (0.006)
R227Q03	1.664	−0.183	−0.159 (0.043)	0.057 (0.012)	0.046 (0.011)	0.063 (0.013)	0.054 (0.012)	1.042 (0.005)	1.094 (0.011)
R227Q06	1.765	−0.777	−0.286 (0.053)	0.082 (0.012)	0.064 (0.009)	0.100 (0.017)	0.088 (0.015)	0.903 (0.004)	0.876 (0.007)

Note.

a_{i}

= item discrimination (international item parameter);

b_{i}

= item difficulty (international item parameter); RMSD = root mean square deviation; MD = mean deviation; dist = RMSD and MD weighted by group-specific distribution; diff = RMSD and MD weighted by a normal density located at item difficulty

b_{i}

. The following entries are printed in bold font: absolute

δ_{i}

values larger than 0.4, RMSD values larger than 0.08, and infit statistics smaller than 0.7 and larger than 1.3.

Table 4. Empirical Example, PISA 2006 Reading: Relative frequencies for patterns of flagged items based on the DIF effect size

δ_{i}

, distribution-weighted and difficulty-weighted RMSD (

{RMSD}_{d i s t}

and

{RMSD}_{d i f f}

, respectively) and infit statistic.

Table 4. Empirical Example, PISA 2006 Reading: Relative frequencies for patterns of flagged items based on the DIF effect size

δ_{i}

, distribution-weighted and difficulty-weighted RMSD (

{RMSD}_{d i s t}

and

{RMSD}_{d i f f}

, respectively) and infit statistic.

Flagged Items
$\| δ \|$	${RMSD}_{dist}$	${RMSD}_{diff}$	$Infit$	%
0	0	0	0	65.1
0	0	0	1	0.8
0	0	1	0	5.5
0	0	1	1	0.4
0	1	1	0	7.3
0	1	1	1	0.1
1	0	0	0	1.1
1	0	0	1	0.3
1	0	1	0	1.1
1	0	1	1	1.7
1	1	1	0	13.5
1	1	1	1	3.0

Note. Flagged items are indicated by an entry of 1. Items were flagged according to the following rules: absolute

δ_{i}

values larger than 0.4, RMSD values larger than 0.08, and infit statistics smaller than 0.7 and larger than 1.3.

Table 5. Empirical Example, PISA 2006 Reading: Percentages of flagged items based on the DIF effect size

δ_{i}

, distribution-weighted and difficulty-weighted RMSD (

{RMSD}_{d i s t}

and

{RMSD}_{d i f f}

, respectively) and infit statistic at the country level.

Table 5. Empirical Example, PISA 2006 Reading: Percentages of flagged items based on the DIF effect size

δ_{i}

, distribution-weighted and difficulty-weighted RMSD (

{RMSD}_{d i s t}

and

{RMSD}_{d i f f}

, respectively) and infit statistic at the country level.

		Flagged Items
cnt	$I$	$\| δ \|$	${RMSD}_{dist}$	${RMSD}_{diff}$	$Infit$
AUS	28	7.1	21.4	25.0	0.0
AUT	27	11.1	11.1	18.5	0.0
BEL	28	14.3	7.1	17.9	10.7
CAN	28	17.9	21.4	32.1	3.6
CHE	28	25.0	17.9	28.6	7.1
CZE	28	25.0	25.0	28.6	3.6
DEU	28	21.4	21.4	28.6	7.1
DNK	27	29.6	37.0	44.4	3.7
ESP	28	17.9	32.1	39.3	7.1
EST	28	17.9	35.7	42.9	10.7
FIN	28	10.7	10.7	17.9	3.6
FRA	28	14.3	25.0	39.3	3.6
GBR	28	25.0	35.7	42.9	7.1
GRC	28	25.0	28.6	35.7	3.6
HUN	28	21.4	17.9	32.1	3.6
IRL	28	17.9	21.4	28.6	3.6
ISL	28	21.4	17.9	21.4	10.7
ITA	28	25.0	21.4	25.0	7.1
JPN	28	25.0	35.7	42.9	17.9
KOR	27	25.9	33.3	51.9	11.1
LUX	27	14.8	11.1	18.5	7.4
NLD	28	32.1	32.1	46.4	14.3
NOR	28	25.0	32.1	42.9	3.6
POL	28	21.4	25.0	39.3	3.6
PRT	28	32.1	32.1	35.7	3.6
SWE	28	14.3	14.3	25.0	7.1

Note. cnt = country label (see Appendix C; I = number of items; Items were flagged according to the following rules: absolute

δ_{i}

values larger than 0.4, RMSD values larger than 0.08, and infit statistics smaller than 0.7 and larger than 1.3.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Robitzsch, A. Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning. Mathematics 2025, 13, 3752. https://doi.org/10.3390/math13233752

AMA Style

Robitzsch A. Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning. Mathematics. 2025; 13(23):3752. https://doi.org/10.3390/math13233752

Chicago/Turabian Style

Robitzsch, Alexander. 2025. "Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning" Mathematics 13, no. 23: 3752. https://doi.org/10.3390/math13233752

APA Style

Robitzsch, A. (2025). Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning. Mathematics, 13(23), 3752. https://doi.org/10.3390/math13233752

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Comparing Weighted RMSD, Weighted MD, Infit, and Outfit Item Fit Statistics Under Uniform Differential Item Functioning

Abstract

1. Introduction

2. Weighted RMSD and Weighted MD Under DIF

2.1. Weighted RMSD and Weighted MD

2.2. Distribution-Weighted RMSD and MD Statistics

2.3. Difficulty-Weighted RMSD and MD

2.4. Information-Weighted RMSD and MD

2.5. Uniformly Weighted RMSD and MD

2.6. Estimation of Weighted RMSD and Weighted MD

3. Infit and Outfit Under DIF

3.1. Expected Value of Outfit Statistic

3.2. Expected Value of Infit Statistic

4. Simulation Study

4.1. Method

4.2. Results

5. Empirical Example

5.1. Method

5.2. Results

6. Discussion

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Integral Identities for the Normal Distribution

Appendix B. Approximation Error in the Computation of the RMSD Statistic

Appendix C. Country Labels for PISA 2006 Reading Study

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI