Bias and Linking Error in Fixed Item Parameter Calibration

Alexander Robitzsch

doi:10.3390/appliedmath4030063

¹

IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany

AppliedMath2024, 4(3), 1181-1191;https://doi.org/10.3390/appliedmath4030063

Version Notes

Order Reprints

Abstract

The two-parameter logistic (2PL) item response theory (IRT) model is frequently applied to analyze group differences for multivariate binary random variables. The item parameters in the 2PL model are frequently fixed when estimating the mean and the standard deviation for a group of interest. This method is also called fixed item parameter calibration (FIPC). In this article, the bias and the linking error of the FIPC approach are analytically derived in the presence of random uniform differential item functioning (DIF). The adequacy of the analytical findings was validated in a simulation study. It turned out that the extent of the bias and the variance in distribution parameters increases with increasing variance of random DIF effects.

Keywords:

item response theory; 2PL model; bias; linking error; fixed item parameter calibration; differential item functioning

MSC:

62H12; 62H17; 62H25; 62P25

1. Introduction

Item response theory (IRT) models [1,2,3,4,5,6,7] represent a statistical framework to model multivariate discrete random variables. IRT models are frequently employed in the context of educational or psychological research. Let

X = (X_{1}, \dots, X_{I})

be the vector of

I \in N

dichotomous (i.e., binary) random variables

X_{i} \in {0, 1}

that are typically referred to as (scored) items (or scored item responses). A unidimensional IRT model [8,9] is a statistical model for the probability distribution

P (X = x)

for

x = (x_{1}, \dots, x_{I}) \in {0, 1}^{I}

, where

P (X = x; δ, γ) = \int \prod_{i = 1}^{I} [{P_{i} (θ; γ_{i})}^{x_{i}} {(1 - P_{i} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ .

(1)

In (1),

ϕ

denotes the density of the univariate normal distribution with mean

μ

and standard deviation

σ

. The parameters of the distribution for the latent (factor) variable

θ

, also labeled as trait or ability, define the parameter of interest

δ = (μ, σ)

. The vector

γ = (γ_{1}, \dots, γ_{I})

collects all estimated item parameters of item response functions (IRF)

P_{i} (θ; γ_{i}) = P (X_{i} = 1 | θ)

(

i = 1, \dots, I

). The two-parameter logistic (2PL) model [10,11,12] has the IRF:

P_{i} (θ; γ_{i}) = Ψ (a_{i} (θ - b_{i})),

(2)

where

a_{i}

and

b_{i}

are the item discrimination and item difficulty, respectively, and the logistic distribution function

Ψ

is given by the following:

Ψ (x) = \frac{1}{1 + exp (- x)} .

(3)

In this case, the item parameters in (1) are given as

γ_{i} = (a_{i}, b_{i})

. The model parameters in the IRT model (1) are frequently estimated using (marginal) maximum likelihood estimation [13,14].

IRT models are frequently used to compare the distribution of a particular group, such as a country, in a test (i.e., on a set of items) with a reference distribution with respect to the factor variable

θ

in the IRT model (1). Fixed item parameter calibration (FIPC, [15,16,17,18,19]) is frequently use. It assumes that item parameters are known (or estimated) in the total sample, and only the distribution parameters must be estimated. The application of the FIPC method using the 2PL model can be assumed to provide consistent distribution parameter estimates if the item parameters also hold for the group under study. However, it could be that the item parameters were group-specific. This property is referred to as differential item functioning (DIF) in the literature (see, for example, [20,21,22,23,24]). It has been shown in the linking literature that DIF produces additional variability in the estimated mean

μ

and standard deviation

σ

[25,26,27,28,29,30,31]. Moreover, it has been demonstrated that the occurrence of DIF can introduce bias in the distribution parameters [32,33]. In particular, bias in FIPC due to random DIF can emerge, because a misspecified IRF is assumed that ignores DIF effects. Importantly, bias in distribution parameter estimates emerges, even if DIF effects are random (i.e., average to zero in the population). Moreover, the presence of random DIF also includes additional variability of the FIPC method.

In this article, the performance of FIPC is investigated in the presence of random DIF [34,35,36,37,38,39,40]. An analytical derivation and a simulation study show that biased estimates can result in the presence of random DIF. The analytical derivation utilizes standard techniques from calculus. However, to our knowledge, such a derivation of the bias and the variance of the FIPC estimates in the presence of DIF have not yet been presented in the literature.

The rest of this article is organized as follows. Section 2 presents an analytical derivation of the bias and the variance of the parameter estimate of the FIPC method. Section 3 reports the findings of a simulation study that aims to confirm the analytical findings of Section 2. Finally, the article closes with a discussion in Section 4.

2. Analytical Derivation of Bias and Linking Error

In this section, we compute approximations of the bias and the variance (i.e., the square of the linking error) for the FIPC method in the presence of random uniform DIF. The derivation operates on the assumption of an infinite sample size of persons. Hence, the effects of sampling error (i.e., sampling of subjects) are neglected.

Let

x_{h} = (x_{h 1}, \dots, x_{h I}) \in {0, 1}^{I}

be a vector of item responses. The index h of an item response pattern can be defined as

h = \sum_{i = 1}^{I} 2^{i} x_{h i}

, where

h = 1, \dots, 2^{I}

. In total, there are

H = 2^{I}

item response patterns for I items. The likelihood of the item response pattern

x_{h}

is given by (see [41])

Like (x_{h}, μ, σ, a, b) = \sum_{t = 1}^{T} f_{t} \prod_{i = 1}^{I} p_{i t}^{x_{h i}} {(1 - p_{i t})}^{1 - x_{h i}} with p_{i t} = Ψ (a_{i} (σ θ_{t} + μ - b_{i})),

(4)

where

f_{t}

are known weights that are proportional to the standard normal distribution, and the likelihood (4) is evaluated at a discrete grid

θ_{t}

for

t = 1, \dots, T

of

θ

points. The distribution parameters consists of the mean

μ

and the standard deviation

σ

.

a = (a_{1}, \dots, a_{I})

is the vector of item discriminations, and

b = (b_{1}, \dots, b_{I})

is the vector of item difficulties. Note that

μ

and

σ

also enter item response probabilities in (4) because the prior probability weights

f_{t}

follow a standard normal distribution.

The item response data are generated by additionally assuming uniform DIF effects. The probability of a pattern

x_{h}

is given by the following (see [41]):

w_{h} = w_{h} (e) = P (X = x_{h}) = Like (x_{h}, μ^{*}, σ^{*}, a^{*}, b^{*} + e),

(5)

and the notation

μ^{*}

and

σ^{*}

indicates that the distribution parameters are fixed for data generation. Note that

w_{h}

is a function of random uniform DIF effects

e_{i}

, where

e = (e_{1}, \dots e_{I})

.

The log-likelihood function l is used to obtain parameter estimates

\hat{δ}

for the population parameter

δ = (μ, σ)

. The log-likelihood function is given by the following:

l (δ, e) = \sum_{h = 1}^{H} w_{h} log L_{h} = \sum_{h = 1}^{H} w_{h} (e) log L_{h} (δ), where

(6)

L_{h} = L_{h} (δ) = Like (x_{h}, μ, σ, a^{*}, b^{*}) .

(7)

The parameter estimate is obtained as the maximizer of l, which means that

\hat{δ}

is the root of a score equation:

\hat{δ} = \underset{δ}{arg min} l (δ, e) such that l_{δ} (\hat{δ}, e) = 0 .

(8)

Importantly, DIF effects only affect probabilities

w_{h}

, and the DIF effects are not modelled.

We now derive an approximation of the parameter estimate

\hat{δ}

as a function of random uniform DIF effects

e

. A Taylor expansion of partial derivative

l_{δ}

in (8) is utilized in the approximation (see [42] for an extensive usage of this technique). We assume independent random uniform DIF effects with

E (e_{i}) = 0

and DIF variance

E (e_{i}^{2}) = τ^{2}

(i.e., a standard deviation

SD (e_{i}) = τ

; see [38,43]) and use a Taylor expansion around

δ = \hat{δ}

and

e = 0

l_{δ} (\hat{δ}, e) = l_{δ} (δ, 0) + l_{δ δ} (δ, 0) (\hat{δ} - δ) + \sum_{i = 1}^{I} l_{δ e_{i}} (δ, 0) e_{i} + \frac{1}{2} \sum_{i = 1}^{I} l_{δ e_{i} e_{i}} (δ, 0) e_{i}^{2} .

(9)

Because we also have

l_{δ} (δ, 0) = 0

(i.e., the true parameter

δ

is obtained if DIF effects are absent), we arrive at the following:

\hat{δ} - δ = - l_{δ δ}^{- 1} \sum_{i = 1}^{I} l_{δ e_{i}} e_{i} - \frac{1}{2} l_{δ δ}^{- 1} \sum_{i = 1}^{I} l_{δ e_{i} e_{i}} e_{i}^{2},

(10)

while omitting arguments in the corresponding partial derivatives. Then, we can obtain the expected bias (due to

E (e_{i}) = 0

and

E (e_{i}^{2}) = τ^{2}

) from (10)) by taking the expectation operator as follows:

Bias (\hat{δ}) = - \frac{τ^{2}}{2} {(\frac{1}{I} l_{δ δ})}^{- 1} (\frac{1}{I} \sum_{i = 1}^{I} l_{δ e_{i} e_{i}}) .

(11)

Note that because

l_{δ δ}

and

\sum_{i = 1}^{I} l_{δ e_{i} e_{i}}

both increases with I, the bias is asymptotically independent of the number of items I. The variance of

\hat{δ}

can be computed as follows:

Var (\hat{δ}) = \frac{1}{I} τ^{2} {(\frac{1}{I} l_{δ δ})}^{- 1} (\frac{1}{I} \sum_{i = 1}^{I} l_{δ e_{i}} l_{δ e_{i}}^{⊤}) {(\frac{1}{I} l_{δ δ})}^{- ⊤} .

(12)

The linking error is is defined as the square root of the variance [26].

We can now more explicitly compute the appearing terms in (11) and (12). The partial derivatives of l with respect to

δ

are given by the following:

l_{δ} = \sum_{h = 1}^{H} w_{h} \frac{\partial log L_{h}}{\partial δ} and l_{δ δ} = \sum_{h = 1}^{H} w_{h} \frac{\partial^{2} log L_{h}}{\partial δ \partial δ^{⊤}},

(13)

while the partial derivatives (denoted by the operator ∂) with respect to uniform DIF effects

e_{i}

can be computed as follows:

l_{δ e_{i}} = \sum_{h = 1}^{H} \frac{\partial w_{h}}{\partial e_{i}} \frac{\partial log L_{h}}{\partial δ} and l_{δ e_{i} e_{i}} = \sum_{h = 1}^{H} \frac{\partial^{2} w_{h}}{\partial e_{i} \partial e_{i}} \frac{\partial log L_{h}}{\partial δ} .

(14)

In detail, the estimating equation for

\hat{μ}

is given by the following:

\frac{\partial log L_{h}}{\partial μ} = \frac{\frac{\partial L_{h}}{\partial μ}}{L_{h}} = \frac{\sum_{t = 1}^{T} \frac{\partial L_{h t}}{\partial μ}}{L_{h}} = \frac{\sum_{t = 1}^{T} L_{h t} G_{μ, h t}}{L_{h}}, where

(15)

L_{h t} = f_{t} \prod_{i = 1}^{I} p_{i t}^{x_{h i}} {(1 - p_{i t})}^{1 - x_{h i}} and G_{μ, h t} = \sum_{i = 1}^{I} a_{i} (x_{h i} - p_{i t}) .

(16)

Moreover, we obtain the estimating equation for

\hat{σ}

as follows:

\frac{\partial log L_{h}}{\partial σ} = \frac{\sum_{t = 1}^{T} L_{h t} G_{σ, h t}}{L_{h}}, where G_{σ, h t} = \sum_{i = 1}^{I} θ_{t} a_{i} (x_{h i} - p_{i t}) .

(17)

The second-order derivatives in

l_{δ δ}

are given as follows:

\frac{\partial^{2} log L_{h}}{\partial μ \partial μ} = \frac{(\sum_{t = 1}^{T} L_{h t} (G_{μ, h t}^{2} + G_{μ μ, h t})) L_{h} - {(\sum_{t = 1}^{T} L_{h t} G_{μ, h t})}^{2}}{L_{h}^{2}} with G_{μ μ, h t} = - \sum_{i = 1}^{I} a_{i}^{2} p_{i t} (1 - p_{i t}),

(18)

\frac{\partial^{2} log L_{h}}{\partial σ \partial σ} = \frac{(\sum_{t = 1}^{T} L_{h t} (G_{σ, h t}^{2} + G_{σ σ, h t})) L_{h} - {(\sum_{t = 1}^{T} L_{h t} G_{σ, h t})}^{2}}{L_{h}^{2}} with G_{σ σ, h t} = - \sum_{i = 1}^{I} a_{i}^{2} θ_{t}^{2} p_{i t} (1 - p_{i t}) and

(19)

\frac{\partial^{2} log L_{h}}{\partial μ \partial σ} = \frac{(\sum_{t = 1}^{T} L_{h t} (G_{μ, h t} G_{σ, h t} + G_{μ σ, h t})) L_{h} - (\sum_{t = 1}^{T} L_{h t} G_{μ, h t}) (\sum_{t = 1}^{T} L_{h t} G_{σ, h t})}{L_{h}^{2}},

(20)

where

G_{μ σ, h t} = - \sum_{i = 1}^{I} a_{i}^{2} θ_{t} p_{i t} (1 - p_{i t})

. Furthermore, we now compute the derivatives of

w_{h}

with respect to

e_{i}

. We obtain the following for the derivatives evaluated at

e_{i} = 0

:

\frac{\partial w_{h}}{\partial e_{i}} = a_{i} \sum_{t = 1}^{T} L_{h t} p_{i t}^{1 - x_{h i}} {(1 - p_{i t})}^{x_{h i}} {(- 1)}^{x_{h i}} and \frac{\partial^{2} w_{h}}{\partial e_{i} \partial e_{i}} = - a_{i}^{2} \sum_{t = 1}^{T} L_{h t} p_{i t}^{1 - x_{h i}} {(1 - p_{i t})}^{x_{h i}} (1 - 2 p_{i t}) {(- 1)}^{x_{h i}} .

(21)

It appears that the bias-determining term in (11) is mainly affected by partial derivatives with respect to the standard deviation

σ

. The bias-determining term can be computed as follows:

\sum_{i = 1}^{I} l_{δ e_{i} e_{i}} = \sum_{i = 1}^{I} \frac{\partial^{3} l}{\partial σ \partial e_{i} \partial e_{i}} = \sum_{i = 1}^{I} (\sum_{h = 1}^{H} \frac{\partial^{2} w_{h}}{\partial e_{i} \partial e_{i}} \frac{\partial log L_{h}}{\partial σ}) = \sum_{h = 1}^{H} \frac{\partial log L_{h}}{\partial σ} \sum_{i = 1}^{I} \frac{\partial^{2} w_{h}}{\partial e_{i} \partial e_{i}} .

(22)

This expression can be simplified to the following:

\sum_{i = 1}^{I} l_{δ e_{i} e_{i}} = - \sum_{h = 1}^{H} [(\sum_{t = 1}^{T} \frac{L_{h t}}{L_{h}} \sum_{i = 1}^{I} a_{i} θ_{t} (x_{h i} - p_{i t})) (\sum_{t = 1}^{T} L_{h t} \sum_{i = 1}^{I} a_{i}^{2} p_{i t}^{1 - x_{h i}} {(1 - p_{i t})}^{x_{h i}} (1 - 2 p_{i t}) {(- 1)}^{x_{h i}})] .

(23)

It is difficult to obtain clear insights in the expression (23). However, the sign of this term can be reasoned via simplified assumptions. Assume item difficulties

b_{i}

are close to

μ

. First, we investigate the first factor of (23) that involves the following term:

α_{h i t} = θ_{t} (x_{h i} - p_{i t}) .

(24)

Assume that

p_{i t} > 0.5

. Then, we expect

x_{h i} = 1

, and

θ_{t}

will typically be positive. In this case, the term

α_{h i t}

will be positive. If

p_{i t} < 0.5

, we expect

x_{h i} = 0

, and

θ_{t}

will typically be negative. Again,

α_{h i t}

will be positive. Second, we investigate the second factor:

β_{h i t} = p_{i t}^{1 - x_{h i}} {(1 - p_{i t})}^{x_{h i}} (1 - 2 p_{i t}) {(- 1)}^{x_{h i}} .

(25)

If

p_{i t} > 0.5

, we expect

x_{h i} = 1

, and

β_{h i t}

is positive. If

p_{i t} < 0.5

, we expect

x_{h i} = 0

, which also implies that

β_{h i t}

is positive. Because there is a negative sign for

\sum_{i = 1}^{I} l_{δ e_{i} e_{i}}

in (23), we expect a negative contribution of

\sum_{i = 1}^{I} l_{δ e_{i} e_{i}}

. Furthermore, the second-order derivative matrix

l_{δ δ}

is negative definite for a maximizer. Moreover, from (11), we have the total bias contribution

- l_{δ δ}^{- 1} \sum_{i = 1}^{I} l_{δ e_{i} e_{i}}

, which includes an additional negative sign. By taking these three negative signs into account, we reason that the bias in

\hat{δ}

can be expected to be negative.

3. Simulation Study

3.1. Methods

The 2PL model was used as the IRT model to simulate item responses. We fixed the mean

μ

and the SD

σ

for the normally distributed ability variable

θ

at 0.3 and 1.2, respectively.

The number I of items was fixed at 15 in the simulation. The item discriminations

a_{i}

of the 15 items had a mean of 1.007 and a standard deviation of 0.225 and ranged between 0.6 and 1.3. The item difficulties

b_{i}

had a mean of 0.000, a standard deviation of 1.272, and ranged between −2.0 and 2.0. The concrete item parameters can be found at https://osf.io/uf2xc (accessed on 4 August 2024). These item parameters were used in the FIPC scaling model.

For data generation using the 2PL model, random uniform DIF effects

e_{i}

(

i = 1, \dots, I

) were added to item difficulties

b_{i}

. The standard deviation

SD (e_{i}) = τ

was chosen at seven factor levels: 0 (indicating no DIF), 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6. Note that random DIF effects were simulated in each replication of the simulation study.

Item responses were simulated according to the 2PL model for sample sizes N of 250, 500, 1000, 2000, and an infinite sample size of subjects (denoted by Inf). In the case of an infinite sample size, the probabilities for all

H = 2^{I}

item response patterns

x \in {0, 1}^{I}

according to the data-generating model were directly computed, and no item responses were simulated.

FIPC was applied as the analysis model in all conditions. The item parameters of the 2PL model were fixed at the prespecified

a_{i}

and

b_{i}

values, as described above.

In each of the 5 (sample size N) × 7 (DIF standard deviation

τ

) = 35 cells of the simulation study, 20,000 replications were conducted. We computed the empirical bias and the root mean square error (RMSE) for the estimated mean

\hat{μ}

and the estimated standard deviation

\hat{σ}

to assess the estimation quality of parameter estimates (see [44]).

The R software (version 4.4.1) [45] was used for the entire analysis in this simulation study. The 2PL model was fitted using the R package TAM [46]. Dedicated R functions for this simulation study were written by the author of this article. These functions and replication material for this Simulation study can be found at https://osf.io/uf2xc (accessed on 4 August 2024).

3.2. Results

Table 1 reports the bias and the RMSE for the estimates of the mean

μ

and the standard deviation

σ

. Overall, it turned out that the bias was larger for

σ

than for

μ

. Moreover, the bias increased as a function of the standard deviation

τ

. The bias exceeded a non-negligible cutoff value of 0.010 (in absolute value) for

μ

for

τ \geq 0.5

and for

σ

for

τ \geq 0.3

. Also, the bias was slightly larger for smaller sample sizes compared to large sample sizes or an infinite sample size. For a fixed value of DIF standard deviation

τ

, the RMSE was the smallest for infinite sample sizes. On the other hand, the RMSE was zero in the case of an infinite sample size of subjects (i.e.,

N = Inf

) and no DIF standard deviation (i.e.,

τ = 0

).

Table 1. Simulation study: Bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

The results in Table 1 are also graphically displayed in Figure 1. Notably, bias and RMSE for

μ

and

σ

increase with increasing DIF standard deviation

τ

. The bias is hardly affected by sample size. Moreover, the RMSE is almost perfectly linearly related to

τ

in an infinite sample size. Moreover, the RMSE increases with decreasing sample size N.

Figure 1. Simulation study: Graphical visualization of the bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

4. Discussion

Through analytical arguments and a simulation study, this article shows that the FIPC method provides biased estimates in the presence of random uniform DIF. Analytical arguments demonstrated that the bias was a function of the DIF standard deviation

τ

. The analytically obtained results were also confirmed based on a simulation study. Overall, the bias in the estimated standard deviation

\hat{σ}

was larger than in the estimated mean

\hat{μ}

.

Note that we derived the bias of distribution parameter estimates under a random DIF assumption. The derivation of the bias under fixed DIF could also rely on the Taylor Expansion (9). However, it is likely that only the linear expansion with respect to

e_{i}

must be considered because we now cannot assume

E (e_{i}) = 0

as in the random DIF case. Note that identification assumptions must be imposed in the fixed DIF case (e.g., all DIF effects sum to zero; see [47,48,49]). This property illustrates the circularity of the DIF definition in the case of fixed DIF, and general statements of the bias of a linking method like FIPC crucially depend on the chosen identification assumptions imposed by individual researchers [50].

Our derivation of the bias and the variance only involved uniform DIF effects (i.e., DIF effects in item difficulty). However, the Taylor expansion could be extended to include nonuniform DIF effects (i.e., DIF effects in item discriminations). Future research might focus on generalizing the findings to the more general DIF case.

Moreover, we only investigated the 2PL model with random DIF. The findings regarding the FIPC method can most likely be transferred to more complex IRT models, such as the three-parameter logistic (3PL) models [51] or other IRT models [52,53,54].

It should be emphasized that our findings were also observed for the nonlinear Haebara and Stocking–Lord linking methods [32]. These linking methods (see [55,56,57,58] for general treatments) also result in biased mean and standard deviations in the presence of random uniform DIF. In contrast, the mean–geometric mean linking method [55,59] that directly operates on estimated item parameters typically provides unbiased estimates [32]. Except for very small sample sizes, we believe that mean–geometric mean linking should be the preferred scaling (i.e., linking) method in applications if researchers cannot exclude the situation of random DIF.

Our study did not consider the estimation of the variance of distribution parameters in FIPC estimation, which is important for statistical inference. Linking errors for the FIPC method can be obtained by using an item jackknife that repeatedly estimates the IRT model with FIPC while excluding one item or one group of items [60]. Computational shortcuts for item jackknife for the FIPC method using the 2PL model were discussed in [61]. In the statistical inference of distribution parameters, standard errors due to the sampling of persons and linking errors due to randomness in item parameters must be disentangled. Ref. [28] addressed this topic for linking methods, but the estimation technique does not apply to the FIPC method.

By interpreting the findings of FIPC and the Haebara and Stocking–Lord linking methods, it appears that the source of the bias is the attempt to align a misspecified item response function that contains DIF onto an item response function that does not represent the DIF in their item parameters. A misfitting item results in a smaller estimated item discrimination. Because the estimated standard deviation represents an average estimated item discrimination, the bias in the estimated mean and standard deviation can be explained as a consequence of item misfit. Hence, the bias of FIPC also transfers to the recalibration linking technique [62,63] that is used in some large-scale educational studies [64,65].

Our findings also shed some light on the practical applications of linking methods to educational large-scale assessment studies [66], like programme for international student assessment (PISA; [67]). Older PISA studies effectively implemented the FIPC method for country comparisons [68]. More recent PISA studies opted for a partial invariance approach in which only the largest DIF effects and country-specific item parameters were assumed in the 2PL model [68,69,70]. Hence, an item with a large DIF effect in one country is essentially excluded from country comparisons that involve this country [50]. Nevertheless, there is very likely random DIF in the remaining items, which will subsequently (slightly) bias the distribution parameters. Hence, there is no statistical evidence for preferring scaling models with invariant item parameters (i.e., using the FIPC method for a country) or partially invariant item parameters (i.e., using the FIPC method for a subset of items for a country) instead of linking methods that allow for DIF effects in all items. However, identification assumptions are imposed by model choice, and any assumption can, in principle, be defended due to the definitional ambiguity.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2PL	two-parameter logistic
FIPC	fixed item parameter calibration
Inf	infinite sample size
IRF	item response function
IRT	item response theory
RMSE	root mean square error

References

Bock, R.D.; Moustaki, I. Item response theory in a general framework. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 469–513. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item Response Theory—A Statistical Framework for Educational and Psychological Measurement. 2024. Ahead of Print. Available online: https://arxiv.org/abs/2108.08604 (accessed on 4 August 2024).
De Ayala, R.J. The Theory and Practice of Item Response Theory; Guilford Publications: New York, NY, USA, 2022. [Google Scholar]
Formann, A.K.; Kohlmann, T. Structural latent class models. Sociol. Methods Res. 1998, 26, 530–565. [Google Scholar] [CrossRef]
Martinková, P.; Hladká, A. Computational Aspects of Psychometric Methods: With R; Chapman and Hall/CRC: Boca Raton, FL, USA, 2023. [Google Scholar] [CrossRef]
Noventa, S.; Heller, J.; Kelava, A. Toward a unified perspective on assessment models, part I: Foundations of a framework. J. Math. Psychol. 2024, 122, 102872. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, WA, USA, 2006; pp. 111–154. [Google Scholar]
van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
Meijer, R.R.; Tendeiro, J.N. Unidimensional item response theory. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 413–443. [Google Scholar] [CrossRef]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
von Davier, M. A general diagnostic model applied to language testing data. Br. J. Math. Stat. Psychol. 2008, 61, 287–307. [Google Scholar] [CrossRef] [PubMed]
von Davier, M.; Yamamoto, K. Partially observed mixtures of IRT models: An extension of the generalized partial-credit model. Appl. Psychol. Meas. 2004, 28, 389–406. [Google Scholar] [CrossRef]
Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
Kang, T.; Petersen, N.S. Linking item parameters to a base scale. Asia Pac. Educ. Rev. 2012, 13, 311–321. [Google Scholar] [CrossRef]
Kim, K.Y. Two IRT fixed parameter calibration methods for the bifactor model. J. Educ. Meas. 2020, 57, 29–50. [Google Scholar] [CrossRef]
Kim, S. A comparative study of IRT fixed parameter calibration methods. J. Educ. Meas. 2006, 43, 355–381. [Google Scholar] [CrossRef]
Kim, S.; Kolen, M.J. Application of IRT fixed parameter calibration to multiple-group test data. Appl. Meas. Educ. 2019, 32, 310–324. [Google Scholar] [CrossRef]
König, C.; Khorramdel, L.; Yamamoto, K.; Frey, A. The benefits of fixed item parameter calibration for parameter accuracy in small sample situations in large-scale assessments. Educ. Meas. Issues Pract. 2021, 40, 17–27. [Google Scholar] [CrossRef]
Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef]
Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. Available online: https://bit.ly/2WDPeqD (accessed on 4 August 2024). [PubMed]
Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. Issues Pract. 2010, 29, 15–27. [Google Scholar] [CrossRef]
Robitzsch, A. Bias-reduced Haebara and Stocking-Lord linking. J 2024, 7, 373–384. [Google Scholar] [CrossRef]
Robitzsch, A. SIMEX-based and analytical bias corrections in Stocking-Lord linking. Analytics 2024, 3, 368–388. [Google Scholar] [CrossRef]
De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
Fox, J.P. Bayesian Item Response Modeling; Springer: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
Longford, N.T.; Holland, P.W.; Thayer, D.T. Stability of the MH D-DIF statistics across populations. In Differential Item Functioning; Holland, P.W., Wainer, H., Eds.; Routledge: London, UK, 1993; pp. 171–196. [Google Scholar] [CrossRef]
Van den Noortgate, W.; De Boeck, P. Assessing and explaining differential item functioning using logistic mixed models. J. Educ. Behav. Stat. 2005, 30, 443–464. [Google Scholar] [CrossRef]
Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Penfield, R.D.; Algina, J. A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. J. Educ. Meas. 2006, 43, 295–312. [Google Scholar] [CrossRef]
Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods. Stat. Med. 2019, 38, 2074–2102. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2023; Available online: https://www.R-project.org (accessed on 15 March 2023).
Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules. 2024. R Package Version 4.2-21. Available online: https://doi.org/10.32614/CRAN.package.TAM (accessed on 19 February 2024).
Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychol. Test Assess. Model. 2020, 62, 233–279. Available online: https://bit.ly/3ezBB05 (accessed on 4 August 2024).
Lord, F.M.; Novick, R. Statistical Theories of Mental Test Scores; Addison-Wesley: Reading, MA, USA, 1968. [Google Scholar]
Bolt, D.M.; Deng, S.; Lee, S. IRT model misspecification and measurement of growth in vertical scaling. J. Educ. Meas. 2014, 51, 141–162. [Google Scholar] [CrossRef]
Loken, E.; Rulison, K.L. Estimation of a four-parameter item response theory model. Br. J. Math. Stat. Psychol. 2010, 63, 509–525. [Google Scholar] [CrossRef] [PubMed]
Shim, H.; Bonifay, W.; Wiedermann, W. Parsimonious asymmetric item response theory modeling with the complementary log-log link. Behav. Res. Methods 2023, 55, 200–219. [Google Scholar] [CrossRef]
Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
von Davier, M.; von Davier, A.A. A unified approach to IRT scale linking and scale transformations. Methodology 2007, 3, 115–124. [Google Scholar] [CrossRef]
von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
Robitzsch, A. Analytical approximation of the jackknife linking error in item response models utilizing a Taylor expansion of the log-likelihood function. AppliedMath 2023, 3, 49–59. [Google Scholar] [CrossRef]
Martin, M.O.; Mullis, I.V.S.; Foy, P.; Brossman, B.; Stanco, G.M. Estimating linking error in PIRLS. IERI Monogr. Ser. 2012, 5, 35–47. [Google Scholar]
Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
Foy, P.; Yin, L. Scaling the PIRLS 2016 achievement data. In Methods and Procedures in PIRLS 2016; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Boston College, Chestnut Hill, MA, USA, 2017. [Google Scholar]
Foy, P.; Fishbein, B.; von Davier, M.; Yin, L. Implementing the TIMSS 2019 scaling methodology. In Methods and Procedures: TIMSS 2019 Technical Report; Martin, M.O., von Davier, M., Mullis, I.V., Eds.; IEA: Boston College, Chestnut Hill, MA, USA, 2020. [Google Scholar]
Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
OECD. PISA 2018. Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 4 August 2024).
OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 4 August 2024).
Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psychol. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 4 August 2024).
von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. Princ. Policy Pract. 2019, 26, 466–488. [Google Scholar] [CrossRef]

Figure 1. Simulation study: Graphical visualization of the bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

Figure 1. Simulation study: Graphical visualization of the bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

Table 1. Simulation study: Bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

Table 1. Simulation study: Bias and root mean square error (RMSE) for the estimate of the mean

μ

and standard deviation

σ

as a function of the uniform DIF standard deviation

τ

and sample size N.

		Bias					RMSE
		N					N
Par	$τ$	250	500	1000	2000	Inf	250	500	1000	2000	Inf
$μ$	0	−0.001	−0.001	0.000	0.000	0.000	0.086	0.061	0.043	0.030	0.000
	0.1	0.000	0.000	0.000	0.000	−0.001	0.091	0.067	0.051	0.041	0.027
	0.2	−0.003	−0.001	−0.002	−0.003	−0.002	0.102	0.081	0.069	0.062	0.054
	0.3	−0.005	−0.004	−0.006	−0.005	−0.006	0.118	0.102	0.091	0.086	0.081
	0.4	−0.010	−0.009	−0.008	−0.010	−0.010	0.139	0.124	0.116	0.111	0.108
	0.5	−0.014	−0.014	−0.015	−0.015	−0.016	0.160	0.148	0.142	0.137	0.134
	0.6	−0.020	−0.020	−0.021	−0.021	−0.022	0.182	0.171	0.166	0.163	0.161
$σ$	0	−0.004	−0.002	−0.001	0.000	0.000	0.075	0.053	0.038	0.027	0.000
	0.1	−0.005	−0.003	−0.002	−0.002	−0.001	0.077	0.055	0.040	0.031	0.016
	0.2	−0.009	−0.007	−0.005	−0.005	−0.005	0.081	0.062	0.049	0.041	0.032
	0.3	−0.013	−0.013	−0.011	−0.011	−0.010	0.089	0.071	0.061	0.054	0.048
	0.4	−0.022	−0.021	−0.019	−0.019	−0.018	0.098	0.083	0.074	0.069	0.064
	0.5	−0.033	−0.031	−0.030	−0.028	−0.029	0.109	0.096	0.088	0.083	0.079
	0.6	−0.044	−0.043	−0.042	−0.041	−0.041	0.121	0.109	0.102	0.099	0.096

Note. Par = parameter; Inf = infinite sample size; biases with absolute values larger than 0.010 are printed in bold font.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Bias and Linking Error in Fixed Item Parameter Calibration

Abstract

1. Introduction

2. Analytical Derivation of Bias and Linking Error

3. Simulation Study

3.1. Methods

3.2. Results

4. Discussion

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics