1. Introduction
Item response theory (IRT) models [
1,
2] are statistical models for multivariate discrete variables. IRT models are frequently applied in educational or psychological research. This article investigates group comparisons in unidimensional IRT models [
3]. Let
be the vector of
I dichotomous (i.e., binary) random variables
that are typically referred to as items in the psychometric literature. A unidimensional IRT model [
4] is a statistical model for the probability distribution
for
, where
In (
1),
denotes the density of the normal distribution with mean
and standard deviation
. The parameters of the distribution for the latent (factor) variable
(also labeled as a trait or ability) are contained in the vector
of the distribution parameters. The vector
contains all the estimated item parameters of the item response functions (IRF)
(
). The two-parameter logistic (2PL) model [
5] possesses the IRF
where
and
are the item discrimination and item difficulty, respectively, and
denotes the logistic distribution function. For independent and identically distributed realizations of the random variable
, the unknown model parameters in (
1) can be estimated by (marginal) maximum likelihood estimation [
6,
7].
In many educational applications, IRT models are employed to compare the distribution of two groups in a test (i.e., on a set of items) regarding the factor variable
in the IRT model (
1). Linking methods [
8,
9,
10] estimate the 2PL model separately in the two groups in the first step. The linking methods process the estimated item parameters in the second step to calculate a difference regarding the mean
and the standard deviation
between the two groups. The separate application of the 2PL model in each of the two groups enables the items to function differently across the groups. This property is termed differential item functioning (DIF; [
11,
12]) in the literature. It has been pointed out that the occurrence of DIF causes additional variability in the estimated mean
and standard deviation
when applying a linking method [
13,
14,
15,
16]. Moreover, it has been shown that DIF can bias the group differences when applying a linking method [
17].
We now discuss the group comparisons within the 2PL model in the presence of DIF. Assume that the 2PL model holds in the first group, with the item discriminations and item intercepts in the first group being defined as
and
, respectively. For identification reasons, we assume
in the first group. In the second group, the ability distribution
is assumed, where
and
are the mean and the standard deviation (SD), respectively. The item discriminations are assumed to be invariant across the two groups; that is,
. A random uniform DIF [
11] effect
is assumed for the item difficulties [
18,
19]. Then, the item difficulties
in the second group are provided by
The variance
is also called the DIF variance [
20], and
is labeled the DIF SD.
A linking method consists of two steps. In the first step, the 2PL model is separately fitted within the two groups while assuming a standard normal distribution
for the ability variable
. Due to sampling errors (i.e., the sampling of persons), the estimated item parameters
and
will slightly differ from the data-generating parameters
and
. In the second group, the original item parameters
and
are not recovered because the 2PL model is fitted with
, but the data-generating model imposes
. A simple calculation provides
Therefore, the identified item parameters in the second group are provided by
Due to sampling errors, the estimated item parameters
and
(or
and
) will slightly differ from
and
(or
and
).
In the presence of the sampling errors or DIF, a linking function
H is chosen that estimates
and
based on the estimated item parameters
and
for
and
. The Stocking–Lord (SL; [
21]) linking method determines the parameter of interest
as a minimizer of the weighted squared distance of the test characteristic functions
In (
6), the IRFs are evaluated on a finite grid of
values
, and
represents the known weight. Previous research highlighted that SL linking is superior to other linking methods, such as Haebara [
22] linking [
23,
24,
25]. However, the conclusion of this research was based on the absence of DIF. It has been shown in [
17] that the presence of (uniform) DIF can lead to biased estimates of the mean
and the SD
, which quantify the differences between the two groups.
In this article, we aim to study the bias correction methods for SL linking. Simulation extrapolation (SIMEX; [
26]) is a method to correct for bias due to measurement error. It is particularly suited for nonlinear statistical models because it is entirely simulation-based and does not require analytical derivations that must be carried out for a particular model under study. This article interprets uniform DIF as a measurement error in applying the SIMEX method to the SL linking method, corresponding to a particular nonlinear model. Our proposed SIMEX method is compared with an alternative analytical bias correction that has been recently proposed for a wide class of linking methods [
27].
The rest of the article is organized as follows. In
Section 2, we review the analytical bias correction methods in SL linking. The newly proposed SIMEX-based bias correction method for SL linking is presented in
Section 3.
Section 4 presents a numerical illustration based on idealized datasets to demonstrate the bias in the original SL linking. The newly proposed SIMEX-based SL linking is compared with the original SL linking and other analytical bias correction SL methods through three simulation studies in
Section 5. Finally, the article closes with a discussion in
Section 6.
2. Analytical Bias Correction in Stocking–Lord Linking
In this section, analytical bias correction for a general linking method [
27], including the SL linking method as an example, is reviewed. In a linking method, the vector
is the statistical parameter of interest. The parameter estimate for
is denoted as
. The linking function
H is defined as a function of
and estimated item parameters
(where
) in the two groups such that
where is
H is a sufficiently smooth (i.e., at least three times differentiable) function. The SL linking function defined in (
6) is an example of the general formulation (
7). The parameter estimate
fulfills the estimating equation
The linking method that is associated with the linking function
H recovers the true group mean
and the true group SD
if there is no sampling error and no DIF (i.e.,
). This property is formalized as
where
denotes joint item parameters
and
, and all uniform DIF effects
are assumed to be zero.
The analytical bias correction due to the presence of DIF effects
is based on a second-order Taylor expansion of
. The estimated item parameters
for item
i are a function of a common item parameter
and the uniform DIF effect
. The estimating Equation (
8) can be rewritten as
where
collects all item parameters and
denotes the vector of DIF effects. Note that
for
.
By assuming independent DIF effects
for
, we apply a first-order Taylor expansion with respect to
and a second-order Taylor expansion with respect to DIF effects
to
and obtain
Note that Taylor expansion (
11) relies on the assumption of independent DIF effects
. Because
, we obtain from (
11)
Because DIF effects can be assumed as random variables with zero mean (i.e.,
), the expected bias of the parameter estimate
can be computed as
Equation (
13) can be used to construct a bias correction term for
. We compute estimated DIF effects
as a proxy for the true DIF effect
as
Then, we obtain a bias-corrected estimate
of
of an empirical version of
as
The vector
contains second-order derivatives of
with respect to
and is provided by
.
An alternative bias correction estimate can be obtained by using
where
is an estimate of
.
We now present the required partial derivatives for the bias correction in SL linking. The estimating equations for SL linking defined in (
6) for the mean
and the SD
are provided as
The second-order derivatives
can be similarly obtained.
The second-order derivatives of
and
with respect to
can be computed as
where
and
denote the first and the second derivative of the logistic function
, respectively. The terms can be inserted into the Formula (
15) for the bias-corrected estimates.
The application of the bias correction estimate (
15) to SL linking is denoted by SLA1 in the rest of this paper. If
is estimated using the empirical variance of the estimates
, the corresponding bias correction estimate for SL linking is denoted by SLA2. The empirical variance of
also contains sampling errors due to the sampling of persons. By obtaining variance estimates
and
from the group-wise fitted 2PL models, the part of the variance
due to sampling of persons can be computed. By computing the difference in the empirical variance and the average variance of
estimates due to sampling error, we obtain a bias-corrected variance
. If a negative variance estimate would result in this difference, the quantity should be set to zero. If the variance estimate
is used in the bias correction Formula (
16) for SL linking, the corresponding estimate is denoted by SLA3.
Note that the second-order derivatives
vanish in linear linking methods such as mean–geometric-mean linking [
9]. Therefore, no bias would occur in these methods [
27].
3. SIMEX-Based Bias Correction in Stocking–Lord Linking
In this section, we describe how the simulation extrapolation (SIMEX) method [
28,
29,
30] is adapted to conduct a bias correction for SL linking in the presence of uniform DIF. We now briefly describe the main idea of the SIMEX method. Let a dataset contain parts
and
, where variables in the matrix
is measured without error and
contains measurement error. Assume that rows in these matrices refer to independent observations such that
and
are the observed data of case
i. In our application, we only consider a one-dimensional variable
W that is prone to measurement error. We assume the decomposition
where
corresponds to an error-free measurement and
denotes the measurement error. It is assumed that the measurement error
is known or can be consistently estimated from data. Furthermore, we assume that measurement errors
are normally distributed with zero mean.
A statistical procedure provides a parameter
for a specified function
f to estimate an unknown parameter
. Frequently, the bias of the estimated parameter
depends on the measurement error variance
. The SIMEX method aims at removing (or, at best, reducing) the bias in
due to measurement error by means of a simulation-based method. The general idea is to induce additional measurement error in the data and to compute the parameter estimate
on these modified datasets that contain additional measurement error. More formally, the values
of variable
W are modified as
where
is a random draw from a normal distribution with zero mean and a standard deviation of
. Hence, the
contains measurement error
that has the variance
. SIMEX evaluates the estimate
as a function of
at a grid of values. Ref. [
31] recommended using the
grid of 0.5, 1.0, 1.5, and 2.0. To reduce Monte Carlo error, the estimation is applied for a repeated simulation of the dataset for a fixed
value. To this end, the parameter estimate is a function of
, resulting in a parameter curve
. This first part of the procedure refers to the simulation step in SIMEX. The second part of SIMEX is the extrapolation step. A regression function
is estimated, and the predicted value at
provides a parameter estimate for
extrapolated to the case of no measurement error variance (i.e.,
). The extrapolating function could be linear or quadratic:
In the case of linear extrapolation, the final parameter estimate for
is provided as
. In the case of quadratic extrapolation, the final parameter estimate is provided by
. The SIMEX method is a particularly flexible measurement error correction method because it can be applied to any class of statistical models [
26,
32].
We now interpret the application of Stocking–Lord (SL) linking in the presence of uniform DIF as a case of measurement error. Note that DIF, as a measurement error phenomenon, also occurs at the population level and is not restricted to the finite sample size of persons. The input data of SL linking are item parameters
. In what follows, we assume that item parameters do not contain sampling errors of persons and that there is only uniform DIF. Note that we have (see (
5))
where
is the parameter vector of interest. The uniform DIF effect
is considered as measurement error and has a variance
. An application of the original SL linking method provides initial estimates
and
from which DIF effects
can be estimated as
We discussed in
Section 2 how
can be estimated from the data. In the application of SIMEX in this paper, we employ the bias-corrected variance estimate
(see the end of
Section 2). The general idea of applying the SIMEX method is to induce additional DIF in the data and to study the expected parameter estimate in SL linking for an increased DIF variance. That is, modified DIF effects
are computed as
where
is a random draw from a normal distribution with zero mean and variance
. Overall, the variable
has a variance of
. Finally, we recompute item difficulties in the second group as (see (
24))
and subsequently apply the SL linking method to the set of modified item parameters
. As a consequence, modified parameter estimates
and
are obtained for item parameters that contain more uniform DIF (i.e., more measurement error) than the original item parameters. For the parameter of interest
or
, the SIMEX method provides a parameter curve
from which a linear or quadratic regression function can be calculated (see (
23)). The extrapolation of the regression function at
is used as a final parameter estimate
that aims to reduce potential bias due to DIF.
The originally proposed SIMEX method simulates new datasets. To reduce Monte Carlo error, the simulation is repeatedly conducted for a fixed
value, and the parameter estimates are averaged afterward. In the case of SL linking, the critical issue is that there are only a few cases (i.e., items) leading to relatively large Monte Carlo errors. Furthermore, reducing the Monte Carlo requires a (very) large number of simulations. However, the parameter estimates in the SL linking method should only be corrected for asymptotic bias. This kind of bias in SL linking would occur in a test for infinite length (i.e.,
) for a fixed DIF variance
. To this end, we propose a quasi-Monte Carlo variant of SIMEX in which SL linking is applied to a test that contains
pseudo-items for an integer
M. The idea is to replicate item parameters and induce normally distributed uniform DIF systematically across all items. Let
be quantiles of the standard normal distribution such that the mean is approximately zero and the variance is one. In this paper, we chose
and computed the inverse of the standard normal distribution
(i.e., quantiles) on the grid
seq(1/(2*M+2), 1-1/(2*M+2), length=M). The
values are obtained using
qnorm(seq(1/(2*M+2), 1-1/(2*M+2), length=M)). For a fixed
, the set of values
have approximately a zero mean and a variance
. The item difficulty for the pseudo-item
for
and
is defined as
All other item parameters
,
, and
are left unmodified across the values of
. By this method, a set of item parameters of pseudo-items are constructed that have a DIF variance of
. The SL linking method is applied to the set of pseudo-items for different
values, and a SIMEX parameter curve
is obtained. Note that our modified SIMEX method does not contain simulation error but at the price of only removing (or reducing) asymptotic bias. Finally, the extrapolation of a fitted regression function to the values
at
provides a final parameter estimate
.
In this article, we used the quadratic function for SIMEX extrapolation in SL linking (denoted as SLSQ). We also considered the linear extrapolation function in SIMEX for SL linking, abbreviated as SLSL.
It should be emphasized that the described quasi-Monte Carlo SIMEX procedure can be applied to any linking method that uses item parameters as input and is potentially biased in the presence of DIF.
4. Numerical Illustration
In this section, we demonstrate the bias in SL linking for an idealized dataset in the presence of uniform DIF. Because the bias occurs at population-level data, there is no need to simulate item response. Instead, we show the bias with computations that are solely based on item parameters.
We consider the situation of linking two groups using the 2PL model. The item discriminations
(
) are chosen equal to 1 in both groups. Eleven base items are defined with equidistant item difficulties −2, −1.6, …, 1.6, and 2. These item parameters are duplicated eleven times. As described in the definition of pseudo-items in our SIMEX modification to SL linking, we define a grid of values
that is computed by
qnorm(seq(1/(2*M+2), 1-1/(2*M+2), length=M)), where
(i.e.,
M <- 11). These values are approximately standard normally distributed and have zero mean and a standard deviation of one. Overall,
items
are used in this illustration. In the first group, the item parameters are duplicated
times. In the second group, for item
, the item discrimination
equals
. The item difficulty
is defined as
By this construction, the DIF variance in the test is
. Note that we constructed DIF deterministically in a systematic way such that items of all difficulties are crossed with all levels of uniform DIF. We computed sets of item parameters for values of
between 0 and 1. True group differences were simulated by setting
and
in this illustration.
The original SL linking method was applied to these datasets of item parameters. This method was compared with the two SIMEX-based SL methods SLSQ (quadratic extrapolation) and SLSL (linear extrapolation), as well as the analytical bias correction SLA1. We studied the parameter estimates and as a function of the DIF variance . SIMEX was applied with the four values 0.5, 1.0, 1.5, and 2.0.
Figure 1 displays the SIMEX parameter curves
and
for the set of item parameters with the DIF variance
(which corresponds to a DIF SD
). It can be seen in
Figure 1 that the linear SIMEX regression curve slightly differed from the quadratic SIMEX regression curve. The estimated mean
for the original SL linking method was 0.285, which led to a bias of
. In contrast, SIMEX-based linking based on quadratic extrapolation (SLSQ) resulted in an almost unbiased estimate of 0.299. The other methods, however, were close, at 0.297 (SLA1) and 0.296 (SLSL).
The differences between the linking methods were more pronounced for the estimated SD . SL linking resulted in an estimate of 1.140, which led to a bias of . The SIMEX-based linking method SLSQ resulted in a estimate of 1.196, which again was almost unbiased. In contrast, the methods SLA1 (with 1.188) and SLSL (with 1.183) resulted in slight biases. However, all three bias correction SL methods clearly outperformed the original SL linking method in terms of bias.
Figure 2 presents parameter estimates
and
as a function of the DIF variance
. It is evident that SL linking provided (strongly) biased estimates for
and
. The bias is an approximately linear function of
. Notably, the bias correction SL methods SLSQ, SLA1, and SLSL were superior compared to SL because they resulted in parameter estimates close to the true values
and
. For large DIF variances, the SIMEX-based linking method SLSQ should be preferred over alternative bias-corrected SL linking methods in terms of bias. It should be noted that the bias for SL linking for the estimated SD is much larger than for the estimated mean.
6. Discussion
In this article, different bias correction estimators for SL linking were compared. The previous literature highlighted that SL linking results in a substantial negative bias for the estimated group standard deviation and moderately biased group mean. This article proposed a SIMEX-based bias correction for SL linking that removed most of the bias in (large) DIF conditions and did not lead to practical efficiency losses in no-DIF conditions. Overall, the SIMEX-based SL methods had slight advantages over the analytical bias corrections for SL linking. However, the main advantage of the SIMEX method is that it is entirely computational and does not require analytical work. Hence, SIMEX-based bias correction can be applied to any linking method that could be affected by DIF.
It has been repeatedly pointed out that the presence of DIF effects requires identification constraints for the group mean and group standard deviation if the item parameters are not assumed to be invariant [
39,
40,
41,
42]. For example, one could assume that the mean of uniform DIF effects equals zero. This case is referred to as DIF cancellation [
43] or balanced DIF [
44]. In this article, we assumed that the DIF effects have zero means in the population (i.e.,
); that is, they are centered in hypothetical replications of the experiment. The random DIF assumption is in stark contrast to the ordinarily employed fixed DIF assumption in which the DIF effects are treated as fixed parameters. We think that researchers intentionally define a pseudo-true parameter in the latter situation by choosing a particular linking method [
45]. Hence, any choice of a linking method and a structural assumption on DIF effects can be defended by a researcher in an empirical study. There is a tendency in the psychometric literature to believe in a partial invariance assumption of DIF effects (e.g., [
46,
47,
48]). We tend to believe that the bias correction methods for Stocking–Lord linking proposed in this article are adequate for the random DIF situation but would likely be less effective for the fixed DIF case.
An anonymous reviewer wondered why we only specified Stocking–Lord linking in a “naïve” implementation in which the presence of DIF effects is essentially ignored. This reviewer suggested removing the identified DIF items from the linking method as discussed as iterated linking or scale purification in the literature [
49,
50,
51,
52,
53]. We do not think that it is generally advised to mindlessly eliminate items that are potentially prone to DIF from group comparisons in linking procedures because, in our belief, researchers should only remove items from a scale (or an analysis) if DIF is shown to be construct-irrelevant (i.e., not being construct-relevant; [
54,
55]). Unfortunately, such iterative approaches are also implemented in major large-scale educational assessment studies like the programme for international student assessment (PISA; [
56,
57]) that serve as methodological blueprints for empirical research. Effectively, iterative linking procedures frequently lead to similar findings like the regularized estimation approach [
58,
59,
60] to DIF effects (see [
61]).
Simulation Study 3 only considered one type of misspecified IRT model. It might be interesting to investigate whether the findings of this simulation study generalize to other complex IRT models, such as the filtered monotonic IRT model [
62,
63,
64,
65] or the four-parameter logistic IRT model [
66,
67].
A reviewer wondered whether the proposed bias correction methods could also be used to remove the potential bias due to sampling variation in the item parameters. However, the findings of Simulation Study 2 in the condition of a small sample size and no DIF demonstrated that such bias correction methods would even hurt the accuracy of the estimates. For larger sample sizes, the sampling error in the item parameters could be less relevant than the extent of the variation in the DIF effects that primarily bias Stocking–Lord linking.
An anonymous reviewer wondered whether the assumption of independent DIF effects could be weakened. In fact, the Taylor expansion could be extended to include differential testlet effects [
68]. Therefore, the bias correction methods can also accommodate testlet effects (see [
14] for a similar approach). Moreover, the SIMEX method can also include variance portions that refer to additional testlet effects. The presence of differential testlet effects could be investigated in future research.
An anonymous reviewer commented that it might be unclear how the modified linking methods would perform in the presence of nonuniform DIF. First, we think, according to our research experience, that uniform DIF is more prevalent than nonuniform DIF [
69]. Second, both bias correction approaches can be extended to handle nonuniform DIF effects. The Taylor expansion in
Section 2 can be extended to include nonuniform DIF effects, which subsequently also provides a bias-corrected estimator. Moreover, SIMEX can also be applied to multivariate predictor variables prone to measurement error [
26]. Only the variance matrix of the DIF effects instead of a scalar DIF standard deviation must be known (or estimated) to apply the SIMEX method to Stocking–Lord linking. Finally, we would also like to note that even recent articles in highly ranked journals using regularization approaches to handle DIF effects only treated the case of uniform DIF effects [
70].
Future research might also investigate the performance of SIMEX-based bias correction methods for nonrobust or robust variants of SL linking (see also [
71,
72,
73,
74,
75,
76,
77]). Furthermore, it would be interesting to adopt the methodology for polytomous items. In our study, we only considered asymmetric SL linking. SIMEX-based bias correction could also be applied to symmetric SL linking [
17,
78]. However, previous studies have shown that symmetric SL linking in its original form already has a smaller bias than asymmetric SL linking in the presence of DIF.