Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

Robitzsch, Alexander

doi:10.3390/sym13112198

Open AccessFeature PaperArticle

Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

by

Alexander Robitzsch

^1,2

¹

IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany

Symmetry 2021, 13(11), 2198; https://doi.org/10.3390/sym13112198

Submission received: 14 October 2021 / Revised: 9 November 2021 / Accepted: 13 November 2021 / Published: 18 November 2021

(This article belongs to the Special Issue Symmetry and Asymmetry in Multivariate Statistics and Data Science)

Download Versions Notes

Abstract

:

In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that can bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects on average cancel out. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and on average favor a particular group. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, the theory of M-estimation, as an important approach to robust statistical estimation suitable for data with asymmetric errors, is used to study the asymptotic behavior of linking estimators if the number of items tends to infinity. These results give insights into the asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.

Keywords:

robust linking; Rasch model; item response model; differential item functioning; partial invariance; double jackknife; double half sampling; robust estimation; asymmetric errors; resampling

1. Introduction

The analysis of psychological or educational tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often analyzed using item response theory (IRT, [1,2]) models. In this article, the Rasch model (RM; [3,4]) is used for comparing two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons are carried out using linking methods [5,6]. An important impediment in applying linking methods is that the items could behave differently in the two groups (i.e., differential item functioning, DIF; [7]); that is, it cannot be expected that the Rasch model holds in the two groups with item parameters that are independent of a group membership.

In this article, we study the performance of linking methods in the presence of DIF that can bias group differences. In contrast to habitually used (i.e., nonrobust) linking methods, robust linking methods aim at deriving estimates of group differences that are robust to the presence of DIF. Importantly, DIF effects can be considered as asymmetric error distributions, and robust statistical methods for location measures are applied for determining a group difference.

This article systematically compares alternative linking methods in the RM. Furthermore, linking errors that quantify the uncertainty in group differences due to the randomness associated with DIF are analytically treated using M-estimation theory and computationally assessed using single and double jackknife and (balanced) half sampling, respectively.

The paper is structured as follows. In Section 2, the RM with random DIF is introduced. In Section 3, several nonrobust and robust linking methods are discussed. In Section 4, M-estimation theory is applied to the study of linking methods for the statistical inference of linking errors. In Section 5, M-estimation theory is applied for the simultaneous assessment of standard errors and linking errors. The resampling techniques double jackknife and double half sampling are introduced in Section 6 for empirically assessing standard errors and linking errors. In Section 7, we present the results of a simulation study in which different robust and nonrobust linking methods are systematically compared across various data-generating models for DIF effects. Section 8 presents a simulation study that investigates the empirical performance of the proposed resampling estimators from Section 6. Finally, the article concludes with a discussion in Section 9.

2. Differential Item Functioning in the Rasch Model

2.1. Rasch Model

The RM [3,4,8,9,10,11,12,13] is a statistical model for dichotomous item responses

X_{i}

for items

i = 1, \dots, I

. A latent variable

θ

(so-called ability) accounts for the dependence among item responses. The item response function (IRF)

P_{i}

for item i in the RM is defined as

P (X_{i} = 1 | θ) = P_{i} (θ; b_{i}) = Ψ (θ - b_{i}),

(1)

where

b_{i}

is the item difficulty,

θ

is the latent ability, and

Ψ (x) = {[1 + exp (- x)]}^{- 1}

denotes the logistic link function.

The RM in a random sampling perspective [14,15] also relies on a local independence assumption and poses a parametric distribution function

F_{α}

on the latent ability

θ

:

P (X = x; b, α) = P (x; b, α) = \int [\prod_{i = 1}^{I} P_{i} {(θ; b_{i})}^{x_{i}} {(1 - P_{i} (θ; b_{i}))}^{1 - x_{i}}] d F_{α} (θ),

(2)

where

X = (X_{1}, \dots, X_{I})

,

x = (x_{1}, \dots, x_{I})

,

b = (b_{1}, \dots, b_{I})

and a finite-dimensional parameter

α

. In many applications, the distribution

F_{α}

is assumed to be normal with a mean of zero. In this case, the parameter

α

only contains the standard deviation

σ

that must be estimated in addition to item parameters

b

. It has been empirically shown that distributional misspecifications of

F_{α}

might not strongly bias estimates of item difficulties

b_{i}

if many items are available [16,17,18]. However, in (very) short tests and with a strong deviation from normality, the bias in item parameter estimates can be non-negligible.

Note that the parameters of the RM are identified up to a constant. Hence, either the mean of the abilities or the mean of item difficulties has to be fixed to zero for reasons of identification [19,20]. If a normal distribution for

θ

is assumed, the mean is set to zero, and only the standard deviation is estimated. The RM is typically estimated with marginal maximum likelihood estimation [21].

In practice, it is unlikely that the distribution of item responses can be adequately represented by the RM. In real datasets, more complex IRFs might be necessary, such as the family of logistic IRT models with two, three or four parameters per item [22] or even more flexible nonparametric monotone IRFs [23,24,25,26]. In large-scale educational datasets, items could have different discriminations [27], and guessing and slipping behavior have been reported [28,29]. However, fitting a misspecified RM to data might be justified if the latent ability

θ

should be defined such that all items contribute equally to

θ

[30,31]. By fitting more complex IRT models, the meaning of

θ

might change, which raises validity concerns. Note that the RM has been used in the popular program for international student assessment (PISA) study in the past [32] and in many other national [33,34] and international large-scale assessment studies [35].

The estimated item parameters

\hat{b}

with marginal maximum likelihood estimation [21] can be interpreted as a pseudo-true parameter (see [36]) that maximizes the Kullback–Leibler information

K L

[37,38,39] between the true distribution Q and its parametric approximation

P (b, σ)

:

(\hat{b}, \hat{σ}) = \underset{(b, σ)}{arg max} K L (Q | P (b, σ)) = \underset{(b, σ)}{arg max} \sum_{s = 1}^{S} Q (x_{s}) log P (x_{s}; b, σ),

(3)

where the sum is defined over the

S = 2^{I}

different item response patterns for

X

. The distribution Q is defined as the true data-generating distribution

Q (x) = P (X = x)

, which is a multinomial distribution on the S item response patterns. The RM is the best approximation to Q with respect to the Kullback–Leibler information. By using different loss functions (i.e., estimation methods, for example, unweighted least squares estimation, [40]), different pseudo-true parameters of the RM will be obtained.

2.2. Differential Item Functioning

Now assume that I items are administered in two groups

g = 1, 2

. The estimation of group differences is of interest. Abilities in the first group follow a normal distribution with zero mean, that is,

θ \sim N (0, σ_{1}^{2})

. In the second group, we also assume a normal distribution, i.e.,

θ \sim N (μ_{2}, σ_{2}^{2})

. The parameter

μ_{2}

can be interpreted as the average difference between the two groups.

In practical applications, it is unlikely that item difficulties

b_{i}

are equal across groups (i.e., they are measurement invariant). In this case, DIF occurs, and there exist group-specific item difficulties

b_{i g}

for groups

g = 1, 2

. Item-specific DIF effects

e_{i}

are defined as

e_{i} = b_{i 2} - b_{i 1}

. In the absence of DIF, all DIF effects

e_{i}

would be equal to zero. Identification constraints on DIF effects must be posed to disentangle group mean differences from average DIF effects [41,42,43].

In this paper, we distinguish the case of random items from fixed items with random DIF effects. In the first case of random items, it is assumed that the bivariate vector of item difficulties

(b_{i 1}, b_{i 2})

follows a bivariate distribution G. In the second case, it is assumed that

b_{i 2} = b_{i 1} + e_{i}

with random effects

e_{i}

, but item difficulties

b_{i 1}

are regarded as fixed. This means that items are fixed, but DIF effects represent a random variable. DIF effects

e_{i}

follow a univariate distribution G in this case.

To identify the group difference

μ_{2}

, identification constraints on G have to be imposed in both cases. The main idea is that the set of items

I = {1, \dots, I}

can be partitioned into two distinct sets

I_{AMI}

and

I_{bias}

. The set of items in

I_{AMI}

(also denoted as reference items; [44]) is deemed valid for obtaining unbiased group differences. The set

I_{AMI}

refers to approximate measurement invariant (AMI; [45]) items. These items are allowed to have DIF effects that on average cancel out. A special case is a set of anchor items in which all items in this set have zero DIF effects [46]. Items in the set

I_{bias}

(also denoted as biased items; [44]) have the potential to bias group differences (see [47]). The partitioning is modeled with a mixture distribution for G [46,48,49]:

G = (1 - π_{bias}) G_{AMI} + π_{bias} G_{bias},

(4)

where

π_{bias}

is the proportion of items in the set

I_{bias}

. In the fixed items case, it is assumed that the expected value for DIF effects of items from

I_{AMI}

is zero, while it can be different from zero for items from

I_{bias}

. More formally, it holds that

δ_{AMI} = \int e d G_{AMI} (e) = 0 and δ_{bias} = \int e d G_{bias} (e) .

(5)

In the random items case, define by

G^{*}

the univariate distribution of DIF effects

e_{i} = b_{i 2} - b_{i 1}

. Based on the mixture representation of the bivariate distribution G, one can decompose the distribution of DIF effects as

G^{*} = (1 - π_{bias}) G_{AMI}^{*} + π_{bias} G_{bias}^{*}

. The condition for DIF effects in the random items case is the same as in Equation 5

δ_{AMI} = \int e d G_{AMI}^{*} (e) = 0 and δ_{bias} = \int e d G_{bias}^{*} (e) .

(6)

The test is said to have balanced DIF if

δ_{bias} = 0

, and it has unbalanced DIF if

δ_{bias} \neq 0

(see [47,50,51]). It is important to emphasize that the definition of the mixture distribution allows the identification of group differences. The total DIF impact

δ

on the test containing all items can be calculated as (for notational simplicity only in the fixed items case)

δ = \int e d G (e) = π_{bias} δ_{bias} .

(7)

With a low proportion

π_{bias}

of biased items, the presence of DIF effects is not expected to have a large impact on estimated group differences.

In this article, we only consider random DIF effects. Probably in the largest part of the literature, DIF effects are considered as fixed (e.g., [52]). In this case, the condition for balanced DIF replaces the expected value by the mean associated with the fixed item parameters [44]. There is no additional uncertainty introduced in the estimation of group differences with fixed DIF effects because the item parameters are held fixed in repeated sampling. In contrast, with random DIF effects, the group mean difference is affected by the sampled DIF effects even for infinite sample sizes of persons. This kind of uncertainty is explicitly addressed in this article.

In many applications, the estimation of group differences involves a previous step in which DIF is detected by applying statistical techniques [7,52,53,54]. DIF detection statistics aim to classify items into a set of items that possess DIF, which should be optimally equal to the set

I_{bias}

. However, DIF detection techniques rely on previous knowledge about DIF-free items or a known group difference [43,55]. Hence, the decision of whether an item has DIF or not requires additional assumptions that cannot be statistically tested (see also [31,56,57]). In this paper, we do not thoroughly investigate DIF detection techniques, but rather study the performance of linking methods to estimate group differences. We distinguish robust from nonrobust linking methods. Robust linking methods adequately handle the presence of biased items (i.e., items in the set of

I_{bias}

) that lead to unbalanced DIF, while nonrobust linking approaches result in biased estimates of group differences in unbalanced DIF situations.

If the RM does not hold, DIF between groups means that IRFs can differ across the two groups. If the misspecified RM is fitted to data, DIF in item difficulties can be interpreted as a summary of DIF between IRFs. It is acknowledged that more complex DIF, such as nonuniform DIF in item discriminations [7] or DIF in guessing parameters, might occur. However, if these model aspects are intentionally ignored by fitting the RM, DIF effects in other aspects of the IRFs only indirectly enter the DIF assessment through item difficulties. Moreover, DIF effects in item difficulties are more frequently found in empirical applications than in item discriminations [58,59,60]. In the rest of the paper, statistical inference regarding the population of persons and the population of items is discussed that is even valid if the fitted IRT model is misspecified.

2.3. Identified Item Parameters in Group-Specific Scaling Models

Linking methods rely on group-specific item parameters estimated in separate scaling models in each group. By doing so, there is no misspecification in the IRT model due to noninvariance.

In the first group, the ability variable in the data-generating model follows

θ \sim N (0, σ_{1}^{2})

, that is,

μ_{1} = 0

. In a separate estimation for the first group with an infinite sample size of persons, the estimated item difficulties

{\hat{b}}_{i 1}

equal the data-generating parameters

b_{i 1}

(i.e.,

{\hat{b}}_{i 1} = b_{i 1}

). In the second group, the distribution of the ability variable is

θ \sim N (μ_{2}, σ_{2}^{2})

. In the estimation, the mean of the ability variable is fixed to zero for reasons of identification. Hence, estimated item difficulties also include the group difference

μ_{2}

parameters. We obtain

θ - b_{i 2} = σ_{2} θ^{*} + μ_{2} - b_{i 2} = θ^{*} - σ_{2}^{- 1} (b_{i 2} - μ_{2})),

(8)

where the standardized ability

θ^{*}

is normally distributed (i.e.,

N (0, σ_{2}^{2})

). Consequently, it follows that

{\hat{b}}_{i 2} = b_{i 2} - μ_{2}

.

3. Linking Methods

In this section, we review several linking methods [5,6,61,62,63] that allow the estimation of the group difference

μ_{2}

. We assume that estimated identified item parameters

{\hat{b}}_{i 1}

and

{\hat{b}}_{i 2}

(

i = 1, \dots, I

) are available (see Section 2.3). We define differences

ν_{i} = {\hat{b}}_{i 1} - {\hat{b}}_{i 2}

.

3.1. Mean-Mean Linking (MM)

Mean-mean linking (MM; [5,64]) is one of the most popular linking methods. The group difference

μ_{2}

is estimated by

{\hat{μ}}_{2} = \frac{1}{I} \sum_{i = 1}^{I} {\hat{b}}_{i 1} - \frac{1}{I} \sum_{i = 1}^{I} {\hat{b}}_{i 2} = \frac{1}{I} \sum_{i = 1}^{I} ν_{i} .

(9)

Note that

{\hat{μ}}_{2}

is determined as the least-squares estimate of item difficulty differences

{\hat{b}}_{i 1} - {\hat{b}}_{i 2}

:

{\hat{μ}}_{2} = \underset{μ_{2}}{arg min} \{\frac{1}{I} \sum_{i = 1}^{I} {(μ_{2} - ν_{i})}^{2}\} .

(10)

We now derive the bias of MM and assume fixed items with random DIF effects e that follow a distribution G. The distribution is given by the mixture representation

G = (1 - π_{bias}) G_{AMI} + π_{bias} G_{bias}

(see Equation (4)). It holds that

\int e d G_{AMI} (e) = 0

and

δ_{bias} = \int e d G_{bias} (e)

(see Equation (5)). Then, we obtain for the bias under MM

\begin{matrix} Bias ({\hat{μ}}_{2}) & = & - \frac{1}{I} \sum_{i = 1}^{I} E (e_{i}) \\ = & - (1 - π_{bias}) \int e d G_{AMI} (e) - π_{bias} \int e d G_{bias} (e) \\ = & - π_{bias} δ_{bias} \end{matrix}

(11)

The bias coincides with the DIF impact on the test (see Equation (7)). The bias vanishes in the case of balanced DIF (i.e.,

δ_{bias} = 0

) or in the absence of biased items (

π_{bias} = 0

). MM can be considered as a nonrobust linking method because biased items can affect the estimated group difference. As an alternative to such a nonrobust approach, it may be recommended to use linking methods based on robust statistical methodology [65] designed for resistant estimation under contamination (especially for data contaminated by outlying values). The following linking methods realize some kind of robustness against the presence of biased DIF items.

3.2. Asymmetrically Trimmed Mean (ATR)

An intuitive idea borrowed from robust statistics [66,67,68] is to consider biased DIF items as outliers [69,70] in estimating the location measure that is given as the group difference. Hence, robust alternatives to the mean (i.e., MM linking) can be established.

The asymmetrically trimmed mean (ATR) removes items with large differences from the estimation. By defining a trimming proportion

α

, the ATR linking estimate

{\hat{μ}}_{2}

is defined as the average of

ν_{i}

values for which the absolute differences

| ν_{i} - mdn (ν) |

are below the

(1 - α)

-quantile of these discrepancies. The main idea is that large discrepancies can be regarded as biased items and should be removed from group comparisons. The ATR estimate is formally defined as

{\hat{μ}}_{2} = \frac{\sum_{i = 1}^{I} ν_{i} 1_{{ν_{i} | | ν_{i} - mdn (ν) | \leq q_{1 - α} (ν - mdn (ν))}} (ν_{i})}{\sum_{i = 1}^{I} 1_{{ν_{i} | | ν_{i} - mdn (ν) | \leq q_{1 - α} (ν - mdn (ν))}} (ν_{i})},

(12)

where

q_{1 - α}

denotes the

(1 - α)

-quantile,

1

the indicator function, and the median

mdn

. The median

mdn (ν)

instead of the mean

\bar{ν}

is used because the median will be typically more robust concerning outliers (i.e., biased DIF items). ATR linking has the potential to properly handle the situation of unbalanced DIF because it explicitly allows that there could only be biased items with unidirectional signs. The ATR estimator is related to the least trimmed absolute estimator [71,72], which is especially suitable for asymmetric contamination in the data. A similar idea of the ATR estimator is used in robust structural equation modeling for defining case weights used for downweighting outlying cases (see [73,74]). As an alternative to the ATR estimator, the least weighted squares estimator may be applied as a location estimator of location with high robustness as well as high efficiency [75].

3.3. Elimination of DIF Items with Subsequent Mean-Mean Linking (EL)

Another popular approach is that DIF items are from the group comparison. The identification of DIF items in the first step requires the definition of an appropriate statistic. In the simulation study, we assume that a preliminary group difference is estimated by the median

mdn (ν)

of all differences

ν_{i}

. An item is declared to have DIF if

| ν_{i} - mdn (ν) |

exceeds a prespecified cutoff K. In many studies, the mean instead of the median is used, and the corresponding condition is referred to as the equal-mean anchor [52]. However, using the median instead of the mean might be a more robust location estimate in the presence of DIF effects. The items with detected DIF are removed for the subsequent computation of MM linking [52]. More formally, the EL estimate can be written as

{\hat{μ}}_{2} = \frac{\sum_{i = 1}^{I} ν_{i} 1_{{ν_{i} | | ν_{i} - mdn (ν) | \leq K}} (ν_{i})}{\sum_{i = 1}^{I} 1_{{ν_{i} | | ν_{i} - mdn (ν) | \leq K}} (ν_{i})} .

(13)

The EL linking method by eliminating DIF items can be interpreted as another variant of a trimmed mean.

3.4. Bisquare Linking (BSQ)

Another robust estimate of the location parameter is based on the bisquare loss function

ρ

(see [76]) that is defined by

ρ (x; K) = \{\begin{matrix} \frac{K^{2}}{6} [1 - {(1 - \frac{x^{2}}{K^{2}})}^{3}] & if | x | \leq K \\ \frac{K^{2}}{6} & else \end{matrix},

(14)

where K is a prespecified threshold value. The group difference

μ_{2}

is estimated by

{\hat{μ}}_{2} = \underset{μ_{2}}{arg min} \{\frac{1}{I} \sum_{i = 1}^{I} ρ (μ_{2} - ν_{i}; K)\} .

(15)

Note that the bisquare loss function is also known as the Tukey biweight function [77].

3.5. Invariance Alignment (IA)

The bisquare loss function

ρ

in Equation (15) can be replaced by any robust (or nonrobust) loss function. In invariance alignment (IA; [78,79,80,81]), the

L_{p}

power loss function

ρ (x) = {| x |}^{p}

(

p \in (0, 2]

) is employed. The group mean estimate is given by

{\hat{μ}}_{2} = \underset{μ_{2}}{arg min} \{\frac{1}{I} \sum_{i = 1}^{I} {|μ_{2} - ν_{i}|}^{p}\} .

(16)

By choosing

p \leq 1

, the extent of noninvariant items is minimized. Hence, the group mean difference relies on items that have small DIF effects while removing items with large DIF effects from the comparison [78]. IA was originally proposed with the power

p = 0.5

[78]. IA with

p = 2

is equivalent to MM. Note that IA with

p < 1

is particularly suited to the situation of partial invariance in which

G_{AMI}

concentrates at zero (i.e., all DIF effects in

I_{AMI}

are zero or close to zero) and fails for symmetrically distributed DIF effects [82,83].

3.6. Haebara Linking (HAE)

In contrast to MM, ATR, BSQ and IA linking methods, Haebara (HAE) linking [84] aligns IRFs instead of aligning item parameters. The linking function is defined as

{\hat{μ}}_{2} = \underset{μ_{2}}{arg min} \{\frac{1}{I} \sum_{i = 1}^{I} \int {|Ψ (θ - {\hat{b}}_{i 1})) - Ψ (θ - {\hat{b}}_{i 2} - μ_{2}))|}^{p} ω (θ) d θ\}

(17)

with a power

p \geq 0

and a weight function

ω

that fulfills

\int ω (θ) d θ = 1

. The originally proposed HAE linking uses

p = 2

[84]. The robust alternative

p = 1

was treated in [47,85,86], while cases

p < 1

were studied in [87].

To get more insight into the relation of IA and HAE, we apply a Taylor approximation of the second IRF in Equation (17) under the assumption of small effects

e_{i}

. We obtain

Ψ (θ - {\hat{b}}_{i 2} - μ_{2}) ≃ Ψ (θ - {\hat{b}}_{i 1}) + Ψ_{1} (θ - {\hat{b}}_{i 1}) (μ_{2} - ν_{i}),

(18)

where

Ψ_{1} (x) = \frac{d Ψ}{d θ} = Ψ (x) (1 - Ψ (x)) \geq 0

. Using the approximation (18), Equation (17) can be rewritten as

{\hat{μ}}_{2} = \underset{μ_{2}}{arg min} \{\frac{1}{I} \sum_{i = 1}^{I} w_{i} {| μ_{2} - ν_{i} |}^{p}\},

(19)

where item-specific weights are given by

w_{i} = \int (Ψ_{1} {(θ - {\hat{b}}_{i 1})}^{p} ω (θ) d θ

. Hence, HAE linking can be interpreted as IA with item-specific weights, and a similar performance of HAE and IA can be expected.

3.7. Gini Linking (GI)

Recently, a linking procedure based on the Gini index (GI; [88]) has been proposed. The linking function is very similar to IA linking and tries to define a group difference that is primarily based on items with small DIF effects. The group mean difference is determined by

{\hat{μ}}_{2} = \underset{μ_{2}}{arg max} \{\frac{\sum_{i = 1}^{I} \sum_{j = 1}^{I} || μ_{2} - ν_{i} |^{p} - {| μ_{2} - ν_{i} |}^{p}|}{2 I \sum_{i = 1}^{I} {| μ_{2} - ν_{i} |}^{p}}\},

(20)

where the power

p > 0

can be chosen by the user. The original proposal used

p = 1

[88]. Previous experience of the authors indicates that GI also works with

p > 1

, but it does not perform satisfactorily with

p < 1

. It has been shown that IA and GI provided similar results in small case studies [88], but GI linking has not yet been systematically compared with other linking methods.

3.8. Robustness of the Different Linking Methods

The linking methods mean-mean linking (MM) and Haebara linking (HAE) with

p = 2

can be considered as nonrobust. The linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with

p \leq 1

, Haebara linking (HAE) with

p \leq 1

and Gini linking (GI) can be considered as robust linking methods that ensure some protection to the presence of biased DIF items.

4. An Analytical Treatment for Assessing Linking Errors

In this section, the computation of linking errors is investigated. Linking errors refer to the uncertainty of the randomness associated with items [32,81,89,90,91,92,93]. The estimated group difference

{\hat{μ}}_{2}

is affected by random DIF. The linking error quantifies this source of variance. In this section, an analytical treatment for assessing linking errors is presented. For this section, we assume an infinite sample size of persons. That means that identified item parameters are estimated without a sampling error. This assumption is dropped in the next Section 5.

In Section 2.2, we assume that random DIF (or item parameters) follow(s) a mixture distribution

G = (1 - π_{bias}) G_{AMI} + π_{bias} G_{bias}

, where

G_{AMI}

denotes the distribution associated with reference items for which DIF effects on average cancel out and

G_{bias}

denotes the distribution of biased items that can impact estimated group differences. The estimation of the group difference can be interpreted as an estimation problem of a location parameter in robust statistics where the location parameter (i.e., the group difference

μ_{2}

) should be based on

G_{AMI}

. However, the observed mixture distribution G contains a contaminated asymmetric error distribution [67,94,95] that might bias the estimate

{\hat{μ}}_{2}

. As discussed in Section 2.2, two cases of random DIF can be distinguished. First, items can be considered random, and the bivariate vector

(b_{i 1}, b_{i 2})

of group-specific item difficulties is modeled with a distribution (see Section 4.1). Second, items can be regarded as fixed, but DIF effects

e_{i} = b_{i 2} - b_{i 1}

are modeled as a random variable (see Section 4.2). Although these cases are very different, their consequences lead to similar estimates of variances. Hence, estimated errors (i.e., linking errors) due to the randomness associated with items are practically identical.

4.1. Random Item Parameters $(b_{i 1}, b_{i 2})$

In this subsection, we discuss the estimation for random item parameters. We introduce a slightly more general notation to cover the linking methods (except for GI linking) from Section 3. The “data” for item i is given by the vector

y_{i} = ({\hat{b}}_{1 i}, {\hat{b}}_{2 i}) = (b_{i 1}, b_{i 2} + μ_{2})

. The linking method must be additive with respect to functions of this data. More formally, let H be a linking function that is defined by

H (γ) = \frac{1}{I} \sum_{i = 1}^{I} h (y_{i}, γ) .

(21)

The linking parameter

γ

(e.g., a group difference) of interest is estimated by

\hat{γ} = \underset{γ}{arg min} H (γ) .

(22)

For a large number of items I, note that

H (γ)

defined in (21) converges to

H_{0} (γ; G) = \int h (y, γ) d G (y) .

(23)

Assuming differentiability of h implies that

\hat{γ}

can be obtained by solving the equation

S (γ) = \frac{\partial}{\partial γ} H (γ) = \frac{1}{I} \sum_{i = 1}^{I} \frac{\partial}{\partial γ} h (y_{i}, γ) = \frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ) = 0,

(24)

where

m = \frac{\partial h}{\partial γ}

.

Equation (24) provides an estimating equation for the parameter

γ

. The corresponding estimator is labeled as an M-estimator [96]. It is evident that the estimated group mean differences

{\hat{μ}}_{2}

in MM, IA and HAE linking are M-estimators by defining the univariate parameter

\hat{γ} = ({\hat{μ}}_{2})

. The linking methods EL and ATR are so-called two-step estimators because their computation relies on the median

{\tilde{μ}}_{2} = mdn (ν)

computed in a first step. Because the estimating equation for the median is clearly defined, the estimate in Equation (24) also applies to two-step estimators because it can be interpreted as a bivariate one-step M-estimator by defining

\hat{γ} = ({\tilde{μ}}_{2}, {\hat{μ}}_{2})

(see [37], chp. 7).

We now apply the theory of M-estimators ([37], chp. 7; [96,97]) to study the asymptotic behavior of

\hat{γ}

. Because we are concerned with linking errors, asymptotic behavior is meant with respect to the number of items. By letting the number of items tend to infinity, the left side in Equation (24) converges to

S_{0} (γ; G) = \int m (y, γ) d G (\tilde{y}),

(25)

where

\tilde{y} = ({\hat{b}}_{1}, {\hat{b}}_{2} - μ_{2})

and

{\hat{b}}_{g}

(

g = 1, 2

) denote the random variables associated with estimated item parameters. As already mentioned, the distribution of item parameters G follows the mixture distribution

G = (1 - π_{bias}) G_{AMI} + π_{bias} G_{bias}

. Assume that densities for the involved distributions exist (i.e., continuous or count densities):

d G = g d θ

,

d G_{AMI} = g_{AMI} d θ

, and

d G_{bias} = g_{bias} d θ

. Equation (25) can be written as

S_{0} (γ; G) = (1 - π_{bias}) S_{0} (γ, G_{AMI}) + π_{bias} S_{0} (γ, G_{bias}) .

(26)

The parameter

γ_{\infty}

obtained from a linking method with an infinite number of items is given as the root of the equation

S_{0} (γ_{\infty}; G) = \int m (y, γ_{\infty}) d G (\tilde{y}) = 0 .

(27)

Note that

γ_{\infty}

is a function of

μ_{2}

, G, and

m

. For a dataset,

μ_{2}

and G are fixed but unknown. However,

m

is chosen in the linking method by a user.

The pseudo-true parameter

γ_{AMI}

is defined as the estimate if all items would be reference items. That is, the linking parameter

γ

would only be determined by the mixture component

G_{AMI}

:

S_{0} (γ_{AMI}; G_{AMI}) = \int m (y, γ_{AMI}) d G_{AMI} (\tilde{y}) = 0 .

(28)

Ideally, a component of

γ_{AMI}

(in the bivariate case) or

γ_{AMI}

itself (in the univariate case) should provide an asymptotically unbiased estimate of

μ_{2}

by choosing an appropriate linking function h (or its derivative

m

). In the following, we assume that

m

is differentiable, although the main propositions of M-estimators do not require differentiability [37]. However, one could always approximate a nondifferentiable linking function h by a differentiable approximation

h_{A}

. For example, the nondifferentiable and nonnegative linking function

h (x)

can be approximated by

h_{A} (x) = \sqrt{h {(x)}^{2} + ε}

for a sufficiently small

ε > 0

[78,80,81,87].

4.1.1. Asymptotic Behavior

We now study the asymptotic behavior of the estimator

\hat{γ}

. For a large number of items,

\hat{γ}

converges to

γ_{\infty}

. The derivation of

γ_{\infty}

relies on a Taylor approximation of

m

and closely follows [98]. Due to (26), we get

S_{0} (γ_{\infty}; G) = (1 - π_{bias}) S_{0} (γ_{\infty}, G_{AMI}) + π_{bias} S_{0} (γ_{\infty}, G_{bias}) = 0 .

(29)

We now apply a first-order Taylor approximation of

m

around

γ = γ_{AMI}

:

m (y, γ_{\infty}) ≃ m (y, γ_{AMI}) + M_{γ} (y, γ_{AMI}) (γ_{\infty} - γ_{AMI}),

(30)

where

M_{γ} = \frac{\partial m}{\partial γ}

is the matrix of partial derivatives. From (29) and (30), we get

\begin{matrix} (1 - π_{bias}) [\int M_{γ} (y, γ_{AMI}) d G_{AMI} (\tilde{y})] (γ_{\infty} - γ_{AMI}) \\ + π_{bias} [\int m (y, γ_{\infty}) d G_{bias} (\tilde{y})] & = & 0 \end{matrix}

(31)

Hence, we obtain from (31),

γ_{\infty} - γ_{AMI} = \frac{π_{bias}}{1 - π_{bias}} {[\int M_{γ} (y, γ_{AMI}) d G_{AMI} (\tilde{y})]}^{- 1} [\int m (y, γ_{\infty}) d G_{bias} (\tilde{y})] .

(32)

If we assume that

γ_{AMI}

allows the unbiased estimation of

μ_{2}

, Equation (32) provides an expression of the asymptotic bias of

\hat{γ}

. Of crucial importance is that the linking function

m

downweights observations from the distribution of biased items

G_{bias}

(i.e.,

\int m (y, γ_{\infty}) d G_{bias} (\tilde{y}) \approx 0

). The linking function

m

has to be chosen so that biased items are automatically removed for group comparison. The next subsection discusses how the linking function

m

should be chosen to enable an unbiased estimation of

μ_{2}

.

4.1.2. Choosing an Optimal Linking Function m

Again, the derivation of the choice of the linking function

m

follows the exposition in [98]. Assume that the true parameter

μ_{2}

is determined by the distribution

G_{AMI}

(with density

g_{AMI}

) of reference items. Hence,

μ_{2}

is given as the maximizer of the log-likelihood function and fulfills

E (\frac{d}{d μ_{2}} log g_{AMI} (\tilde{y})) = \int \frac{d}{d μ_{2}} log g_{AMI} (\tilde{y}) g_{AMI} (\tilde{y}) d \tilde{y} = 0 .

(33)

Based on (33), the linking function

m

can be chosen in order to obtain unbiased estimates of group mean differences

μ_{2}

(see [98]):

m (y, μ_{2}) = (\frac{d}{d μ_{2}} log g_{AMI} (\tilde{y})) w (y, μ_{2})

(34)

with the weight function w defined as

w (y, μ_{2}) = \frac{(1 - π_{bias}) g_{AMI} (\tilde{y})}{g (\tilde{y})} = {[1 + \frac{π_{bias} g_{bias} (\tilde{y})}{(1 - π_{bias}) g_{AMI} (\tilde{y})}]}^{- 1} .

(35)

Note that

0 < w (y, μ_{2}) \leq 1

and the weighting function w weighs observations

y

according to their closeness to the distribution

G_{AMI}

. Observations

y

with large density values

g_{bias} (\tilde{y})

are downweighted in w. Using (33), it can be shown that

\int m (y, μ_{2}) g (\tilde{y}) d \tilde{y} = 0 .

(36)

4.1.3. Asymptotic Normal Distribution

We now show that the M-estimator

\hat{γ}

follows an asymptotic normal (AN) distribution (see [37], chp. 7). The same Taylor expansion as in (30) provides

m (y, \hat{γ}) ≃ m (y, γ_{\infty}) + M_{γ} (y, γ_{\infty}) (\hat{γ} - γ_{\infty}) .

(37)

The approximation (37) can be substituted into the estimating Equation (24):

0 = \frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ) ≃ \frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty}) + (\frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, γ_{\infty})) (\hat{γ} - γ_{\infty}) .

(38)

Hence, we obtain from (38)

\hat{γ} - γ_{\infty} ≃ - {(\frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, γ_{\infty}))}^{- 1} (\frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty})) .

(39)

For

I \to \infty

, we have (see (24))

\frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, γ_{\infty}) \overset{p}{\to} A (γ_{\infty}) = - \int M_{γ} (y, γ_{\infty}) d G (\tilde{y}) and

(40)

\frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty}) \overset{p}{\to} S_{0} (γ_{\infty}, G) = \int m (y, γ_{\infty}) d G (\tilde{y}) = 0 .

(41)

Therefore, we obtain the asymptotic normal distribution of

\hat{γ}

as

\hat{γ} \sim AN (γ_{\infty}, \frac{1}{I} A {(γ_{\infty})}^{- 1} B (γ_{\infty}) A {(γ_{\infty})}^{- ⊤}), where

(42)

B (γ_{\infty}) = \int m (y, γ_{\infty}) m {(y, γ_{\infty})}^{T} d G (\tilde{y}) .

(43)

The involved matrices

A (γ_{\infty})

and

B (γ_{\infty})

can be estimated from sample data by

\hat{A} (\hat{γ}) = - \frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, \hat{γ}) and

(44)

\hat{B} (\hat{γ}) = \frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, \hat{γ}) m {(y_{i}, \hat{γ})}^{⊤} .

(45)

Notably, the distribution stated in Equation (42) only holds for a sufficiently large number of items I.

4.1.4. Scalar Linking Parameter

We now specialize our results if the estimated parameter

\hat{γ}

coincides with the estimated group difference

{\hat{μ}}_{2}

. In this case, m is a univariate linking function. Assume that

μ_{2, AMI}

and

μ_{2, \infty}

are the roots of the following equations, respectively:

\int m (y, μ_{2, AMI}) d G_{AMI} (\tilde{y}) = 0 and

(46)

\int m (y, μ_{2, \infty}) d G (\tilde{y}) = 0 .

(47)

The asymptotic behavior of

{\hat{μ}}_{2}

can be described as (see Equation (32))

μ_{2, \infty} - μ_{2, AMI} = - \frac{π_{bias}}{1 - π_{bias}} \frac{\int m (y, μ_{2, \infty}) d G_{bias} (\tilde{y})}{\int m^{'} (y, μ_{2, AMI}) d G_{AMI} (\tilde{y})},

(48)

where

m^{'}

is the derivative of m with respect to

μ_{2}

. Furthermore,

{\hat{μ}}_{2}

is asymptotically normally distributed (see Equation (42)):

{\hat{μ}}_{2} \sim AN (μ_{2, \infty}, \frac{1}{I} \frac{\int {\{m (y, μ_{2, \infty})\}}^{2} d G (y)}{{\{\int m^{'} (y, μ_{2, \infty}) d G (y)\}}^{2}}) .

(49)

Again, the involved integrals for the variance estimate in (49) can be estimated using sample data (see Equations (44) and (45)).

4.2. Fixed Item Parameters $b_{i}$ , Random DIF Effects $e_{i}$

In this subsection, we consider the case of fixed item parameters

b_{i}

, but DIF effects

e_{i}

are random. The “data” in Section 4.1 was given by

y_{i} = ({\hat{b}}_{1 i}, {\hat{b}}_{2 i}) = (b_{i}, - μ_{2} + b_{i} + e_{i})

, and DIF effects

e_{i}

follow a distribution G. Now, only

e_{i}

is random and we define the data as

y_{i} = {\hat{b}}_{2 i} - {\hat{b}}_{1 i} = - μ_{2} + e_{i}

and

{\tilde{y}}_{i} = y_{i} + μ_{2}

.

The estimating Equation (24) for the linking parameter

γ

can be rewritten as

S (γ) = \frac{1}{I} \sum_{i = 1}^{I} m (b_{i}, y_{i}, γ) = 0 .

(50)

The term in (50) converges to

{\tilde{S}}_{0} (γ; G) = lim_{I \to \infty} \frac{1}{I} \sum_{i = 1}^{I} \int m (b_{i}, y_{i}, γ) d G ({\tilde{y}}_{i}) = 0 .

(51)

One has to assume that

{\tilde{S}}_{0} (γ; G)

exists. Define

γ_{\infty}

by

{\tilde{S}}_{0} (γ_{\infty}; G) = lim_{I \to \infty} \frac{1}{I} \sum_{i = 1}^{I} \int m (b_{i}, y_{i}, γ_{\infty}) d G ({\tilde{y}}_{i}) = 0 .

(52)

Then, we can derive an asymptotic normal distribution for

\hat{γ}

:

\hat{γ} \sim AN (γ_{\infty}, \frac{1}{I} \tilde{A} {(γ_{\infty})}^{- 1} \tilde{B} (γ_{\infty}) \tilde{A} {(γ_{\infty})}^{- ⊤}), where

(53)

\tilde{A} (γ_{\infty}) = - lim_{I \to \infty} \frac{1}{I} \sum_{i = 1}^{I} \int M_{γ} (b_{i}, y_{i}, γ_{\infty}) d G ({\tilde{y}}_{i})

(54)

\tilde{B} (γ_{\infty}) = lim_{I \to \infty} \frac{1}{I} \sum_{i = 1}^{I} \int m (b_{i}, y_{i}, γ_{\infty}) m {(b_{i}, y_{i}, γ_{\infty})}^{⊤} d G ({\tilde{y}}_{i})

(55)

The involved matrices

\tilde{A} (γ_{\infty})

and

\tilde{B} (γ_{\infty})

can be estimated by

\hat{A} (\hat{γ}) = - \frac{1}{I} \sum_{i = 1}^{I} M_{γ} (b_{i}, y_{i}, \hat{γ}) and

(56)

\hat{B} (\hat{γ}) = \frac{1}{I} \sum_{i = 1}^{I} m (b_{i}, y_{i}, \hat{γ}) m {(b_{i}, y_{i}, \hat{γ})}^{⊤} .

(57)

Interestingly, these estimators coincide with estimated standard errors in the case of random item parameters (see Equations (44) and (45)). Hence, no practical differences regarding the estimated linking parameters and their estimated standard errors can be expected. Only conceptual differences emerge for the two treatments of DIF effects.

5. An Analytical Treatment for the Simultaneous Assessment of Standard Errors and Linking Errors

In practice, the variance in the group mean difference is affected by the sampling of persons (i.e., standard error) and the randomness associated with items (i.e., linking error). There have been attempts for the analytical treatment of the simultaneous inference with respect to the two modes [81,99,100]. In this section, we apply M-estimation theory for the simultaneous assessment of standard errors and linking errors. The general idea in this kind of inference is investigating the asymptotic behavior of the M-estimator

\hat{γ}

if the number of persons P and the number of items I tend to infinity. We only consider the case of random items, but treatment of the case with fixed items and random DIF effects is similar.

In the notation of Section 4,

y_{i}

denotes the vector of (true) identified item parameters. In finite samples of size P, only estimates

{\hat{y}}_{i}

are available. For

P \to \infty

, it holds that

{\hat{y}}_{i} \overset{p}{\to} y_{i}

. In long tests, the estimated item parameters are approximately independent between items [101]. Hence, we can assume that

{\hat{y}}_{i}

are approximately independent of each other. M-estimation theory applied to the person side guarantees an asymptotic normal distribution:

{\hat{y}}_{i} \sim AN (y_{i}, \frac{1}{P} V_{i} (y_{i})),

(58)

where

V_{i} (y_{i})

is a function of true item parameters

y_{i}

. We now use a Taylor expansion with respect to

γ

and

y_{i}

m ({\hat{y}}_{i}, \hat{γ}) ≃ m (y_{i}, γ_{\infty}) + M_{γ} (y_{i}, γ_{\infty}) (\hat{γ} - γ_{\infty}) + M_{y} (y_{i}, γ_{\infty}) ({\hat{y}}_{i} - y_{i}) .

(59)

Using the same approach as in Section 4.1.3, we get an approximation of the estimating equation as

\frac{1}{I} \sum_{i = 1}^{I} [m (y_{i}, γ_{\infty}) + M_{γ} (y_{i}, γ_{\infty}) (\hat{γ} - γ_{\infty}) + M_{y} (y_{i}, γ_{\infty}) ({\hat{y}}_{i} - y_{i})] = 0 .

(60)

Then, we obtain

\hat{γ} - γ = {(\frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, γ_{\infty}))}^{- 1} [\frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty}) + \frac{1}{I} \sum_{i = 1}^{I} M_{y} (y_{i}, γ_{\infty}) ({\hat{y}}_{i} - y_{i})] .

(61)

By definition, we have for

I \to \infty

\frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty}) \overset{p}{\to} S_{0} (γ_{\infty}, G) = \int m (y, γ_{\infty}) d G (\tilde{y}) = 0 .

(62)

Moreover, the following limit exists as in Section 4.1.3:

\frac{1}{I} \sum_{i = 1}^{I} M_{γ} (y_{i}, γ_{\infty}) \overset{p}{\to} A (γ_{\infty}) = - \int M_{γ} (y, γ_{\infty}) d G (\tilde{y}) .

(63)

Because

{\hat{y}}_{i} - y_{i} \overset{p}{\to} 0

for

P \to \infty

, the second term in the right bracket in 61 vanishes asymptotically for

I \to \infty

\frac{1}{I} \sum_{i = 1}^{I} M_{y} (y_{i}, γ_{\infty}) ({\hat{y}}_{i} - y_{i}) \overset{p}{\to} 0 .

(64)

For the computation of the covariance matrix, we have

\frac{1}{I} \sum_{i = 1}^{I} m (y_{i}, γ_{\infty}) m {(y_{i}, γ_{\infty})}^{⊤} \overset{p}{\to} B (γ_{\infty}) = \int m (y, γ_{\infty}) m {(y, γ_{\infty})}^{⊤} d G (\tilde{y}) and

(65)

\frac{1}{I} \sum_{i = 1}^{I} M_{y} (y_{i}, γ_{\infty}) V_{i} (y_{i}) M_{y} {(y_{i}, γ_{\infty})}^{⊤} \overset{p}{\to} C (γ_{\infty}) = \int M_{y} (y, γ_{\infty}) V_{i} (y) M_{y} {(y, γ_{\infty})}^{⊤} d G (\tilde{y}) .

(66)

This shows the asymptotic normal distribution when the simultaneous inference with respect to persons and items is conducted:

\hat{γ} \sim AN (γ_{\infty}, \frac{1}{I} A {(γ_{\infty})}^{- 1} (B (γ_{\infty}) + \frac{1}{P} C (γ_{\infty})) A {(γ_{\infty})}^{- ⊤}) .

(67)

It evident is from (67) that the number of persons and the number of items are part of the statistical inference. The involved matrices

A (γ_{\infty})

,

B (γ_{\infty})

, and

C (γ_{\infty})

can be estimated from sample data. However, for example, in (65), the true identified item parameters

y_{i}

in the left term has to be replaced by the estimated item parameters

{\hat{y}}_{i}

which can cause slight biases in estimated variance matrices. Because of this disadvantage, we propose resampling techniques for the simultaneous inference of standard errors and linking errors in the next section.

6. Resampling Methods for the Simultaneous Assessment of Standard Errors and Linking Errors

We now derive estimation formulas for resampling methods [102,103] for persons and items. The derivation is motivated by assuming the following data-generating model

Y_{p i} = μ + u_{p} + v_{i} + e_{p i}, Var (u_{p}) = σ_{P}^{2}, Var (v_{i}) = σ_{I}^{2}, Var (e_{p i}) = σ_{P \times I}^{2} .

(68)

where

Y_{p i}

is the observed data for person p (or person groups) and item i (or item groups). The random variables

u_{p}

,

v_{i}

, and

e_{p i}

are all independent of each other. We now derive the variance for the mean estimate

\hat{μ}

:

\hat{μ} = \frac{1}{P I} \sum_{p = 1}^{P} \sum_{i = 1}^{I} X_{p i} .

(69)

Its variance is given by

Var (\hat{μ}) = \frac{σ_{P}^{2}}{P} + \frac{σ_{I}^{2}}{I} + \frac{σ_{P \times I}^{2}}{P I} .

(70)

The variance in (70) contains error sources for persons and items. Hence, it allows a simultaneous inference for both error facets. Following the terminology of errors in item response modeling for the large-scale assessment of students [93], the variance

σ_{P}^{2} / P

quantifies sampling error due to sampling persons, the variance

σ_{I}^{2} / I

linking error due to sampling items, and the variance

σ_{P \times I}^{2} / P I

can be interpreted as measurement error.

6.1. Single Jackknife (SJK)

The classical single jackknife (SJK; [104,105,106,107]) approach removes one unit (e.g., a (group of) person(s) or a (group of) item(s)) from an analysis for computing standard errors. First, we investigate the jackknife estimate in which only persons are removed. Let

{\hat{μ}}_{(- p, 0)}

be the mean estimate in which person p is removed:

{\hat{μ}}_{(- p, - 0)} = \frac{1}{(P - 1) I} \sum_{\binom{q = 1}{q \neq p}}^{P} \sum_{i = 1}^{I} X_{q i}

(71)

{\hat{μ}}_{(- p, - 0)} - \hat{μ} = - \frac{1}{P} v_{p} + \sum_{q \neq p} \frac{1}{P (P - 1)} v_{p} - \frac{1}{P I} \sum_{i} e_{p i} + \sum_{q \neq p} \sum_{i} \frac{1}{(P - 1) I P} e_{q i} [3 e x]

(72)

We now derive the expected value of the square in Equation (72):

E {({\hat{μ}}_{(- p, - 0)} - \hat{μ})}^{2} = \frac{1}{P (P - 1)} σ_{P}^{2} + \frac{1}{P (P - 1) I} σ_{P \times I}^{2} .

(73)

Now, we define

S_{P}^{2} = \sum_{p = 1}^{P} {({\hat{μ}}_{(- p, - 0)} - \hat{μ})}^{2} .

(74)

By using (72), we now obtain

E S_{P}^{2} = \frac{1}{P - 1} σ_{P}^{2} + \frac{1}{(P - 1) I} σ_{P \times I}^{2} .

(75)

Equation (75) allows the computation of the standard error associated with person sampling. From Equation (70), we can attribute the variance

σ_{P \times I}^{2} / P

to person sampling. From (75), we get by replacing the expected value

E S_{P}^{2}

with the observed value

S_{P}^{2}

\frac{1}{P} σ_{P}^{2} = \frac{P - 1}{P} S_{P}^{2} - \frac{1}{P I} σ_{P \times I}^{2} .

(76)

In the single jackknife, the person-by-item interaction variance component

σ_{P \times I}^{2}

is typically ignored and the variance due to person sampling is, hence, estimated by

\frac{P - 1}{P} S_{P}^{2}

.

Similarly, we can derive the properties of the SJK estimate in which a single item i (or an item group) is removed from the analysis:

{\hat{μ}}_{(- 0, - i)} = \frac{1}{P (I - 1))} \sum_{p = 1}^{P} \sum_{\binom{j = 1}{j \neq i}}^{I} X_{p j} .

(77)

The SJK variance estimate for item sampling utilizes the sum of squares term

S_{I}^{2} = \sum_{i = 1}^{I} {({\hat{μ}}_{(- 0, - i)} - \hat{μ})}^{2} .

(78)

By changing the indices i and p in Equation (75), we obtain

E S_{I}^{2} = \frac{1}{I - 1} σ_{I}^{2} + \frac{1}{P (I - 1)} σ_{P \times I}^{2} .

(79)

By replacing the expected value

E S_{I}^{2}

with the observed value

S_{I}^{2}

, the quantity

\frac{I - 1}{I} S_{I}^{2}

is used as the variance estimate concerning the item facet. For the joint inference of persons and items, the variance terms for persons and items are added:

\hat{Var} (\hat{μ}) = \frac{P - 1}{P} S_{P}^{2} + \frac{I - 1}{I} S_{I}^{2} .

(80)

Note that this variance estimate is biased because

E (\hat{Var} (\hat{μ})) - Var (\hat{μ}) = \frac{1}{P I} σ_{P \times I}^{2} > 0 .

(81)

Consequently, so-called double jackknife resampling should be employed to remove the bias from the estimated variance.

6.2. Double Jackknife (DJK)

The double jackknife (DJK; [104,106,108]) removes a person (or a group of persons) and an item (or a group of items) from an analysis for the determination of the standard error. The elimination and repeated analysis is carried out for all persons and items. Let

{\hat{μ}}_{(- p, - i)}

be the mean estimate in which the person p and item i are removed. In more detail, it is

{\hat{μ}}_{(- p, - i)} = \frac{1}{(P - 1) (I - 1)} \sum_{\binom{q = 1}{q \neq p}}^{P} \sum_{\binom{j = 1}{j \neq i}}^{I} X_{q j} .

(82)

The estimate

{\hat{μ}}_{(- p, - 0)}

only removes person p, and the estimate

{\hat{μ}}_{(- 0, - i)}

only removes item i. The corresponding estimates have already been studied as SJK estimates in Section 6.1.

We now consider an analysis in which one person and one item are removed. One obtains

\begin{matrix} {\hat{μ}}_{(- p, - i)} - \hat{μ} & = & - \frac{1}{P} v_{p} + \sum_{q \neq p} \frac{1}{P (P - 1)} v_{q} - \frac{1}{I} u_{i} + \sum_{j \neq i} \frac{1}{I (I - 1)} u_{j} \\ - \frac{1}{P I} (e_{p i} + \sum_{j \neq i} e_{p j} + \sum_{q \neq p} e_{q i}) + \sum_{q \neq p} \sum_{j \neq i} \frac{P + I - 1}{(P - 1) (I - 1) P I} e_{q j} \end{matrix}

(83)

It follows that

E {({\hat{μ}}_{(- p, - i)} - \hat{μ})}^{2} = \frac{1}{P (P - 1)} σ_{P}^{2} + \frac{1}{I (I - 1)} σ_{I}^{2} + \frac{P + I - 1}{P I (P - 1) (I - 1)} σ_{P \times I}^{2} .

(84)

Now, define

S_{P \times I}^{2} = \sum_{p = 1}^{P} \sum_{i = 1}^{I} {({\hat{μ}}_{(- p, - i)} - \hat{μ})}^{2} .

(85)

We then obtain by using (84)

E S_{P \times I}^{2} = \frac{I}{P - 1} σ_{P}^{2} + \frac{P}{I - 1} σ_{I}^{2} + \frac{P + I - 1}{(P - 1) (I - 1)} σ_{P \times I}^{2} .

(86)

One can use Equations (75), (79) and (86) as estimating equations by equating the expected values of the sum of squares by their observed counterparts. We have three equations for three unknowns

\begin{matrix} S_{P}^{2} & = & \frac{1}{P - 1} σ_{P}^{2} + \frac{1}{(P - 1) I} σ_{P \times I}^{2} \\ S_{I}^{2} & = & \frac{1}{I - 1} σ_{I}^{2} + \frac{1}{(I - 1) P} σ_{P \times I}^{2} \\ S_{P \times I}^{2} & = & I \frac{1}{P - 1} σ_{P}^{2} + P \frac{1}{I - 1} σ_{I}^{2} + (P + I - 1) \frac{1}{(P - 1) (I - 1)} σ_{P \times I}^{2} \end{matrix}

(87)

We further simplify (87) to

\begin{matrix} (P - 1) I S_{P}^{2} & = & I σ_{P}^{2} + σ_{P \times I}^{2} \\ P (I - 1) S_{I}^{2} & = & P σ_{I}^{2} + σ_{P \times I}^{2} \\ (P - 1) (I - 1) S_{P \times I}^{2} & = & I (I - 1) σ_{P}^{2} + P (P - 1) σ_{I}^{2} + (P + I - 1) σ_{P \times I}^{2} \end{matrix}

(88)

Now substitute the first and second equation in (88) in the third equation. We obtain

(P - 1) (I - 1) S_{P \times I}^{2} = (I - 1) (P - 1) I S_{P}^{2} + (P - 1) P (I - 1) S_{I}^{2} + σ_{P \times I}^{2} .

(89)

Hence, we get from (89)

\frac{σ_{P \times I}^{2}}{P I} = \frac{(P - 1) (I - 1)}{P I} [S_{P \times I}^{2} - \frac{1}{P} S_{P}^{2} - \frac{1}{I} S_{I}^{2}] .

(90)

Further, the variance components for persons and items can be computed as

\begin{matrix} \frac{σ_{P}^{2}}{P} & = & \frac{P - 1}{P} S_{P}^{2} - \frac{σ_{P \times I}^{2}}{P I} and \end{matrix} \frac{σ_{I}^{2}}{I} = \frac{I - 1}{I} S_{I}^{2} - \frac{σ_{P \times I}^{2}}{P I} .

(91)

The quantities in (90) and (91) can be used to estimate the population variance defined in (70). The crucial issue is how to handle negative variance estimates in estimation. Based on experience from preliminary simulation studies, the following variance estimate turned out to be most satisfactory:

\hat{Var} (\hat{μ}) = \frac{P - 1}{P} S_{P}^{2} + \frac{I - 1}{I} S_{I}^{2} - \frac{{\hat{σ}}_{P \times I}^{2}}{P I},

(92)

where

{\hat{σ}}_{P \times I}^{2} = max (σ_{P \times I}^{2}, 0)

is nonnegative, and

σ_{P \times I}^{2}

is defined in Equation (90).

6.3. Single Half Sampling (SHS)

In single half sampling (SHS; [107]), half of the sample is used to reanalyze the data to compute standard errors. Let

{\hat{μ}}_{P, h}

be the h-th half sample for persons in which half of the persons are sampled. Without loss of generality, let P be even. The h-th half sample consists of

P / 2

persons. We define half sample h in which the first

P / 2

persons are sampled and compute the mean estimate

{\hat{μ}}_{P, h} = \frac{1}{I (P / 2)} \sum_{p = 1}^{P / 2} \sum_{i = 1}^{I} X_{p i} .

(93)

Then, we obtain

{\hat{μ}}_{P, h} - \hat{μ} = \frac{1}{P} \sum_{p = 1}^{P / 2} u_{p} - \frac{1}{P} \sum_{p = P / 2 + 1}^{P} u_{p} + \frac{1}{P I} \sum_{p = 1}^{P / 2} e_{p i} - \frac{1}{P} \sum_{p = P / 2 + 1}^{P I} e_{p i} .

(94)

Hence, we get from (95)

E {({\hat{μ}}_{P, h} - \hat{μ})}^{2} = \frac{σ_{P}^{2}}{P} + \frac{σ_{P \times I}^{2}}{P I} .

(95)

Now, there are H (potentially balanced) half samples (see [102]) with estimates

{\hat{μ}}_{P, h}

. Define the variance

U_{P}^{2} = \frac{1}{H} \sum_{h = 1}^{H} {({\hat{μ}}_{P, h} - \hat{μ})}^{2} .

(96)

Using (95), it follows that

E U_{P}^{2} = \frac{σ_{P}^{2}}{P} + \frac{σ_{P \times I}^{2}}{P I} .

(97)

Similarly, one can consider half samples of items. Assume that in half sample k, the first

I / 2

items are sampled. Let

{\hat{μ}}_{I, k} = \frac{1}{P (I / 2)} \sum_{p = 1}^{P} \sum_{i = 1}^{I / 2} X_{p i} .

(98)

One can define the variance in estimates due to different half samples of items. Define the variance

U_{I}^{2} = \frac{1}{K} \sum_{h = 1}^{K} {({\hat{μ}}_{I, k} - \hat{μ})}^{2} .

(99)

Using the same derivations, we get

E U_{I}^{2} = \frac{σ_{I}^{2}}{I} + \frac{σ_{P \times I}^{2}}{P I} .

(100)

Based on the expected values in (97) and (100), one can define a variance estimate of

\hat{μ}

by adding the variance components regarding persons and items as

\hat{Var} (\hat{μ}) = U_{P}^{2} + U_{I}^{2} .

(101)

Notably, this estimate is positively biased because

E (\hat{Var} (\hat{μ})) - Var (\hat{μ}) = \frac{σ_{P \times I}^{2}}{P I} > 0 .

(102)

As in the case of SJK, SHS also results in a biased variance estimate. In the next section, we investigate double half sampling that removes the bias component.

6.4. Double Half Sampling (DHS)

In double half sampling (DHS), half samples of persons and items are created and the analysis is replicated for these half samples. Let h be a half sample of persons, and k be a half sample of items for this dataset of persons. Let

{\hat{μ}}_{I : P, k h}

be the mean estimate for the half sample for persons and items and

{\hat{μ}}_{P, h}

be the estimate for the half sample of persons.

Define the variance

U_{I : P}^{2} = \frac{1}{K H} \sum_{h = 1}^{H} \sum_{h = 1}^{K} {({\hat{μ}}_{I : P, k h} - {\hat{μ}}_{P, h})}^{2} .

(103)

Using the same derivation as in (100), one obtains

E U_{I : P}^{2} = \frac{σ_{I}^{2}}{I} + \frac{σ_{P \times I}^{2}}{(P / 2) I} = \frac{σ_{I}^{2}}{I} + 2 \frac{σ_{P \times I}^{2}}{P I} .

(104)

Hence, an unbiased estimate of the variance for

\hat{μ}

using DHS is obtained by

\hat{Var} (\hat{μ}) = U_{P}^{2} + U_{I}^{2} - \frac{{\hat{σ}}_{P \times I}^{2}}{P I},

(105)

where

\frac{{\hat{σ}}_{P \times I}^{2}}{P I} = max (U_{I : P}^{2} - U_{P}^{2}, 0)

.

In practice, one can use balanced half samples based on Hadamard matrices for the most efficient variance estimates that minimize the Monte Carlo error for creating half samples [102]. In the simulation study (see Section 8), only balanced half samples are considered.

6.5. Double Bootstrap

It might be tempting to consider a double bootstrap resampling approach of persons and items as an alternative to DJK and DHS [104,109,110,111]. We believe that bootstrapping items should not be recommended because duplicating items introduces additional local dependence in IRT models, which, in turn, induces bias in estimated item parameters and linking parameters. Hence, the variability obtained from a double bootstrap will also include portions of bias.

7. Simulation Study 1: Comparing the Performance of Different Linking Methods

In Simulation Study 1, we compare the performance of robust and nonrobust linking methods for the RM in the presence and absence of random DIF. This study systematically compares several robust linking methods. In particular, the recently proposed GI method is compared with alternative methods.

7.1. Design

Data were simulated according to the RM with random DIF in two groups. In the first group, the ability distributed was simulated as

θ \sim N (0, 1)

. In the second group, we simulated

θ \sim N (0.5, 1)

(i.e.,

μ_{2} = 0.5

). Item difficulties

b_{i}

were fixed in the simulation and were chosen equidistant in the interval

[- 2, 2]

. Hence, in this study, we assumed fixed item difficulties

b_{i}

, but simulated random DIF effects

e_{i}

according to a mixture distribution

G = (1 - π_{bias}) G_{AMI} + π_{bias} G_{bias}

(see Section 2.2). The distribution of DIF effects reference items was chosen as a centered normal distribution; that is,

G_{AMI} = N (0, τ_{AMI}^{2})

. For the distribution of DIF effects of biased items

G_{bias}

, we chose a two-point distribution for balanced DIF with values

- δ_{bias}

and

δ_{bias}

and corresponding probabilities

π_{bias} / 2

. For unbalanced DIF, we simulated a one-point distribution at

δ_{bias}

with probability

π_{bias}

which favored the first group. In the simulation, we fixed

δ_{bias}

to 0.60. The bias for MM linking is expected to be

- π_{bias} δ_{bias}

(see Equation (11)). It vanishes for balanced DIF and is a function of

π_{bias}

in the case of unbalanced DIF.

In the simulation, five factors were varied. First, we chose the sample size N of persons as 250, 500, 1000, and 5000. Second, we varied the number of items by

I = 20

and

I = 40

. Third, we chose the proportion of biased items

π_{bias} = 0, 0.1, 0.3

. With

π_{bias} = 0

, no biased DIF items were simulated. Fourth, we varied the standard deviation (SD) of DIF effects of reference items

τ_{AMI}

as 0, 0.1, 0.2, and 0.3. Fifth, we simulated three different distributions of DIF effects if

τ_{AMI} > 0

: a normal distribution, a uniform distribution, and a t-distribution with four degrees of freedom. With

τ_{AMI} = 0

, reference items do not have DIF effects. The distributions of DIF effects were appropriately scaled in order to match the SD

τ_{AMI}

. In total, 1000 datasets were simulated and analyzed in each condition.

7.2. Analysis

The RM model was separately estimated in the two groups. The linking methods introduced in Section 3 were applied. We chose a cutoff value of 0.4 for DIF detection in the EL method. In ATR linking, we chose trimming proportions of 0.20 and 0.40. In BSQ linking, we chose 0.4 as the threshold parameter K. IA was estimated using the powers

p = 1, 0.5, 0.25

, and 0.1. GI linking was utilized with powers 1 and 2. HAE linking was specified with powers

p = 2

, 1, 0.5, 0.25, and 0.1.

The parameter of interest was the estimated group mean difference

{\hat{μ}}_{2}

. For this parameter, the bias and root mean square error (RMSE) were computed. To reduce the dependence of the RMSE from the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method is divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE possesses the lowest value of 100 for the best linking method.

To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA). We used a variance decomposition for assessing the importance in the presence and absence of DIF.

Moreover, we classified linking methods on whether they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias in the estimated mean

{\hat{μ}}_{2}

was smaller than 0.01. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 125.

In all analyses, the statistical software R [112] was used. The R package sirt [113] was employed for estimating the RM model with marginal maximum likelihood as the estimation method. The linking methods were estimated using R functions particularly written for this paper.

7.3. Results

In Table 1, the variance decomposition of the ANOVA summarized across conditions of no DIF is presented. For bias, sample size N, the number of items I as well as linking methods (Meth in Table 1) have an impact. However, as we will see later, the bias is of non-negligible size in the situation of no DIF. For RMSE, linking methods constitute the major source of differences. In contrast, sample size and the number of items only have small effects on the RMSE.

In Table 2, the variance decomposition of the ANOVA summarized across conditions of balanced and unbalanced DIF, respectively, is presented. All terms up to three-way interactions were included. For balanced DIF (column BAL), RMSE is more important than bias. It is evident that linking methods produced the largest variability in estimates, followed by the SD

τ_{AMI}

of DIF effects of reference items, the proportion of biased items

π_{bias}

, sample size N, and the type of distribution (column Dist) of DIF effects. For unbalanced DIF, the bias is primarily affected by

π_{bias}

and

τ_{AMI}

and their interaction. Like for balanced DIF, the linking method substantially explains the variability in the RMSE of group mean differences.

Table 3 summarizes the performance of the different linking methods across all conditions with no DIF, balanced DIF, and unbalanced DIF. In the absence of DIF, all linking methods produced unbiased estimates. However, IA with small powers p of 0.25 and 0.1 as well as HAE with

p = 0.1

resulted in less precise estimates. Interestingly, GI linking always resulted in a substantially increased variability in estimated group mean differences compared to all other linking methods.

In the conditions with balanced DIF (column “BAL”), all linking methods (except for GI in a few conditions) produced unbiased estimates. However, using robust linking methods (i.e., EL, ATR, BSQ, IA, GI, HAE(p) with

p \leq 1

) resulted in an efficiency loss in the RMSE compared to nonrobust linking methods (i.e., MM, HAE(2)). Among the robust linking methods, MM linking with the elimination of DIF items (i.e., EL) as well as IA and HAE with

p = 1

performed best.

Finally, the situation of unbalanced DIF (column “UNBAL”) is most challenging because linking methods have to handle the presence of biased items. Notably, robust linking methods are preferred over nonrobust linking in such a situation. In particular, MM and HAE(2) always resulted in biased estimates. Among the robust linking methods, BSQ and IA with

p = 0.25

and 0.1 resulted in the least simulation conditions with biased estimates. Concerning RMSE, EL and ATR with a trimming proportion of 0.4 performed best, followed by IA with

p = 1

, HAE with

p = 1

, ATR with a trimming proportion of 0.2 and BSQ linking.

Table 4 shows the RMSE for balanced DIF for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

). For balanced DIF, all linking methods produced unbiased estimates (not shown in the table). However, there were slight differences between the linking methods with respect to the RMSE. In the situation of partial invariance (i.e.,

τ_{AMI} = 0

), the efficiency loss of robust linking methods compared to nonrobust linking methods MM and HAE(2) was acceptable. However, GI resulted in larger variable estimates. Moreover, note that GI linking with

p = 2

outperformed GI with

p = 1

in most conditions. Robust linking methods IA and HAE with very small power values p (e.g.,

p = 0.25

or 0.1) also caused a non-negligible RMSE increase.

The efficiency loss of robust linking methods is much larger if the reference items also possess DIF (i.e.,

τ_{AMI} > 0

). Only IA with

p = 1

can somehow compete with MM and HAE(2) linking. The variance increase in robust linking methods IA and HAE with very small powers is apparent. It also has to be stated that GI linking produced large RMSE values in balanced DIF conditions.

Table 5 shows the bias and the RMSE for unbalanced DIF for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

). All linking methods show biases in at least one condition. Notably, nonrobust linking methods MM and HAE(2) showed the largest bias. Robust linking methods reduce the bias in all conditions. The most critical condition is

π_{bias} = 0.3

and

τ_{AMI} = 0.2

. In this condition, BSQ linking has the least bias, followed by IA with small powers 0.25 and 0.1. In this condition, it is also interesting to note that biases for a large sample size of

N = 1000

are smaller than for

N = 250

.

With respect to the RMSE, EL, ATR, BSQ, and IA with powers 0.5 and 0.25 can be recommended. It is important to emphasize that GI linking with

p = 2

performed well in the case of partial invariance (i.e.,

τ_{AMI} = 0

), but outperformed the recently proposed GI linking using

p = 1

. Interestingly, DIF detection with subsequent MM linking (method EL) was also relatively effective as long as the proportion of biased items is not too large.

8. Simulation Study 2: Performance of Resampling Methods for Computing Standard Errors and Linking Errors

In Simulation Study 2, we investigate the performance of resampling methods for estimating the variability of group mean differences. DJK and DHS have not yet been systematically studied for linking methods in the literature. In particular, there is a lack of research for studying resampling methods for robust linking methods.

8.1. Design

The data generating model closely follows that of Simulation Study 1 (see Section 7.1). Only a selected number of conditions was simulated because resampling methods are computationally demanding. In contrast to Simulation Study 1, we set

δ_{bias} = 0.3

. Only balanced DIF was simulated because the assessment of variability (and not bias) was the focus of this simulation. The proportion of biased items were chosen as

π_{bias} = 0

or

π_{bias} = 0.3

. The SD of DIF effects for reference items

τ_{AMI}

was set to 0.3. We considered sample sizes

N = 500

and

N = 2000

and fixed the number of items to

I = 40

. In total, 2000 replications were conducted in each condition of the simulation study.

8.2. Analysis

To further reduce computation time, we only chose a selected number of linking methods that provided unbiased estimates in Simulation Study 1; that is, MM, ATR, IA, and HAE. We assessed the variability in estimated group mean differences with the resampling methods SJK (Equation (80)), DJK (Equation (92)), SHS (Equation (101)), and DHS (Equation (105)). We applied the resampling methods with 20 replication zones (containing 500/20 = 25 or 2000/20 = 100 persons and 40/20=2 items in each zone). Approximate balanced half sampling was used by specifying zones so that it was constructed from the upper part of a Hadamard matrix with a minimum dimension larger than 20. We computed confidence intervals based on the estimated standard errors

\hat{s}

by the respective methods as

CI ({\hat{μ}}_{2}) = {\hat{μ}}_{2} \mp 1.96 \cdot \hat{s}

. The proportion of replications in which the true difference

μ_{2}

is contained in

CI ({\hat{μ}}_{2})

is defined as the coverage rate. Coverage rates are classified as satisfactory if they range within the interval

[92.5, 97.5]

for a condition in a simulation. As in Simulation Study 1, we used R [112] and the R package sirt [113].

8.3. Results

In Table 6, coverage rates for the resampling methods are displayed. By construction, single resampling methods (SJK and SHS) result in slightly wider confidence intervals than double resampling methods (DJK and DHS), which, in turn, produce higher coverage rates. It can be seen that SJK and DJK failed to produce acceptable coverage rates. In particular, jackknife error estimates performed worse for robust linking methods. This is in line with results in robust statistics that JK does not work for nondifferentiable statistics. However, SJK can be used for the nonrobust linking method HAE(2). In contrast, half sampling resampling methods outperformed jackknife. As expected, SHS produced a slight overcoverage, but DHS produced acceptable coverage in all conditions. Particularly noteworthy is the fact that DHS also successfully performed for robust linking methods. Overall, these findings indicate that half sampling methods should be preferred over jackknife resampling.

9. Discussion

In this article, we investigated the performance of robust and nonrobust linking methods as well as the assessment of standard error and linking error estimates of group mean differences. We assumed random DIF with a mixture distribution model. Items are implicitly classified into a set of reference items (that are valid for group comparisons) and biased items that potentially bias group mean differences. We studied the nonrobust linking methods mean-mean linking (MM) and Haebara linking (HAE) with

p = 2

, as well as the robust linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with

p \leq 1

, Haebara linking (HAE) with

p \leq 1

and Gini linking (GI).

We found that robust linking methods can be very effective in reducing biases in the presence of biased items in unbalanced DIF situations. However, in the presence of DIF on reference items (i.e., in the absence of partial invariance), robust linking methods can result in the reduced efficiency of estimates compared to nonrobust methods such as mean-mean linking or Haebara linking, in particular in the situation of balanced DIF. Our study also compared the recently proposed Gini linking with alternative linking methods. Surprisingly, GI performed worse compared to its competitors and only showed an acceptable performance using a modified GI version using a power

p = 2

. In our view, it is hard to recommend a particular linking estimator in the unbalanced DIF situation. It is only evident that mean-mean linking and Haebara linking with

p = 2

are prone to bias and should not be used. Moreover, the recently proposed Gini linking produced much more variable estimates than competitive linking estimators. The usual practice in psychometrics (linking method EL) that eliminates DIF items in the first step of the analysis and computes group differences based on the DIF-free items in the second step, provides comparable results to robust linking methods (see also [47,114]). Note that we used the median as the preliminary location estimate in the EL method in the first step, which differs from the practice that employs the equal mean difficulty assumption (i.e., uses the mean instead of the median; see [52]).

We also studied the variability of group mean difference estimates due to random DIF. The randomness of DIF introduces an additional source of error (i.e., the linking error) in addition to the standard error associated with the sampling of persons. We analytically derived the distribution of the group difference through M-estimation theory. These results have importance for a (very) large number of items. Because we used a relatively small number of

I = 40

items in the simulation and large item pools are not often present in applications, we investigated (single and double) jackknife and (single and double) half sampling resampling methods for persons and items for assessing the variability in estimates of the linking methods. We found that our proposed double half sampling outperformed jackknife-based error estimates. In contrast to jackknife, half sampling can also satisfactorily be applied to nondifferentiable robust linking methods. These findings indicate that half sampling methods could find their way for assessing linking errors in empirical applications.

In this article, we focused on the estimation of group differences. In the investigation of DIF in applied research, how to choose the correct anchor is always crucial [41,50,115,116,117,118]. The studied robust linking estimators can be used to transform estimated item difficulties onto the same scale. Differences in transformed item difficulties can be investigated for DIF effects. Resampling procedures (single jackknife or single half sampling) can be employed for assessing the statistical significance of DIF effects.

As an alternative to separate scaling with subsequent robust or nonrobust linking, concurrent scaling assuming invariant item parameters can be utilized. Although such a one-step approach might be preferred from the practitioner’s point of view, the presence of DIF effects likely introduces some bias in estimated group differences [5,119]. Surprisingly, the bias is even present for balanced DIF [119]. Robust linking methods have the advantage that a few outlying DIF effects are automatically removed from group comparisons [119]. Moreover, concurrent calibration might have computational disadvantages [44,120]. As a further alternative, concurrent calibration assuming partial invariance can be pursued [47,121,122]. In this approach, DIF for items is investigated in a first step, and items that showed DIF receive group-specific item parameters in the concurrent calibration approach, while invariance is assumed for the remaining items.

Furthermore, the precision of linking estimates can be improved by including further person covariates in the analysis [123,124]. This could be particularly true if there also exist DIF effects for person covariates. There is a lack of research for studies that include person covariates in robust linking methods.

Finally, we assumed that the Rasch model was correctly specified. This assumption might be unrealistic in practice, and much more complex item response functions could have generated the item responses [23,29]. It would be interesting to study the performance of the different linking methods and the assessment of standard errors and linking errors for misspecified models. We would like to emphasize that M-estimation theory and resampling techniques also provide valid inference in the case of misspecified models. It can always be debated whether estimates from a misspecified Rasch model are practically relevant or should be interpreted. We tend to argue that parameter estimates of misspecified models summarize a population distribution, and model fitting is not always (or maybe should not be) targeted at estimating the model that has generated the data. In this sense, we think that approaches that include model error as an additional component in statistical inference [125] might be beneficial.

Funding

This research received no external funding.

Acknowledgments

We would like to thank four anonymous reviewers, the academic editor and the assistant editor for valuable comments that helped to improve the article.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

1PL	one-parameter logistic model
AN	asymptotic normal distribution
ATR	asymmetrically trimmed mean linking
BSQ	bisquare kernel linking
DHS	double half sampling
DIF	differential item functioning
DJK	double jackknife
EL	elimination of DIF items with subsequent mean-mean linking
HAE	Haebara linking
IA	invariance alignment
IRF	item response function
IRT	item response theory
MM	mean-mean linking
PISA	program for international student assessment
RM	Rasch model
RMSE	root mean square error
SD	standard deviation
SHS	single half sampling
SJK	single jackknife

References

Van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
Fischer, G.H.; Molenaar, I.W. (Eds.) Rasch Models. Foundations, Recent Developments, and Applications; Springer: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–667. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 125–167. [Google Scholar] [CrossRef]
Andrich, D.; Marais, I. A Course in Rasch Measurement Theory; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Kubinger, K.D. Psychological test calibration using the Rasch model—Some critical suggestions on traditional approaches. Int. J. Test. 2005, 5, 377–394. [Google Scholar] [CrossRef]
Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. [Google Scholar]
Linacre, J.M. Rasch model estimation: Further topics. J. Appl. Meas. 2004, 5, 95–110. [Google Scholar]
Rost, J. Was ist aus dem Rasch-Modell geworden? [Where has the Rasch model gone?]. Psychol. Rundsch. 1999, 50, 140–156. [Google Scholar] [CrossRef]
Von Davier, M. The Rasch model. In Handbook of Item Response Theory, Volume 1: Models; CRC Press: Boca Raton, FL, USA, 2016; pp. 31–48. [Google Scholar] [CrossRef]
Holland, P.W. On the sampling theory foundations of item response theory models. Psychometrika 1990, 55, 577–601. [Google Scholar] [CrossRef]
San Martin, E. Identification of item response theory models. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 127–150. [Google Scholar] [CrossRef] [Green Version]
Robitzsch, A. A comprehensive simulation study of estimation methods for the Rasch model. Stats 2021, 4, 48. [Google Scholar] [CrossRef]
Xu, X.; Jia, Y. The Sensitivity of Parameter Estimates to the Latent Ability Distribution; (Research Report No. RR-11-40); Educational Testing Service: Princeton, NJ, USA, 2011. [Google Scholar] [CrossRef] [Green Version]
Zwinderman, A.H.; Van den Wollenberg, A.L. Robustness of marginal maximum likelihood estimation in the Rasch model. Appl. Psychol. Meas. 1990, 14, 73–81. [Google Scholar] [CrossRef] [Green Version]
Fischer, G.H. Rasch models. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 515–585. [Google Scholar] [CrossRef]
San Martin, E.; Rolin, J. Identification of parametric Rasch-type models. J. Stat. Plan. Inference 2013, 143, 116–130. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
Loken, E.; Rulison, K.L. Estimation of a four-parameter item response theory model. Brit. J. Math. Stat. Psychol. 2010, 63, 509–525. [Google Scholar] [CrossRef]
Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 31. [Google Scholar] [CrossRef]
Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef] [Green Version]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 20. [Google Scholar] [CrossRef]
Culpepper, S.A. The prevalence and implications of slipping on low-stakes, large-scale assessments. J. Educ. Behav. Stat. 2017, 42, 706–725. [Google Scholar] [CrossRef]
Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 30 June 2021).
Becker, B.; Weirich, S.; Mahler, N.; Sachse, K.A. Testdesign und Auswertung des IQB-Bildungstrends 2018: Technische Grundlagen [Test design and analysis of the IQB education trend 2018: Technical foundations]. In IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich; Stanat, P., Schipolowski, S., Mahler, N., Weirich, S., Henschel, S., Eds.; Waxmann: Münster, Germany, 2019; pp. 411–425. Available online: https://bit.ly/3mTvgRX (accessed on 30 June 2021).
Pohl, S.; Carstensen, C. NEPS Technical Report–Scaling the Data of the Competence Tests; (NEPS Working Paper No. 14); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2012; Available online: https://bit.ly/2XThQww (accessed on 30 June 2021).
Wendt, H.; Bos, W.; Goy, M. On applications of Rasch models in international comparative large-scale assessments: A historical review. Educ. Res. Eval. 2011, 17, 419–446. [Google Scholar] [CrossRef]
Hoff, P.; Wakefield, J. Bayesian sandwich posteriors for pseudo-true parameters. J. Stat. Plan. Inference 2013, 10, 1638–1642. [Google Scholar] [CrossRef] [Green Version]
Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Sun, Y. Constructing a Misspecifed Item Response Model That Yields a Specified Estimate and a Specified Model Misfit Value. Ph.D. Thesis, The Ohoi State University, Columbus, OH, USA, 2015. Available online: https://bit.ly/3AGJPgm (accessed on 30 June 2021).
White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
Forero, C.G.; Maydeu-Olivares, A. Estimation of IRT graded response models: Limited versus full information methods. Psychol. Methods 2009, 14, 275–299. [Google Scholar] [CrossRef] [Green Version]
Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
Cho, S.J.; Suh, Y.; Lee, W.Y. After differential item functioning is detected: IRT item calibration and scoring in the presence of DIF. Appl. Psychol. Meas. 2016, 40, 573–591. [Google Scholar] [CrossRef]
Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
Van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef] [Green Version]
Frederickx, S.; Tuerlinckx, F.; De Boeck, P.; Magis, D. RIM: A random item mixture model to detect differential item functioning. J. Educ. Meas. 2010, 47, 432–457. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. [Google Scholar]
Pohl, S.; Schulze, D.; Stets, E. Partial measurement invariance: Extending and evaluating the cluster approach for identifying anchor items. Appl. Psychol. Meas. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef] [Green Version]
Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef] [Green Version]
Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
Welzel, C.; Inglehart, R.F. Misconceptions of measurement equivalence: Time for a paradigm shift. Comp. Political Stud. 2016, 49, 1068–1094. [Google Scholar] [CrossRef]
Welzel, C.; Brunkert, L.; Kruse, S.; Inglehart, R.F. Non-invariance? An overstated problem with misconceived causes. Sociol. Methods Res. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 30 June 2021).
Rutkowski, L.; Svetina, D. Measurement invariance in international surveys: Categorical indicators and fit measure performance. Appl. Meas. Educ. 2017, 30, 39–51. [Google Scholar] [CrossRef]
Von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
Von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
Manna, V.F.; Gu, L. Different Methods of Adjusting for Form Difficulty under the Rasch Model: Impact on Consistency of Assessment Results; (Research Report No. RR-19-08); Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
Jureckova, J.; Picek, J. Robust Statistical Methods with R; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
Huber, P.J.; Ronchetti, E.M. Robust Statistics; Wiley: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Ronchetti, E. The main contributions of robust statistics to statistical science and a new challenge. Metron 2021, 79, 127–135. [Google Scholar] [CrossRef]
Magis, D.; De Boeck, P. Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivar. Behav. Res. 2011, 46, 733–755. [Google Scholar] [CrossRef]
Magis, D.; De Boeck, P. A robust outlier approach to prevent type I error inflation in differential item functioning. Educ. Psychol. Meas. 2012, 72, 291–311. [Google Scholar] [CrossRef]
Rusiecki, A. Robust learning algorithm based on LTA estimator. Neurocomputing 2013, 120, 624–632. [Google Scholar] [CrossRef]
Wilcox, R. Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
Yuan, K.H.; Bentler, P.M.; Chan, W. Structural equation modeling with heavy tailed distributions. Psychometrika 2004, 69, 421–436. [Google Scholar] [CrossRef]
Yuan, K.H.; Zhang, Z. Structural equation modeling diagnostics using R package semdiag and EQS. Struct. Equ. Model. 2012, 19, 683–702. [Google Scholar] [CrossRef]
Kalina, J. Implicitly weighted methods in robust image analysis. J. Math. Imaging Vis. 2012, 44, 449–462. [Google Scholar] [CrossRef]
Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage: Thousand Oaks, CA, USA, 2016. [Google Scholar]
Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley: New York, NY, USA, 1986. [Google Scholar] [CrossRef]
Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef] [Green Version]
Pokropek, A.; Lüdtke, O.; Robitzsch, A. An extension of the invariance alignment method for scale linking. Psych. Test Assess. Model. 2020, 62, 303–334. [Google Scholar]
Robitzsch, A. L_p loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 19. [Google Scholar] [CrossRef]
Muthén, B.; Asparouhov, T. Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociol. Methods Res. 2018, 47, 637–664. [Google Scholar] [CrossRef]
Pokropek, A.; Davidov, E.; Schmidt, P. A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Struct. Equ. Model. 2019, 26, 724–744. [Google Scholar] [CrossRef] [Green Version]
Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef] [Green Version]
He, Y.; Cui, Z.; Osterlind, S.J. New robust scale transformation methods in the presence of outlying common items. Appl. Psychol. Meas. 2015, 39, 613–626. [Google Scholar] [CrossRef]
He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 14. [Google Scholar] [CrossRef]
Strobl, C.; Kopf, J.; Kohler, L.; von Oertzen, T.; Zeileis, A. Anchor point selection: Scale alignment based on an inequality criterion. Appl. Psychol. Meas. 2021, 45, 214–230. [Google Scholar] [CrossRef]
Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
Jaeckel, L.A. Robust estimates of location: Symmetry and asymmetric contamination. Ann. Math. Stat. 1971, 42, 1020–1034. [Google Scholar] [CrossRef]
Xu, X.; Chen, X. A practical method of robust estimation in case of asymmetry. J. Stat. Theory Pract. 2018, 12, 370–396. [Google Scholar] [CrossRef]
Stefanski, L.A.; Boos, D.D. The calculus of M-estimation. Am. Stat. 2002, 56, 29–38. [Google Scholar] [CrossRef]
Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Simakhin, V.A.; Shamanaeva, L.G.; Avdyushina, A.E. Robust parametric estimates of heterogeneous experimental data. Russ. Phys. J. 2021, 63, 1510–1518. [Google Scholar] [CrossRef]
Hunter, J.E. Probabilistic foundations for coefficients of generalizability. Psychometrika 1968, 33, 1–18. [Google Scholar] [CrossRef]
Husek, T.R.; Sirotnik, K. Item Sampling in Educational Research; CSEIP Occasional Report No. 2.; University of California: Los Angeles, CA, USA, 1967; Available online: https://bit.ly/3k47t1s (accessed on 30 June 2021).
Yuan, K.H.; Cheng, Y.; Patton, J. Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika 2014, 79, 232–254. [Google Scholar] [CrossRef]
Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar] [CrossRef]
Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; (Research Report No. RR-09-02); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
Rao, J.N.K.; Wu, C.F.J. Inference from Stratified Samples: Second-Order Analysis of Three Methods for Nonlinear Statistics. J. Am. Stat. Assoc. 1985, 80, 620–630. [Google Scholar] [CrossRef]
Xu, X.; von Davier, M. Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study; (Research Report No. RR-10-10); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
Tong, Y.; Brennan, R.L. Bootstrap estimates of standard errors in generalizability theory. Educ. Psychol. Meas. 2007, 67, 804–817. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
Robitzsch, A. sirt: Supplementary Item Response Theory Models. R package version 3.10-111; R Core Team: Vienna, Austria, 2021; Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 25 June 2021).
DeMars, C.E. Alignment as an alternative to anchor purification in DIF analyses. Struct. Equ. Model. 2020, 27, 56–72. [Google Scholar] [CrossRef]
Chen, Y.; Li, C.; Xu, G. DIF statistical inference and detection without knowing anchoring items. arXiv 2021, arXiv:2110.11112. Available online: https://arxiv.org/abs/2110.11112 (accessed on 21 October 2021).
Kopf, J.; Zeileis, A.; Strobl, C. A framework for anchor methods and an iterative forward approach for DIF detection. Appl. Psychol. Meas. 2015, 39, 83–103. [Google Scholar] [CrossRef] [Green Version]
Tutz, G.; Schauberger, G. A penalty approach to differential item functioning in Rasch models. Psychometrika 2015, 80, 21–43. [Google Scholar] [CrossRef] [Green Version]
Yuan, K.H.; Liu, H.; Han, Y. Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika 2021, 86, 345–377. [Google Scholar] [CrossRef]
Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 9. [Google Scholar] [CrossRef]
Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
Von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
Glas, C.A.W.; Jehangir, M. Modeling country-specific differential functioning. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 97–115. [Google Scholar] [CrossRef]
Albano, A.D.; Wiberg, M. Linking with external covariates: Examining accuracy by anchor type, test length, ability difference, and sample size. Appl. Psychol. Meas. 2019, 43, 597–610. [Google Scholar] [CrossRef]
Sansivieri, V.; Wiberg, M. Linking scales in item response theory with covariates. J. Res. Educ. Scie. Technol. 2018, 3, 12–32. Available online: https://bit.ly/3ze7qEF (accessed on 30 June 2021).
Wu, H.; Browne, M.W. Quantifying adventitious error in a covariance structure as a random effect. Psychometrika 2015, 80, 571–600. [Google Scholar] [CrossRef] [Green Version]

Table 1. Variance proportions of different factors in the simulation study for bias and RMSE in the condition of no differential item functioning (DIF).

Source	Bias	RMSE
N	31.9	0.6
I	2.3	1.4
Meth	23.1	95.0
N × I	20.2	0.1
N × Meth	14.7	2.2
I × Meth	1.3	0.6
Residual	6.6	0.2

Note.N = sample size; I = number of items; Meth = linking method; Percentage values larger than 1.0 are printed in bold.

Table 2. Variance proportions of different factors in the simulation study for bias and RMSE for balanced and unbalanced differential item functioning (DIF).

Source	BAL		UNBAL
Source	Bias	RMSE	Bias	RMSE
N	0.3	5.5	1.1	9.8
I	0.2	0.1	0.0	0.2
Meth	29.7	39.5	7.5	16.5
$π_{bias}$	0.2	5.7	40.2	1.9
$τ_{AMI}$	6.9	9.2	20.6	0.7
Dist	3.4	4.8	2.1	0.4
N × I	0.0	0.1	0.0	0.3
N × Meth	1.0	4.5	0.8	8.1
N × $π_{bias}$	0.1	1.2	0.5	2.5
N × $τ_{AMI}$	0.7	1.0	0.3	2.3
N × Dist	0.6	1.9	0.2	0.2
I × Meth	0.7	0.2	0.1	0.3
I × $π_{bias}$	0.0	0.0	0.0	0.3
I × $τ_{AMI}$	0.1	0.1	0.1	0.1
I × Dist	0.0	0.0	0.0	0.0
Meth × $π_{bias}$	3.1	5.2	1.4	5.3
Meth × $τ_{AMI}$	12.8	5.4	5.4	13.9
Meth × Dist	5.6	1.8	0.4	0.8
$π_{bias}$ × $τ_{AMI}$	0.2	0.9	10.1	3.1
$π_{bias}$ × Dist	0.6	0.0	0.5	0.7
$τ_{AMI}$ × Dist	2.1	1.9	1.7	1.2
N × I × Meth	0.3	0.1	0.0	0.2
N × I × $π_{bias}$	0.0	0.0	0.0	0.1
N × I × $τ_{AMI}$	0.1	0.1	0.0	0.1
N × I × Dist	0.0	0.0	0.0	0.0
N × Meth × $π_{bias}$	1.0	3.2	0.4	4.8
N × Meth × $τ_{AMI}$	2.9	1.3	0.4	9.7
N × Meth × Dist	1.3	0.6	0.0	0.4
N × $π_{bias}$ × $τ_{AMI}$	0.2	0.1	0.3	2.0
N × $π_{bias}$ × Dist	0.2	0.0	0.0	0.5
N × $τ_{AMI}$ × Dist	0.5	0.4	0.1	0.4
I × Meth × $π_{bias}$	0.3	0.0	0.0	0.1
I × Meth × $τ_{AMI}$	0.9	0.4	0.1	0.3
I × Meth × Dist	0.4	0.1	0.0	0.0
I × $π_{bias}$ × $τ_{AMI}$	0.0	0.0	0.1	0.0
I × $π_{bias}$ × Dist	0.0	0.0	0.0	0.0
I × $τ_{AMI}$ × Dist	0.1	0.0	0.0	0.0
Meth × $π_{bias}$ × $τ_{AMI}$	2.8	0.9	3.5	4.6
Meth × $π_{bias}$ × Dist	3.3	0.2	0.2	0.4
Meth × $τ_{AMI}$ × Dist	4.7	0.8	0.5	0.3
$π_{bias}$ × $τ_{AMI}$ × Dist	0.6	0.0	0.4	0.2
Residual	12.2	2.6	1.0	7.7

Note. BAL = balanced DIF; UNBAL = unbalanced DIF; N = sample size; I = number of items; Meth = linking method; Dist = distribution of DIF effects;

π_{bias}

= proportion of biased DIF items;

τ_{AMI}

= standard deviation of DIF effects of reference items. Percentage values larger than 1.0 are printed in bold.

Table 3. Summary of satisfactory performance of linking methods for bias and RMSE for no, balanced and unbalanced differential item functioning (DIF).

Method	Bias			RMSE
Method	NODIF	BAL	UNBAL	NODIF	BAL	UNBAL
MM	100	100	0	100	100	57
EL(0.4)	100	100	58	100	68	87
ATR(0.2)	100	100	46	100	56	71
ATR(0.4)	100	100	58	100	57	85
BSQ(0.4)	100	100	72	100	35	69
IA(1)	100	100	45	100	75	73
IA(0.5)	100	100	61	100	33	59
IA(0.25)	100	100	69	88	13	35
IA(0.1)	100	100	70	75	5	14
GI(1)	100	95	63	0	0	1
GI(2)	100	96	61	0	3	18
HAE(2)	100	100	0	100	100	58
HAE(1)	100	100	46	100	69	72
HAE(0.5)	100	100	54	100	35	54
HAE(0.25)	100	100	49	100	12	24
HAE(0.1)	100	99	41	63	2	4

Note. NODIF = no DIF; BAL = balanced DIF; UNBAL = unbalanced DIF; MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; Percentage values larger than 67 are printed in bold.

Table 4. RMSE for balanced differential item functioning (DIF) for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

).

Table 4. RMSE for balanced differential item functioning (DIF) for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

).

$τ_{AMI}$	0						0.2
$π_{bias}$	0		0.1		0.3		0		0.1		0.3
N	250	1000	250	1000	250	1000	250	1000	250	1000	250	1000
Method
MM	100	100	100	100	100	100	100	100	100	100	100	100
EL(0.4)	102	100	104	102	109	108	108	110	110	115	119	135
ATR(0.2)	105	106	104	105	109	129	108	116	108	116	119	160
ATR(0.4)	107	107	106	108	109	108	111	120	111	124	119	133
BSQ(0.4)	110	101	112	102	126	105	132	133	136	141	164	171
IA(1)	103	104	103	104	106	108	105	110	106	113	111	125
IA(0.5)	109	108	111	111	121	116	118	129	121	133	137	160
IA(0.25)	115	112	118	113	133	118	129	141	136	151	160	188
IA(0.1)	121	111	126	115	147	122	142	155	151	168	182	215
GI(1)	173	153	157	127	181	238	213	254	214	236	282	396
GI(2)	157	154	133	112	162	151	188	232	186	207	251	379
HAE(2)	100	100	100	100	100	102	100	100	100	101	101	102
HAE(1)	103	104	104	105	107	109	106	112	107	116	113	130
HAE(0.5)	110	112	111	114	117	118	115	127	118	133	135	158
HAE(0.25)	114	115	116	119	128	125	127	143	130	148	154	182
HAE(0.1)	118	118	123	125	137	135	137	153	141	164	172	207

Note. MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; RMSE values smaller than 125 are printed in bold.

Table 5. Bias and RMSE for unbalanced differential item functioning (DIF) for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

).

Table 5. Bias and RMSE for unbalanced differential item functioning (DIF) for

I = 40

items as a function of sample sizes (N), proportion of biased items (

π_{bias}

), and standard deviation of DIF effects of reference items (

τ_{AMI}

).

	Bias								RMSE
$τ_{AMI}$	0				0.2				0				0.2
$π_{bias}$	0.1		0.3		0.1		0.3		0.1		0.3		0.1		0.3
N	250	1000	250	1000	250	1000	250	1000	250	1000	250	1000	250	1000	250	1000
Method
MM	−0.06	−0.06	−0.18	−0.18	−0.06	−0.06	−0.18	−0.18	109	157	154	382	100	121	107	163
EL(0.4)	−0.02	0.00	−0.08	−0.02	−0.03	−0.01	−0.14	−0.10	100	100	104	115	101	100	100	114
ATR(0.2)	−0.02	−0.01	−0.10	−0.06	−0.03	−0.01	−0.15	−0.12	101	103	108	161	100	101	101	127
ATR(0.4)	−0.02	−0.01	−0.08	−0.02	−0.03	−0.02	−0.14	−0.09	104	108	105	112	103	110	100	106
BSQ(0.4)	0.00	0.00	−0.03	0.00	−0.02	0.00	−0.11	−0.02	108	100	100	100	123	120	118	100
IA(1)	−0.03	−0.02	−0.12	−0.06	−0.04	−0.03	−0.16	−0.13	102	108	119	169	100	108	102	129
IA(0.5)	−0.02	−0.01	−0.08	−0.03	−0.03	−0.02	−0.13	−0.09	109	113	109	131	109	119	105	112
IA(0.25)	−0.01	−0.01	−0.06	−0.02	−0.02	−0.02	−0.12	−0.06	117	117	111	134	123	136	113	118
IA(0.1)	−0.01	−0.01	−0.05	−0.02	−0.01	−0.01	−0.12	−0.06	127	126	116	140	136	153	120	132
GI(1)	0.03	0.02	0.01	0.01	−0.01	0.03	−0.11	−0.11	150	132	133	173	194	213	162	255
GI(2)	0.04	0.02	0.00	0.01	0.00	0.05	−0.13	−0.13	128	110	117	102	167	175	153	251
HAE(2)	−0.06	−0.06	−0.18	−0.18	−0.06	−0.06	−0.18	−0.18	109	156	154	380	101	121	108	163
HAE(1)	−0.03	−0.02	−0.12	−0.06	−0.04	−0.03	−0.15	−0.13	103	108	118	164	101	108	102	128
HAE(0.5)	−0.02	−0.01	−0.08	−0.03	−0.03	−0.02	−0.15	−0.10	108	113	111	132	109	119	108	124
HAE(0.25)	−0.02	−0.01	−0.08	−0.04	−0.03	−0.02	−0.16	−0.12	113	118	120	232	120	133	122	159
HAE(0.1)	−0.02	−0.01	−0.08	−0.06	−0.03	−0.02	−0.17	−0.14	119	123	132	342	130	148	132	195

Note. MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; Absolute bias values smaller than 0.05 and RMSE values smaller than 125 are printed in bold.

Table 6. Coverage rates for linking methods for balanced differential item function for

I = 40

items as a function of sample size (N) and the proportion of biased items (

π_{bias}

).

Table 6. Coverage rates for linking methods for balanced differential item function for

I = 40

items as a function of sample size (N) and the proportion of biased items (

π_{bias}

).

Method	$N = 500$				$N = 2000$
Method	SJK	DJK	SHS	DHS	SJK	DJK	SHS	DHS
	No biased items ( $π_{bias} = 0$ )
MM	94.8	92.2	95.1	93.6	93.9	91.8	94.4	93.6
ATR(0.2)	97.6	95.5	96.5	93.5	95.0	84.0	95.3	93.4
ATR(0.4)	98.9	97.4	96.8	93.7	95.9	89.6	95.4	92.7
IA(1)	98.2	96.5	96.5	94.3	93.7	89.3	94.9	93.1
IA(0.5)	99.8	99.6	97.7	94.2	98.2	97.3	97.2	93.9
IA(0.25)	99.8	99.7	98.8	95.1	99.6	99.1	98.5	94.0
HAE(2)	94.9	91.8	95.1	93.9	93.9	91.3	94.1	93.5
HAE(1)	98.5	97.9	96.9	94.3	95.2	91.9	95.3	92.8
HAE(0.5)	99.8	99.8	98.2	95.9	99.2	98.2	97.7	94.5
HAE(0.25)	99.8	99.8	98.9	95.5	99.8	99.1	98.0	93.6
	Biased items ( $π_{bias} = 0.3$ )
MM	95.8	93.1	95.5	94.6	97.2	96.1	97.1	96.5
ATR(0.2)	97.8	95.3	97.1	94.2	93.6	83.8	97.1	95.9
ATR(0.4)	98.2	95.8	96.9	93.4	91.9	83.3	95.5	93.9
IA(1)	97.8	95.8	96.5	94.2	93.4	87.7	96.8	95.8
IA(0.5)	99.2	99.1	97.8	93.9	96.0	93.6	97.5	95.2
IA(0.25)	99.5	99.3	98.0	95.1	98.1	96.5	97.5	94.1
HAE(2)	95.6	93.0	96.1	94.6	97.1	96.1	97.0	96.4
HAE(1)	98.1	97.4	96.7	94.1	95.0	91.5	96.9	95.5
HAE(0.5)	99.5	99.2	98.0	95.1	97.7	95.4	98.2	96.3
HAE(0.25)	99.9	99.7	98.3	95.5	99.6	98.0	98.6	95.8

Note. SJK = single jackknife (Equation (80)); DJK = double jackknife (Equation (92)); SHS = single half sampling (Equation (101)); DHS = double half sampling (Equation (105)); MM = mean-mean linking; ATR = asymmetrically trimmed mean linking; IA = invariance alignment; HAE = Haebara linking; Coverage rates within the interval

[92.5, 97.5]

are printed in bold.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Robitzsch, A. Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry 2021, 13, 2198. https://doi.org/10.3390/sym13112198

AMA Style

Robitzsch A. Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry. 2021; 13(11):2198. https://doi.org/10.3390/sym13112198

Chicago/Turabian Style

Robitzsch, Alexander. 2021. "Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques" Symmetry 13, no. 11: 2198. https://doi.org/10.3390/sym13112198

APA Style

Robitzsch, A. (2021). Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry, 13(11), 2198. https://doi.org/10.3390/sym13112198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

Abstract

1. Introduction

2. Differential Item Functioning in the Rasch Model

2.1. Rasch Model

2.2. Differential Item Functioning

2.3. Identified Item Parameters in Group-Specific Scaling Models

3. Linking Methods

3.1. Mean-Mean Linking (MM)

3.2. Asymmetrically Trimmed Mean (ATR)

3.3. Elimination of DIF Items with Subsequent Mean-Mean Linking (EL)

3.4. Bisquare Linking (BSQ)

3.5. Invariance Alignment (IA)

3.6. Haebara Linking (HAE)

3.7. Gini Linking (GI)

3.8. Robustness of the Different Linking Methods

4. An Analytical Treatment for Assessing Linking Errors

4.1. Random Item Parameters ( b i 1 , b i 2 )

4.1.1. Asymptotic Behavior

4.1.2. Choosing an Optimal Linking Function m

4.1.3. Asymptotic Normal Distribution

4.1.4. Scalar Linking Parameter

4.2. Fixed Item Parameters b i , Random DIF Effects e i

5. An Analytical Treatment for the Simultaneous Assessment of Standard Errors and Linking Errors

6. Resampling Methods for the Simultaneous Assessment of Standard Errors and Linking Errors

6.1. Single Jackknife (SJK)

6.2. Double Jackknife (DJK)

6.3. Single Half Sampling (SHS)

6.4. Double Half Sampling (DHS)

6.5. Double Bootstrap

7. Simulation Study 1: Comparing the Performance of Different Linking Methods

7.1. Design

7.2. Analysis

7.3. Results

8. Simulation Study 2: Performance of Resampling Methods for Computing Standard Errors and Linking Errors

8.1. Design

8.2. Analysis

8.3. Results

9. Discussion

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Random Item Parameters $(b_{i 1}, b_{i 2})$

4.2. Fixed Item Parameters $b_{i}$ , Random DIF Effects $e_{i}$