Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

: In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that can bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects on average cancel out. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and on average favor a particular group. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, the theory of M-estimation, as an important approach to robust statistical estimation suitable for data with asymmetric errors, is used to study the asymptotic behavior of linking estimators if the number of items tends to inﬁnity. These results give insights into the asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.


Introduction
The analysis of psychological or educational tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often analyzed using item response theory (IRT, [1,2]) models. In this article, the Rasch model (RM; [3,4]) is used for comparing two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons are carried out using linking methods [5,6]. An important impediment in applying linking methods is that the items could behave differently in the two groups (i.e., differential item functioning, DIF; [7]); that is, it cannot be expected that the Rasch model holds in the two groups with item parameters that are independent of a group membership.
In this article, we study the performance of linking methods in the presence of DIF that can bias group differences. In contrast to habitually used (i.e., nonrobust) linking methods, robust linking methods aim at deriving estimates of group differences that are robust to the presence of DIF. Importantly, DIF effects can be considered as asymmetric error distributions, and robust statistical methods for location measures are applied for determining a group difference.
This article systematically compares alternative linking methods in the RM. Furthermore, linking errors that quantify the uncertainty in group differences due to the randomness associated with DIF are analytically treated using M-estimation theory and computationally assessed using single and double jackknife and (balanced) half sampling, respectively.
The paper is structured as follows. In Section 2, the RM with random DIF is introduced. In Section 3, several nonrobust and robust linking methods are discussed. In Section 4, M-estimation theory is applied to the study of linking methods for the statistical inference of linking errors. In Section 5, M-estimation theory is applied for the simultaneous assessment of standard errors and linking errors. The resampling techniques double jackknife and double half sampling are introduced in Section 6 for empirically assessing standard errors and linking errors. In Section 7, we present the results of a simulation study in which different robust and nonrobust linking methods are systematically compared across various data-generating models for DIF effects. Section 8 presents a simulation study that investigates the empirical performance of the proposed resampling estimators from Section 6. Finally, the article concludes with a discussion in Section 9.

Rasch Model
The RM [3,4,[8][9][10][11][12][13]] is a statistical model for dichotomous item responses X i for items i = 1, . . . , I. A latent variable θ (so-called ability) accounts for the dependence among item responses. The item response function (IRF) P i for item i in the RM is defined as where b i is the item difficulty, θ is the latent ability, and Ψ(x) = [1 + exp(−x)] −1 denotes the logistic link function. The RM in a random sampling perspective [14,15] also relies on a local independence assumption and poses a parametric distribution function F α on the latent ability θ: where X = (X 1 , . . . , X I ), x = (x 1 , . . . , x I ) , b = (b 1 , . . . , b I ) and a finite-dimensional parameter α. In many applications, the distribution F α is assumed to be normal with a mean of zero. In this case, the parameter α only contains the standard deviation σ that must be estimated in addition to item parameters b. It has been empirically shown that distributional misspecifications of F α might not strongly bias estimates of item difficulties b i if many items are available [16][17][18]. However, in (very) short tests and with a strong deviation from normality, the bias in item parameter estimates can be non-negligible. Note that the parameters of the RM are identified up to a constant. Hence, either the mean of the abilities or the mean of item difficulties has to be fixed to zero for reasons of identification [19,20]. If a normal distribution for θ is assumed, the mean is set to zero, and only the standard deviation is estimated. The RM is typically estimated with marginal maximum likelihood estimation [21].
In practice, it is unlikely that the distribution of item responses can be adequately represented by the RM. In real datasets, more complex IRFs might be necessary, such as the family of logistic IRT models with two, three or four parameters per item [22] or even more flexible nonparametric monotone IRFs [23][24][25][26]. In large-scale educational datasets, items could have different discriminations [27], and guessing and slipping behavior have been reported [28,29]. However, fitting a misspecified RM to data might be justified if the latent ability θ should be defined such that all items contribute equally to θ [30,31]. By fitting more complex IRT models, the meaning of θ might change, which raises validity concerns. Note that the RM has been used in the popular program for international student assessment (PISA) study in the past [32] and in many other national [33,34] and international large-scale assessment studies [35].
The estimated item parametersb with marginal maximum likelihood estimation [21] can be interpreted as a pseudo-true parameter (see [36]) that maximizes the Kullback-Leibler information KL [37][38][39] between the true distribution Q and its parametric approximation P(b, σ): where the sum is defined over the S = 2 I different item response patterns for X. The distribution Q is defined as the true data-generating distribution Q(x) = P(X = x), which is a multinomial distribution on the S item response patterns. The RM is the best approximation to Q with respect to the Kullback-Leibler information. By using different loss functions (i.e., estimation methods, for example, unweighted least squares estimation, [40]), different pseudo-true parameters of the RM will be obtained.

Differential Item Functioning
Now assume that I items are administered in two groups g = 1, 2. The estimation of group differences is of interest. Abilities in the first group follow a normal distribution with zero mean, that is, θ ∼ N(0, σ 2 1 ). In the second group, we also assume a normal distribution, i.e., θ ∼ N(µ 2 , σ 2 2 ). The parameter µ 2 can be interpreted as the average difference between the two groups.
In practical applications, it is unlikely that item difficulties b i are equal across groups (i.e., they are measurement invariant). In this case, DIF occurs, and there exist groupspecific item difficulties b ig for groups g = 1, 2. Item-specific DIF effects e i are defined as e i = b i2 − b i1 . In the absence of DIF, all DIF effects e i would be equal to zero. Identification constraints on DIF effects must be posed to disentangle group mean differences from average DIF effects [41][42][43].
In this paper, we distinguish the case of random items from fixed items with random DIF effects. In the first case of random items, it is assumed that the bivariate vector of item difficulties (b i1 , b i2 ) follows a bivariate distribution G. In the second case, it is assumed that b i2 = b i1 + e i with random effects e i , but item difficulties b i1 are regarded as fixed. This means that items are fixed, but DIF effects represent a random variable. DIF effects e i follow a univariate distribution G in this case.
To identify the group difference µ 2 , identification constraints on G have to be imposed in both cases. The main idea is that the set of items I = {1, . . . , I} can be partitioned into two distinct sets I AMI and I bias . The set of items in I AMI (also denoted as reference items; [44]) is deemed valid for obtaining unbiased group differences. The set I AMI refers to approximate measurement invariant (AMI; [45]) items. These items are allowed to have DIF effects that on average cancel out. A special case is a set of anchor items in which all items in this set have zero DIF effects [46]. Items in the set I bias (also denoted as biased items; [44]) have the potential to bias group differences (see [47]). The partitioning is modeled with a mixture distribution for G [46,48,49]: where π bias is the proportion of items in the set I bias . In the fixed items case, it is assumed that the expected value for DIF effects of items from I AMI is zero, while it can be different from zero for items from I bias . More formally, it holds that δ AMI = e dG AMI (e) = 0 and δ bias = e dG bias (e). (5) In the random items case, define by G * the univariate distribution of DIF effects e i = b i2 − b i1 . Based on the mixture representation of the bivariate distribution G, one can decompose the distribution of DIF effects as G * = (1 − π bias )G * AMI + π bias G * bias . The condition for DIF effects in the random items case is the same as in Equation 5 δ AMI = e dG * AMI (e) = 0 and δ bias = e dG * bias (e).
The test is said to have balanced DIF if δ bias = 0, and it has unbalanced DIF if δ bias = 0 (see [47,50,51]). It is important to emphasize that the definition of the mixture distribution allows the identification of group differences. The total DIF impact δ on the test containing all items can be calculated as (for notational simplicity only in the fixed items case) δ = e dG(e) = π bias δ bias .
With a low proportion π bias of biased items, the presence of DIF effects is not expected to have a large impact on estimated group differences.
In this article, we only consider random DIF effects. Probably in the largest part of the literature, DIF effects are considered as fixed (e.g., [52]). In this case, the condition for balanced DIF replaces the expected value by the mean associated with the fixed item parameters [44]. There is no additional uncertainty introduced in the estimation of group differences with fixed DIF effects because the item parameters are held fixed in repeated sampling. In contrast, with random DIF effects, the group mean difference is affected by the sampled DIF effects even for infinite sample sizes of persons. This kind of uncertainty is explicitly addressed in this article.
In many applications, the estimation of group differences involves a previous step in which DIF is detected by applying statistical techniques [7,[52][53][54]. DIF detection statistics aim to classify items into a set of items that possess DIF, which should be optimally equal to the set I bias . However, DIF detection techniques rely on previous knowledge about DIF-free items or a known group difference [43,55]. Hence, the decision of whether an item has DIF or not requires additional assumptions that cannot be statistically tested (see also [31,56,57]). In this paper, we do not thoroughly investigate DIF detection techniques, but rather study the performance of linking methods to estimate group differences. We distinguish robust from nonrobust linking methods. Robust linking methods adequately handle the presence of biased items (i.e., items in the set of I bias ) that lead to unbalanced DIF, while nonrobust linking approaches result in biased estimates of group differences in unbalanced DIF situations.
If the RM does not hold, DIF between groups means that IRFs can differ across the two groups. If the misspecified RM is fitted to data, DIF in item difficulties can be interpreted as a summary of DIF between IRFs. It is acknowledged that more complex DIF, such as nonuniform DIF in item discriminations [7] or DIF in guessing parameters, might occur. However, if these model aspects are intentionally ignored by fitting the RM, DIF effects in other aspects of the IRFs only indirectly enter the DIF assessment through item difficulties. Moreover, DIF effects in item difficulties are more frequently found in empirical applications than in item discriminations [58][59][60]. In the rest of the paper, statistical inference regarding the population of persons and the population of items is discussed that is even valid if the fitted IRT model is misspecified.

Identified Item Parameters in Group-Specific Scaling Models
Linking methods rely on group-specific item parameters estimated in separate scaling models in each group. By doing so, there is no misspecification in the IRT model due to noninvariance.
In the first group, the ability variable in the data-generating model follows θ ∼ N(0, σ 2 1 ), that is, µ 1 = 0. In a separate estimation for the first group with an infinite sample size of persons, the estimated item difficultiesb i1 equal the data-generating parameters b i1 (i.e.,b i1 = b i1 ). In the second group, the distribution of the ability variable is θ ∼ N(µ 2 , σ 2 2 ). In the estimation, the mean of the ability variable is fixed to zero for reasons of identification. Hence, estimated item difficulties also include the group difference µ 2 parameters. We obtain where the standardized ability θ * is normally distributed (i.e., N(0, σ 2 2 )). Consequently, it follows thatb i2 = b i2 − µ 2 .

Mean-Mean Linking (MM)
Mean-mean linking (MM; [5,64]) is one of the most popular linking methods. The group difference µ 2 is estimated bŷ Note thatμ 2 is determined as the least-squares estimate of item difficulty differenceŝ b i1 −b i2 :μ 2 = arg min We now derive the bias of MM and assume fixed items with random DIF effects e that follow a distribution G. The distribution is given by the mixture representation G = (1 − π bias )G AMI + π bias G bias (see Equation (4)). It holds that e dG AMI (e) = 0 and δ bias = e dG bias (e) (see Equation (5)). Then, we obtain for the bias under MM The bias coincides with the DIF impact on the test (see Equation (7)). The bias vanishes in the case of balanced DIF (i.e., δ bias = 0) or in the absence of biased items (π bias = 0). MM can be considered as a nonrobust linking method because biased items can affect the estimated group difference. As an alternative to such a nonrobust approach, it may be recommended to use linking methods based on robust statistical methodology [65] designed for resistant estimation under contamination (especially for data contaminated by outlying values). The following linking methods realize some kind of robustness against the presence of biased DIF items.

Asymmetrically Trimmed Mean (ATR)
An intuitive idea borrowed from robust statistics [66][67][68] is to consider biased DIF items as outliers [69,70] in estimating the location measure that is given as the group difference. Hence, robust alternatives to the mean (i.e., MM linking) can be established.
The asymmetrically trimmed mean (ATR) removes items with large differences from the estimation. By defining a trimming proportion α, the ATR linking estimateμ 2 is defined as the average of ν i values for which the absolute differences |ν i − mdn(ν)| are below the (1 − α)-quantile of these discrepancies. The main idea is that large discrepancies can be regarded as biased items and should be removed from group comparisons. The ATR estimate is formally defined aŝ where q 1−α denotes the (1 − α)-quantile, 1 the indicator function, and the median mdn. The median mdn(ν) instead of the mean ν is used because the median will be typically more robust concerning outliers (i.e., biased DIF items). ATR linking has the potential to properly handle the situation of unbalanced DIF because it explicitly allows that there could only be biased items with unidirectional signs. The ATR estimator is related to the least trimmed absolute estimator [71,72], which is especially suitable for asymmetric contamination in the data. A similar idea of the ATR estimator is used in robust structural equation modeling for defining case weights used for downweighting outlying cases (see [73,74]). As an alternative to the ATR estimator, the least weighted squares estimator may be applied as a location estimator of location with high robustness as well as high efficiency [75].

Elimination of DIF Items with Subsequent Mean-Mean Linking (EL)
Another popular approach is that DIF items are from the group comparison. The identification of DIF items in the first step requires the definition of an appropriate statistic. In the simulation study, we assume that a preliminary group difference is estimated by the median mdn(ν) of all differences ν i . An item is declared to have DIF if |ν i − mdn(ν)| exceeds a prespecified cutoff K. In many studies, the mean instead of the median is used, and the corresponding condition is referred to as the equal-mean anchor [52]. However, using the median instead of the mean might be a more robust location estimate in the presence of DIF effects. The items with detected DIF are removed for the subsequent computation of MM linking [52]. More formally, the EL estimate can be written aŝ The EL linking method by eliminating DIF items can be interpreted as another variant of a trimmed mean.

Bisquare Linking (BSQ)
Another robust estimate of the location parameter is based on the bisquare loss function ρ (see [76]) that is defined by where K is a prespecified threshold value. The group difference µ 2 is estimated by Note that the bisquare loss function is also known as the Tukey biweight function [77].

Invariance Alignment (IA)
The bisquare loss function ρ in Equation (15) can be replaced by any robust (or nonrobust) loss function. In invariance alignment (IA; [78][79][80][81]), the L p power loss function ρ(x) = |x| p (p ∈ (0, 2]) is employed. The group mean estimate is given bŷ By choosing p ≤ 1, the extent of noninvariant items is minimized. Hence, the group mean difference relies on items that have small DIF effects while removing items with large DIF effects from the comparison [78]. IA was originally proposed with the power p = 0.5 [78]. IA with p = 2 is equivalent to MM. Note that IA with p < 1 is particularly suited to the situation of partial invariance in which G AMI concentrates at zero (i.e., all DIF effects in I AMI are zero or close to zero) and fails for symmetrically distributed DIF effects [82,83].
To get more insight into the relation of IA and HAE, we apply a Taylor approximation of the second IRF in Equation (17) under the assumption of small effects e i . We obtain where Using the approximation (18), Equation (17) can be rewritten asμ where item-specific weights are given by Hence, HAE linking can be interpreted as IA with item-specific weights, and a similar performance of HAE and IA can be expected.

Gini Linking (GI)
Recently, a linking procedure based on the Gini index (GI; [88]) has been proposed. The linking function is very similar to IA linking and tries to define a group difference that is primarily based on items with small DIF effects. The group mean difference is determined byμ where the power p > 0 can be chosen by the user. The original proposal used p = 1 [88]. Previous experience of the authors indicates that GI also works with p > 1, but it does not perform satisfactorily with p < 1. It has been shown that IA and GI provided similar results in small case studies [88], but GI linking has not yet been systematically compared with other linking methods.

Robustness of the Different Linking Methods
The linking methods mean-mean linking (MM) and Haebara linking (HAE) with p = 2 can be considered as nonrobust. The linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with p ≤ 1, Haebara linking (HAE) with p ≤ 1 and Gini linking (GI) can be considered as robust linking methods that ensure some protection to the presence of biased DIF items.

An Analytical Treatment for Assessing Linking Errors
In this section, the computation of linking errors is investigated. Linking errors refer to the uncertainty of the randomness associated with items [32,81,[89][90][91][92][93]. The estimated group differenceμ 2 is affected by random DIF. The linking error quantifies this source of variance. In this section, an analytical treatment for assessing linking errors is presented. For this section, we assume an infinite sample size of persons. That means that identified item parameters are estimated without a sampling error. This assumption is dropped in the next Section 5.
In Section 2.2, we assume that random DIF (or item parameters) follow(s) a mixture distribution G = (1 − π bias )G AMI + π bias G bias , where G AMI denotes the distribution associated with reference items for which DIF effects on average cancel out and G bias denotes the distribution of biased items that can impact estimated group differences. The estimation of the group difference can be interpreted as an estimation problem of a location parameter in robust statistics where the location parameter (i.e., the group difference µ 2 ) should be based on G AMI . However, the observed mixture distribution G contains a contaminated asymmetric error distribution [67,94,95] that might bias the estimateμ 2 . As discussed in Section 2.2, two cases of random DIF can be distinguished. First, items can be considered random, and the bivariate vector (b i1 , b i2 ) of group-specific item difficulties is modeled with a distribution (see Section 4.1). Second, items can be regarded as fixed, but DIF effects e i = b i2 − b i1 are modeled as a random variable (see Section 4.2). Although these cases are very different, their consequences lead to similar estimates of variances. Hence, estimated errors (i.e., linking errors) due to the randomness associated with items are practically identical.

Random Item Parameters
In this subsection, we discuss the estimation for random item parameters. We introduce a slightly more general notation to cover the linking methods (except for GI linking) from Section 3. The "data" for item i is given by the vector The linking method must be additive with respect to functions of this data. More formally, let H be a linking function that is defined by The linking parameter γ (e.g., a group difference) of interest is estimated bŷ For a large number of items I, note that H(γ) defined in (21) converges to Assuming differentiability of h implies thatγ can be obtained by solving the equation where m = ∂h ∂γ . Equation (24) provides an estimating equation for the parameter γ. The corresponding estimator is labeled as an M-estimator [96]. It is evident that the estimated group mean differencesμ 2 in MM, IA and HAE linking are M-estimators by defining the univariate parameterγ = (μ 2 ). The linking methods EL and ATR are so-called two-step estimators because their computation relies on the medianμ 2 = mdn(ν) computed in a first step. Because the estimating equation for the median is clearly defined, the estimate in Equation (24) also applies to two-step estimators because it can be interpreted as a bivariate one-step M-estimator by definingγ = (μ 2 ,μ 2 ) (see [37], chp. 7).
We now apply the theory of M-estimators ( [37], chp. 7; [96,97]) to study the asymptotic behavior ofγ. Because we are concerned with linking errors, asymptotic behavior is meant with respect to the number of items. By letting the number of items tend to infinity, the left side in Equation (24) converges to 2) denote the random variables associated with estimated item parameters. As already mentioned, the distribution of item parameters G follows the mixture distribution G = (1 − π bias )G AMI + π bias G bias . Assume that densities for the involved distributions exist (i.e., continuous or count densities): dG = g dθ, dG AMI = g AMI dθ, and dG bias = g bias dθ. Equation (25) can be written as The parameter γ ∞ obtained from a linking method with an infinite number of items is given as the root of the equation Note that γ ∞ is a function of µ 2 , G, and m. For a dataset, µ 2 and G are fixed but unknown. However, m is chosen in the linking method by a user.
The pseudo-true parameter γ AMI is defined as the estimate if all items would be reference items. That is, the linking parameter γ would only be determined by the mixture component G AMI : Ideally, a component of γ AMI (in the bivariate case) or γ AMI itself (in the univariate case) should provide an asymptotically unbiased estimate of µ 2 by choosing an appropriate linking function h (or its derivative m). In the following, we assume that m is differentiable, although the main propositions of M-estimators do not require differentiability [37]. However, one could always approximate a nondifferentiable linking function h by a differentiable approximation h A . For example, the nondifferentiable and nonnegative linking function h(x) can be approximated by h A (x) = h(x) 2 + ε for a sufficiently small ε > 0 [78,80,81,87].

Asymptotic Behavior
We now study the asymptotic behavior of the estimatorγ. For a large number of items,γ converges to γ ∞ . The derivation of γ ∞ relies on a Taylor approximation of m and closely follows [98]. Due to (26), we get We now apply a first-order Taylor approximation of m around γ = γ AMI : where M γ = ∂m ∂γ is the matrix of partial derivatives. From (29) and (30), we get Hence, we obtain from (31), If we assume that γ AMI allows the unbiased estimation of µ 2 , Equation (32) provides an expression of the asymptotic bias ofγ. Of crucial importance is that the linking function m downweights observations from the distribution of biased items G bias (i.e., m(y, γ ∞ ) dG bias (ỹ) ≈ 0). The linking function m has to be chosen so that biased items are automatically removed for group comparison. The next subsection discusses how the linking function m should be chosen to enable an unbiased estimation of µ 2 .

Choosing an Optimal Linking Function m
Again, the derivation of the choice of the linking function m follows the exposition in [98]. Assume that the true parameter µ 2 is determined by the distribution G AMI (with density g AMI ) of reference items. Hence, µ 2 is given as the maximizer of the log-likelihood function and fulfills Based on (33), the linking function m can be chosen in order to obtain unbiased estimates of group mean differences µ 2 (see [98]): with the weight function w defined as Note that 0 < w(y, µ 2 ) ≤ 1 and the weighting function w weighs observations y according to their closeness to the distribution G AMI . Observations y with large density values g bias (ỹ) are downweighted in w. Using (33), it can be shown that

Asymptotic Normal Distribution
We now show that the M-estimatorγ follows an asymptotic normal (AN) distribution (see [37], chp. 7). The same Taylor expansion as in (30) provides The approximation (37) can be substituted into the estimating Equation (24): Hence, we obtain from (38) For I → ∞, we have (see (24)) 1 Therefore, we obtain the asymptotic normal distribution ofγ aŝ The involved matrices A(γ ∞ ) and B(γ ∞ ) can be estimated from sample data bŷ Notably, the distribution stated in Equation (42) only holds for a sufficiently large number of items I.
The asymptotic behavior ofμ 2 can be described as (see Equation (32)) where m is the derivative of m with respect to µ 2 . Furthermore,μ 2 is asymptotically normally distributed (see Equation (42)): Again, the involved integrals for the variance estimate in (49) can be estimated using sample data (see Equations (44) and (45)).

Fixed Item Parameters b i , Random DIF Effects e i
In this subsection, we consider the case of fixed item parameters b i , but DIF effects e i are random. The "data" in Section 4.1 was given by and DIF effects e i follow a distribution G. Now, only e i is random and we define the data as y The estimating Equation (24) for the linking parameter γ can be rewritten as The term in (50) converges tõ One has to assume thatS 0 (γ; G) exists. Define γ ∞ bỹ Then, we can derive an asymptotic normal distribution forγ: The involved matricesÃ(γ ∞ ) andB(γ ∞ ) can be estimated bŷ Interestingly, these estimators coincide with estimated standard errors in the case of random item parameters (see Equations (44) and (45)). Hence, no practical differences regarding the estimated linking parameters and their estimated standard errors can be expected. Only conceptual differences emerge for the two treatments of DIF effects.

An Analytical Treatment for the Simultaneous Assessment of Standard Errors and Linking Errors
In practice, the variance in the group mean difference is affected by the sampling of persons (i.e., standard error) and the randomness associated with items (i.e., linking error). There have been attempts for the analytical treatment of the simultaneous inference with respect to the two modes [81,99,100]. In this section, we apply M-estimation theory for the simultaneous assessment of standard errors and linking errors. The general idea in this kind of inference is investigating the asymptotic behavior of the M-estimatorγ if the number of persons P and the number of items I tend to infinity. We only consider the case of random items, but treatment of the case with fixed items and random DIF effects is similar.
In the notation of Section 4, y i denotes the vector of (true) identified item parameters.
In finite samples of size P, only estimatesŷ i are available. For P → ∞, it holds thatŷ i p → y i . In long tests, the estimated item parameters are approximately independent between items [101]. Hence, we can assume thatŷ i are approximately independent of each other. M-estimation theory applied to the person side guarantees an asymptotic normal distribution: where V i (y i ) is a function of true item parameters y i . We now use a Taylor expansion with respect to γ and y i Using the same approach as in Section 4.1.3, we get an approximation of the estimating equation as Then, we obtain By definition, we have for Moreover, the following limit exists as in Section 4.1.3: Becauseŷ i − y i p → 0 for P → ∞, the second term in the right bracket in 61 vanishes asymptotically for I → ∞ For the computation of the covariance matrix, we have This shows the asymptotic normal distribution when the simultaneous inference with respect to persons and items is conducted: It evident is from (67) that the number of persons and the number of items are part of the statistical inference. The involved matrices A(γ ∞ ), B(γ ∞ ), and C(γ ∞ ) can be estimated from sample data. However, for example, in (65), the true identified item parameters y i in the left term has to be replaced by the estimated item parametersŷ i which can cause slight biases in estimated variance matrices. Because of this disadvantage, we propose resampling techniques for the simultaneous inference of standard errors and linking errors in the next section.

Resampling Methods for the Simultaneous Assessment of Standard Errors and Linking Errors
We now derive estimation formulas for resampling methods [102,103] for persons and items. The derivation is motivated by assuming the following data-generating model where Y pi is the observed data for person p (or person groups) and item i (or item groups). The random variables u p , v i , and e pi are all independent of each other. We now derive the variance for the mean estimateμ:μ Its variance is given by The variance in (70) contains error sources for persons and items. Hence, it allows a simultaneous inference for both error facets. Following the terminology of errors in item response modeling for the large-scale assessment of students [93], the variance σ 2 P /P quantifies sampling error due to sampling persons, the variance σ 2 I /I linking error due to sampling items, and the variance σ 2 P ×I /PI can be interpreted as measurement error.

Single Jackknife (SJK)
The classical single jackknife (SJK; [104][105][106][107]) approach removes one unit (e.g., a (group of) person(s) or a (group of) item(s)) from an analysis for computing standard errors. First, we investigate the jackknife estimate in which only persons are removed. Letμ (−p,0) be the mean estimate in which person p is removed: We now derive the expected value of the square in Equation (72): Now, we define By using (72), we now obtain Equation (75) allows the computation of the standard error associated with person sampling. From Equation (70), we can attribute the variance σ 2 P ×I /P to person sampling. From (75), we get by replacing the expected value ES 2 P with the observed value S 2 In the single jackknife, the person-by-item interaction variance component σ 2 P ×I is typically ignored and the variance due to person sampling is, hence, estimated by P−1 P S 2 P . Similarly, we can derive the properties of the SJK estimate in which a single item i (or an item group) is removed from the analysis: The SJK variance estimate for item sampling utilizes the sum of squares term By changing the indices i and p in Equation (75), we obtain By replacing the expected value ES 2 I with the observed value S 2 I , the quantity I−1 I S 2 I is used as the variance estimate concerning the item facet. For the joint inference of persons and items, the variance terms for persons and items are added: Note that this variance estimate is biased because Consequently, so-called double jackknife resampling should be employed to remove the bias from the estimated variance.

Double Jackknife (DJK)
The double jackknife (DJK; [104,106,108]) removes a person (or a group of persons) and an item (or a group of items) from an analysis for the determination of the standard error. The elimination and repeated analysis is carried out for all persons and items. Let µ (−p,−i) be the mean estimate in which the person p and item i are removed. In more detail, it isμ The estimateμ (−p,−0) only removes person p, and the estimateμ (−0,−i) only removes item i. The corresponding estimates have already been studied as SJK estimates in Section 6.1.
We now consider an analysis in which one person and one item are removed. One obtainŝ It follows that Now, define We then obtain by using (84) One can use Equations (75), (79) and (86) as estimating equations by equating the expected values of the sum of squares by their observed counterparts. We have three equations for three unknowns We further simplify (87) to Now substitute the first and second equation in (88) in the third equation. We obtain (P − 1)(I − 1)S 2 P ×I = (I − 1)(P − 1)IS 2 P + (P − 1)P(I − 1)S 2 I + σ 2 P ×I .
Hence, we get from (89) Further, the variance components for persons and items can be computed as

P ×I
PI and The quantities in (90) and (91) can be used to estimate the population variance defined in (70). The crucial issue is how to handle negative variance estimates in estimation. Based on experience from preliminary simulation studies, the following variance estimate turned out to be most satisfactory: whereσ 2 P ×I = max(σ 2 P ×I , 0) is nonnegative, and σ 2 P ×I is defined in Equation (90).

Single Half Sampling (SHS)
In single half sampling (SHS; [107]), half of the sample is used to reanalyze the data to compute standard errors. Letμ P,h be the h-th half sample for persons in which half of the persons are sampled. Without loss of generality, let P be even. The h-th half sample consists of P/2 persons. We define half sample h in which the first P/2 persons are sampled and compute the mean estimateμ Then, we obtain Hence, we get from (95) Now, there are H (potentially balanced) half samples (see [102]) with estimatesμ P,h . Define the variance Using (95), it follows that Similarly, one can consider half samples of items. Assume that in half sample k, the first I/2 items are sampled. Letμ One can define the variance in estimates due to different half samples of items. Define the variance Using the same derivations, we get Based on the expected values in (97) and (100), one can define a variance estimate ofμ by adding the variance components regarding persons and items as Notably, this estimate is positively biased because As in the case of SJK, SHS also results in a biased variance estimate. In the next section, we investigate double half sampling that removes the bias component.

Double Half Sampling (DHS)
In double half sampling (DHS), half samples of persons and items are created and the analysis is replicated for these half samples. Let h be a half sample of persons, and k be a half sample of items for this dataset of persons. Letμ I:P,kh be the mean estimate for the half sample for persons and items andμ P,h be the estimate for the half sample of persons.
Define the variance Using the same derivation as in (100), one obtains Hence, an unbiased estimate of the variance forμ using DHS is obtained by whereσ 2 P ×I PI = max(U 2 I:P − U 2 P , 0). In practice, one can use balanced half samples based on Hadamard matrices for the most efficient variance estimates that minimize the Monte Carlo error for creating half samples [102]. In the simulation study (see Section 8), only balanced half samples are considered.

Double Bootstrap
It might be tempting to consider a double bootstrap resampling approach of persons and items as an alternative to DJK and DHS [104,[109][110][111]. We believe that bootstrapping items should not be recommended because duplicating items introduces additional local dependence in IRT models, which, in turn, induces bias in estimated item parameters and linking parameters. Hence, the variability obtained from a double bootstrap will also include portions of bias.

Simulation Study 1: Comparing the Performance of Different Linking Methods
In Simulation Study 1, we compare the performance of robust and nonrobust linking methods for the RM in the presence and absence of random DIF. This study systematically compares several robust linking methods. In particular, the recently proposed GI method is compared with alternative methods.

Design
Data were simulated according to the RM with random DIF in two groups. In the first group, the ability distributed was simulated as θ ∼ N(0, 1). In the second group, we simulated θ ∼ N(0.5, 1) (i.e., µ 2 = 0.5). Item difficulties b i were fixed in the simulation and were chosen equidistant in the interval [−2, 2]. Hence, in this study, we assumed fixed item difficulties b i , but simulated random DIF effects e i according to a mixture distribution G = (1 − π bias )G AMI + π bias G bias (see Section 2.2). The distribution of DIF effects reference items was chosen as a centered normal distribution; that is, G AMI = N(0, τ 2 AMI ). For the distribution of DIF effects of biased items G bias , we chose a two-point distribution for balanced DIF with values −δ bias and δ bias and corresponding probabilities π bias /2. For unbalanced DIF, we simulated a one-point distribution at δ bias with probability π bias which favored the first group. In the simulation, we fixed δ bias to 0.60. The bias for MM linking is expected to be −π bias δ bias (see Equation (11)). It vanishes for balanced DIF and is a function of π bias in the case of unbalanced DIF.
In the simulation, five factors were varied. First, we chose the sample size N of persons as 250, 500, 1000, and 5000. Second, we varied the number of items by I = 20 and I = 40. Third, we chose the proportion of biased items π bias = 0, 0.1, 0.3. With π bias = 0, no biased DIF items were simulated. Fourth, we varied the standard deviation (SD) of DIF effects of reference items τ AMI as 0, 0.1, 0.2, and 0.3. Fifth, we simulated three different distributions of DIF effects if τ AMI > 0: a normal distribution, a uniform distribution, and a t-distribution with four degrees of freedom. With τ AMI = 0, reference items do not have DIF effects. The distributions of DIF effects were appropriately scaled in order to match the SD τ AMI . In total, 1000 datasets were simulated and analyzed in each condition.

Analysis
The RM model was separately estimated in the two groups. The linking methods introduced in Section 3 were applied. We chose a cutoff value of 0.4 for DIF detection in the EL method. In ATR linking, we chose trimming proportions of 0.20 and 0.40. In BSQ linking, we chose 0.4 as the threshold parameter K. IA was estimated using the powers p = 1, 0.5, 0.25, and 0.1. GI linking was utilized with powers 1 and 2. HAE linking was specified with powers p = 2, 1, 0.5, 0.25, and 0.1.
The parameter of interest was the estimated group mean differenceμ 2 . For this parameter, the bias and root mean square error (RMSE) were computed. To reduce the dependence of the RMSE from the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method is divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE possesses the lowest value of 100 for the best linking method.
To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA). We used a variance decomposition for assessing the importance in the presence and absence of DIF.
Moreover, we classified linking methods on whether they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias in the estimated meanμ 2 was smaller than 0.01. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 125. In all analyses, the statistical software R [112] was used. The R package sirt [113] was employed for estimating the RM model with marginal maximum likelihood as the estimation method. The linking methods were estimated using R functions particularly written for this paper.

Results
In Table 1, the variance decomposition of the ANOVA summarized across conditions of no DIF is presented. For bias, sample size N, the number of items I as well as linking methods ("Meth" in Table 1) have an impact. However, as we will see later, the bias is of non-negligible size in the situation of no DIF. For RMSE, linking methods constitute the major source of differences. In contrast, sample size and the number of items only have small effects on the RMSE. Table 1. Variance proportions of different factors in the simulation study for bias and RMSE in the condition of no differential item functioning (DIF).

Source
Bias RMSE In Table 2, the variance decomposition of the ANOVA summarized across conditions of balanced and unbalanced DIF, respectively, is presented. All terms up to three-way interactions were included. For balanced DIF (column "BAL"), RMSE is more important than bias. It is evident that linking methods produced the largest variability in estimates, followed by the SD τ AMI of DIF effects of reference items, the proportion of biased items π bias , sample size N, and the type of distribution (column "Dist") of DIF effects. For unbalanced DIF, the bias is primarily affected by π bias and τ AMI and their interaction. Like for balanced DIF, the linking method substantially explains the variability in the RMSE of group mean differences. Table 3 summarizes the performance of the different linking methods across all conditions with no DIF, balanced DIF, and unbalanced DIF. In the absence of DIF, all linking methods produced unbiased estimates. However, IA with small powers p of 0.25 and 0.1 as well as HAE with p = 0.1 resulted in less precise estimates. Interestingly, GI linking always resulted in a substantially increased variability in estimated group mean differences compared to all other linking methods.
In the conditions with balanced DIF (column "BAL"), all linking methods (except for GI in a few conditions) produced unbiased estimates. However, using robust linking methods (i.e., EL, ATR, BSQ, IA, GI, HAE(p) with p ≤ 1) resulted in an efficiency loss in the RMSE compared to nonrobust linking methods (i.e., MM, HAE(2)). Among the robust linking methods, MM linking with the elimination of DIF items (i.e., EL) as well as IA and HAE with p = 1 performed best.
Finally, the situation of unbalanced DIF (column "UNBAL") is most challenging because linking methods have to handle the presence of biased items. Notably, robust linking methods are preferred over nonrobust linking in such a situation. In particular, MM and HAE(2) always resulted in biased estimates. Among the robust linking methods, BSQ and IA with p = 0.25 and 0.1 resulted in the least simulation conditions with biased estimates. Concerning RMSE, EL and ATR with a trimming proportion of 0.4 performed best, followed by IA with p = 1, HAE with p = 1, ATR with a trimming proportion of 0.2 and BSQ linking.   Table 4 shows the RMSE for balanced DIF for I = 40 items as a function of sample sizes (N), proportion of biased items (π bias ), and standard deviation of DIF effects of reference items (τ AMI ). For balanced DIF, all linking methods produced unbiased estimates (not shown in the table). However, there were slight differences between the linking methods with respect to the RMSE. In the situation of partial invariance (i.e., τ AMI = 0), the efficiency loss of robust linking methods compared to nonrobust linking methods MM and HAE(2) was acceptable. However, GI resulted in larger variable estimates. Moreover, note that GI linking with p = 2 outperformed GI with p = 1 in most conditions. Robust linking methods IA and HAE with very small power values p (e.g., p = 0.25 or 0.1) also caused a non-negligible RMSE increase.  The efficiency loss of robust linking methods is much larger if the reference items also possess DIF (i.e., τ AMI > 0). Only IA with p = 1 can somehow compete with MM and HAE(2) linking. The variance increase in robust linking methods IA and HAE with very small powers is apparent. It also has to be stated that GI linking produced large RMSE values in balanced DIF conditions. Table 5 shows the bias and the RMSE for unbalanced DIF for I = 40 items as a function of sample sizes (N), proportion of biased items (π bias ), and standard deviation of DIF effects of reference items (τ AMI ). All linking methods show biases in at least one condition. Notably, nonrobust linking methods MM and HAE (2) showed the largest bias. Robust linking methods reduce the bias in all conditions. The most critical condition is π bias = 0.3 and τ AMI = 0.2. In this condition, BSQ linking has the least bias, followed by IA with small powers 0.25 and 0.1. In this condition, it is also interesting to note that biases for a large sample size of N = 1000 are smaller than for N = 250.
With respect to the RMSE, EL, ATR, BSQ, and IA with powers 0.5 and 0.25 can be recommended. It is important to emphasize that GI linking with p = 2 performed well in the case of partial invariance (i.e., τ AMI = 0), but outperformed the recently proposed GI linking using p = 1. Interestingly, DIF detection with subsequent MM linking (method EL) was also relatively effective as long as the proportion of biased items is not too large. Table 5. Bias and RMSE for unbalanced differential item functioning (DIF) for I = 40 items as a function of sample sizes (N), proportion of biased items (π bias ), and standard deviation of DIF effects of reference items (τ AMI ).

Simulation Study 2: Performance of Resampling Methods for Computing Standard Errors and Linking Errors
In Simulation Study 2, we investigate the performance of resampling methods for estimating the variability of group mean differences. DJK and DHS have not yet been systematically studied for linking methods in the literature. In particular, there is a lack of research for studying resampling methods for robust linking methods.

Design
The data generating model closely follows that of Simulation Study 1 (see Section 7.1). Only a selected number of conditions was simulated because resampling methods are computationally demanding. In contrast to Simulation Study 1, we set δ bias = 0.3. Only balanced DIF was simulated because the assessment of variability (and not bias) was the focus of this simulation. The proportion of biased items were chosen as π bias = 0 or π bias = 0.3. The SD of DIF effects for reference items τ AMI was set to 0.3. We considered sample sizes N = 500 and N = 2000 and fixed the number of items to I = 40. In total, 2000 replications were conducted in each condition of the simulation study.

Analysis
To further reduce computation time, we only chose a selected number of linking methods that provided unbiased estimates in Simulation Study 1; that is, MM, ATR, IA, and HAE. We assessed the variability in estimated group mean differences with the resampling methods SJK (Equation (80)), DJK (Equation (92)), SHS (Equation (101)), and DHS (Equation (105)). We applied the resampling methods with 20 replication zones (containing 500/20 = 25 or 2000/20 = 100 persons and 40/20=2 items in each zone). Approximate balanced half sampling was used by specifying zones so that it was constructed from the upper part of a Hadamard matrix with a minimum dimension larger than 20. We computed confidence intervals based on the estimated standard errorsŝ by the respective methods as CI(μ 2 ) =μ 2 ∓ 1.96 ·ŝ. The proportion of replications in which the true difference µ 2 is contained in CI(μ 2 ) is defined as the coverage rate. Coverage rates are classified as satisfactory if they range within the interval [92.5, 97.5] for a condition in a simulation. As in Simulation Study 1, we used R [112] and the R package sirt [113].

Results
In Table 6, coverage rates for the resampling methods are displayed. By construction, single resampling methods (SJK and SHS) result in slightly wider confidence intervals than double resampling methods (DJK and DHS), which, in turn, produce higher coverage rates. It can be seen that SJK and DJK failed to produce acceptable coverage rates. In particular, jackknife error estimates performed worse for robust linking methods. This is in line with results in robust statistics that JK does not work for nondifferentiable statistics. However, SJK can be used for the nonrobust linking method HAE (2). In contrast, half sampling resampling methods outperformed jackknife. As expected, SHS produced a slight overcoverage, but DHS produced acceptable coverage in all conditions. Particularly noteworthy is the fact that DHS also successfully performed for robust linking methods. Overall, these findings indicate that half sampling methods should be preferred over jackknife resampling. Table 6. Coverage rates for linking methods for balanced differential item function for I = 40 items as a function of sample size (N) and the proportion of biased items (π bias ). Note. SJK = single jackknife (Equation (80)); DJK = double jackknife (Equation (92)); SHS = single half sampling (Equation (101)); DHS = double half sampling (Equation (105)); MM = mean-mean linking; ATR = asymmetrically trimmed mean linking; IA = invariance alignment; HAE = Haebara linking; Coverage rates within the interval [92.5, 97.5] are printed in bold.

Discussion
In this article, we investigated the performance of robust and nonrobust linking methods as well as the assessment of standard error and linking error estimates of group mean differences. We assumed random DIF with a mixture distribution model. Items are implicitly classified into a set of reference items (that are valid for group comparisons) and biased items that potentially bias group mean differences. We studied the nonrobust linking methods mean-mean linking (MM) and Haebara linking (HAE) with p = 2, as well as the robust linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with p ≤ 1, Haebara linking (HAE) with p ≤ 1 and Gini linking (GI).
We found that robust linking methods can be very effective in reducing biases in the presence of biased items in unbalanced DIF situations. However, in the presence of DIF on reference items (i.e., in the absence of partial invariance), robust linking methods can result in the reduced efficiency of estimates compared to nonrobust methods such as mean-mean linking or Haebara linking, in particular in the situation of balanced DIF. Our study also compared the recently proposed Gini linking with alternative linking methods. Surprisingly, GI performed worse compared to its competitors and only showed an acceptable performance using a modified GI version using a power p = 2. In our view, it is hard to recommend a particular linking estimator in the unbalanced DIF situation. It is only evident that mean-mean linking and Haebara linking with p = 2 are prone to bias and should not be used. Moreover, the recently proposed Gini linking produced much more variable estimates than competitive linking estimators. The usual practice in psychometrics (linking method EL) that eliminates DIF items in the first step of the analysis and computes group differences based on the DIF-free items in the second step, provides comparable results to robust linking methods (see also [47,114]). Note that we used the median as the preliminary location estimate in the EL method in the first step, which differs from the practice that employs the equal mean difficulty assumption (i.e., uses the mean instead of the median; see [52]).
We also studied the variability of group mean difference estimates due to random DIF. The randomness of DIF introduces an additional source of error (i.e., the linking error) in addition to the standard error associated with the sampling of persons. We analytically derived the distribution of the group difference through M-estimation theory. These results have importance for a (very) large number of items. Because we used a relatively small number of I = 40 items in the simulation and large item pools are not often present in applications, we investigated (single and double) jackknife and (single and double) half sampling resampling methods for persons and items for assessing the variability in estimates of the linking methods. We found that our proposed double half sampling outperformed jackknife-based error estimates. In contrast to jackknife, half sampling can also satisfactorily be applied to nondifferentiable robust linking methods. These findings indicate that half sampling methods could find their way for assessing linking errors in empirical applications.
In this article, we focused on the estimation of group differences. In the investigation of DIF in applied research, how to choose the correct anchor is always crucial [41,50,[115][116][117][118]. The studied robust linking estimators can be used to transform estimated item difficulties onto the same scale. Differences in transformed item difficulties can be investigated for DIF effects. Resampling procedures (single jackknife or single half sampling) can be employed for assessing the statistical significance of DIF effects.
As an alternative to separate scaling with subsequent robust or nonrobust linking, concurrent scaling assuming invariant item parameters can be utilized. Although such a one-step approach might be preferred from the practitioner's point of view, the presence of DIF effects likely introduces some bias in estimated group differences [5,119]. Surprisingly, the bias is even present for balanced DIF [119]. Robust linking methods have the advantage that a few outlying DIF effects are automatically removed from group comparisons [119]. Moreover, concurrent calibration might have computational disadvantages [44,120]. As a further alternative, concurrent calibration assuming partial invariance can be pursued [47,121,122]. In this approach, DIF for items is investigated in a first step, and items that showed DIF receive group-specific item parameters in the concurrent calibration approach, while invariance is assumed for the remaining items.
Furthermore, the precision of linking estimates can be improved by including further person covariates in the analysis [123,124]. This could be particularly true if there also exist DIF effects for person covariates. There is a lack of research for studies that include person covariates in robust linking methods.
Finally, we assumed that the Rasch model was correctly specified. This assumption might be unrealistic in practice, and much more complex item response functions could have generated the item responses [23,29]. It would be interesting to study the performance of the different linking methods and the assessment of standard errors and linking errors for misspecified models. We would like to emphasize that M-estimation theory and resampling techniques also provide valid inference in the case of misspecified models. It can always be debated whether estimates from a misspecified Rasch model are practically relevant or should be interpreted. We tend to argue that parameter estimates of misspecified models summarize a population distribution, and model fitting is not always (or maybe should not be) targeted at estimating the model that has generated the data. In this sense, we think that approaches that include model error as an additional component in statistical inference [125] might be beneficial.
Funding: This research received no external funding.