Next Article in Journal
On Ulam Stability of Functional Equations in 2-Normed Spaces—A Survey
Previous Article in Journal
Expected Values of Some Molecular Descriptors in Random Cyclooctane Chains
Previous Article in Special Issue
On a Vector-Valued Measure of Multivariate Skewness
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

by
Alexander Robitzsch
1,2
1
IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
Symmetry 2021, 13(11), 2198; https://doi.org/10.3390/sym13112198
Submission received: 14 October 2021 / Revised: 9 November 2021 / Accepted: 13 November 2021 / Published: 18 November 2021
(This article belongs to the Special Issue Symmetry and Asymmetry in Multivariate Statistics and Data Science)

Abstract

:
In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that can bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects on average cancel out. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and on average favor a particular group. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, the theory of M-estimation, as an important approach to robust statistical estimation suitable for data with asymmetric errors, is used to study the asymptotic behavior of linking estimators if the number of items tends to infinity. These results give insights into the asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.

1. Introduction

The analysis of psychological or educational tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often analyzed using item response theory (IRT, [1,2]) models. In this article, the Rasch model (RM; [3,4]) is used for comparing two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons are carried out using linking methods [5,6]. An important impediment in applying linking methods is that the items could behave differently in the two groups (i.e., differential item functioning, DIF; [7]); that is, it cannot be expected that the Rasch model holds in the two groups with item parameters that are independent of a group membership.
In this article, we study the performance of linking methods in the presence of DIF that can bias group differences. In contrast to habitually used (i.e., nonrobust) linking methods, robust linking methods aim at deriving estimates of group differences that are robust to the presence of DIF. Importantly, DIF effects can be considered as asymmetric error distributions, and robust statistical methods for location measures are applied for determining a group difference.
This article systematically compares alternative linking methods in the RM. Furthermore, linking errors that quantify the uncertainty in group differences due to the randomness associated with DIF are analytically treated using M-estimation theory and computationally assessed using single and double jackknife and (balanced) half sampling, respectively.
The paper is structured as follows. In Section 2, the RM with random DIF is introduced. In Section 3, several nonrobust and robust linking methods are discussed. In Section 4, M-estimation theory is applied to the study of linking methods for the statistical inference of linking errors. In Section 5, M-estimation theory is applied for the simultaneous assessment of standard errors and linking errors. The resampling techniques double jackknife and double half sampling are introduced in Section 6 for empirically assessing standard errors and linking errors. In Section 7, we present the results of a simulation study in which different robust and nonrobust linking methods are systematically compared across various data-generating models for DIF effects. Section 8 presents a simulation study that investigates the empirical performance of the proposed resampling estimators from Section 6. Finally, the article concludes with a discussion in Section 9.

2. Differential Item Functioning in the Rasch Model

2.1. Rasch Model

The RM [3,4,8,9,10,11,12,13] is a statistical model for dichotomous item responses X i for items i = 1 , , I . A latent variable θ (so-called ability) accounts for the dependence among item responses. The item response function (IRF) P i for item i in the RM is defined as
P ( X i = 1 | θ ) = P i ( θ ; b i ) = Ψ ( θ b i ) ,
where b i is the item difficulty, θ is the latent ability, and Ψ ( x ) = [ 1 + exp ( x ) ] 1 denotes the logistic link function.
The RM in a random sampling perspective [14,15] also relies on a local independence assumption and poses a parametric distribution function F α on the latent ability θ :
P ( X = x ; b , α ) = P ( x ; b , α ) = i = 1 I P i ( θ ; b i ) x i ( 1 P i ( θ ; b i ) ) 1 x i d F α ( θ ) ,
where X = ( X 1 , , X I ) , x = ( x 1 , , x I ) , b = ( b 1 , , b I ) and a finite-dimensional parameter α . In many applications, the distribution F α is assumed to be normal with a mean of zero. In this case, the parameter α only contains the standard deviation σ that must be estimated in addition to item parameters b . It has been empirically shown that distributional misspecifications of F α might not strongly bias estimates of item difficulties b i if many items are available [16,17,18]. However, in (very) short tests and with a strong deviation from normality, the bias in item parameter estimates can be non-negligible.
Note that the parameters of the RM are identified up to a constant. Hence, either the mean of the abilities or the mean of item difficulties has to be fixed to zero for reasons of identification [19,20]. If a normal distribution for θ is assumed, the mean is set to zero, and only the standard deviation is estimated. The RM is typically estimated with marginal maximum likelihood estimation [21].
In practice, it is unlikely that the distribution of item responses can be adequately represented by the RM. In real datasets, more complex IRFs might be necessary, such as the family of logistic IRT models with two, three or four parameters per item [22] or even more flexible nonparametric monotone IRFs [23,24,25,26]. In large-scale educational datasets, items could have different discriminations [27], and guessing and slipping behavior have been reported [28,29]. However, fitting a misspecified RM to data might be justified if the latent ability θ should be defined such that all items contribute equally to θ [30,31]. By fitting more complex IRT models, the meaning of θ might change, which raises validity concerns. Note that the RM has been used in the popular program for international student assessment (PISA) study in the past [32] and in many other national [33,34] and international large-scale assessment studies [35].
The estimated item parameters b ^ with marginal maximum likelihood estimation [21] can be interpreted as a pseudo-true parameter (see [36]) that maximizes the Kullback–Leibler information K L [37,38,39] between the true distribution Q and its parametric approximation P ( b , σ ) :
( b ^ , σ ^ ) = arg max ( b , σ ) K L ( Q | P ( b , σ ) ) = arg max ( b , σ ) s = 1 S Q ( x s ) log P ( x s ; b , σ ) ,
where the sum is defined over the S = 2 I different item response patterns for X . The distribution Q is defined as the true data-generating distribution Q ( x ) = P ( X = x ) , which is a multinomial distribution on the S item response patterns. The RM is the best approximation to Q with respect to the Kullback–Leibler information. By using different loss functions (i.e., estimation methods, for example, unweighted least squares estimation, [40]), different pseudo-true parameters of the RM will be obtained.

2.2. Differential Item Functioning

Now assume that I items are administered in two groups g = 1 , 2 . The estimation of group differences is of interest. Abilities in the first group follow a normal distribution with zero mean, that is, θ N ( 0 , σ 1 2 ) . In the second group, we also assume a normal distribution, i.e., θ N ( μ 2 , σ 2 2 ) . The parameter μ 2 can be interpreted as the average difference between the two groups.
In practical applications, it is unlikely that item difficulties b i are equal across groups (i.e., they are measurement invariant). In this case, DIF occurs, and there exist group-specific item difficulties b i g for groups g = 1 , 2 . Item-specific DIF effects e i are defined as e i = b i 2 b i 1 . In the absence of DIF, all DIF effects e i would be equal to zero. Identification constraints on DIF effects must be posed to disentangle group mean differences from average DIF effects [41,42,43].
In this paper, we distinguish the case of random items from fixed items with random DIF effects. In the first case of random items, it is assumed that the bivariate vector of item difficulties ( b i 1 , b i 2 ) follows a bivariate distribution G. In the second case, it is assumed that b i 2 = b i 1 + e i with random effects e i , but item difficulties b i 1 are regarded as fixed. This means that items are fixed, but DIF effects represent a random variable. DIF effects e i follow a univariate distribution G in this case.
To identify the group difference μ 2 , identification constraints on G have to be imposed in both cases. The main idea is that the set of items I = { 1 , , I } can be partitioned into two distinct sets I AMI and I bias . The set of items in I AMI (also denoted as reference items; [44]) is deemed valid for obtaining unbiased group differences. The set I AMI refers to approximate measurement invariant (AMI; [45]) items. These items are allowed to have DIF effects that on average cancel out. A special case is a set of anchor items in which all items in this set have zero DIF effects [46]. Items in the set I bias (also denoted as biased items; [44]) have the potential to bias group differences (see [47]). The partitioning is modeled with a mixture distribution for G [46,48,49]:
G = ( 1 π bias ) G AMI + π bias G bias ,
where π bias is the proportion of items in the set I bias . In the fixed items case, it is assumed that the expected value for DIF effects of items from I AMI is zero, while it can be different from zero for items from I bias . More formally, it holds that
δ AMI = e d G AMI ( e ) = 0 and δ bias = e d G bias ( e ) .
In the random items case, define by G * the univariate distribution of DIF effects e i = b i 2 b i 1 . Based on the mixture representation of the bivariate distribution G, one can decompose the distribution of DIF effects as G * = ( 1 π bias ) G AMI * + π bias G bias * . The condition for DIF effects in the random items case is the same as in Equation 5
δ AMI = e d G AMI * ( e ) = 0 and δ bias = e d G bias * ( e ) .
The test is said to have balanced DIF if δ bias = 0 , and it has unbalanced DIF if δ bias 0 (see [47,50,51]). It is important to emphasize that the definition of the mixture distribution allows the identification of group differences. The total DIF impact δ on the test containing all items can be calculated as (for notational simplicity only in the fixed items case)
δ = e d G ( e ) = π bias δ bias .
With a low proportion π bias of biased items, the presence of DIF effects is not expected to have a large impact on estimated group differences.
In this article, we only consider random DIF effects. Probably in the largest part of the literature, DIF effects are considered as fixed (e.g., [52]). In this case, the condition for balanced DIF replaces the expected value by the mean associated with the fixed item parameters [44]. There is no additional uncertainty introduced in the estimation of group differences with fixed DIF effects because the item parameters are held fixed in repeated sampling. In contrast, with random DIF effects, the group mean difference is affected by the sampled DIF effects even for infinite sample sizes of persons. This kind of uncertainty is explicitly addressed in this article.
In many applications, the estimation of group differences involves a previous step in which DIF is detected by applying statistical techniques [7,52,53,54]. DIF detection statistics aim to classify items into a set of items that possess DIF, which should be optimally equal to the set I bias . However, DIF detection techniques rely on previous knowledge about DIF-free items or a known group difference [43,55]. Hence, the decision of whether an item has DIF or not requires additional assumptions that cannot be statistically tested (see also [31,56,57]). In this paper, we do not thoroughly investigate DIF detection techniques, but rather study the performance of linking methods to estimate group differences. We distinguish robust from nonrobust linking methods. Robust linking methods adequately handle the presence of biased items (i.e., items in the set of I bias ) that lead to unbalanced DIF, while nonrobust linking approaches result in biased estimates of group differences in unbalanced DIF situations.
If the RM does not hold, DIF between groups means that IRFs can differ across the two groups. If the misspecified RM is fitted to data, DIF in item difficulties can be interpreted as a summary of DIF between IRFs. It is acknowledged that more complex DIF, such as nonuniform DIF in item discriminations [7] or DIF in guessing parameters, might occur. However, if these model aspects are intentionally ignored by fitting the RM, DIF effects in other aspects of the IRFs only indirectly enter the DIF assessment through item difficulties. Moreover, DIF effects in item difficulties are more frequently found in empirical applications than in item discriminations [58,59,60]. In the rest of the paper, statistical inference regarding the population of persons and the population of items is discussed that is even valid if the fitted IRT model is misspecified.

2.3. Identified Item Parameters in Group-Specific Scaling Models

Linking methods rely on group-specific item parameters estimated in separate scaling models in each group. By doing so, there is no misspecification in the IRT model due to noninvariance.
In the first group, the ability variable in the data-generating model follows θ N ( 0 , σ 1 2 ) , that is, μ 1 = 0 . In a separate estimation for the first group with an infinite sample size of persons, the estimated item difficulties b ^ i 1 equal the data-generating parameters b i 1 (i.e., b ^ i 1 = b i 1 ). In the second group, the distribution of the ability variable is θ N ( μ 2 , σ 2 2 ) . In the estimation, the mean of the ability variable is fixed to zero for reasons of identification. Hence, estimated item difficulties also include the group difference μ 2 parameters. We obtain
θ b i 2 = σ 2 θ * + μ 2 b i 2 = θ * σ 2 1 ( b i 2 μ 2 ) ) ,
where the standardized ability θ * is normally distributed (i.e., N ( 0 , σ 2 2 ) ). Consequently, it follows that b ^ i 2 = b i 2 μ 2 .

3. Linking Methods

In this section, we review several linking methods [5,6,61,62,63] that allow the estimation of the group difference μ 2 . We assume that estimated identified item parameters b ^ i 1 and b ^ i 2 ( i = 1 , , I ) are available (see Section 2.3). We define differences ν i = b ^ i 1 b ^ i 2 .

3.1. Mean-Mean Linking (MM)

Mean-mean linking (MM; [5,64]) is one of the most popular linking methods. The group difference μ 2 is estimated by
μ ^ 2 = 1 I i = 1 I b ^ i 1 1 I i = 1 I b ^ i 2 = 1 I i = 1 I ν i .
Note that μ ^ 2 is determined as the least-squares estimate of item difficulty differences b ^ i 1 b ^ i 2 :
μ ^ 2 = arg min μ 2 1 I i = 1 I ( μ 2 ν i ) 2 .
We now derive the bias of MM and assume fixed items with random DIF effects e that follow a distribution G. The distribution is given by the mixture representation G = ( 1 π bias ) G AMI + π bias G bias (see Equation (4)). It holds that e d G AMI ( e ) = 0 and δ bias = e d G bias ( e ) (see Equation (5)). Then, we obtain for the bias under MM
Bias ( μ ^ 2 ) = 1 I i = 1 I E ( e i ) = ( 1 π bias ) e d G AMI ( e ) π bias e d G bias ( e ) = π bias δ bias
The bias coincides with the DIF impact on the test (see Equation (7)). The bias vanishes in the case of balanced DIF (i.e., δ bias = 0 ) or in the absence of biased items ( π bias = 0 ). MM can be considered as a nonrobust linking method because biased items can affect the estimated group difference. As an alternative to such a nonrobust approach, it may be recommended to use linking methods based on robust statistical methodology [65] designed for resistant estimation under contamination (especially for data contaminated by outlying values). The following linking methods realize some kind of robustness against the presence of biased DIF items.

3.2. Asymmetrically Trimmed Mean (ATR)

An intuitive idea borrowed from robust statistics [66,67,68] is to consider biased DIF items as outliers [69,70] in estimating the location measure that is given as the group difference. Hence, robust alternatives to the mean (i.e., MM linking) can be established.
The asymmetrically trimmed mean (ATR) removes items with large differences from the estimation. By defining a trimming proportion α , the ATR linking estimate μ ^ 2 is defined as the average of ν i values for which the absolute differences | ν i mdn ( ν ) | are below the ( 1 α ) -quantile of these discrepancies. The main idea is that large discrepancies can be regarded as biased items and should be removed from group comparisons. The ATR estimate is formally defined as
μ ^ 2 = i = 1 I ν i 1 { ν i | | ν i mdn ( ν ) | q 1 α ( ν mdn ( ν ) ) } ( ν i ) i = 1 I 1 { ν i | | ν i mdn ( ν ) | q 1 α ( ν mdn ( ν ) ) } ( ν i ) ,
where q 1 α denotes the ( 1 α ) -quantile, 1 the indicator function, and the median mdn . The median mdn ( ν ) instead of the mean ν ¯ is used because the median will be typically more robust concerning outliers (i.e., biased DIF items). ATR linking has the potential to properly handle the situation of unbalanced DIF because it explicitly allows that there could only be biased items with unidirectional signs. The ATR estimator is related to the least trimmed absolute estimator [71,72], which is especially suitable for asymmetric contamination in the data. A similar idea of the ATR estimator is used in robust structural equation modeling for defining case weights used for downweighting outlying cases (see [73,74]). As an alternative to the ATR estimator, the least weighted squares estimator may be applied as a location estimator of location with high robustness as well as high efficiency [75].

3.3. Elimination of DIF Items with Subsequent Mean-Mean Linking (EL)

Another popular approach is that DIF items are from the group comparison. The identification of DIF items in the first step requires the definition of an appropriate statistic. In the simulation study, we assume that a preliminary group difference is estimated by the median mdn ( ν ) of all differences ν i . An item is declared to have DIF if | ν i mdn ( ν ) | exceeds a prespecified cutoff K. In many studies, the mean instead of the median is used, and the corresponding condition is referred to as the equal-mean anchor [52]. However, using the median instead of the mean might be a more robust location estimate in the presence of DIF effects. The items with detected DIF are removed for the subsequent computation of MM linking [52]. More formally, the EL estimate can be written as
μ ^ 2 = i = 1 I ν i 1 { ν i | | ν i mdn ( ν ) | K } ( ν i ) i = 1 I 1 { ν i | | ν i mdn ( ν ) | K } ( ν i ) .
The EL linking method by eliminating DIF items can be interpreted as another variant of a trimmed mean.

3.4. Bisquare Linking (BSQ)

Another robust estimate of the location parameter is based on the bisquare loss function ρ (see [76]) that is defined by
ρ ( x ; K ) = K 2 6 1 1 x 2 K 2 3 if | x | K K 2 6 else ,
where K is a prespecified threshold value. The group difference μ 2 is estimated by   
μ ^ 2 = arg min μ 2 1 I i = 1 I ρ ( μ 2 ν i ; K ) .
Note that the bisquare loss function is also known as the Tukey biweight function [77].

3.5. Invariance Alignment (IA)

The bisquare loss function ρ in Equation (15) can be replaced by any robust (or nonrobust) loss function. In invariance alignment (IA; [78,79,80,81]), the L p power loss function ρ ( x ) = | x | p ( p ( 0 , 2 ] ) is employed. The group mean estimate is given by
μ ^ 2 = arg min μ 2 1 I i = 1 I μ 2 ν i p .
By choosing p 1 , the extent of noninvariant items is minimized. Hence, the group mean difference relies on items that have small DIF effects while removing items with large DIF effects from the comparison [78]. IA was originally proposed with the power p = 0.5   [78]. IA with p = 2 is equivalent to MM. Note that IA with p < 1 is particularly suited to the situation of partial invariance in which G AMI concentrates at zero (i.e., all DIF effects in I AMI are zero or close to zero) and fails for symmetrically distributed DIF effects [82,83].

3.6. Haebara Linking (HAE)

In contrast to MM, ATR, BSQ and IA linking methods, Haebara (HAE) linking [84] aligns IRFs instead of aligning item parameters. The linking function is defined as
μ ^ 2 = arg min μ 2 1 I i = 1 I Ψ θ b ^ i 1 ) Ψ θ b ^ i 2 μ 2 ) p ω ( θ ) d θ
with a power p 0 and a weight function ω that fulfills ω ( θ ) d θ = 1 . The originally proposed HAE linking uses p = 2 [84]. The robust alternative p = 1 was treated in [47,85,86], while cases p < 1 were studied in [87].
To get more insight into the relation of IA and HAE, we apply a Taylor approximation of the second IRF in Equation (17) under the assumption of small effects e i . We obtain
Ψ ( θ b ^ i 2 μ 2 ) Ψ ( θ b ^ i 1 ) + Ψ 1 ( θ b ^ i 1 ) ( μ 2 ν i ) ,
where Ψ 1 ( x ) = d Ψ d θ = Ψ ( x ) ( 1 Ψ ( x ) ) 0 . Using the approximation (18), Equation (17) can be rewritten as
μ ^ 2 = arg min μ 2 1 I i = 1 I w i | μ 2 ν i | p ,
where item-specific weights are given by w i = ( Ψ 1 ( θ b ^ i 1 ) p ω ( θ ) d θ . Hence, HAE linking can be interpreted as IA with item-specific weights, and a similar performance of HAE and IA can be expected.

3.7. Gini Linking (GI)

Recently, a linking procedure based on the Gini index (GI; [88]) has been proposed. The linking function is very similar to IA linking and tries to define a group difference that is primarily based on items with small DIF effects. The group mean difference is determined by
μ ^ 2 = arg max μ 2 i = 1 I j = 1 I | μ 2 ν i | p | μ 2 ν i | p 2 I i = 1 I | μ 2 ν i | p ,
where the power p > 0 can be chosen by the user. The original proposal used p = 1 [88]. Previous experience of the authors indicates that GI also works with p > 1 , but it does not perform satisfactorily with p < 1 . It has been shown that IA and GI provided similar results in small case studies [88], but GI linking has not yet been systematically compared with other linking methods.

3.8. Robustness of the Different Linking Methods

The linking methods mean-mean linking (MM) and Haebara linking (HAE) with p = 2 can be considered as nonrobust. The linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with p 1 , Haebara linking (HAE) with p 1 and Gini linking (GI) can be considered as robust linking methods that ensure some protection to the presence of biased DIF items.

4. An Analytical Treatment for Assessing Linking Errors

In this section, the computation of linking errors is investigated. Linking errors refer to the uncertainty of the randomness associated with items [32,81,89,90,91,92,93]. The estimated group difference μ ^ 2 is affected by random DIF. The linking error quantifies this source of variance. In this section, an analytical treatment for assessing linking errors is presented. For this section, we assume an infinite sample size of persons. That means that identified item parameters are estimated without a sampling error. This assumption is dropped in the next Section 5.
In Section 2.2, we assume that random DIF (or item parameters) follow(s) a mixture distribution G = ( 1 π bias ) G AMI + π bias G bias , where G AMI denotes the distribution associated with reference items for which DIF effects on average cancel out and G bias denotes the distribution of biased items that can impact estimated group differences. The estimation of the group difference can be interpreted as an estimation problem of a location parameter in robust statistics where the location parameter (i.e., the group difference μ 2 ) should be based on G AMI . However, the observed mixture distribution G contains a contaminated asymmetric error distribution [67,94,95] that might bias the estimate μ ^ 2 . As discussed in Section 2.2, two cases of random DIF can be distinguished. First, items can be considered random, and the bivariate vector ( b i 1 , b i 2 ) of group-specific item difficulties is modeled with a distribution (see Section 4.1). Second, items can be regarded as fixed, but DIF effects e i = b i 2 b i 1 are modeled as a random variable (see Section 4.2). Although these cases are very different, their consequences lead to similar estimates of variances. Hence, estimated errors (i.e., linking errors) due to the randomness associated with items are practically identical.

4.1. Random Item Parameters ( b i 1 , b i 2 )

In this subsection, we discuss the estimation for random item parameters. We introduce a slightly more general notation to cover the linking methods (except for GI linking) from Section 3. The “data” for item i is given by the vector y i = ( b ^ 1 i , b ^ 2 i ) = ( b i 1 , b i 2 + μ 2 ) . The linking method must be additive with respect to functions of this data. More formally, let H be a linking function that is defined by
H ( γ ) = 1 I i = 1 I h ( y i , γ ) .
The linking parameter γ (e.g., a group difference) of interest is estimated by
γ ^ = arg min γ H ( γ ) .
For a large number of items I, note that H ( γ ) defined in (21) converges to   
H 0 ( γ ; G ) = h ( y , γ ) d G ( y ) .
Assuming differentiability of h implies that γ ^ can be obtained by solving the equation
S ( γ ) = γ H ( γ ) = 1 I i = 1 I γ h ( y i , γ ) = 1 I i = 1 I m ( y i , γ ) = 0 ,
where m = h γ .
Equation (24) provides an estimating equation for the parameter γ . The corresponding estimator is labeled as an M-estimator [96]. It is evident that the estimated group mean differences μ ^ 2 in MM, IA and HAE linking are M-estimators by defining the univariate parameter γ ^ = ( μ ^ 2 ) . The linking methods EL and ATR are so-called two-step estimators because their computation relies on the median μ ˜ 2 = mdn ( ν ) computed in a first step. Because the estimating equation for the median is clearly defined, the estimate in Equation (24) also applies to two-step estimators because it can be interpreted as a bivariate one-step M-estimator by defining γ ^ = ( μ ˜ 2 , μ ^ 2 ) (see [37], chp. 7).
We now apply the theory of M-estimators ([37], chp. 7; [96,97]) to study the asymptotic behavior of γ ^ . Because we are concerned with linking errors, asymptotic behavior is meant with respect to the number of items. By letting the number of items tend to infinity, the left side in Equation (24) converges to
S 0 ( γ ; G ) = m ( y , γ ) d G ( y ˜ ) ,
where y ˜ = ( b ^ 1 , b ^ 2 μ 2 ) and b ^ g ( g = 1 , 2 ) denote the random variables associated with estimated item parameters. As already mentioned, the distribution of item parameters G follows the mixture distribution G = ( 1 π bias ) G AMI + π bias G bias . Assume that densities for the involved distributions exist (i.e., continuous or count densities): d G = g d θ , d G AMI = g AMI d θ , and d G bias = g bias d θ . Equation (25) can be written as
S 0 ( γ ; G ) = ( 1 π bias ) S 0 ( γ , G AMI ) + π bias S 0 ( γ , G bias ) .
The parameter γ obtained from a linking method with an infinite number of items is given as the root of the equation
S 0 ( γ ; G ) = m ( y , γ ) d G ( y ˜ ) = 0 .
Note that γ is a function of μ 2 , G, and m . For a dataset, μ 2 and G are fixed but unknown. However, m is chosen in the linking method by a user.
The pseudo-true parameter γ AMI is defined as the estimate if all items would be reference items. That is, the linking parameter γ would only be determined by the mixture component G AMI :
S 0 ( γ AMI ; G AMI ) = m ( y , γ AMI ) d G AMI ( y ˜ ) = 0 .
Ideally, a component of γ AMI (in the bivariate case) or γ AMI itself (in the univariate case) should provide an asymptotically unbiased estimate of μ 2 by choosing an appropriate linking function h (or its derivative m ). In the following, we assume that m is differentiable, although the main propositions of M-estimators do not require differentiability [37]. However, one could always approximate a nondifferentiable linking function h by a differentiable approximation h A . For example, the nondifferentiable and nonnegative linking function h ( x ) can be approximated by h A ( x ) = h ( x ) 2 + ε for a sufficiently small ε > 0 [78,80,81,87].

4.1.1. Asymptotic Behavior

We now study the asymptotic behavior of the estimator γ ^ . For a large number of items, γ ^ converges to γ . The derivation of γ relies on a Taylor approximation of m and closely follows [98]. Due to (26), we get
S 0 ( γ ; G ) = ( 1 π bias ) S 0 ( γ , G AMI ) + π bias S 0 ( γ , G bias ) = 0 .
We now apply a first-order Taylor approximation of m around γ = γ AMI :
m ( y , γ ) m ( y , γ AMI ) + M γ ( y , γ AMI ) ( γ γ AMI ) ,
where M γ = m γ is the matrix of partial derivatives. From (29) and (30), we get
( 1 π bias ) M γ ( y , γ AMI ) d G AMI ( y ˜ ) ( γ γ AMI ) + π bias m ( y , γ ) d G bias ( y ˜ ) = 0
Hence, we obtain from (31),
γ γ AMI = π bias 1 π bias M γ ( y , γ AMI ) d G AMI ( y ˜ ) 1 m ( y , γ ) d G bias ( y ˜ ) .
If we assume that γ AMI allows the unbiased estimation of μ 2 , Equation (32) provides an expression of the asymptotic bias of γ ^ . Of crucial importance is that the linking function m downweights observations from the distribution of biased items G bias (i.e., m ( y , γ ) d G bias ( y ˜ ) 0 ). The linking function m has to be chosen so that biased items are automatically removed for group comparison. The next subsection discusses how the linking function m should be chosen to enable an unbiased estimation of μ 2 .

4.1.2. Choosing an Optimal Linking Function m

Again, the derivation of the choice of the linking function m follows the exposition in [98]. Assume that the true parameter μ 2 is determined by the distribution G AMI (with density g AMI ) of reference items. Hence, μ 2 is given as the maximizer of the log-likelihood function and fulfills
E d d μ 2 log g AMI ( y ˜ ) = d d μ 2 log g AMI ( y ˜ ) g AMI ( y ˜ ) d y ˜ = 0 .
Based on (33), the linking function m can be chosen in order to obtain unbiased estimates of group mean differences μ 2 (see [98]):
m ( y , μ 2 ) = d d μ 2 log g AMI ( y ˜ ) w ( y , μ 2 )
with the weight function w defined as
w ( y , μ 2 ) = ( 1 π bias ) g AMI ( y ˜ ) g ( y ˜ ) = 1 + π bias g bias ( y ˜ ) ( 1 π bias ) g AMI ( y ˜ ) 1 .
Note that 0 < w ( y , μ 2 ) 1 and the weighting function w weighs observations y according to their closeness to the distribution G AMI . Observations y with large density values g bias ( y ˜ ) are downweighted in w. Using (33), it can be shown that
m ( y , μ 2 ) g ( y ˜ ) d y ˜ = 0 .

4.1.3. Asymptotic Normal Distribution

We now show that the M-estimator γ ^ follows an asymptotic normal (AN) distribution (see [37], chp. 7). The same Taylor expansion as in (30) provides
m ( y , γ ^ ) m ( y , γ ) + M γ ( y , γ ) ( γ ^ γ ) .
The approximation (37) can be substituted into the estimating Equation (24):
0 = 1 I i = 1 I m ( y i , γ ) 1 I i = 1 I m ( y i , γ ) + 1 I i = 1 I M γ ( y i , γ ) ( γ ^ γ ) .
Hence, we obtain from (38)
γ ^ γ 1 I i = 1 I M γ ( y i , γ ) 1 1 I i = 1 I m ( y i , γ ) .
For I , we have (see (24))
1 I i = 1 I M γ ( y i , γ ) p A ( γ ) = M γ ( y , γ ) d G ( y ˜ ) and
1 I i = 1 I m ( y i , γ ) p S 0 ( γ , G ) = m ( y , γ ) d G ( y ˜ ) = 0 .
Therefore, we obtain the asymptotic normal distribution of γ ^ as
γ ^ AN γ , 1 I A ( γ ) 1 B ( γ ) A ( γ ) , where
B ( γ ) = m ( y , γ ) m ( y , γ ) T d G ( y ˜ ) .
The involved matrices A ( γ ) and B ( γ ) can be estimated from sample data by
A ^ ( γ ^ ) = 1 I i = 1 I M γ ( y i , γ ^ ) and
B ^ ( γ ^ ) = 1 I i = 1 I m ( y i , γ ^ ) m ( y i , γ ^ ) .
Notably, the distribution stated in Equation (42) only holds for a sufficiently large number of items I.

4.1.4. Scalar Linking Parameter

We now specialize our results if the estimated parameter γ ^ coincides with the estimated group difference μ ^ 2 . In this case, m is a univariate linking function. Assume that μ 2 , AMI and μ 2 , are the roots of the following equations, respectively:
m ( y , μ 2 , AMI ) d G AMI ( y ˜ ) = 0 and
m ( y , μ 2 , ) d G ( y ˜ ) = 0 .
The asymptotic behavior of μ ^ 2 can be described as (see Equation (32))   
μ 2 , μ 2 , AMI = π bias 1 π bias m ( y , μ 2 , ) d G bias ( y ˜ ) m ( y , μ 2 , AMI ) d G AMI ( y ˜ ) ,
where m is the derivative of m with respect to μ 2 . Furthermore, μ ^ 2 is asymptotically normally distributed (see Equation (42)):
μ ^ 2 AN μ 2 , , 1 I m ( y , μ 2 , ) 2 d G ( y ) m ( y , μ 2 , ) d G ( y ) 2 .
Again, the involved integrals for the variance estimate in (49) can be estimated using sample data (see Equations (44) and (45)).

4.2. Fixed Item Parameters b i , Random DIF Effects e i

In this subsection, we consider the case of fixed item parameters b i , but DIF effects e i are random. The “data” in Section 4.1 was given by y i = ( b ^ 1 i , b ^ 2 i ) = ( b i , μ 2 + b i + e i ) , and DIF effects e i follow a distribution G. Now, only e i is random and we define the data as y i = b ^ 2 i b ^ 1 i = μ 2 + e i and y ˜ i = y i + μ 2 .
The estimating Equation (24) for the linking parameter γ can be rewritten as
S ( γ ) = 1 I i = 1 I m ( b i , y i , γ ) = 0 .
The term in (50) converges to
S ˜ 0 ( γ ; G ) = lim I 1 I i = 1 I m ( b i , y i , γ ) d G ( y ˜ i ) = 0 .
One has to assume that S ˜ 0 ( γ ; G ) exists. Define γ by
S ˜ 0 ( γ ; G ) = lim I 1 I i = 1 I m ( b i , y i , γ ) d G ( y ˜ i ) = 0 .
Then, we can derive an asymptotic normal distribution for γ ^ :
γ ^ AN γ , 1 I A ˜ ( γ ) 1 B ˜ ( γ ) A ˜ ( γ ) , where
A ˜ ( γ ) = lim I 1 I i = 1 I M γ ( b i , y i , γ ) d G ( y ˜ i )
B ˜ ( γ ) = lim I 1 I i = 1 I m ( b i , y i , γ ) m ( b i , y i , γ ) d G ( y ˜ i )
The involved matrices A ˜ ( γ ) and B ˜ ( γ ) can be estimated by
A ^ ( γ ^ ) = 1 I i = 1 I M γ ( b i , y i , γ ^ ) and
B ^ ( γ ^ ) = 1 I i = 1 I m ( b i , y i , γ ^ ) m ( b i , y i , γ ^ ) .
Interestingly, these estimators coincide with estimated standard errors in the case of random item parameters (see Equations (44) and (45)). Hence, no practical differences regarding the estimated linking parameters and their estimated standard errors can be expected. Only conceptual differences emerge for the two treatments of DIF effects.

5. An Analytical Treatment for the Simultaneous Assessment of Standard Errors and Linking Errors

In practice, the variance in the group mean difference is affected by the sampling of persons (i.e., standard error) and the randomness associated with items (i.e., linking error). There have been attempts for the analytical treatment of the simultaneous inference with respect to the two modes [81,99,100]. In this section, we apply M-estimation theory for the simultaneous assessment of standard errors and linking errors. The general idea in this kind of inference is investigating the asymptotic behavior of the M-estimator γ ^ if the number of persons P and the number of items I tend to infinity. We only consider the case of random items, but treatment of the case with fixed items and random DIF effects is similar.
In the notation of Section 4, y i denotes the vector of (true) identified item parameters. In finite samples of size P, only estimates y ^ i are available. For P , it holds that y ^ i p y i . In long tests, the estimated item parameters are approximately independent between items [101]. Hence, we can assume that y ^ i are approximately independent of each other. M-estimation theory applied to the person side guarantees an asymptotic normal distribution:
y ^ i AN y i , 1 P V i ( y i ) ,
where V i ( y i ) is a function of true item parameters y i . We now use a Taylor expansion with respect to γ and y i
m ( y ^ i , γ ^ ) m ( y i , γ ) + M γ ( y i , γ ) ( γ ^ γ ) + M y ( y i , γ ) ( y ^ i y i ) .
Using the same approach as in Section 4.1.3, we get an approximation of the estimating equation as
1 I i = 1 I m ( y i , γ ) + M γ ( y i , γ ) ( γ ^ γ ) + M y ( y i , γ ) ( y ^ i y i ) = 0 .
Then, we obtain
γ ^ γ = 1 I i = 1 I M γ ( y i , γ ) 1 1 I i = 1 I m ( y i , γ ) + 1 I i = 1 I M y ( y i , γ ) ( y ^ i y i ) .
By definition, we have for I
1 I i = 1 I m ( y i , γ ) p S 0 ( γ , G ) = m ( y , γ ) d G ( y ˜ ) = 0 .
Moreover, the following limit exists as in Section 4.1.3:
1 I i = 1 I M γ ( y i , γ ) p A ( γ ) = M γ ( y , γ ) d G ( y ˜ ) .
Because y ^ i y i p 0 for P , the second term in the right bracket in 61 vanishes asymptotically for I
1 I i = 1 I M y ( y i , γ ) ( y ^ i y i ) p 0 .
For the computation of the covariance matrix, we have
1 I i = 1 I m ( y i , γ ) m ( y i , γ ) p B ( γ ) = m ( y , γ ) m ( y , γ ) d G ( y ˜ ) and
1 I i = 1 I M y ( y i , γ ) V i ( y i ) M y ( y i , γ ) p C ( γ ) = M y ( y , γ ) V i ( y ) M y ( y , γ ) d G ( y ˜ ) .
This shows the asymptotic normal distribution when the simultaneous inference with respect to persons and items is conducted:
γ ^ AN γ , 1 I A ( γ ) 1 B ( γ ) + 1 P C ( γ ) A ( γ ) .
It evident is from (67) that the number of persons and the number of items are part of the statistical inference. The involved matrices A ( γ ) , B ( γ ) , and C ( γ ) can be estimated from sample data. However, for example, in (65), the true identified item parameters y i in the left term has to be replaced by the estimated item parameters y ^ i which can cause slight biases in estimated variance matrices. Because of this disadvantage, we propose resampling techniques for the simultaneous inference of standard errors and linking errors in the next section.

6. Resampling Methods for the Simultaneous Assessment of Standard Errors and Linking Errors

We now derive estimation formulas for resampling methods [102,103] for persons and items. The derivation is motivated by assuming the following data-generating model
Y p i = μ + u p + v i + e p i , Var ( u p ) = σ P 2 , Var ( v i ) = σ I 2 , Var ( e p i ) = σ P × I 2 .
where Y p i is the observed data for person p (or person groups) and item i (or item groups). The random variables u p , v i , and e p i are all independent of each other. We now derive the variance for the mean estimate μ ^ :
μ ^ = 1 P I p = 1 P i = 1 I X p i .
Its variance is given by
Var ( μ ^ ) = σ P 2 P + σ I 2 I + σ P × I 2 P I .
The variance in (70) contains error sources for persons and items. Hence, it allows a simultaneous inference for both error facets. Following the terminology of errors in item response modeling for the large-scale assessment of students [93], the variance σ P 2 / P quantifies sampling error due to sampling persons, the variance σ I 2 / I linking error due to sampling items, and the variance σ P × I 2 / P I can be interpreted as measurement error.

6.1. Single Jackknife (SJK)

The classical single jackknife (SJK; [104,105,106,107]) approach removes one unit (e.g., a (group of) person(s) or a (group of) item(s)) from an analysis for computing standard errors. First, we investigate the jackknife estimate in which only persons are removed. Let μ ^ ( p , 0 ) be the mean estimate in which person p is removed:
μ ^ ( p , 0 ) = 1 ( P 1 ) I q = 1 q p P i = 1 I X q i
μ ^ ( p , 0 ) μ ^ = 1 P v p + q p 1 P ( P 1 ) v p 1 P I i e p i + q p i 1 ( P 1 ) I P e q i [ 3 e x ]
We now derive the expected value of the square in Equation (72):
E ( μ ^ ( p , 0 ) μ ^ ) 2 = 1 P ( P 1 ) σ P 2 + 1 P ( P 1 ) I σ P × I 2 .
Now, we define
S P 2 = p = 1 P ( μ ^ ( p , 0 ) μ ^ ) 2 .
By using (72), we now obtain
E S P 2 = 1 P 1 σ P 2 + 1 ( P 1 ) I σ P × I 2 .
Equation (75) allows the computation of the standard error associated with person sampling. From Equation (70), we can attribute the variance σ P × I 2 / P to person sampling. From (75), we get by replacing the expected value E S P 2 with the observed value S P 2
1 P σ P 2 = P 1 P S P 2 1 P I σ P × I 2 .
In the single jackknife, the person-by-item interaction variance component σ P × I 2 is typically ignored and the variance due to person sampling is, hence, estimated by P 1 P S P 2 .
Similarly, we can derive the properties of the SJK estimate in which a single item i (or an item group) is removed from the analysis:
μ ^ ( 0 , i ) = 1 P ( I 1 ) ) p = 1 P j = 1 j i I X p j .
The SJK variance estimate for item sampling utilizes the sum of squares term
S I 2 = i = 1 I ( μ ^ ( 0 , i ) μ ^ ) 2 .
By changing the indices i and p in Equation (75), we obtain
E S I 2 = 1 I 1 σ I 2 + 1 P ( I 1 ) σ P × I 2 .
By replacing the expected value E S I 2 with the observed value S I 2 , the quantity I 1 I S I 2 is used as the variance estimate concerning the item facet. For the joint inference of persons and items, the variance terms for persons and items are added:
Var ^ ( μ ^ ) = P 1 P S P 2 + I 1 I S I 2 .
Note that this variance estimate is biased because
E Var ^ ( μ ^ ) Var ( μ ^ ) = 1 P I σ P × I 2 > 0 .
Consequently, so-called double jackknife resampling should be employed to remove the bias from the estimated variance.

6.2. Double Jackknife (DJK)

The double jackknife (DJK; [104,106,108]) removes a person (or a group of persons) and an item (or a group of items) from an analysis for the determination of the standard error. The elimination and repeated analysis is carried out for all persons and items. Let μ ^ ( p , i ) be the mean estimate in which the person p and item i are removed. In more detail, it is
μ ^ ( p , i ) = 1 ( P 1 ) ( I 1 ) q = 1 q p P j = 1 j i I X q j .
The estimate μ ^ ( p , 0 ) only removes person p, and the estimate μ ^ ( 0 , i ) only removes item i. The corresponding estimates have already been studied as SJK estimates in Section 6.1.
We now consider an analysis in which one person and one item are removed. One obtains
μ ^ ( p , i ) μ ^ = 1 P v p + q p 1 P ( P 1 ) v q 1 I u i + j i 1 I ( I 1 ) u j 1 P I e p i + j i e p j + q p e q i + q p j i P + I 1 ( P 1 ) ( I 1 ) P I e q j
It follows that
E ( μ ^ ( p , i ) μ ^ ) 2 = 1 P ( P 1 ) σ P 2 + 1 I ( I 1 ) σ I 2 + P + I 1 P I ( P 1 ) ( I 1 ) σ P × I 2 .
Now, define
S P × I 2 = p = 1 P i = 1 I ( μ ^ ( p , i ) μ ^ ) 2 .
We then obtain by using (84)
E S P × I 2 = I P 1 σ P 2 + P I 1 σ I 2 + P + I 1 ( P 1 ) ( I 1 ) σ P × I 2 .
One can use Equations (75), (79) and (86) as estimating equations by equating the expected values of the sum of squares by their observed counterparts. We have three equations for three unknowns
S P 2 = 1 P 1 σ P 2 + 1 ( P 1 ) I σ P × I 2 S I 2 = 1 I 1 σ I 2 + 1 ( I 1 ) P σ P × I 2 S P × I 2 = I 1 P 1 σ P 2 + P 1 I 1 σ I 2 + ( P + I 1 ) 1 ( P 1 ) ( I 1 ) σ P × I 2
We further simplify (87) to
( P 1 ) I S P 2 = I σ P 2 + σ P × I 2 P ( I 1 ) S I 2 = P σ I 2 + σ P × I 2 ( P 1 ) ( I 1 ) S P × I 2 = I ( I 1 ) σ P 2 + P ( P 1 ) σ I 2 + ( P + I 1 ) σ P × I 2
Now substitute the first and second equation in (88) in the third equation. We obtain
( P 1 ) ( I 1 ) S P × I 2 = ( I 1 ) ( P 1 ) I S P 2 + ( P 1 ) P ( I 1 ) S I 2 + σ P × I 2 .
Hence, we get from (89)
σ P × I 2 P I = ( P 1 ) ( I 1 ) P I S P × I 2 1 P S P 2 1 I S I 2 .
Further, the variance components for persons and items can be computed as
σ P 2 P = P 1 P S P 2 σ P × I 2 P I and σ I 2 I = I 1 I S I 2 σ P × I 2 P I .
The quantities in (90) and (91) can be used to estimate the population variance defined in (70). The crucial issue is how to handle negative variance estimates in estimation. Based on experience from preliminary simulation studies, the following variance estimate turned out to be most satisfactory:
Var ^ ( μ ^ ) = P 1 P S P 2 + I 1 I S I 2 σ ^ P × I 2 P I ,
where σ ^ P × I 2 = max ( σ P × I 2 , 0 ) is nonnegative, and σ P × I 2 is defined in Equation (90).

6.3. Single Half Sampling (SHS)

In single half sampling (SHS; [107]), half of the sample is used to reanalyze the data to compute standard errors. Let μ ^ P , h be the h-th half sample for persons in which half of the persons are sampled. Without loss of generality, let P be even. The h-th half sample consists of P / 2 persons. We define half sample h in which the first P / 2 persons are sampled and compute the mean estimate
μ ^ P , h = 1 I ( P / 2 ) p = 1 P / 2 i = 1 I X p i .
Then, we obtain
μ ^ P , h μ ^ = 1 P p = 1 P / 2 u p 1 P p = P / 2 + 1 P u p + 1 P I p = 1 P / 2 e p i 1 P p = P / 2 + 1 P I e p i .
Hence, we get from (95)
E ( μ ^ P , h μ ^ ) 2 = σ P 2 P + σ P × I 2 P I .
Now, there are H (potentially balanced) half samples (see [102]) with estimates μ ^ P , h . Define the variance
U P 2 = 1 H h = 1 H ( μ ^ P , h μ ^ ) 2 .
Using (95), it follows that
E U P 2 = σ P 2 P + σ P × I 2 P I .
Similarly, one can consider half samples of items. Assume that in half sample k, the first I / 2 items are sampled. Let
μ ^ I , k = 1 P ( I / 2 ) p = 1 P i = 1 I / 2 X p i .
One can define the variance in estimates due to different half samples of items. Define the variance
U I 2 = 1 K h = 1 K ( μ ^ I , k μ ^ ) 2 .
Using the same derivations, we get
E U I 2 = σ I 2 I + σ P × I 2 P I .
Based on the expected values in (97) and (100), one can define a variance estimate of μ ^ by adding the variance components regarding persons and items as
Var ^ ( μ ^ ) = U P 2 + U I 2 .
Notably, this estimate is positively biased because
E ( Var ^ ( μ ^ ) ) Var ( μ ^ ) = σ P × I 2 P I > 0 .
As in the case of SJK, SHS also results in a biased variance estimate. In the next section, we investigate double half sampling that removes the bias component.

6.4. Double Half Sampling (DHS)

In double half sampling (DHS), half samples of persons and items are created and the analysis is replicated for these half samples. Let h be a half sample of persons, and k be a half sample of items for this dataset of persons. Let μ ^ I : P , k h be the mean estimate for the half sample for persons and items and μ ^ P , h be the estimate for the half sample of persons.
Define the variance
U I : P 2 = 1 K H h = 1 H h = 1 K ( μ ^ I : P , k h μ ^ P , h ) 2 .
Using the same derivation as in (100), one obtains
E U I : P 2 = σ I 2 I + σ P × I 2 ( P / 2 ) I = σ I 2 I + 2 σ P × I 2 P I .
Hence, an unbiased estimate of the variance for μ ^ using DHS is obtained by
Var ^ ( μ ^ ) = U P 2 + U I 2 σ ^ P × I 2 P I ,
where σ ^ P × I 2 P I = max ( U I : P 2 U P 2 , 0 ) .
In practice, one can use balanced half samples based on Hadamard matrices for the most efficient variance estimates that minimize the Monte Carlo error for creating half samples [102]. In the simulation study (see Section 8), only balanced half samples are considered.

6.5. Double Bootstrap

It might be tempting to consider a double bootstrap resampling approach of persons and items as an alternative to DJK and DHS [104,109,110,111]. We believe that bootstrapping items should not be recommended because duplicating items introduces additional local dependence in IRT models, which, in turn, induces bias in estimated item parameters and linking parameters. Hence, the variability obtained from a double bootstrap will also include portions of bias.

7. Simulation Study 1: Comparing the Performance of Different Linking Methods

In Simulation Study 1, we compare the performance of robust and nonrobust linking methods for the RM in the presence and absence of random DIF. This study systematically compares several robust linking methods. In particular, the recently proposed GI method is compared with alternative methods.

7.1. Design

Data were simulated according to the RM with random DIF in two groups. In the first group, the ability distributed was simulated as θ N ( 0 , 1 ) . In the second group, we simulated θ N ( 0.5 , 1 ) (i.e., μ 2 = 0.5 ). Item difficulties b i were fixed in the simulation and were chosen equidistant in the interval [ 2 , 2 ] . Hence, in this study, we assumed fixed item difficulties b i , but simulated random DIF effects e i according to a mixture distribution G = ( 1 π bias ) G AMI + π bias G bias (see Section 2.2). The distribution of DIF effects reference items was chosen as a centered normal distribution; that is, G AMI = N ( 0 , τ AMI 2 ) . For the distribution of DIF effects of biased items G bias , we chose a two-point distribution for balanced DIF with values δ bias and δ bias and corresponding probabilities π bias / 2 . For unbalanced DIF, we simulated a one-point distribution at δ bias with probability π bias which favored the first group. In the simulation, we fixed δ bias to 0.60. The bias for MM linking is expected to be π bias δ bias (see Equation (11)). It vanishes for balanced DIF and is a function of π bias in the case of unbalanced DIF.
In the simulation, five factors were varied. First, we chose the sample size N of persons as 250, 500, 1000, and 5000. Second, we varied the number of items by I = 20 and I = 40 . Third, we chose the proportion of biased items π bias = 0 , 0.1 , 0.3 . With π bias = 0 , no biased DIF items were simulated. Fourth, we varied the standard deviation (SD) of DIF effects of reference items τ AMI as 0, 0.1, 0.2, and 0.3. Fifth, we simulated three different distributions of DIF effects if τ AMI > 0 : a normal distribution, a uniform distribution, and a t-distribution with four degrees of freedom. With τ AMI = 0 , reference items do not have DIF effects. The distributions of DIF effects were appropriately scaled in order to match the SD τ AMI . In total, 1000 datasets were simulated and analyzed in each condition.

7.2. Analysis

The RM model was separately estimated in the two groups. The linking methods introduced in Section 3 were applied. We chose a cutoff value of 0.4 for DIF detection in the EL method. In ATR linking, we chose trimming proportions of 0.20 and 0.40. In BSQ linking, we chose 0.4 as the threshold parameter K. IA was estimated using the powers p = 1 , 0.5 , 0.25 , and 0.1. GI linking was utilized with powers 1 and 2. HAE linking was specified with powers p = 2 , 1, 0.5, 0.25, and 0.1.
The parameter of interest was the estimated group mean difference μ ^ 2 . For this parameter, the bias and root mean square error (RMSE) were computed. To reduce the dependence of the RMSE from the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method is divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE possesses the lowest value of 100 for the best linking method.
To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA). We used a variance decomposition for assessing the importance in the presence and absence of DIF.
Moreover, we classified linking methods on whether they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias in the estimated mean μ ^ 2 was smaller than 0.01. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 125.
In all analyses, the statistical software R [112] was used. The R package sirt [113] was employed for estimating the RM model with marginal maximum likelihood as the estimation method. The linking methods were estimated using R functions particularly written for this paper.

7.3. Results

In Table 1, the variance decomposition of the ANOVA summarized across conditions of no DIF is presented. For bias, sample size N, the number of items I as well as linking methods (Meth in Table 1) have an impact. However, as we will see later, the bias is of non-negligible size in the situation of no DIF. For RMSE, linking methods constitute the major source of differences. In contrast, sample size and the number of items only have small effects on the RMSE.
In Table 2, the variance decomposition of the ANOVA summarized across conditions of balanced and unbalanced DIF, respectively, is presented. All terms up to three-way interactions were included. For balanced DIF (column BAL), RMSE is more important than bias. It is evident that linking methods produced the largest variability in estimates, followed by the SD τ AMI of DIF effects of reference items, the proportion of biased items π bias , sample size N, and the type of distribution (column Dist) of DIF effects. For unbalanced DIF, the bias is primarily affected by π bias and τ AMI and their interaction. Like for balanced DIF, the linking method substantially explains the variability in the RMSE of group mean differences.
Table 3 summarizes the performance of the different linking methods across all conditions with no DIF, balanced DIF, and unbalanced DIF. In the absence of DIF, all linking methods produced unbiased estimates. However, IA with small powers p of 0.25 and 0.1 as well as HAE with p = 0.1 resulted in less precise estimates. Interestingly, GI linking always resulted in a substantially increased variability in estimated group mean differences compared to all other linking methods.
In the conditions with balanced DIF (column “BAL”), all linking methods (except for GI in a few conditions) produced unbiased estimates. However, using robust linking methods (i.e., EL, ATR, BSQ, IA, GI, HAE(p) with p 1 ) resulted in an efficiency loss in the RMSE compared to nonrobust linking methods (i.e., MM, HAE(2)). Among the robust linking methods, MM linking with the elimination of DIF items (i.e., EL) as well as IA and HAE with p = 1 performed best.
Finally, the situation of unbalanced DIF (column “UNBAL”) is most challenging because linking methods have to handle the presence of biased items. Notably, robust linking methods are preferred over nonrobust linking in such a situation. In particular, MM and HAE(2) always resulted in biased estimates. Among the robust linking methods, BSQ and IA with p = 0.25 and 0.1 resulted in the least simulation conditions with biased estimates. Concerning RMSE, EL and ATR with a trimming proportion of 0.4 performed best, followed by IA with p = 1 , HAE with p = 1 , ATR with a trimming proportion of 0.2 and BSQ linking.
Table 4 shows the RMSE for balanced DIF for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ). For balanced DIF, all linking methods produced unbiased estimates (not shown in the table). However, there were slight differences between the linking methods with respect to the RMSE. In the situation of partial invariance (i.e., τ AMI = 0 ), the efficiency loss of robust linking methods compared to nonrobust linking methods MM and HAE(2) was acceptable. However, GI resulted in larger variable estimates. Moreover, note that GI linking with p = 2 outperformed GI with p = 1 in most conditions. Robust linking methods IA and HAE with very small power values p (e.g., p = 0.25 or 0.1) also caused a non-negligible RMSE increase.
The efficiency loss of robust linking methods is much larger if the reference items also possess DIF (i.e., τ AMI > 0 ). Only IA with p = 1 can somehow compete with MM and HAE(2) linking. The variance increase in robust linking methods IA and HAE with very small powers is apparent. It also has to be stated that GI linking produced large RMSE values in balanced DIF conditions.
Table 5 shows the bias and the RMSE for unbalanced DIF for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ). All linking methods show biases in at least one condition. Notably, nonrobust linking methods MM and HAE(2) showed the largest bias. Robust linking methods reduce the bias in all conditions. The most critical condition is π bias = 0.3 and τ AMI = 0.2 . In this condition, BSQ linking has the least bias, followed by IA with small powers 0.25 and 0.1. In this condition, it is also interesting to note that biases for a large sample size of N = 1000 are smaller than for N = 250 .
With respect to the RMSE, EL, ATR, BSQ, and IA with powers 0.5 and 0.25 can be recommended. It is important to emphasize that GI linking with p = 2 performed well in the case of partial invariance (i.e., τ AMI = 0 ), but outperformed the recently proposed GI linking using p = 1 . Interestingly, DIF detection with subsequent MM linking (method EL) was also relatively effective as long as the proportion of biased items is not too large.

8. Simulation Study 2: Performance of Resampling Methods for Computing Standard Errors and Linking Errors

In Simulation Study 2, we investigate the performance of resampling methods for estimating the variability of group mean differences. DJK and DHS have not yet been systematically studied for linking methods in the literature. In particular, there is a lack of research for studying resampling methods for robust linking methods.

8.1. Design

The data generating model closely follows that of Simulation Study 1 (see Section 7.1). Only a selected number of conditions was simulated because resampling methods are computationally demanding. In contrast to Simulation Study 1, we set δ bias = 0.3 . Only balanced DIF was simulated because the assessment of variability (and not bias) was the focus of this simulation. The proportion of biased items were chosen as π bias = 0 or π bias = 0.3 . The SD of DIF effects for reference items τ AMI was set to 0.3. We considered sample sizes N = 500 and N = 2000 and fixed the number of items to I = 40 . In total, 2000 replications were conducted in each condition of the simulation study.

8.2. Analysis

To further reduce computation time, we only chose a selected number of linking methods that provided unbiased estimates in Simulation Study 1; that is, MM, ATR, IA, and HAE. We assessed the variability in estimated group mean differences with the resampling methods SJK (Equation (80)), DJK (Equation (92)), SHS (Equation (101)), and DHS (Equation (105)). We applied the resampling methods with 20 replication zones (containing 500/20 = 25 or 2000/20 = 100 persons and 40/20=2 items in each zone). Approximate balanced half sampling was used by specifying zones so that it was constructed from the upper part of a Hadamard matrix with a minimum dimension larger than 20. We computed confidence intervals based on the estimated standard errors s ^ by the respective methods as CI ( μ ^ 2 ) = μ ^ 2 1.96 · s ^ . The proportion of replications in which the true difference μ 2 is contained in CI ( μ ^ 2 ) is defined as the coverage rate. Coverage rates are classified as satisfactory if they range within the interval [ 92.5 , 97.5 ] for a condition in a simulation. As in Simulation Study 1, we used R [112] and the R package sirt [113].

8.3. Results

In Table 6, coverage rates for the resampling methods are displayed. By construction, single resampling methods (SJK and SHS) result in slightly wider confidence intervals than double resampling methods (DJK and DHS), which, in turn, produce higher coverage rates. It can be seen that SJK and DJK failed to produce acceptable coverage rates. In particular, jackknife error estimates performed worse for robust linking methods. This is in line with results in robust statistics that JK does not work for nondifferentiable statistics. However, SJK can be used for the nonrobust linking method HAE(2). In contrast, half sampling resampling methods outperformed jackknife. As expected, SHS produced a slight overcoverage, but DHS produced acceptable coverage in all conditions. Particularly noteworthy is the fact that DHS also successfully performed for robust linking methods. Overall, these findings indicate that half sampling methods should be preferred over jackknife resampling.

9. Discussion

In this article, we investigated the performance of robust and nonrobust linking methods as well as the assessment of standard error and linking error estimates of group mean differences. We assumed random DIF with a mixture distribution model. Items are implicitly classified into a set of reference items (that are valid for group comparisons) and biased items that potentially bias group mean differences. We studied the nonrobust linking methods mean-mean linking (MM) and Haebara linking (HAE) with p = 2 , as well as the robust linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with p 1 , Haebara linking (HAE) with p 1 and Gini linking (GI).
We found that robust linking methods can be very effective in reducing biases in the presence of biased items in unbalanced DIF situations. However, in the presence of DIF on reference items (i.e., in the absence of partial invariance), robust linking methods can result in the reduced efficiency of estimates compared to nonrobust methods such as mean-mean linking or Haebara linking, in particular in the situation of balanced DIF. Our study also compared the recently proposed Gini linking with alternative linking methods. Surprisingly, GI performed worse compared to its competitors and only showed an acceptable performance using a modified GI version using a power p = 2 . In our view, it is hard to recommend a particular linking estimator in the unbalanced DIF situation. It is only evident that mean-mean linking and Haebara linking with p = 2 are prone to bias and should not be used. Moreover, the recently proposed Gini linking produced much more variable estimates than competitive linking estimators. The usual practice in psychometrics (linking method EL) that eliminates DIF items in the first step of the analysis and computes group differences based on the DIF-free items in the second step, provides comparable results to robust linking methods (see also [47,114]). Note that we used the median as the preliminary location estimate in the EL method in the first step, which differs from the practice that employs the equal mean difficulty assumption (i.e., uses the mean instead of the median; see [52]).
We also studied the variability of group mean difference estimates due to random DIF. The randomness of DIF introduces an additional source of error (i.e., the linking error) in addition to the standard error associated with the sampling of persons. We analytically derived the distribution of the group difference through M-estimation theory. These results have importance for a (very) large number of items. Because we used a relatively small number of I = 40 items in the simulation and large item pools are not often present in applications, we investigated (single and double) jackknife and (single and double) half sampling resampling methods for persons and items for assessing the variability in estimates of the linking methods. We found that our proposed double half sampling outperformed jackknife-based error estimates. In contrast to jackknife, half sampling can also satisfactorily be applied to nondifferentiable robust linking methods. These findings indicate that half sampling methods could find their way for assessing linking errors in empirical applications.
In this article, we focused on the estimation of group differences. In the investigation of DIF in applied research, how to choose the correct anchor is always crucial [41,50,115,116,117,118]. The studied robust linking estimators can be used to transform estimated item difficulties onto the same scale. Differences in transformed item difficulties can be investigated for DIF effects. Resampling procedures (single jackknife or single half sampling) can be employed for assessing the statistical significance of DIF effects.
As an alternative to separate scaling with subsequent robust or nonrobust linking, concurrent scaling assuming invariant item parameters can be utilized. Although such a one-step approach might be preferred from the practitioner’s point of view, the presence of DIF effects likely introduces some bias in estimated group differences [5,119]. Surprisingly, the bias is even present for balanced DIF [119]. Robust linking methods have the advantage that a few outlying DIF effects are automatically removed from group comparisons [119]. Moreover, concurrent calibration might have computational disadvantages [44,120]. As a further alternative, concurrent calibration assuming partial invariance can be pursued [47,121,122]. In this approach, DIF for items is investigated in a first step, and items that showed DIF receive group-specific item parameters in the concurrent calibration approach, while invariance is assumed for the remaining items.
Furthermore, the precision of linking estimates can be improved by including further person covariates in the analysis [123,124]. This could be particularly true if there also exist DIF effects for person covariates. There is a lack of research for studies that include person covariates in robust linking methods.
Finally, we assumed that the Rasch model was correctly specified. This assumption might be unrealistic in practice, and much more complex item response functions could have generated the item responses [23,29]. It would be interesting to study the performance of the different linking methods and the assessment of standard errors and linking errors for misspecified models. We would like to emphasize that M-estimation theory and resampling techniques also provide valid inference in the case of misspecified models. It can always be debated whether estimates from a misspecified Rasch model are practically relevant or should be interpreted. We tend to argue that parameter estimates of misspecified models summarize a population distribution, and model fitting is not always (or maybe should not be) targeted at estimating the model that has generated the data. In this sense, we think that approaches that include model error as an additional component in statistical inference [125] might be beneficial.

Funding

This research received no external funding.

Acknowledgments

We would like to thank four anonymous reviewers, the academic editor and the assistant editor for valuable comments that helped to improve the article.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

   The following abbreviations are used in this manuscript:
1PLone-parameter logistic model
ANasymptotic normal distribution
ATRasymmetrically trimmed mean linking
BSQbisquare kernel linking
DHSdouble half sampling
DIFdifferential item functioning
DJKdouble jackknife
ELelimination of DIF items with subsequent mean-mean linking
HAEHaebara linking
IAinvariance alignment
IRFitem response function
IRTitem response theory
MMmean-mean linking
PISAprogram for international student assessment
RMRasch model
RMSEroot mean square error
SDstandard deviation
SHSsingle half sampling
SJKsingle jackknife

References

  1. Van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
  2. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  3. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  4. Fischer, G.H.; Molenaar, I.W. (Eds.) Rasch Models. Foundations, Recent Developments, and Applications; Springer: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
  5. Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  6. Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–667. [Google Scholar] [CrossRef]
  7. Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 125–167. [Google Scholar] [CrossRef]
  8. Andrich, D.; Marais, I. A Course in Rasch Measurement Theory; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  9. Kubinger, K.D. Psychological test calibration using the Rasch model—Some critical suggestions on traditional approaches. Int. J. Test. 2005, 5, 377–394. [Google Scholar] [CrossRef]
  10. Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. [Google Scholar]
  11. Linacre, J.M. Rasch model estimation: Further topics. J. Appl. Meas. 2004, 5, 95–110. [Google Scholar]
  12. Rost, J. Was ist aus dem Rasch-Modell geworden? [Where has the Rasch model gone?]. Psychol. Rundsch. 1999, 50, 140–156. [Google Scholar] [CrossRef]
  13. Von Davier, M. The Rasch model. In Handbook of Item Response Theory, Volume 1: Models; CRC Press: Boca Raton, FL, USA, 2016; pp. 31–48. [Google Scholar] [CrossRef]
  14. Holland, P.W. On the sampling theory foundations of item response theory models. Psychometrika 1990, 55, 577–601. [Google Scholar] [CrossRef]
  15. San Martin, E. Identification of item response theory models. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 127–150. [Google Scholar] [CrossRef] [Green Version]
  16. Robitzsch, A. A comprehensive simulation study of estimation methods for the Rasch model. Stats 2021, 4, 48. [Google Scholar] [CrossRef]
  17. Xu, X.; Jia, Y. The Sensitivity of Parameter Estimates to the Latent Ability Distribution; (Research Report No. RR-11-40); Educational Testing Service: Princeton, NJ, USA, 2011. [Google Scholar] [CrossRef] [Green Version]
  18. Zwinderman, A.H.; Van den Wollenberg, A.L. Robustness of marginal maximum likelihood estimation in the Rasch model. Appl. Psychol. Meas. 1990, 14, 73–81. [Google Scholar] [CrossRef] [Green Version]
  19. Fischer, G.H. Rasch models. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 515–585. [Google Scholar] [CrossRef]
  20. San Martin, E.; Rolin, J. Identification of parametric Rasch-type models. J. Stat. Plan. Inference 2013, 143, 116–130. [Google Scholar] [CrossRef]
  21. Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
  22. Loken, E.; Rulison, K.L. Estimation of a four-parameter item response theory model. Brit. J. Math. Stat. Psychol. 2010, 63, 509–525. [Google Scholar] [CrossRef]
  23. Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
  24. Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 31. [Google Scholar] [CrossRef]
  25. Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
  26. Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef] [Green Version]
  27. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  28. Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 20. [Google Scholar] [CrossRef]
  29. Culpepper, S.A. The prevalence and implications of slipping on low-stakes, large-scale assessments. J. Educ. Behav. Stat. 2017, 42, 706–725. [Google Scholar] [CrossRef]
  30. Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
  31. Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
  32. OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 30 June 2021).
  33. Becker, B.; Weirich, S.; Mahler, N.; Sachse, K.A. Testdesign und Auswertung des IQB-Bildungstrends 2018: Technische Grundlagen [Test design and analysis of the IQB education trend 2018: Technical foundations]. In IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich; Stanat, P., Schipolowski, S., Mahler, N., Weirich, S., Henschel, S., Eds.; Waxmann: Münster, Germany, 2019; pp. 411–425. Available online: https://bit.ly/3mTvgRX (accessed on 30 June 2021).
  34. Pohl, S.; Carstensen, C. NEPS Technical Report–Scaling the Data of the Competence Tests; (NEPS Working Paper No. 14); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2012; Available online: https://bit.ly/2XThQww (accessed on 30 June 2021).
  35. Wendt, H.; Bos, W.; Goy, M. On applications of Rasch models in international comparative large-scale assessments: A historical review. Educ. Res. Eval. 2011, 17, 419–446. [Google Scholar] [CrossRef]
  36. Hoff, P.; Wakefield, J. Bayesian sandwich posteriors for pseudo-true parameters. J. Stat. Plan. Inference 2013, 10, 1638–1642. [Google Scholar] [CrossRef] [Green Version]
  37. Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  38. Sun, Y. Constructing a Misspecifed Item Response Model That Yields a Specified Estimate and a Specified Model Misfit Value. Ph.D. Thesis, The Ohoi State University, Columbus, OH, USA, 2015. Available online: https://bit.ly/3AGJPgm (accessed on 30 June 2021).
  39. White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  40. Forero, C.G.; Maydeu-Olivares, A. Estimation of IRT graded response models: Limited versus full information methods. Psychol. Methods 2009, 14, 275–299. [Google Scholar] [CrossRef] [Green Version]
  41. Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
  42. Cho, S.J.; Suh, Y.; Lee, W.Y. After differential item functioning is detected: IRT item calibration and scoring in the presence of DIF. Appl. Psychol. Meas. 2016, 40, 573–591. [Google Scholar] [CrossRef]
  43. Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef]
  44. Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
  45. Van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef] [Green Version]
  46. Frederickx, S.; Tuerlinckx, F.; De Boeck, P.; Magis, D. RIM: A random item mixture model to detect differential item functioning. J. Educ. Meas. 2010, 47, 432–457. [Google Scholar] [CrossRef]
  47. Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
  48. De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
  49. Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
  50. Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. [Google Scholar]
  51. Pohl, S.; Schulze, D.; Stets, E. Partial measurement invariance: Extending and evaluating the cluster approach for identifying anchor items. Appl. Psychol. Meas. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
  52. Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef] [Green Version]
  53. Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef] [Green Version]
  54. Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  55. Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
  56. Welzel, C.; Inglehart, R.F. Misconceptions of measurement equivalence: Time for a paradigm shift. Comp. Political Stud. 2016, 49, 1068–1094. [Google Scholar] [CrossRef]
  57. Welzel, C.; Brunkert, L.; Kruse, S.; Inglehart, R.F. Non-invariance? An overstated problem with misconceived causes. Sociol. Methods Res. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
  58. Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 30 June 2021).
  59. Rutkowski, L.; Svetina, D. Measurement invariance in international surveys: Categorical indicators and fit measure performance. Appl. Meas. Educ. 2017, 30, 39–51. [Google Scholar] [CrossRef]
  60. Von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
  61. González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  62. Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
  63. Von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
  64. Manna, V.F.; Gu, L. Different Methods of Adjusting for Form Difficulty under the Rasch Model: Impact on Consistency of Assessment Results; (Research Report No. RR-19-08); Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
  65. Jureckova, J.; Picek, J. Robust Statistical Methods with R; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
  66. Huber, P.J.; Ronchetti, E.M. Robust Statistics; Wiley: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  67. Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  68. Ronchetti, E. The main contributions of robust statistics to statistical science and a new challenge. Metron 2021, 79, 127–135. [Google Scholar] [CrossRef]
  69. Magis, D.; De Boeck, P. Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivar. Behav. Res. 2011, 46, 733–755. [Google Scholar] [CrossRef]
  70. Magis, D.; De Boeck, P. A robust outlier approach to prevent type I error inflation in differential item functioning. Educ. Psychol. Meas. 2012, 72, 291–311. [Google Scholar] [CrossRef]
  71. Rusiecki, A. Robust learning algorithm based on LTA estimator. Neurocomputing 2013, 120, 624–632. [Google Scholar] [CrossRef]
  72. Wilcox, R. Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
  73. Yuan, K.H.; Bentler, P.M.; Chan, W. Structural equation modeling with heavy tailed distributions. Psychometrika 2004, 69, 421–436. [Google Scholar] [CrossRef]
  74. Yuan, K.H.; Zhang, Z. Structural equation modeling diagnostics using R package semdiag and EQS. Struct. Equ. Model. 2012, 19, 683–702. [Google Scholar] [CrossRef]
  75. Kalina, J. Implicitly weighted methods in robust image analysis. J. Math. Imaging Vis. 2012, 44, 449–462. [Google Scholar] [CrossRef]
  76. Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage: Thousand Oaks, CA, USA, 2016. [Google Scholar]
  77. Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley: New York, NY, USA, 1986. [Google Scholar] [CrossRef]
  78. Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
  79. Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef] [Green Version]
  80. Pokropek, A.; Lüdtke, O.; Robitzsch, A. An extension of the invariance alignment method for scale linking. Psych. Test Assess. Model. 2020, 62, 303–334. [Google Scholar]
  81. Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 19. [Google Scholar] [CrossRef]
  82. Muthén, B.; Asparouhov, T. Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociol. Methods Res. 2018, 47, 637–664. [Google Scholar] [CrossRef]
  83. Pokropek, A.; Davidov, E.; Schmidt, P. A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Struct. Equ. Model. 2019, 26, 724–744. [Google Scholar] [CrossRef] [Green Version]
  84. Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef] [Green Version]
  85. He, Y.; Cui, Z.; Osterlind, S.J. New robust scale transformation methods in the presence of outlying common items. Appl. Psychol. Meas. 2015, 39, 613–626. [Google Scholar] [CrossRef]
  86. He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
  87. Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 14. [Google Scholar] [CrossRef]
  88. Strobl, C.; Kopf, J.; Kohler, L.; von Oertzen, T.; Zeileis, A. Anchor point selection: Scale alignment based on an inequality criterion. Appl. Psychol. Meas. 2021, 45, 214–230. [Google Scholar] [CrossRef]
  89. Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
  90. Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
  91. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  92. Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
  93. Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
  94. Jaeckel, L.A. Robust estimates of location: Symmetry and asymmetric contamination. Ann. Math. Stat. 1971, 42, 1020–1034. [Google Scholar] [CrossRef]
  95. Xu, X.; Chen, X. A practical method of robust estimation in case of asymmetry. J. Stat. Theory Pract. 2018, 12, 370–396. [Google Scholar] [CrossRef]
  96. Stefanski, L.A.; Boos, D.D. The calculus of M-estimation. Am. Stat. 2002, 56, 29–38. [Google Scholar] [CrossRef]
  97. Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  98. Simakhin, V.A.; Shamanaeva, L.G.; Avdyushina, A.E. Robust parametric estimates of heterogeneous experimental data. Russ. Phys. J. 2021, 63, 1510–1518. [Google Scholar] [CrossRef]
  99. Hunter, J.E. Probabilistic foundations for coefficients of generalizability. Psychometrika 1968, 33, 1–18. [Google Scholar] [CrossRef]
  100. Husek, T.R.; Sirotnik, K. Item Sampling in Educational Research; CSEIP Occasional Report No. 2.; University of California: Los Angeles, CA, USA, 1967; Available online: https://bit.ly/3k47t1s (accessed on 30 June 2021).
  101. Yuan, K.H.; Cheng, Y.; Patton, J. Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika 2014, 79, 232–254. [Google Scholar] [CrossRef]
  102. Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
  103. Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
  104. Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
  105. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar] [CrossRef]
  106. Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; (Research Report No. RR-09-02); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
  107. Rao, J.N.K.; Wu, C.F.J. Inference from Stratified Samples: Second-Order Analysis of Three Methods for Nonlinear Statistics. J. Am. Stat. Assoc. 1985, 80, 620–630. [Google Scholar] [CrossRef]
  108. Xu, X.; von Davier, M. Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study; (Research Report No. RR-10-10); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
  109. Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
  110. Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
  111. Tong, Y.; Brennan, R.L. Bootstrap estimates of standard errors in generalizability theory. Educ. Psychol. Meas. 2007, 67, 804–817. [Google Scholar] [CrossRef]
  112. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
  113. Robitzsch, A. sirt: Supplementary Item Response Theory Models. R package version 3.10-111; R Core Team: Vienna, Austria, 2021; Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 25 June 2021).
  114. DeMars, C.E. Alignment as an alternative to anchor purification in DIF analyses. Struct. Equ. Model. 2020, 27, 56–72. [Google Scholar] [CrossRef]
  115. Chen, Y.; Li, C.; Xu, G. DIF statistical inference and detection without knowing anchoring items. arXiv 2021, arXiv:2110.11112. Available online: https://arxiv.org/abs/2110.11112 (accessed on 21 October 2021).
  116. Kopf, J.; Zeileis, A.; Strobl, C. A framework for anchor methods and an iterative forward approach for DIF detection. Appl. Psychol. Meas. 2015, 39, 83–103. [Google Scholar] [CrossRef] [Green Version]
  117. Tutz, G.; Schauberger, G. A penalty approach to differential item functioning in Rasch models. Psychometrika 2015, 80, 21–43. [Google Scholar] [CrossRef] [Green Version]
  118. Yuan, K.H.; Liu, H.; Han, Y. Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika 2021, 86, 345–377. [Google Scholar] [CrossRef]
  119. Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 9. [Google Scholar] [CrossRef]
  120. Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
  121. Von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  122. Glas, C.A.W.; Jehangir, M. Modeling country-specific differential functioning. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 97–115. [Google Scholar] [CrossRef]
  123. Albano, A.D.; Wiberg, M. Linking with external covariates: Examining accuracy by anchor type, test length, ability difference, and sample size. Appl. Psychol. Meas. 2019, 43, 597–610. [Google Scholar] [CrossRef]
  124. Sansivieri, V.; Wiberg, M. Linking scales in item response theory with covariates. J. Res. Educ. Scie. Technol. 2018, 3, 12–32. Available online: https://bit.ly/3ze7qEF (accessed on 30 June 2021).
  125. Wu, H.; Browne, M.W. Quantifying adventitious error in a covariance structure as a random effect. Psychometrika 2015, 80, 571–600. [Google Scholar] [CrossRef] [Green Version]
Table 1. Variance proportions of different factors in the simulation study for bias and RMSE in the condition of no differential item functioning (DIF).
Table 1. Variance proportions of different factors in the simulation study for bias and RMSE in the condition of no differential item functioning (DIF).
SourceBiasRMSE
N31.90.6
I2.31.4
Meth23.195.0
N × I20.20.1
N × Meth14.72.2
I × Meth1.30.6
Residual6.60.2
Note.N = sample size; I = number of items; Meth = linking method; Percentage values larger than 1.0 are printed in bold.
Table 2. Variance proportions of different factors in the simulation study for bias and RMSE for balanced and unbalanced differential item functioning (DIF).
Table 2. Variance proportions of different factors in the simulation study for bias and RMSE for balanced and unbalanced differential item functioning (DIF).
SourceBALUNBAL
BiasRMSEBiasRMSE
N0.35.51.19.8
I0.20.10.00.2
Meth29.739.57.516.5
π bias 0.25.740.21.9
τ AMI 6.99.220.60.7
Dist3.44.82.10.4
N ×  I0.00.10.00.3
N ×  Meth1.04.50.88.1
N ×  π bias 0.11.20.52.5
N ×  τ AMI 0.71.00.32.3
N ×  Dist0.61.90.20.2
I ×  Meth0.70.20.10.3
I ×  π bias 0.00.00.00.3
I ×  τ AMI 0.10.10.10.1
I ×  Dist0.00.00.00.0
Meth ×  π bias 3.15.21.45.3
Meth ×  τ AMI 12.85.45.413.9
Meth ×  Dist5.61.80.40.8
π bias  ×  τ AMI 0.20.910.13.1
π bias  ×  Dist0.60.00.50.7
τ AMI  ×  Dist2.11.91.71.2
N ×  I ×  Meth0.30.10.00.2
N ×  I ×  π bias 0.00.00.00.1
N ×  I ×  τ AMI 0.10.10.00.1
N ×  I ×  Dist0.00.00.00.0
N ×  Meth ×  π bias 1.03.20.44.8
N ×  Meth ×  τ AMI 2.91.30.49.7
N ×  Meth ×  Dist1.30.60.00.4
N ×  π bias  ×  τ AMI 0.20.10.32.0
N ×  π bias  ×  Dist0.20.00.00.5
N ×  τ AMI  ×  Dist0.50.40.10.4
I ×  Meth ×  π bias 0.30.00.00.1
I ×  Meth ×  τ AMI 0.90.40.10.3
I ×  Meth ×  Dist0.40.10.00.0
I ×  π bias  ×  τ AMI 0.00.00.10.0
I ×  π bias  ×  Dist0.00.00.00.0
I ×  τ AMI  ×  Dist0.10.00.00.0
Meth ×  π bias  ×  τ AMI 2.80.93.54.6
Meth ×  π bias  ×  Dist3.30.20.20.4
Meth ×  τ AMI  ×  Dist4.70.80.50.3
π bias  ×  τ AMI  ×  Dist0.60.00.40.2
Residual12.22.61.07.7
Note. BAL = balanced DIF; UNBAL = unbalanced DIF; N = sample size; I = number of items; Meth = linking method; Dist = distribution of DIF effects; π bias  = proportion of biased DIF items; τ AMI  = standard deviation of DIF effects of reference items. Percentage values larger than 1.0 are printed in bold.
Table 3. Summary of satisfactory performance of linking methods for bias and RMSE for no, balanced and unbalanced differential item functioning (DIF).
Table 3. Summary of satisfactory performance of linking methods for bias and RMSE for no, balanced and unbalanced differential item functioning (DIF).
MethodBiasRMSE
NODIFBALUNBALNODIFBALUNBAL
MM100100010010057
EL(0.4)100100581006887
ATR(0.2)100100461005671
ATR(0.4)100100581005785
BSQ(0.4)100100721003569
IA(1)100100451007573
IA(0.5)100100611003359
IA(0.25)10010069881335
IA(0.1)1001007075514
GI(1)1009563001
GI(2)10096610318
HAE(2)100100010010058
HAE(1)100100461006972
HAE(0.5)100100541003554
HAE(0.25)100100491001224
HAE(0.1)10099416324
Note. NODIF = no DIF; BAL = balanced DIF; UNBAL = unbalanced DIF; MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; Percentage values larger than 67 are printed in bold.
Table 4. RMSE for balanced differential item functioning (DIF) for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ).
Table 4. RMSE for balanced differential item functioning (DIF) for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ).
τ AMI 00.2
π bias 00.10.300.10.3
N250100025010002501000250100025010002501000
Method
MM100100100100100100100100100100100100
EL(0.4)102100104102109108108110110115119135
ATR(0.2)105106104105109129108116108116119160
ATR(0.4)107107106108109108111120111124119133
BSQ(0.4)110101112102126105132133136141164171
IA(1)103104103104106108105110106113111125
IA(0.5)109108111111121116118129121133137160
IA(0.25)115112118113133118129141136151160188
IA(0.1)121111126115147122142155151168182215
GI(1)173153157127181238213254214236282396
GI(2)157154133112162151188232186207251379
HAE(2)100100100100100102100100100101101102
HAE(1)103104104105107109106112107116113130
HAE(0.5)110112111114117118115127118133135158
HAE(0.25)114115116119128125127143130148154182
HAE(0.1)118118123125137135137153141164172207
Note. MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; RMSE values smaller than 125 are printed in bold.
Table 5. Bias and RMSE for unbalanced differential item functioning (DIF) for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ).
Table 5. Bias and RMSE for unbalanced differential item functioning (DIF) for I = 40 items as a function of sample sizes (N), proportion of biased items ( π bias ), and standard deviation of DIF effects of reference items ( τ AMI ).
BiasRMSE
τ AMI 00.200.2
π bias 0.10.30.10.30.10.30.10.3
N25010002501000250100025010002501000250100025010002501000
Method
MM−0.06−0.06−0.18−0.18−0.06−0.06−0.18−0.18109157154382100121107163
EL(0.4)−0.02 0.00−0.08−0.02−0.03−0.01−0.14−0.10100100104115101100100114
ATR(0.2)−0.02−0.01−0.10−0.06−0.03−0.01−0.15−0.12101103108161100101101127
ATR(0.4)−0.02−0.01−0.08−0.02−0.03−0.02−0.14−0.09104108105112103110100106
BSQ(0.4) 0.00 0.00−0.03 0.00−0.02 0.00−0.11−0.02108100100100123120118100
IA(1)−0.03−0.02−0.12−0.06−0.04−0.03−0.16−0.13102108119169100108102129
IA(0.5)−0.02−0.01−0.08−0.03−0.03−0.02−0.13−0.09109113109131109119105112
IA(0.25)−0.01−0.01−0.06−0.02−0.02−0.02−0.12−0.06117117111134123136113118
IA(0.1)−0.01−0.01−0.05−0.02−0.01−0.01−0.12−0.06127126116140136153120132
GI(1) 0.03 0.02 0.01 0.01−0.01 0.03−0.11−0.11150132133173194213162255
GI(2) 0.04 0.02 0.00 0.01 0.00 0.05−0.13−0.13128110117102167175153251
HAE(2)−0.06−0.06−0.18−0.18−0.06−0.06−0.18−0.18109156154380101121108163
HAE(1)−0.03−0.02−0.12−0.06−0.04−0.03−0.15−0.13103108118164101108102128
HAE(0.5)−0.02−0.01−0.08−0.03−0.03−0.02−0.15−0.10108113111132109119108124
HAE(0.25)−0.02−0.01−0.08−0.04−0.03−0.02−0.16−0.12113118120232120133122159
HAE(0.1)−0.02−0.01−0.08−0.06−0.03−0.02−0.17−0.14119123132342130148132195
Note. MM = mean-mean linking; EL = elimination of DIF items with subsequent mean-mean linking; ATR = asymmetrically trimmed mean linking; BSQ = bisquare kernel linking; IA = invariance alignment; GI = Gini linking; HAE = Haebara linking; Absolute bias values smaller than 0.05 and RMSE values smaller than 125 are printed in bold.
Table 6. Coverage rates for linking methods for balanced differential item function for I = 40 items as a function of sample size (N) and the proportion of biased items ( π bias ).
Table 6. Coverage rates for linking methods for balanced differential item function for I = 40 items as a function of sample size (N) and the proportion of biased items ( π bias ).
Method N = 500 N = 2000
SJKDJKSHSDHSSJKDJKSHSDHS
No biased items ( π bias = 0 )
MM94.892.295.193.693.991.894.493.6
ATR(0.2)97.695.596.593.595.084.095.393.4
ATR(0.4)98.997.496.893.795.989.695.492.7
IA(1)98.296.596.594.393.789.394.993.1
IA(0.5)99.899.697.794.298.297.397.293.9
IA(0.25)99.899.798.895.199.699.198.594.0
HAE(2)94.991.895.193.993.991.394.193.5
HAE(1)98.597.996.994.395.291.995.392.8
HAE(0.5)99.899.898.295.999.298.297.794.5
HAE(0.25)99.899.898.995.599.899.198.093.6
Biased items ( π bias = 0.3 )
MM95.893.195.594.697.296.197.196.5
ATR(0.2)97.895.397.194.293.683.897.195.9
ATR(0.4)98.295.896.993.491.983.395.593.9
IA(1)97.895.896.594.293.487.796.895.8
IA(0.5)99.299.197.893.996.093.697.595.2
IA(0.25)99.599.398.095.198.196.597.594.1
HAE(2)95.693.096.194.697.196.197.096.4
HAE(1)98.197.496.794.195.091.596.995.5
HAE(0.5)99.599.298.095.197.795.498.296.3
HAE(0.25)99.999.798.395.599.698.098.695.8
Note. SJK = single jackknife (Equation (80)); DJK = double jackknife (Equation (92)); SHS = single half sampling (Equation (101)); DHS = double half sampling (Equation (105)); MM = mean-mean linking; ATR = asymmetrically trimmed mean linking; IA = invariance alignment; HAE = Haebara linking; Coverage rates within the interval [ 92.5 , 97.5 ] are printed in bold.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry 2021, 13, 2198. https://doi.org/10.3390/sym13112198

AMA Style

Robitzsch A. Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry. 2021; 13(11):2198. https://doi.org/10.3390/sym13112198

Chicago/Turabian Style

Robitzsch, Alexander. 2021. "Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques" Symmetry 13, no. 11: 2198. https://doi.org/10.3390/sym13112198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop