Robust Haebara Linking for Many Groups in the Case of Partial Invariance

: The comparison of group means in item response models constitutes an important issue in empirical research. The present article discusses an extension of Haebara linking by proposing a ﬂexible class of robust linking functions for comparisons of many groups. These robust linking functions are particularly suited to item response data that are generated under partial invariance. In a simulation study, it is shown that the newly proposed robust Haebara linking approach outperforms existing approaches of Haebara linking. In an empirical application using PISA data, it is illustrated that country means can be sensitive to the choice of linking functions.


Introduction
One major goal of empirical studies in psychology and education is to compare cognitive outcomes across many groups. For example, the programme for international student assessment (PISA; [1]) provides international comparisons of student performance for a large group of countries (72 countries in PISA 2015). A major obstacle to these comparisons is that cognitive tests often show differential item functioning (DIF; [2]).
In this article, we propose a robust variant of Haebara linking [3] for many groups. We use a two-parameter logistic model (2PL) item response model to introduce the methodology. It is shown that approximately unbiased group comparisons can be conducted with robust Haebara linking when group-specific subsets of items show DIF (i.e., partial invariance). Importantly, no additional steps for identifying items with DIF are needed; items that possess DIF are essentially treated as outliers [4,5] in the linking procedure.
The paper is structured as follows. Section 2 describes the 2PL model under partial invariance. Section 3 introduces the robust Haebara linking method. In Section 4, the proposed method is evaluated in a simulation study. Section 5 presents an empirical example of PISA data. Finally, Section 6 concludes with a discussion that focuses on limitations and potential gaps for future research.

2PL Model with Partial Invariance
In the following, we introduce the concept of partial invariance for multiple groups. For G groups (g = 1, . . . , G), I items (i = 1, . . . , I) are administered. It is assumed that a unidimensional item response model holds in each group with group-specific item response functions (IRF) P ig (θ), indicating the probability of a correct item response X ig , conditional on ability θ. The IRFs in the 2PL model [6] are given as P(X ig = 1|θ g ) = Ψ(a ig (θ g − b ig )) , θ g ∼ N(µ g , σ 2 g ) , where b ig are group-specific item difficulties for item i (i = 1, . . . , I) in group g (g = 1, . . . , G), and a ig are group-specific item loadings. In this article, we assume uniform DIF [2] that presupposes that item loadings are invariant across groups, i.e., a i1 = . . . = a iG ≡ a i . Group-specific item difficulties are decomposed into b ig = b i + e ig , where b i indicates common item difficulties and e ig are denoted as uniform DIF effects. In Equation 1, Ψ denotes the logistic distribution function, and it is assumed that the abilities within each group g are normally distributed with mean µ g and standard deviation σ g . It is well known that not all DIF effects e ig and group means µ g can be simultaneously identified in the 2PL model (see [7,8]). To resolve the identification issue, the set of items for each group is partitioned into two distinct sets (see [9]). More specifically, we assume that for each group g, a subset of so-called anchor items J A,g ⊂ J = {1, . . . , I} exists such that e ig = 0 for all i ∈ J A,g . The set of biased items is defined as J B,g = J \ J A,g . Biased items are allowed to possess DIF effects e ig = 0, which differs from zero. This situation is also referred in the literature as partial invariance [10,11]. If there are no biased items, all item parameters are invariant, which is denoted as full invariance. One central argument in the DIF literature is that items with DIF effects have the potential to bias the estimated ability distributions (i.e., group means or group standard deviations) and should, therefore, not be included in group comparisons (e.g., [1], for arguments in the PISA study, or [12]). Biased estimates of group means can be particularly expected in the case that all DIF effects of items within a group have the same sign (i.e., unbalanced DIF).
In practice, it is not known which items serve as anchor items for group g. The choice can be based on a substantive basis (e.g., considerations outside of psychometrics, see [13]) or using psychometric methods. In this article, the identification of group means and group standard deviations is conducted using psychometric methods, namely linking methods (see [14][15][16][17][18] for overviews). Linking methods rely on separate scalings for all groups. In more detail, the 2PL model is fitted for each group (under the assumption θ ∼ N(0, 1)), resulting in estimated item loadingsâ g and estimated item interceptsb g for all groups. In the second step, estimated parameters (â g ,b g ) are used to determine the vector of group means µ = (µ 1 , . . . , µ G ) and σ = (σ 1 , . . . , σ G ) standard deviations.

Haebara Linking
In this section, we introduce the robust Haebara linking method that determines group means µ, group standard deviations σ, common item slopes a = (a 1 , . . . , a I ), and common item difficulties b = (b 1 , . . . , b I ) based on estimated item loadingsâ g and estimated item interceptsb g for all groups g. A linking function H is employed that minimizes the distances between group-specific IRFs and aligned common IRFs for computing unknown parameters (µ, σ, a, b) where ρ is a loss function, and ω is a weighting function that fulfills ω(θ)dθ = 1. In all subsequent analyses, we choose the standard normal density function as the weighting function ω. Linking based on the function H in Equation 2 is referred to as robust Haebara linking and generalizes the originally proposed Haebara linking method for two groups [3] that uses the loss function ρ(x) = x 2 . He and colleagues [19,20] considered the loss function ρ(x) = |x| for two groups. Haebara linking for multiple groups was investigated in several articles [9,[21][22][23]. In particular, it was shown in [9] that the loss function ρ(x) = |x| was efficient in handling the situation of partial invariance for multiple groups.
In this article, we consider the class of loss function ρ(x) = |x| p with nonnegative power values p. In Figure 1, the loss functions for different values of p are shown. It can be seen that p = 1 and p = 2 put different weights for values near zero. In the limiting case of p → 0, ρ(x) is the step function that takes the value 0 if x is zero, and 1 for all other x values. With ρ(x) = |x| p for very small p values (e.g., p = 0.02) in Equation 2, the linking function essentially counts the number of events in which the group-specific IRF deviates from the aligned common IRF. In the minimization of H defined in Equation 2, the unknown parameters can be obtained by setting the first derivatives to 0, i.e., ∂H ∂µ = 0, ∂H ∂σ = 0, ∂H ∂a = 0, and ∂H ∂b = 0. However, the loss function ρ(x) = |x| p is not differentiable for p ≤ 1, and the first derivative must be replaced by a subdifferential. Moreover, due to nondifferentiability of ρ, standard optimization algorithms that rely on derivatives cannot be used. However, in robust Haebara linking, the function ρ(x) = |x| p is replaced by a differentiable approximating function ρ D (x) = (x 2 + ε) p/2 using a small ε > 0 (e.g., ε = .001). Because ρ D is differentiable, quasi-Newton minimization approaches can be used that are implemented in standard optimizers in R [24]. The implementation of robust Haebara linking in the sirt [25] package specifies a sequence of decreasing values of ε in the optimization, each using the previous solution as initial values (see [26] for a similar approach).
It should be noted that there are competitive linking methods to Haebara linking. The Stocking-Lord method [27] minimizes the difference of the integrated squared difference of the sum of group-specific IRFs and the sum of aligned common IRFs. There are also alternative linking approaches that directly rely on estimated item parameters instead of IRFs, such as mean-mean linking [16], Haberman linking based on regression modeling [28], invariance alignment [29], and distance-based measures (like χ 2 ; [30,31]), to name a few. For Haberman linking and invariance alignment, robust alternatives were recently studied [9,32,33]. The linking approach is a two-step method as separate scalings are applied group-wise in the first step. However, it can be shown that one can reformulate the two-step estimation problem as a one-step estimation problem with side conditions [34].

Estimated Group Means as a Function of DIF Effects
Now, assume that the vector of joint item parameters a and b and group standard deviations σ are already identified. We now investigate the estimated group meanμ g and use the part in Equation 2 that relates to the group mean µ g . The estimateμ g can be determined as By using two Taylor approximations, we can formulate the estimated group meanμ g as a function of the true mean µ g and weighted DIF effects e ig . For p = 1, we get (see Appendix A; Equation A11) where w ig = |e ig | p−2 W i (θ)ω(θ)dθ, and W i is the information function of item i. The item-specific weights w ig consist of two factors. First, the factor |e ig | p−2 governs the influence of DIF effects. Items with large DIF effects e ig are down-weighted for p < 2. Second, the factor W i (θ)ω(θ)dθ is the integrated information function with respect to ω. The influence of this factor is largest for items with large item loadings a i and item difficulties b i that are located in the center of the ability distribution. We now consider two important special cases of Equation 4. For p = 2, we obtain the Haebara linking proposed in [3], and it holds that All DIF effects are weighted according to their item information function. There is no down-weighting of large DIF effects. For p = 0, the estimated group meansμ g are heavily influenced by items with small DIF effects because items large DIF effects e ig are strongly down-weighted. To illustrate the behavior in the situation of partial invariance, let us assume that for the DIF effects of anchor items i ∈ J A,g it holds that |e ig | ≈ ε with a small ε > 0 and the DIF effects of biased items i ∈ J B,g are much larger than zero in absolute value, i.e., |e ig | > e 0 . The weights in Equation 4 are computed as w ig = |e ig | −2 W i (θ)ω(θ)dθ. Further, assume for simplicity that W ≈ W i (θ)ω(θ)dθ, meaning that the integrated information is nearly constant across items. For biased items, we get w ig < e −2 0 W, and for anchor items it holds that w ig = ε −2 W. Inserting these relations in Equation 4, we obtain for the bias in estimated group means By letting ε → 0 in Equation 6, we obtain and the bias is determined by the DIF effects of the anchor items. As the DIF effects of anchor items were assumed to be small in the derivation, we get unbiased estimated group means in the situation of partial invariance.
In the case of p = 1 (as proposed in [19,20]), it can be shown that the bias in estimated group means in robust Haebara linking is a weighted median of DIF effects (see Equation A14 in Appendix A.5).

Simulation Design
In this study, we generated dichotomous item responses and investigated the performance of robust Haebara linking for the 2PL model. We adopted a simulation design that was used in [9]. We simulated item responses from a 2PL model for G = 9 groups. For each group g, abilities were normally distributed with mean µ g and standard deviation σ g . Across all conditions and replications of the simulation, the group means and standard deviations were held fixed (see Appendix B for values used in the simulation). The total population comprising all groups had a mean of 0 and a standard deviation of 1.
Item responses X ig for item i in group g were simulated according to the 2PL model where DIF effects in item difficulties were defined as e ig = Z ig δ. The DIF indicator variables Z ig had values of 0, 1, or −1, where values different from zero indicated uniform DIF effects. For each country, either all nonzero Z ig values were 1 or were −1, meaning that all DIF effects had the same direction (i.e., unbalanced DIF). Item loadings a i were assumed to be invariant across groups. The DIF effect size was chosen as δ = 0.6. A fixed proportion π B of biased items was selected and was equal across groups, i.e., ∑ I i=1 |Z ig | = Iπ B for all groups g = 1, . . . , G. For example, if 30% out of I = 20 items have DIF effects, 6 items have values of Z ig that differ from zero. The item parameters were held constant across conditions and replications (see Appendix B for data-generating parameters). In total, I = 20 items were used in the simulation.
For each condition of the simulation design, R = 300 replications were generated. We manipulated the number of persons per group (N = 250, 500, 1000, and 5000). We also varied the proportion π B of biased items with DIF effects (0%, 10%, and 30%).

Analysis Methods
The performance of robust Haebara linking with powers p = 2, 1, 0.5, 0.25, 0.1, and 0.02 for estimated group means were compared with the scaling approach that relies on full invariance of all item parameters. The approach with full invariance (FI) was specified as a 2PL multiple group item response model.
To identify group means and group standard deviations in the linking procedure, for the first group, the mean was set to 0, and the standard deviation was set to 1. After estimating all group means and group standard deviations, these parameters were transformed to obtain a mean of 0 and a standard deviation 1 for the total sample comprising all groups. These conditions were also fulfilled in the data generating model (see Subsection 4.1).
The average root mean square error (ARMSE) is computed as the average of the root mean square error (RMSE) of all group means: In all analyses, the statistical software R [24] was used. Robust Haebara linking was carried out with the sirt::linking.haebara() function in the R package sirt [25]. The TAM::tam.mml.2pl() function in the R package TAM [35] was used for estimating the 2PL model with marginal maximum likelihood as the estimation method.

Results
In Table 1, average absolute bias (ABIAS) and average RMSE (ARMSE) as a function of sample size are shown. If there are no biased items, all linking methods provided unbiased estimates. As indicated by the ARMSE, there were some efficiency losses by using robust Haebara approaches (p ≤ 1) compared to nonrobust approaches (p = 2 or the FI model). The pattern of results for ABIAS and ARMSE for 10% biased items mimic findings for 30% biased items but were less strongly pronounced. Hence, we only describe the results for 30% biased items. The most biased estimates were obtained for the FI model and p = 2. Using small values of p resulted in a reduction of bias. Notably, the smallest biases were obtained for p = 0.02. However, biases for robust Haeabara linking were larger for smaller sample sizes. For group sizes N = 500, 1000, and 5000, the pattern of RMSE followed that of the bias. Very small values of p are preferred in terms of most precise estimates. However, for N = 250, the smallest ARMSE was obtained for p = 0.5. Probably, uncertainty in estimated item parameters adds additional variation and outweighs the smaller bias for small p. To sum up, robust Haebara linking effectively handles situations of partial invariance. Interestingly, values of the power p smaller than 1 are preferred in terms of ABIAS and ARMSE and are superior to previously proposed approaches that use p = 2 [3] and p = 1 [20].

Empirical Example: PISA 2006 Reading Competence
In order to illustrate the choice of different values for the power p in robust Haebara linking in the case of many groups, we analyzed the data from the PISA 2006 assessment [36]. In this case, groups constitute countries. In this reanalysis, we included 26 OECD countries that participated in 2006 and focused on the reading domain (see [37] for a similar analysis; but see also [9,38,39] for findings using the same dataset). Reading items were only administered to a subset of the participating students, and we included only those students who received a test booklet with at least one reading item. This resulted in a total sample size of 110,236 students (ranging from 2,010 to 12,142 between countries). In total, 28 reading items nested within eight testlets were used in PISA 2006. Six of the 28 items were polytomous and were dichotomously recoded, with only the highest category being recoded as correct. We used seven different analysis models to obtain estimates of the country means: a full invariance approach (concurrent scaling with multiple groups; FI), and robust Haebara linking using powers p = 2, 1, 0.5, 0.25, 0.1, and 0.02. For all analyses, the 2PL model was estimated using student weights. Within a country, student weights were normalized to a sum of 5,000, so that all countries contributed equally to the analyses. Finally, all estimated country means were linearly transformed such that the distribution containing all (weighted) students in all 26 countries had a mean of 500 (points) and a standard deviation of 100. Note that this transformation is not equivalent to the one used in officially published PISA publications.  Table 2, the country mean estimates obtained from the seven different analysis models are shown. Within a country, the range of country means differed between 0.4 (AUT, Austria) and 16.1 (KOR, South Korea) points (M = 3.2, SD = 3.1) across the different models. These differences between the methods can be traced back to different amounts of country DIF. The model based on full invariance and Haebara linking with p = 2 appeared to be similar, resulting in a large correlation of estimated country means (r = .997) and small absolute differences (M = 1.2, SD = 1.1). In contrast, Haebara linking for p = 2 and p = 0.02 differed quite a lot, resulting in a correlation of r = .980 and non-negligible absolute differences between methods (M = 3.2, SD = 3.1). Given that standard errors due to sampling of students in country means in PISA are typically about 3 points, in some cases, differences between different model estimates would provide different statements regarding statistical significance. Interestingly, the country mean estimate for South Korea (KOR) dropped from 560.5 (p = 2) to 544.4 (p = 0.02). The reason is that robust Haebara linking down-weights items with large DIF effects from the computation of country means. For South Korea, there are four items with large negative DIF effects (a relative advantage) and no items with large positive DIF effects (a relative disadvantage) that are most strongly down-weighted (see [9]). Hence, it can be concluded the choice of a particular linking method has the potential to impact the ranking of countries in PISA (see also [40,41]).

Discussion
In this article, we investigated the performance of an extension of Haebara linking in many groups. By using a robust loss function family ρ(x) = |x| p (p > 0) it was shown that the method efficiently handles the case of partial invariance.
Originally, Haebara linking has been proposed for p = 2 [3] and has been robustified using p = 1 in [20]. Our simulation study showed that power values p smaller than 1 has superior performance to the previous proposals in the literature. In more detail, in the case of many groups, p values of at most 0.25 are particularly advantageous.
As it is true for all simulation studies, our study has some limitations. First, we restricted the number of groups to 9. For international large-scale assessments like PISA (e.g., [1,36]), the number of groups-countries in this case-are much larger, say 30, or even 50. On the other hand, we believe that the robust Haebara linking method could also be attractive in the case of two groups [42] or a few groups [43]. Second, we only used 20 dichotomous items in the simulation studies. The performance of robust Haebara linking with a very low or higher number of items could be a relevant topic of future research. Third, we restricted ourselves to dichotomous data. Robust Haebara linking could be extended to polytomous items (see, e.g., [44]).
In the simulation study, it was shown that robust Haebara linking shows desirable performance in the situation of partial invariance. However, DIF effects could also be rather unsystematically distributed that cancel on average. This situation is sometimes referred to as approximate invariance (or random DIF, see [32,[45][46][47][48][49]). It can be concluded that in the presence of approximate invariance, power values of p = 2 are probably optimal [32,33], and the use of robust Haebara linking can lead to inferior statistical performance.
It should be emphasized that we did not investigate the computation of standard errors in our linking approach. There is ample literature that derives standard error formulas for linking due to sampling of persons (e.g., [44,[50][51][52][53][54][55]) Alternatively, variability in estimated group means due to the selection of items has been studied as linking errors in the literature [38,[56][57][58][59]. In future research, it would be interesting to accompany robust Haebara linking with error components that reflects these sources of uncertainty. Both analytical standard errors (e.g., [52]) and resampling procedures (e.g., [60]) could be useful in this respect.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Let ρ(x) = |x + a| p for p > 0, p = 1, and a = 0. We now apply a Taylor approximation up to the second order around x = 0. We get ρ (x) = p|x + a| p−1 sign(a) = p|x + a| p−2 a and ρ (x) = p(p − 1)|x + a| p−2 . Then, we obtain the following approximation

. Minimization of a Quadratic Function
For the derivation of an estimated group mean in robust Haebara linking, we consider the following quadratic minimization problem where A, B, and C are real numbers. By taking the first derivative in Equation A2, we obtain

. Taylor Approximation of Item Response Function with DIF Effects
We now apply a Taylor expansion for the difference of item response functions that appear in robust Haebara linking: Let W i (θ) be the item information in the 2PL model. A Taylor approximation in Equation A4 around where wmdn denotes the weighted median based on data (x i , w i ), and x i are data values and w i sample weights. A further simplification of Equation A13 provideŝ
In Table A1, common item parameters (i.e., item loadings and item difficulties) are shown. Tables A2 and A3 show the values of the DIF indicator variable Z ig for the condition of 10% and 30% biased items, respectively.