A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning

: This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difﬁculties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.


Introduction
The analysis of educational and psychological tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often modeled using item response theory (IRT; [1][2][3]; for applications, see, e.g., [4][5][6][7][8][9][10][11][12][13][14]) models. In this article, the two-parameter logistic (2PL; [15]) IRT model is investigated to compare two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons were carried out using linking methods [16]. A significant obstacle in applying linking methods is that the test items could behave differently in the two groups (i.e., differential item functioning), that is it cannot be expected that the two groups share a common set of statistical parameters for the test items. Such a situation is particularly important in educational large-scale assessment (LSA; [17][18][19]) studies in which several countries are compared. It can be expected that test items function differently because there are curricular differences in those countries.
The paper is structured as follows. In Section 2, the 2PL model with differential item functioning is introduced. In Section 3, several linking methods are discussed. In Section 4, we present the results of a simulation study in which these linking methods are compared. In Section 5, an empirical example using the PISA datasets from 2006 and 2009 for Austria is presented. Finally, the paper closes with a discussion in Section 6.

2PL Model
For dichotomous items i = 1, . . . , I, the item response function (IRF) of the 2PL model is given by [3,20]: where Ψ(y) = exp(y)/(1 + exp(y)) is the logistic link function, a i is the item discrimination, and b i is the item difficulty. The one-parameter logistic (1PL) IRT model (also referred to as the Rasch model; [21]) fixes all item discriminations a i to one. The 1PL or the 2PL model is usually estimated under a local independence assumption, that is: where f is the density of θ and x = (x 1 , . . . , x I ). In many applications, θ is assumed to be normally distributed (i.e., θ ∼ N(µ, σ 2 )). In (2), the multivariate contingency table of X with corresponding probabilities P(X = x) involving 2 I item response patterns was summarized by a unidimensional latent variable θ.

Linking Design
To assess the distributional differences between two groups, an appropriate linking design must be established to identify group differences. Such a linking design is displayed in Figure 1. For two groups g = 1, 2, there is a set I 0 of common items (also referred to as anchor items or link items) that are administered in both groups. There are also group-specific items I g that are uniquely administered in each of the two groups. This linking design is also referred to as a common items nonequivalent group design [22]. In equating, no common items exist in many applications, but two test forms are administered to equivalent groups [23]. Equating is often performed with the goal of producing an equivalency table for the sum score in a test, while linking determines linking constants in order to identify group differences with respect to the latent variable θ. In this article, we are interested in determining distributional differences between the two groups for the latent variable θ.
It has to be emphasized that differences between the two groups are mainly determined by the set of common items if the linking relies on an IRT model. The employed IRT models can predict the expected performance of students on items that were not administered. The crucial assumption is that item responses on nonadministered unique items can be inferred (or imputed) from common items. This corresponds to an ignorability assumption in the missing data literature and translates into the conditional independence of item responses on unique items conditional on item responses on common items (see [24,25]). If a unidimensional IRT model with a local independence assumption (2) holds, this condition is automatically fulfilled.

Random Differential Item Functioning
Common items in the set I 0 are administered in two groups g = 1, 2 (see Figure 1). It is assumed that the 2PL model (see Section 2.1) holds in both groups. The situation in which item parameters do not differ between groups is called measurement invariance [26][27][28]. However, it is likely in many applications that items function differently in the two groups. The existence of group-specific item parameters is labeled differential item functioning (DIF; [29][30][31][32][33]; for applications, see, e.g., [34][35][36][37][38]). In this case, items possess group-specific item discriminations a ig and item difficulties b ig . DIF in item discriminations is referred to as nonuniform DIF (NUDIF), while DIF in item difficulties is referred to as uniform DIF (UDIF) if there is no DIF in item discriminations (see [32,39,40]). Note that the local independence assumption (2) holds within each group even though the item parameters can differ across groups.
For the rest of this article, we assumed random DIF [41][42][43][44][45][46][47][48][49][50]. The main idea is that the difference of item parameters between groups is modeled by a distribution. Hence, a population perspective to a universe of items was adopted. Random DIF for item discriminations a ig (i = 1, . . . , I; g = 1, 2) is defined as: where DIF effects f i are considered as a random variable that follows a distribution F f . Common item discriminations a i can be considered either fixed or random. In the following, we considered them as fixed. Item difficulties b ig are also considered as random DIF, and DIF effects e i follow a random distribution F e : where b i are common item difficulties. It is important to emphasize that there is an inherent indefinability of a simultaneous determination of average DIF effects and average group differences [51][52][53][54]. Hence, some structural assumptions must be posed on the distributions of DIF effects F e and F f . One possible choice would be to assume E(e i ) = E( f i ) = 0; that is, random DIF is is described by unsystematic differences in the item parameters (see Appendix A for further details). A frequent assumption is that DIF effects are normally distributed with zero means: It is impossible to assume the means of DIF effects different from zero because average DIF effects are confounded with average group differences. Note that it holds that a i2 − a i1 = 2 f i ∼ N(0, 2τ 2 a ) and b i2 − b i1 = 2e i ∼ N(0, 2τ 2 b ). As an alternative, a sparsity assumption might be posed on DIF effects. In this case, the majority of items have DIF effects of zero, while only a few items have DIF effects different from zero [44,55]. If DIF effects are considered as fixed, this situation is known as partial invariance [56][57][58]. The random DIF distribution is a mixture distribution with two classes: one class with zero effects and the other class containing DIF effects different from zero. The mixture distribution can be simultaneously estimated in an IRT model with two groups ( [55]; see also [59]). Alternatively, DIF detection methods can be used to identify items whose DIF effects differ from zero [32,60,61]. These items can be removed from linking in subsequent analysis (see, e.g., [62]). However, some research has shown that simultaneous treatment of the linking and modeling of DIF effects can result in superior statistical performance [54,[63][64][65][66].
We would also like to point out that DIF effects in item discriminations in Equation (3) follow an additive model. Alternatively, a multiplicative model for DIF effects can be assumed: which corresponds to an additive model in logarithmized item discriminations: It is hard to decide in empirical applications whether DIF effects for item discriminations should be modeled in the untransformed metric (see Equation (3)) or a logarithmized metric (see Equation (7)).
The simulation study only considered normally distributed DIF effects with zero means and an additive model for DIF effects in item discriminations.

Identified Item Parameters in Separate Calibrations in the Two Groups
In Section 3, we discuss some linking methods that rely on item parameters that were obtained from separate calibrations. That is, the 2PL model was separately fitted for the two groups. For reasons of identifiability, it has to be assumed that in the first group, it holds that µ 1 = 0 and σ 1 = 1. In a separate estimation for the first group with an infinite sample size, the estimated item discriminationsâ i1 are equal to the data-generating parameters a i1 . The same holds for estimated item difficulties, that isb i1 = b i1 .
In the second group, there are group-specific item parameters a i2 and b i2 and distribution parameters µ 2 = 0 and σ 2 = 1. In a separate estimation for the second group, it was assumed that the mean was set to zero, and the standard deviation (SD) was set to one. Hence, estimated item parameters also include the distribution parameters. We obtain: where the standardized ability θ * is standard normally distributed (i.e., N(0, 1)). From Equation (8), it follows that:

The Role of Normally Distributed Random DIF in Educational Assessment
It should be noted that the presence of random DIF implies that there is no single item for which the two groups share the same item parameters. Hence, all items are allowed to have different item parameters. By posing a normal distribution on DIF effects, it was assumed that DIF across groups vanishes on average for an infinite number of items. The presence of normally distributed random DIF is strongly different from the situation of partial invariance in which only a few items (or a few item parameters) possess DIF, while the rest of the items (or item parameters) do not have DIF. The two kinds of DIF effects can also simultaneously occur [65]. In our experience, we mainly observed random fluctuations of group-specific item parameters, which would correspond to normally distributed random DIF instead of to the case of partial invariance. However, it seems that ensuring partial invariance is seen as a measurement ideal in educational LSA studies [67][68][69][70]. We have argued against such a perspective elsewhere [54,65,71,72] and think that the random DIF perspective with normally distributed DIF effects is more relevant in real-world applications. Moreover, we think removing items due to DIF from group comparisons is unwise because DIF can be interpreted as construct-relevant [32,52,[73][74][75]. In this situation, a group difference that relies on a purified set of items can bias estimated group differences with respect to the means and standard deviations.

Linking Methods
In the following, we discuss linking methods that allow estimating the distribution parameters µ 2 = E(θ) and σ 2 = SD(θ) of the second group. Let γ 0 denote item parameters a i and b i of common items I 0 and γ g item parameters from the group-specific sets of unique items I g for g = 1, 2. Furthermore, denote by X g the matrix of observed item responses from group g for items i ∈ I 0 ∪ I g . For group g = 1, 2, the log-likelihood function is defined by: where x pgi is the item response of person p in group g at item i. The IRT model in (10) can be estimated by marginal maximum likelihood (MML) estimation and an expectationmaximization algorithm [76][77][78][79].
For reasons of identification, we define µ 1 = 0 and σ 1 = 1 and identify the distribution parameters µ 2 and σ 2 of the second group. Separate calibrations for the two groups can be carried out and result in group-specific item parameter estimatesâ ig andb ig (see Section 2.3.1). These item parameters were subsequently transformed utilizing linking methods in order to obtain estimatesμ 2 andσ 2 . See [16,22,[80][81][82] for an overview of linking methods. In the next subsection, we discuss several linking methods.

Log-Mean-Mean Linking
In log-mean-mean linking (logMM; [16]), the means of the logarithmized item discriminations and item difficulties in the two groups are set equal for identifying group means and group SDs. The SD σ 2 of the second group is estimated as: where |I 0 | is the number of items in the set I 0 . The estimation in (11) corresponds to the assumption of additive DIF effects in logarithmized item discriminations (see (7)). The mean µ 2 of the second group is estimated by: It can be shown that logMM estimates are consistent under weak conditions. Here, consistency means that the number of common items (i.e., |I 0 |) is tending toward infinity. Moreover, it was assumed that group-specific item parameters would have been estimated with an infinite sample size (i.e., N → ∞). Consistency proofs can be easily modified to the case of finite sample sizes (use p → as the mathematical symbol for consistency in the folllowing). Then, consistency is meant under a double sampling scheme in which the number of persons and number of items tends to infinity (i.e., N → ∞ and |I 0 | → ∞). We now present the consistency result. (i) For additive DIF effects f i (Equation (3)), it holds that E( f i ) = 0 and f i has a symmetric distribution; (ii) For multiplicative DIF effects f i (Equation (7)), it holds that E(log f i ) = 0.

Mean-Mean Linking
In mean-mean linking (MM; [16]), the means of untransformed item discriminations are matched. Hence, the estimation in (11) is substituted by: This estimation corresponds to additive DIF effects in untransformed item discriminations (see (3)). The estimation formula forμ 2 in (12) remains unaltered.
It can also be shown that MM estimates are consistent under weak conditions.
Proposition 2. Assume that DIF effects e i fulfill E(e i ) = 0. For DIF effects f i , one of the following conditions holds: (i) For additive DIF effects f i (Equation (3)), it holds that E( f i ) = 0; (ii) For multiplicative DIF effects f i (Equation (7)), it holds that E(log f i ) = 0 and log f i has a symmetric distribution.
Finally, it is instructive to study the effects on the estimates in MM linking if the true effects do not have zero means. For additive DIF effects f i , we assumed E( f i ) = δ f and E(e i ) = δ e . Using similar derivations as in Appendix C, we obtain: where If DIF effects do not have zero means, Equations (16) and (17) show that biased estimates for the group mean and the group standard deviation can be expected.

Haberman Linking (HAB and HAB-Nolog)
Haberman linking (HAB; [71,83,84]) provides a generalization of logMM and MM to multiple groups while simultaneously estimating common item parameters a and b. HAB consists of two steps. In the first step, group-specific SDs are estimated. In the second step, group-specific means are estimated.
The originally proposed HAB linking [83] operates on logarithmized item discriminations (HAB). To estimate σ 2 , the linking function in HAB for the particular case of two groups is: Parameters in HAB linking are estimated as the minimizer of (18): (σ 2 ,â) = arg min For i ∈ I g for g = 1, 2, we obtainâ i =â ig . Furthermore, for i ∈ I 0 , the minimization of (18) corresponds to a two-way analysis of variance. We obtain (see Equation (A31) in Appendix D):σ which shows the equivalence to logMM linking with respect to the estimation of σ 2 .
In the second step, group means are estimated. The linking function is given as: The parameters are estimated as: Using the same derivation as for H 1,log , we obtain: Notably, Equation (23) coincides with the estimation in logMM linking (see Equation (12)).
As an alternative, Haberman linking can also be conducted based on untransformed item discriminations [71]. This method is labeled as HAB-nolog. It turned out that HABnolog outperformed HAB in some situations for multiple groups [71,84]. The linking function of HAB-nolog is given by: Parameter estimates in HAB-nolog linking are determined by: The SD of the second group is given by (see Equation (A36) in Appendix D): The linking function for µ 2 in HAB-nolog is the same as in HAB (see Equation (21)).

Invariance Alignment with p = 2
Asparouhov and Muthén [85,86] proposed the method of invariance alignment (IA) to define a linking that maximizes the extent of invariant item parameters. IA can also be regarded as a linking method that handles noninvariant item parameters [71].
The IA method is based on estimated group-specific item interceptsν ig =â igbig and item discriminationsâ ig obtained from separate calibrations. IA was originally formulated to detect only a few noninvariant items. It has been pointed out that IA in its original proposal is not an acceptable linking method in the presence of normally distributed DIF effects [87][88][89]. In [71], IA was studied using a general class of so-called L p -type robust linking functions. The original IA formulation used p = 0.5 [85]. For normally distributed DIF effects, p = 2 is an adequate choice [71,89]. Hence, we investigate IA with the loss function using p = 2 (IA2).
It has been shown that IA estimation originally formulated as a joint estimation problem can be reformulated as a two-step estimation method [71]. In the first step, group SDs are computed. In the second step, group means are computed. For two groups, σ 2 is estimated in the first step and µ 2 in the second step. Note that the first group serves as the reference group (i.e., it holds that µ 1 = 0 and σ 1 = 1). The formulas in [71] can be simplified to the case of two groups for providing estimatesμ 2 andσ 2 : The estimates in (27) and (28) have close expressions (see Equations (A41) and (A45) in Appendix E).

Haebara Linking Methods (HAE-Asymm, HAE-Symm, HAE-Joint)
In contrast to the MM, logMM, HAB, HAB-nolog, and IA linking methods, Haebara (HAE) linking [90] aligns IRFs instead of directly aligning item parameters. The linking function in asymmetric HAE (HAE-asymm; [90]) linking is given as: Estimated distribution parameters are obtained by minimizing (29): (μ 2 ,σ 2 ) = arg min The linking function (29) aligns IRFs of the second group to those in the first group. Hence, it is asymmetric because parameters in the second group are expected to behave similarly to the first group. However, the first group could alternatively be aligned to the second group. To robustify the HAE linking, both directions of alignment are considered in symmetric Haebara linking (HAE-symm; [91,92]), which employs the linking function: In a similar vein, the estimated mean and the SD for the second group is given by and defined as: A generalization of HAE linking to the general case of multiple groups was proposed in [93]. In this joint Haebara (HAE-joint) linking approach, distribution parameters are simultaneously estimated with common item parameters. The linking function in HAEjoint is defined as [65,93,94]: where a and b denote common item parameters. Parameter estimates are given by minimizing (33): A variant of joint Haebara linking was proposed in [84]. Note that in joint Haebara linking, common IRFs are aligned to group-specific IRFs.
3.6. Recalibration Linking (RC1, RC2, and RC3) A further linking technique is recalibration (RC) linking. This is based on item parameters obtained from separate calibrations. The core idea is that group differences in distributions can be inferred by recalibrating a study with one group using item parameters from the other group. RC methods are mainly used in LSA studies such as the Programme for International Student Assessment (PISA; [95]), Progress in International Reading Literacy Study (PIRLS; [96]), and Trends in International Mathematics and Science Study (TIMSS; [97,98]).
In PISA, RC linking was employed until PISA 2012 for the 1PL model to handle model misspecifications (i.e., the data-generating IRT model deviates from the 1PL model) in linking that could artificially impact the estimated SDs [99]. RC linking in PIRLS and TIMSS operates on the 3PL model [98,100,101]. Notably, recalibration methods have also been proposed for determining linking errors in PIRLS/TIMSS [101], as well as in PISA ( [69], p. 176ff.), but the two approaches turned out to be different.
RC linking is based on estimated item parameters from the first and the second groups. Item parameters are obtained assuming µ 1 = µ 2 = 0 and σ 1 = σ 2 = 1 in separate calibrations. The group-specific parameter estimates are defined as: 2 ) = arg max To obtain the mean and the SD of the second group, the data of the first group (i.e., X 1 ) are recalibrated using item parameters from the second group (i.e.,γ (2) 0 ): A recalibrated mean m 1 and a recalibrated SD s 2 are obtained. These two parameters indicate differences between the two groups. Similarly, data of the second group (i.e., X 2 ) can be recalibrated using item parameters from the first group (i.e.,γ Based on these estimates, the distribution parameters for the second group are defined as: where the scaling factor s can take different forms. Note thatμ 2 relies on the recalibrated mean m 1 for the first group and the scaling factor s. The three RC linking methods differ concerning the scaling constant used: The linking methods RC1 and RC2 are asymmetric, while method RC3 relies on both recalibrated SDs. The scaling factor in RC3 linking is defined as the geometric mean of the scaling factors in RC1 and RC2. The definition is motivated by the fact that the impact on recalibrated SDs is symmetrically treated. The linking method RC1 is currently used in PIRLS and TIMSS [96,98] and was used in PISA until 2012 [99]. To our knowledge, linking methods RC2 and RC3 have not yet been investigated in the literature.
It should be noted that RC linking can be considered a variant of ANCH linking because, in RC linking, the distribution parameters of the first group are re-estimated using anchored item parameters from the second group. However, the distribution parameters of the second group are indirectly obtained by transforming the re-estimated distribution parameters of the first group (see Section 3.6). In contrast, ANCH provides an estimate of µ 2 and σ 2 directly.

Concurrent Calibration
Concurrent calibration (CC; [65,70,103]) is based on a multiple-group IRT model and presupposes invariant item groups across groups. The distribution parameters of the second group are determined by maximizing the joint likelihood: (μ 2 ,σ 2 ,γ 0 ,γ 1 ,γ 2 ) = arg max µ 2 ,σ 2 ,γ 0 ,γ 1 ,γ 2 l(0, 1, (γ 0 , γ 1 ); X 1 ) + l(µ 2 , σ 2 , (γ 0 , γ 2 ); X 2 ) (43) In the presence of random DIF, the log-likelihood function in (43) will typically be misspecified. The estimated mean and SD can typically be biased due to this misspecification. It has frequently been pointed out that separate calibration with subsequent linking can be more robust to the presence of DIF than CC [16,54,102]. CC can only be expected to be more efficient than linking based on separate calibrations in small to moderate sample sizes and in the absence of DIF [65]. Notably, CC is more computationally demanding than linking based on separate calibration [65,104].

Purpose
The purpose of this simulation study was to investigate the performance of linking methods in the two-group case for the 2PL model under different sample sizes, different numbers of items, and different amounts of uniform and nonuniform DIF. Most simulation studies either assume invariant item parameters (i.e., no DIF) or presuppose partial invariance in which only a few item parameters differ between groups (e.g., [63,[105][106][107][108][109][110][111][112]). There is a lack of research in the presence of random DIF, although there is some initial work for continuous items [88,89]. Moreover, although recalibration linking is in operational use in LSA studies, they have not yet been systematically compared to alternative linking methods.
We expected from the previous research and our analytical findings in Propositions 1 and 2 that moment-based linking methods could be competitive with the CC and HAE linking methods [54,65]. We did not have specific hypotheses regarding the performance of recalibration linking.

Design
We simulated a design with two groups and only common items (i.e., no unique items). Data were simulated according to the 2PL model with different amounts of random DIF. In the simulation, it was assumed that there were only common items and no group-specific unique items. The 2PL model with random DIF was simulated according to Equations (3) and (4). Table A1 in Appendix F shows item parameters a i and b i for 20 items used in the simulation. The first group served as a reference group assuming µ 1 = 0 and σ 1 = 1, while the distribution parameters for the second group were µ 2 = 0.3 and σ 2 = 1.2.
In the simulation, four factors were simulated. First, we chose sample sizes N = 500, 1000, and 5000 (3 factor levels). Second, the item number was either I = 20 or I = 40 (2 factor levels). For 40 items, the item parameters from Table A1 were duplicated. The random DIF SD τ b for item difficulties b i was 0, 0.1, 0.3, or 0.5 (4 factor levels). The random DIF SD τ a for item discriminations a i was 0, 0.15, 0.25 (3 factor levels). In total, there were 3 × 2 × 4 × 3 = 96 conditions employed in the simulation.
In total, 1000 datasets were simulated and analyzed in each condition.
The parameters of interest were the estimated meanμ 2 and SDσ 2 for the second group. For the two parameters, the bias and root-mean-squared error (RMSE) were computed. To decrease the dependence of the RMSE on the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method was divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE had 100 for the best linking method as its lowest value.
To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA) and used a variance decomposition to assess the importance.
Moreover, we classified linking methods as whether or not they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias of a parameter (i.e., the estimated meanμ 2 or estimated SDσ 2 ) was smaller than 0.01. In LSA studies such as the Programme for International Student Assessment (PISA), standard errors were about 0.02 or 0.03 for standardized ability variables θ. It should be required that the bias only be a fraction of the variability introduced by the sampling error, which motivated the value of 0.01 as a cutoff. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 120. Here, an alternative estimator to the best-performing estimation should not lose too much precision. With an RMSE of 120, the relative-mean-squared error (MSE) was 144 (note that 1.2 2 = 1.44), which corresponded to a loss in precision of 44%. Estimators with worse performance might not be considered as satisfactory in such a situation.
In all the analyses, the statistical software R [113] was used. The R package TAM [114] was used to estimate the 2PL model with marginal maximum likelihood as the estimation method. The linking methods were estimated using dedicated R functions or the existing functionality in the R packages sirt [115] and TAM [114]. The ANOVA model was estimated with the R package lme4 [116].

Results
In Table 1, the variance decomposition of the ANOVA is presented. All terms up to three-way interactions are included. From the size of the residual variance, it can be concluded that the first three orders are sufficient to capture the most important factors in the simulation.
It turned out that sample size (N) and the number of items (I) were only of minor importance in the first three-order terms in ANOVA. However, the linking method, as well as the size of random DIF were important factors. Importantly, the random DIF SD (τ b ) in item difficulties b i had a large influence on the estimated means, while random DIF SD (τ a ) in item discriminations a i strongly impacted the estimated SDs.
Due to these observations, we decided to provide aggregated results for three important groups of cells in the simulation. First, we aggregated the results for six conditions with no DIF (NODIF; τ b = 0 and τ a = 0). Because there were two estimated parameters (mean and SD), this resulted in aggregation across twelve measures for the bias and RMSE, respectively. Second, we considered all 18 conditions with uniform DIF (UDIF; τ b > 0 and τ a = 0). Third, we provide summaries across all 48 conditions with nonuniform DIF (NUDIF; τ a > 0).   In Table 2, the performance and the linking methods are summarized across these three groups of conditions by classifying all linking methods as either satisfactory or nonsatisfactory. Table 2 shows the proportion of conditions in which a linking method provided satisfactory results. In the absence of DIF (columns "NODIF"), all methods performed well in most of the conditions. Notably, IA2, the asymmetric recalibration methods RC1 and RC2, as well as ANCH provided biases in some instances. In the case of uniform DIF (columns "UDIF"), HAE-asymm, HAE-joint and CC cannot be recommended in terms of the bias. Moreover, only moment-based methods (logMM, HAB, MM, HABnolog) can be recommended in terms of the RMSE. Finally, in the presence of nonuniform DIF (columns "NUDIF"), moment-based methods, as well as HAE-symm and RC3 were satisfactory in terms of the bias. If the RMSE is considered as an additional criterion, utilizing linking methods MM, HAB-nolog, HAE-symm, and RC3 can be suggested. Note. DIF = differential item functioning; NODIF = no DIF; UDIF = uniform DIF; NUDIF = nonuniform DIF; logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking (see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration. Values smaller than 70 are printed in bold.
In Table 3, the bias and RMSE for the meanμ 2 and the SDσ 2 of the second group for N = 1000 students and I = 40 items are shown. It can be seen that all linking methods performed well in the absence of DIF (see columns "NODIF"). In the case of uniform DIF (columns "UDIF"), HAE-asymm, HAE-joint, and CC produced nonsatisfactory results in terms of the bias for the mean or the SD. Interestingly, the RMSE for the SD was much larger for HAE, RC, ANCH, and CC than the moment-based methods logMM, HAB, MM, and logMM. In the case of nonuniform DIF (column "NUDIF"), only moment-based methods (except IA2) and HAE-symm and the newly proposed recalibration linking method RC3 produced satisfactory results. Furthermore, note that the RMSE for the SD was larger for logMM and HAB compared to MM and HAB-nolog. This finding can be explained by the fact that the data-generating model for DIF was an additive model that operated on nontransformed item discriminations and favored MM and HAB-nolog. To conclude, linking methods that are not moment-based can only be recommended as a default linking method when the mean is of interest and there is an absence of random DIF. It depends on the extent of UDIF (i.e., size of τ b ) whether the additional variance introduced by moment-based methods is compensated by smaller biases. Table 3. Bias and RMSE for meanμ 2 and standard deviationσ 2 for the second group for a sample size N = 1000 and I = 40 items as a function of the type of differential item functioning and linking method. Note: DIF = differential item functioning; NODIF = no DIF; UDIF = uniform DIF; NUDIF = nonuniform DIF; τ a = standard deviation of DIF effects in item discriminations a i ; τ b = standard deviation of DIF effects in item difficulties b i ; logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking (see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration. Absolute biases larger than 0.02 are printed in bold. RMSE values larger than 120 are printed in bold.

Method
In order to illustrate the consequences of different scaling models (i.e., 1PL and 2PL models), as well as different linking methods (see Section 3), we analyzed the data for Austrian students from the PISA study conducted in 2006 (PISA 2006; [95]) and 2009 (PISA 2009; [117]). In Table 4   In both PISA studies, we included only those students who received a test booklet with at least one item in a respective domain. For simplicity, all polytomous items (i.e., items with maximum scores larger than one) were dichotomously recoded, with only the highest category being recoded as correct. The 1PL and the 2PL model were used as scaling models, and student weights were taken into account. In total, 13 linking methods (logMM, HABlog, MM, HAB-nolog, IA2, HAE-asymm, HAE-symm, HAE-joint, RC1, RC2, RC3, ANCH, CC; see Section 3) were applied. In the linking procedure, the distribution parameters of the first study (PISA 2006) were fixed (i.e., µ 1 = 0, σ 1 = 0), and the distribution parameters of the second study (PISA 2009) were estimated. The two ability distributions for the three domains (and the distribution parameters) were linearly transformed such that the mean equaled the officially reported mean and the SD equaled the officially reported SD in PISA 2006 in the respective domain. For example, for Mathematics, it holds thatμ 1 = M = 506.8 andσ 1 = SD = 96.8 in PISA 2006 for all linking methods for the 1PL and the 2PL model.

Results
In Table 5, the trend estimates for Austria from PISA 2006 and PISA 2009 are shown. For the linearly transformed scores, the trend estimate is given as∆ =μ 2 −μ 1 . There was a significant and large negative trend for Mathematics and Science, while the trend in Reading turned out to be smaller. Across both models (1PL and 2PL) and linking methods, the average trend in Mathematics was M = −14.4 (SD = 1.1, Range = 3.6). There were slight differences between the 1PL and the 2PL models for the moment-based linking methods (i.e., logMM, HAB, MM, HAB-nolog). Notably, the variation of estimates across linking methods was larger for the 2PL model (SD = 1.3) than for the 1PL model (SD = 0.7). The trend in Reading was smaller (M = −5.4) and less variable (SD = 0.8, Range = 2.4) than in Mathematics. Finally, the trend in Science was also strongly negative (M = −14.5). The variability (SD = 1.3, Range = 5.2) was mainly caused by the variability in linking methods of the 2PL model. In contrast, linking methods for the 1PL model were very similar (SD = 0.3, Range = 0.8).
In Table 6, the estimated SDσ 2 in PISA 2009 is displayed for the 1PL and the 2PL models as a function of the linking method. Interestingly, the variability across linking methods turned out to be larger than for the mean estimate. These findings indicate that an adjustment for the average differences in item discriminations has to be conducted to avoid biased estimates of the means and SDs when a misspecified 1PL model is employed (see [99]). Note. logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking (see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration. Note. logMM = log-mean-mean linking; HAB = Haberman linking with logarithmized item discriminations; MM = mean-mean linking; HAB-nolog = Haberman linking with untransformed item discriminations; IA2 = invariance alignment with power p = 2; HAE-asymm = asymmetric Haebara linking; HAE-symm = symmetric Haebara linking; HAE-joint = Haebara linking with joint item parameters; RC = recalibration linking (see Equation (40)); ANCH = anchored item parameters; CC = concurrent calibration.

Discussion
In this article, several linking methods for two groups were evaluated for the 2PL model in the case of normally distributed random DIF effects with zero means for item difficulties and item discriminations. Somehow surprisingly and contrary to the recommendations in the literature, moment-based linking methods (mean-mean and log-mean-mean, as well as Haberman linking) performed best in terms of the bias and RMSE. The unbiasedness of moment-based methods in the case of many items was expected due to our consistency results for the estimators presented in Section 3. When the primary criterion is the bias, symmetric Haebara (HAE-symm) linking and the newly proposed recalibration method RC3 can only be recommended among the nonmoment-based methods. In contrast, the commonly used asymmetric Haebara (HAE-asymm) linking and the recalibration linking RC1 used in PIRLS and TIMSS had substantially worse performance. Furthermore, note that concurrent calibration-which incorrectly assumes invariant item parameters across groups-and the anchored item parameters method provided biased estimates and cannot be recommended in operational use. Concurrent calibration can only achieve the promised highest efficiency [67,70] in small-to-moderate-sized samples, the absence of random DIF, and a correctly specified IRT model. If data in educational large-scale assessment studies were indeed to follow a random DIF distribution, the currently used linking methods (concurrent calibration, recalibration linking RC1) could be replaced by better alternatives as proposed in this study (moment-based linking, recalibration linking RC3). Of course, the extent of the bias and loss in the precision of the current methods depends on the variance of the DIF effects and can vary from study to study.
Our study assumed that the utilized scaling model (i.e., the 2PL model) was correctly specified. This assumption might be unrealistic in practice, and data could have been generated with much more complex item response functions [118][119][120][121][122] or multidimensional IRT models [123,124]. The performance of the linking methods for misspecified IRT models [125,126] in the presence of random DIF might be an exciting topic for future research [127,128].
In this article, we only considered random DIF effects with a normal distribution and zero means. For relevance in practical applications, it might be interesting for future research to study the DIF effects under different data-generating models. First, random DIF could be simulation in a partial invariance situation in which only a few items have DIF effects, while the majority of DIF effects do not have DIF effects. Second, the case of normally distributed random DIF and partial invariance could occur in tandem. DIF effects could be simulated from a mixture distribution in which the first class includes normally distributed DIF effects with zero means, while the second class of smaller proportion includes outlying and large DIF effects. Due to the presence of outliers, the average of DIF effects will typically differ from zero. For this kind of DIF effects, robust linking methods must be employed for removing these outlying items from group comparisons [65,71,89,94,107].
In this article, we only studied linking in the two-group case. In the case of multiple groups, different linking methods can be employed [65,84,93,129,130]. However, the findings from two groups are likely to generalize to the multiple group case if the linking were to be performed based on sequences of pairwise linking approaches [131][132][133]. Comparing the performance of pairwise linking approaches and simultaneous linking of multiple groups would be an exciting topic for future research.
Our findings are likely to have even more impact on vertical scaling in which groups constitute time points or grades in the school career [134]. In this situation, differences in the means and SDs between groups are expected to be larger, and the consequences of choosing an inappropriate linking method can be much more pronounced [128,[135][136][137][138]. As pointed out by an anonymous reviewer, vertical linking poses more assumptions than cross-sectional linking because younger test-takers could incorporate guessing strategies if they had a chance to respond to harder items designed for older test-takers (see also [25]).
It should be noted that we did not investigate the computation of standard errors for the linking methods. There is a rich literature that derives standard error formulas for linking due to sampling of persons (e.g., [85,104,132,[139][140][141]). In addition, variability in estimated group means and SDs due to selecting items has been studied as linking errors in the literature [66,72,99,[142][143][144][145]. It might be interesting in future research to investigate standard errors that reflect these sources of uncertainty [72,84,132,146]. Procedures that rely on resampling of persons and items will likely correctly reflect uncertainty due to persons and items in the parameters of interest.

Appendix A.1. DIF Effects for Item Difficulties
First, we show that the mean E(e i ) must be set to zero for reasons of identification. Assume that there are additive DIF effects for item difficulties (see Equation (4)). Let us assume that DIF effects e i have a mean different from zero (i.e., E(e i ) = δ e . Then, we can write e i = δ e + e * i with E(e * i ) = 0. The IRF for item i for persons in the first group (i.e., g = 1) can be written as: For persons in the second group, the ability distribution is given by θ ∼ N(µ 2 , σ 2 2 ). However, the IRF can be equivalently formulated as: Hence, the parameterization involving common item difficulties b i , DIF effects e i with E(e i ) = δ e = 0, and θ ∼ N(µ 2 , σ 2 2 ) for persons in the second group can be equivalently parameterized with common item difficulties b * i , DIF effects e * i with E(e * i ) = 0, and θ ∼ N(µ 2 , σ 2 2 ) for persons in the second group. This demonstrates that group mean differences for abilities cannot be identified if the average DIF effect E(e i ) is assumed to differ from zero.

Appendix D. Estimates in Haberman Linking
For the HAB linking with logarithmized item loadings, the linking function for the standard deviation is given as: Taking the derivative of H 1,log with respect to log a i for i ∈ I 0 and setting it to zero provide (up to multiplication with a constant): (logâ i1 − log a i ) + (logâ i2 − log a i − log σ 2 ) = 0 .
Similarly, setting the derivative of H 1,log with respect to log σ 2 to zero provides: Summing the equations of all |I 0 | items of (A26) and subtracting (A27) provide: With an estimate logσ 2 of log σ 2 , we obtain from (A26): Plugging (A29) into (A27) provides: Hence, it holds that: For Haberman linking with untransformed item discriminations (HAB-nolog), the linking function for the standard deviation of the second group is given as: Taking the derivative of H 1,nolog with respect to a i for i ∈ I 0 and setting it to zero provide: Estimatesâ i are then given by: Setting the derivative of H 1,nolog with respect to σ 2 to zero provides: Substituting (A34) into (A35) provides:

Appendix E. Estimates in Invariance Alignment
We now derive closed expressions for the estimates from IA2 (see Section 3.4). To estimate σ 2 , define: Then, it follows that: The derivative of f in (A38) is: To find the minimum, setting the derivative to zero provides: and we have: To estimate µ 2 , define: We obtain: Setting the derivative of g with respect to µ 2 to zero provides: Then, we obtain using (A41):

Appendix F. Item Parameters Used in the Simulation Study
In Table A1, the item parameters used in the simulation study are shown. Item discriminations a i had a mean of 1.00 (SD = 0.28, Min = 0.50, Max = 1.42), and item difficulties b i had an average of 0.00 (SD = 1.00, Min = −1.62, Max = 1.39) Note. a i = item discrimination; b i = item difficulty.