Previous Article in Journal
Numerical Evidence for a Bipartite Pure State Entanglement Witness from Approximate Analytical Diagonalization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparing Different Specifications of Mean–Geometric Mean Linking

by
Alexander Robitzsch
1,2
1
IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
Foundations 2025, 5(2), 20; https://doi.org/10.3390/foundations5020020
Submission received: 3 May 2025 / Revised: 27 May 2025 / Accepted: 4 June 2025 / Published: 6 June 2025
(This article belongs to the Section Mathematical Sciences)

Abstract

:
Mean–geometric mean (MGM) linking compares group differences on a latent variable θ within the two-parameter logistic (2PL) item response theory model. This article investigates three specifications of MGM linking that differ in the weighting of item difficulty differences: unweighted (UW), discrimination-weighted (DW), and precision-weighted (PW). These methods are evaluated under conditions where random DIF effects are present in either item difficulties or item intercepts. The three estimators are analyzed both analytically and through a simulation study. The PW method outperforms the other two only in the absence of random DIF or in small samples when DIF is present. In larger samples, the UW method performs best when random DIF with homogeneous variances affects item difficulties, while the DW method achieves superior performance when such DIF is present in item intercepts. The analytical results and simulation findings consistently show that the PW method introduces bias in the estimated group mean when random DIF is present. Given that the effectiveness of MGM methods depends on the type of random DIF, the distribution of DIF effects was further examined using PISA 2006 reading data. The model comparisons indicate that random DIF with homogeneous variances in item intercepts provides a better fit than random DIF in item difficulties in the PISA 2006 reading dataset.

1. Introduction

Item response theory (IRT) models [1,2,3] provide a statistical framework for modeling multivariate discrete outcomes. This work specifically addresses binary item responses and explores methods for comparing two populations using linking techniques. Consider a response vector X = ( X 1 , , X I ) , where each variable X i { 0 , 1 } represents a dichotomously scored item. A unidimensional IRT model [4] specifies the joint probability distribution P ( X = x ) for response patterns x = ( x 1 , , x I ) { 0 , 1 } I  through a parametric formulation:
P ( X = x ; δ , γ ) = i = 1 I P i ( θ ; γ i ) x i 1 P i ( θ ; γ i ) 1 x i f ( θ ; μ , σ ) d θ ,
where f denotes the normal density function, parameterized by the mean μ and the standard deviation (SD) σ . The distribution parameters μ and σ of the latent variable θ , often referred to as a trait or ability variable, are collected in the vector δ = ( μ , σ ) . The vector γ = ( γ 1 , , γ I ) collects the item parameters for the item response functions (IRFs) P i ( θ ; γ i ) = P ( X i = 1 | θ ) for i = 1 , , I . The IRF of the two-parameter logistic (2PL) model [5] is defined by
P i ( θ ; γ i ) = Ψ a i ( θ b i ) ,
where a i and b i  represent the item discrimination and the item difficulty b i , respectively. The function Ψ ( x ) = ( 1 + exp ( x ) ) 1 corresponds to the standard logistic distribution function. In this formulation, the item parameter vector is γ i = ( a i , b i ) . Alternatively, the 2PL model can be reparametrized by replacing the difficulty parameter b i with the intercept ν i , resulting in
P i ( θ ; γ i ) = Ψ a i θ + ν i .
The two 2PL parameterizations are related by the identity ν i = a i b i (see [6,7]). In this parametrization, the item parameter vector is γ i = ( a i , ν i ) .
Given a sample of N individuals with independent and identically distributed response vectors x 1 , , x N drawn from the distribution of X , the parameters of the IRT model specified in (1) can be consistently estimated through marginal maximum likelihood (MML) methods [8,9].
IRT models are widely employed to compare the test performance of two groups by assessing differences in the parameters of the latent variable θ , as defined in the IRT framework of (1). This article specifically examines linking methods [10] based on the 2PL model.
In the first step of the linking approach, the 2PL model is estimated separately for each group, allowing for the presence of differential item functioning (DIF), where item behavior may vary across groups [11,12,13]. More specifically, item parameters are permitted to differ between groups, indicating that the groups may respond differently to an item even after accounting for overall differences in the θ variable. In the second step, differences in item parameters are used to estimate group differences in θ through a linking procedure [10,14,15].
This article evaluates the performance of mean–geometric mean (MGM; [7,10,16,17,18,19]) linking in the presence of DIF [13] in either item difficulties or item intercepts. The standard MGM method is based on the mean difference of log-transformed item discriminations for estimating the group SD and the mean difference of untransformed item difficulties for estimating the group mean. This study considers three specifications of MGM linking under random DIF [20,21,22] in item difficulties or intercepts. Prior research has shown that random DIF contributes to increased variance in the estimated linking parameters [23,24]. This concept is also referred to as a linking error in educational large-scale assessment studies [23,25,26,27,28,29,30,31,32].
The three MGM specifications considered here differ in the weighting of item difficulty differences when computing the group mean. To the best of the authors’ knowledge, the performance of these MGM variants under random DIF has not yet been systematically examined. The performance of the MGM estimators is assessed analytically and through a simulation study.
The remainder of the article is organized as follows. Section 2 reviews the MGM specifications. Section 3 presents the results from a simulation study comparing their performance. Section 4 provides an empirical illustration using PISA data. Section 5 concludes with a discussion.

2. Mean–Geometric Mean Linking

2.1. Identified Item Parameters in Separate Scaling

The various specifications of the MGM method are based on item parameters from the 2PL model, estimated separately for each group. The following describes the identification of item parameters under the assumption that no DIF is present in item discriminations a i or item difficulties b i . In both groups, the latent variable θ is standardized by fixing its mean and SD to 0 and 1, respectively, allowing all item parameters to be estimated within each group. In the first group, the identified item parameters are given by a ^ i 1 = a i , b ^ i 1 = b i , and ν ^ i 1 = ν i = a i b i , where a i and b i represent the invariant item parameters in the 2PL model across groups.
In the second group, the latent variable θ is assumed to have a mean μ and SD σ . By fixing the θ mean to 0 and SD to 1, the identified item parameters from the separate 2PL model estimation in this group are given by
a ^ i 2 = σ a i or α ^ i 2 = log a ^ i 2 = log σ + log a i ,
b ^ i 2 = σ 1 ( b i μ ) and
ν ^ i 2 = a i μ + ν i .
The MGM method aims to recover the parameters μ and σ using the group-specific item parameters a ^ i g and b ^ i g ( i = 1 , , I , g = 1 , 2 ), obtained from separate estimations under the 2PL model.

2.2. Weighted Means

The different specifications of MGM linking for estimating μ are essentially different weighted means of item difficulty differences. The following briefly reviews the statistical properties of such weighted means. Let Y i ( i = 1 , , I ) denote normally distributed observations with mean μ and variances σ i 2 . A weighted mean with fixed weights w i is defined as
Y ¯ w = 1 I i = 1 I w i Y i 1 I i = 1 I w i .
Note that the multiplication factor 1 / I in (7) could be omitted; however, it is included to maintain consistency with later expressions in the various specifications of MGM linking. The expected value of Y ¯ w is μ , and its variance is given by
Var ( Y ¯ w ) = i = 1 I w i 2 σ i 2 i = 1 I w i 2 .
The minimal variance in (8) is attained when the observations Y i are weighted by their precisions 1 / σ i 2 (i.e., the inverses of their variances) (see [33]). A weighted mean using these weights is commonly referred to as a precision-weighted mean.

2.3. Random DIF in Item Difficulties or Item Intercepts

The occurrence of random DIF [20,34] can be characterized by whether DIF manifests in item difficulties [34] or item intercepts [35]. Assume that DIF arises only as deviations in item parameters for the second group relative to the first group. Further, assume invariant item parameters. Let e i and ϵ i denote random DIF in item difficulties and item intercepts with zero means, respectively, under the assumptions
b i 2 = b i + e i and ν i 2 = ν i + ϵ i .
The two DIF effects are related by
ϵ i = a i e i or e i = ϵ i a i .
If the DIF effects e i in item difficulties have variances τ i 2 , corresponding to random DIF, the variances of the DIF effects ϵ i in item intercepts are given by
Var ( ϵ i ) = a i 2 Var ( e i ) .
If the random DIF effects e i have homogeneous variances τ 2 , it follows from (11) that the corresponding DIF effects ϵ i in item intercepts exhibit heterogeneous variances a i 2 τ 2 . Conversely, if the DIF effects ϵ i in item intercepts have homogeneous variances τ 2 , then the DIF effects e i in item difficulties have heterogeneous variances τ 2 / a i 2 .
The formulation of random DIF in terms of item difficulties or item intercepts is statistically equivalent when allowing for heterogeneous variances τ i 2 . However, empirical analyses may test whether random DIF with homogeneous variance is more plausible in item difficulties or in item intercepts. As will be shown later, the performance of the different MGM specifications depends on whether random DIF with homogeneous variances occurs in item difficulties or in item intercepts.

2.4. Estimation of σ in MGM Linking

In MGM linking, the SD σ is estimated using the means of log-transformed item discriminations. Specifically, the estimate σ ^ is computed as (see [10,16])
σ ^ = exp 1 I i = 1 I log a ^ i 2 1 I i = 1 I log a ^ i 1 .
Since averages on the logarithmic scale are used, this method is referred to as log-mean-mean linking.

2.5. Estimation of μ in MGM Linking

The estimation of μ is now addressed. Three variants of weighted means of item difficulties are considered to derive the estimate μ ^ . In the formal treatment, assume that random DIF occurs in item difficulties, with E ( e i ) = 0 and Var ( e i ) = τ i 2 . If random DIF in item difficulties exhibits homogeneous variance, then τ i 2 = τ 2 . Alternatively, if random DIF with homogeneous variances occurs in item intercepts, then τ i 2 = τ 2 / a i 2 .
In addition to random DIF, sampling errors affect the item parameter estimates. Let u i g denote the sampling error in the estimated item difficulty ( i = 1 , , I ; g = 1 , 2 ). The estimated item difficulty in the first group is then given by
b ^ i 1 = b i + u i 1 with E ( u i 1 ) = 0 and Var ( u i 1 ) C 0 + C 1 b i 2 N a i 2 ,
where C 0 and C 1 are constants that depend on the dataset. The variance expression in (13) is supported by empirical evidence from simulation studies [9,36].
The estimated item difficulty in the second group satisfies
b ^ i 2 = 1 σ ( b i + e i μ ) + u i 2 with E ( u i 2 ) = 0 and Var ( u i 2 ) C 0 + C 1 1 σ 2 ( b i + e i μ ) 2 N a i 2 σ 2 .
Here, u i 2 represents the sampling error, while e i denotes the random DIF effect. For a sufficiently large number of items, the estimated item difficulties can be treated as approximately independent across items [36].

2.5.1. Unweighted MGM Linking (UW)

The original variant of MGM linking for estimating μ is based on the difference in item difficulties, defined as
μ ^ = 1 I i = 1 I σ ^ b ^ i 2 1 I i = 1 I b ^ i 1 = 1 I i = 1 I σ ^ b ^ i 2 b ^ i 1 .
This estimator uses the previously calculated SD σ ^ from (12) and applies equal weights to the differences σ ^ b ^ i 2 b ^ i 1 in item difficulties. For this reason, the estimator in (15) is referred to as unweighted MGM linking (UW).
To examine the statistical properties of μ ^ , the expression in (15) is rewritten using (13) and (14) as
μ ^ = σ ^ σ μ σ ^ σ σ · 1 I i = 1 I b i σ ^ σ · 1 I i = 1 I e i 1 I i = 1 I σ ^ u i 2 u i 1 .
Since E ( σ ^ ) σ as I , it follows that E ( μ ^ ) μ under the assumptions E ( e i ) = E ( u i g ) = 0 . Thus, a simplified form of μ ^ is given by (16) as
μ ^ = μ 1 I i = 1 I e i 1 I i = 1 I σ ^ u i 2 u i 1 .
To derive the variance of μ ^ , assume that Var ( σ ^ ) 0 . Then, from (17),
Var ( μ ^ ) = 1 I 2 i = 1 I τ i 2 + 1 I 2 i = 1 I σ 2 Var ( u i 2 ) + Var ( u i 1 ) .
Equation (18) shows that the variance of μ ^ consists of two components: the variance due to random DIF effects e i and the variance from sampling errors u i g . Using (13) and (14), the expression can be further rephrased as
Var ( μ ^ ) = 1 I 2 i = 1 I τ i 2 + 1 N I 2 i = 1 I C 0 ( σ 2 + 1 ) + C 1 ( b i + e i μ ) 2 + b i 2 a i 2 .
As the sample size increases, the contribution of the sampling error variance diminishes. However, the variance component due to random DIF remains nonzero even in the limit of infinite sample size.

2.5.2. Discrimination-Weighted MGM Linking (DW)

An alternative MGM linking estimate for μ relies on the identification of Equation (6). Following the rationale used in the invariance alignment [37,38] method, the absence of DIF effects yields the identity
ν ^ i 2 ν ^ i 1 a ^ i 2 σ μ = 0 .
This identity motivates the estimation of μ as the minimizer of
H ( μ ) = i = 1 I ν ^ i 2 ν ^ i 1 a ^ i 2 σ ^ μ 2 ,
which leads to the estimator
μ ^ = σ ^ 1 I i = 1 I ν ^ i 2 ν ^ i 1 a ^ i 2 1 I i = 1 I a ^ i 2 2 .
The estimator μ ^ in (22) can be further rewritten as (see [19])
μ ^ = σ ^ 1 I i = 1 I a ^ i 2 b ^ i 2 a ^ i 1 b ^ i 1 a ^ i 2 1 I i = 1 I a ^ i 2 2 .
To analyze the statistical properties of μ ^ as defined in (23), simplifying assumptions are applied: Var ( σ ^ ) 0 , a ^ i 1 a i , and a ^ i 2 σ a i . Under these assumptions, the estimator simplifies to
μ ^ = i = 1 I a i 2 σ b ^ i 2 b ^ i 1 i = 1 I a i 2 .
This expression reveals that μ ^ is a weighted average of item difficulty differences, where the weights are proportional to the squared item discriminations. Consequently, the estimator in (24) is referred to as discrimination-based MGM linking (DW).
The estimator μ ^ in (24) can be expressed in terms of the DIF effects e i and the sampling errors u i g as
μ ^ = μ + i = 1 I a i 2 e i i = 1 I a i 2 + i = 1 I a i 2 σ u i 2 u i 1 i = 1 I a i 2 .
As with the UW estimator, this formulation yields an asymptotically unbiased estimate of μ , i.e., E ( μ ^ ) μ as I . The variance of μ ^ is given by
Var ( μ ^ ) = i = 1 I a i 4 τ i 2 i = 1 I a i 2 2 + i = 1 I a i 4 σ 2 Var ( u i 2 ) + Var ( u i 1 ) i = 1 I a i 2 2 ,
which can be further simplified using the expressions for Var ( u i g ) from (19) as
Var ( μ ^ ) = i = 1 I a i 4 τ i 2 i = 1 I a i 2 2 + 1 N · i = 1 I a i 2 C 0 ( σ 2 + 1 ) + C 1 ( b i + e i μ ) 2 + b i 2 i = 1 I a i 2 2 .
The left-hand variance component in (27) indicates that an optimal estimate of μ is obtained when the DIF effects e i satisfy Var ( e i ) = a i 2 τ 2 , corresponding to random DIF with homogeneous variances in item intercepts. In this case, the weighting by item discriminations enhances precision, as the sampling variance of estimated item difficulties is also proportional to a i 2 . However, if random DIF with homogeneous variance occurs in item difficulties rather than in item intercepts, the discrimination-based weighting in DW may result in a higher variance compared to the equal weighting used in UW, particularly in large samples where the contribution from sampling error becomes negligible.

2.5.3. Precision-Weighted MGM Linking (PW)

The UW linking method assigns equal weights to the item difficulty differences σ ^ b ^ i 2 b ^ i 1 . As an alternative, these differences can be weighted by their precisions, that is, the inverse of their sampling variances [7,39]. This approach yields the estimator
μ ^ = i = 1 I ω i σ b ^ i 2 b ^ i 1 i = 1 I ω i ,
where the precision weights ω i must be estimated. The variances Var ( b ^ i 1 ) and Var ( b ^ i 2 ) are obtained from the observed information matrix in the group-wise scaling models. Based on these variances, the weights ω i are defined as
ω i = σ ^ 2 Var ( b ^ i 2 ) + Var ( b ^ i 1 ) 1 .
Using (13) and (14), the precision weights can be approximately determined as
ω i = σ ^ 2 Var ( u i 2 ) + Var ( u i 1 ) 1 = N a i 2 C 0 ( σ 2 + 1 ) + C 1 ( b i + e i μ ) 2 + b i 2 .
Importantly, (30) highlights that the estimated precision weights ω i depend on the random DIF effects e i . For small values of e i , a linear Taylor approximation of (30) yields
ω i ω i 0 ω i 1 e i ( b i μ ) , where
ω i 0 = N a i 2 h i , ω i 1 = 2 C 1 N a i 2 h i 2 , and h i = C 0 ( σ 2 + 1 ) + C 1 ( b i μ ) 2 + b i 2 1 .
Note that ω i 0 and ω i 1 are independent of the random DIF effect e i .
The estimator μ ^ in (28) can be rephrased as
μ ^ μ i = 1 I ω i 0 ω i 1 e i ( b i μ ) e i i = 1 I ω i 0 ω i 1 e i i = 1 I ω i 0 ω i 1 e i ( b i μ ) u i 2 u i 1 i = 1 I ω i 0 ω i 1 e i .
Assuming independence between e i and u i g , the expectation of μ ^ as I is given by
E ( μ ^ ) μ i = 1 I ω i 1 τ i 2 ( μ b i ) i = 1 I ω i 0 for I .
According to (34), the PW linking method may produce a negatively biased estimate of μ when μ is, on average, greater than b i . However, in the absence of random DIF, the PW method does not exhibit bias in the estimation of μ .
The variance of the PW estimate can be derived analogously to that of the UW and DW estimators, although it offers limited additional insight. By construction, the PW linking method yields the smallest variance in the absence of DIF, as it employs optimal precision weights.

3. Simulation Study

In this Simulation Study, the performances of the three MGM linking specifications (i.e., UW, DW, and PW) outlined in Section 2.5 are compared.

3.1. Method

The data-generating model was based on the 2PL model applied to two groups. For the first group, the latent variable θ followed a standard normal distribution with a fixed mean of 0 and SD of 1. For the second group, θ was also normally distributed with a fixed mean μ = 0.3 and SD σ = 1.2 , which was consistent across all simulation conditions.
The simulation study used I = 20 or I = 40 items. Group-specific item parameters a i g and b i g for each item i = 1 , , I and for groups g = 1 , 2 were derived from fixed base parameters and newly simulated random DIF effects in each replication. The item parameters were constructed using 10 base items. These base items were duplicated twice in the 20-item condition and four times in the 40-item condition. For the 10 base items, the base item discriminations a i 0 were set to 0.6 for the first five items and 1.2 for the remaining five. Base item difficulties were assigned values of 1.4 , 0.7 , 0.0 , 0.7 , and 1.4 for the first five items, with the same sequence repeated for the remaining items. The complete set of item parameters is available at https://osf.io/xa4qz (accessed on 3 May 2025).
For the first group, item discriminations and item difficulties were set to the base item parameters. In the second group, DIF effects with a homogeneous variance were introduced either in item difficulties or item intercepts. A normally distributed random DIF effect with DIF SD τ was added to the corresponding item difficulty or item intercept. The DIF SD τ was chosen as 0, 0.25, or 0.5. Combined with the type of DIF effects (i.e., in item difficulties b i or item intercepts ν i ), five different DIF conditions (i.e., no DIF, τ = 0.25 and DIF in b i , τ = 0.5 and DIF in b i , τ = 0.25 and DIF in ν i , and τ = 0.5 and DIF in ν i ) were simulated. Item discriminations in the second group were kept identical to the base values, ensuring no DIF in discrimination parameters.
Per-group sample sizes of N = 500 , 1000, 2000, and infinity (denoted as Inf) were selected to represent typical ranges encountered in medium- to large-scale testing scenarios involving the 2PL model [40]. For infinite sample sizes, no item responses were simulated. However, the item parameters used in MGM linking still included the random DIF effects in this case.
In each of the 4 (sample size N) × 2 (number of items I) × 5 (random DIF conditions) = 40 simulation conditions, 7500 replications were conducted. The three MGM specifications—UW, DW, and PW—were applied to the simulated datasets. The bias, SD, and root mean square error (RMSE) of the estimated mean μ ^ were computed. The relative RMSE of the μ ^ estimator was defined as the RMSE of a given method divided by the RMSE of the UW method, which served as the reference.
All analyses in this simulation study were performed using R (Version 4.4.1; [41]). The 2PL model was fitted using the sirt::xxirt() function from the R package sirt (Version 4.2-114; [42]). Dedicated functions were developed to estimate the different MGM models. Replication materials for this study can be accessed at https://osf.io/xa4qz (accessed on 3 May 2025).

3.2. Results

Table 1 presents the bias of the estimated group mean μ ^ as a function of the number of items I and the sample size N. In the absence of DIF ( τ = 0 ), all three MGM methods yielded unbiased estimates. When DIF was present in either item difficulties b i or item intercepts ν i , the UW and DW methods continued to produce unbiased estimates. Consistent with the analytical findings in Section 2.5.3, the PW method exhibited bias under these conditions. The magnitude of the bias increased with larger DIF SD τ . Notably, the bias of the PW method did not diminish with increasing sample size N.
Table 2 reports the SD of the estimated group mean μ ^ as a function of the number of items I and the sample size N. As expected, the SD decreased with increasing sample size and increased with higher DIF SD τ . The SD also declined with a larger number of items. In the no DIF condition ( τ = 0 ), the PW method produced estimates with the lowest SD, followed by the DW and UW methods. When random DIF was present in item difficulties b i , PW resulted in the smallest SD for smaller sample sizes, whereas UW became more efficient than DW and PW as the sample size increased. A comparable pattern emerged for DIF in item intercepts ν i , with the distinction that DW instead of UW yielded the smallest SD in larger samples.
Table 3 presents the relative RMSE of the estimated group mean μ ^ as a function of the number of items I and the sample size N. The PW method exhibited the lowest RMSE in the no DIF condition and in DIF conditions with small sample sizes. In large samples, the UW method yielded the smallest RMSE when DIF was present in item difficulties b i . In contrast, in conditions with DIF in item intercepts ν i , the DW and PW methods outperformed UW, with DW showing a slight efficiency advantage over PW at larger sample sizes.
Overall, the results of this Simulation Study showed that the performance of the UW, DW, and PW methods depended on the type of simulated DIF effects. The PW linking method yielded unbiased and efficient estimates only in the absence of random DIF. In DIF conditions with item difficulties affected, the UW method outperformed DW. Conversely, when DIF was simulated in item intercepts, the DW method was superior to UW.

4. Empirical Example: PISA 2006 Reading

The Simulation Study presented in Section 3 demonstrates that the performance of the three MGM methods depends on the presence and nature of random DIF in the data. To investigate whether random DIF occurs in item intercepts or item difficulties, the PISA 2006 dataset [43] for the reading domain was analyzed. This dataset includes participants from 26 selected countries (see Appendix A) that participated in the PISA 2006 study. The full PISA 2006 dataset is publicly accessible at https://www.oecd.org/en/data/datasets/pisa-2006-database.html as of (accessed on 3 May 2025).
Items in the reading domain were administered to a subset of students participating in the PISA 2006 study. The analysis included students who had been administered at least one item from the respective cognitive domain. In total, the analysis included 110,236 students, with sample sizes per country ranging from 2010 to 12,142 ( M = 4239.8 , S D = 3046.7 ).
A few of the 28 reading items were originally scored polytomous but were recoded into dichotomous scores for simplicity in this empirical example, with only the highest category considered correct. The other items were handled as dichotomous, consistent with the original treatment in PISA.
Student (sampling) weights were applied in all analyses. To guarantee equal influence from each country, weights within each country were normalized to sum to 5000. It should be noted that the choice of 5000 is arbitrary; any constant value would serve equally well to balance contributions across countries.
In the first step, international item parameters were estimated by fitting the 2PL model to the weighted, combined dataset for each domain. These item parameters, along with other relevant information, are presented in Table 4. The average item discrimination was 1.402, suggesting a relatively well-discriminating test, while the average item difficulty was 0.163 , indicating that the items were slightly easier relative to the ability of students in the total population.
In the second step, country means and country SDs were computed using the fixed international item parameters presented in Table 4. The means and SDs for the 26 countries, based on the original logit scale of the 2PL model, are reported in Table 5. The country means had an average of M = 0.000 (with S D = 0.228 ), while the country SDs had an average of M = 0.973 (with S D = 0.078 ).
In the third step, DIF effects e i were determined for each country. The country mean and country SD were fixed at μ ^ and σ ^ , as obtained from the second step, while the international item parameters a i and b i were used. Specifically, the IRT model
P ( X i = 1 | θ ) = Ψ a i ( θ b i e i ) with N ( μ ^ , σ ^ 2 )
was applied in each country, where N denotes the normal distribution. It is important to note that in (35), only DIF effects and their sampling variances were computed.
Using the data on estimated DIF effects, the distribution of DIF effects within each country was examined. As an initial descriptive step, the empirical SD of DIF effects e i (i.e., τ ^ obs ) was calculated and is reported in Table 5. The average DIF SD was M = 0.367 with S D = 0.091 , indicating considerable heterogeneity in DIF effects across countries.
To account for the contribution of sampling variance in the observed DIF effect estimates e ^ i , maximum likelihood estimation was applied to the following model for DIF effects:
e ^ i N ( κ , τ 2 + v i 2 ) for i = 1 , , I .
Here, the parameters κ and the DIF SD τ were estimated, and v i 2 denotes the estimated sampling variance of e ^ i . Model (36) corresponds to a random-effects meta-analysis model with known error variances and was estimated using the stats::optim() function in R (Version: 4.4.1; [41]). The resulting τ estimates, referred to as τ ^ bc (i.e., bias-corrected estimates), are also reported in Table 5 and were, as expected, slightly lower than the empirical values: M = 0.359 , S D = 0.089 . Note that model (36) assumes random DIF with homogeneous variances for DIF in item difficulties.
The model for DIF effects in (36) is contrasted with the alternative model
e ^ i N κ , τ 2 a i 2 + v i 2 for i = 1 , , I ,
which represents DIF in item intercepts under the assumption of a homogeneous variance. Again, this model was fitted using the stats::optim() function in R [41]. The corresponding τ estimates, denoted as τ ^ ν , bc , are reported in Table 5. On average, τ ^ ν , bc was slightly larger than τ ^ bc , with M = 0.424 and S D = 0.083 . This result aligns with expectations, given that the average item discrimination clearly exceeded 1.
Because the competing models (36) and (37) were based on the same data (i.e., estimated DIF effects e ^ i ) and involved the same number of estimated parameters, their log-likelihood values can be directly compared to assess whether the assumption of DIF in item difficulties or item intercepts is more appropriate. The corresponding difference in log-likelihood, denoted as Δ L L , is reported in Table 5. Notably, the model assuming DIF in item intercepts provided a better fit in 23 out of 26 countries.
Table 4 also includes DIF SD estimates, τ ^ obs and τ ^ bc , computed for individual items across countries to examine whether certain items were more susceptible to country-level DIF than others. Substantial variability in τ ^ bc estimates across items was observed, with M = 0.335 and S D = 0.165 . To evaluate the plausibility of the normality assumption for DIF effects, the Shapiro–Wilk test for normality was applied. The corresponding p values are listed as p(SW) in Table 4. A total of 8 out of 28 items showed statistically significant deviations from normality. Additionally, Figure 1 presents histograms of estimated DIF effects for nine selected items, along with the estimated DIF SD τ and the Shapiro–Wilk p value. While outliers in DIF effects were evident for items R104Q02 and R219Q01E, the overall pattern suggested unsystematic variation in DIF effects, supporting the plausibility of the normal distribution assumption for random DIF. In contrast, the commonly assumed partial invariance structure—where only a subset of items exhibit large absolute DIF effects while the majority show small or no DIF effects [44,45]—was clearly not supported by the data.

5. Discussion

This article examines three specifications of MGM linking that differ in the computation of the group mean μ ^ . The UW method assigns equal weight to all item difficulty differences; the DW method weights these differences by the squared item discrimination; and the PW method applies precision weights that account for the sampling error in item difficulty differences. The relative performance of the three methods depends on the data-generating model. When no random DIF is present, the PW method consistently outperforms the other two. In the presence of random DIF, the estimated group mean is influenced by both DIF and sampling error. Thus, the effectiveness of each method depends on the relative contribution of these two sources of variance.
When random DIF with homogeneous variances affects item difficulties, the UW method outperforms both DW and PW under large sample conditions. Conversely, if random DIF with homogeneous variances influences item intercepts, the DW method yields superior results among the three approaches.
Because the estimated precision weights in PW linking reflect the presence of DIF effects, a bias arises due to the covariance between these weights and the item difficulty differences since DIF affects both quantities. Therefore, if random DIF is suspected, the PW method is not recommended.
Given that the performance of the UW and DW methods depends on whether random DIF affects item difficulties or item intercepts, the PISA 2006 reading dataset was analyzed to investigate which assumption is more tenable. Model fit comparisons indicated that DIF effects were more prevalent in item intercepts than in item difficulties for the majority of countries. This empirical evidence suggests that the DW method may be preferable when selecting an MGM specification in applied settings. Furthermore, as DIF in item intercepts appears to be the more tenable assumption in practice, future simulation studies may benefit from focusing on DIF effects in item intercepts rather than in item difficulties. Nevertheless, future research should examine whether this specific result from the PISA 2006 reading dataset generalizes to other empirical contexts.
However, weighting items based on item discrimination may not accurately reflect group differences, as the group difference should ideally assign equal weight to all items to preserve the intended test composition [46,47]. Nonetheless, this critique may not fully hold, as item weighting is already introduced when selecting the 2PL model—incorporating item discrimination—over the Rasch model [48], which applies equal weighting in the IRT framework.
Future research could examine the comparison between the UW and PW methods within the Rasch model. In this setting, MGM linking is replaced by mean–mean linking, as only the group means are aligned, and the group SDs are freely estimated. As noted by an anonymous reviewer, the Rasch model may exhibit special measurement properties compared to the 2PL model [49,50,51] (but see [52,53,54]), which can lead to its preference among practitioners [55,56,57,58]. Several established tools for detecting DIF have been developed within the Rasch framework [59,60]. The relative efficiency of the UW and PW methods depends on the relative magnitude of random DIF SD and sampling error. Importantly, the PW method is also expected to introduce bias in the estimated group mean under the Rasch model when random DIF effects are present.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the fact that this is a secondary data analysis, based on PISA data, for which there already is approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Replication material for the Simulation Study in Section 3 can be found at https://osf.io/xa4qz (accessed on 3 May 2025). The PISA 2006 dataset used in Section 4 can be downloaded from https://www.oecd.org/en/data/datasets/pisa-2006-database.html (accessed on 3 May 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
2PLtwo-parameter logistic
DIFdifferential item functioning
DWdiscrimination-weighted mean–geometric mean linking
IRFitem response function
IRTitem response theory
MGMmean–geometric mean
MMLmarginal maximum likelihood
PWprecision-weighted mean–geometric mean linking
PISAprogramme for international student assessment
RMSEroot mean square error
SDstandard deviation
UWunweighted mean–geometric mean linking

Appendix A. Country Labels for the PISA 2006 Study

The country labels used in Table 5 are as follows: AUS = Australia; AUT = Austria; BEL = Belgium; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; JPN = Japan; KOR = Korea; LUX = Luxembourg; NLD = The Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; SWE = Sweden.

References

  1. Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
  2. Reckase, M.D. Multidimensional Item Response Theory Models; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  3. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  4. van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
  5. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  6. Kamata, A.; Bauer, D.J. A note on the relation between factor analytic and item response theory models. Struct. Equ. Model. 2008, 15, 136–153. [Google Scholar] [CrossRef]
  7. van der Linden, W.J.; Barrett, M.D. Linking item response model parameters. Psychometrika 2016, 81, 650–673. [Google Scholar] [CrossRef] [PubMed]
  8. Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
  9. Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
  10. Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  11. Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
  12. Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  13. Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
  14. Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
  15. Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
  16. Mislevy, R.J.; Bock, R.D. BILOG 3. Item Analysis and Test Scoring with Binary Logistic Models; Software Manual; Scientific Software International: Chicago, IL, USA, 1990. [Google Scholar]
  17. Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model Through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
  18. Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
  19. Robitzsch, A. Extensions to mean–geometric mean linking. Mathematics 2025, 13, 35. [Google Scholar] [CrossRef]
  20. De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
  21. Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
  22. de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
  23. Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
  24. Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
  25. Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
  26. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  27. Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
  28. Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
  29. Robitzsch, A.; Lüdtke, O. An examination of the linking error currently used in PISA. Meas. Interdiscip. Res. Persp. 2024, 22, 61–77. [Google Scholar] [CrossRef]
  30. Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
  31. Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
  32. Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
  33. Lohr, S.L. Sampling: Design and Analysis; Chapman and Hall/CRC: London, UK, 2021. [Google Scholar] [CrossRef]
  34. Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
  35. Chen, Y.; Li, C.; Ouyang, J.; Xu, G. DIF statistical inference without knowing anchoring items. Psychometrika 2023, 88, 1097–1122. [Google Scholar] [CrossRef]
  36. Yuan, K.H.; Cheng, Y.; Patton, J. Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika 2014, 79, 232–254. [Google Scholar] [CrossRef] [PubMed]
  37. Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
  38. Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
  39. Barrett, M.D.; van der Linden, W.J. Estimating linking functions for response model parameters. J. Educ. Behav. Stat. 2019, 44, 180–209. [Google Scholar] [CrossRef]
  40. Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
  41. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 15 June 2024).
  42. Robitzsch, A. sirt: Supplementary Item Response Theory Models, R Package Version 4.2-114. 2025. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 7 April 2025).
  43. OECD. PISA 2006 Technical Report; OECD: Paris, France, 2009; Available online: https://bit.ly/38jhdzp (accessed on 3 May 2025).
  44. von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  45. Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
  46. Brennan, R.L. Misconceptions at the intersection of measurement theory and practice. Educ. Meas. 1998, 17, 5–9. [Google Scholar] [CrossRef]
  47. Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
  48. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  49. Heine, J.H.; Heene, M. Measurement and mind: Unveiling the self-delusion of metrification in psychology. Meas. Interdiscip. Res. Persp. 2024; Epub ahead of print. [Google Scholar] [CrossRef]
  50. Salzberger, T. The illusion of measurement: Rasch versus 2-PL. Rasch Meas. Trans. 2002, 16, 882. Available online: https://tinyurl.com/25wzmzb5 (accessed on 3 May 2025).
  51. Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. Available online: https://bit.ly/2UV6Eht (accessed on 3 May 2025).
  52. Ballou, D. Test scaling and value-added measurement. Educ. Financ. Policy 2009, 4, 351–383. [Google Scholar] [CrossRef]
  53. van der Linden, W.J. Fundamental measurement and the fundamentals of Rasch measurement. In Objective Measurement: Theory Into Practice (Vol. 2); Wilson, M., Ed.; Ablex Publishing Corporation: Hillsdale, NJ, USA, 1994; pp. 3–24. [Google Scholar]
  54. Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci. 2022, 4, 9. [Google Scholar] [CrossRef]
  55. Andrich, D.; Marais, I. A Course in Rasch Measurement Theory; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  56. Engelhard, G. Invariant Measurement; Routledge: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
  57. Melin, J.; Bonn, S.E.; Pendrill, L.; Lagerros, Y.T. A questionnaire for assessing user satisfaction with mobile health apps: Development using Rasch measurement theory. JMIR mHealth uHealth 2020, 8, e15909. [Google Scholar] [CrossRef]
  58. Wu, M.; Tam, H.P.; Jen, T.H. Educational Measurement for Applied Researchers; Springer: Singapore, 2016. [Google Scholar] [CrossRef]
  59. Tennant, A.; Pallant, J.F. DIF matters: A practical approach to test if differential item functioning makes a difference. Rasch Meas. Trans. 2007, 20, 1082–1084. Available online: https://rb.gy/wbiku0 (accessed on 3 May 2025).
  60. Melin, J.; Cano, S.; Flöel, A.; Göschel, L.; Pendrill, L. The role of entropy in construct specification equations (CSE) to improve the validity of memory tests: Extension to word lists. Entropy 2022, 24, 934. [Google Scholar] [CrossRef]
Figure 1. Empirical Example, PISA 2006 Reading: Histograms of estimated DIF effects for nine selected items (R102Q07, R104Q01, R104Q02, R104Q05, R111Q01, R111Q02B, R111Q06B, R219Q01E, and R219Q01T), along with estimated DIF SD τ and Shapiro–Wilk test for normality (p(SW)). DIF effects of −0.4 and 0.4 are displayed in a red vertical dashed line.
Figure 1. Empirical Example, PISA 2006 Reading: Histograms of estimated DIF effects for nine selected items (R102Q07, R104Q01, R104Q02, R104Q05, R111Q01, R111Q02B, R111Q06B, R219Q01E, and R219Q01T), along with estimated DIF SD τ and Shapiro–Wilk test for normality (p(SW)). DIF effects of −0.4 and 0.4 are displayed in a red vertical dashed line.
Foundations 05 00020 g001aFoundations 05 00020 g001b
Table 1. Simulation Study: Bias of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
Table 1. Simulation Study: Bias of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
τ = 0 τ = 0.25 , DIF in b i τ = 0.5 , DIF in b i τ = 0.25 , DIF in ν i τ = 0.5 , DIF in ν i
I N UW DW PW UW DW PW UW DW PW UW DW PW UW DW PW
205000.003−0.003−0.007 0.002−0.003−0.014 0.0090.001−0.026 0.004−0.002−0.013 0.0050.000−0.029
10000.002−0.002−0.004 0.002−0.002−0.011 0.001−0.003−0.029 0.003−0.001−0.011 0.000−0.005−0.033
20000.0010.000−0.001 0.000−0.001−0.009 0.0000.000−0.027 −0.0010.000−0.011 −0.0010.000−0.029
Inf0.0000.000 0.0000.000 0.0000.000 0.000−0.001 −0.003−0.001
405000.0050.003−0.002 0.0040.003−0.008 0.0040.006−0.024 0.0050.004−0.008 0.0050.009−0.022
10000.003−0.001−0.002 0.001−0.002−0.010 0.0040.001−0.025 0.001−0.001−0.011 −0.001−0.004−0.031
2000−0.0030.001−0.002 −0.0030.001−0.009 −0.0040.001−0.027 −0.0020.002−0.008 −0.0040.002−0.027
Inf0.0000.000 0.0010.001 0.0000.001 −0.001−0.001 0.0020.001
Note. SD = standard deviation; Inf = infinite sample size; UW = unweighted mean–geometric mean linking; DW = discrimination-weighted mean–geometric mean linking; PW = precision-weighted mean–geometric mean linking; — = linking method not applied; Values of absolute bias larger than 0.010 are printed in bold font.
Table 2. Simulation Study: Standard deviation (SD) of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
Table 2. Simulation Study: Standard deviation (SD) of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
τ = 0 τ = 0.25 , DIF in b i τ = 0.5 , DIF in b i τ = 0.25 , DIF in ν i τ = 0.5 , DIF in ν i
I N UW DW PW UW DW PW UW DW PW UW DW PW UW DW PW
205000.0950.0850.082 0.1100.1070.103 0.1500.1600.149 0.1200.1020.101 0.1790.1470.142
10000.0660.0580.058 0.0870.0880.087 0.1310.1450.136 0.0990.0840.085 0.1660.1340.130
20000.0460.0420.042 0.0740.0790.078 0.1220.1390.131 0.0870.0730.075 0.1560.1270.125
Inf0.0000.000 0.0560.066 0.1120.130 0.0720.058 0.1470.118
405000.0830.0790.077 0.0900.0910.087 0.1160.1240.114 0.0990.0890.087 0.1350.1170.109
10000.0580.0540.054 0.0710.0710.069 0.0980.1060.099 0.0780.0690.069 0.1200.1000.097
20000.0400.0380.038 0.0560.0600.059 0.0890.1030.095 0.0650.0570.057 0.1110.0920.089
Inf0.0000.000 0.0390.046 0.0790.092 0.0520.042 0.1050.084
Note. Inf = infinite sample size; UW = unweighted mean–geometric mean linking; DW = discrimination-weighted mean–geometric mean linking; PW = precision-weighted mean–geometric mean linking; — = linking method not applied.
Table 3. Simulation Study: Relative root mean square error (RMSE) of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
Table 3. Simulation Study: Relative root mean square error (RMSE) of estimated mean μ ^ as a function of the DIF SD τ , the type of DIF effects, the number of items I, and sample size N.
τ = 0 τ = 0.25 , DIF in b i τ = 0.5 , DIF in b i τ = 0.25 , DIF in ν i τ = 0.5 , DIF in ν i
I N UW DW PW UW DW PW UW DW PW UW DW PW UW DW PW
2050010089.086.8 10097.094.7 100106.1100.7 10085.085.1 10082.580.8
100010087.887.5 100101.099.8 100111.0106.4 10084.285.9 10080.881.1
200010091.991.4 100107.0106.3 100114.2109.4 10083.987.1 10081.482.1
Inf 100116.7 100116.1 10080.7 10079.8
4050010095.392.9 100100.396.3 100106.799.9 10090.287.8 10086.282.2
100010093.093.1 100100.098.8 100109.0104.3 10087.788.9 10083.484.3
200010096.094.5 100107.6106.1 100115.2111.4 10086.787.7 10083.284.0
Inf 100116.2 100116.4 10079.5 10079.9
Note. SD = standard deviation; Inf = infinite sample size; — = linking method not applied; UW = unweighted mean–geometric mean linking; DW = discrimination-weighted mean–geometric mean linking; PW = precision-weighted mean–geometric mean linking; The UW method was used as the reference method to compute the relative RMSE.
Table 4. Empirical Example, PISA 2006 Reading: International item parameters and descriptive statistics of DIF effects for all 28 items.
Table 4. Empirical Example, PISA 2006 Reading: International item parameters and descriptive statistics of DIF effects for all 28 items.
DIF Effects
Item #CNT a i b i τ ^ obs τ ^ bc Min Max p(SW)
R055Q01261.395−1.4860.2180.210−0.4470.6540.124
R055Q02261.3790.0430.2140.207−0.3940.4110.522
R055Q03261.620−0.3350.2790.272−0.4450.4960.095
R055Q05262.117−0.7770.1880.182−0.3530.6440.002
R067Q01261.228−2.0690.3500.339−0.6641.0220.033
R067Q04260.8320.7230.7100.694−2.0410.9760.017
R067Q05261.088−0.3070.5260.513−1.2581.1640.508
R102Q04A251.4600.6690.3830.373−0.6040.7830.266
R102Q05261.3300.2440.2980.290−0.6110.4350.073
R102Q07241.418−1.4930.4270.416−0.6800.8210.083
R104Q01261.628−1.3210.1850.178−0.2670.4450.122
R104Q02260.5831.3370.6850.664−0.8732.1940.008
R104Q05261.2062.9740.4490.428−0.6741.0660.530
R111Q01261.365−0.6040.2590.251−0.4000.5580.366
R111Q02B261.0441.9170.5000.486−0.8581.0270.738
R111Q06B261.5890.5420.2240.217−0.5890.3070.124
R219Q01E261.633−0.2500.2950.287−1.0420.5410.006
R219Q01T261.861−0.6670.2420.235−0.5220.4780.986
R219Q02261.534−1.1790.2290.221−0.4230.3460.451
R220Q01261.7620.3080.2110.205−0.3170.4600.377
R220Q02B251.521−0.3760.1590.152−0.2210.3380.143
R220Q04261.302−0.3120.3200.312−0.5460.3730.003
R220Q05261.977−1.1450.1650.157−0.3700.2860.297
R220Q06261.167−0.6750.3930.383−0.5000.6880.014
R227Q01260.778−0.1510.6710.655−1.5501.2750.827
R227Q02T260.9940.7920.6290.614−0.9951.4370.738
R227Q03261.665−0.1830.2350.227−0.6500.4840.557
R227Q06261.766−0.7770.2250.218−0.3140.5550.021
Note. #CNT = number of countries per item; a i = item discrimination in the 2PL model; b i = item difficulty in the 2PL model; τ ^ obs = observed SD of DIF effects; τ ^ bc = bias-corrected estimate of SD of DIF effects; Min = smallest DIF effect per item across countries; Max = largest DIF effect per item across countries; p(SW) = p-value of Shapiro–Wilk test for normality of DIF effects.
Table 5. Empirical Example, PISA 2006 Reading: Descriptive statistics for countries and estimated SD of DIF effects.
Table 5. Empirical Example, PISA 2006 Reading: Descriptive statistics for countries and estimated SD of DIF effects.
CNTNIMSD τ ^ obs τ ^ bc τ ^ ν , bc Δ LL
AUS7562280.1700.9600.2490.2460.350−1.46
AUT264627−0.0371.0330.2720.2650.3103.86
BEL4840280.0591.0710.2660.2550.3073.21
CAN12,142280.2760.9340.2830.2790.3591.34
CHE6578280.0230.9580.3270.3200.3773.87
CZE324628−0.1681.1300.3350.3260.3933.25
DEU270128−0.0391.1400.5220.5000.44511.56
DNK2431270.0010.8910.3980.3940.4474.63
ESP10,50628−0.3510.8150.4130.4080.4793.90
EST263028−0.0070.8380.3440.3390.4321.66
FIN2536280.5160.8540.3300.3260.3784.26
FRA252428−0.0100.9840.3320.3200.4051.88
GBR706128−0.0160.9850.3400.3360.4470.48
GRC260628−0.4310.9520.4900.4790.5106.62
HUN239928−0.1480.9180.3200.3060.3713.23
IRL2468280.1840.9460.2750.2690.3431.60
ISL201028−0.0690.9150.3260.3200.4051.87
ITA11,62928−0.2850.9840.3500.3400.4222.55
JPN3203280.0281.0340.4380.4350.598−0.51
KOR2790270.5610.9590.5890.5760.6285.71
LUX244327−0.1801.0120.3330.3150.3584.65
NLD2666280.0921.0170.4290.4250.5162.98
NOR250428−0.1071.0180.4530.4390.4617.10
POL2968280.0681.0000.3060.3020.411−0.21
PRT277328−0.2420.9550.5340.5290.5427.76
SWE2374280.1071.0040.2880.2830.3274.39
Note. CNT = country label (see Appendix A); N = sample size per country; I = number of items per country; M = country mean; SD = country SD; τ ^ obs = observed SD of DIF effects in item difficulties b i ; τ ^ bc = bias-corrected SD estimate of DIF effects in item difficulties b i ; τ ^ ν , bc = bias-corrected SD estimate of DIF effects in item intercepts ν i ; Δ L L = difference in log-likelihood values for models with DIF effects in item intercepts ν i or item difficulties b i . Positive Δ L L values indicate a better model fit for DIF effects in item intercepts.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Robitzsch, A. Comparing Different Specifications of Mean–Geometric Mean Linking. Foundations 2025, 5, 20. https://doi.org/10.3390/foundations5020020

AMA Style

Robitzsch A. Comparing Different Specifications of Mean–Geometric Mean Linking. Foundations. 2025; 5(2):20. https://doi.org/10.3390/foundations5020020

Chicago/Turabian Style

Robitzsch, Alexander. 2025. "Comparing Different Specifications of Mean–Geometric Mean Linking" Foundations 5, no. 2: 20. https://doi.org/10.3390/foundations5020020

APA Style

Robitzsch, A. (2025). Comparing Different Specifications of Mean–Geometric Mean Linking. Foundations, 5(2), 20. https://doi.org/10.3390/foundations5020020

Article Metrics

Back to TopTop