Abstract
Mean–geometric mean (MGM) linking compares group differences on a latent variable within the two-parameter logistic (2PL) item response theory model. This article investigates three specifications of MGM linking that differ in the weighting of item difficulty differences: unweighted (UW), discrimination-weighted (DW), and precision-weighted (PW). These methods are evaluated under conditions where random DIF effects are present in either item difficulties or item intercepts. The three estimators are analyzed both analytically and through a simulation study. The PW method outperforms the other two only in the absence of random DIF or in small samples when DIF is present. In larger samples, the UW method performs best when random DIF with homogeneous variances affects item difficulties, while the DW method achieves superior performance when such DIF is present in item intercepts. The analytical results and simulation findings consistently show that the PW method introduces bias in the estimated group mean when random DIF is present. Given that the effectiveness of MGM methods depends on the type of random DIF, the distribution of DIF effects was further examined using PISA 2006 reading data. The model comparisons indicate that random DIF with homogeneous variances in item intercepts provides a better fit than random DIF in item difficulties in the PISA 2006 reading dataset.
Keywords:
item response model; 2PL model; mean–geometric mean linking; differential item functioning MSC:
62H10; 62H25
1. Introduction
Item response theory (IRT) models [1,2,3] provide a statistical framework for modeling multivariate discrete outcomes. This work specifically addresses binary item responses and explores methods for comparing two populations using linking techniques. Consider a response vector , where each variable represents a dichotomously scored item. A unidimensional IRT model [4] specifies the joint probability distribution for response patterns through a parametric formulation:
where f denotes the normal density function, parameterized by the mean and the standard deviation (SD) . The distribution parameters and of the latent variable , often referred to as a trait or ability variable, are collected in the vector . The vector collects the item parameters for the item response functions (IRFs) for . The IRF of the two-parameter logistic (2PL) model [5] is defined by
where and represent the item discrimination and the item difficulty , respectively. The function corresponds to the standard logistic distribution function. In this formulation, the item parameter vector is . Alternatively, the 2PL model can be reparametrized by replacing the difficulty parameter with the intercept , resulting in
The two 2PL parameterizations are related by the identity (see [6,7]). In this parametrization, the item parameter vector is .
Given a sample of N individuals with independent and identically distributed response vectors drawn from the distribution of , the parameters of the IRT model specified in (1) can be consistently estimated through marginal maximum likelihood (MML) methods [8,9].
IRT models are widely employed to compare the test performance of two groups by assessing differences in the parameters of the latent variable , as defined in the IRT framework of (1). This article specifically examines linking methods [10] based on the 2PL model.
In the first step of the linking approach, the 2PL model is estimated separately for each group, allowing for the presence of differential item functioning (DIF), where item behavior may vary across groups [11,12,13]. More specifically, item parameters are permitted to differ between groups, indicating that the groups may respond differently to an item even after accounting for overall differences in the variable. In the second step, differences in item parameters are used to estimate group differences in through a linking procedure [10,14,15].
This article evaluates the performance of mean–geometric mean (MGM; [7,10,16,17,18,19]) linking in the presence of DIF [13] in either item difficulties or item intercepts. The standard MGM method is based on the mean difference of log-transformed item discriminations for estimating the group SD and the mean difference of untransformed item difficulties for estimating the group mean. This study considers three specifications of MGM linking under random DIF [20,21,22] in item difficulties or intercepts. Prior research has shown that random DIF contributes to increased variance in the estimated linking parameters [23,24]. This concept is also referred to as a linking error in educational large-scale assessment studies [23,25,26,27,28,29,30,31,32].
The three MGM specifications considered here differ in the weighting of item difficulty differences when computing the group mean. To the best of the authors’ knowledge, the performance of these MGM variants under random DIF has not yet been systematically examined. The performance of the MGM estimators is assessed analytically and through a simulation study.
2. Mean–Geometric Mean Linking
2.1. Identified Item Parameters in Separate Scaling
The various specifications of the MGM method are based on item parameters from the 2PL model, estimated separately for each group. The following describes the identification of item parameters under the assumption that no DIF is present in item discriminations or item difficulties . In both groups, the latent variable is standardized by fixing its mean and SD to 0 and 1, respectively, allowing all item parameters to be estimated within each group. In the first group, the identified item parameters are given by , , and , where and represent the invariant item parameters in the 2PL model across groups.
In the second group, the latent variable is assumed to have a mean and SD . By fixing the mean to 0 and SD to 1, the identified item parameters from the separate 2PL model estimation in this group are given by
The MGM method aims to recover the parameters and using the group-specific item parameters and (, ), obtained from separate estimations under the 2PL model.
2.2. Weighted Means
The different specifications of MGM linking for estimating are essentially different weighted means of item difficulty differences. The following briefly reviews the statistical properties of such weighted means. Let () denote normally distributed observations with mean and variances . A weighted mean with fixed weights is defined as
Note that the multiplication factor in (7) could be omitted; however, it is included to maintain consistency with later expressions in the various specifications of MGM linking. The expected value of is , and its variance is given by
The minimal variance in (8) is attained when the observations are weighted by their precisions (i.e., the inverses of their variances) (see [33]). A weighted mean using these weights is commonly referred to as a precision-weighted mean.
2.3. Random DIF in Item Difficulties or Item Intercepts
The occurrence of random DIF [20,34] can be characterized by whether DIF manifests in item difficulties [34] or item intercepts [35]. Assume that DIF arises only as deviations in item parameters for the second group relative to the first group. Further, assume invariant item parameters. Let and denote random DIF in item difficulties and item intercepts with zero means, respectively, under the assumptions
The two DIF effects are related by
If the DIF effects in item difficulties have variances , corresponding to random DIF, the variances of the DIF effects in item intercepts are given by
If the random DIF effects have homogeneous variances , it follows from (11) that the corresponding DIF effects in item intercepts exhibit heterogeneous variances . Conversely, if the DIF effects in item intercepts have homogeneous variances , then the DIF effects in item difficulties have heterogeneous variances .
The formulation of random DIF in terms of item difficulties or item intercepts is statistically equivalent when allowing for heterogeneous variances . However, empirical analyses may test whether random DIF with homogeneous variance is more plausible in item difficulties or in item intercepts. As will be shown later, the performance of the different MGM specifications depends on whether random DIF with homogeneous variances occurs in item difficulties or in item intercepts.
2.4. Estimation of in MGM Linking
In MGM linking, the SD is estimated using the means of log-transformed item discriminations. Specifically, the estimate is computed as (see [10,16])
Since averages on the logarithmic scale are used, this method is referred to as log-mean-mean linking.
2.5. Estimation of in MGM Linking
The estimation of is now addressed. Three variants of weighted means of item difficulties are considered to derive the estimate . In the formal treatment, assume that random DIF occurs in item difficulties, with and . If random DIF in item difficulties exhibits homogeneous variance, then . Alternatively, if random DIF with homogeneous variances occurs in item intercepts, then .
In addition to random DIF, sampling errors affect the item parameter estimates. Let denote the sampling error in the estimated item difficulty (; ). The estimated item difficulty in the first group is then given by
where and are constants that depend on the dataset. The variance expression in (13) is supported by empirical evidence from simulation studies [9,36].
The estimated item difficulty in the second group satisfies
Here, represents the sampling error, while denotes the random DIF effect. For a sufficiently large number of items, the estimated item difficulties can be treated as approximately independent across items [36].
2.5.1. Unweighted MGM Linking (UW)
The original variant of MGM linking for estimating is based on the difference in item difficulties, defined as
This estimator uses the previously calculated SD from (12) and applies equal weights to the differences in item difficulties. For this reason, the estimator in (15) is referred to as unweighted MGM linking (UW).
To examine the statistical properties of , the expression in (15) is rewritten using (13) and (14) as
Since as , it follows that under the assumptions . Thus, a simplified form of is given by (16) as
To derive the variance of , assume that . Then, from (17),
Equation (18) shows that the variance of consists of two components: the variance due to random DIF effects and the variance from sampling errors . Using (13) and (14), the expression can be further rephrased as
As the sample size increases, the contribution of the sampling error variance diminishes. However, the variance component due to random DIF remains nonzero even in the limit of infinite sample size.
2.5.2. Discrimination-Weighted MGM Linking (DW)
An alternative MGM linking estimate for relies on the identification of Equation (6). Following the rationale used in the invariance alignment [37,38] method, the absence of DIF effects yields the identity
This identity motivates the estimation of as the minimizer of
which leads to the estimator
The estimator in (22) can be further rewritten as (see [19])
To analyze the statistical properties of as defined in (23), simplifying assumptions are applied: , , and . Under these assumptions, the estimator simplifies to
This expression reveals that is a weighted average of item difficulty differences, where the weights are proportional to the squared item discriminations. Consequently, the estimator in (24) is referred to as discrimination-based MGM linking (DW).
The estimator in (24) can be expressed in terms of the DIF effects and the sampling errors as
As with the UW estimator, this formulation yields an asymptotically unbiased estimate of , i.e., as . The variance of is given by
which can be further simplified using the expressions for from (19) as
The left-hand variance component in (27) indicates that an optimal estimate of is obtained when the DIF effects satisfy , corresponding to random DIF with homogeneous variances in item intercepts. In this case, the weighting by item discriminations enhances precision, as the sampling variance of estimated item difficulties is also proportional to . However, if random DIF with homogeneous variance occurs in item difficulties rather than in item intercepts, the discrimination-based weighting in DW may result in a higher variance compared to the equal weighting used in UW, particularly in large samples where the contribution from sampling error becomes negligible.
2.5.3. Precision-Weighted MGM Linking (PW)
The UW linking method assigns equal weights to the item difficulty differences . As an alternative, these differences can be weighted by their precisions, that is, the inverse of their sampling variances [7,39]. This approach yields the estimator
where the precision weights must be estimated. The variances and are obtained from the observed information matrix in the group-wise scaling models. Based on these variances, the weights are defined as
Using (13) and (14), the precision weights can be approximately determined as
Importantly, (30) highlights that the estimated precision weights depend on the random DIF effects . For small values of , a linear Taylor approximation of (30) yields
Note that and are independent of the random DIF effect .
The estimator in (28) can be rephrased as
Assuming independence between and , the expectation of as is given by
According to (34), the PW linking method may produce a negatively biased estimate of when is, on average, greater than . However, in the absence of random DIF, the PW method does not exhibit bias in the estimation of .
The variance of the PW estimate can be derived analogously to that of the UW and DW estimators, although it offers limited additional insight. By construction, the PW linking method yields the smallest variance in the absence of DIF, as it employs optimal precision weights.
3. Simulation Study
In this Simulation Study, the performances of the three MGM linking specifications (i.e., UW, DW, and PW) outlined in Section 2.5 are compared.
3.1. Method
The data-generating model was based on the 2PL model applied to two groups. For the first group, the latent variable followed a standard normal distribution with a fixed mean of 0 and SD of 1. For the second group, was also normally distributed with a fixed mean and SD , which was consistent across all simulation conditions.
The simulation study used or items. Group-specific item parameters and for each item and for groups were derived from fixed base parameters and newly simulated random DIF effects in each replication. The item parameters were constructed using 10 base items. These base items were duplicated twice in the 20-item condition and four times in the 40-item condition. For the 10 base items, the base item discriminations were set to 0.6 for the first five items and 1.2 for the remaining five. Base item difficulties were assigned values of , , , , and for the first five items, with the same sequence repeated for the remaining items. The complete set of item parameters is available at https://osf.io/xa4qz (accessed on 3 May 2025).
For the first group, item discriminations and item difficulties were set to the base item parameters. In the second group, DIF effects with a homogeneous variance were introduced either in item difficulties or item intercepts. A normally distributed random DIF effect with DIF SD was added to the corresponding item difficulty or item intercept. The DIF SD was chosen as 0, 0.25, or 0.5. Combined with the type of DIF effects (i.e., in item difficulties or item intercepts ), five different DIF conditions (i.e., no DIF, and DIF in , and DIF in , and DIF in , and and DIF in ) were simulated. Item discriminations in the second group were kept identical to the base values, ensuring no DIF in discrimination parameters.
Per-group sample sizes of , 1000, 2000, and infinity (denoted as Inf) were selected to represent typical ranges encountered in medium- to large-scale testing scenarios involving the 2PL model [40]. For infinite sample sizes, no item responses were simulated. However, the item parameters used in MGM linking still included the random DIF effects in this case.
In each of the 4 (sample size N) × 2 (number of items I) × 5 (random DIF conditions) simulation conditions, 7500 replications were conducted. The three MGM specifications—UW, DW, and PW—were applied to the simulated datasets. The bias, SD, and root mean square error (RMSE) of the estimated mean were computed. The relative RMSE of the estimator was defined as the RMSE of a given method divided by the RMSE of the UW method, which served as the reference.
All analyses in this simulation study were performed using R (Version 4.4.1; [41]). The 2PL model was fitted using the sirt::xxirt() function from the R package sirt (Version 4.2-114; [42]). Dedicated functions were developed to estimate the different MGM models. Replication materials for this study can be accessed at https://osf.io/xa4qz (accessed on 3 May 2025).
3.2. Results
Table 1 presents the bias of the estimated group mean as a function of the number of items I and the sample size N. In the absence of DIF (), all three MGM methods yielded unbiased estimates. When DIF was present in either item difficulties or item intercepts , the UW and DW methods continued to produce unbiased estimates. Consistent with the analytical findings in Section 2.5.3, the PW method exhibited bias under these conditions. The magnitude of the bias increased with larger DIF SD . Notably, the bias of the PW method did not diminish with increasing sample size N.
Table 1.
Simulation Study: Bias of estimated mean as a function of the DIF SD , the type of DIF effects, the number of items I, and sample size N.
Table 2 reports the SD of the estimated group mean as a function of the number of items I and the sample size N. As expected, the SD decreased with increasing sample size and increased with higher DIF SD . The SD also declined with a larger number of items. In the no DIF condition (), the PW method produced estimates with the lowest SD, followed by the DW and UW methods. When random DIF was present in item difficulties , PW resulted in the smallest SD for smaller sample sizes, whereas UW became more efficient than DW and PW as the sample size increased. A comparable pattern emerged for DIF in item intercepts , with the distinction that DW instead of UW yielded the smallest SD in larger samples.
Table 2.
Simulation Study: Standard deviation (SD) of estimated mean as a function of the DIF SD , the type of DIF effects, the number of items I, and sample size N.
Table 3 presents the relative RMSE of the estimated group mean as a function of the number of items I and the sample size N. The PW method exhibited the lowest RMSE in the no DIF condition and in DIF conditions with small sample sizes. In large samples, the UW method yielded the smallest RMSE when DIF was present in item difficulties . In contrast, in conditions with DIF in item intercepts , the DW and PW methods outperformed UW, with DW showing a slight efficiency advantage over PW at larger sample sizes.
Table 3.
Simulation Study: Relative root mean square error (RMSE) of estimated mean as a function of the DIF SD , the type of DIF effects, the number of items I, and sample size N.
Overall, the results of this Simulation Study showed that the performance of the UW, DW, and PW methods depended on the type of simulated DIF effects. The PW linking method yielded unbiased and efficient estimates only in the absence of random DIF. In DIF conditions with item difficulties affected, the UW method outperformed DW. Conversely, when DIF was simulated in item intercepts, the DW method was superior to UW.
4. Empirical Example: PISA 2006 Reading
The Simulation Study presented in Section 3 demonstrates that the performance of the three MGM methods depends on the presence and nature of random DIF in the data. To investigate whether random DIF occurs in item intercepts or item difficulties, the PISA 2006 dataset [43] for the reading domain was analyzed. This dataset includes participants from 26 selected countries (see Appendix A) that participated in the PISA 2006 study. The full PISA 2006 dataset is publicly accessible at https://www.oecd.org/en/data/datasets/pisa-2006-database.html as of (accessed on 3 May 2025).
Items in the reading domain were administered to a subset of students participating in the PISA 2006 study. The analysis included students who had been administered at least one item from the respective cognitive domain. In total, the analysis included 110,236 students, with sample sizes per country ranging from 2010 to 12,142 (, ).
A few of the 28 reading items were originally scored polytomous but were recoded into dichotomous scores for simplicity in this empirical example, with only the highest category considered correct. The other items were handled as dichotomous, consistent with the original treatment in PISA.
Student (sampling) weights were applied in all analyses. To guarantee equal influence from each country, weights within each country were normalized to sum to 5000. It should be noted that the choice of 5000 is arbitrary; any constant value would serve equally well to balance contributions across countries.
In the first step, international item parameters were estimated by fitting the 2PL model to the weighted, combined dataset for each domain. These item parameters, along with other relevant information, are presented in Table 4. The average item discrimination was 1.402, suggesting a relatively well-discriminating test, while the average item difficulty was , indicating that the items were slightly easier relative to the ability of students in the total population.
Table 4.
Empirical Example, PISA 2006 Reading: International item parameters and descriptive statistics of DIF effects for all 28 items.
In the second step, country means and country SDs were computed using the fixed international item parameters presented in Table 4. The means and SDs for the 26 countries, based on the original logit scale of the 2PL model, are reported in Table 5. The country means had an average of (with ), while the country SDs had an average of (with ).
Table 5.
Empirical Example, PISA 2006 Reading: Descriptive statistics for countries and estimated SD of DIF effects.
In the third step, DIF effects were determined for each country. The country mean and country SD were fixed at and , as obtained from the second step, while the international item parameters and were used. Specifically, the IRT model
was applied in each country, where denotes the normal distribution. It is important to note that in (35), only DIF effects and their sampling variances were computed.
Using the data on estimated DIF effects, the distribution of DIF effects within each country was examined. As an initial descriptive step, the empirical SD of DIF effects (i.e., ) was calculated and is reported in Table 5. The average DIF SD was with , indicating considerable heterogeneity in DIF effects across countries.
To account for the contribution of sampling variance in the observed DIF effect estimates , maximum likelihood estimation was applied to the following model for DIF effects:
Here, the parameters and the DIF SD were estimated, and denotes the estimated sampling variance of . Model (36) corresponds to a random-effects meta-analysis model with known error variances and was estimated using the stats::optim() function in R (Version: 4.4.1; [41]). The resulting estimates, referred to as (i.e., bias-corrected estimates), are also reported in Table 5 and were, as expected, slightly lower than the empirical values: , . Note that model (36) assumes random DIF with homogeneous variances for DIF in item difficulties.
The model for DIF effects in (36) is contrasted with the alternative model
which represents DIF in item intercepts under the assumption of a homogeneous variance. Again, this model was fitted using the stats::optim() function in R [41]. The corresponding estimates, denoted as , are reported in Table 5. On average, was slightly larger than , with and . This result aligns with expectations, given that the average item discrimination clearly exceeded 1.
Because the competing models (36) and (37) were based on the same data (i.e., estimated DIF effects ) and involved the same number of estimated parameters, their log-likelihood values can be directly compared to assess whether the assumption of DIF in item difficulties or item intercepts is more appropriate. The corresponding difference in log-likelihood, denoted as , is reported in Table 5. Notably, the model assuming DIF in item intercepts provided a better fit in 23 out of 26 countries.
Table 4 also includes DIF SD estimates, and , computed for individual items across countries to examine whether certain items were more susceptible to country-level DIF than others. Substantial variability in estimates across items was observed, with and . To evaluate the plausibility of the normality assumption for DIF effects, the Shapiro–Wilk test for normality was applied. The corresponding p values are listed as p(SW) in Table 4. A total of 8 out of 28 items showed statistically significant deviations from normality. Additionally, Figure 1 presents histograms of estimated DIF effects for nine selected items, along with the estimated DIF SD and the Shapiro–Wilk p value. While outliers in DIF effects were evident for items R104Q02 and R219Q01E, the overall pattern suggested unsystematic variation in DIF effects, supporting the plausibility of the normal distribution assumption for random DIF. In contrast, the commonly assumed partial invariance structure—where only a subset of items exhibit large absolute DIF effects while the majority show small or no DIF effects [44,45]—was clearly not supported by the data.

Figure 1.
Empirical Example, PISA 2006 Reading: Histograms of estimated DIF effects for nine selected items (R102Q07, R104Q01, R104Q02, R104Q05, R111Q01, R111Q02B, R111Q06B, R219Q01E, and R219Q01T), along with estimated DIF SD and Shapiro–Wilk test for normality (p(SW)). DIF effects of −0.4 and 0.4 are displayed in a red vertical dashed line.
5. Discussion
This article examines three specifications of MGM linking that differ in the computation of the group mean . The UW method assigns equal weight to all item difficulty differences; the DW method weights these differences by the squared item discrimination; and the PW method applies precision weights that account for the sampling error in item difficulty differences. The relative performance of the three methods depends on the data-generating model. When no random DIF is present, the PW method consistently outperforms the other two. In the presence of random DIF, the estimated group mean is influenced by both DIF and sampling error. Thus, the effectiveness of each method depends on the relative contribution of these two sources of variance.
When random DIF with homogeneous variances affects item difficulties, the UW method outperforms both DW and PW under large sample conditions. Conversely, if random DIF with homogeneous variances influences item intercepts, the DW method yields superior results among the three approaches.
Because the estimated precision weights in PW linking reflect the presence of DIF effects, a bias arises due to the covariance between these weights and the item difficulty differences since DIF affects both quantities. Therefore, if random DIF is suspected, the PW method is not recommended.
Given that the performance of the UW and DW methods depends on whether random DIF affects item difficulties or item intercepts, the PISA 2006 reading dataset was analyzed to investigate which assumption is more tenable. Model fit comparisons indicated that DIF effects were more prevalent in item intercepts than in item difficulties for the majority of countries. This empirical evidence suggests that the DW method may be preferable when selecting an MGM specification in applied settings. Furthermore, as DIF in item intercepts appears to be the more tenable assumption in practice, future simulation studies may benefit from focusing on DIF effects in item intercepts rather than in item difficulties. Nevertheless, future research should examine whether this specific result from the PISA 2006 reading dataset generalizes to other empirical contexts.
However, weighting items based on item discrimination may not accurately reflect group differences, as the group difference should ideally assign equal weight to all items to preserve the intended test composition [46,47]. Nonetheless, this critique may not fully hold, as item weighting is already introduced when selecting the 2PL model—incorporating item discrimination—over the Rasch model [48], which applies equal weighting in the IRT framework.
Future research could examine the comparison between the UW and PW methods within the Rasch model. In this setting, MGM linking is replaced by mean–mean linking, as only the group means are aligned, and the group SDs are freely estimated. As noted by an anonymous reviewer, the Rasch model may exhibit special measurement properties compared to the 2PL model [49,50,51] (but see [52,53,54]), which can lead to its preference among practitioners [55,56,57,58]. Several established tools for detecting DIF have been developed within the Rasch framework [59,60]. The relative efficiency of the UW and PW methods depends on the relative magnitude of random DIF SD and sampling error. Importantly, the PW method is also expected to introduce bias in the estimated group mean under the Rasch model when random DIF effects are present.
Funding
This research received no external funding.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to the fact that this is a secondary data analysis, based on PISA data, for which there already is approval.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
Replication material for the Simulation Study in Section 3 can be found at https://osf.io/xa4qz (accessed on 3 May 2025). The PISA 2006 dataset used in Section 4 can be downloaded from https://www.oecd.org/en/data/datasets/pisa-2006-database.html (accessed on 3 May 2025).
Conflicts of Interest
The author declares no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| 2PL | two-parameter logistic |
| DIF | differential item functioning |
| DW | discrimination-weighted mean–geometric mean linking |
| IRF | item response function |
| IRT | item response theory |
| MGM | mean–geometric mean |
| MML | marginal maximum likelihood |
| PW | precision-weighted mean–geometric mean linking |
| PISA | programme for international student assessment |
| RMSE | root mean square error |
| SD | standard deviation |
| UW | unweighted mean–geometric mean linking |
Appendix A. Country Labels for the PISA 2006 Study
The country labels used in Table 5 are as follows: AUS = Australia; AUT = Austria; BEL = Belgium; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HUN = Hungary; IRL = Ireland; ISL = Iceland; ITA = Italy; JPN = Japan; KOR = Korea; LUX = Luxembourg; NLD = The Netherlands; NOR = Norway; POL = Poland; PRT = Portugal; SWE = Sweden.
References
- Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Reckase, M.D. Multidimensional Item Response Theory Models; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Kamata, A.; Bauer, D.J. A note on the relation between factor analytic and item response theory models. Struct. Equ. Model. 2008, 15, 136–153. [Google Scholar] [CrossRef]
- van der Linden, W.J.; Barrett, M.D. Linking item response model parameters. Psychometrika 2016, 81, 650–673. [Google Scholar] [CrossRef] [PubMed]
- Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
- Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
- Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
- Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
- Mislevy, R.J.; Bock, R.D. BILOG 3. Item Analysis and Test Scoring with Binary Logistic Models; Software Manual; Scientific Software International: Chicago, IL, USA, 1990. [Google Scholar]
- Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model Through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
- Robitzsch, A. Extensions to mean–geometric mean linking. Mathematics 2025, 13, 35. [Google Scholar] [CrossRef]
- De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
- Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
- de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
- Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
- Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
- Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
- Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
- Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
- Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. An examination of the linking error currently used in PISA. Meas. Interdiscip. Res. Persp. 2024, 22, 61–77. [Google Scholar] [CrossRef]
- Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
- Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
- Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
- Lohr, S.L. Sampling: Design and Analysis; Chapman and Hall/CRC: London, UK, 2021. [Google Scholar] [CrossRef]
- Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
- Chen, Y.; Li, C.; Ouyang, J.; Xu, G. DIF statistical inference without knowing anchoring items. Psychometrika 2023, 88, 1097–1122. [Google Scholar] [CrossRef]
- Yuan, K.H.; Cheng, Y.; Patton, J. Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika 2014, 79, 232–254. [Google Scholar] [CrossRef] [PubMed]
- Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
- Barrett, M.D.; van der Linden, W.J. Estimating linking functions for response model parameters. J. Educ. Behav. Stat. 2019, 44, 180–209. [Google Scholar] [CrossRef]
- Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 15 June 2024).
- Robitzsch, A. sirt: Supplementary Item Response Theory Models, R Package Version 4.2-114. 2025. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 7 April 2025).
- OECD. PISA 2006 Technical Report; OECD: Paris, France, 2009; Available online: https://bit.ly/38jhdzp (accessed on 3 May 2025).
- von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
- Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
- Brennan, R.L. Misconceptions at the intersection of measurement theory and practice. Educ. Meas. 1998, 17, 5–9. [Google Scholar] [CrossRef]
- Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
- Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
- Heine, J.H.; Heene, M. Measurement and mind: Unveiling the self-delusion of metrification in psychology. Meas. Interdiscip. Res. Persp. 2024; Epub ahead of print. [Google Scholar] [CrossRef]
- Salzberger, T. The illusion of measurement: Rasch versus 2-PL. Rasch Meas. Trans. 2002, 16, 882. Available online: https://tinyurl.com/25wzmzb5 (accessed on 3 May 2025).
- Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. Available online: https://bit.ly/2UV6Eht (accessed on 3 May 2025).
- Ballou, D. Test scaling and value-added measurement. Educ. Financ. Policy 2009, 4, 351–383. [Google Scholar] [CrossRef]
- van der Linden, W.J. Fundamental measurement and the fundamentals of Rasch measurement. In Objective Measurement: Theory Into Practice (Vol. 2); Wilson, M., Ed.; Ablex Publishing Corporation: Hillsdale, NJ, USA, 1994; pp. 3–24. [Google Scholar]
- Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci. 2022, 4, 9. [Google Scholar] [CrossRef]
- Andrich, D.; Marais, I. A Course in Rasch Measurement Theory; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Engelhard, G. Invariant Measurement; Routledge: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
- Melin, J.; Bonn, S.E.; Pendrill, L.; Lagerros, Y.T. A questionnaire for assessing user satisfaction with mobile health apps: Development using Rasch measurement theory. JMIR mHealth uHealth 2020, 8, e15909. [Google Scholar] [CrossRef]
- Wu, M.; Tam, H.P.; Jen, T.H. Educational Measurement for Applied Researchers; Springer: Singapore, 2016. [Google Scholar] [CrossRef]
- Tennant, A.; Pallant, J.F. DIF matters: A practical approach to test if differential item functioning makes a difference. Rasch Meas. Trans. 2007, 20, 1082–1084. Available online: https://rb.gy/wbiku0 (accessed on 3 May 2025).
- Melin, J.; Cano, S.; Flöel, A.; Göschel, L.; Pendrill, L. The role of entropy in construct specification equations (CSE) to improve the validity of memory tests: Extension to word lists. Entropy 2022, 24, 934. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).