Abstract
Robust mean-geometric mean (MGM) linking methods enable reliable group comparisons in item response theory models under fixed and sparse differential item functioning. This article evaluates six alternative standard error and confidence interval (CI) estimation methods across four MGM linking approaches. Our Simulation Study demonstrates that CIs based on the delta method or bootstrap procedures using the normal distribution or empirical quantiles exhibit highly inflated coverage rates. In contrast, CIs derived from a weighted least squares estimation problem, as well as basic and bias-corrected bootstrap methods, yield satisfactory coverage rates in most simulation conditions for robust MGM linking.
1. Introduction
Item response theory (IRT) models [1,2,3,4] provide a statistical framework for modeling multivariate discrete data. This article centers on dichotomous (i.e., binary) item responses and the the use of linking methods to compare two groups [5,6]. Let denote a vector of I binary random variables, where each corresponds to an item or (scored) item responses. In a unidimensional IRT model [7], the joint distribution is specified for item response patterns :
where is the density of the normal distribution, parameterized by the mean and the standard deviation (SD) . The latent variable —also referred to as a latent trait—is characterized by distribution parameters grouped in the vector . Item-specific parameters are represented by the vector , where each parameter vector parameterizes the item response function (IRF) for item i. This article employs the two-parameter logistic (2PL) model [8] with the IRF defined as
where and are the item discrimination and the item difficulty of item i, respectively. The function is the logistic distribution function, and .
The unknown parameters of the IRT model in (1) can be consistently estimated using marginal maximum likelihood estimation (MML; [9,10]) for a sample of individuals.
IRT models are commonly employed to assess group differences on a test by examining variations in the latent trait , assuming the parametric form specified in (1). This article concentrates on linking methods [5,11,12] based on the 2PL model. The linking process involves two stages: first, the 2PL model is separately estimated for each group, accommodating potential differential item functioning (DIF; [13,14,15,16,17]); second, the differences in item parameters are utilized to compute group differences in the latent trait via a linking method [5,18].
This article investigates the accuracy of confidence interval (CI) and standard error estimates in mean-geometric mean (MGM; [5,19,20,21,22,23]) linking under fixed sparse uniform DIF [16] in item difficulties [24,25]. The traditional MGM method computes linking constants using mean differences of log-transformed item discriminations and untransformed item difficulties, relying on means as the location measure, which corresponds to using the loss function. This study evaluates CI coverage rates for robust MGM linking methods that employ the () and loss functions [25]. These robust loss functions can effectively down-weight the influence of items with large DIF effects in the group comparison in the linking method [18,26,27,28,29,30,31,32,33,34,35]. Analytically derived standard errors based on the delta method for robust MGM linking are compared with various parametric bootstrap approaches. To the best of our knowledge, the adequacy of statistical inference for robust linking methods remains an underexplored area in the literature, with almost no studies specifically addressing this topic (see [26] for an exception).
The rest of the article is organized as follows. Section 2 reviews robust MGM linking. Section 3 presents alternative standard error and CI estimation methods. Findings from a simulation study are reported in Section 4. Section 5 presents empirical examples that illustrate the application of the different CI estimation methods. Finally, the article closes with a discussion in Section 6 and conclusions in Section 7.
2. Robust Mean-Geometric Mean Linking
This section reviews robust MGM linking [25]. As in standard linking procedures, item discriminations and item difficulties () are estimated by fitting the 2PL model separately in the two groups . The outcome of MGM linking is the estimation of the mean and the SD for the second group, representing the difference in the variable relative to the first group, which is defined to have a mean of 0 and an SD of 1.
2.1. and Loss Functions
Robust MGM linking can be interpreted as the computation of a robust location measure [36,37,38]. A flexible class of loss functions is given by the loss function [39,40,41] for positive values p, defined as
The loss function is not differentiable at when . A commonly used differentiable approximation of is
where is a tuning parameter that controls the approximation error (see [42,43]). In practice, a value of has proven effective [41,44,45].
The loss function, a special case of with , is the most widely used and is defined as . However, the loss function is known to be sensitive to outliers and, unlike with , lacks robustness.
An alternative robust loss function is the loss function [46,47,48,49], defined as
where denotes the indicator function, which takes the value 0 for and 1 for . A differentiable approximation is given by (see [50])
where is a tuning parameter. The choice has shown satisfactory performance in applications [44,45,51]. Using a smaller value of slightly reduces the bias but generally increases the variance of the estimator. Thus, the selection of involves a bias–variance trade-off that must be evaluated individually for each empirical application.
The loss function with is often preferred over the loss function when error distributions are asymmetric or contain outliers [52]. In practice, (i.e., ) is frequently selected. However, the loss function is preferable to in terms of bias, although this advantage comes at the cost of increased sampling variance [25,53].
2.2. Estimation of
The estimation of in robust MGM linking is based on log-transformed item discriminations. The parameter estimate is obtained by minimizing
where is a differentiable loss function, which may be a differentiable approximation of or . The corresponding estimating equation for is given by
where denotes the first derivative of the loss function . For the squared loss function , a closed-form solution is available as
Note that (9) corresponds to the original estimation method proposed in MGM linking [5,19,20].
2.3. Estimation of
The estimation of in MGM linking is based on the previously estimated SD . The group mean estimate is obtained by minimizing
where is a differentiable loss function. The estimate satisfies the estimating equation
with denoting the first derivative of . For the squared loss function , a closed-form solution is available as
which corresponds to the originally proposed MGM linking method [5,19,20].
3. Estimation of Standard Errors and Confidence Intervals
In this section, alternative methods to compute standard errors and CIs are presented. While the first two methods rely on asymptotic theory (delta method and weighted least squares, presented in Section 3.1 and Section 3.2), the last presented one—the bootstrap methods in Section 3.3—only uses resampling to compute a CI.
For notational convenience, we define the vector and its estimate , which jointly represent the linking parameter estimates. The vector includes all estimated item parameters, and its corresponding estimated variance matrix is denoted by . The variance matrix is computed as the inverse of the observed information matrix, which is based on the second-order derivatives of the log-likelihood function with respect to the item parameters .
3.1. Delta Method (DM)
The standard error of and corresponding CIs for its components are computed using the delta method (DM; see [54,55,56,57,58,59,60,61,62,63,64,65,66]). In robust MGM linking, the estimate is given in two-step estimation using the estimating Equations (8) and (11). Alternatively, these can also be combined as a one-step solution of the estimating equation:
The DM approach applies a Taylor expansion of around the population values , yielding
where and denote the Jacobians of with respect to and , respectively. Using (13) and the identity , expression (14) simplifies to
Letting , the variance of becomes
An estimate of is given by
resulting in the estimated variance matrix
Standard errors for and are obtained as the square roots of the diagonal elements in . Confidence intervals can then be constructed assuming normality and using these standard errors.
3.2. Weighted Least Squares (WLS)
An alternative to the DM method for standard error estimation is derived from robust regression methodology [67,68]. In robust regression, estimation is typically carried out via iterative weighted least squares, where parameter estimates are obtained at each iteration by minimizing a weighted least squares criterion. After computing the regression parameters, the weights are updated based on the chosen robust loss function . Standard errors in this framework are computed using the ordinary WLS formula, treating the final weights as fixed. This approach is adapted here to estimate standard errors in robust MGM linking.
The estimating equation in (13), under the WLS approach with fixed weights, takes the form
The DM, as presented in Section 3.1, is then applied to the modified estimating Equation (19) to derive an alternative variance matrix for and to construct CIs for and .
It is noted that the DM and WLS methods yield identical standard errors when the squared loss function is used, because it implies and . However, for the loss () and the loss, the resulting standard error estimates will differ.
3.3. Bootstrap Methods
This section applies parametric bootstrap methods to compute confidence intervals for the linking parameters . The estimate satisfies the estimating equation . The item parameters possess a variance matrix , estimated separately for each group. The parametric bootstrap resamples item parameters based on , generating a distribution of the linking parameter estimates (see also [69]).
To obtain bootstrap samples , draw from a multivariate normal distribution with mean vector and variance matrix . For each bth bootstrap sample, the estimate satisfies . Standard bootstrap techniques can then be applied to the resulting estimates for (see [70]).
Let denote an entry of , representing either or , and let denote the corresponding parameter estimate in the bth bootstrap sample. Define as the empirical distribution function of , obtained from the parametric bootstrap using B samples. The associated inverse distribution function (i.e., the quantile function) is denoted by .
The following outlines alternative CI estimation methods at confidence level (e.g., ), based on bootstrap procedures as described in [70,71].
3.3.1. Normal Distribution Bootstrap CI (BNO)
The normal distribution bootstrap (BNO) CI assumes a normal distribution for the parameter estimate . Let denote the inverse distribution function (i.e., the quantile function) of the standard normal distribution. The quantiles are denoted briefly as . The BNO confidence interval is given by
where denotes the empirical standard deviation of the bootstrap estimates , defined by
For a confidence level of , the quantile in (21) is .
3.3.2. Percentile Bootstrap CI (BPE)
The percentile bootstrap (BPE) CI is based on the quantiles (i.e., percentiles) of the empirical distribution of the bootstrap parameter estimates . The CI is defined as
An advantage of the BPE method is its applicability to cases where the distribution of is asymmetric.
3.3.3. Basic Bootstrap CI (BBB)
The basic bootstrap (BBB) CI is constructed by forming a confidence interval for the deviations . Following this approach, the CI is given by (see [69])
The rationale for using (24) is that the interval
forms a CI for the deviations between the bootstrap estimate and original estimate .
3.3.4. Bias-Corrected Bootstrap CI (BBC)
Finally, the bias-corrected bootstrap (BBC) CI accounts for potential bias in the bootstrap samples and accommodates asymmetric distributions of the linking parameter estimate . It is defined as
and denotes the indicator function. If the mean of the bootstrap estimates equals the original estimate then , and the BBC CI in (26) coincides with the BPE CI from (23).
4. Simulation Study
In this Simulation Study, the adequacy of CI estimates is evaluated for the six methods described in Section 3, comparing their performance in robust MGM linking.
4.1. Method
The 2PL model for two groups served as the data-generating model. In the first group, the latent variable followed a normal distribution with fixed mean 0 and SD 1. In the second group, also followed a normal distribution, but with a fixed mean of and SD of across all simulation conditions.
The Simulation Study used items. Group-specific item parameters and for each item and for groups were chosen based on fixed base item parameters. The item parameters were constructed using 10 base items that were duplicated to obtain a test involving 20 items. For the 10 base items, the base item discriminations were chosen as 1.499, 1.129, 1.647, 1.014, 1.567, 0.800, 0.974, 0.913, 0.739, 0.717. This yielded an average item discrimination of with an . The base item difficulties of the 10 base items were chosen as −0.314, 0.411, −1.097, −0.542, −1.854, −0.403, −0.895, 0.715, 0.841, and 0.139, yielding a mean with an . The complete set of item parameters is available at https://osf.io/tjngx (accessed on 4 May 2025).
For the first group, item discriminations and item difficulties matched the base item parameters exactly. In contrast, the second group included a uniform DIF effect d that shifted the base item difficulty for a subset of items with d, while no DIF was introduced for the remaining items. In the item condition with duplicated item parameters, the DIF items were 1, 2, 3, 11, 12, and 13, corresponding to 30% of the items. The DIF effect was varied as and , representing no DIF and strong DIF conditions, respectively. No DIF was imposed on item discriminations. The simulated uniform DIF can be characterized as fixed and sparse DIF.
Group sample sizes of , 1000, 2000, 4000, and 10,000 were selected to mimic sample sizes in medium-to-large-scale testing applications of the 2PL model [72] and to allow studying the asymptotic behavior of the CI estimation methods.
In each of the 5 (sample size N) × 2 (DIF effect size d) simulation conditions, 3000 replications were conducted. Robust MGM linking was applied for , 1, 0.5, and 0. The tuning parameter was set to 0.001 for and , and to for . For all four MGM linking methods, CIs at confidence level were computed using the six methods DM, WLS, BNO, BPE, BBB, and BBC (see Section 3). A total of bootstrap samples were used in the parametric bootstrap approaches.
Bias and root mean square error (RMSE) were evaluated for the estimated mean and SD . In addition, coverage rates were assessed for all four MGM linking methods crossed with the six CI estimation approaches. For each MGM method, a pseudo-true parameter was defined as the average parameter estimate across replications within a given simulation condition to isolate coverage performance from parameter bias. The coverage rate was defined as the proportion of replications in which the CI included the pseudo-true parameter. Coverage rates between 91% and 98% were considered acceptable [73]. A coverage rate below 91% is referred to as undercoverage, whereas a rate above 98% is considered overcoverage.
Moreover, the power rates for the statistical tests and were assessed. These tests evaluated whether significant differences existed between the two groups, in terms of the mean and the SD . The null hypothesis was rejected if the test value fell outside the corresponding CI. The power rate was estimated as the proportion of replications in which the null hypothesis was rejected.
All analyses for this simulation study were carried out using the statistical software R (Version 4.4.1; [74]). The 2PL model was fitted using the sirt::xxirt() function from the R package sirt (Version 4.2-114; [75]). Custom functions were developed to implement robust MGM linking. Optimization for MGM linking was performed using the stats::optim() function. Replication materials for this simulation study are available at https://osf.io/tjngx (accessed on 4 May 2025).
4.2. Results
Table 1 presents the bias, the SD, and the RMSE of the estimated mean and SD as a function of the DIF effect size d and sample size N. It can be seen that all four MGM specifications involving different powers p in the loss function produced unbiased estimates for and in the absence of DIF. In the presence of DIF (i.e., for ), in line with findings from the literature, the square loss function with had the largest bias, followed by and . Note that the bias for can approximately be determined as , where refers to the proportion of DIF items. Also, note that the bias for robust MGM linking for and vanished with increasing sample size, although the convergence to zero was particularly slow for . However, the best performance regarding bias had the loss function using the power , which also performed well in smaller samples. As DIF was not present in item discriminations, the SD was unbiased for all methods in all simulation conditions.
Table 1.
Simulation Study: Bias, standard deviation (SD) and root mean square error (RMSE) of the estimated mean and SD as a function of the DIF effect size d and sample size N.
As expected, the SD decreased with increasing sample size for all powers p for both and . Moreover, the SD increased as the value of p decreased.
In the no DIF condition, the RMSE was smallest for and increased with decreasing power p, reaching its highest value for the loss function (). Although the loss produced unbiased estimates, this indicates that it resulted in the largest variance. In contrast, under the DIF condition with , the RMSE for was smallest for , as the minimal bias outweighed the increase in variance. Hence, under the simulated conditions, the robust loss function with emerged as the clear frontrunner in terms of both bias and RMSE.
Table 2 presents the coverage rates of the estimated mean and SD as a function of the DIF effect size d and sample size N. All six CI estimation methods performed adequately for , but substantial differences emerged for the robust loss functions with , 0.5, and 0. Achieving adequate coverage was generally more challenging in the presence of DIF than in the condition without DIF. The DM method exhibited pronounced overcoverage for , a pattern that also appeared for BNO and BPE at and . For , the WLS method showed undercoverage in some conditions, whereas it performed acceptably for and . The coverage rates of WLS improved with increasing large sample sizes. Across all simulation settings, the bootstrap methods BBB and BBC delivered the most consistent performance, although undercoverage was observed in a few cells with .
Table 2.
Simulation Study: Coverage rates of the estimated mean and SD as a function of the DIF effect size d and sample size N.
Table 3 presents the power rates for detecting significant group differences in (i.e., ) and (i.e., ). As expected, the power rates increased with larger sample sizes. However, the DM method, which exhibited substantial overcoverage, showed markedly reduced power. This underscores the importance of using the better-performing bootstrap methods, BBB and BBC, which combine adequate coverage with sufficiently high power to detect group differences.
Table 3.
Simulation Study: Power rates for the statistical tests of the null hypotheses and as a function of the DIF effect size d and sample size N.
5. Empirical Examples
This section illustrates the differences between the CI estimation methods for robust MGM linking using powers , 1, 0.5, and 0, using two empirical examples based on publicly available datasets from R packages. The datasets involve two groups and contain dichotomous items without missing item responses.
5.1. Dataset dataDIF
The first example used the dataDIF dataset from the R package equateIRT (Version 1.0.0; [22,76]). The full dataset includes 20 dichotomous items and three groups of 1000 persons each. The dataset was originally simulated and applied in a research article devoted to the assessment of fixed DIF [77]. For illustration, only the first and the second groups were used for robust MGM linking in this example.
Table 4 presents the point and CI estimates for the dataDIF dataset. The non-robust MGM linking estimate with differed slightly from the robust MGM approaches using , , or 0. The CI estimates obtained by the DM method differed noticeably from those of the other CI estimation methods for , and the discrepancies became more pronounced for . In particular, significant group differences (i.e., statistically differed from 0) were detected by all CI methods except DM. For , the differences between the MGM methods were much smaller. However, as with , for , the SD in the second group was significantly larger than in the first group (i.e., statistically differed from 1) according to all the methods except DM. These results are consistent with the findings from the Simulation Study, indicating that the DM method has substantially reduced power for detecting group differences.
Table 4.
Empirical example, dataset dataDIF: Point estimates and confidence interval estimates for estimated mean and SD .
5.2. Dataset MathExam14W
The second example used the MathExam14W dataset from the R package psychotools (Version 0.7–4; [78]). This dataset includes responses from 729 students to 13 dichotomous items from a written exam in introductory mathematics, along with several covariates. The grouping variable gender was used for analysis. The first group consisted of 403 male students, and the second group consisted of 326 female students.
Table 5 reports the point and CI estimates for the MathExam14W dataset. As in the first example, the robust MGM methods differed markedly from the non-robust MGM method for . The group differences in were statistically significant based on the BBC estimation method. For , the DM method yielded an excessively wide CI for . Notably, the estimates also showed slight discrepancies between the robust and non-robust MGM methods. Again, for , the CI based on the DM method was extremely wide, resulting in implausible values.
Table 5.
Empirical example, dataset MathExam14W: Point estimates and confidence interval estimates for estimated mean and SD .
6. Discussion
This article compared the performance of alternative CI estimation methods across various robust MGM linking approaches. The robust loss function was preferred in practical applications involving fixed and sparse uniform DIF effects, as it outperformed the loss functions with , , and in terms of bias and RMSE. CI estimates based on the delta method (DM), which relies on differential approximations of the loss functions (), exhibited highly inflated coverage rates. A modified CI estimation approach, formulated by recasting the robust minimization problem as a weighted least squares (WLS) problem, yielded acceptable coverage rates for all p values except for . In this case, the basic bootstrap (BBB) and bias-corrected bootstrap (BBC) methods performed satisfactorily and clearly outperformed the commonly used bootstrap approaches based on the normal approximation (BNO) and empirical percentiles (BPE).
The failure to obtain acceptable coverage rates for DM, as well as for the BNO and BPE bootstrap methods, aligns with earlier findings reported by the authors that also documented overcoverage in robust linking methods [33]. These results suggest that assessing linking errors is more challenging for robust methods compared to non-robust linking approaches. It could be speculated that the non-differentiability of the loss function presents specific challenges for the DM and WLS methods, as well as for the bootstrap methods that do not include a bias correction term. This observation may align with the presence of finite-sample bias in the robust MGM linking parameter estimates.
The parametric bootstrap method resimulates item parameters and repeatedly applies the robust MGM linking method to compute CIs. Although this approach is clearly more computationally demanding than the DM or WLS methods, it can generally be completed within minutes in typical empirical applications. Nevertheless, the substantially increased computation time of the bootstrap may become a concern in large-scale empirical studies or extensive simulation designs.
In the Simulation Study, the distribution parameters and were held constant across simulation conditions. This is not considered a limitation, as the general patterns of bias, the RMSE, and the coverage rates are not expected to change with alternative choices of and .
The simulation study revealed that DM resulted in overcoverage, while WLS led to undercoverage for the loss function with . Preliminary evidence from additional simulation studies indicated that a weighted standard error combining DM and WLS—assigning slightly more weight to WLS than to DM—may yield improved coverage rates. In more detail, let and denote the matrices in (17) corresponding to the DM and WLS methods, respectively, which are used to compute the variance matrix . The proposed approach, which combines DM and WLS, constructs a weighted matrix with , and uses it to compute the variance matrix . This alternative CI estimation method based on could provide a viable option that avoids the need for the computationally intensive bootstrap procedure.
As noted by an anonymous reviewer, future research could compare different CI methods using both theoretical and simulated coverage probabilities, as presented in [79]. The resulting guidance on CI selection for small-to-moderate sample sizes would be of particular interest.
This study focused on standard error and confidence interval estimation for linking parameter estimates in robust MGM linking, accounting for the sampling variability of persons. Future research could additionally examine linking errors [80] in the robust MGM method that reflect uncertainty in linking parameter estimates arising from the randomness of DIF effects.
Future research could also examine CI assessment in the context of structural equation models (SEM; [81,82]) estimated using or loss functions [51,83,84]. In such applications, the parametric bootstrap approach may be applied to sufficient statistics involving estimated mean vectors and covariance matrices for SEM.
7. Conclusions
The main findings regarding CI estimates in robust MGM linking can be summarized as follows:
- The DM method exhibited highly inflated coverage rates for the linking parameter estimates, accompanied by substantially reduced power rates.
- The WLS method performed well with loss functions for or , but showed notable undercoverage in small-to-moderate sample sizes.
- Among the bootstrap methods, BBB and BBC—which include a bias correction term—achieved desirable coverage rates, unlike the BNO and BPE methods, which lack such correction.
Funding
This research received no external funding.
Data Availability Statement
Replication material for the Simulation Study in Section 4 can be found at https://osf.io/tjngx (accessed on 4 May 2025).
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| 2PL | two-parameter logistic |
| BBB | basic bootstrap |
| BBC | bias-corrected bootstrap |
| BNO | bootstrap based on normal distribution |
| BPE | percentile bootstrap |
| CI | confidence interval |
| DIF | differential item functioning |
| DM | delta method |
| IRF | item response function |
| IRT | item response theory |
| MGM | mean-geometric mean |
| MML | marginal maximum likelihood |
| RMSE | root mean square error |
| SD | standard deviation |
| SEM | structural equation model |
| WLS | weighted least squares |
References
- Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Cai, L.; Moustaki, I. Estimation methods in latent variable models for categorical outcome variables. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 253–277. [Google Scholar] [CrossRef]
- Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory – A statistical framework for educational and psychological measurement. Stat. Sci. 2025, 40, 167–194. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
- Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
- Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
- Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
- Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; 2007; pp. 125–167. [Google Scholar] [CrossRef]
- Wells, C.S. Assessing Measurement Invariance for Applied Research; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
- Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
- Mislevy, R.J.; Bock, R.D. BILOG 3. Item Analysis and Test Scoring with Binary Logistic Models; Software Manual; Scientific Software International: Chicago, IL, USA, 1990. [Google Scholar]
- Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model Through Separate Calibrations; Research Report No. RR-09-40; Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
- Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
- van der Linden, W.J.; Barrett, M.D. Linking item response model parameters. Psychometrika 2016, 81, 650–673. [Google Scholar] [CrossRef]
- Robitzsch, A. Comparing robust linking and regularized estimation for linking two groups in the 1PL and 2PL models in the presence of sparse uniform differential item functioning. Stats 2023, 6, 192–208. [Google Scholar] [CrossRef]
- Robitzsch, A. Extensions to mean–geometric mean linking. Mathematics 2025, 13, 35. [Google Scholar] [CrossRef]
- Halpin, P.F. Differential item functioning via robust scaling. Psychometrika 2024, 89, 796–821. [Google Scholar] [CrossRef] [PubMed]
- He, Y.; Cui, Z.; Fang, Y.; Chen, H. Using a linear regression method to detect outliers in IRT common item equating. Appl. Psychol. Meas. 2013, 37, 522–540. [Google Scholar] [CrossRef]
- Jurich, D.; Liu, C. Detecting item parameter drift in small sample Rasch equating. Appl. Meas. Educ. 2023, 36, 326–339. [Google Scholar] [CrossRef]
- Liu, C.; Jurich, D. Outlier detection using t-test in Rasch IRT equating under NEAT design. Appl. Psychol. Meas. 2023, 47, 34–47. [Google Scholar] [CrossRef]
- Magis, D.; De Boeck, P. Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivar. Behav. Res. 2011, 46, 733–755. [Google Scholar] [CrossRef]
- Magis, D.; De Boeck, P. A robust outlier approach to prevent type I error inflation in differential item functioning. Educ. Psychol. Meas. 2012, 72, 291–311. [Google Scholar] [CrossRef]
- Manna, V.F.; Gu, L. Different Methods of Adjusting for form Difficulty Under the Rasch Model: Impact on Consistency of Assessment Results; Research Report No. RR-19-08; Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
- Strobl, C.; Kopf, J.; Kohler, L.; von Oertzen, T.; Zeileis, A. Anchor point selection: Scale alignment based on an inequality criterion. Appl. Psychol. Meas. 2021, 45, 214–230. [Google Scholar] [CrossRef]
- Wang, W.; Liu, Y.; Liu, H. Testing differential item functioning without predefined anchor items using robust regression. J. Educ. Behav. Stat. 2022, 47, 666–692. [Google Scholar] [CrossRef]
- Huber, P.J.; Ronchetti, E.M. Robust Statistics; Wiley: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Wilcox, R. Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
- Lipovetsky, S. Optimal Lp-metric for minimizing powered deviations in regression. J. Mod. Appl. Stat. Methods 2007, 6, 20. [Google Scholar] [CrossRef]
- Giacalone, M.; Panarello, D.; Mattera, R. Multicollinearity in regression: An efficiency comparison between Lp-norm and least squares estimators. Qual. Quant. 2018, 52, 1831–1859. [Google Scholar] [CrossRef]
- Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
- Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Modeling 2014, 21, 495–508. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
- Robitzsch, A. Examining differences of invariance alignment in the Mplus software and the R package sirt. Mathematics 2024, 12, 770. [Google Scholar] [CrossRef]
- Robitzsch, A. Comparing robust Haberman linking and invariance alignment. Stats 2025, 8, 3. [Google Scholar] [CrossRef]
- Oelker, M.R.; Pößnecker, W.; Tutz, G. Selection and fusion of categorical predictors with L0-type penalties. Stat. Model. 2015, 15, 389–410. [Google Scholar] [CrossRef]
- Oelker, M.R.; Tutz, G. A uniform framework for the combination of penalties in generalized structured models. Adv. Data Anal. Classif. 2017, 11, 97–120. [Google Scholar] [CrossRef]
- Xiang, J.; Yue, H.; Yin, X.; Wang, L. A new smoothed l0 regularization approach for sparse signal recovery. Math. Probl. Eng. 2019, 2019, 1978154. [Google Scholar] [CrossRef]
- Wang, L.; Yin, X.; Yue, H.; Xiang, J. A regularized weighted smoothed L0 norm minimization method for underdetermined blind source separation. Sensors 2018, 18, 4260. [Google Scholar] [CrossRef]
- O’Neill, M.; Burke, K. Variable selection using a smooth information criterion for distributional regression models. Stat. Comput. 2023, 33, 71. [Google Scholar] [CrossRef]
- Robitzsch, A. L0 and Lp loss functions in model-robust estimation of structural equation models. Psych 2023, 5, 1122–1139. [Google Scholar] [CrossRef]
- Jaeckel, L.A. Robust estimates of location: Symmetry and asymmetric contamination. Ann. Math. Stat. 1971, 42, 1020–1034. [Google Scholar] [CrossRef]
- Robitzsch, A. Computational aspects of L0 linking in the Rasch model. Algorithms 2025, 18, 213. [Google Scholar] [CrossRef]
- Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
- Ogasawara, H. Item response theory true score equatings and their standard errors. J. Educ. Behav. Stat. 2001, 26, 31–50. [Google Scholar] [CrossRef]
- Ogasawara, H. Applications of asymptotic expansion in item response theory linking. In Statistical Models for Test Equating, Scaling, and Linking; von Davier, A., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 261–280. [Google Scholar] [CrossRef]
- Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef]
- Battauz, M. Factors affecting the variability of IRT equating coefficients. Stat. Neerl. 2015, 69, 85–101. [Google Scholar] [CrossRef]
- Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
- Zhang, Z. Asymptotic standard errors of equating coefficients using the characteristic curve methods for the graded response model. Appl. Meas. Educ. 2020, 33, 309–330. [Google Scholar] [CrossRef]
- Zhang, Z. Asymptotic standard errors of parameter scale transformation coefficients in test equating under the nominal response model. Appl. Psychol. Meas. 2021, 45, 134–138. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z. Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Appl. Psychol. Meas. 2021, 45, 331–345. [Google Scholar] [CrossRef]
- Jewsbury, P.A. Error Variance in Common Population Linking Bridge Studies; Research Report No. RR-19-42; Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Jewsbury, P.A. Generally applicable variance estimation methods for common-population linking. J. Educ. Behav. Stat. 2024. [Google Scholar] [CrossRef]
- Jewsbury, P.A. Linking error on achievement levels accounting for dependencies and complex sampling. J. Educ. Meas. 2025; epub ahead of print. [Google Scholar] [CrossRef]
- Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
- Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage: Thousand Oaks, CA, USA, 2016; Available online: https://bit.ly/38XUSX1 (accessed on 4 May 2025).
- Fox, J.; Weisberg, S. Robust Regression in R: An Appendix to an R Companion to Applied Regression, 2nd ed.; Sage: Thousand Oaks, CA, USA, 2010; Available online: https://bit.ly/3canwcw (accessed on 4 May 2025).
- Chen, Y.; Li, C.; Ouyang, J.; Xu, G. DIF statistical inference without knowing anchoring items. Psychometrika 2023, 88, 1097–1122. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar] [CrossRef]
- Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar] [CrossRef]
- Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Muthén, L.K.; Muthén, B.O. How to use a Monte Carlo study to decide on sample size and determine power. Struct. Equ. Modeling 2002, 9, 599–620. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2024. Available online: https://www.R-project.org (accessed on 15 June 2024).
- Robitzsch, A. sirt: Supplementary Item Response Theory Models. R Package Version 4.2-114. 2025. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 7 April 2025).
- Battauz, M. equateMultiple: Equating of Multiple Forms. R Package Version 1.0.0. 2024. Available online: https://cran.r-project.org/web/packages/equateMultiple/index.html (accessed on 7 April 2025). [CrossRef]
- Battauz, M. On Wald tests for differential item functioning detection. Stat. Methods Appl. 2019, 28, 103–118. [Google Scholar] [CrossRef]
- Zeileis, A.; Strobl, C.; Wickelmaier, F.; Komboz, B.; Kopf, J.; Schneider, L.; Debelak, R. psychotools: Psychometric Modeling Infrastructure. R Package Version 0.7-4. 2024. Available online: https://cran.r-project.org/web/packages/psychotools/index.html (accessed on 7 April 2025). [CrossRef]
- Fitts, D.A. Expected and empirical coverages of different methods for generating noncentral t confidence intervals for a standardized mean difference. Behav. Res. Methods 2021, 53, 2412–2429. [Google Scholar] [CrossRef]
- Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
- Bollen, K.A. Structural Equations with Latent Variables; Wiley: New York, NY, USA, 1989. [Google Scholar] [CrossRef]
- Yuan, K.H.; Bentler, P.M. Structural Equation Modeling with Robust Covarianc. Available online: https://www3.nd.edu/~kyuan/courses/sem/readpapers/Yuan-Bentler-SM98.pdf (accessed on 7 April 2025). [CrossRef]
- Siemsen, E.; Bollen, K.A. Least absolute deviation estimation in structural equation modeling. Sociol. Methods Res. 2007, 36, 227–265. [Google Scholar] [CrossRef]
- van Kesteren, E.J.; Oberski, D.L. Flexible extensions to structural equation models using computation graphs. Struct. Equ. Modeling 2022, 29, 233–247. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).