Assessing Standard Error Estimation Approaches for Robust Mean-Geometric Mean Linking

Alexander Robitzsch

doi:10.3390/appliedmath5030086

¹

IPN—Leibniz Institute for Science and Mathematics Education (IPN—Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik), Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB—Zentrum für Internationale Bildungsvergleichsstudien), Olshausenstraße 62, 24118 Kiel, Germany

AppliedMath2025, 5(3), 86;https://doi.org/10.3390/appliedmath5030086

Version Notes

Order Reprints

Abstract

Robust mean-geometric mean (MGM) linking methods enable reliable group comparisons in item response theory models under fixed and sparse differential item functioning. This article evaluates six alternative standard error and confidence interval (CI) estimation methods across four MGM linking approaches. Our Simulation Study demonstrates that CIs based on the delta method or bootstrap procedures using the normal distribution or empirical quantiles exhibit highly inflated coverage rates. In contrast, CIs derived from a weighted least squares estimation problem, as well as basic and bias-corrected bootstrap methods, yield satisfactory coverage rates in most simulation conditions for robust MGM linking.

Keywords:

item response model; 2PL model; mean-geometric mean linking; differential item functioning; standard error; bootstrap

1. Introduction

Item response theory (IRT) models [1,2,3,4] provide a statistical framework for modeling multivariate discrete data. This article centers on dichotomous (i.e., binary) item responses and the the use of linking methods to compare two groups [5,6]. Let

X = (X_{1}, \dots, X_{I})

denote a vector of I binary random variables, where each

X_{i} \in {0, 1}

corresponds to an item or (scored) item responses. In a unidimensional IRT model [7], the joint distribution

P (X = x)

is specified for item response patterns

x = (x_{1}, \dots, x_{I}) \in {0, 1}^{I}

:

P (X = x; δ, γ) = \int \prod_{i = 1}^{I} [{P_{i} (θ; γ_{i})}^{x_{i}} {(1 - P_{i} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ,

(1)

where

ϕ

is the density of the normal distribution, parameterized by the mean

μ

and the standard deviation (SD)

σ

. The latent variable

θ

—also referred to as a latent trait—is characterized by distribution parameters grouped in the vector

δ = (μ, σ)

. Item-specific parameters are represented by the vector

γ = (γ_{1}, \dots, γ_{I})

, where each parameter vector

γ_{i}

parameterizes the item response function (IRF)

P_{i} (θ; γ_{i}) = P (X_{i} = 1 ∣ θ)

for item i. This article employs the two-parameter logistic (2PL) model [8] with the IRF defined as

P_{i} (θ; γ_{i}) = Ψ (a_{i} (θ - b_{i})),

(2)

where

a_{i}

and

b_{i}

are the item discrimination and the item difficulty of item i, respectively. The function

Ψ (x) = {(1 + exp (- x))}^{- 1}

is the logistic distribution function, and

γ_{i} = (a_{i}, b_{i})

.

The unknown parameters of the IRT model in (1) can be consistently estimated using marginal maximum likelihood estimation (MML; [9,10]) for a sample of individuals.

IRT models are commonly employed to assess group differences on a test by examining variations in the latent trait

θ

, assuming the parametric form specified in (1). This article concentrates on linking methods [5,11,12] based on the 2PL model. The linking process involves two stages: first, the 2PL model is separately estimated for each group, accommodating potential differential item functioning (DIF; [13,14,15,16,17]); second, the differences in item parameters are utilized to compute group differences in the latent trait

θ

via a linking method [5,18].

This article investigates the accuracy of confidence interval (CI) and standard error estimates in mean-geometric mean (MGM; [5,19,20,21,22,23]) linking under fixed sparse uniform DIF [16] in item difficulties [24,25]. The traditional MGM method computes linking constants using mean differences of log-transformed item discriminations and untransformed item difficulties, relying on means as the location measure, which corresponds to using the

L_{2}

loss function. This study evaluates CI coverage rates for robust MGM linking methods that employ the

L_{p}

(

p > 0

) and

L_{0}

loss functions [25]. These robust loss functions can effectively down-weight the influence of items with large DIF effects in the group comparison in the linking method [18,26,27,28,29,30,31,32,33,34,35]. Analytically derived standard errors based on the delta method for robust MGM linking are compared with various parametric bootstrap approaches. To the best of our knowledge, the adequacy of statistical inference for robust linking methods remains an underexplored area in the literature, with almost no studies specifically addressing this topic (see [26] for an exception).

The rest of the article is organized as follows. Section 2 reviews robust MGM linking. Section 3 presents alternative standard error and CI estimation methods. Findings from a simulation study are reported in Section 4. Section 5 presents empirical examples that illustrate the application of the different CI estimation methods. Finally, the article closes with a discussion in Section 6 and conclusions in Section 7.

2. Robust Mean-Geometric Mean Linking

This section reviews robust MGM linking [25]. As in standard linking procedures, item discriminations

{\hat{a}}_{i g}

and item difficulties

{\hat{b}}_{i g}

(

i = 1, \dots, I

) are estimated by fitting the 2PL model separately in the two groups

g = 1, 2

. The outcome of MGM linking is the estimation of the mean

\hat{μ}

and the SD

\hat{σ}

for the second group, representing the difference in the

θ

variable relative to the first group, which is defined to have a mean of 0 and an SD of 1.

2.1. $L_{p}$ and $L_{0}$ Loss Functions

Robust MGM linking can be interpreted as the computation of a robust location measure [36,37,38]. A flexible class of loss functions is given by the

L_{p}

loss function [39,40,41]

ρ

for positive values p, defined as

ρ (x) = {| x |}^{p} for p > 0 .

(3)

The

L_{p}

loss function is not differentiable at

x = 0

when

p \leq 1

. A commonly used differentiable approximation of

ρ

is

ρ (x) ≃ {({| x |}^{2} + ε)}^{p / 2},

(4)

where

ε > 0

is a tuning parameter that controls the approximation error (see [42,43]). In practice, a value of

ε = 0.001

has proven effective [41,44,45].

The

L_{2}

loss function, a special case of

L_{p}

with

p = 2

, is the most widely used and is defined as

ρ (x) = x^{2}

. However, the

L_{2}

loss function is known to be sensitive to outliers and, unlike

L_{p}

with

p < 1

, lacks robustness.

An alternative robust loss function is the

L_{0}

loss function [46,47,48,49], defined as

ρ (x) = 1 (x \neq 0),

(5)

where

1

denotes the indicator function, which takes the value 0 for

x = 0

and 1 for

x \neq 0

. A differentiable approximation is given by (see [50])

ρ (x) ≃ \frac{x^{2}}{x^{2} + ε},

(6)

where

ε > 0

is a tuning parameter. The choice

ε = 0.01

has shown satisfactory performance in applications [44,45,51]. Using a smaller value of

ε = 0.001

slightly reduces the bias but generally increases the variance of the estimator. Thus, the selection of

ε

involves a bias–variance trade-off that must be evaluated individually for each empirical application.

The

L_{p}

loss function with

p < 1

is often preferred over the

L_{1}

loss function when error distributions are asymmetric or contain outliers [52]. In practice,

L_{0.5}

(i.e.,

p = 0.5

) is frequently selected. However, the

L_{0}

loss function is preferable to

L_{0.5}

in terms of bias, although this advantage comes at the cost of increased sampling variance [25,53].

2.2. Estimation of $σ$

The estimation of

σ

in robust MGM linking is based on log-transformed item discriminations. The parameter estimate

\hat{σ}

is obtained by minimizing

H_{σ} (σ) = \sum_{i = 1}^{I} ρ (log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log σ),

(7)

where

ρ

is a differentiable loss function, which may be a differentiable approximation of

L_{p}

or

L_{0}

. The corresponding estimating equation for

σ

is given by

h_{σ} (σ) = \sum_{i = 1}^{I} ρ_{1} (log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log σ) = 0,

(8)

where

ρ_{1}

denotes the first derivative of the loss function

ρ

. For the squared loss function

L_{2}

, a closed-form solution is available as

\hat{σ} = exp [\frac{1}{I} \sum_{i = 1}^{I} log {\hat{a}}_{i 2} - \frac{1}{I} \sum_{i = 1}^{I} log {\hat{a}}_{i 1}] .

(9)

Note that (9) corresponds to the original estimation method proposed in MGM linking [5,19,20].

2.3. Estimation of $μ$

The estimation of

μ

in MGM linking is based on the previously estimated SD

\hat{σ}

. The group mean estimate

\hat{μ}

is obtained by minimizing

H_{μ} (μ) = \sum_{i = 1}^{I} ρ (\hat{σ} {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + μ),

(10)

where

ρ

is a differentiable loss function. The estimate

\hat{μ}

satisfies the estimating equation

h_{μ} (μ) = \sum_{i = 1}^{I} ρ_{1} (\hat{σ} {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + μ) = 0,

(11)

with

ρ_{1}

denoting the first derivative of

ρ

. For the squared loss function

L_{2}

, a closed-form solution is available as

\hat{μ} = - (\hat{σ} \frac{1}{I} \sum_{i = 1}^{I} {\hat{b}}_{i 2} - \frac{1}{I} \sum_{i = 1}^{I} {\hat{b}}_{i 1}),

(12)

which corresponds to the originally proposed MGM linking method [5,19,20].

3. Estimation of Standard Errors and Confidence Intervals

In this section, alternative methods to compute standard errors and CIs are presented. While the first two methods rely on asymptotic theory (delta method and weighted least squares, presented in Section 3.1 and Section 3.2), the last presented one—the bootstrap methods in Section 3.3—only uses resampling to compute a CI.

For notational convenience, we define the vector

δ = (μ, σ)

and its estimate

\hat{δ} = (\hat{μ}, \hat{σ})

, which jointly represent the linking parameter estimates. The vector

\hat{γ}

includes all estimated item parameters, and its corresponding estimated variance matrix is denoted by

V_{\hat{γ}}

. The variance matrix is computed as the inverse of the observed information matrix, which is based on the second-order derivatives of the log-likelihood function with respect to the item parameters

γ

.

3.1. Delta Method (DM)

The standard error of

\hat{δ}

and corresponding CIs for its components are computed using the delta method (DM; see [54,55,56,57,58,59,60,61,62,63,64,65,66]). In robust MGM linking, the estimate

\hat{δ} = (\hat{μ}, \hat{σ})

is given in two-step estimation using the estimating Equations (8) and (11). Alternatively, these can also be combined as a one-step solution of the estimating equation:

h_{δ} (δ, \hat{γ}) = (\begin{matrix} \sum_{i = 1}^{I} ρ_{1} (log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log σ) \\ \sum_{i = 1}^{I} ρ_{1} (σ {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + μ) \end{matrix}) = 0 .

(13)

The DM approach applies a Taylor expansion of

h_{δ}

around the population values

(δ, γ)

, yielding

h_{δ} (\hat{δ}, \hat{γ}) = h_{δ} (δ, γ) + h_{δ δ} (δ, γ) (\hat{δ} - δ) + h_{δ γ} (δ, γ) (\hat{γ} - γ),

(14)

where

h_{δ δ}

and

h_{δ γ}

denote the Jacobians of

h_{δ}

with respect to

δ

and

γ

, respectively. Using (13) and the identity

h_{δ} (δ, γ) = 0

, expression (14) simplifies to

\hat{δ} - δ = h_{δ δ} {(δ, γ)}^{- 1} h_{δ γ} (δ, γ) (\hat{γ} - γ) .

(15)

Letting

A = h_{δ δ} {(δ, γ)}^{- 1} h_{δ γ} (δ, γ)

, the variance of

\hat{δ}

becomes

Var (\hat{δ}) = A V_{\hat{γ}} A^{⊤} .

(16)

An estimate of

A

is given by

\hat{A} = h_{δ δ} {(\hat{δ}, \hat{γ})}^{- 1} h_{δ γ} (\hat{δ}, \hat{γ}),

(17)

resulting in the estimated variance matrix

V_{\hat{δ}} = \hat{A} V_{\hat{γ}} {\hat{A}}^{⊤} .

(18)

Standard errors for

\hat{μ}

and

\hat{σ}

are obtained as the square roots of the diagonal elements in

V_{\hat{δ}}

. Confidence intervals can then be constructed assuming normality and using these standard errors.

3.2. Weighted Least Squares (WLS)

An alternative to the DM method for standard error estimation is derived from robust regression methodology [67,68]. In robust regression, estimation is typically carried out via iterative weighted least squares, where parameter estimates are obtained at each iteration by minimizing a weighted least squares criterion. After computing the regression parameters, the weights are updated based on the chosen robust loss function

ρ

. Standard errors in this framework are computed using the ordinary WLS formula, treating the final weights as fixed. This approach is adapted here to estimate standard errors in robust MGM linking.

The estimating equation in (13), under the WLS approach with fixed weights, takes the form

h_{δ} (δ, \hat{γ}) = (\begin{matrix} \sum_{i = 1}^{I} w_{i, σ} (log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log σ) \\ \sum_{i = 1}^{I} w_{i, μ} (\hat{σ} {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + μ) \end{matrix}) = 0, where

(19)

w_{i, σ} = \frac{ρ (log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log \hat{σ})}{{(log {\hat{a}}_{i 2} - log {\hat{a}}_{i 1} - log \hat{σ})}^{2}} and w_{i, μ} = \frac{ρ (\hat{σ} {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + \hat{μ})}{{(\hat{σ} {\hat{b}}_{i 2} - {\hat{b}}_{i 1} + \hat{μ})}^{2}} .

(20)

The DM, as presented in Section 3.1, is then applied to the modified estimating Equation (19) to derive an alternative variance matrix

V_{\hat{δ}}

for

\hat{δ}

and to construct CIs for

\hat{μ}

and

\hat{σ}

.

It is noted that the DM and WLS methods yield identical standard errors when the squared loss function

ρ (x) = x^{2}

is used, because it implies

w_{i, σ} = 1

and

w_{i, μ} = 1

. However, for the

L_{p}

loss (

0 < p < 1

) and the

L_{0}

loss, the resulting standard error estimates will differ.

3.3. Bootstrap Methods

This section applies parametric bootstrap methods to compute confidence intervals for the linking parameters

\hat{δ}

. The estimate

\hat{δ}

satisfies the estimating equation

h_{δ} (δ, \hat{γ}) = 0

. The item parameters

\hat{δ}

possess a variance matrix

V_{\hat{γ}}

, estimated separately for each group. The parametric bootstrap resamples item parameters based on

V_{\hat{γ}}

, generating a distribution of the linking parameter estimates

\hat{δ}

(see also [69]).

To obtain bootstrap samples

b = 1, \dots, B

, draw

{\hat{γ}}^{(b)}

from a multivariate normal distribution with mean vector

\hat{γ}

and variance matrix

V_{\hat{γ}}

. For each bth bootstrap sample, the estimate

{\hat{δ}}^{(b)}

satisfies

h_{δ} (δ, {\hat{γ}}^{(b)}) = 0

. Standard bootstrap techniques can then be applied to the resulting estimates

{\hat{γ}}^{(b)}

for

b = 1, \dots, B

(see [70]).

Let

\hat{κ}

denote an entry of

\hat{δ}

, representing either

\hat{μ}

or

\hat{σ}

, and let

{\hat{κ}}^{(b)}

denote the corresponding parameter estimate in the bth bootstrap sample. Define

{\hat{G}}_{\hat{κ}}

as the empirical distribution function of

\hat{κ}

, obtained from the parametric bootstrap using B samples. The associated inverse distribution function (i.e., the quantile function) is denoted by

{\hat{G}}_{\hat{κ}}^{- 1}

.

The following outlines alternative CI estimation methods at confidence level

1 - α

(e.g.,

1 - α = 0.95

), based on bootstrap procedures as described in [70,71].

3.3.1. Normal Distribution Bootstrap CI (BNO)

The normal distribution bootstrap (BNO) CI assumes a normal distribution for the parameter estimate

\hat{κ}

. Let

Φ^{- 1}

denote the inverse distribution function (i.e., the quantile function) of the standard normal distribution. The quantiles are denoted briefly as

z_{α} = Φ^{- 1} (α)

. The BNO confidence interval is given by

C I = [\hat{κ} - z_{1 - α / 2} s_{\hat{κ}}, \hat{κ} + z_{1 - α / 2} s_{\hat{κ}}],

(21)

where

s_{\hat{κ}}

denotes the empirical standard deviation of the bootstrap estimates

{\hat{κ}}^{(b)}

, defined by

s_{\hat{κ}} = \sqrt{\frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{κ}}^{(b)} - \bar{\hat{κ}})}^{2}} with \bar{\hat{κ}} = \frac{1}{B} \sum_{b = 1}^{B} {\hat{κ}}^{(b)} .

(22)

For a confidence level of

1 - α = 0.95

, the quantile in (21) is

z_{1 - α / 2} = 1.96

.

3.3.2. Percentile Bootstrap CI (BPE)

The percentile bootstrap (BPE) CI is based on the quantiles (i.e., percentiles) of the empirical distribution of the bootstrap parameter estimates

\hat{κ}

. The CI is defined as

C I = [{\hat{G}}_{\hat{κ}}^{- 1} (α / 2), {\hat{G}}_{\hat{κ}}^{- 1} (1 - α / 2)] .

(23)

An advantage of the BPE method is its applicability to cases where the distribution of

\hat{κ}

is asymmetric.

3.3.3. Basic Bootstrap CI (BBB)

The basic bootstrap (BBB) CI is constructed by forming a confidence interval for the deviations

{\hat{κ}}^{(b)} - \hat{κ}

. Following this approach, the CI is given by (see [69])

C I = [2 \hat{κ} - {\hat{G}}_{\hat{κ}}^{- 1} (1 - α / 2), 2 \hat{κ} - {\hat{G}}_{\hat{κ}}^{- 1} (α / 2)] .

(24)

The rationale for using (24) is that the interval

[{\hat{G}}_{\hat{κ}}^{- 1} (α / 2) - \hat{κ}, {\hat{G}}_{\hat{κ}}^{- 1} (1 - α / 2) - \hat{κ}]

(25)

forms a CI for the deviations

{\hat{κ}}^{(b)} - \hat{κ}

between the bootstrap estimate

{\hat{κ}}^{(b)}

and original estimate

\hat{κ}

.

3.3.4. Bias-Corrected Bootstrap CI (BBC)

Finally, the bias-corrected bootstrap (BBC) CI accounts for potential bias in the bootstrap samples and accommodates asymmetric distributions of the linking parameter estimate

\hat{κ}

. It is defined as

C I = [{\hat{G}}_{\hat{κ}}^{- 1} (Φ \{2 b_{\hat{κ}} - z_{1 - α / 2}\}), {\hat{G}}_{\hat{κ}}^{- 1} (Φ \{2 b_{\hat{κ}} + z_{1 - α / 2}\})], where

(26)

b_{\hat{κ}} = Φ^{- 1} (\frac{1}{B} \sum_{b = 1}^{B} 1 ({\hat{κ}}^{(b)} < \hat{κ}))

(27)

and

1

denotes the indicator function. If the mean of the bootstrap estimates

{\hat{κ}}^{(b)}

equals the original estimate

\hat{κ}

then

b_{\hat{κ}} = 0

, and the BBC CI in (26) coincides with the BPE CI from (23).

4. Simulation Study

In this Simulation Study, the adequacy of CI estimates is evaluated for the six methods described in Section 3, comparing their performance in robust MGM linking.

4.1. Method

The 2PL model for two groups served as the data-generating model. In the first group, the latent variable

θ

followed a normal distribution with fixed mean 0 and SD 1. In the second group,

θ

also followed a normal distribution, but with a fixed mean of

μ = 0.3

and SD of

σ = 1.2

across all simulation conditions.

The Simulation Study used

I = 20

items. Group-specific item parameters

a_{i g}

and

b_{i g}

for each item

i = 1, \dots, I

and for groups

g = 1, 2

were chosen based on fixed base item parameters. The item parameters were constructed using 10 base items that were duplicated to obtain a test involving 20 items. For the 10 base items, the base item discriminations

a_{i 0}

were chosen as 1.499, 1.129, 1.647, 1.014, 1.567, 0.800, 0.974, 0.913, 0.739, 0.717. This yielded an average item discrimination of

M = 1.100

with an

S D = 0.350

. The base item difficulties

b_{i 0}

of the 10 base items were chosen as −0.314, 0.411, −1.097, −0.542, −1.854, −0.403, −0.895, 0.715, 0.841, and 0.139, yielding a mean

- 0.300

with an

S D = 0.850

. The complete set of item parameters is available at https://osf.io/tjngx (accessed on 4 May 2025).

For the first group, item discriminations and item difficulties matched the base item parameters exactly. In contrast, the second group included a uniform DIF effect d that shifted the base item difficulty

b_{i 0}

for a subset of items with d, while no DIF was introduced for the remaining items. In the

I = 20

item condition with duplicated item parameters, the DIF items were 1, 2, 3, 11, 12, and 13, corresponding to 30% of the items. The DIF effect was varied as

d = 0

and

d = 0.6

, representing no DIF and strong DIF conditions, respectively. No DIF was imposed on item discriminations. The simulated uniform DIF can be characterized as fixed and sparse DIF.

Group sample sizes of

N = 500

, 1000, 2000, 4000, and 10,000 were selected to mimic sample sizes in medium-to-large-scale testing applications of the 2PL model [72] and to allow studying the asymptotic behavior of the CI estimation methods.

In each of the 5 (sample size N) × 2 (DIF effect size d)

= 10

simulation conditions, 3000 replications were conducted. Robust MGM linking was applied for

p = 2

, 1, 0.5, and 0. The tuning parameter

ε

was set to 0.001 for

p = 1

and

p = 0.5

, and to

ε = 0.01

for

p = 0

. For all four MGM linking methods, CIs at confidence level

1 - α = 0.95

were computed using the six methods DM, WLS, BNO, BPE, BBB, and BBC (see Section 3). A total of

B = 999

bootstrap samples were used in the parametric bootstrap approaches.

Bias and root mean square error (RMSE) were evaluated for the estimated mean

\hat{μ}

and SD

\hat{σ}

. In addition, coverage rates were assessed for all four MGM linking methods crossed with the six CI estimation approaches. For each MGM method, a pseudo-true parameter was defined as the average parameter estimate across replications within a given simulation condition to isolate coverage performance from parameter bias. The coverage rate was defined as the proportion of replications in which the CI included the pseudo-true parameter. Coverage rates between 91% and 98% were considered acceptable [73]. A coverage rate below 91% is referred to as undercoverage, whereas a rate above 98% is considered overcoverage.

Moreover, the power rates for the statistical tests

H_{0} : μ = 0

and

H_{0} : σ = 1

were assessed. These tests evaluated whether significant differences existed between the two groups, in terms of the mean

μ

and the SD

σ

. The null hypothesis

H_{0}

was rejected if the test value fell outside the corresponding CI. The power rate was estimated as the proportion of replications in which the null hypothesis was rejected.

All analyses for this simulation study were carried out using the statistical software R (Version 4.4.1; [74]). The 2PL model was fitted using the sirt::xxirt() function from the R package sirt (Version 4.2-114; [75]). Custom functions were developed to implement robust MGM linking. Optimization for MGM linking was performed using the stats::optim() function. Replication materials for this simulation study are available at https://osf.io/tjngx (accessed on 4 May 2025).

4.2. Results

Table 1 presents the bias, the SD, and the RMSE of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N. It can be seen that all four MGM specifications involving different powers p in the loss function

ρ

produced unbiased estimates for

μ

and

σ

in the absence of DIF. In the presence of DIF (i.e., for

d = 0.6

), in line with findings from the literature, the square loss function

L_{2}

with

p = 2

had the largest bias, followed by

p = 1

and

p = 0.5

. Note that the bias for

p = 2

can approximately be determined as

- π d = - 0.3 \times 0.6 = - 0.18

, where

π

refers to the proportion of DIF items. Also, note that the bias for robust MGM linking for

p = 1

and

p = 0.5

vanished with increasing sample size, although the convergence to zero was particularly slow for

p = 1

. However, the best performance regarding bias had the

L_{0}

loss function using the power

p = 0

, which also performed well in smaller samples. As DIF was not present in item discriminations, the SD

σ

was unbiased for all methods in all simulation conditions.

Table 1. Simulation Study: Bias, standard deviation (SD) and root mean square error (RMSE) of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

As expected, the SD decreased with increasing sample size for all powers p for both

\hat{μ}

and

\hat{σ}

. Moreover, the SD increased as the value of p decreased.

In the no DIF condition, the RMSE was smallest for

p = 2

and increased with decreasing power p, reaching its highest value for the

L_{0}

loss function (

p = 0

). Although the

L_{0}

loss produced unbiased estimates, this indicates that it resulted in the largest variance. In contrast, under the DIF condition with

d = 0.6

, the RMSE for

\hat{μ}

was smallest for

p = 0

, as the minimal bias outweighed the increase in variance. Hence, under the simulated conditions, the robust loss function with

p = 0

emerged as the clear frontrunner in terms of both bias and RMSE.

Table 2 presents the coverage rates of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N. All six CI estimation methods performed adequately for

p = 2

, but substantial differences emerged for the robust loss functions with

p = 1

, 0.5, and 0. Achieving adequate coverage was generally more challenging in the presence of DIF than in the condition without DIF. The DM method exhibited pronounced overcoverage for

p \leq 1

, a pattern that also appeared for BNO and BPE at

p = 0.5

and

p = 0

. For

p = 0

, the WLS method showed undercoverage in some conditions, whereas it performed acceptably for

p = 1

and

p = 0.5

. The coverage rates of WLS improved with increasing large sample sizes. Across all simulation settings, the bootstrap methods BBB and BBC delivered the most consistent performance, although undercoverage was observed in a few cells with

N = 500

.

Table 2. Simulation Study: Coverage rates of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

Table 3 presents the power rates for detecting significant group differences in

μ

(i.e.,

H_{0} : μ = 0

) and

σ

(i.e.,

H_{0} : σ = 1

). As expected, the power rates increased with larger sample sizes. However, the DM method, which exhibited substantial overcoverage, showed markedly reduced power. This underscores the importance of using the better-performing bootstrap methods, BBB and BBC, which combine adequate coverage with sufficiently high power to detect group differences.

Table 3. Simulation Study: Power rates for the statistical tests of the null hypotheses

μ = 0

and

σ = 1

as a function of the DIF effect size d and sample size N.

5. Empirical Examples

This section illustrates the differences between the CI estimation methods for robust MGM linking using powers

p = 2

, 1, 0.5, and 0, using two empirical examples based on publicly available datasets from R packages. The datasets involve two groups and contain dichotomous items without missing item responses.

5.1. Dataset `dataDIF`

The first example used the dataDIF dataset from the R package equateIRT (Version 1.0.0; [22,76]). The full dataset includes 20 dichotomous items and three groups of 1000 persons each. The dataset was originally simulated and applied in a research article devoted to the assessment of fixed DIF [77]. For illustration, only the first and the second groups were used for robust MGM linking in this example.

Table 4 presents the point and CI estimates for the dataDIF dataset. The non-robust MGM linking estimate

\hat{μ}

with

p = 2

differed slightly from the robust MGM approaches using

p = 1

,

0.5

, or 0. The CI estimates obtained by the DM method differed noticeably from those of the other CI estimation methods for

p = 0.5

, and the discrepancies became more pronounced for

p = 0

. In particular, significant group differences (i.e.,

μ

statistically differed from 0) were detected by all CI methods except DM. For

\hat{σ}

, the differences between the MGM methods were much smaller. However, as with

\hat{μ}

, for

p = 0

, the SD in the second group was significantly larger than in the first group (i.e.,

σ

statistically differed from 1) according to all the methods except DM. These results are consistent with the findings from the Simulation Study, indicating that the DM method has substantially reduced power for detecting group differences.

Table 4. Empirical example, dataset dataDIF: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

5.2. Dataset `MathExam14W`

The second example used the MathExam14W dataset from the R package psychotools (Version 0.7–4; [78]). This dataset includes responses from 729 students to 13 dichotomous items from a written exam in introductory mathematics, along with several covariates. The grouping variable gender was used for analysis. The first group consisted of 403 male students, and the second group consisted of 326 female students.

Table 5 reports the point and CI estimates for the MathExam14W dataset. As in the first example, the robust MGM methods differed markedly from the non-robust MGM method for

\hat{μ}

. The group differences in

μ

were statistically significant based on the BBC estimation method. For

p = 0

, the DM method yielded an excessively wide CI for

\hat{μ}

. Notably, the

\hat{σ}

estimates also showed slight discrepancies between the robust and non-robust MGM methods. Again, for

p = 0

, the CI based on the DM method was extremely wide, resulting in implausible values.

Table 5. Empirical example, dataset MathExam14W: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

6. Discussion

This article compared the performance of alternative CI estimation methods across various robust MGM linking approaches. The robust

L_{0}

loss function was preferred in practical applications involving fixed and sparse uniform DIF effects, as it outperformed the

L_{p}

loss functions with

p = 0.5

,

p = 1

, and

p = 2

in terms of bias and RMSE. CI estimates based on the delta method (DM), which relies on differential approximations of the

L_{p}

loss functions (

0 < p \leq 1

), exhibited highly inflated coverage rates. A modified CI estimation approach, formulated by recasting the robust minimization problem as a weighted least squares (WLS) problem, yielded acceptable coverage rates for all p values except for

p = 0

. In this case, the basic bootstrap (BBB) and bias-corrected bootstrap (BBC) methods performed satisfactorily and clearly outperformed the commonly used bootstrap approaches based on the normal approximation (BNO) and empirical percentiles (BPE).

The failure to obtain acceptable coverage rates for DM, as well as for the BNO and BPE bootstrap methods, aligns with earlier findings reported by the authors that also documented overcoverage in robust linking methods [33]. These results suggest that assessing linking errors is more challenging for robust methods compared to non-robust linking approaches. It could be speculated that the non-differentiability of the loss function

ρ

presents specific challenges for the DM and WLS methods, as well as for the bootstrap methods that do not include a bias correction term. This observation may align with the presence of finite-sample bias in the robust MGM linking parameter estimates.

The parametric bootstrap method resimulates item parameters and repeatedly applies the robust MGM linking method to compute CIs. Although this approach is clearly more computationally demanding than the DM or WLS methods, it can generally be completed within minutes in typical empirical applications. Nevertheless, the substantially increased computation time of the bootstrap may become a concern in large-scale empirical studies or extensive simulation designs.

In the Simulation Study, the distribution parameters

μ

and

σ

were held constant across simulation conditions. This is not considered a limitation, as the general patterns of bias, the RMSE, and the coverage rates are not expected to change with alternative choices of

μ

and

σ

.

The simulation study revealed that DM resulted in overcoverage, while WLS led to undercoverage for the loss function with

p = 0

. Preliminary evidence from additional simulation studies indicated that a weighted standard error combining DM and WLS—assigning slightly more weight to WLS than to DM—may yield improved coverage rates. In more detail, let

A_{1}

and

A_{2}

denote the matrices in (17) corresponding to the DM and WLS methods, respectively, which are used to compute the variance matrix

V_{\hat{δ}}

. The proposed approach, which combines DM and WLS, constructs a weighted matrix

A_{w} = w A_{1} + (1 - w) A_{2}

with

w \in [0, 1]

, and uses it to compute the variance matrix

V_{\hat{δ}}^{(w)} = A_{w} V_{\hat{γ}} A_{w}^{⊤}

. This alternative CI estimation method based on

V_{\hat{δ}}^{(w)}

could provide a viable option that avoids the need for the computationally intensive bootstrap procedure.

As noted by an anonymous reviewer, future research could compare different CI methods using both theoretical and simulated coverage probabilities, as presented in [79]. The resulting guidance on CI selection for small-to-moderate sample sizes would be of particular interest.

This study focused on standard error and confidence interval estimation for linking parameter estimates in robust MGM linking, accounting for the sampling variability of persons. Future research could additionally examine linking errors [80] in the robust MGM method that reflect uncertainty in linking parameter estimates arising from the randomness of DIF effects.

Future research could also examine CI assessment in the context of structural equation models (SEM; [81,82]) estimated using

L_{p}

or

L_{0}

loss functions [51,83,84]. In such applications, the parametric bootstrap approach may be applied to sufficient statistics involving estimated mean vectors and covariance matrices for SEM.

7. Conclusions

The main findings regarding CI estimates in robust MGM linking can be summarized as follows:

The DM method exhibited highly inflated coverage rates for the linking parameter estimates, accompanied by substantially reduced power rates.
The WLS method performed well with $L_{p}$ loss functions for $p = 1$ or $p = 0.5$ , but showed notable undercoverage in small-to-moderate sample sizes.
Among the bootstrap methods, BBB and BBC—which include a bias correction term—achieved desirable coverage rates, unlike the BNO and BPE methods, which lack such correction.

Funding

This research received no external funding.

Data Availability Statement

Replication material for the Simulation Study in Section 4 can be found at https://osf.io/tjngx (accessed on 4 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2PL	two-parameter logistic
BBB	basic bootstrap
BBC	bias-corrected bootstrap
BNO	bootstrap based on normal distribution
BPE	percentile bootstrap
CI	confidence interval
DIF	differential item functioning
DM	delta method
IRF	item response function
IRT	item response theory
MGM	mean-geometric mean
MML	marginal maximum likelihood
RMSE	root mean square error
SD	standard deviation
SEM	structural equation model
WLS	weighted least squares

References

Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Cai, L.; Moustaki, I. Estimation methods in latent variable models for categorical outcome variables. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 253–277. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory – A statistical framework for educational and psychological measurement. Stat. Sci. 2025, 40, 167–194. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; 2007; pp. 125–167. [Google Scholar] [CrossRef]
Wells, C.S. Assessing Measurement Invariance for Applied Research; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
Mislevy, R.J.; Bock, R.D. BILOG 3. Item Analysis and Test Scoring with Binary Logistic Models; Software Manual; Scientific Software International: Chicago, IL, USA, 1990. [Google Scholar]
Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model Through Separate Calibrations; Research Report No. RR-09-40; Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
van der Linden, W.J.; Barrett, M.D. Linking item response model parameters. Psychometrika 2016, 81, 650–673. [Google Scholar] [CrossRef]
Robitzsch, A. Comparing robust linking and regularized estimation for linking two groups in the 1PL and 2PL models in the presence of sparse uniform differential item functioning. Stats 2023, 6, 192–208. [Google Scholar] [CrossRef]
Robitzsch, A. Extensions to mean–geometric mean linking. Mathematics 2025, 13, 35. [Google Scholar] [CrossRef]
Halpin, P.F. Differential item functioning via robust scaling. Psychometrika 2024, 89, 796–821. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Cui, Z.; Fang, Y.; Chen, H. Using a linear regression method to detect outliers in IRT common item equating. Appl. Psychol. Meas. 2013, 37, 522–540. [Google Scholar] [CrossRef]
Jurich, D.; Liu, C. Detecting item parameter drift in small sample Rasch equating. Appl. Meas. Educ. 2023, 36, 326–339. [Google Scholar] [CrossRef]
Liu, C.; Jurich, D. Outlier detection using t-test in Rasch IRT equating under NEAT design. Appl. Psychol. Meas. 2023, 47, 34–47. [Google Scholar] [CrossRef]
Magis, D.; De Boeck, P. Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivar. Behav. Res. 2011, 46, 733–755. [Google Scholar] [CrossRef]
Magis, D.; De Boeck, P. A robust outlier approach to prevent type I error inflation in differential item functioning. Educ. Psychol. Meas. 2012, 72, 291–311. [Google Scholar] [CrossRef]
Manna, V.F.; Gu, L. Different Methods of Adjusting for form Difficulty Under the Rasch Model: Impact on Consistency of Assessment Results; Research Report No. RR-19-08; Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
Strobl, C.; Kopf, J.; Kohler, L.; von Oertzen, T.; Zeileis, A. Anchor point selection: Scale alignment based on an inequality criterion. Appl. Psychol. Meas. 2021, 45, 214–230. [Google Scholar] [CrossRef]
Wang, W.; Liu, Y.; Liu, H. Testing differential item functioning without predefined anchor items using robust regression. J. Educ. Behav. Stat. 2022, 47, 666–692. [Google Scholar] [CrossRef]
Huber, P.J.; Ronchetti, E.M. Robust Statistics; Wiley: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Wilcox, R. Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
Lipovetsky, S. Optimal L_p-metric for minimizing powered deviations in regression. J. Mod. Appl. Stat. Methods 2007, 6, 20. [Google Scholar] [CrossRef]
Giacalone, M.; Panarello, D.; Mattera, R. Multicollinearity in regression: An efficiency comparison between L_p-norm and least squares estimators. Qual. Quant. 2018, 52, 1831–1859. [Google Scholar] [CrossRef]
Robitzsch, A. L_p loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Modeling 2014, 21, 495–508. [Google Scholar] [CrossRef]
Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
Robitzsch, A. Examining differences of invariance alignment in the Mplus software and the R package sirt. Mathematics 2024, 12, 770. [Google Scholar] [CrossRef]
Robitzsch, A. Comparing robust Haberman linking and invariance alignment. Stats 2025, 8, 3. [Google Scholar] [CrossRef]
Oelker, M.R.; Pößnecker, W.; Tutz, G. Selection and fusion of categorical predictors with L₀-type penalties. Stat. Model. 2015, 15, 389–410. [Google Scholar] [CrossRef]
Oelker, M.R.; Tutz, G. A uniform framework for the combination of penalties in generalized structured models. Adv. Data Anal. Classif. 2017, 11, 97–120. [Google Scholar] [CrossRef]
Xiang, J.; Yue, H.; Yin, X.; Wang, L. A new smoothed l₀ regularization approach for sparse signal recovery. Math. Probl. Eng. 2019, 2019, 1978154. [Google Scholar] [CrossRef]
Wang, L.; Yin, X.; Yue, H.; Xiang, J. A regularized weighted smoothed L₀ norm minimization method for underdetermined blind source separation. Sensors 2018, 18, 4260. [Google Scholar] [CrossRef]
O’Neill, M.; Burke, K. Variable selection using a smooth information criterion for distributional regression models. Stat. Comput. 2023, 33, 71. [Google Scholar] [CrossRef]
Robitzsch, A. L₀ and L_p loss functions in model-robust estimation of structural equation models. Psych 2023, 5, 1122–1139. [Google Scholar] [CrossRef]
Jaeckel, L.A. Robust estimates of location: Symmetry and asymmetric contamination. Ann. Math. Stat. 1971, 42, 1020–1034. [Google Scholar] [CrossRef]
Robitzsch, A. Computational aspects of L₀ linking in the Rasch model. Algorithms 2025, 18, 213. [Google Scholar] [CrossRef]
Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
Ogasawara, H. Item response theory true score equatings and their standard errors. J. Educ. Behav. Stat. 2001, 26, 31–50. [Google Scholar] [CrossRef]
Ogasawara, H. Applications of asymptotic expansion in item response theory linking. In Statistical Models for Test Equating, Scaling, and Linking; von Davier, A., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 261–280. [Google Scholar] [CrossRef]
Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef]
Battauz, M. Factors affecting the variability of IRT equating coefficients. Stat. Neerl. 2015, 69, 85–101. [Google Scholar] [CrossRef]
Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
Zhang, Z. Asymptotic standard errors of equating coefficients using the characteristic curve methods for the graded response model. Appl. Meas. Educ. 2020, 33, 309–330. [Google Scholar] [CrossRef]
Zhang, Z. Asymptotic standard errors of parameter scale transformation coefficients in test equating under the nominal response model. Appl. Psychol. Meas. 2021, 45, 134–138. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z. Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Appl. Psychol. Meas. 2021, 45, 331–345. [Google Scholar] [CrossRef]
Jewsbury, P.A. Error Variance in Common Population Linking Bridge Studies; Research Report No. RR-19-42; Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
Jewsbury, P.A. Generally applicable variance estimation methods for common-population linking. J. Educ. Behav. Stat. 2024. [Google Scholar] [CrossRef]
Jewsbury, P.A. Linking error on achievement levels accounting for dependencies and complex sampling. J. Educ. Meas. 2025; epub ahead of print. [Google Scholar] [CrossRef]
Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage: Thousand Oaks, CA, USA, 2016; Available online: https://bit.ly/38XUSX1 (accessed on 4 May 2025).
Fox, J.; Weisberg, S. Robust Regression in R: An Appendix to an R Companion to Applied Regression, 2nd ed.; Sage: Thousand Oaks, CA, USA, 2010; Available online: https://bit.ly/3canwcw (accessed on 4 May 2025).
Chen, Y.; Li, C.; Ouyang, J.; Xu, G. DIF statistical inference without knowing anchoring items. Psychometrika 2023, 88, 1097–1122. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar] [CrossRef]
Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar] [CrossRef]
Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Muthén, L.K.; Muthén, B.O. How to use a Monte Carlo study to decide on sample size and determine power. Struct. Equ. Modeling 2002, 9, 599–620. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2024. Available online: https://www.R-project.org (accessed on 15 June 2024).
Robitzsch, A. sirt: Supplementary Item Response Theory Models. R Package Version 4.2-114. 2025. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 7 April 2025).
Battauz, M. equateMultiple: Equating of Multiple Forms. R Package Version 1.0.0. 2024. Available online: https://cran.r-project.org/web/packages/equateMultiple/index.html (accessed on 7 April 2025). [CrossRef]
Battauz, M. On Wald tests for differential item functioning detection. Stat. Methods Appl. 2019, 28, 103–118. [Google Scholar] [CrossRef]
Zeileis, A.; Strobl, C.; Wickelmaier, F.; Komboz, B.; Kopf, J.; Schneider, L.; Debelak, R. psychotools: Psychometric Modeling Infrastructure. R Package Version 0.7-4. 2024. Available online: https://cran.r-project.org/web/packages/psychotools/index.html (accessed on 7 April 2025). [CrossRef]
Fitts, D.A. Expected and empirical coverages of different methods for generating noncentral t confidence intervals for a standardized mean difference. Behav. Res. Methods 2021, 53, 2412–2429. [Google Scholar] [CrossRef]
Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
Bollen, K.A. Structural Equations with Latent Variables; Wiley: New York, NY, USA, 1989. [Google Scholar] [CrossRef]
Yuan, K.H.; Bentler, P.M. Structural Equation Modeling with Robust Covarianc. Available online: https://www3.nd.edu/~kyuan/courses/sem/readpapers/Yuan-Bentler-SM98.pdf (accessed on 7 April 2025). [CrossRef]
Siemsen, E.; Bollen, K.A. Least absolute deviation estimation in structural equation modeling. Sociol. Methods Res. 2007, 36, 227–265. [Google Scholar] [CrossRef]
van Kesteren, E.J.; Oberski, D.L. Flexible extensions to structural equation models using computation graphs. Struct. Equ. Modeling 2022, 29, 233–247. [Google Scholar] [CrossRef]

Table 1. Simulation Study: Bias, standard deviation (SD) and root mean square error (RMSE) of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

Table 1. Simulation Study: Bias, standard deviation (SD) and root mean square error (RMSE) of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

			Bias, $p =$				SD, $p =$				RMSE, $p =$
Par	d	N	2	1	0.5	0	2	1	0.5	0	2	1	0.5	0
$\hat{μ}$	0	500	0.005	0.002	0.001	−0.001	0.085	0.086	0.092	0.100	0.085	0.086	0.092	0.100
		1000	0.002	0.001	0.000	−0.001	0.059	0.059	0.064	0.070	0.059	0.059	0.064	0.070
		2000	0.002	0.002	0.002	0.002	0.043	0.043	0.045	0.047	0.043	0.043	0.045	0.047
		4000	0.000	0.000	0.000	0.000	0.030	0.031	0.031	0.032	0.030	0.031	0.031	0.032
		10,000	0.000	0.000	0.000	0.000	0.019	0.019	0.019	0.019	0.019	0.019	0.019	0.019
	0.6	500	−0.178	−0.115	−0.075	−0.015	0.084	0.099	0.113	0.123	0.197	0.152	0.136	0.124
		1000	−0.180	−0.083	−0.044	−0.005	0.058	0.069	0.077	0.083	0.189	0.108	0.089	0.083
		2000	−0.179	−0.059	−0.026	0.000	0.041	0.048	0.051	0.054	0.183	0.076	0.057	0.054
		4000	−0.180	−0.045	−0.016	−0.001	0.029	0.033	0.035	0.035	0.182	0.055	0.038	0.035
		10,000	−0.180	−0.032	−0.010	−0.001	0.018	0.021	0.021	0.021	0.181	0.038	0.024	0.021
$\hat{σ}$	0	500	0.005	0.004	0.004	0.003	0.078	0.084	0.094	0.108	0.078	0.084	0.094	0.108
		1000	0.003	0.002	0.002	0.002	0.054	0.058	0.065	0.075	0.054	0.058	0.065	0.075
		2000	0.002	0.002	0.003	0.003	0.038	0.040	0.044	0.048	0.038	0.040	0.044	0.048
		4000	0.001	0.001	0.001	0.001	0.026	0.027	0.029	0.030	0.026	0.027	0.029	0.030
		10,000	0.000	0.000	0.000	0.000	0.017	0.017	0.018	0.018	0.017	0.017	0.018	0.018
	0.6	500	0.004	0.003	0.004	0.004	0.076	0.082	0.092	0.107	0.076	0.082	0.092	0.107
		1000	0.002	0.001	0.001	0.001	0.053	0.057	0.063	0.073	0.053	0.057	0.063	0.073
		2000	0.002	0.002	0.002	0.002	0.037	0.039	0.042	0.046	0.037	0.039	0.042	0.046
		4000	0.001	0.001	0.001	0.001	0.027	0.028	0.029	0.030	0.027	0.028	0.029	0.030
		10,000	0.000	0.000	0.000	0.000	0.017	0.017	0.018	0.018	0.017	0.017	0.018	0.018

Note. Par = parameter; p = used power in the loss function

ρ

.

Table 2. Simulation Study: Coverage rates of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

Table 2. Simulation Study: Coverage rates of the estimated mean

\hat{μ}

and SD

\hat{σ}

as a function of the DIF effect size d and sample size N.

			$\hat{μ}$						$\hat{σ}$
p	d	N	DM	WLS	BNO	BPE	BBB	BBC	DM	WLS	BNO	BPE	BBB	BBC
2	0	500	95.1	95.1	95.2	94.9	95.3	95.0	94.2	94.2	94.9	94.5	94.6	95.1
		1000	95.5	95.5	95.7	95.5	95.4	95.8	94.7	94.7	95.1	94.8	94.7	95.0
		2000	94.4	94.4	94.3	94.2	94.1	94.0	95.1	95.1	95.4	95.0	95.1	95.1
		4000	94.5	94.5	94.5	94.4	94.3	94.4	95.3	95.3	95.2	94.9	95.2	95.1
		10,000	95.6	95.6	95.7	95.4	95.5	95.1	95.2	95.2	95.4	95.1	95.2	95.0
	0.6	500	94.5	94.5	94.5	94.1	94.5	94.2	94.9	94.9	95.4	95.1	95.4	95.6
		1000	94.9	94.9	94.9	94.9	95.0	94.6	95.1	95.1	95.4	95.0	95.0	95.4
		2000	95.2	95.2	95.2	95.0	95.1	94.9	95.6	95.6	95.6	95.4	95.4	95.4
		4000	95.5	95.5	95.5	95.3	95.5	95.3	94.9	94.9	95.0	95.0	94.6	95.0
		10,000	95.1	95.1	94.9	94.9	94.8	95.0	94.7	94.7	94.9	94.5	94.6	94.6
1	0	500	99.1	95.9	96.4	96.9	94.2	93.8	98.9	95.1	96.0	97.3	92.4	92.6
		1000	98.7	96.0	96.6	97.0	94.9	94.3	98.9	95.6	96.4	97.3	93.7	93.5
		2000	97.6	94.9	95.6	95.9	94.1	94.2	98.9	95.5	96.8	97.5	94.7	94.6
		4000	96.8	94.7	95.5	95.5	94.2	94.3	98.1	95.6	96.7	96.9	95.1	95.1
		10,000	96.5	95.6	96.0	96.3	95.0	95.1	97.3	95.2	96.4	96.7	95.0	94.9
	0.6	500	99.0	94.5	95.4	95.2	90.0	90.0	99.4	96.2	97.0	97.7	93.4	93.9
		1000	98.9	94.5	96.3	95.3	91.7	91.5	99.0	95.2	96.4	97.4	93.8	94.3
		2000	98.6	94.8	96.8	95.5	92.2	92.0	98.7	95.8	97.1	97.6	94.9	95.0
		4000	98.0	94.8	96.6	94.7	92.6	92.5	97.9	95.0	96.3	96.9	94.6	94.8
		10,000	96.5	94.2	96.0	94.1	92.8	92.6	96.4	94.5	95.5	95.8	94.5	94.7
0.5	0	500	99.8	96.4	97.2	98.7	92.9	92.9	99.8	95.3	97.5	99.2	91.1	92.4
		1000	99.9	95.9	97.8	98.9	93.4	93.5	99.6	94.5	97.6	99.2	91.5	92.1
		2000	99.3	94.6	97.2	97.9	93.3	93.5	99.6	94.8	98.4	99.3	93.4	94.0
		4000	98.6	94.8	96.8	97.3	94.2	94.3	99.3	94.7	98.1	98.8	94.9	95.0
		10,000	97.5	95.4	96.7	97.1	95.2	95.0	98.5	94.7	97.6	98.1	95.0	95.2
	0.6	500	99.7	94.7	97.2	96.9	88.3	91.9	99.8	95.8	98.2	99.3	91.5	92.9
		1000	99.7	94.5	98.1	97.9	91.3	92.3	99.8	95.0	97.6	99.3	92.6	93.4
		2000	99.3	94.7	97.9	98.2	93.7	93.9	99.5	94.9	98.3	98.9	94.1	94.7
		4000	99.0	94.1	97.6	97.7	94.3	94.1	99.4	94.8	97.8	98.7	94.7	94.4
		10,000	97.7	94.4	96.8	96.7	94.5	94.7	98.1	94.1	96.9	97.2	95.4	95.3
0	0	500	99.9	92.4	98.1	99.4	93.4	95.9	99.6	87.0	99.0	99.8	92.8	96.9
		1000	99.9	92.8	99.0	99.8	94.2	95.6	99.6	87.4	99.0	99.8	93.2	96.2
		2000	100	94.1	98.6	99.2	95.4	95.9	99.9	91.1	99.5	99.9	96.3	97.1
		4000	100	96.0	97.8	98.4	95.9	96.2	100	94.9	99.2	99.6	97.1	97.3
		10,000	99.9	97.6	96.7	97.0	96.0	95.9	100	97.6	97.9	98.1	97.1	97.1
	0.6	500	99.6	88.0	98.7	98.6	94.5	97.9	99.6	86.7	99.1	99.7	92.4	97.2
		1000	99.7	88.8	99.3	99.5	95.2	96.7	99.8	88.5	99.1	99.9	93.9	96.7
		2000	99.8	91.6	99.0	99.5	95.4	96.6	99.7	91.6	99.4	99.7	96.2	97.4
		4000	100	94.1	98.7	99.2	96.3	96.8	99.9	94.9	98.8	99.4	97.0	97.4
		10,000	100	96.3	97.0	97.2	95.9	95.9	99.9	97.0	97.2	97.5	96.3	96.2

Note. p = used power in the loss function ρ. DM = delta method; WLS = weighted least squares; BNO = bootstrap CI based on normal distribution; BPE = percentile bootstrap CI; BBB = basic bootstrap CI; BBC = bias-corrected bootstrap CI; Coverage rates smaller than 91.0 or larger than 98.0 are printed in bold font.

Table 3. Simulation Study: Power rates for the statistical tests of the null hypotheses

μ = 0

and

σ = 1

as a function of the DIF effect size d and sample size N.

Table 3. Simulation Study: Power rates for the statistical tests of the null hypotheses

μ = 0

and

σ = 1

as a function of the DIF effect size d and sample size N.

			Test of H₀: μ = 0						Test of H₀: σ = 1
p	d	N	DM	WLS	BNO	BPE	BBB	BBC	DM	WLS	BNO	BPE	BBB	BBC
2	0	500	95.6	95.6	95.4	96.0	95.0	95.9	78.9	78.9	76.6	82.1	71.2	80.7
		1000	99.9	99.9	99.9	99.9	99.9	99.9	97.9	97.9	97.7	98.2	97.3	98.0
		2000	100	100	100	100	100	100	100	100	100	100	100	100
		4000	100	100	100	100	100	100	100	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
	0.6	500	30.3	30.3	29.9	31.2	29.3	30.4	79.9	79.9	77.5	83.0	71.8	81.9
		1000	54.4	54.4	54.6	55.4	53.5	54.8	98.2	98.2	98.0	98.4	97.5	98.4
		2000	84.1	84.1	84.0	84.4	83.8	84.3	100	100	100	100	100	100
		4000	98.5	98.5	98.4	98.5	98.4	98.4	100	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
1	0	500	80.3	94.1	93.7	95.1	91.6	93.0	30.9	68.2	64.6	73.6	56.5	70.1
		1000	99.5	100	100	100	99.9	99.9	82.7	96.2	94.6	96.9	89.9	94.4
		2000	100	100	100	100	100	100	99.9	100	100	100	99.9	100
		4000	100	100	100	100	100	100	100	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
	0.6	500	19.9	48.4	44.0	25.8	59.8	60.5	30.5	68.4	64.9	73.1	57.4	69.7
		1000	71.1	89.0	86.0	74.2	91.3	91.0	83.8	96.1	94.6	97.0	90.5	94.1
		2000	99.1	99.9	99.7	99.6	99.9	99.9	99.9	99.9	99.9	100	99.9	99.9
		4000	100	100	100	100	100	100	100	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
0.5	0	500	34.2	89.5	86.5	91.0	78.9	84.5	6.1	55.5	44.6	55.5	40.4	54.5
		1000	81.8	99.9	99.6	99.9	98.6	99.0	36.5	91.4	82.7	91.1	72.4	83.0
		2000	99.0	100	100	100	100	100	87.2	99.9	99.6	100	97.4	99.2
		4000	100	100	100	100	100	100	99.8	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
	0.6	500	10.1	55.0	41.9	16.1	62.4	56.9	5.8	56.6	45.3	55.2	41.1	55.9
		1000	47.9	91.9	85.4	72.8	89.1	86.1	36.3	91.8	83.2	90.8	72.2	84.1
		2000	91.2	99.9	99.6	99.7	99.6	99.4	88.9	99.9	99.6	99.9	97.5	99.2
		4000	99.7	100	100	100	100	100	99.8	100	100	100	100	100
		10,000	100	100	100	100	100	100	100	100	100	100	100	100
0	0	500	8.5	90.8	73.3	81.2	65.6	75.0	4.3	66.9	27.4	33.0	29.0	41.8
		1000	21.0	99.7	97.3	99.2	91.3	96.1	11.7	92.2	59.5	72.9	52.6	67.5
		2000	35.3	100	100	100	99.9	100	23.8	99.9	95.8	99.3	89.4	96.4
		4000	54.7	100	100	100	100	100	45.9	100	100	100	99.9	100
		10,000	83.0	100	100	100	100	100	78.1	100	100	100	100	100
	0.6	500	7.3	83.0	30.6	9.3	60.8	31.8	3.8	67.8	26.8	33.9	29.3	42.7
		1000	17.3	97.9	85.4	80.0	83.1	78.0	11.3	92.4	60.0	73.7	52.8	68.3
		2000	33.1	100	99.7	99.9	98.7	99.0	24.8	99.9	96.0	99.5	90.4	96.5
		4000	48.5	100	100	100	100	100	48.7	100	100	100	99.9	100
		10,000	74.0	100	100	100	100	100	77.8	100	100	100	100	100

Note. p = used power in the loss function ρ. DM = delta method; WLS = weighted least squares; BNO = bootstrap CI based on normal distribution; BPE = percentile bootstrap CI; BBB = basic bootstrap CI; BBC = bias-corrected bootstrap CI; Power rates smaller than 80.0 are printed in bold font.

Table 4. Empirical example, dataset dataDIF: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

Table 4. Empirical example, dataset dataDIF: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

Par	p	Est	DM	WLS	BNO	BPE	BBB	BBC
$\hat{μ}$	2	0.43	[0.30, 0.55]	[0.30, 0.55]	[0.31, 0.55]	[0.30, 0.54]	[0.31, 0.55]	[0.31, 0.55]
	1	0.49	[0.34, 0.64]	[0.36, 0.62]	[0.37, 0.61]	[0.34, 0.58]	[0.40, 0.64]	[0.40, 0.63]
	0.5	0.50	[0.34, 0.66]	[0.38, 0.63]	[0.38, 0.63]	[0.35, 0.60]	[0.41, 0.66]	[0.41, 0.66]
	0	0.51	[−0.05, 1.08]	[0.37, 0.65]	[0.36, 0.66]	[0.34, 0.65]	[0.37, 0.68]	[0.39, 0.68]
$\hat{σ}$	2	1.14	[1.04, 1.24]	[1.04, 1.24]	[1.04, 1.24]	[1.05, 1.25]	[1.03, 1.23]	[1.05, 1.24]
	1	1.17	[1.03, 1.30]	[1.06, 1.28]	[1.06, 1.28]	[1.05, 1.26]	[1.07, 1.28]	[1.08, 1.31]
	0.5	1.18	[0.99, 1.37]	[1.06, 1.30]	[1.06, 1.30]	[1.04, 1.27]	[1.08, 1.32]	[1.08, 1.34]
	0	1.18	[0.49, 1.87]	[1.06, 1.30]	[1.01, 1.35]	[1.00, 1.34]	[1.02, 1.36]	[1.02, 1.37]

Note. Par = parameter; p = used power in the loss function

ρ

. Est = point estimate; DM = delta method; WLS = weighted least squares; BNO = bootstrap CI based on normal distribution; BPE = percentile bootstrap CI; BBB = basic bootstrap CI; BBC = bias-corrected bootstrap CI.

Table 5. Empirical example, dataset MathExam14W: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

Table 5. Empirical example, dataset MathExam14W: Point estimates and confidence interval estimates for estimated mean

\hat{μ}

and SD

\hat{σ}

.

Par	p	Est	DM	WLS	BNO	BPE	BBB	BBC
$\hat{μ}$	2	0.44	[0.17, 0.72]	[0.17, 0.72]	[0.16, 0.73]	[0.16, 0.73]	[0.16, 0.73]	[0.17, 0.74]
	1	0.30	[0.04, 0.57]	[0.08, 0.52]	[0.07, 0.54]	[0.12, 0.58]	[0.03, 0.49]	[0.04, 0.50]
	0.5	0.28	[−0.10, 0.67]	[0.06, 0.50]	[0.03, 0.53]	[0.08, 0.59]	[−0.03, 0.48]	[0.00, 0.47]
	0	0.28	[−14.10, 14.65]	[0.06, 0.49]	[−0.01, 0.56]	[0.03, 0.58]	[−0.03, 0.52]	[0.01, 0.56]
$\hat{σ}$	2	1.28	[1.06, 1.51]	[1.06, 1.51]	[1.04, 1.52]	[1.08, 1.55]	[1.02, 1.49]	[1.07, 1.54]
	1	1.17	[0.77, 1.58]	[0.94, 1.41]	[0.94, 1.41]	[0.99, 1.47]	[0.88, 1.36]	[0.94, 1.36]
	0.5	1.14	[0.27, 2.01]	[0.89, 1.39]	[0.87, 1.41]	[0.95, 1.48]	[0.80, 1.33]	[0.85, 1.35]
	0	1.12	[−0.93, 3.17]	[0.92, 1.32]	[0.78, 1.46]	[0.89, 1.53]	[0.71, 1.36]	[0.87, 1.46]

Note. Par = parameter; p = used power in the loss function

ρ

. Est = point estimate; DM = delta method; WLS = weighted least squares; BNO = bootstrap CI based on normal distribution; BPE = percentile bootstrap CI; BBB = basic bootstrap CI; BBC = bias-corrected bootstrap CI.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Assessing Standard Error Estimation Approaches for Robust Mean-Geometric Mean Linking

Abstract

1. Introduction

2. Robust Mean-Geometric Mean Linking

2.1. $L_{p}$ and $L_{0}$ Loss Functions

2.2. Estimation of $σ$

2.3. Estimation of $μ$

3. Estimation of Standard Errors and Confidence Intervals

3.1. Delta Method (DM)

3.2. Weighted Least Squares (WLS)

3.3. Bootstrap Methods

3.3.1. Normal Distribution Bootstrap CI (BNO)

3.3.2. Percentile Bootstrap CI (BPE)

3.3.3. Basic Bootstrap CI (BBB)

3.3.4. Bias-Corrected Bootstrap CI (BBC)

4. Simulation Study

4.1. Method

4.2. Results

5. Empirical Examples

5.1. Dataset `dataDIF`

5.2. Dataset `MathExam14W`

6. Discussion

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

Assessing Standard Error Estimation Approaches for Robust Mean-Geometric Mean Linking

Abstract

1. Introduction

2. Robust Mean-Geometric Mean Linking

2.1. L p and L 0 Loss Functions

2.2. Estimation of σ

2.3. Estimation of μ

3. Estimation of Standard Errors and Confidence Intervals

3.1. Delta Method (DM)

3.2. Weighted Least Squares (WLS)

3.3. Bootstrap Methods

3.3.1. Normal Distribution Bootstrap CI (BNO)

3.3.2. Percentile Bootstrap CI (BPE)

3.3.3. Basic Bootstrap CI (BBB)

3.3.4. Bias-Corrected Bootstrap CI (BBC)

4. Simulation Study

4.1. Method

4.2. Results

5. Empirical Examples

5.1. Dataset dataDIF

5.2. Dataset MathExam14W

6. Discussion

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

2.1. $L_{p}$ and $L_{0}$ Loss Functions

2.2. Estimation of $σ$

2.3. Estimation of $μ$

5.1. Dataset `dataDIF`

5.2. Dataset `MathExam14W`