Information-Weighted and Normal Density-Weighted Haebara Linking

Alexander Robitzsch

doi:10.3390/info16040273

Abstract

Linking methods based on item response theory aim to place item parameters from different groups, test forms, or test administrations onto a common scale. Information-weighted Haebara linking has been proposed as an alternative to the widely used standard Haebara linking method. This study compares its performance against Haebara linking with weights based on the normal distribution. Simulations using the two-parameter logistic model indicate that while information-weighted Haebara linking outperforms the uniformly weighted variant, it does not surpass normal density-weighted Haebara linking in terms of bias and root mean square error. The results suggest that normal density weights should be preferred over both uniform and information-based approaches. Additionally, standard errors were derived for both methods, yielding satisfactory coverage rates.

Keywords:

linking; Haebara linking; 2PL model; item information function

1. Introduction

Item response theory (IRT) models [1,2] provide a statistical framework for analyzing multivariate binary (dichotomous) [3], polytomous [4], or continuous [5,6] data, commonly used in educational and psychological assessments. However, this article is restricted to the case of dichotomous data. Let

X = (X_{1}, \dots, X_{I})

represent a vector of I binary random variables (

X_{i} \in 0, 1

), typically referred to as items or (scored) item responses in psychometrics. A unidimensional IRT model [3,7] defines the probability distribution (

P (X = x)

) for

x = (x_{1}, \dots, x_{I}) \in {0, 1}^{I}

as follows:

P (X = x; δ, γ) = \int \prod_{i = 1}^{I} [{P_{i} (θ; γ_{i})}^{x_{i}} {(1 - P_{i} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ,

(1)

where

ϕ

is the normal density function with a mean of

μ

and standard deviation (SD) of

σ

. The latent variable (

θ

), often termed ability, trait, or factor in the psychometric literature, follows a distribution characterized by parameters of

δ = (μ, σ)

. The vector expressed as

γ = (γ_{1}, \dots, γ_{I})

contains the estimated item parameters, which define the item response functions (IRFs) as

P_{i} (θ; γ_{i}) = P (X_{i} = 1 | θ)

. The IRF characterizes the relationship between the item response (

X_{i}

) and the latent variable (

θ

). Typically, the IRF is assumed to be monotonic, meaning that the probability of correctly answering item i increases as the ability (

θ

) increases.

A widely used IRT model for dichotomous responses is the two-parameter logistic (2PL) model [8], which specifies the IRF as

P_{i} (θ; γ_{i}) = Ψ (a_{i} (θ - b_{i})),

(2)

where

a_{i}

and

b_{i}

represent the item discrimination and difficulty parameters, respectively, and

Ψ (x) = {(1 + exp (- x))}^{- 1}

denotes the logistic link function.

Given N independent and identically distributed observations (

x_{1}, \dots, x_{N}

) from the distribution of

X

, the model parameters in Equation (1) are estimated using marginal maximum likelihood estimation (MML; [9,10,11]).

As noted by an anonymous reviewer, the IRT model defined in Equation (1) is specifically formulated for MML estimation. The multivariate distribution (

P (X = x)

) is represented by the IRT model, which incorporates a latent variable (

θ

) and assumes the conditional independence of items. In Equation (1), the latent variable (

θ

) is treated as a random effect and integrated out. Alternatively,

θ

could be treated as a fixed effect (see also [12]), allowing the IRT model to be estimated using joint or conditional maximum likelihood techniques [2,13].

In educational assessments, IRT models are often used to compare the distributions of two groups in a test based on the latent variable (

θ

) in Equation (1). Linking methods [14,15,16,17] based on IRT are statistical techniques for placing item parameters from different groups, test forms, or administrations onto a common scale. Typically, the groups are nonequivalent, meaning their distributions may differ. However, a set of common items is administered across groups, enabling the linking of item parameters onto a common scale. To facilitate the comparison of the groups, we first estimate the 2PL model separately for each group, ensuring that all group differences are reflected in the item parameters. In a subsequent step, these estimated item parameters are transformed to quantify differences in the mean (

μ

) and SD (

σ

) between the two groups. Popular linking methods based on the 2PL model include mean-geometric mean linking [15], Haebara linking [18], and Stocking–Lord linking [19].

Haebara linking [18], also known as item characteristic curve linking [15], is widely used in the psychometric literature and has been applied in numerous empirical studies [20,21]. According to Google Scholar, the original reference by Haebara (1980, Jpn. Psychol. Res., [18]) has been cited 453 times (as of 24 March 2025). A variant of Haebara linking, information-weighted Haebara linking, was recently introduced by Wang et al. [22,23]. Their findings indicated that this approach outperforms ordinary Haebara linking in terms of bias and root mean square error. However, their comparison was limited to Haebara linking with uniform weights. Other implementations incorporate weights based on the standard normal density [18,24]. The present study examines whether the advantages reported for information-weighted Haebara linking persist when compared to alternative normal density-based Haebara linking specifications. The results indicate that information-weighted Haebara linking, in fact, exhibits inferior performance in these comparisons. Additionally, standard errors for information-weighted Haebara linking are derived, addressing a gap in the original work of Wang et al. [22,23].

The remainder of this article is structured as follows. Section 2 reviews the different specifications of Haebara linking. Section 3 outlines the research objectives addressed in the simulation study. Findings from the simulation study comparing these specifications are presented in Section 4. Finally, Section 5 provides a discussion of the results.

2. Haebara Linking

The general notation of linking methods [14,15] is introduced first, followed by its specialization for Haebara linking. Let

δ = (μ, σ)

be the linking parameter of interest. A linking method takes, as input, the group-wise estimated item parameters of the 2PL model—specifically, the vectors of item discriminations (

{\hat{a}}_{g} = ({\hat{a}}_{1 g}, \dots, {\hat{a}}_{I g})

) and the vectors of item difficulties (

{\hat{b}}_{g} = ({\hat{b}}_{1 g}, \dots, {\hat{b}}_{I g})

) for

g = 1, 2

. These item parameters are collected into a vector (

\hat{γ} = ({\hat{γ}}_{1}, \dots, {\hat{γ}}_{I})

, where

{\hat{γ}}_{i} = ({\hat{a}}_{i 1}, {\hat{b}}_{i 1}, {\hat{a}}_{i 2}, {\hat{b}}_{i 2})

contains the estimated parameters for item i).

A linking function (

H = H (δ, γ)

) is chosen such that the linking parameter estimate (

\hat{δ} = (\hat{μ}, \hat{σ})

) is obtained by minimizing this function based on estimated item parameters (

\hat{γ}

, th) at is,

\hat{δ} = \underset{δ}{arg min} H (δ, \hat{γ}) .

(3)

It is often required that H have second-order partial derivatives with respect to

δ

and

γ

. In this case, due to Equation (3), the linking parameter estimate (

\hat{γ}

) satisfies

H_{δ} (\hat{δ}, \hat{γ}) = 0,

(4)

where

H_{δ}

denotes the partial derivative of H with respect to

δ

. If the true item parameters (

γ

) are invariant across groups, the estimating Equation (4) in the case of an infinite sample size is given by

H_{δ} (δ, γ) = 0 .

(5)

This property guarantees that a linking method provides consistent linking parameter estimates.

The next sections discuss normal density-weighted and information-weighted linking methods as particular linking approaches.

2.1. Haebara Linking Weighted by Normal Density

Haebara linking defines the linking parameter estimate (

\hat{δ}

) by minimizing a weighted squared distance between the aligned item response functions (IRFs) of two groups. The original Haebara linking, as proposed in [18], uses the following linking function:

H (δ, \hat{γ}) = \sum_{i = 1}^{I} \sum_{t = 1}^{T} ω_{t} D_{i t} (δ, {\hat{γ}}_{i}) with D_{i t} (δ, {\hat{γ}}_{i}) = {[Ψ ({\hat{a}}_{i 1} (σ θ_{t} + μ - {\hat{b}}_{i 1})) - Ψ ({\hat{a}}_{i 2} (θ_{t} - {\hat{b}}_{i 2}))]}^{2},

(6)

where

θ_{1}, \dots, θ_{T}

is a discrete set of

θ

points (e.g.,

T = 101

equidistant points on the interval of

[- 6, 6]

), and

ω_{t}

represents the user-defined weights. The original reference [18] used empirical frequencies of

θ

estimates to define these weights. Later work, such as that reported in [24,25], utilized the density of the standard normal distribution to define the weights (

ω_{t}

) in Equation (6). This normal density approach closely resembles Haebara’s original proposal for long tests, though it differs in the case of short tests because the variance of individual

θ

estimates is larger than that of the true

θ

distribution. Formally, the normal density-based weights with a standard deviation of

σ_{0}

are defined as

ω_{t} = exp (- \frac{θ_{t}^{2}}{2 σ_{0}^{2}}),

(7)

where normalization constants for the normal density are omitted. The choice of

σ_{0} = 1

is grounded in the assumption that separate calibration of the 2PL model in both groups involves a standard normal distribution for the

θ

variable. Over time, researchers have adopted uniform weights (

ω_{t} = 1

), which have become the default choice in IRT linking software such as R packages equateIRT [26], plink [27], and sirt [28]. However, users can modify the default weights, and normal density weights can be easily implemented.

Because the weights (

ω_{t}

) serve as weight deviations

D_{i t} (δ, {\hat{γ}}_{i})

in the IRFs, the choice of weights is expected to influence the variance of the linking parameter estimates. When uniform weights are used, all items contribute equally, leading to an unweighted definition of the overall discrepancy between the two IRFs. In contrast, when normal density weights are used, deviations are more heavily weighted for items whose difficulties are centered around zero in the

θ

distribution. Items with difficulties closer to zero typically exhibit smaller standard errors of item parameters, which may result in a lower variance of linking parameter estimates when normal density weights are applied, as compared to uniform weights in Haebara linking.

2.2. Haebara Linking Weighted by Item Information

The modification of normal density-weighted Haebara linking proposed by Wang et al. [22,23] incorporates the item information function (IIF [7]) into the weighting of squared IRF deviations (

D_{i t} (δ, {\hat{γ}}_{i})

). The IIF quantifies how much information a given item provides about the ability level (

θ

) of a subject. It tells us how well an item can discriminate between individuals of different ability levels. The IIF in the 2PL model, given by

I I F (θ, a_{i}, b_{i}) = a_{i}^{2} Ψ (a_{i} (θ - b_{i})) [1 - Ψ (a_{i} (θ - b_{i}))],

(8)

describes the amount of information provided by item i at ability level

θ

. Equation (8) shows that item information reaches its maximum at

θ = b_{i}

. Additionally, items with higher discrimination parameters (

a_{i}

) provide greater item information.

The information-based Haebara linking introduces the information-based weight term, which is the sum of the IIFs for both groups. This term is defined as [22,23]

J (θ_{t}, δ, {\hat{γ}}_{i}) = I I F (μ + σ θ_{t}, {\hat{a}}_{i 1}, {\hat{b}}_{i 1}) + I I F (θ_{t}, {\hat{a}}_{i 2}, {\hat{b}}_{i 2})

(9)

and is used in the modified linking function [22,23]:

H (δ, \hat{γ}) = \sum_{i = 1}^{I} \sum_{t = 1}^{T} ω_{t} J (θ_{t}, δ, {\hat{γ}}_{i}) D_{i t} (δ, {\hat{γ}}_{i}) .

(10)

In the original formulation of information-based Haebara linking, the weights (

ω_{t}

) were uniform. In this updated version, however, the weights are determined by the information functions, and the linking parameter (

δ

) is incorporated into the weighting term. This means that the information-based weights are not fixed but are updated during the minimization process of the linking function.

Additionally, an approach was tested where fixed information weights (

J (θ_{t}, δ^{*}, {\hat{γ}}_{i})

) were used based on some preliminary estimate (

δ^{*}

). The resulting estimator showed statistical properties that were very similar to those of the original information-based Haebara linking approach in Equation (10).

Since information-weighted Haebara linking was only recently introduced in the literature [22,23], it has not yet been incorporated into popular linking software packages such as equateIRT [26] or plink [27]. However, implementing the method in R is relatively straightforward, as the linking function can be defined by the user and numerically minimized using solvers like stats::optim() or stats::nlminb().

2.3. Standard Error Estimation

The estimation of standard errors for the different specifications of Haebara linking can be formulated based on a general linking function (H). The following treatment can be applied for the linking functions defined in Equations (6) and (10). The linking parameter estimate (

\hat{δ}

) satisfies the estimating equation:

H_{δ} (δ, \hat{γ}) = 0 .

(11)

The derivation of standard errors for

\hat{δ}

due to sampling of persons relies on the delta method (see [29,30,31,32,33,34,35]). A linear Taylor approximation of

H_{δ}

around

(δ, γ)

yields

H_{δ} (\hat{δ}, \hat{γ}) = H_{δ} (δ, γ) + H_{δ γ} (δ, γ) (\hat{γ} - γ) + H_{δ δ} (δ, γ) (\hat{δ} - δ) .

(12)

Due to

H_{δ} (\hat{δ}, \hat{γ}) = 0

and

H_{δ} (δ, γ) = 0

, the following holds:

\hat{δ} - δ = - H_{δ δ} {(δ, γ)}^{- 1} H_{δ γ} (δ, γ) (\hat{γ} - γ) .

(13)

Thus, the variance matrix (

V

) of

\hat{δ}

is given by

V = Var (\hat{δ}) = A V_{\hat{γ}} A^{⊤} with A = H_{δ δ} {(δ, γ)}^{- 1} H_{δ γ} (δ, γ)

(14)

where

V_{\hat{γ}}

is the variance matrix (

Var (\hat{γ})

) of the item parameters (

\hat{γ}

), which are typically obtained from IRT software. The matrix (

V_{\hat{γ}}

) is a block-diagonal matrix that contains the variance matrices of estimated item parameters in both groups. The variance matrix (

V

) in Equation (14) can be estimated by

\hat{V} = \hat{A} V_{\hat{γ}} {\hat{A}}^{⊤} with \hat{A} = H_{δ δ} {(\hat{δ}, \hat{γ})}^{- 1} H_{δ γ} (\hat{δ}, \hat{γ}) .

(15)

The necessary second-order partial derivatives for the variance matrix are readily obtained from the definition of weighted Haebara linking but could be quite cumbersome to evaluate analytically [22]. Alternatively, these derivatives can be obtained by numerical differentiation. In this case, users would define the linking function (

H (δ, γ)

) in statistical software like R, allowing the software to compute the necessary derivatives (

H_{δ δ}

and

H_{δ γ}

) automatically. Standard errors of the entries of the linking parameter estimate (

\hat{δ} = (\hat{μ}, \hat{σ})

) are computed as the square roots of the diagonal elements in

\hat{V}

.

3. Research Purpose

This simulation study examines whether the advantages of information-weighted Haebara linking, as claimed in recent work by Wang et al. [22,23], hold when compared to alternative normal density-weighted Haebara linking specifications. In normal density-weighted Haebara linking, the standard deviation (SD) of the normal density used for computing the weights is varied and compared with uniform weights in Haebara linking. The work of Wang et al. did not include standard errors for the information-based Haebara linking method. To address this, the accuracy of standard errors is also evaluated in the present simulation study and compared with that for normal density-weighted Haebara linking.

4. Simulation Study

4.1. Method

In this Simulation Study, item responses were simulated under the 2PL model in two groups. Item parameters were generated for

I = 20

and

I = 40

items, representing a short and a long test, respectively, in empirical applications. These parameters remained fixed across all replications of each condition in the simulation study. For

I = 10

items, the item discriminations (

a_{i}

) were set to 0.60, 0.60, 0.60, 0.60, 0.60, 1.20, 1.20, 1.20, 1.20, and 1.20, yielding a mean of

M = 0.900

and an

S D = 0.316

. The base item difficulties (

b_{i 0}

) were set to −1.40, −0.70, 0.00, 0.70, 1.40, −1.40, −0.70, 0.00, 0.70, and 1.40, resulting in

M = 0.000

and

S D = 1.043

. These parameters were also used in the simulation study conducted by Robitzsch [36]. For

I = 20

and

I = 40

items, the parameters were doubled and duplicated four times, respectively. Unlike other studies, this study assumed invariant item parameters across the two groups, with no fixed or random differential item function (see [37,38]).

The

θ

variable in the 2PL model was assumed to follow a normal distribution in both groups. For identification reasons, its mean and SD were fixed at 0 and 1, respectively, in the first group. In the second group,

θ

was simulated with a mean of

μ = 0.3

and an SD of

σ = 1.2

.

Sample sizes of

N = 500

, 1000, 2000, and 5000 per group were selected, representing moderate to large samples. Smaller sample sizes were avoided due to potential instability in estimating the 2PL model.

Alternative specifications of Haebara linking were considered, weighted by either the normal density or item information. First, uniform weights (

ω_{t}

) were chosen in Equations (6) and (10) for Haebara linking (denoted as method HA) and information-weighted Haebara linking (denoted as IHA). Second, a normal density with zero mean and standard deviation (

σ_{0}

) was used for weighting in normal density-weighted and information-weighted Haebara linking. The specifications of

σ_{0} = 2

, 1, and 0.5 corresponded to methods HA2, HA1, and HA0.5, respectively, for normal density-weighted Haebara linking and IHA2, IHA1, and IHA0.5, respectively, for item information-weighted Haebara linking.

In each of the 4 (sample size N) × 2 (number of items I) = 8 cells of the simulation, 5000 replications were conducted. Bias and root mean square error (RMSE) were computed for the estimated mean (

\hat{μ}

) and the estimated standard deviation (

\hat{σ}

) for each linking method. Let

{\hat{μ}}_{r}

be the estimated mean in replication

r = 1, \dots, R

. The bias of

\hat{μ}

is estimated as

\begin{matrix} Bias (\hat{μ}) = \frac{1}{R} \sum_{r = 1}^{R} ({\hat{μ}}_{r} - μ), \end{matrix}

(16)

where

μ

denotes the population mean in the data-generating model. The RMSE is estimated as

RMSE (\hat{μ}) = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{μ}}_{r} - μ)}^{2}} .

(17)

Note that the square of the RMSE (which is the mean square error) quantifies bias and variance because

{[RMSE (\hat{μ})]}^{2} = {[Bias (\hat{μ})]}^{2} + Var (\hat{μ}), where Var (\hat{μ}) = \frac{1}{R} \sum_{r = 1}^{R} {({\hat{μ}}_{r} - \hat{μ})}^{2} .

(18)

The bias and RMSE for

\hat{σ}

are defined in the same manner. A relative RMSE is calculated as the ratio of the RMSE values of a particular linking method to the HA method, chosen as the reference, then multiplied by 100. Haebara linking with uniform weights was selected as the reference method for the RMSE because prior studies comparing information-weighted Haebara linking to Haebara linking with uniform weights used this approach in their simulations [22,23].

The coverage rate for

\hat{μ}

and

\hat{σ}

at the 95% confidence level was also assessed based on the normal distribution and the standard error obtained in Section 2.3. This was calculated as the percentage of events in which an estimated confidence interval contained the pseudotrue value of

μ

or

σ

, where these pseudotrue values were the average estimates for a particular linking method in each simulation condition. Assessing the coverage rates based on pseudotrue parameter values removes potential effects of bias in the linking methods.

This simulation study utilized R statistical software (Version 4.4.1; [39]). The 2PL model was fitted using the sirt::xxirt() function of R package sirt (Version 4.2-106; [28]). Optimization for Haebara linking was performed with the stats::optim() function, implemented in a custom function written specifically for this paper. Replication materials for this simulation study are available at https://osf.io/a2zef (accessed on 9 March 2025).

4.2. Results

4.2.1. Bias

Table 1 presents the bias of the estimated group mean (

\hat{μ}

) and the estimated SD (

\hat{σ}

) as a function of the number of items (I) and sample size (N). The bias of

\hat{μ}

decreased as the sample size increased for all methods. For both 20 and 40 items, the bias approached zero at larger sample sizes (

N = 2000

and

N = 5000

). The bias for (

\hat{μ}

) was generally small, with no significant differences between methods at the larger sample sizes. HA exhibited the largest bias at

N = 500

. IHA showed a slight negative bias at smaller sample sizes but performed similarly to HA at larger sample sizes. The HA2, HA1, and HA0.5 methods generally displayed bias comparable to that of HA for sample sizes of at least 1000. Information-weighted Haebara linking methods (IHA2, IHA1, and IHA0.5) followed a similar pattern, but the negative bias was slightly more pronounced compared to their normal density-weighted counterparts (HA2, HA1, and HA0.5). Overall, the bias was relatively similar for both the short (i.e.,

I = 20

) and long (i.e.,

I = 40

) tests.

Table 1. Simulation Study: Bias of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Similarly to the bias of

\hat{μ}

, the bias of the estimated SD (

\hat{σ}

) generally decreased as the sample size increased. At

N = 5000

, the bias became negligible across all methods and item sizes. At smaller sample sizes (

N = 500

and

N = 1000

), however, the bias for

\hat{σ}

was more pronounced, particularly for methods using normal density weights and information-based weights. HA and IHA showed the largest bias at smaller sample sizes. For instance, at

N = 500

and

I = 20

, the bias for HA was 0.025, while for IHA, it was −0.011. As the sample size increased, the bias for both methods decreased. As shown in [22,23], IHA consistently performed better, with smaller bias at larger sample sizes compared to HA. The information-weighted methods (IHA2, IHA1, and IHA0.5) exhibited larger biases than IHA, although these biases also decreased with increasing sample size. In contrast, the bias for Haebara linking with normal density-based weights (HA2 and HA1) was smaller than that of HA, but the bias for HA0.5 was larger.

In terms of bias, Haebara linking with normal density weights (HA2 and HA1) emerged as the most effective among the different Haebara specifications. While IHA reduced the bias of HA at smaller sample sizes, it performed worse than HA2 and HA1. However, the bias of information-weighted Haebara linking with normal density weights (IHA2, IHA1, and IHA0.5) was larger than that of IHA.

4.2.2. RMSE

Table 2 reports the relative RMSE for the estimated mean (

\hat{μ}

) and the estimated SD (

\hat{σ}

). The best and second-best methods are highlighted with a yellow background, while the third- and fourth-best methods are shown with a gray background.

Table 2. Simulation Study: Relative root mean square error (RMSE) of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

First, we focus on analyzing the RMSE for

\hat{μ}

. IHA slightly outperformed HA, particularly at smaller sample sizes. However, Haebara linking methods employing normal density weights (i.e., HA2, HA1, and HA0.5) significantly outperformed both HA and IHA, with a strong preference for HA1 and HA0.5. The efficiency gains of HA1 and HA0.5 were more pronounced in the shorter test (

I = 20

). Interestingly, information-weighted Haebara linking with normal density weights (IHA1 and IHA0.5) achieved a lower RMSE than the information-weighted Haebara linking method (IHA) using uniform weights.

Next, we turn our attention to the RMSE for

\hat{σ}

. The efficiency gains of IHA compared to HA were more pronounced for

\hat{σ}

than for

\hat{μ}

. However, similar to the findings for

\hat{μ}

, the normal density-weighted Haebara methods (HA1 and HA0.5) were significantly better than IHA, with a slight advantage for HA1. In contrast to the results for

\hat{μ}

, information-weighted Haebara linking with normal density weights had a higher RMSE than the original IHA method, which relies on uniform weights.

In conclusion, similar to the bias results, normal density-weighted Haebara linking methods should be preferred over information-weighted Haebara linking methods.

4.2.3. Coverage Rates

Finally, Table 3 presents the coverage rates for

\hat{μ}

and

\hat{σ}

across the different linking methods. Overall, the coverage rates were highly satisfactory, ranging from 94.2 to 95.9. These results indicate that reliable statistical inference for both normal density-weighted and information-weighted Haebara linking can be achieved using the standard error estimation method outlined in Section 2.3.

Table 3. Simulation Study: Coverage rate of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

4.2.4. Summary

This simulation study demonstrated that bias in information-weighted Haebara linking and Haebara linking with uniform weights was observed for smaller sample sizes of

N = 500

. This bias was reduced when using normal density-weighted Haebara linking. However, bias generally decreased for all methods as sample size increased. The normal-density weighted Haebara linking approaches exhibited the highest precision in linking parameter estimates, as measured by RMSE and generally outperformed information-weighted Haebara linking. The precision gains were primarily due to the lower bias in normal density-weighted Haebara linking. The coverage rates for the proposed standard errors were consistently satisfactory, ensuring the reliable use of the proposed methodology in empirical research.

5. Discussion

This article compares different weighting strategies in Haebara linking. Information-weighted Haebara linking has recently been proposed as an alternative to the standard Haebara linking method [22,23]. These studies argue that information-weighted Haebara linking reduces bias and RMSE in linking parameter estimates. Our findings generally support this claim, but only when uniform weights are used in traditional Haebara linking. When normal density weights with a standard deviation of 1 or 0.5 are employed, the corresponding Haebara methods outperform information-based linking. Furthermore, information-weighted Haebara linking does not consistently benefit from the replacement of uniform weights with normal density weights. As a second contribution, we derived standard errors for both information-weighted Haebara linking and normal density-weighted Haebara that resulted in satisfactory coverage rates.

Our findings challenge the prevailing practice in IRT software, where uniform weights are commonly used as the default in Haebara linking. We recommend returning to Haebara’s original proposal [18], which uses normal density weights with a standard deviation of 1. Although information-weighted Haebara linking [22,23] generally improves upon the choice of uniform weights, it remains inferior to normal density-weighted Haebara linking. Given the superior performance of normal density weights, the practical advantages of using information-weighted Haebara linking remain unclear.

It should be emphasized that using normal density-weighted Haebara linking instead of Haebara linking with uniform weights does not increase computational effort. Typically, the separate scaling of the two groups through the 2PL IRT model is much more computationally demanding than the subsequent Haebara linking step, which can be conducted with standard numerical optimizers. Alternatively, users could apply a concurrent scaling approach in the 2PL model, assuming invariant item parameters across groups. However, using this multiple-group IRT model would increase computational effort [32], and it typically requires more iterations to converge than separate scaling within each group. Moreover, the literature suggests that linking approaches are more robust to model mis-specification than concurrent calibration [15].

An anonymous reviewer inquired about the potential impact of the findings on large-scale educational assessments or psychological testing in real-world settings. While we do not expect substantial differences to arise from applying different weighting techniques in Haebara linking, we believe that, like any statistical technique, Haebara linking should be examined in terms of optimizing the statistical properties of the estimator. Practitioners should select the estimator with the most desirable properties for their empirical studies. Therefore, we believe our study provides valuable recommendations for those looking to implement Haebara linking in their linking studies.

Our study focused on the 2PL model, which can be seen as a special case of the generalized partial credit model studied in [23]. It may be of interest to examine the performance of different weighting approaches in the context of the three-parameter logistic (3PL) model [22]. However, it seems reasonable to expect that normal density-weighted Haebara linking would also perform better in this case.

As noted by an anonymous reviewer, equal sample sizes in the groups are common in empirical research. Such situations could be explored in future studies. It is anticipated that the main findings of this simulation study will remain unchanged, with the precision of the estimates primarily determined by the smaller sample size in the two groups.

Another avenue for future research could be to investigate the impact of weighting choices in Stocking–Lord linking [19]. The effects of normal density versus information weighting may be less pronounced here, while item information must be replaced by test information (see [22]). Given that Stocking–Lord linking generally outperforms Haebara linking [37], especially in the presence of differential item functioning (DIF; [40,41]), future studies should also explore whether information-weighted or normal density-weighted variants are preferable for this linking method. We suppose that the information-weighted approaches would also not outperform normal density-based approaches to Stocking–Lord linking. It is anticipated that, similar to Haebara linking, the information-weighted approaches may not outperform normal density-based approaches inStocking–Lord linking.

However, this paper only simulated item responses without DIF effects. Other research has suggested that mean geometric linking should be preferred over Haebara linking in the presence of DIF [38], particularly regarding bias in linking parameter estimates. Therefore, researchers should opt for normal density-weighted Haebara linking to obtain the most efficient estimates in situations without DIF, but this method is not recommended in cases with non-negligible DIF.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This article only uses simulated datasets. Replication material for the simulation study can be found at https://osf.io/a2zef (accessed on 9 March 2025).

Conflicts of Interest

The authors declare no conflict of interests.

Abbreviations

The following abbreviations are used in this manuscript:

2PL	Two-parameter logistic
3PL	Three-parameter logistic
DIF	Differential item functioning
IIF	Item information function
IRF	Item response function
IRT	Item response theory
MML	Marginal maximum likelihood
RMSE	Root mean square error
SD	Standard deviation

References

Bock, R.D.; Moustaki, I. Item response theory in a general framework. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 469–513. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory–A statistical framework for educational and psychological measurement. Stat. Sci. 2024; Epub ahead of print. Available online: https://rb.gy/1yic0e (accessed on 9 March 2025).
van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar]
Nering, M.L.; Ostini, R. Handbook of Polytomous Item Response Theory Models; Taylor & Francis: Boca Raton, FL, USA, 2011. [Google Scholar] [CrossRef]
Mellenbergh, G.J. Models for continuous responses. In Handbook of Item Response Theory; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 1: Models, pp. 153–163. [Google Scholar]
Tutz, G.; Jordan, P. Latent trait item response models for continuous responses. J. Educ. Behav. Stat. 2024, 49, 499–532. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 2: Statistical Tools, pp. 217–236. [Google Scholar] [CrossRef]
Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; Volume 2: Statistical Tools, pp. 197–216. [Google Scholar] [CrossRef]
Holland, P.W. On the sampling theory foundations of item response theory models. Psychometrika 1990, 55, 577–601. [Google Scholar] [CrossRef]
Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef]
Stocking, M.L.; Lord, F.M. Developing a common metric in item response theory. Appl. Psychol. Meas. 1983, 7, 201–210. [Google Scholar] [CrossRef]
Liu, G.; Kim, H.J.; Lee, W.C.; Kim, Y. Comparison of Simultaneous Linking and Separate Calibration with Stocking-Lord Method; CASMA Research Report Number 57; Center for Advanced Studies in Measurement and Assessment, University of Iowa: Iowa City, IA, USA, 2024; Available online: https://tinyurl.com/2bj6pbwn (accessed on 9 March 2025).
Jianhua, X.; Shuliang, D. Model-based methods for test equating under item response theory. In Proceedings of the 2010 International Conference on E-Business and E-Government, IEEE, Guangzhou, China, 7–9 May 2010; pp. 5458–5461. [Google Scholar] [CrossRef]
Wang, S.; Zhang, M.; Lee, W.C.; Huang, F.; Li, Z.; Li, Y.; Yu, S. Two IRT characteristic curve linking methods weighted by information. J. Educ. Meas. 2022, 59, 423–441. [Google Scholar] [CrossRef]
Wang, S.; Lee, W.C.; Zhang, M.; Yuan, L. IRT characteristic curve linking methods weighted by information for mixed-format tests. Appl. Meas. Educ. 2024, 37, 377–390. [Google Scholar] [CrossRef]
Kim, S.H.; Cohen, A.S. A comparison of linking and concurrent calibration under the graded response model. Appl. Psychol. Meas. 2002, 26, 25–41. [Google Scholar] [CrossRef]
Kim, S.; Kolen, M.J. Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. J. Educ. Behav. Stat. 2007, 32, 371–397. [Google Scholar] [CrossRef]
Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
Weeks, J.P. plink: An R package for linking mixed-format tests using IRT-based methods. J. Stat. Softw. 2010, 35, 1–33. [Google Scholar] [CrossRef]
Robitzsch, A. Sirt: Supplementary Item Response Theory Models. R Package Version 4.2-106. 2024. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 31 December 2024).
Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
Ogasawara, H. Item response theory true score equatings and their standard errors. J. Educ. Behav. Stat. 2001, 26, 31–50. [Google Scholar] [CrossRef]
Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef]
Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
Zhang, Z. Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Appl. Psychol. Meas. 2021, 45, 331–345. [Google Scholar] [CrossRef]
Robitzsch, A. Extensions to mean–geometric mean linking. Mathematics 2025, 13, 35. [Google Scholar] [CrossRef]
Robitzsch, A. Bias-reduced Haebara and Stocking-Lord linking. J 2024, 7, 373–384. [Google Scholar] [CrossRef]
Robitzsch, A. Does random differential item functioning occur in one or two groups? Implications for bias and variance in asymmetric and symmetric Haebara and Stocking-Lord linking. Asymmetry 2024, 1, 0005. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria. 2024. Available online: https://www.R-project.org (accessed on 15 June 2024).
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; Volume 26: Psychometrics, pp. 125–167. [Google Scholar] [CrossRef]
Bauer, D.J. Enhancing measurement validity in diverse populations: Modern approaches to evaluating differential item functioning. Brit. J. Math. Stat. Psychol. 2023, 76, 435–461. [Google Scholar] [CrossRef]

Table 1. Simulation Study: Bias of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Table 1. Simulation Study: Bias of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Par	I	N	HA	IHA	HA2	IHA2	HA1	IHA1	HA0.5	IHA0.5
$\hat{μ}$	20	500	0.008	−0.004	0.003	−0.005	0.000	−0.005	−0.002	−0.005
		1000	0.003	−0.004	0.000	−0.004	−0.002	−0.004	−0.003	−0.004
		2000	0.001	−0.002	0.000	−0.002	0.000	−0.002	−0.001	−0.001
		5000	0.001	−0.001	0.000	−0.001	0.000	−0.001	−0.001	−0.001
	40	500	0.008	−0.001	0.005	−0.002	0.003	−0.001	0.002	−0.001
		1000	0.003	−0.003	0.000	−0.004	−0.001	−0.004	−0.002	−0.004
		2000	−0.001	−0.002	−0.001	−0.002	−0.001	−0.001	−0.001	−0.001
		5000	0.000	−0.002	−0.001	−0.002	−0.001	−0.002	−0.002	−0.002
$\hat{σ}$	20	500	0.025	−0.011	0.006	−0.020	−0.006	−0.032	−0.014	−0.061
		1000	0.013	−0.006	0.003	−0.010	−0.004	−0.017	−0.008	−0.032
		2000	0.006	−0.003	0.002	−0.005	−0.002	−0.008	−0.003	−0.016
		5000	0.002	−0.001	0.001	−0.002	−0.001	−0.003	−0.002	−0.006
	40	500	0.019	−0.012	0.004	−0.019	−0.007	−0.031	−0.013	−0.060
		1000	0.013	−0.004	0.005	−0.008	−0.001	−0.014	−0.004	−0.029
		2000	0.005	−0.003	0.001	−0.005	−0.002	−0.008	−0.003	−0.016
		5000	0.003	0.000	0.001	−0.001	0.000	−0.003	−0.001	−0.006

Note. Par = parameter; HA = Haebara linking with uniform weights; IHA = information-weighted Haebara linking with uniform weights; HA2, HA1, HA0.5 = normal-density-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively; IHA2, IHA1, IHA0.5 = information-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively; Values of absolute bias larger than 0.005 are printed in bold font.

Table 2. Simulation Study: Relative root mean square error (RMSE) of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Table 2. Simulation Study: Relative root mean square error (RMSE) of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Par	I	N	HA	IHA	HA2	IHA2	HA1	IHA1	HA0.5	IHA0.5
$\hat{μ}$	20	500	100	98.0	93.6	95.5	89.7	92.3	89.0	91.5
		1000	100	98.9	94.4	96.4	91.0	93.2	90.6	92.5
		2000	100	100.2	95.1	97.7	91.9	94.4	91.7	93.6
		5000	100	100.3	94.6	97.6	91.0	93.9	90.2	92.5
	40	500	100	97.1	96.3	95.8	94.3	94.4	94.1	94.6
		1000	100	98.2	96.4	96.8	94.3	95.1	94.1	94.8
		2000	100	100.0	97.6	98.8	96.3	97.5	96.5	97.6
		5000	100	99.1	97.2	97.8	95.6	96.5	95.7	96.7
$\hat{σ}$	20	500	100	88.6	87.3	89.9	83.5	93.9	86.2	115.1
		1000	100	92.4	90.2	93.0	86.8	94.9	89.0	109.0
		2000	100	96.4	92.7	96.9	90.1	97.9	92.5	107.9
		5000	100	95.3	93.1	95.3	90.1	95.4	92.5	102.9
	40	500	100	90.8	90.4	92.5	87.7	97.6	89.5	122.3
		1000	100	92.1	91.9	92.8	89.0	95.1	90.2	109.8
		2000	100	97.1	95.4	97.9	94.0	99.7	95.8	110.6
		5000	100	96.3	95.0	96.2	92.8	96.5	93.8	102.3

Note. Par = parameter; HA = Haebara linking with uniform weights; IHA = information-weighted Haebara linking with uniform weights; HA2, HA1, HA0.5 = normal-density-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively; IHA2, IHA1, IHA0.5 = information-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively; The HA method was used as the reference method to compute the relative RMSE.; The best and second-best methods are highlighted in yellow, while the third and fourth best are shown with a gray background.

Table 3. Simulation Study: Coverage rate of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Table 3. Simulation Study: Coverage rate of estimated mean

\hat{μ}

and estimated SD

\hat{σ}

as a function of the number of items I and sample size N.

Par	I	N	HA	IHA	HA2	IHA2	HA1	IHA1	HA0.5	IHA0.5
$\hat{μ}$	20	500	95.6	95.4	95.3	95.5	95.5	95.6	95.6	95.5
		1000	94.4	94.8	94.5	94.8	94.8	95.1	94.9	95.1
		2000	95.3	95.2	95.2	95.2	95.1	95.1	95.1	95.1
		5000	95.3	95.6	95.6	95.5	95.7	95.6	95.7	95.5
	40	500	94.8	95.2	94.8	95.1	94.8	95.0	94.7	95.0
		1000	95.0	95.3	95.1	95.3	95.3	95.6	95.5	95.7
		2000	95.4	95.3	95.3	95.3	95.2	95.2	95.1	95.1
		5000	95.2	95.5	95.3	95.5	95.3	95.4	95.3	95.1
$\hat{σ}$	20	500	94.9	95.1	94.8	95.2	95.0	95.3	95.2	95.7
		1000	95.0	95.0	94.9	95.2	95.0	95.3	95.0	95.6
		2000	95.2	94.9	95.3	95.0	95.3	95.3	95.2	95.5
		5000	94.8	94.9	95.1	95.3	95.0	95.2	95.0	95.3
	40	500	94.2	94.7	94.4	94.9	94.6	95.2	95.1	95.9
		1000	94.7	95.0	95.0	95.0	95.1	95.1	95.0	95.5
		2000	94.9	94.8	94.7	94.8	94.8	94.9	94.9	95.1
		5000	94.7	94.8	94.7	94.8	94.9	94.7	94.8	95.0

Note. Par = parameter; HA = Haebara linking with uniform weights; IHA = information-weighted Haebara linking with uniform weights; HA2, HA1, HA0.5 = normal-density-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively; IHA2, IHA1, IHA0.5 = information-based Haebara linking based on weights with normal density of zero mean and

σ_{0} = 2

, 1, and 0.5, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.