1. Introduction
Item response theory (IRT) models [
1,
2,
3] are multivariate statistical models for multivariate binary random variables. These models are frequently used to model cognitive testing data from educational or psychological applications. For example, IRT models are operationally utilized in educational large-scale assessments [
4,
5], like the programme for international student assessment (PISA; [
6]) study.
In this article, we only treat unidimensional IRT models [
7]. Let
be the vector of
I dichotomous random variables
(also referred to as items or (scored) item responses). A unidimensional IRT model [
8] is a statistical model for the probability distribution
for
, where
where
denotes the density of the normal distribution, with the mean
and the standard deviation
. The distribution parameters of the latent variable
(also referred to as the factor variable, trait, or ability) are contained in the vector
. The vector
contains all the estimated item parameters of item response functions (IRFs); where
(
). The two-parameter logistic (2PL) model [
9] possesses the following IRF:
Using the item discrimination
and item difficulty
,
denotes the logistic distribution function. The 2PL model could also be estimated for non-normal distributions [
10,
11,
12,
13,
14,
15]. In this case, an identification constraint is typically applied to an item
such that
and
.
The Rasch model [
16] is obtained from the 2PL model as a particular case in which all item discriminations equal 1. Some researchers believe that the Rasch model offers particular measurement (i.e., metrological) properties in contrast to the 2PL model (e.g., [
17,
18,
19]). However, in our view, the Rasch model has only one advantage over the 2PL model, which is that a conditional maximum likelihood estimation is applicable [
20]. Moreover, the Rasch model possesses the unweighted sum score as a sufficient statistic for
, which offers many interpretational advantages [
21,
22,
23,
24,
25,
26]. There is a disbelief that group comparisons can only be conducted with the Rasch model because it has a so-called property of separability, which entails specific objective comparisons [
27,
28]. However, this reasoning is incorrect and can be disproved by empirical data [
29]. In fact, any IRT model with invariant item parameters across groups allows for invariant group comparisons [
7,
30], although proponents of the Rasch model frequently claim otherwise [
31,
32].
If independent and identically distributed observations
of
N persons from the distribution of the random variable
are available, and the unknown model parameters of the IRT model in (
1) can be estimated using the marginal maximum likelihood (MML) using an expectation–maximization algorithm [
33,
34].
IRT models are frequently used to compare the performance of two groups in a test (i.e., on a set of items) regarding the factor variable
in the IRT model (
1). In the following, we only discuss the 2PL model. Two primary approaches can be distinguished [
35]. First, concurrent calibration can be applied in which a joint IRT model is estimated in the two groups by assuming common (i.e., invariant) item discriminations and item difficulties in the two groups. While the mean and the standard deviation of
are fixed in the first group for identification reasons, the mean
and the standard deviation
can be identified for the second group. Hence, these two parameters summarize the group difference regarding the factor variable
. Second, the 2PL model can be separately estimated in each of the two groups. This approach allows items to function differently across groups, which is a property that is referred to as differential item functioning (DIF; see [
36,
37,
38]). In the second step, the differences in item parameters are used to determine the group difference regarding the
variable by means of a linking method [
39,
40,
41]. The occurrence of DIF causes additional variability in the estimated mean
and standard deviation
[
42,
43,
44,
45,
46]. Therefore, the estimated distribution parameters
depend on the choice of selected items, even for infinite sample sizes of persons. This variability is quantified in the linking error [
47,
48,
49,
50,
51,
52].
An anonymous reviewer pointed out that a simultaneous estimation, while allowing for group-specific DIF effects, would also be possible [
53,
54,
55,
56]. In fact, the two-step procedure of separate scaling with subsequent linking can be equivalently formulated as a one-step simultaneous estimation (i.e., concurrent calibration) with nonlinear constraints on group-specific item parameters [
57].
This article investigates the computation of the total uncertainty of linking methods in a general treatment based on M-estimation theory [
58]. A new bias-corrected linking error is derived, which results in the better performance of the coverage rates of constructed confidence intervals for linking estimates. In particular, it turned out that the newly proposed bias-corrected linking error estimate has a smaller bias for the linking error than the estimators currently employed in the literature.
The rest of the article is organized as follows.
Section 2 formalizes the linking methods in the statistical language of estimating equations (i.e., M-estimation).
Section 3 presents examples of linking methods that are subsequently investigated in two simulation studies. In
Section 4, M-estimation theory is applied to compute the linking error, standard error, total error, and the newly proposed bias-corrected linking error. Then,
Section 5 and
Section 6 present the findings from two simulation studies. Finally, the article closes with a discussion in
Section 7.
4. Estimation of Standard Error, Linking Error, and Total
Error
In this section, we derive the standard error, linking error, and total error for linking estimates in the framework of M-estimation theory [
58,
59,
76]. The treatment in this section is an extension of the material presented in [
65]. Assume the item parameter estimate
with an estimated variance matrix
. Moreover, let
be the vector of all the item parameters with an estimated variance matrix
. The corresponding population analogs of the estimators are denoted by
and
and are effectively the estimates for an infinite sample size. As described in
Section 2, the linking method provides an estimate
for the population parameter
as a root of the estimating equation.
4.1. Linking Error
First, we derive the linking error of the estimate
of the estimating equation (
16) that quantifies the uncertainty in the estimate due to the selection (or randomness) of items. As is usual in M-estimation theory, we carry out a Taylor approximation of
around the true parameter
(sometimes also referred to as the pseudotrue parameter; [
58])
where
denotes the vector of partial derivatives of
with respect to
. Moreover, it holds that
because of the definition of the true parameter [
58]. Hence, we obtain from (
17) the following:
M-estimation theory provides the variance matrix of
as the sandwich variance estimate:
In (
21), we used the approximate independence of the item parameters across items. In M-estimation, the matrix
is called the bread matrix, and
is the meat matrix. The unknown quantities in (
19) can be estimated by
Hence, an estimate of the variance matrix
is given by
The factor
in (
24) is included to correct for finite-sample bias [
65,
77,
78,
79].
4.2. Standard Error
We now compute the standard error of
due to the sampling of persons (see [
80,
81,
82,
83,
84,
85]). A Taylor approximation of
around
is carried out and results in
This allows us to compute the variance matrix in
due to the sampling error as follows:
The unknown quantities in (
29) can be estimated using
in (
22), and we obtain the following:
4.3. Total Error and Bias-Corrected Linking Error
We now compute the total uncertainty in
(i.e., the total error). The variance as the total error has been defined as the sum of the variances due to the sampling error and linking error, and it is written as follows (see [
50,
65]):
We now derive a bias-corrected estimate of the linking error variance matrix
, which, in turn, allows us to compute a bias-corrected variance matrix for the linking error. The estimated meat matrix in the variance matrix for the linking error is given as
However, the linking error should only be computed based on the true item parameters
instead of the estimated item parameters
, which appear in (
32). A Taylor approximation provides
Hence, the inflated variance contribution in
due to the sampling error can be determined as
where we used the approximate independence of the item parameters across items. As a result, we compute a bias-corrected meat matrix as follows:
Note that the correction term
in (
35) corresponds to the matrix
in (
29) in the variance due to standard errors if the item parameters
were uncorrelated across items. Next, a bias-corrected variance matrix due to linking error is given as
and the variance matrix for the total error is given by
To sum up, the variance matrix referring to the total error can be written as
To obtain standard errors, linking errors, bias-corrected linking errors, total errors and bias-corrected total errors, the square root of the diagonal elements of the corresponding matrices can be taken. In cases of negative variances for bias-corrected estimates, a corresponding linking error estimate is set to zero.
6. Simulation Study 2: Assessing Total Error in Finite Sample
Sizes
In the second Simulation Study 2, we investigated the statistical performance of the linking estimation methods in finite samples. In this case, there was uncertainty regarding the sampling of persons, thus resulting in standard error results and randomness in the DIF effects, which resulted in linking error results. Both sources of errors can be summarized as the total error.
6.1. Method
The item responses were simulated according to the 2PL model for a test with
, 20, or 40 items. The same item parameters as in Simulation Study 1 (see
Section 5.1) were used. The factor variable
was assumed to be normally distributed in both groups. Like in Simulation Study 1, the mean and the standard deviation for the factor variable
in the first group were fixed at 0 and 1, respectively. The variable
in the second group had a mean
and a standard deviation
. We used the normal distribution and the scaled
distribution for DIF effects and varied the DIF standard deviations
as 0.25 and 0.5. Moreover, we simulated a condition of no DIF effects (i.e.,
). The sample sizes
, 1000, and 2000 were chosen in order to mimic sample sizes that are typically in the applications of the 2PL model [
8].
In contrast to Simulation Study 1, the item parameters of the 2PL model were separately estimated for the two groups in the first step using MML estimation. In the second step, the performance of the five linking methods MM, MGM, RMGM, HAE, and RHAE was studied. The estimated mean and the estimated standard deviation for the five methods were compared regarding the bias, RMSE, and relative RMSE. As in Simulation Study 1, MGM linking was used as the reference method for computing the RMSE.
In total, 5000 replications were conducted in each of the 5 (type of distribution for DIF effects combined with DIF standard deviation ) × 3 (number of items I) × 3 (sample size N) = 45 cells of the simulation.
In the analysis, we computed the median of the linking error estimate
based on (
24) and the median of the bias-corrected linking error estimate
based on (
36). Moreover, we compared the coverage rates for the estimates
and
based on the standard error
, as well as the (uncorrected) total error
, based on (
31), and the bias-corrected total error
was based on (
37).
The R software [
86] was used for the entire analysis in this simulation study. The 2PL model was estimated with the
sirt::xxirt() function in the R package sirt [
87]. As in Simulation Study 1, the R function
linking_2groups_dich() was used for computing the estimates
and
and their standard errors for the five linking methods. Replication material for this Simulation Study 2 can be found at
https://osf.io/6bp3t (accessed on 29 April 2024).
6.2. Results
All five linking methods were approximately unbiased in all conditions of the simulation study.
Table 3 presents the relative RMSE as a function of the different DIF distribution types, the DIF effect standard deviation
, the number of items
I, and the sample size
N. As expected from the literature, the HAE method was the most efficient linking method in the condition of no DIF (i.e.,
). Across all conditions, the MM method had comparable performance to the MGM method regarding
, but it was slightly more efficient for
. Efficiency gains of the RMGM method were only realized for the heavy-tailed
distribution in large sample sizes. In these situations, the RHAE method outperformed the RMGM method for the estimated mean but not for the estimated standard deviation.
Table 4 presents the median of the estimated linking error
and bias-corrected linking error
for the estimated mean
. The results for the DIF effects with a
distribution and a standard deviation of
were omitted for space reasons. In
Table 4, we have also reported the estimated linking error for an infinite sample size (i.e.,
Inf) that was obtained from Simulation Study 1. It can be seen that the estimated linking errors
and
(almost always) converged to the
for an infinite sample size with an increasing sample size.
It turned out that the estimated linking error was positively biased, while the bias-corrected linking error was negatively biased (to a lesser extent). In particular, the median estimated linking error of 0.061 for the MM method for 10 items and a small sample size was substantially larger than the true value of 0 in the condition of no DIF (i.e., ). On the other hand, the median of the estimated bias-corrected linking error was 0 in all situations in which no DIF was simulated in the item parameters. Overall, one could conclude that the bias in both linking error types can be reduced with an increasing sample size and an increasing number of items. For all linking methods except for the RMGM method, the uncorrected linking error had worse performance compared to the bias-corrected linking error . Hence, could be the preferred choice for a reported linking error.
Table 5 reports the median values of the estimated linking error
and bias-corrected linking error
for the estimated standard deviation
. Overall, we observed a similar pattern of findings as in the case of the estimated mean
. Again, the bias-corrected linking error estimates for the RMGM method were unsatisfactory. The bias in the uncorrected linking error estimates was slightly larger for
than for
. A simple idea might be to use the mean of both of the linking error estimates
and
as another linking error to improve the performance of the linking error estimate.
In
Table 6, the coverage rate for the estimated mean
is displayed. In the no DIF condition
, the coverage rates based on the standard error
performed satisfactorily. In the presence of DIF, the uncertainty in the estimated means of
was underestimated when the standard error was used when computing confidence intervals, thus resulting in substantial undercoverage. The confidence intervals based on the total error
tended to have slightly increased coverage rates. In such situations, the coverage rates for the confidence intervals based on the bias-corrected linking error
were slightly better. Generally, the RMGM linking method did not have adequate coverage rates in many situations.
Finally,
Table 7 reports the coverage rates for the estimated standard deviation
. The bias-corrected total error outperformed the uncorrected total error regarding coverage rates. In many situations, the coverage rates based on the total error were too high. However, the RMGM method had substantial overcoverage in many conditions for confidence intervals based on
and
, particularly for fewer items or smaller sample sizes.
7. Discussion
In this article, we simultaneously treated standard errors and linking errors for linking methods in the 2PL model. We proposed a bias-corrected linking error estimate, which delivers a bias-corrected total error estimate. This bias-corrected total error outperformed the usually employed total error that is given as the simple variances due to standard error and the usual uncorrected linking error. In a simulation study, it turned out that the confidence intervals for the linking parameters based on the bias-corrected total error outperformed those based on the usual total error regarding coverage rates. Moreover, the bias-corrected linking error estimate was less biased than the uncorrected linking error estimate.
As with any simulation study, our study had several limitations. First, our study only treated the 2PL model for dichotomous item responses. However, the performance of the linking estimators and their variance estimates could also be investigated for the simpler Rasch model for dichotomous item responses [
16] or the generalized partial credit model for polytomous item responses [
88]. Furthermore, the theory in this article could also be adapted to the chain linking of multiple groups [
83,
84]. In addition, the distribution types of the DIF effects in the simulation studies were restricted to the symmetric normal distribution and the
t distributions with three degrees of freedom. Future research could focus on alternative and asymmetric distributions such as mixture, uniform, or discrete distributions. Moreover, the factor variable
was assumed to be normally distributed in both simulation studies. The 2PL model could also be estimated with non-normal
distributions [
10,
13], which could be investigated in future studies. Next, followup research could focus on linking with smaller sample sizes, such as
, as well as the case of unbalanced group sizes. Furthermore, we only employed 10, 20, or 40 items in the two simulation studies. Future research could also investigate a larger number of items. We do not think that linking should be conducted with an even smaller number of items, such as
items, because the group comparisons will likely become unstable in the presence of DIF, and the representativity of the link items might be questioned (but, see [
89]). Also, the extent of nonuniform DIF was not independently manipulated from the extent of uniform DIF in the two simulation studies. Finally, the performance of our proposed error estimates could also be applied to mis-specified IRT models. For example, the 2PL model could be employed for linking if the item response data would be generated from the logistic positive exponential model [
90,
91] or the monotonic polynomial IRT model [
92,
93]. All of these limitations could be addressed in extensive future research.
As a final side note, I would like to add that a comparison of two groups regarding the distribution of the factor variable
could also be conducted using concurrent calibration by assuming invariant (i.e., the same) item parameters across the groups. Some researchers argue that linking uncertainty is reduced by assuming invariant item parameters (see [
94,
95]). I think that this belief is unjustified. The fact that there could be variability due to item selection does not disappear, because the variability in the model parameters is not represented in the statistical model. The computation of the linking errors under the assumption of invariant item parameters in the statistical model has been carved out in Ref. [
96].