Abstract
The two-parameter logistic (2PL) item response theory model is a statistical model for analyzing multivariate binary data. In this article, two groups are brought onto a common metric using the 2PL model using linking methods. The linking methods of mean–mean linking, mean–geometric–mean linking, and Haebara linking are investigated in nonrobust and robust specifications in the presence of differential item functioning (DIF). M-estimation theory is applied to derive linking errors for the studied linking methods. However, estimated linking errors are prone to sampling error in estimated item parameters, thus resulting in artificially increased the linking error estimates in finite samples. For this reason, a bias-corrected linking error estimate is proposed. The usefulness of the modified linking error estimate is demonstrated in a simulation study. It is shown that a simultaneous assessment of the standard error and linking error in a total error must be conducted to obtain valid statistical inference. In the computation of the total error, using the bias-corrected linking error estimate instead of the usually employed linking error provides more accurate coverage rates.
1. Introduction
Item response theory (IRT) models [1,2,3] are multivariate statistical models for multivariate binary random variables. These models are frequently used to model cognitive testing data from educational or psychological applications. For example, IRT models are operationally utilized in educational large-scale assessments [4,5], like the programme for international student assessment (PISA; [6]) study.
In this article, we only treat unidimensional IRT models [7]. Let be the vector of I dichotomous random variables (also referred to as items or (scored) item responses). A unidimensional IRT model [8] is a statistical model for the probability distribution for , where
where denotes the density of the normal distribution, with the mean and the standard deviation . The distribution parameters of the latent variable (also referred to as the factor variable, trait, or ability) are contained in the vector . The vector contains all the estimated item parameters of item response functions (IRFs); where (). The two-parameter logistic (2PL) model [9] possesses the following IRF:
Using the item discrimination and item difficulty , denotes the logistic distribution function. The 2PL model could also be estimated for non-normal distributions [10,11,12,13,14,15]. In this case, an identification constraint is typically applied to an item such that and .
The Rasch model [16] is obtained from the 2PL model as a particular case in which all item discriminations equal 1. Some researchers believe that the Rasch model offers particular measurement (i.e., metrological) properties in contrast to the 2PL model (e.g., [17,18,19]). However, in our view, the Rasch model has only one advantage over the 2PL model, which is that a conditional maximum likelihood estimation is applicable [20]. Moreover, the Rasch model possesses the unweighted sum score as a sufficient statistic for , which offers many interpretational advantages [21,22,23,24,25,26]. There is a disbelief that group comparisons can only be conducted with the Rasch model because it has a so-called property of separability, which entails specific objective comparisons [27,28]. However, this reasoning is incorrect and can be disproved by empirical data [29]. In fact, any IRT model with invariant item parameters across groups allows for invariant group comparisons [7,30], although proponents of the Rasch model frequently claim otherwise [31,32].
If independent and identically distributed observations of N persons from the distribution of the random variable are available, and the unknown model parameters of the IRT model in (1) can be estimated using the marginal maximum likelihood (MML) using an expectation–maximization algorithm [33,34].
IRT models are frequently used to compare the performance of two groups in a test (i.e., on a set of items) regarding the factor variable in the IRT model (1). In the following, we only discuss the 2PL model. Two primary approaches can be distinguished [35]. First, concurrent calibration can be applied in which a joint IRT model is estimated in the two groups by assuming common (i.e., invariant) item discriminations and item difficulties in the two groups. While the mean and the standard deviation of are fixed in the first group for identification reasons, the mean and the standard deviation can be identified for the second group. Hence, these two parameters summarize the group difference regarding the factor variable . Second, the 2PL model can be separately estimated in each of the two groups. This approach allows items to function differently across groups, which is a property that is referred to as differential item functioning (DIF; see [36,37,38]). In the second step, the differences in item parameters are used to determine the group difference regarding the variable by means of a linking method [39,40,41]. The occurrence of DIF causes additional variability in the estimated mean and standard deviation [42,43,44,45,46]. Therefore, the estimated distribution parameters depend on the choice of selected items, even for infinite sample sizes of persons. This variability is quantified in the linking error [47,48,49,50,51,52].
An anonymous reviewer pointed out that a simultaneous estimation, while allowing for group-specific DIF effects, would also be possible [53,54,55,56]. In fact, the two-step procedure of separate scaling with subsequent linking can be equivalently formulated as a one-step simultaneous estimation (i.e., concurrent calibration) with nonlinear constraints on group-specific item parameters [57].
This article investigates the computation of the total uncertainty of linking methods in a general treatment based on M-estimation theory [58]. A new bias-corrected linking error is derived, which results in the better performance of the coverage rates of constructed confidence intervals for linking estimates. In particular, it turned out that the newly proposed bias-corrected linking error estimate has a smaller bias for the linking error than the estimators currently employed in the literature.
The rest of the article is organized as follows. Section 2 formalizes the linking methods in the statistical language of estimating equations (i.e., M-estimation). Section 3 presents examples of linking methods that are subsequently investigated in two simulation studies. In Section 4, M-estimation theory is applied to compute the linking error, standard error, total error, and the newly proposed bias-corrected linking error. Then, Section 5 and Section 6 present the findings from two simulation studies. Finally, the article closes with a discussion in Section 7.
2. Linking Method
In this section, we formally define linking methods as a particular M-estimation [58,59] problem. We refer to Section 3 for examples of linking methods. Throughout this paper, it is assumed that contains the estimated item parameters from the 2PL model for item i in both groups. One can assume that there exists a true item parameter in the population (i.e., for an infinite sample size). The quantity is a consistent estimate for (under the scheme ). The goal of a linking method consists of estimating the mean and the standard deviation in the second group based on all the estimated item parameters .
2.1. One-Step Linking Method
In a one-step method linking method, the parameter estimate is obtained as the minimizer of a nonlinear optimization function H that is a sum of terms, in which each term refers to a single item. Formally, we define
Now, we define the partial derivatives , , , and , and for brevity. The parameter estimate solves the following estimating equation:
M-estimation theory is applied for computing a variance estimate for under the assumption that and are independent realizations from a distribution [58]. In this sense, the uncertainty of regarding the selected set of items is somehow quantified.
2.2. Two-Step Linking Method
In a two-step linking method, the standard deviation is estimated in the first step. Afterward, the mean is estimated in the second step. The estimate of the standard deviation is determined as a root of the nonlinear equation (that is, it is additive in items).
In the second step, the estimate is obtained as the root of the estimating equation
Note that the two-step method can be alternatively considered as a one-step linking method, because the estimate solves the stacked estimating equation (see [58]).
Hence, one-step and two-step linking methods can be simultaneously analyzed using M-estimation theory.
2.3. Statistical Inference
This article is concerned with estimating the uncertainty of the estimated mean and standard deviation that are contained in the vector . The uncertainty is due to the sampling (or selection) of persons (i.e., under the scheme ) and the selection (or sampling/modeling the randomness in group comparisons) of items (i.e., under the scheme ).
Most of the linking literature that treats the uncertainty in by computing a standard error (SE) of for a fixed number of items [60,61,62]. In this case, the variability in exists because there is sampling variability in the estimated item parameters .
We have argued that the estimated item parameters have a population analogue in an infinite sample size. If there is a model error (i.e., DIF), the estimated linking parameter depends on the chosen set of items, even for an infinite sample size. This variability due to item selection, which is referred to as the linking error (LE; see [48,63]) and the variance estimation, operates under the scheme where .
The total error (TE) includes both sources of uncertainty: the standard error due to randomness due to persons and the linking error due to randomness in items [47,50,51,64]. However, it has been argued that the ordinary linking error estimate is partly affected by the sampling error [65]. In this article, a bias-corrected linking error resulting in a bias-corrected total error estimate is examined to try to reduce the portion of the estimated linking error variance that is due to the sampling error. We outline the statistical underpinnings of the estimators in Section 4.
3. Robust and Nonrobust Linking Methods
In this section, we discuss the most frequently employed linking methods for the 2PL model.
3.1. Mean–Mean Linking (MM)
The mean–mean (MM) linking method is a two-step linking method [40]. The standard deviation is estimated in the first step as the ratio of the means of the item discriminations in the two groups.
In the second step, the mean is estimated by
The estimate can be written as the solution of the estimating equations
The MM linking method is considered nonrobust to outliers in the item parameter differences between groups, because the mean is an outlier-sensitive location measure.
3.2. Mean–Geometric–Mean Linking (MGM and RMGM)
The mean–geometric–mean (MGM) linking [40] (also referred to as log–mean–mean linking in [66]) is another two-step linking method that estimates as the ratio of the geometric means of the item discriminations. Using a general loss function , the standard deviation is estimated in the first step as
where is a differentiable approximation of the loss function for and a sufficiently small (see [67]). In this article, we use in the simulation studies [68]. The ordinary mean is obtained using the loss function for . The median as a location measure can be approximately obtained for . The choice is advantageous for asymmetrically distributed error distributions for DIF effects [69] and appears in invariance alignment [68,70]. In the second step, the mean is estimated by
By defining , the solution in MGM linking is given as the root of the estimating equations as follows:
MGM linking in a nonrobust variant is defined with , while robust MGM (abbreviated as RMGM) linking is defined by choosing in the loss function .
3.3. Symmetric Haebara Linking (HAE and RHAE)
Haebara linking is a one-step linking method [71] that relies on aligning IRFs to determine and . Using a discrete grid of points () and weights , the mean and the standard deviation are estimated by minimizing a weighted distance between the IRFs, which is shown as follows:
where , and again define a differentiable approximation of the loss function. For example, the grid can be equidistantly chosen between and 6, and could be proportional to the values of the density of a normal distribution with a standard deviation of 2. Some researchers alternatively prefer to set all weights equal to 1. The nonrobust loss function has been originally proposed by Haebara [71] and is widely used. The robust loss function with has been proposed in Refs. [72,73]. The general robust loss function in Haebara linking for was utilized in [74]. Corresponding estimating equations to (14) are obtained by computing the partial derivatives of H with respect to and .
The linking function H in (14) is referred to as asymmetric Haebara linking, because it aligns the IRF of the first group with the IRF of the second group. This kind of asymmetry induces non-negligibly biased estimates for the standard deviation and for the mean to a smaller extent [66]. To this end, symmetric Haebara linking [75] has been proposed to simultaneously align the IRFs of both groups and to reduce the bias [66]. The linking function of symmetric Haebara linking is defined by
The estimate is obtained by minimizing H in (15) with respect to . Nonrobust symmetric Haebara (HAE) linking is obtained with the choice of the loss function, while robust symmetric Haebara (RHAE) linking is obtained with the choice .
4. Estimation of Standard Error, Linking Error, and Total Error
In this section, we derive the standard error, linking error, and total error for linking estimates in the framework of M-estimation theory [58,59,76]. The treatment in this section is an extension of the material presented in [65]. Assume the item parameter estimate with an estimated variance matrix . Moreover, let be the vector of all the item parameters with an estimated variance matrix . The corresponding population analogs of the estimators are denoted by and and are effectively the estimates for an infinite sample size. As described in Section 2, the linking method provides an estimate for the population parameter as a root of the estimating equation.
4.1. Linking Error
First, we derive the linking error of the estimate of the estimating equation (16) that quantifies the uncertainty in the estimate due to the selection (or randomness) of items. As is usual in M-estimation theory, we carry out a Taylor approximation of around the true parameter (sometimes also referred to as the pseudotrue parameter; [58])
where denotes the vector of partial derivatives of with respect to . Moreover, it holds that because of the definition of the true parameter [58]. Hence, we obtain from (17) the following:
M-estimation theory provides the variance matrix of as the sandwich variance estimate:
In (21), we used the approximate independence of the item parameters across items. In M-estimation, the matrix is called the bread matrix, and is the meat matrix. The unknown quantities in (19) can be estimated by
Hence, an estimate of the variance matrix is given by
4.2. Standard Error
We now compute the standard error of due to the sampling of persons (see [80,81,82,83,84,85]). A Taylor approximation of around is carried out and results in
We can now use
This allows us to compute the variance matrix in due to the sampling error as follows:
4.3. Total Error and Bias-Corrected Linking Error
We now compute the total uncertainty in (i.e., the total error). The variance as the total error has been defined as the sum of the variances due to the sampling error and linking error, and it is written as follows (see [50,65]):
We now derive a bias-corrected estimate of the linking error variance matrix , which, in turn, allows us to compute a bias-corrected variance matrix for the linking error. The estimated meat matrix in the variance matrix for the linking error is given as
However, the linking error should only be computed based on the true item parameters instead of the estimated item parameters , which appear in (32). A Taylor approximation provides
Hence, the inflated variance contribution in due to the sampling error can be determined as
where we used the approximate independence of the item parameters across items. As a result, we compute a bias-corrected meat matrix as follows:
Note that the correction term in (35) corresponds to the matrix in (29) in the variance due to standard errors if the item parameters were uncorrelated across items. Next, a bias-corrected variance matrix due to linking error is given as
and the variance matrix for the total error is given by
To sum up, the variance matrix referring to the total error can be written as
To obtain standard errors, linking errors, bias-corrected linking errors, total errors and bias-corrected total errors, the square root of the diagonal elements of the corresponding matrices can be taken. In cases of negative variances for bias-corrected estimates, a corresponding linking error estimate is set to zero.
5. Simulation Study 1: Assessing Linking Error in Infinite Sample Size
In the first Simulation Study 1, the validity of the variance estimates based on M-estimation (see Section 4) for the estimated mean and the estimated standard deviation was investigated for an infinite sample size of persons.
5.1. Method
In this study, only item parameters were simulated in each replication. No item responses were simulated, because the case of an infinite sample size N was investigated. The 2PL model was used as the IRF in the IRT model. There were two groups. For identification reasons, the mean and the standard deviation of the factor variable in the first group were set to 0 to 1, respectively. The mean and the standard deviation of the second group parametrized the group differences. Throughout the simulation, the choices and were made.
We simulated item parameters for , 20, and 40 items. The group-specific item parameters and for and relied on base item parameters that were fixed in the simulation and a random DIF effect that was simulated in each replication of the simulation study. The base item discriminations in the case of items were chosen as 0.73, 1.25, 1.20, 1.47, 0.97, 1.38, 1.05, 1.14, 1.15, and 0.67. The base item discriminations were chosen as −1.31, 1.44, −1.20, 0.10, 0.10, −0.74, 1.48, −0.61, 0.82, and −0.07. The same item parameters were also chosen in [65]. For item numbers as multiples of 10, we duplicated the item parameters of the 10 items accordingly. The group-specific item difficulties were simulated as
where is a random DIF effect. Note that parametrizes a uniform DIF effect as the difference in group-specific item difficulties. Group-specific item discriminations were simulated as
where is another random DIF effect. The nonuniform DIF effect can be computed as the difference of logarithms of item discriminations (i.e., ). In the simulation, we assumed that and were uncorrelated. Both DIF effects had zero means and had standard deviation for and for . Two distributions of and were specified: a normal distribution or a scaled t distribution with three degrees of freedom (denoted as ). In this simulation study, we varied the DIF standard deviation for DIF effects for item difficulties as 0.25 and 0.50. According to the definition, the respective DIF standard deviation for DIF effects for (logarithmized) item discriminations were 0.075, and 0.15, respectively.
Five different linking methods were utilized to compute to estimate the mean difference and the standard deviation : mean–mean (MM) linking, mean–geometric–mean (MGM) linking, robust mean–geometric–mean (RGM) linking, symmetric Haebara (HAE) linking, and robust symmetric Haebara (RHAE) linking. The linking methods rely on estimated item discriminations and item difficulties (, ). In an infinite sample size, these identified item parameters are given as
In each of the 2 (type of distribution) × 2 (DIF standard deviation ) × 3 (number of items I) = 12 cells of the simulation, 5000 replications were conducted. We computed bias and root mean square error (RMSE) for the estimated mean and the estimated standard deviation . A relative percentage RMSE was computed as the ratio of the RMSE values of a particular linking method and the chosen reference method of MGM linking. We also assessed the coverage rate for and at the 95% confidence level based on the normal distribution as the percentage of events in which an estimated confidence interval contained the true value or , respectively.
The R software (Version 4.3.0; [86]) was used for the entire analysis in this simulation study. We wrote an R function linking_2groups_dich() that allows for the computation of estimates and their standard errors of any user-defined linking method. The function and replication material for this Simulation Study 1 can be found at https://osf.io/6bp3t (accessed on 29 April 2024).
5.2. Results
It turned out that all five linking methods resulted in approximately unbiased estimates for the mean and the standard deviation . Table 1 displays the relative RMSE for the estimates as a function of the DIF effect distribution type, the DIF standard deviation , and the number of items I. The MM linking method performed similarly to the MGM method with respect to the mean (i.e., the MM had relative RMSE values close to 100; that is, those of the MGM) but was slightly less efficient with respect to the standard deviation , particularly for DIF effects that followed the distribution. The RMGM linking method had substantial efficiency losses for normally distributed DIF effects (i.e., it had RMSE values much larger than 100). On the other hand, the RMGM method provided large efficiency gains compared to the MGM method for the heavy-tailed distribution for and (i.e., it had RMSE values much smaller than 100). Note that the RHAE method outperformed the HAE method for DIF effects with a distribution when analyzing the RMSE for the estimated mean . However, RHAE (or HAE) should not be preferred over RMGM (or MGM) with respect to the RMSE of the estimated standard deviation , because it had RMSE values much larger than 100.
Table 1.
Simulation Study 1: Relative RMSE for the estimated mean and for the estimated standard deviation as a function of different DIF distributions, DIF standard deviation , and number of items I for an infinite sample size.
Table 2 reports the coverage rate for and for the five linking methods. Coverage rates that are within the interval [92.5,97.5] indicate acceptable performance. Overall, the coverage rates based on M-estimation theory performed satisfactorily for at least 20 items, except for the RMGM linking method, which resulted in undercoverage for (i.e., it had coverage rates much lower than 92.5). In this case, the estimated linking errors were, on average, too small compared to the standard deviation of the estimates across replications in this simulation study. As expected, the coverage rates improved with an increasing number of items I.
Table 2.
Simulation Study 1: Coverage rate at 95% confidence level for the estimated mean and for the estimated standard deviation as a function of different DIF distributions, DIF standard deviation , and number of items I for an infinite sample size.
6. Simulation Study 2: Assessing Total Error in Finite Sample Sizes
In the second Simulation Study 2, we investigated the statistical performance of the linking estimation methods in finite samples. In this case, there was uncertainty regarding the sampling of persons, thus resulting in standard error results and randomness in the DIF effects, which resulted in linking error results. Both sources of errors can be summarized as the total error.
6.1. Method
The item responses were simulated according to the 2PL model for a test with , 20, or 40 items. The same item parameters as in Simulation Study 1 (see Section 5.1) were used. The factor variable was assumed to be normally distributed in both groups. Like in Simulation Study 1, the mean and the standard deviation for the factor variable in the first group were fixed at 0 and 1, respectively. The variable in the second group had a mean and a standard deviation . We used the normal distribution and the scaled distribution for DIF effects and varied the DIF standard deviations as 0.25 and 0.5. Moreover, we simulated a condition of no DIF effects (i.e., ). The sample sizes , 1000, and 2000 were chosen in order to mimic sample sizes that are typically in the applications of the 2PL model [8].
In contrast to Simulation Study 1, the item parameters of the 2PL model were separately estimated for the two groups in the first step using MML estimation. In the second step, the performance of the five linking methods MM, MGM, RMGM, HAE, and RHAE was studied. The estimated mean and the estimated standard deviation for the five methods were compared regarding the bias, RMSE, and relative RMSE. As in Simulation Study 1, MGM linking was used as the reference method for computing the RMSE.
In total, 5000 replications were conducted in each of the 5 (type of distribution for DIF effects combined with DIF standard deviation ) × 3 (number of items I) × 3 (sample size N) = 45 cells of the simulation.
In the analysis, we computed the median of the linking error estimate based on (24) and the median of the bias-corrected linking error estimate based on (36). Moreover, we compared the coverage rates for the estimates and based on the standard error , as well as the (uncorrected) total error , based on (31), and the bias-corrected total error was based on (37).
The R software [86] was used for the entire analysis in this simulation study. The 2PL model was estimated with the sirt::xxirt() function in the R package sirt [87]. As in Simulation Study 1, the R function linking_2groups_dich() was used for computing the estimates and and their standard errors for the five linking methods. Replication material for this Simulation Study 2 can be found at https://osf.io/6bp3t (accessed on 29 April 2024).
6.2. Results
All five linking methods were approximately unbiased in all conditions of the simulation study. Table 3 presents the relative RMSE as a function of the different DIF distribution types, the DIF effect standard deviation , the number of items I, and the sample size N. As expected from the literature, the HAE method was the most efficient linking method in the condition of no DIF (i.e., ). Across all conditions, the MM method had comparable performance to the MGM method regarding , but it was slightly more efficient for . Efficiency gains of the RMGM method were only realized for the heavy-tailed distribution in large sample sizes. In these situations, the RHAE method outperformed the RMGM method for the estimated mean but not for the estimated standard deviation.
Table 3.
Simulation Study 2: Relative RMSE for the estimated mean and for the estimated standard deviation as a function of different DIF distributions, DIF standard deviation , number of items I, and sample size N.
Table 4 presents the median of the estimated linking error and bias-corrected linking error for the estimated mean . The results for the DIF effects with a distribution and a standard deviation of were omitted for space reasons. In Table 4, we have also reported the estimated linking error for an infinite sample size (i.e., Inf) that was obtained from Simulation Study 1. It can be seen that the estimated linking errors and (almost always) converged to the for an infinite sample size with an increasing sample size.
Table 4.
Simulation Study 2: Median of the estimated linking error and the estimate bias-corrected linking error for the estimated mean as a function of different DIF distributions, DIF standard deviation , number of items I, and sample size N.
It turned out that the estimated linking error was positively biased, while the bias-corrected linking error was negatively biased (to a lesser extent). In particular, the median estimated linking error of 0.061 for the MM method for 10 items and a small sample size was substantially larger than the true value of 0 in the condition of no DIF (i.e., ). On the other hand, the median of the estimated bias-corrected linking error was 0 in all situations in which no DIF was simulated in the item parameters. Overall, one could conclude that the bias in both linking error types can be reduced with an increasing sample size and an increasing number of items. For all linking methods except for the RMGM method, the uncorrected linking error had worse performance compared to the bias-corrected linking error . Hence, could be the preferred choice for a reported linking error.
Table 5 reports the median values of the estimated linking error and bias-corrected linking error for the estimated standard deviation . Overall, we observed a similar pattern of findings as in the case of the estimated mean . Again, the bias-corrected linking error estimates for the RMGM method were unsatisfactory. The bias in the uncorrected linking error estimates was slightly larger for than for . A simple idea might be to use the mean of both of the linking error estimates and as another linking error to improve the performance of the linking error estimate.
Table 5.
Simulation Study 2: Median of the estimated linking error and the estimate bias-corrected linking error for the estimated standard deviation as a function of different DIF distributions, DIF standard deviation , number of items I, and sample size N.
In Table 6, the coverage rate for the estimated mean is displayed. In the no DIF condition , the coverage rates based on the standard error performed satisfactorily. In the presence of DIF, the uncertainty in the estimated means of was underestimated when the standard error was used when computing confidence intervals, thus resulting in substantial undercoverage. The confidence intervals based on the total error tended to have slightly increased coverage rates. In such situations, the coverage rates for the confidence intervals based on the bias-corrected linking error were slightly better. Generally, the RMGM linking method did not have adequate coverage rates in many situations.
Table 6.
Simulation Study 2: Coverage rate at 95% confidence level based on the standard error , total error , and bias-corrected total error for the estimated mean as a function of different DIF distributions, DIF standard deviation , number of items I, and sample size N.
Finally, Table 7 reports the coverage rates for the estimated standard deviation . The bias-corrected total error outperformed the uncorrected total error regarding coverage rates. In many situations, the coverage rates based on the total error were too high. However, the RMGM method had substantial overcoverage in many conditions for confidence intervals based on and , particularly for fewer items or smaller sample sizes.
Table 7.
Simulation Study 2: Coverage rate at 95% confidence level based on the standard error , total error , and bias-corrected total error for the estimated standard deviation as a function of different DIF distributions, DIF standard deviation , number of items I, and sample size N.
7. Discussion
In this article, we simultaneously treated standard errors and linking errors for linking methods in the 2PL model. We proposed a bias-corrected linking error estimate, which delivers a bias-corrected total error estimate. This bias-corrected total error outperformed the usually employed total error that is given as the simple variances due to standard error and the usual uncorrected linking error. In a simulation study, it turned out that the confidence intervals for the linking parameters based on the bias-corrected total error outperformed those based on the usual total error regarding coverage rates. Moreover, the bias-corrected linking error estimate was less biased than the uncorrected linking error estimate.
As with any simulation study, our study had several limitations. First, our study only treated the 2PL model for dichotomous item responses. However, the performance of the linking estimators and their variance estimates could also be investigated for the simpler Rasch model for dichotomous item responses [16] or the generalized partial credit model for polytomous item responses [88]. Furthermore, the theory in this article could also be adapted to the chain linking of multiple groups [83,84]. In addition, the distribution types of the DIF effects in the simulation studies were restricted to the symmetric normal distribution and the t distributions with three degrees of freedom. Future research could focus on alternative and asymmetric distributions such as mixture, uniform, or discrete distributions. Moreover, the factor variable was assumed to be normally distributed in both simulation studies. The 2PL model could also be estimated with non-normal distributions [10,13], which could be investigated in future studies. Next, followup research could focus on linking with smaller sample sizes, such as , as well as the case of unbalanced group sizes. Furthermore, we only employed 10, 20, or 40 items in the two simulation studies. Future research could also investigate a larger number of items. We do not think that linking should be conducted with an even smaller number of items, such as items, because the group comparisons will likely become unstable in the presence of DIF, and the representativity of the link items might be questioned (but, see [89]). Also, the extent of nonuniform DIF was not independently manipulated from the extent of uniform DIF in the two simulation studies. Finally, the performance of our proposed error estimates could also be applied to mis-specified IRT models. For example, the 2PL model could be employed for linking if the item response data would be generated from the logistic positive exponential model [90,91] or the monotonic polynomial IRT model [92,93]. All of these limitations could be addressed in extensive future research.
As a final side note, I would like to add that a comparison of two groups regarding the distribution of the factor variable could also be conducted using concurrent calibration by assuming invariant (i.e., the same) item parameters across the groups. Some researchers argue that linking uncertainty is reduced by assuming invariant item parameters (see [94,95]). I think that this belief is unjustified. The fact that there could be variability due to item selection does not disappear, because the variability in the model parameters is not represented in the statistical model. The computation of the linking errors under the assumption of invariant item parameters in the statistical model has been carved out in Ref. [96].
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| 2PL | two-parameter logistic |
| HAE | Haebara |
| IRF | item response function |
| IRT | item response theory |
| LE | linking error |
| MGM | mean–geometric–mean |
| MM | mean–mean |
| MML | marginal maximum likelihood |
| RHAE | robust Haebara |
| RMGM | robust mean–geometric–mean |
| SE | standard error |
| TE | total error |
References
- Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory—A statistical framework for educational and psychological measurement. Stat. Sci. 2023; epub ahead of print. Available online: https://imstat.org/journals-and-publications/statistical-science/statistical-science-future-papers/ (accessed on 29 April 2024).
- Bock, R.D.; Moustaki, I. Item response theory in a general framework. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 469–513. [Google Scholar] [CrossRef]
- Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Berezner, A.; Adams, R.J. Why large-scale assessments use scaling and item response theory. In Implementation of Large-Scale Education Assessments; Lietz, P., Cresswell, J.C., Rust, K.F., Adams, R.J., Eds.; Wiley: New York, NY, USA, 2017; pp. 323–356. [Google Scholar] [CrossRef]
- OECD. PISA 2018. Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 29 April 2024).
- van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Bartolucci, F. A class of multidimensional IRT models for testing unidimensionality and clustering items. Psychometrika 2007, 72, 141–157. [Google Scholar] [CrossRef]
- Casabianca, J.M.; Lewis, C. IRT item parameter recovery with marginal maximum likelihood estimation using loglinear smoothing models. J. Educ. Behav. Stat. 2015, 40, 547–578. [Google Scholar] [CrossRef]
- von Davier, M. A general diagnostic model applied to language testing data. Br. J. Math. Stat. Psychol. 2008, 61, 287–307. [Google Scholar] [CrossRef] [PubMed]
- Xu, X.; von Davier, M. Fitting the Structured General Diagnostic Model to NAEP Data; Research Report No. RR-08-28; Educational Testing Service: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
- Woods, C.M.; Lin, N. Item response theory with estimation of the latent density using Davidian curves. Appl. Psychol. Meas. 2009, 33, 102–117. [Google Scholar] [CrossRef]
- Woods, C.M. Estimating the latent density in unidimensional IRT to permit non-normality. In Handbook of Item Response Theory Modeling; Reise, S.P., Revicki, D.A., Eds.; Routledge: New York, NY, USA, 2014; pp. 78–102. [Google Scholar] [CrossRef]
- Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
- Bond, T.; Yan, Z.; Heene, M. Applying the Rasch Model; Routledge: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. Available online: https://bit.ly/2UV6Eht (accessed on 29 April 2024).
- Salzberger, T. The illusion of measurement: Rasch versus 2-PL. Rasch Meas. Trans. 2002, 16, 882. Available online: https://tinyurl.com/25wzmzb5 (accessed on 29 April 2024).
- van der Linden, W.J. Fundamental measurement and the fundamentals of Rasch measurement. In Objective Measurement: Theory Into Practice. Vol. 2; Wilson, M., Ed.; Ablex Publishing Corporation: Hillsdale, NJ, USA, 1994; pp. 3–24. [Google Scholar]
- Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
- Edelsbrunner, P.A. A model and its fit lie in the eye of the beholder: Long live the sum score. Front. Psychol. 2022, 13, 986767. [Google Scholar] [CrossRef] [PubMed]
- Hemker, B.T. To a or not to a: On the use of the total score. In Essays on Contemporary Psychometrics; van der Ark, L.A., Emons, W.H.M., Meijer, R.R., Eds.; Springer: Cham, Switzerland, 2023; pp. 251–270. [Google Scholar] [CrossRef]
- Robitzsch, A. On the choice of the item response model for scaling PISA data: Model selection based on information criteria and quantifying model uncertainty. Entropy 2022, 24, 760. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci. 2022, 4, 9. [Google Scholar] [CrossRef]
- White, M. A peculiarity in educational measurement practices. PsyArXiv 2024. [Google Scholar]
- Engelhard, G. Invariant Measurement; Routledge: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
- Wind, S.A.; Engelhard, G. How invariant and accurate are domain ratings in writing assessment? Assess. Writ. 2013, 18, 278–299. [Google Scholar] [CrossRef]
- Heene, M.; Bollmann, S.; Bühner, M. Much ado about nothing, or much to do about something? J. Individ. Differ. 2014, 35, 245–249. [Google Scholar] [CrossRef]
- Ballou, D. Test scaling and value-added measurement. Educ. Financ. Policy 2009, 4, 351–383. [Google Scholar] [CrossRef]
- Briggs, D.; Maul, A.; McGrane, J. On the nature of measurement. PsyArXiv 2023. [Google Scholar] [CrossRef]
- Heine, J.H.; Heene, M. Measurement and mind: Unveiling the self-delusion of metrification in psychology. Meas. Interdiscip. Res. Persp. 2024; epub ahead of print. [Google Scholar] [CrossRef]
- Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
- Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychol. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
- Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amesterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
- Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
- Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
- Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
- Michaelides, M.P. A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating. Front. Psychol. 2010, 1, 167. [Google Scholar] [CrossRef]
- Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
- Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
- Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
- Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef]
- Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
- OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 29 April 2024).
- Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
- Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
- Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
- Melin, J.; Cano, S.; Pendrill, L. The role of entropy in construct specification equations (CSE) to improve the validity of memory tests. Entropy 2021, 23, 212. [Google Scholar] [CrossRef]
- Melin, J.; Cano, S.; Flöel, A.; Göschel, L.; Pendrill, L. The role of entropy in construct specification equations (CSE) to improve the validity of memory tests: Extension to word lists. Entropy 2022, 24, 934. [Google Scholar] [CrossRef]
- Melin, J.; Fridberg, H.; Hansson, E.E.; Smedberg, D.; Pendrill, L. Exploring a new application of construct specification equations (CSEs) and entropy: A pilot study with balance measurements. Entropy 2023, 25, 940. [Google Scholar] [CrossRef]
- Tennant, A.; Pallant, J.F. DIF matters: A practical approach to test if differential item functioning makes a difference. Rasch Meas. Trans. 2007, 20, 1082–1084. [Google Scholar]
- von Davier, M.; von Davier, A.A. A unified approach to IRT scale linking and scale transformations. Methodology 2007, 3, 115–124. [Google Scholar] [CrossRef]
- Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Stefanski, L.A.; Boos, D.D. The calculus of M-estimation. Am. Stat. 2002, 56, 29–38. [Google Scholar] [CrossRef]
- Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
- Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
- Jewsbury, P.A. Error Variance in Common Population Linking Bridge Studies; Research Report No. RR-19-42; Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
- Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; Research Report No. RR-09-02; Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
- Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
- Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
- Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
- Robitzsch, A. Comparing robust linking and regularized estimation for linking two groups in the 1PL and 2PL models in the presence of sparse uniform differential item functioning. Stats 2023, 6, 192–208. [Google Scholar] [CrossRef]
- Robitzsch, A. Examining differences of invariance alignment in the Mplus software and the R package sirt. Mathematics 2024, 12, 770. [Google Scholar] [CrossRef]
- Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef]
- He, Y.; Cui, Z.; Osterlind, S.J. New robust scale transformation methods in the presence of outlying common items. Appl. Psychol. Meas. 2015, 39, 613–626. [Google Scholar] [CrossRef]
- He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
- Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 155–173. [Google Scholar] [CrossRef]
- Weeks, J.P. plink: An R package for linking mixed-format tests using IRT-based methods. J. Stat. Softw. 2010, 35, 1–33. [Google Scholar] [CrossRef]
- Zeileis, A. Object-oriented computation of sandwich estimators. J. Stat. Softw. 2006, 16, 1–16. [Google Scholar] [CrossRef]
- Fay, M.P.; Graubard, B.I. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics 2001, 57, 1198–1206. [Google Scholar] [CrossRef]
- Li, P.; Redden, D.T. Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Stat. Med. 2015, 34, 281–296. [Google Scholar] [CrossRef] [PubMed]
- Zeileis, A.; Köll, S.; Graham, N. Various versatile variances: An object-oriented implementation of clustered covariances in R. J. Stat. Softw. 2020, 95, 1–36. [Google Scholar] [CrossRef]
- Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
- Ogasawara, H. Item response theory true score equatings and their standard errors. J. Educ. Behav. Stat. 2001, 26, 31–50. [Google Scholar] [CrossRef]
- Ogasawara, H. Applications of asymptotic expansion in item response theory linking. In Statistical Models for Test Equating, Scaling, and Linking; von Davier, A., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 261–280. [Google Scholar] [CrossRef]
- Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef] [PubMed]
- Battauz, M. Factors affecting the variability of IRT equating coefficients. Stat. Neerl. 2015, 69, 85–101. [Google Scholar] [CrossRef]
- Zhang, Z. Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Appl. Psychol. Meas. 2021, 45, 331–345. [Google Scholar] [CrossRef] [PubMed]
- R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2023; Available online: https://www.R-project.org/ (accessed on 15 March 2023).
- Robitzsch, A. sirt: Supplementary Item Response Theory Models.R Package Version 4.1-15. 2024. Available online: https://cran.r-project.org/web/packages/sirt/index.html (accessed on 29 April 2024).
- Muraki, E. A generalized partial credit model: Application of an EM algorithm. Appl. Psychol. Meas. 1992, 16, 159–176. [Google Scholar] [CrossRef]
- Pibal, F.; Cesnik, H.S. Evaluating the quantity-quality trade-off in the selection of anchor items: A vertical scaling approach. Pract. Assess. Res. Eval. 2011, 16, 6. [Google Scholar]
- Samejima, F. Logistic positive exponent family of models: Virtue of asymmetric item characteristic curves. Psychometrika 2000, 65, 319–335. [Google Scholar] [CrossRef]
- Huang, Q.; Bolt, D.M.; Lyu, W. Investigating item complexity as a source of cross-national DIF in TIMSS math and science. Large-scale Assess. Educ. 2024, 12, 12. [Google Scholar] [CrossRef]
- Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
- Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 447–478. [Google Scholar] [CrossRef]
- OECD. PISA 2015; Technical Report; OECD: Paris, France, 2017; Available online: https://bit.ly/32buWnZ (accessed on 29 April 2024).
- Robitzsch, A.; Lüdtke, O. An examination of the linking error currently used in PISA. Meas. Interdiscip. Res. Persp. 2024, 22, 61–77. [Google Scholar] [CrossRef]
- Robitzsch, A. Analytical approximation of the jackknife linking error in item response models utilizing a Taylor expansion of the log-likelihood function. AppliedMath 2023, 3, 49–59. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).