Abstract
In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that can bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects on average cancel out. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and on average favor a particular group. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, the theory of M-estimation, as an important approach to robust statistical estimation suitable for data with asymmetric errors, is used to study the asymptotic behavior of linking estimators if the number of items tends to infinity. These results give insights into the asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.
1. Introduction
The analysis of psychological or educational tests is an important field in the social sciences. The test items (i.e., tasks presented in these tests) are often analyzed using item response theory (IRT, [,]) models. In this article, the Rasch model (RM; [,]) is used for comparing two groups on test items. For example, groups could be demographic groups, countries, studies, or time points. The group comparisons are carried out using linking methods [,]. An important impediment in applying linking methods is that the items could behave differently in the two groups (i.e., differential item functioning, DIF; []); that is, it cannot be expected that the Rasch model holds in the two groups with item parameters that are independent of a group membership.
In this article, we study the performance of linking methods in the presence of DIF that can bias group differences. In contrast to habitually used (i.e., nonrobust) linking methods, robust linking methods aim at deriving estimates of group differences that are robust to the presence of DIF. Importantly, DIF effects can be considered as asymmetric error distributions, and robust statistical methods for location measures are applied for determining a group difference.
This article systematically compares alternative linking methods in the RM. Furthermore, linking errors that quantify the uncertainty in group differences due to the randomness associated with DIF are analytically treated using M-estimation theory and computationally assessed using single and double jackknife and (balanced) half sampling, respectively.
The paper is structured as follows. In Section 2, the RM with random DIF is introduced. In Section 3, several nonrobust and robust linking methods are discussed. In Section 4, M-estimation theory is applied to the study of linking methods for the statistical inference of linking errors. In Section 5, M-estimation theory is applied for the simultaneous assessment of standard errors and linking errors. The resampling techniques double jackknife and double half sampling are introduced in Section 6 for empirically assessing standard errors and linking errors. In Section 7, we present the results of a simulation study in which different robust and nonrobust linking methods are systematically compared across various data-generating models for DIF effects. Section 8 presents a simulation study that investigates the empirical performance of the proposed resampling estimators from Section 6. Finally, the article concludes with a discussion in Section 9.
2. Differential Item Functioning in the Rasch Model
2.1. Rasch Model
The RM [,,,,,,,] is a statistical model for dichotomous item responses for items . A latent variable (so-called ability) accounts for the dependence among item responses. The item response function (IRF) for item i in the RM is defined as
where is the item difficulty, is the latent ability, and denotes the logistic link function.
The RM in a random sampling perspective [,] also relies on a local independence assumption and poses a parametric distribution function on the latent ability :
where , , and a finite-dimensional parameter . In many applications, the distribution is assumed to be normal with a mean of zero. In this case, the parameter only contains the standard deviation that must be estimated in addition to item parameters . It has been empirically shown that distributional misspecifications of might not strongly bias estimates of item difficulties if many items are available [,,]. However, in (very) short tests and with a strong deviation from normality, the bias in item parameter estimates can be non-negligible.
Note that the parameters of the RM are identified up to a constant. Hence, either the mean of the abilities or the mean of item difficulties has to be fixed to zero for reasons of identification [,]. If a normal distribution for is assumed, the mean is set to zero, and only the standard deviation is estimated. The RM is typically estimated with marginal maximum likelihood estimation [].
In practice, it is unlikely that the distribution of item responses can be adequately represented by the RM. In real datasets, more complex IRFs might be necessary, such as the family of logistic IRT models with two, three or four parameters per item [] or even more flexible nonparametric monotone IRFs [,,,]. In large-scale educational datasets, items could have different discriminations [], and guessing and slipping behavior have been reported [,]. However, fitting a misspecified RM to data might be justified if the latent ability should be defined such that all items contribute equally to [,]. By fitting more complex IRT models, the meaning of might change, which raises validity concerns. Note that the RM has been used in the popular program for international student assessment (PISA) study in the past [] and in many other national [,] and international large-scale assessment studies [].
The estimated item parameters with marginal maximum likelihood estimation [] can be interpreted as a pseudo-true parameter (see []) that maximizes the Kullback–Leibler information [,,] between the true distribution Q and its parametric approximation :
where the sum is defined over the different item response patterns for . The distribution Q is defined as the true data-generating distribution , which is a multinomial distribution on the S item response patterns. The RM is the best approximation to Q with respect to the Kullback–Leibler information. By using different loss functions (i.e., estimation methods, for example, unweighted least squares estimation, []), different pseudo-true parameters of the RM will be obtained.
2.2. Differential Item Functioning
Now assume that I items are administered in two groups . The estimation of group differences is of interest. Abilities in the first group follow a normal distribution with zero mean, that is, . In the second group, we also assume a normal distribution, i.e., . The parameter can be interpreted as the average difference between the two groups.
In practical applications, it is unlikely that item difficulties are equal across groups (i.e., they are measurement invariant). In this case, DIF occurs, and there exist group-specific item difficulties for groups . Item-specific DIF effects are defined as . In the absence of DIF, all DIF effects would be equal to zero. Identification constraints on DIF effects must be posed to disentangle group mean differences from average DIF effects [,,].
In this paper, we distinguish the case of random items from fixed items with random DIF effects. In the first case of random items, it is assumed that the bivariate vector of item difficulties follows a bivariate distribution G. In the second case, it is assumed that with random effects , but item difficulties are regarded as fixed. This means that items are fixed, but DIF effects represent a random variable. DIF effects follow a univariate distribution G in this case.
To identify the group difference , identification constraints on G have to be imposed in both cases. The main idea is that the set of items can be partitioned into two distinct sets and . The set of items in (also denoted as reference items; []) is deemed valid for obtaining unbiased group differences. The set refers to approximate measurement invariant (AMI; []) items. These items are allowed to have DIF effects that on average cancel out. A special case is a set of anchor items in which all items in this set have zero DIF effects []. Items in the set (also denoted as biased items; []) have the potential to bias group differences (see []). The partitioning is modeled with a mixture distribution for G [,,]:
where is the proportion of items in the set . In the fixed items case, it is assumed that the expected value for DIF effects of items from is zero, while it can be different from zero for items from . More formally, it holds that
In the random items case, define by the univariate distribution of DIF effects . Based on the mixture representation of the bivariate distribution G, one can decompose the distribution of DIF effects as . The condition for DIF effects in the random items case is the same as in Equation 5
The test is said to have balanced DIF if , and it has unbalanced DIF if (see [,,]). It is important to emphasize that the definition of the mixture distribution allows the identification of group differences. The total DIF impact on the test containing all items can be calculated as (for notational simplicity only in the fixed items case)
With a low proportion of biased items, the presence of DIF effects is not expected to have a large impact on estimated group differences.
In this article, we only consider random DIF effects. Probably in the largest part of the literature, DIF effects are considered as fixed (e.g., []). In this case, the condition for balanced DIF replaces the expected value by the mean associated with the fixed item parameters []. There is no additional uncertainty introduced in the estimation of group differences with fixed DIF effects because the item parameters are held fixed in repeated sampling. In contrast, with random DIF effects, the group mean difference is affected by the sampled DIF effects even for infinite sample sizes of persons. This kind of uncertainty is explicitly addressed in this article.
In many applications, the estimation of group differences involves a previous step in which DIF is detected by applying statistical techniques [,,,]. DIF detection statistics aim to classify items into a set of items that possess DIF, which should be optimally equal to the set . However, DIF detection techniques rely on previous knowledge about DIF-free items or a known group difference [,]. Hence, the decision of whether an item has DIF or not requires additional assumptions that cannot be statistically tested (see also [,,]). In this paper, we do not thoroughly investigate DIF detection techniques, but rather study the performance of linking methods to estimate group differences. We distinguish robust from nonrobust linking methods. Robust linking methods adequately handle the presence of biased items (i.e., items in the set of ) that lead to unbalanced DIF, while nonrobust linking approaches result in biased estimates of group differences in unbalanced DIF situations.
If the RM does not hold, DIF between groups means that IRFs can differ across the two groups. If the misspecified RM is fitted to data, DIF in item difficulties can be interpreted as a summary of DIF between IRFs. It is acknowledged that more complex DIF, such as nonuniform DIF in item discriminations [] or DIF in guessing parameters, might occur. However, if these model aspects are intentionally ignored by fitting the RM, DIF effects in other aspects of the IRFs only indirectly enter the DIF assessment through item difficulties. Moreover, DIF effects in item difficulties are more frequently found in empirical applications than in item discriminations [,,]. In the rest of the paper, statistical inference regarding the population of persons and the population of items is discussed that is even valid if the fitted IRT model is misspecified.
2.3. Identified Item Parameters in Group-Specific Scaling Models
Linking methods rely on group-specific item parameters estimated in separate scaling models in each group. By doing so, there is no misspecification in the IRT model due to noninvariance.
In the first group, the ability variable in the data-generating model follows , that is, . In a separate estimation for the first group with an infinite sample size of persons, the estimated item difficulties equal the data-generating parameters (i.e., ). In the second group, the distribution of the ability variable is . In the estimation, the mean of the ability variable is fixed to zero for reasons of identification. Hence, estimated item difficulties also include the group difference parameters. We obtain
where the standardized ability is normally distributed (i.e., ). Consequently, it follows that .
3. Linking Methods
In this section, we review several linking methods [,,,,] that allow the estimation of the group difference . We assume that estimated identified item parameters and () are available (see Section 2.3). We define differences .
3.1. Mean-Mean Linking (MM)
Mean-mean linking (MM; [,]) is one of the most popular linking methods. The group difference is estimated by
Note that is determined as the least-squares estimate of item difficulty differences :
We now derive the bias of MM and assume fixed items with random DIF effects e that follow a distribution G. The distribution is given by the mixture representation (see Equation (4)). It holds that and (see Equation (5)). Then, we obtain for the bias under MM
The bias coincides with the DIF impact on the test (see Equation (7)). The bias vanishes in the case of balanced DIF (i.e., ) or in the absence of biased items (). MM can be considered as a nonrobust linking method because biased items can affect the estimated group difference. As an alternative to such a nonrobust approach, it may be recommended to use linking methods based on robust statistical methodology [] designed for resistant estimation under contamination (especially for data contaminated by outlying values). The following linking methods realize some kind of robustness against the presence of biased DIF items.
3.2. Asymmetrically Trimmed Mean (ATR)
An intuitive idea borrowed from robust statistics [,,] is to consider biased DIF items as outliers [,] in estimating the location measure that is given as the group difference. Hence, robust alternatives to the mean (i.e., MM linking) can be established.
The asymmetrically trimmed mean (ATR) removes items with large differences from the estimation. By defining a trimming proportion , the ATR linking estimate is defined as the average of values for which the absolute differences are below the -quantile of these discrepancies. The main idea is that large discrepancies can be regarded as biased items and should be removed from group comparisons. The ATR estimate is formally defined as
where denotes the -quantile, the indicator function, and the median . The median instead of the mean is used because the median will be typically more robust concerning outliers (i.e., biased DIF items). ATR linking has the potential to properly handle the situation of unbalanced DIF because it explicitly allows that there could only be biased items with unidirectional signs. The ATR estimator is related to the least trimmed absolute estimator [,], which is especially suitable for asymmetric contamination in the data. A similar idea of the ATR estimator is used in robust structural equation modeling for defining case weights used for downweighting outlying cases (see [,]). As an alternative to the ATR estimator, the least weighted squares estimator may be applied as a location estimator of location with high robustness as well as high efficiency [].
3.3. Elimination of DIF Items with Subsequent Mean-Mean Linking (EL)
Another popular approach is that DIF items are from the group comparison. The identification of DIF items in the first step requires the definition of an appropriate statistic. In the simulation study, we assume that a preliminary group difference is estimated by the median of all differences . An item is declared to have DIF if exceeds a prespecified cutoff K. In many studies, the mean instead of the median is used, and the corresponding condition is referred to as the equal-mean anchor []. However, using the median instead of the mean might be a more robust location estimate in the presence of DIF effects. The items with detected DIF are removed for the subsequent computation of MM linking []. More formally, the EL estimate can be written as
The EL linking method by eliminating DIF items can be interpreted as another variant of a trimmed mean.
3.4. Bisquare Linking (BSQ)
Another robust estimate of the location parameter is based on the bisquare loss function (see []) that is defined by
where K is a prespecified threshold value. The group difference is estimated by
Note that the bisquare loss function is also known as the Tukey biweight function [].
3.5. Invariance Alignment (IA)
The bisquare loss function in Equation (15) can be replaced by any robust (or nonrobust) loss function. In invariance alignment (IA; [,,,]), the power loss function () is employed. The group mean estimate is given by
By choosing , the extent of noninvariant items is minimized. Hence, the group mean difference relies on items that have small DIF effects while removing items with large DIF effects from the comparison []. IA was originally proposed with the power []. IA with is equivalent to MM. Note that IA with is particularly suited to the situation of partial invariance in which concentrates at zero (i.e., all DIF effects in are zero or close to zero) and fails for symmetrically distributed DIF effects [,].
3.6. Haebara Linking (HAE)
In contrast to MM, ATR, BSQ and IA linking methods, Haebara (HAE) linking [] aligns IRFs instead of aligning item parameters. The linking function is defined as
with a power and a weight function that fulfills . The originally proposed HAE linking uses []. The robust alternative was treated in [,,], while cases were studied in [].
To get more insight into the relation of IA and HAE, we apply a Taylor approximation of the second IRF in Equation (17) under the assumption of small effects . We obtain
where . Using the approximation (18), Equation (17) can be rewritten as
where item-specific weights are given by . Hence, HAE linking can be interpreted as IA with item-specific weights, and a similar performance of HAE and IA can be expected.
3.7. Gini Linking (GI)
Recently, a linking procedure based on the Gini index (GI; []) has been proposed. The linking function is very similar to IA linking and tries to define a group difference that is primarily based on items with small DIF effects. The group mean difference is determined by
where the power can be chosen by the user. The original proposal used []. Previous experience of the authors indicates that GI also works with , but it does not perform satisfactorily with . It has been shown that IA and GI provided similar results in small case studies [], but GI linking has not yet been systematically compared with other linking methods.
3.8. Robustness of the Different Linking Methods
The linking methods mean-mean linking (MM) and Haebara linking (HAE) with can be considered as nonrobust. The linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with , Haebara linking (HAE) with and Gini linking (GI) can be considered as robust linking methods that ensure some protection to the presence of biased DIF items.
4. An Analytical Treatment for Assessing Linking Errors
In this section, the computation of linking errors is investigated. Linking errors refer to the uncertainty of the randomness associated with items [,,,,,,]. The estimated group difference is affected by random DIF. The linking error quantifies this source of variance. In this section, an analytical treatment for assessing linking errors is presented. For this section, we assume an infinite sample size of persons. That means that identified item parameters are estimated without a sampling error. This assumption is dropped in the next Section 5.
In Section 2.2, we assume that random DIF (or item parameters) follow(s) a mixture distribution , where denotes the distribution associated with reference items for which DIF effects on average cancel out and denotes the distribution of biased items that can impact estimated group differences. The estimation of the group difference can be interpreted as an estimation problem of a location parameter in robust statistics where the location parameter (i.e., the group difference ) should be based on . However, the observed mixture distribution G contains a contaminated asymmetric error distribution [,,] that might bias the estimate . As discussed in Section 2.2, two cases of random DIF can be distinguished. First, items can be considered random, and the bivariate vector of group-specific item difficulties is modeled with a distribution (see Section 4.1). Second, items can be regarded as fixed, but DIF effects are modeled as a random variable (see Section 4.2). Although these cases are very different, their consequences lead to similar estimates of variances. Hence, estimated errors (i.e., linking errors) due to the randomness associated with items are practically identical.
4.1. Random Item Parameters
In this subsection, we discuss the estimation for random item parameters. We introduce a slightly more general notation to cover the linking methods (except for GI linking) from Section 3. The “data” for item i is given by the vector . The linking method must be additive with respect to functions of this data. More formally, let H be a linking function that is defined by
The linking parameter (e.g., a group difference) of interest is estimated by
Assuming differentiability of h implies that can be obtained by solving the equation
where .
Equation (24) provides an estimating equation for the parameter . The corresponding estimator is labeled as an M-estimator []. It is evident that the estimated group mean differences in MM, IA and HAE linking are M-estimators by defining the univariate parameter . The linking methods EL and ATR are so-called two-step estimators because their computation relies on the median computed in a first step. Because the estimating equation for the median is clearly defined, the estimate in Equation (24) also applies to two-step estimators because it can be interpreted as a bivariate one-step M-estimator by defining (see [], chp. 7).
We now apply the theory of M-estimators ([], chp. 7; [,]) to study the asymptotic behavior of . Because we are concerned with linking errors, asymptotic behavior is meant with respect to the number of items. By letting the number of items tend to infinity, the left side in Equation (24) converges to
where and () denote the random variables associated with estimated item parameters. As already mentioned, the distribution of item parameters G follows the mixture distribution . Assume that densities for the involved distributions exist (i.e., continuous or count densities): , , and . Equation (25) can be written as
The parameter obtained from a linking method with an infinite number of items is given as the root of the equation
Note that is a function of , G, and . For a dataset, and G are fixed but unknown. However, is chosen in the linking method by a user.
The pseudo-true parameter is defined as the estimate if all items would be reference items. That is, the linking parameter would only be determined by the mixture component :
Ideally, a component of (in the bivariate case) or itself (in the univariate case) should provide an asymptotically unbiased estimate of by choosing an appropriate linking function h (or its derivative ). In the following, we assume that is differentiable, although the main propositions of M-estimators do not require differentiability []. However, one could always approximate a nondifferentiable linking function h by a differentiable approximation . For example, the nondifferentiable and nonnegative linking function can be approximated by for a sufficiently small [,,,].
4.1.1. Asymptotic Behavior
We now study the asymptotic behavior of the estimator . For a large number of items, converges to . The derivation of relies on a Taylor approximation of and closely follows []. Due to (26), we get
We now apply a first-order Taylor approximation of around :
where is the matrix of partial derivatives. From (29) and (30), we get
Hence, we obtain from (31),
If we assume that allows the unbiased estimation of , Equation (32) provides an expression of the asymptotic bias of . Of crucial importance is that the linking function downweights observations from the distribution of biased items (i.e., ). The linking function has to be chosen so that biased items are automatically removed for group comparison. The next subsection discusses how the linking function should be chosen to enable an unbiased estimation of .
4.1.2. Choosing an Optimal Linking Function m
Again, the derivation of the choice of the linking function follows the exposition in []. Assume that the true parameter is determined by the distribution (with density ) of reference items. Hence, is given as the maximizer of the log-likelihood function and fulfills
Based on (33), the linking function can be chosen in order to obtain unbiased estimates of group mean differences (see []):
with the weight function w defined as
Note that and the weighting function w weighs observations according to their closeness to the distribution . Observations with large density values are downweighted in w. Using (33), it can be shown that
4.1.3. Asymptotic Normal Distribution
We now show that the M-estimator follows an asymptotic normal (AN) distribution (see [], chp. 7). The same Taylor expansion as in (30) provides
The approximation (37) can be substituted into the estimating Equation (24):
Hence, we obtain from (38)
Therefore, we obtain the asymptotic normal distribution of as
The involved matrices and can be estimated from sample data by
Notably, the distribution stated in Equation (42) only holds for a sufficiently large number of items I.
4.1.4. Scalar Linking Parameter
We now specialize our results if the estimated parameter coincides with the estimated group difference . In this case, m is a univariate linking function. Assume that and are the roots of the following equations, respectively:
The asymptotic behavior of can be described as (see Equation (32))
where is the derivative of m with respect to . Furthermore, is asymptotically normally distributed (see Equation (42)):
Again, the involved integrals for the variance estimate in (49) can be estimated using sample data (see Equations (44) and (45)).
4.2. Fixed Item Parameters , Random DIF Effects
In this subsection, we consider the case of fixed item parameters , but DIF effects are random. The “data” in Section 4.1 was given by , and DIF effects follow a distribution G. Now, only is random and we define the data as and .
The estimating Equation (24) for the linking parameter can be rewritten as
The term in (50) converges to
One has to assume that exists. Define by
Then, we can derive an asymptotic normal distribution for :
The involved matrices and can be estimated by
Interestingly, these estimators coincide with estimated standard errors in the case of random item parameters (see Equations (44) and (45)). Hence, no practical differences regarding the estimated linking parameters and their estimated standard errors can be expected. Only conceptual differences emerge for the two treatments of DIF effects.
5. An Analytical Treatment for the Simultaneous Assessment of Standard Errors and Linking Errors
In practice, the variance in the group mean difference is affected by the sampling of persons (i.e., standard error) and the randomness associated with items (i.e., linking error). There have been attempts for the analytical treatment of the simultaneous inference with respect to the two modes [,,]. In this section, we apply M-estimation theory for the simultaneous assessment of standard errors and linking errors. The general idea in this kind of inference is investigating the asymptotic behavior of the M-estimator if the number of persons P and the number of items I tend to infinity. We only consider the case of random items, but treatment of the case with fixed items and random DIF effects is similar.
In the notation of Section 4, denotes the vector of (true) identified item parameters. In finite samples of size P, only estimates are available. For , it holds that . In long tests, the estimated item parameters are approximately independent between items []. Hence, we can assume that are approximately independent of each other. M-estimation theory applied to the person side guarantees an asymptotic normal distribution:
where is a function of true item parameters . We now use a Taylor expansion with respect to and
Using the same approach as in Section 4.1.3, we get an approximation of the estimating equation as
Then, we obtain
By definition, we have for
Moreover, the following limit exists as in Section 4.1.3:
Because for , the second term in the right bracket in 61 vanishes asymptotically for
For the computation of the covariance matrix, we have
This shows the asymptotic normal distribution when the simultaneous inference with respect to persons and items is conducted:
It evident is from (67) that the number of persons and the number of items are part of the statistical inference. The involved matrices , , and can be estimated from sample data. However, for example, in (65), the true identified item parameters in the left term has to be replaced by the estimated item parameters which can cause slight biases in estimated variance matrices. Because of this disadvantage, we propose resampling techniques for the simultaneous inference of standard errors and linking errors in the next section.
6. Resampling Methods for the Simultaneous Assessment of Standard Errors and Linking Errors
We now derive estimation formulas for resampling methods [,] for persons and items. The derivation is motivated by assuming the following data-generating model
where is the observed data for person p (or person groups) and item i (or item groups). The random variables , , and are all independent of each other. We now derive the variance for the mean estimate :
Its variance is given by
The variance in (70) contains error sources for persons and items. Hence, it allows a simultaneous inference for both error facets. Following the terminology of errors in item response modeling for the large-scale assessment of students [], the variance quantifies sampling error due to sampling persons, the variance linking error due to sampling items, and the variance can be interpreted as measurement error.
6.1. Single Jackknife (SJK)
The classical single jackknife (SJK; [,,,]) approach removes one unit (e.g., a (group of) person(s) or a (group of) item(s)) from an analysis for computing standard errors. First, we investigate the jackknife estimate in which only persons are removed. Let be the mean estimate in which person p is removed:
We now derive the expected value of the square in Equation (72):
Now, we define
By using (72), we now obtain
Equation (75) allows the computation of the standard error associated with person sampling. From Equation (70), we can attribute the variance to person sampling. From (75), we get by replacing the expected value with the observed value
In the single jackknife, the person-by-item interaction variance component is typically ignored and the variance due to person sampling is, hence, estimated by .
Similarly, we can derive the properties of the SJK estimate in which a single item i (or an item group) is removed from the analysis:
The SJK variance estimate for item sampling utilizes the sum of squares term
By replacing the expected value with the observed value , the quantity is used as the variance estimate concerning the item facet. For the joint inference of persons and items, the variance terms for persons and items are added:
Note that this variance estimate is biased because
Consequently, so-called double jackknife resampling should be employed to remove the bias from the estimated variance.
6.2. Double Jackknife (DJK)
The double jackknife (DJK; [,,]) removes a person (or a group of persons) and an item (or a group of items) from an analysis for the determination of the standard error. The elimination and repeated analysis is carried out for all persons and items. Let be the mean estimate in which the person p and item i are removed. In more detail, it is
The estimate only removes person p, and the estimate only removes item i. The corresponding estimates have already been studied as SJK estimates in Section 6.1.
We now consider an analysis in which one person and one item are removed. One obtains
It follows that
Now, define
We then obtain by using (84)
One can use Equations (75), (79) and (86) as estimating equations by equating the expected values of the sum of squares by their observed counterparts. We have three equations for three unknowns
We further simplify (87) to
Now substitute the first and second equation in (88) in the third equation. We obtain
Hence, we get from (89)
Further, the variance components for persons and items can be computed as
The quantities in (90) and (91) can be used to estimate the population variance defined in (70). The crucial issue is how to handle negative variance estimates in estimation. Based on experience from preliminary simulation studies, the following variance estimate turned out to be most satisfactory:
where is nonnegative, and is defined in Equation (90).
6.3. Single Half Sampling (SHS)
In single half sampling (SHS; []), half of the sample is used to reanalyze the data to compute standard errors. Let be the h-th half sample for persons in which half of the persons are sampled. Without loss of generality, let P be even. The h-th half sample consists of persons. We define half sample h in which the first persons are sampled and compute the mean estimate
Then, we obtain
Hence, we get from (95)
Now, there are H (potentially balanced) half samples (see []) with estimates . Define the variance
Using (95), it follows that
Similarly, one can consider half samples of items. Assume that in half sample k, the first items are sampled. Let
One can define the variance in estimates due to different half samples of items. Define the variance
Using the same derivations, we get
Based on the expected values in (97) and (100), one can define a variance estimate of by adding the variance components regarding persons and items as
Notably, this estimate is positively biased because
As in the case of SJK, SHS also results in a biased variance estimate. In the next section, we investigate double half sampling that removes the bias component.
6.4. Double Half Sampling (DHS)
In double half sampling (DHS), half samples of persons and items are created and the analysis is replicated for these half samples. Let h be a half sample of persons, and k be a half sample of items for this dataset of persons. Let be the mean estimate for the half sample for persons and items and be the estimate for the half sample of persons.
Define the variance
Using the same derivation as in (100), one obtains
Hence, an unbiased estimate of the variance for using DHS is obtained by
where .
In practice, one can use balanced half samples based on Hadamard matrices for the most efficient variance estimates that minimize the Monte Carlo error for creating half samples []. In the simulation study (see Section 8), only balanced half samples are considered.
6.5. Double Bootstrap
It might be tempting to consider a double bootstrap resampling approach of persons and items as an alternative to DJK and DHS [,,,]. We believe that bootstrapping items should not be recommended because duplicating items introduces additional local dependence in IRT models, which, in turn, induces bias in estimated item parameters and linking parameters. Hence, the variability obtained from a double bootstrap will also include portions of bias.
7. Simulation Study 1: Comparing the Performance of Different Linking Methods
In Simulation Study 1, we compare the performance of robust and nonrobust linking methods for the RM in the presence and absence of random DIF. This study systematically compares several robust linking methods. In particular, the recently proposed GI method is compared with alternative methods.
7.1. Design
Data were simulated according to the RM with random DIF in two groups. In the first group, the ability distributed was simulated as . In the second group, we simulated (i.e., ). Item difficulties were fixed in the simulation and were chosen equidistant in the interval . Hence, in this study, we assumed fixed item difficulties , but simulated random DIF effects according to a mixture distribution (see Section 2.2). The distribution of DIF effects reference items was chosen as a centered normal distribution; that is, . For the distribution of DIF effects of biased items , we chose a two-point distribution for balanced DIF with values and and corresponding probabilities . For unbalanced DIF, we simulated a one-point distribution at with probability which favored the first group. In the simulation, we fixed to 0.60. The bias for MM linking is expected to be (see Equation (11)). It vanishes for balanced DIF and is a function of in the case of unbalanced DIF.
In the simulation, five factors were varied. First, we chose the sample size N of persons as 250, 500, 1000, and 5000. Second, we varied the number of items by and . Third, we chose the proportion of biased items . With , no biased DIF items were simulated. Fourth, we varied the standard deviation (SD) of DIF effects of reference items as 0, 0.1, 0.2, and 0.3. Fifth, we simulated three different distributions of DIF effects if : a normal distribution, a uniform distribution, and a t-distribution with four degrees of freedom. With , reference items do not have DIF effects. The distributions of DIF effects were appropriately scaled in order to match the SD . In total, 1000 datasets were simulated and analyzed in each condition.
7.2. Analysis
The RM model was separately estimated in the two groups. The linking methods introduced in Section 3 were applied. We chose a cutoff value of 0.4 for DIF detection in the EL method. In ATR linking, we chose trimming proportions of 0.20 and 0.40. In BSQ linking, we chose 0.4 as the threshold parameter K. IA was estimated using the powers , and 0.1. GI linking was utilized with powers 1 and 2. HAE linking was specified with powers , 1, 0.5, 0.25, and 0.1.
The parameter of interest was the estimated group mean difference . For this parameter, the bias and root mean square error (RMSE) were computed. To reduce the dependence of the RMSE from the sample size and the number of items, we computed a relative RMSE for which the RMSE of a linking method is divided by the RMSE of the linking method with the best performance. Hence, this relative RMSE possesses the lowest value of 100 for the best linking method.
To summarize the contribution of each of the manipulated factors in the simulation, we conducted an analysis of variance (ANOVA). We used a variance decomposition for assessing the importance in the presence and absence of DIF.
Moreover, we classified linking methods on whether they showed satisfactory performance in a particular condition. We defined satisfactory performance for the bias if the absolute bias in the estimated mean was smaller than 0.01. An estimator had satisfactory performance concerning the RMSE if the relative RMSE was smaller than 125.
In all analyses, the statistical software R [] was used. The R package sirt [] was employed for estimating the RM model with marginal maximum likelihood as the estimation method. The linking methods were estimated using R functions particularly written for this paper.
7.3. Results
In Table 1, the variance decomposition of the ANOVA summarized across conditions of no DIF is presented. For bias, sample size N, the number of items I as well as linking methods (Meth in Table 1) have an impact. However, as we will see later, the bias is of non-negligible size in the situation of no DIF. For RMSE, linking methods constitute the major source of differences. In contrast, sample size and the number of items only have small effects on the RMSE.

Table 1.
Variance proportions of different factors in the simulation study for bias and RMSE in the condition of no differential item functioning (DIF).
In Table 2, the variance decomposition of the ANOVA summarized across conditions of balanced and unbalanced DIF, respectively, is presented. All terms up to three-way interactions were included. For balanced DIF (column BAL), RMSE is more important than bias. It is evident that linking methods produced the largest variability in estimates, followed by the SD of DIF effects of reference items, the proportion of biased items , sample size N, and the type of distribution (column Dist) of DIF effects. For unbalanced DIF, the bias is primarily affected by and and their interaction. Like for balanced DIF, the linking method substantially explains the variability in the RMSE of group mean differences.

Table 2.
Variance proportions of different factors in the simulation study for bias and RMSE for balanced and unbalanced differential item functioning (DIF).
Table 3 summarizes the performance of the different linking methods across all conditions with no DIF, balanced DIF, and unbalanced DIF. In the absence of DIF, all linking methods produced unbiased estimates. However, IA with small powers p of 0.25 and 0.1 as well as HAE with resulted in less precise estimates. Interestingly, GI linking always resulted in a substantially increased variability in estimated group mean differences compared to all other linking methods.

Table 3.
Summary of satisfactory performance of linking methods for bias and RMSE for no, balanced and unbalanced differential item functioning (DIF).
In the conditions with balanced DIF (column “BAL”), all linking methods (except for GI in a few conditions) produced unbiased estimates. However, using robust linking methods (i.e., EL, ATR, BSQ, IA, GI, HAE(p) with ) resulted in an efficiency loss in the RMSE compared to nonrobust linking methods (i.e., MM, HAE(2)). Among the robust linking methods, MM linking with the elimination of DIF items (i.e., EL) as well as IA and HAE with performed best.
Finally, the situation of unbalanced DIF (column “UNBAL”) is most challenging because linking methods have to handle the presence of biased items. Notably, robust linking methods are preferred over nonrobust linking in such a situation. In particular, MM and HAE(2) always resulted in biased estimates. Among the robust linking methods, BSQ and IA with and 0.1 resulted in the least simulation conditions with biased estimates. Concerning RMSE, EL and ATR with a trimming proportion of 0.4 performed best, followed by IA with , HAE with , ATR with a trimming proportion of 0.2 and BSQ linking.
Table 4 shows the RMSE for balanced DIF for items as a function of sample sizes (N), proportion of biased items (), and standard deviation of DIF effects of reference items (). For balanced DIF, all linking methods produced unbiased estimates (not shown in the table). However, there were slight differences between the linking methods with respect to the RMSE. In the situation of partial invariance (i.e., ), the efficiency loss of robust linking methods compared to nonrobust linking methods MM and HAE(2) was acceptable. However, GI resulted in larger variable estimates. Moreover, note that GI linking with outperformed GI with in most conditions. Robust linking methods IA and HAE with very small power values p (e.g., or 0.1) also caused a non-negligible RMSE increase.

Table 4.
RMSE for balanced differential item functioning (DIF) for items as a function of sample sizes (N), proportion of biased items (), and standard deviation of DIF effects of reference items ().
The efficiency loss of robust linking methods is much larger if the reference items also possess DIF (i.e., ). Only IA with can somehow compete with MM and HAE(2) linking. The variance increase in robust linking methods IA and HAE with very small powers is apparent. It also has to be stated that GI linking produced large RMSE values in balanced DIF conditions.
Table 5 shows the bias and the RMSE for unbalanced DIF for items as a function of sample sizes (N), proportion of biased items (), and standard deviation of DIF effects of reference items (). All linking methods show biases in at least one condition. Notably, nonrobust linking methods MM and HAE(2) showed the largest bias. Robust linking methods reduce the bias in all conditions. The most critical condition is and . In this condition, BSQ linking has the least bias, followed by IA with small powers 0.25 and 0.1. In this condition, it is also interesting to note that biases for a large sample size of are smaller than for .

Table 5.
Bias and RMSE for unbalanced differential item functioning (DIF) for items as a function of sample sizes (N), proportion of biased items (), and standard deviation of DIF effects of reference items ().
With respect to the RMSE, EL, ATR, BSQ, and IA with powers 0.5 and 0.25 can be recommended. It is important to emphasize that GI linking with performed well in the case of partial invariance (i.e., ), but outperformed the recently proposed GI linking using . Interestingly, DIF detection with subsequent MM linking (method EL) was also relatively effective as long as the proportion of biased items is not too large.
8. Simulation Study 2: Performance of Resampling Methods for Computing Standard Errors and Linking Errors
In Simulation Study 2, we investigate the performance of resampling methods for estimating the variability of group mean differences. DJK and DHS have not yet been systematically studied for linking methods in the literature. In particular, there is a lack of research for studying resampling methods for robust linking methods.
8.1. Design
The data generating model closely follows that of Simulation Study 1 (see Section 7.1). Only a selected number of conditions was simulated because resampling methods are computationally demanding. In contrast to Simulation Study 1, we set . Only balanced DIF was simulated because the assessment of variability (and not bias) was the focus of this simulation. The proportion of biased items were chosen as or . The SD of DIF effects for reference items was set to 0.3. We considered sample sizes and and fixed the number of items to . In total, 2000 replications were conducted in each condition of the simulation study.
8.2. Analysis
To further reduce computation time, we only chose a selected number of linking methods that provided unbiased estimates in Simulation Study 1; that is, MM, ATR, IA, and HAE. We assessed the variability in estimated group mean differences with the resampling methods SJK (Equation (80)), DJK (Equation (92)), SHS (Equation (101)), and DHS (Equation (105)). We applied the resampling methods with 20 replication zones (containing 500/20 = 25 or 2000/20 = 100 persons and 40/20=2 items in each zone). Approximate balanced half sampling was used by specifying zones so that it was constructed from the upper part of a Hadamard matrix with a minimum dimension larger than 20. We computed confidence intervals based on the estimated standard errors by the respective methods as . The proportion of replications in which the true difference is contained in is defined as the coverage rate. Coverage rates are classified as satisfactory if they range within the interval for a condition in a simulation. As in Simulation Study 1, we used R [] and the R package sirt [].
8.3. Results
In Table 6, coverage rates for the resampling methods are displayed. By construction, single resampling methods (SJK and SHS) result in slightly wider confidence intervals than double resampling methods (DJK and DHS), which, in turn, produce higher coverage rates. It can be seen that SJK and DJK failed to produce acceptable coverage rates. In particular, jackknife error estimates performed worse for robust linking methods. This is in line with results in robust statistics that JK does not work for nondifferentiable statistics. However, SJK can be used for the nonrobust linking method HAE(2). In contrast, half sampling resampling methods outperformed jackknife. As expected, SHS produced a slight overcoverage, but DHS produced acceptable coverage in all conditions. Particularly noteworthy is the fact that DHS also successfully performed for robust linking methods. Overall, these findings indicate that half sampling methods should be preferred over jackknife resampling.

Table 6.
Coverage rates for linking methods for balanced differential item function for items as a function of sample size (N) and the proportion of biased items ().
9. Discussion
In this article, we investigated the performance of robust and nonrobust linking methods as well as the assessment of standard error and linking error estimates of group mean differences. We assumed random DIF with a mixture distribution model. Items are implicitly classified into a set of reference items (that are valid for group comparisons) and biased items that potentially bias group mean differences. We studied the nonrobust linking methods mean-mean linking (MM) and Haebara linking (HAE) with , as well as the robust linking methods based on the asymmetrically trimmed mean (ATR), elimination of DIF items with subsequent mean-mean linking (EL), bisquare linking (BSQ), invariance alignment (IA) with , Haebara linking (HAE) with and Gini linking (GI).
We found that robust linking methods can be very effective in reducing biases in the presence of biased items in unbalanced DIF situations. However, in the presence of DIF on reference items (i.e., in the absence of partial invariance), robust linking methods can result in the reduced efficiency of estimates compared to nonrobust methods such as mean-mean linking or Haebara linking, in particular in the situation of balanced DIF. Our study also compared the recently proposed Gini linking with alternative linking methods. Surprisingly, GI performed worse compared to its competitors and only showed an acceptable performance using a modified GI version using a power . In our view, it is hard to recommend a particular linking estimator in the unbalanced DIF situation. It is only evident that mean-mean linking and Haebara linking with are prone to bias and should not be used. Moreover, the recently proposed Gini linking produced much more variable estimates than competitive linking estimators. The usual practice in psychometrics (linking method EL) that eliminates DIF items in the first step of the analysis and computes group differences based on the DIF-free items in the second step, provides comparable results to robust linking methods (see also [,]). Note that we used the median as the preliminary location estimate in the EL method in the first step, which differs from the practice that employs the equal mean difficulty assumption (i.e., uses the mean instead of the median; see []).
We also studied the variability of group mean difference estimates due to random DIF. The randomness of DIF introduces an additional source of error (i.e., the linking error) in addition to the standard error associated with the sampling of persons. We analytically derived the distribution of the group difference through M-estimation theory. These results have importance for a (very) large number of items. Because we used a relatively small number of items in the simulation and large item pools are not often present in applications, we investigated (single and double) jackknife and (single and double) half sampling resampling methods for persons and items for assessing the variability in estimates of the linking methods. We found that our proposed double half sampling outperformed jackknife-based error estimates. In contrast to jackknife, half sampling can also satisfactorily be applied to nondifferentiable robust linking methods. These findings indicate that half sampling methods could find their way for assessing linking errors in empirical applications.
In this article, we focused on the estimation of group differences. In the investigation of DIF in applied research, how to choose the correct anchor is always crucial [,,,,,]. The studied robust linking estimators can be used to transform estimated item difficulties onto the same scale. Differences in transformed item difficulties can be investigated for DIF effects. Resampling procedures (single jackknife or single half sampling) can be employed for assessing the statistical significance of DIF effects.
As an alternative to separate scaling with subsequent robust or nonrobust linking, concurrent scaling assuming invariant item parameters can be utilized. Although such a one-step approach might be preferred from the practitioner’s point of view, the presence of DIF effects likely introduces some bias in estimated group differences [,]. Surprisingly, the bias is even present for balanced DIF []. Robust linking methods have the advantage that a few outlying DIF effects are automatically removed from group comparisons []. Moreover, concurrent calibration might have computational disadvantages [,]. As a further alternative, concurrent calibration assuming partial invariance can be pursued [,,]. In this approach, DIF for items is investigated in a first step, and items that showed DIF receive group-specific item parameters in the concurrent calibration approach, while invariance is assumed for the remaining items.
Furthermore, the precision of linking estimates can be improved by including further person covariates in the analysis [,]. This could be particularly true if there also exist DIF effects for person covariates. There is a lack of research for studies that include person covariates in robust linking methods.
Finally, we assumed that the Rasch model was correctly specified. This assumption might be unrealistic in practice, and much more complex item response functions could have generated the item responses [,]. It would be interesting to study the performance of the different linking methods and the assessment of standard errors and linking errors for misspecified models. We would like to emphasize that M-estimation theory and resampling techniques also provide valid inference in the case of misspecified models. It can always be debated whether estimates from a misspecified Rasch model are practically relevant or should be interpreted. We tend to argue that parameter estimates of misspecified models summarize a population distribution, and model fitting is not always (or maybe should not be) targeted at estimating the model that has generated the data. In this sense, we think that approaches that include model error as an additional component in statistical inference [] might be beneficial.
Funding
This research received no external funding.
Acknowledgments
We would like to thank four anonymous reviewers, the academic editor and the assistant editor for valuable comments that helped to improve the article.
Conflicts of Interest
The author declares no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
1PL | one-parameter logistic model |
AN | asymptotic normal distribution |
ATR | asymmetrically trimmed mean linking |
BSQ | bisquare kernel linking |
DHS | double half sampling |
DIF | differential item functioning |
DJK | double jackknife |
EL | elimination of DIF items with subsequent mean-mean linking |
HAE | Haebara linking |
IA | invariance alignment |
IRF | item response function |
IRT | item response theory |
MM | mean-mean linking |
PISA | program for international student assessment |
RM | Rasch model |
RMSE | root mean square error |
SD | standard deviation |
SHS | single half sampling |
SJK | single jackknife |
References
- Van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
- Fischer, G.H.; Molenaar, I.W. (Eds.) Rasch Models. Foundations, Recent Developments, and Applications; Springer: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
- Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–667. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 125–167. [Google Scholar] [CrossRef]
- Andrich, D.; Marais, I. A Course in Rasch Measurement Theory; Springer: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Kubinger, K.D. Psychological test calibration using the Rasch model—Some critical suggestions on traditional approaches. Int. J. Test. 2005, 5, 377–394. [Google Scholar] [CrossRef]
- Linacre, J.M. Understanding Rasch measurement: Estimation methods for Rasch measures. J. Outcome Meas. 1999, 3, 382–405. [Google Scholar]
- Linacre, J.M. Rasch model estimation: Further topics. J. Appl. Meas. 2004, 5, 95–110. [Google Scholar]
- Rost, J. Was ist aus dem Rasch-Modell geworden? [Where has the Rasch model gone?]. Psychol. Rundsch. 1999, 50, 140–156. [Google Scholar] [CrossRef]
- Von Davier, M. The Rasch model. In Handbook of Item Response Theory, Volume 1: Models; CRC Press: Boca Raton, FL, USA, 2016; pp. 31–48. [Google Scholar] [CrossRef]
- Holland, P.W. On the sampling theory foundations of item response theory models. Psychometrika 1990, 55, 577–601. [Google Scholar] [CrossRef]
- San Martin, E. Identification of item response theory models. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 127–150. [Google Scholar] [CrossRef]
- Robitzsch, A. A comprehensive simulation study of estimation methods for the Rasch model. Stats 2021, 4, 48. [Google Scholar] [CrossRef]
- Xu, X.; Jia, Y. The Sensitivity of Parameter Estimates to the Latent Ability Distribution; (Research Report No. RR-11-40); Educational Testing Service: Princeton, NJ, USA, 2011. [Google Scholar] [CrossRef]
- Zwinderman, A.H.; Van den Wollenberg, A.L. Robustness of marginal maximum likelihood estimation in the Rasch model. Appl. Psychol. Meas. 1990, 14, 73–81. [Google Scholar] [CrossRef]
- Fischer, G.H. Rasch models. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Routledge: Oxford, UK, 2007; pp. 515–585. [Google Scholar] [CrossRef]
- San Martin, E.; Rolin, J. Identification of parametric Rasch-type models. J. Stat. Plan. Inference 2013, 143, 116–130. [Google Scholar] [CrossRef]
- Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
- Loken, E.; Rulison, K.L. Estimation of a four-parameter item response theory model. Brit. J. Math. Stat. Psychol. 2010, 63, 509–525. [Google Scholar] [CrossRef]
- Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
- Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 31. [Google Scholar] [CrossRef]
- Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
- Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 20. [Google Scholar] [CrossRef]
- Culpepper, S.A. The prevalence and implications of slipping on low-stakes, large-scale assessments. J. Educ. Behav. Stat. 2017, 42, 706–725. [Google Scholar] [CrossRef]
- Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
- OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 30 June 2021).
- Becker, B.; Weirich, S.; Mahler, N.; Sachse, K.A. Testdesign und Auswertung des IQB-Bildungstrends 2018: Technische Grundlagen [Test design and analysis of the IQB education trend 2018: Technical foundations]. In IQB-Bildungstrend 2018. Mathematische und naturwissenschaftliche Kompetenzen am Ende der Sekundarstufe I im zweiten Ländervergleich; Stanat, P., Schipolowski, S., Mahler, N., Weirich, S., Henschel, S., Eds.; Waxmann: Münster, Germany, 2019; pp. 411–425. Available online: https://bit.ly/3mTvgRX (accessed on 30 June 2021).
- Pohl, S.; Carstensen, C. NEPS Technical Report–Scaling the Data of the Competence Tests; (NEPS Working Paper No. 14); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2012; Available online: https://bit.ly/2XThQww (accessed on 30 June 2021).
- Wendt, H.; Bos, W.; Goy, M. On applications of Rasch models in international comparative large-scale assessments: A historical review. Educ. Res. Eval. 2011, 17, 419–446. [Google Scholar] [CrossRef]
- Hoff, P.; Wakefield, J. Bayesian sandwich posteriors for pseudo-true parameters. J. Stat. Plan. Inference 2013, 10, 1638–1642. [Google Scholar] [CrossRef][Green Version]
- Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Sun, Y. Constructing a Misspecifed Item Response Model That Yields a Specified Estimate and a Specified Model Misfit Value. Ph.D. Thesis, The Ohoi State University, Columbus, OH, USA, 2015. Available online: https://bit.ly/3AGJPgm (accessed on 30 June 2021).
- White, H. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
- Forero, C.G.; Maydeu-Olivares, A. Estimation of IRT graded response models: Limited versus full information methods. Psychol. Methods 2009, 14, 275–299. [Google Scholar] [CrossRef]
- Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
- Cho, S.J.; Suh, Y.; Lee, W.Y. After differential item functioning is detected: IRT item calibration and scoring in the presence of DIF. Appl. Psychol. Meas. 2016, 40, 573–591. [Google Scholar] [CrossRef]
- Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
- Van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef]
- Frederickx, S.; Tuerlinckx, F.; De Boeck, P.; Magis, D. RIM: A random item mixture model to detect differential item functioning. J. Educ. Meas. 2010, 47, 432–457. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
- De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
- Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
- Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. [Google Scholar]
- Pohl, S.; Schulze, D.; Stets, E. Partial measurement invariance: Extending and evaluating the cluster approach for identifying anchor items. Appl. Psychol. Meas. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
- Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef]
- Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
- Welzel, C.; Inglehart, R.F. Misconceptions of measurement equivalence: Time for a paradigm shift. Comp. Political Stud. 2016, 49, 1068–1094. [Google Scholar] [CrossRef]
- Welzel, C.; Brunkert, L.; Kruse, S.; Inglehart, R.F. Non-invariance? An overstated problem with misconceived causes. Sociol. Methods Res. 2021. Epub ahead of print. [Google Scholar] [CrossRef]
- Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 30 June 2021).
- Rutkowski, L.; Svetina, D. Measurement invariance in international surveys: Categorical indicators and fit measure performance. Appl. Meas. Educ. 2017, 30, 39–51. [Google Scholar] [CrossRef]
- Von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
- González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
- Von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
- Manna, V.F.; Gu, L. Different Methods of Adjusting for Form Difficulty under the Rasch Model: Impact on Consistency of Assessment Results; (Research Report No. RR-19-08); Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Jureckova, J.; Picek, J. Robust Statistical Methods with R; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
- Huber, P.J.; Ronchetti, E.M. Robust Statistics; Wiley: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; Wiley: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Ronchetti, E. The main contributions of robust statistics to statistical science and a new challenge. Metron 2021, 79, 127–135. [Google Scholar] [CrossRef]
- Magis, D.; De Boeck, P. Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivar. Behav. Res. 2011, 46, 733–755. [Google Scholar] [CrossRef]
- Magis, D.; De Boeck, P. A robust outlier approach to prevent type I error inflation in differential item functioning. Educ. Psychol. Meas. 2012, 72, 291–311. [Google Scholar] [CrossRef]
- Rusiecki, A. Robust learning algorithm based on LTA estimator. Neurocomputing 2013, 120, 624–632. [Google Scholar] [CrossRef]
- Wilcox, R. Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
- Yuan, K.H.; Bentler, P.M.; Chan, W. Structural equation modeling with heavy tailed distributions. Psychometrika 2004, 69, 421–436. [Google Scholar] [CrossRef]
- Yuan, K.H.; Zhang, Z. Structural equation modeling diagnostics using R package semdiag and EQS. Struct. Equ. Model. 2012, 19, 683–702. [Google Scholar] [CrossRef]
- Kalina, J. Implicitly weighted methods in robust image analysis. J. Math. Imaging Vis. 2012, 44, 449–462. [Google Scholar] [CrossRef]
- Fox, J. Applied Regression Analysis and Generalized Linear Models; Sage: Thousand Oaks, CA, USA, 2016. [Google Scholar]
- Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley: New York, NY, USA, 1986. [Google Scholar] [CrossRef]
- Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
- Pokropek, A.; Lüdtke, O.; Robitzsch, A. An extension of the invariance alignment method for scale linking. Psych. Test Assess. Model. 2020, 62, 303–334. [Google Scholar]
- Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 19. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociol. Methods Res. 2018, 47, 637–664. [Google Scholar] [CrossRef]
- Pokropek, A.; Davidov, E.; Schmidt, P. A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Struct. Equ. Model. 2019, 26, 724–744. [Google Scholar] [CrossRef]
- Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef]
- He, Y.; Cui, Z.; Osterlind, S.J. New robust scale transformation methods in the presence of outlying common items. Appl. Psychol. Meas. 2015, 39, 613–626. [Google Scholar] [CrossRef]
- He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
- Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 14. [Google Scholar] [CrossRef]
- Strobl, C.; Kopf, J.; Kohler, L.; von Oertzen, T.; Zeileis, A. Anchor point selection: Scale alignment based on an inequality criterion. Appl. Psychol. Meas. 2021, 45, 214–230. [Google Scholar] [CrossRef]
- Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar]
- Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
- Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
- Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
- Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
- Jaeckel, L.A. Robust estimates of location: Symmetry and asymmetric contamination. Ann. Math. Stat. 1971, 42, 1020–1034. [Google Scholar] [CrossRef]
- Xu, X.; Chen, X. A practical method of robust estimation in case of asymmetry. J. Stat. Theory Pract. 2018, 12, 370–396. [Google Scholar] [CrossRef]
- Stefanski, L.A.; Boos, D.D. The calculus of M-estimation. Am. Stat. 2002, 56, 29–38. [Google Scholar] [CrossRef]
- Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
- Simakhin, V.A.; Shamanaeva, L.G.; Avdyushina, A.E. Robust parametric estimates of heterogeneous experimental data. Russ. Phys. J. 2021, 63, 1510–1518. [Google Scholar] [CrossRef]
- Hunter, J.E. Probabilistic foundations for coefficients of generalizability. Psychometrika 1968, 33, 1–18. [Google Scholar] [CrossRef]
- Husek, T.R.; Sirotnik, K. Item Sampling in Educational Research; CSEIP Occasional Report No. 2.; University of California: Los Angeles, CA, USA, 1967; Available online: https://bit.ly/3k47t1s (accessed on 30 June 2021).
- Yuan, K.H.; Cheng, Y.; Patton, J. Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika 2014, 79, 232–254. [Google Scholar] [CrossRef]
- Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef]
- Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
- Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar] [CrossRef]
- Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; (Research Report No. RR-09-02); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Rao, J.N.K.; Wu, C.F.J. Inference from Stratified Samples: Second-Order Analysis of Three Methods for Nonlinear Statistics. J. Am. Stat. Assoc. 1985, 80, 620–630. [Google Scholar] [CrossRef]
- Xu, X.; von Davier, M. Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study; (Research Report No. RR-10-10); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
- Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
- Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
- Tong, Y.; Brennan, R.L. Bootstrap estimates of standard errors in generalizability theory. Educ. Psychol. Meas. 2007, 67, 804–817. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
- Robitzsch, A. sirt: Supplementary Item Response Theory Models. R package version 3.10-111; R Core Team: Vienna, Austria, 2021; Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 25 June 2021).
- DeMars, C.E. Alignment as an alternative to anchor purification in DIF analyses. Struct. Equ. Model. 2020, 27, 56–72. [Google Scholar] [CrossRef]
- Chen, Y.; Li, C.; Xu, G. DIF statistical inference and detection without knowing anchoring items. arXiv 2021, arXiv:2110.11112. Available online: https://arxiv.org/abs/2110.11112 (accessed on 21 October 2021).
- Kopf, J.; Zeileis, A.; Strobl, C. A framework for anchor methods and an iterative forward approach for DIF detection. Appl. Psychol. Meas. 2015, 39, 83–103. [Google Scholar] [CrossRef]
- Tutz, G.; Schauberger, G. A penalty approach to differential item functioning in Rasch models. Psychometrika 2015, 80, 21–43. [Google Scholar] [CrossRef]
- Yuan, K.H.; Liu, H.; Han, Y. Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika 2021, 86, 345–377. [Google Scholar] [CrossRef]
- Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 9. [Google Scholar] [CrossRef]
- Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
- Von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
- Glas, C.A.W.; Jehangir, M. Modeling country-specific differential functioning. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 97–115. [Google Scholar] [CrossRef]
- Albano, A.D.; Wiberg, M. Linking with external covariates: Examining accuracy by anchor type, test length, ability difference, and sample size. Appl. Psychol. Meas. 2019, 43, 597–610. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M. Linking scales in item response theory with covariates. J. Res. Educ. Scie. Technol. 2018, 3, 12–32. Available online: https://bit.ly/3ze7qEF (accessed on 30 June 2021).
- Wu, H.; Browne, M.W. Quantifying adventitious error in a covariance structure as a random effect. Psychometrika 2015, 80, 571–600. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).