Next Article in Journal
COVID-19 Limitations on Doodling as a Measure of Burnout
Previous Article in Journal
An Evaluation of Safety Training for a Diverse Disaster Response Workforce: The Case of the Deepwater Horizon Oil Spill
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data

by
Alexander Robitzsch
1,2
1
IPN—Leibniz Institute for Science and Mathematics Education, University of Kiel, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), University of Kiel, Olshausenstraße 62, 24118 Kiel, Germany
Eur. J. Investig. Health Psychol. Educ. 2021, 11(4), 1653-1687; https://doi.org/10.3390/ejihpe11040117
Submission received: 3 October 2021 / Revised: 26 November 2021 / Accepted: 10 December 2021 / Published: 14 December 2021

Abstract

:
Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

1. Introduction

It has frequently been argued that measured student performance in educational large-scale assessment (LSA; [1,2,3]) studies is affected by test-taking strategies. In a recent paper that was published in the highly ranked Science journal, researchers Steffi Pohl, Esther Ulitzsch and Matthias von Davier [4] argue that “current reporting practices, however, they confound differences in test-taking behavior (such as working speed and item nonresponse) with differences in competencies (ability). Furthermore, they do so in a different way for different examinees, threatening the fairness of country comparisons” [4]. Hence, the reported student performance (or, equivalently, student ability) is regarded by the authors as a conflated composite of a “true” ability and test-taking strategies. Importantly, Pohl et al. [4] question the validity of country comparisons that are currently reported in LSA studies and argue for an approach that separates test-taking behavior (i.e., item response propensity and working speed) from a purified ability measure. The core idea of the Pohl et al. [4] approach is on how to model missing item responses in educational large-scale assessment studies. In this article, we systematically investigate the consequences of different treatments of missing item responses in the programme for international student assessment (PISA) study conducted in 2018. Note that we do not focus on exploring or modeling test-taking strategies in this article.
While the treatment of missing data in statistical analyses in social sciences is now widely used [5,6,7,8], in recent literature, there are recommendations for treating missing item responses in item response theory (IRT; [9]) models in LSA studies [10,11]. Typically, the treatment of item responses can be distinguished between calibration (computation of item parameters) and scaling (computation of ability distributions).
It is essential to distinguish the type of missing item responses. Missing item responses at the end of the test are referred to as not reached items, while missing items within the test are denoted as omitted items [12]. Since the PISA 2015 study, not reached items are no longer scored as wrong and the proportion of not reached items is used as a predictor in the latent background model [13]. Items that are not administered to students in test booklets in a multiple-matrix design [13,14,15] lead to missingness completely at random (except in multi-stage adaptive testing; see [16]). This kind of missingness is not the topic of this article and typically does not cause issues in estimating population and item parameters.
Several psychometricians have repeatedly argued that missing item responses should never be scored as wrong because such a treatment would produce biased item parameter estimates and unfair country rankings [4,10,11,17,18]. In contrast, model-based treatments of missing item responses that rely on latent ignorability [4,10,11,19] are advocated. Missing item responses can be ignored in this approach when including response indicators and a latent response propensity [20,21]. Importantly, the missingness process is summarized by the latent response variable. As an alternative, multiple imputation at the level of items can be employed to handle missing item responses properly [22,23]. However, scoring missing item responses as wrong could be defended for validity reasons [24,25,26]. Moreover, it has been occasionally argued that simulation studies cannot provide information on the proper treatment of missing item responses in a concrete empirical application because the truth is unknown that would have generated the data [25,27]. Nevertheless, simulation studies could be tremendously helpful in understanding and comparing competitive statistical modeling approaches.
Our findings might only be generalizable to other low-stakes assessment studies like PISA [28,29,30]. However, the underlying mechanisms for missing item responses can strongly differ from high-stakes assessment studies [31].
Although several proposals of using alternative scaling models for abilities in LSA studies like PISA have been made, previous work either did not report country means in the metric of interest [10] such that consequences cannot be interpreted, or constituted only a toy analysis consisting only a few countries [4] that did enable a generalization to operational practice. Therefore, this article compares different scaling models that rely on different treatments of missing item responses. We use the PISA 2018 mathematics dataset as a showcase. We particularly contrast the scoring of missing item responses as wrong with model-based approaches that rely on latent ignorability [4,10,11] and a more flexible Mislevy-Wu model [32,33] containing the former two models as submodels. In the framework of the Mislevy-Wu model, it is tested whether the scoring of missing item responses as wrong or treating them as latent ignorable are preferred in terms of model fit. Moreover, it is studied whether the probability of responding to an item depends on the item response itself (i.e., nonignorable missingness, [7]). In the most general model, the missingness process is assumed to be item format-specific. Finally, we investigate the variability across means from different models for a country.
The rest of the article is structured as follows. In Section 2, an overview of different statistical modeling approaches for handling missing item responses is presented. Section 3 contains an illustrative simulation study that demonstrates the distinguishing features of the different modeling approaches. In Section 4, the sample of persons and items and the analysis strategy for the PISA 2018 mathematics case study are described. In Section 5, the results of PISA 2018 mathematics are presented. Finally, the paper closes with a discussion in Section 6.

2. Statistical Models for Handling Missing Item Responses

In this section, different statistical approaches for handling missing item responses are discussed. These different approaches are utilized in the illustrative simulation study (see Section 3) and the empirical case study involving PISA 2018 mathematics data (see Section 4).
For simplicity, we only consider the case of dichotomous items. The case of polytomous items only requires more notation for the description of models but does not change the general reasoning elaborated for dichotomous items. Let X p i denote the dichotomous item responses and the R p i response indicators for person p and item i. The response indicator R p i takes the value one if X p i is observed and zero if X p i is missing. Consistent with the operational practice since PISA 2015, the two-parameter logistic (2PL) model [34] is used for scaling item responses [13,16]. The item response function is given as
P ( X p i = 1 | θ p ) = Ψ ( a i ( θ p b i ) ) ,
where Ψ denotes the logistic distribution function. The item parameters a i and b i are item discriminations and difficulties, respectively. It holds that 1 Ψ ( x ) = Ψ ( x ) . Local independence of item responses is posed; that is, item responses X p i are conditionally independent from each other given the ability variable θ p . The latent ability θ p follows a standard normal distribution. If all item parameters are estimated, the mean of the ability distribution is fixed to zero and the standard deviation is fixed to one. The one-parameter logistic (1PL, [35]) model is obtained if all item discriminations are set equal to each other.
In Figure 1, the main distinctive features of the different missing data treatments are shown. Three primary strategies can be distinguished [36,37]. These strategies differ in how to include information from the response indicator variables.
First, response indicators R p are unmodelled (using model labels starting with “U”), and missing entries in item responses X p are scored using some a priorily defined rule resulting in item responses X sco , p without missing entries. For example, missing item responses can be scored as wrong or can be omitted in the estimation of the scaling model. In a second step, the 2PL scaling model is applied to the dataset containing scored item responses X sco , p .
Second, model-based approaches (using model labels starting with “M”) pose a joint IRT model for item responses X p and response indicators R p [19]. The 2PL scaling model for the one-dimensional ability variable θ p is part of this model. In addition, a further latent variable ξ p (i.e., the so-called response propensity) is included that describes the correlational structure underlying the response indicators R p . In most approaches discussed in the literature, there is no path from X p i to R p i . After controlling for ability θ p and response propensity ξ p , there is no modeled effect of the item response on the response indicator. In this paper, we allow for this additional relation by using the Mislevy-Wu model and empirically demonstrate that missingness on items depends on the item response itself.
Third, imputation-based approaches (using model labels starting with “I”) first generate multiply imputed datasets and fit the 2PL scaling model to the imputed datasets in a second step [37,38]. Different imputation models can be employed. One can either use only the item responses X p or use the item responses X p and the response indicators R p in the imputation model. As an alternative, imputations can be generated based on an IRT model that contains item responses X p and missing indicators R p . These imputation models can coincide with IRT models that are employed as model-based approaches in our overview. After fitting the IRT models for ( X p , R p ) , the output contains a posterior distribution P ( θ p , ξ p | X p , R p ) for each subject p. For each imputed dataset, one first simulates latent variables θ p and ξ p from the posterior distribution [39]. For items with missing item responses (i.e., R p i = 0 ), one can simulate scores for X p i according to the conditional distribution P ( X p i = x | R p i = 0 , θ p , ξ p ) ( x = 0 , 1 ). It holds that
P ( X p i = 1 | R p i = 0 , θ p , ξ p ) = P ( R p i = 0 | X p i = 1 , ξ p ) P ( X p i = 1 | θ p ) x = 0 1 P ( R p i = 0 | X p i = x , ξ p ) P ( X p i = x | θ p )
The 2PL scaling model is applied to the imputed datasets X imp , p in a second step. In the analyses of this paper, we always created 5 imputed datasets to reduce the simulation error associated with the imputation. We stack the 5 multiply imputed datasets into one long dataset and applied the 2PL scaling model for the stacked dataset (see [40,41,42]). The stacking approach does not result in biased item parameter estimates [41], but resampling procedures are required for obtaining correct standard errors [40]. This article mainly focuses on differences between results from different models and does not investigate the accuracy of standard error computation methods based on resampling procedures.
In the next subsections, we describe the different models for treating missing item responses. These models differ with regards to the missingness mechanism assumptions of missing item responses. Some of the model abbreviations in Figure 1 are already mentioned in this section. Models that only appear in the case study PISA 2018 mathematics are described in Section 4.1.

2.1. Scoring Missing Item Responses as Wrong

In a reference model, we scored all missing item responses (omitted and not reached items) as wrong (model UW). The literature frequently argues that missing item responses should never be scored as wrong [4,10,17,43]. However, we think that the arguments against the scoring as wrong are flawed because these studies simulate missing item responses based on response probabilities that do not depend on the item itself. We think that these data-generating models are not plausible in applications (but see also [44] for a more complex missing model; [25,26]). On the other hand, one can simulate missing item responses such that missing item responses can only occur for incorrectly solved items (i.e., for items with X p i = 0 ). In this situation, all missing data treatments that do not score missing item responses as wrong will provide biased estimates [27].

2.2. Scoring Missing Item Responses as Partially Correct

Missing responses for MC items can be scored as partially correct (also known as fractional correct item responses; see [45]). The main idea is that a student could guess the MC item if he or she does not know the answer. If an item i has K i alternatives, a random guess of an item option would provide a correct response with probability 1 / K i . In IRT estimation, one can weigh probabilities P ( X p i = 1 ) with 1 / K i and P ( X p i = 0 ) with 1 1 / K i [45]. This weighing implements a scoring of a missing MC item as partially correct (model UP). The maximum likelihood estimation is replaced by a pseudo-likelihood estimation that allows non-integer item responses [45]. More formally, the log-likelihood function l for estimating item parameters a = ( a 1 , , a I ) and b = ( b 1 , , b I ) can be written as
l ( a , b ; X sco ) = p = 1 N log i = 1 I Ψ ( a i ( θ b i ) ) x p i [ 1 Ψ ( a i ( θ b i ) ) ] 1 x p i f ( θ ) d θ ,
where f denotes the density of the standard normal distribution, and N denotes the sample size. The entries x p i in the vector of scored item responses X p can generally take values between 0 and 1. The EM algorithm typically used in estimating IRT models [46,47] only needs to be slightly modified for handling fractionally correct item responses. In the M-step for computing expected counts, one must utilize the fractional item responses instead of using only zero or one values. The estimation can be carried out in the R [48] package sirt [49] (i.e., using the function rasch.mml2()).
It should be mentioned pseudo-likelihood estimation of IRT models that allow non-integer item responses is not widely implemented in IRT software. However, the partially correct scoring can be alternatively implemented by employing a multiple imputation approach of item responses. For every missing item response of item i, a correct item response is imputed with probability 1 / K i . No imputation algorithm is required because only random guessing is assumed. This means that the guessing probability of 1 / K i is constant for persons and items.
Missing item responses for CR items are scored as wrong in the partially correct scoring approach because students in this situation cannot simply guess unknown answers.

2.3. Treating Missing Item Responses as Ignorable

As an alternative to scoring missing item responses as wrong, missing item responses can be ignored in likelihood estimation. In model UO1, all missing item responses are ignored in the scaling model. The student ability θ p is extracted based on the observed item responses only. The log-likelihood function l for this model can be written as
l ( a , b ; X , R ) = p = 1 N log i = 1 I Ψ ( a i ( θ b i ) ) r p i x p i [ 1 Ψ ( a i ( θ b i ) ) ] r p i ( 1 x p i ) f ( θ ) d θ .
It can be seen from Equation (4) that only observations with observed item responses (i.e., r p i = 1 ) contribute to the likelihood function.
The method UO1 is valid if missing item responses can be regarded as ignorable [18]. If X com , p = ( X obs , p , X mis , p ) is a partitioning of the vector of complete item responses into the observed and the missing part, the assumption that item responses are missing at random [7] is given as
P ( R p | X obs , p , X mis , p ) = P ( R p | X obs , p ) .
This means that the probability of omitting items only depends on observed items and not the unobserved item responses. By integrating out missing item responses X mis , p , the joint distribution ( X com , p , R p ) and using the MAR assumption (5) can be written as
P ( X obs , p , X mis , p , R p ) d X mis , p = P ( R p | X obs , p ) P ( X obs , p ) .
Hence, Equation (6) shows that likelihood inference for MAR data can entirely rely on the probability distribution P ( X obs , p ) of observed item responses. The notion of (manifest) ignorability means that model parameters of the distributions P ( X obs , p ) and P ( R p | X obs , p ) are distinctive. This means that these distributions can be modeled independently.
It should be emphasized that the MAR assumption (5) does not involve the latent ability θ p . The probability of missingness must be inferred by (summaries of) observed item responses only. This kind of missingness process might be violated in practice. In the following subsection, a weakened version of ignorability is discussed.

2.4. Treating Missing Item Responses as Latent Ignorable

Latent ignorability [19,50,51,52,53,54,55,56,57,58,59,60] is one of the weakest nonignorable missingness mechanisms. Latent ignorability weakens the assumption of ignorability for MAR data. In this case, the existence of a latent variable η p is assumed. The dimension of η p is typically much lower than the dimension of X p . Latent ignorability is defined as (see [19])
P ( R p | X obs , p , X mis , p , η p ) = P ( R p | X obs , p , η p ) .
That is, the probability of missing item responses depends on observed item responses and the latent variable η p , but not the unknown missing item responses X mis , p itself. By integrating out X mis , p , we obtain
P ( R p , X obs , p , X mis , p | η p ) d X mis , p = P ( R p | X obs , p , η p ) P ( X obs , p | η p ) .
The specification (7) is also known as a shared-parameter model [61,62]. In most applications, conditional independence of item responses X p i and response indicators R p i conditional on η p is assumed [19]. In this case, Equation (8) simplifies to
P ( R p = r p , X obs , p = x obs , p , X mis , p | η p ) d X mis , p = i = 1 I P ( R p i = r p i | η p ) P ( X p i = x p i | η p ) r p i .
In the rest of this paper, it is assumed that the latent variable η p consists of a latent ability θ p and a latent response propensity ξ p . The latent response propensity ξ p is a unidimensional latent variable that represents the dimensional structure of the response indicators R p . The probability of responding to an item is given by (model MO2; [10,20,44,63,64,65,66])
P ( R p i = 1 | X p i = x p i , θ p , ξ p ) = P ( R p i = 1 | ξ p ) = Ψ ( ξ p β i ) .
Note that the probability of responding to item i only depends on ξ p and is independent of X p i and θ p . The 2PL model is assumed for item responses X p i (see Equation (1)):
P ( X p i = 1 | θ p , ξ p ) = P ( X p i = 1 | θ p ) = Ψ ( a i ( θ p b i ) ) .
The model defined by Equations (10) and (11) is also referred to as the Holman–Glas model [20,37]. In this article, a bivariate normal distribution for ( θ p , ξ p ) is assumed, where SD ( θ p ) is fixed to one, and SD ( ξ p ) , as well as Cor ( θ p , ξ p ) , are estimated (see [67,68] for more complex distributions).
The model UO1 (see Section 2.3) that presupposes ignorability (instead of latent ignorability) can be tested as a nested model within model MO2 by setting Cor ( θ p , ξ p ) = 0 . This model is referred to as model MO1.
Note that the joint measurement model for item responses X p i and response indicators R p i can be written as
P ( X p i = x , R p i = r | θ p , ξ p ) = 1 Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ) if x = 0 and r = 1 , Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ) if x = 1 and r = 1 , 1 Ψ ( ξ p β i ) if x = NA and r = 0 .
Hence, the model defined in Equation (12) can be interpreted as an IRT model for a variable V p i that has three categories: Category 0 (observed incorrect): X p i = 0 , R p i = 1 , Category 1 (observed correct): X p i = 1 , R p i = 1 , and Category 2 (missing item response): X p i = NA , R p i = 0 (see [43,69,70]).

2.4.1. Generating Imputations from IRT Models Assuming Latent Ignorability

The IRT models MO1 and MO2 are also used for generating multiply imputed datasets. Conditional on θ p , missing item responses are imputed according to the response probability from the 2PL model (see Equation (11)). The stacked imputed dataset is scaled with the unidimensional 2PL model. If models MO1 or MO2 were be the true data-generating models, the results from multiple imputation (i.e., IO1 and IO2) would coincide with model-based treatments (i.e., MO1 and MO2). However, results can differ in the case of misspecified models [71,72].

2.4.2. Including Summaries of Response Indicators in the Latent Background Model

The IRT model for response indicators R p i in Equation (10) is a 1PL model. Hence, the sum score R p = i = 1 I R p i is a sufficient statistic for the response propensity ξ p [73]. Then, the joint distribution can be written as
P ( R p , X obs , p , θ p , ξ p ) P ( θ p | ξ p ) P ( ξ p ) = P ( X obs , p | θ p ) P ( θ p | ξ p ) P ( R p | ξ p ) P ( ξ p ) .
Instead of estimating a joint distribution ( θ p , ξ p ) , a conditional distribution θ p | R p can be specified in a latent background model (LBM; [74,75]). That is, one uses the proportion of missing item responses Z p = 1 R p / I as a predictor for θ p [11,12] and employs a conditional normal distribution θ p | Z p N ( γ 0 + γ 1 Z p , σ e 2 ) . This manifest variable Z p can be regarded as a proxy variable for the latent variable ξ p . The resulting model is referred to as model UO2.

2.5. Mislevy-Wu Model for Nonignorable Item Responses

Latent ignorability characterizes only a weak deviation from an ignorable missing data process. It might be more plausible that the probability P ( R p i = 1 | X p i , θ p , ξ p ) of responding to an item depends on the observed or unobserved item response X p i itself [76,77,78,79,80]. The so-called Mislevy-Wu model [32,33,81,82] extends the model MO2 (see Equation (10)) that assumes latent ignorability to
P ( R p i = 1 | X p i , θ p , ξ p ) = Ψ ( ξ p β i δ i X p i ) .
In this model, the probability of responding to an item depends on the latent response propensity ξ p and the item response X p i itself (see [24,25,49,81,83,84]). The parameter β i governs the missingness proportion for X p i in the subgroup of persons with X p i = 0 , while the sum β i + δ i represents the missingness proportion for persons with X p i = 1 . The unique feature of the Mislevy-Wu model is that the missingness proportion is allowed to depend on the item response. If a very small negative value for the missingness parameter δ i is chosen (e.g., δ i = 10 ), the response probability P ( R p i = 1 | X p i , θ p , ξ p ) in Equation (14) is close to one, meaning that persons with X p i = 1 always provide item response (i.e., they have a missing proportion of zero). By applying the Bayes theorem, it follows in this case that persons with a missing item response must possess an incorrectly solved item; that is, it holds X p i = 0 . It should be emphasized that the Mislevy-Wu model is a special case of models discussed in [85].
Model MM1 is defined by assuming a common δ i parameter for all items. In model MM2, two δ parameters are estimated for item formats CR and MC in the PISA 2018 mathematics case study (see Section 5 for results).
Note that the Mislevy-Wu model for item responses X p i and response indicators R p i can be also formulated as a joint measurement model for a polytomous item with three categories 0 (observed incorrect), 1 (observed correct), and 2 (missing; see also Equation (12)):
P ( X p i = x , R p i = r | θ p , ξ p ) = 1 Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ) if x = 0 and r = 1 , Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ρ i ) if x = 1 and r = 1 , Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ρ i ) + 1 Ψ ( a i ( θ p b i ) ) Ψ ( ξ p β i ) if x = NA and r = 0 .
The most salient property of the models MM1 and MM2 is that the model treating missing item responses as wrong (model UW) can be tested by setting δ i = 10 in Equation (14) (see [33]). This model is referred to as model MW and the corresponding scaling model based on multiply imputed datasets from MW as model IW. Moreover, the model MO2 assuming latent ignorability is obtained by setting δ i = 0 for all items i (see Equation (10)). It has been shown that parameter estimation in the Mislevy-Wu model and model selection among models MW, MO2, and MM1 based on information criteria have satisfactory performance [33].
For both models, multiply imputed datasets were also created based on conditional distributions P ( X p i | R p i , θ p , ξ p ) . The scaling models based on stacked imputed datasets are referred to as IM1 and IM2.

2.6. Imputation Models Based on Fully Conditional Specification

The imputation models discussed in previous subsections are based on unidimensional or two-dimensional IRT models (see [36,86,87,88,89] for more imputation approaches relying on strong assumptions). Posing such a strict dimensionality assumption might result in invalid imputations because almost all IRT models in educational large-scale assessment studies are likely to be misspecified [26]. Hence, alternative imputation models for missing item responses were considered that relied on fully conditional specification (FCS; [41]) implemented in the R package mice [90].
The FCS imputation algorithm operates as follows (see [41,91,92,93]). Let W p denote the vector of variables that can have missing values. FCS cycles through all variables in W p (see [37,94,95,96]). For variable W p v , all remaining variables in W p except W p v are used as predictors for W p v (denotes as W p , ( v ) ) in the imputation model. More formally, a linear regression model
W p v = γ 0 + γ W p , ( v ) + ε p v , ε p v N ( 0 , σ v 2 )
is specified. For dichotomous variables W p v , (16) might be replaced by a logistic regression model. Our experiences correspond with those from the literature that using a linear regression with predictive mean matching (PMM; [41,97,98,99]) provides more stable estimates of the conditional imputation models. PMM guarantees that imputed values only take values that are present in the observed data (i.e., values of 0 or 1 for dichotomous item responses).
In situations with many items, W p , ( v ) is a high-dimensional vector of covariates in the imputation model (16). To provide a stable and efficient estimation of the imputation model, a dimension reduction method for the vector of covariates can be applied to enable a feasible estimation. For example, principal component analysis [100] or sufficient dimension reduction [101] can be applied in each imputation model for reducing the dimensionality of W p , ( v ) . In this paper, partial least squares (PLS) regression [102] is used for transforming the vector of covariates to a low-dimensional vector of PLS factors that successively maximize the covariance with the criterion variable (i.e., maximize the covariance Cov ( α f W p , ( v ) , W p v ) with factor loading vectors α f for uncorrelated factors α f W p , ( v ) with f = 1 , , F ; see [103]). In the simulation study and the empirical case study, we use 10 PLS factors to avoid the curse of dimensionality due to estimating too many parameters in the regression models [103,104].
In the imputation model IF1, only item responses X p are included. This specification will provide approximately unbiased estimates if the MAR assumption (i.e., manifest ignorability) holds. In model IF2, response indicators R p are additionally included [105]. This approach is close to the assumption of latent ignorability in which summaries of the response indicators are also required for predicting the missingness of an item response. Hence, it can be expected that the model IF2 outperforms IF1 and provides similar results to the model MO2 relying on latent ignorability. In contrast to the Mislevy-Wu model, for imputing item response X p i in model IF2, the predictors X p , ( i ) and R p , ( i ) are used. Hence, the probability of responding to an item is not allowed to depend on the item itself. This assumption might be less plausible than assuming the response model in Equation (14).
Like for all imputation-based approaches in this paper, 5 multiply imputed datasets were created, and the 2PL scaling model is applied to the stacked dataset involving all imputed datasets.

3. Illustrative Simulation Study

In order to better understand the relations between different models for the treatment of missing item responses, we performed a small illustrative simulation study to provide insights into the behavior of the most important models under a variety of data-generating models.

3.1. Method

We restrict ourselves to the analysis of only one group. This does not imply interpretational issues because the main motivation of this study is to provide a better insight into the behavior of the models and not to mimic the PISA application involving 45 countries. We only employed a fixed number of I = 20 items in a linear fixed test design. Hence, we did not utilize a multi-matrix design with random allocation of students to test booklets as implemented in PISA. In our experience, we have not (yet) seen any simulation study whose results with a multi-matrix test design substantially differ from a linear fixed test design. We chose a sample size of N = 1500 , which corresponds to a typical sample size at the item level in the PISA application.
Item responses were generated based on the Mislevy-Wu model (see Equation (10)). Item responses were simulated according to the 2PL model. We fixed the correlation of the latent ability θ and the latent response propensity ξ to 0.5. We assumed item difficulties that were equidistantly chosen on the interval [ 2 , 2 ] (i.e., 2.000 , −1.789, −1.579, …, 1.789, 2.000), and we used item discriminations of 1 when simulating data. The ability variable θ was assumed to be standard normally distributed. For the response mechanism in the Mislevy-Wu model in Equation (10), we varied a common missingness parameter δ in five factor levels −10, −3, 2 , −1, and 0. The case δ = 10 effectively corresponds to the situation in which missing item responses can only be produced by incorrect item responses. This simulation condition refers to the situation in which missing item responses must be scored as wrong for obtaining unbiased statistical inference. The situation δ = 0 corresponds to the situation of latent ignorability. The cases δ = 3 , 2 , 1 correspond to situations in which both the scoring as wrong and latent ignorability missing data treatment are not consistent with the data-generating model, and biased estimation can be expected. For the model for response indicators, we used a common β parameter across items in the simulation. As our motivation was to vary the average proportion of missing item responses (i.e., the factor levels were 5 % , 10 % , 20 % , and 30 % ), the common β parameter is a function of the δ parameter. Prior to the main illustrative simulation, we numerically determined the β parameter to obtain the desired missing data proportion rate (see Table A1 in Appendix A for the specific values used).
Seven analysis models were utilized in this simulation study. First, we evaluated the performance of the 2PL model for complete data (model CD). Second, we estimated the Mislevy-Wu model assuming a common missingness parameter δ (model MM1; Section 2.5). Third, we applied the method of scoring of missing items as wrong in model UW. Fourth, in contrast to UW, missing item responses were ignored in the estimation in model UO (Section 2.3). Fifth, we estimated the model with response propensity ξ relying on latent ignorability (model MO2, Section 2.4). Furthermore, two imputation-based approaches were used that rely on the fully conditional specification approach implemented in the R package mice [90]. For both approaches, five multiply imputed datasets were utilized, and the 2PL models were estimated by using a stacked dataset containing all five imputed datasets. Sixth, the model IF1 uses item responses in the imputation approach that employs PMM. Seventh, the model IF2 uses item responses and response indicators in the imputation model. To avoid multicollinearity issues, PLS imputation with 10 PLS factors was applied for models IF1 and IF2.
The 2PL analysis models provided item difficulties and item discriminations and fixed the ability distribution to the standard normal distribution. To enable a comparison of the estimated mean and the standard deviation with the mean and the standard deviation of the data-generating model, estimated item parameters were linked to the true item parameters used in the data-generating model. As a result, a mean and a standard deviation as a result of the linking procedure is compared to the true mean (i.e., M = 0) and the true standard deviation (SD = 1). In this simulation, we applied Haberman linking [106,107] that is equivalent to log-mean-mean linking for two groups [108]. Note that we use Haberman linking for multiple groups (i.e., multiple countries) in the case study in Section 4.
A total number of 500 replications was carried out for each cell of the design. We evaluated bias and root mean square error (RMSE) for the estimated mean and standard deviation. We also assessed Monte Carlo standard errors for bias, and RMSE are calculated based on the jackknife procedure [109,110]. Twenty jackknife zones were defined for the computing of the Monte Carlo standard errors.
In this illustrative simulation study, the statistical software R [48] along with the packages mice [90] and sirt [49] are used.

3.2. Results

In Table 1, the bias for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ is shown. In the case of complete data (CD), no biases exist. Except for the situation of a large proportion of missing item responses of 30 % and an extreme δ parameter of 10 (bias = 0.054), the Mislevy-Model (model MM1)—that is consistent with the data-generating model—performed very well in terms of bias for the mean and the standard deviation. If missing data were only caused by wrong items (i.e., δ = 10 ), models that rely on ignorability (UO, IF1) or latent ignorability (MO2, IF2) produced large biases (e.g., for the mean in the condition of 10% missing data UO 0.159, MO2 0.149, IF1 0.160, IF2 0.152). As was to be expected in this case, scoring missing item responses as wrong provided unbiased results. In contrast, if the data-generating model relied on latent ignorability (i.e., δ = 0 ), scoring missing item responses as wrong provided biased estimates (e.g., for the mean for 10% missing data, the bias was −0.139). Note that in this condition, MO2 and IF2 provided unbiased estimates, while the models that did not take response indicators into account provided biased estimates (e.g., for the mean for 10% missing data: UO 0.037, IF1 0.038).
For values of the missingness parameter δ between 10 and 0, both missing data treatments as wrong and latent ignorable provided biased estimates for the mean. The biases were much more pronounced for higher missing data proportions. Moreover, the standard estimation is substantially underestimated when relying on a model for latent ignorability if the latent ignorability was not used for simulating item responses. Interestingly, the imputation model IF2 that uses both item responses and response indicators showed similar behavior to the model MO2 that involves the latent response propensity ξ , while the imputation model IF1 only using item responses performed similarly to UO. The standard deviation was underestimated in many conditions for the models assuming latent ignorability if the Mislevy-Wu model holds.
The Monte Carlo standard errors for the bias of the mean (M = 0.0023, SD = 0.0005, Max = 0.0044) were similar to those of the standard deviation (M = 0.0022, SD = 0.0005, Max = 0.0038). The uncertainty in the bias estimates is negligible to the variation across different missing data treatments. Hence, the conclusions obtained from this simulation study can be considered trustworthy.
In Table A2 in Appendix A, the RMSE for the mean and the standard deviation for the different missing data treatments are shown as a function of the missing data proportion and the missingness parameter δ . In situations where the models UW or MO2 provided unbiased estimates, the Mislevy-Wu model MM1 has slightly larger variable estimates. However, only in these particular situations, the RMSE of the simpler restrictive models was smaller than those of MM1. In general situations, the increase in variability was outperformed by a lower bias of model MM1. The Monte Carlo standard error for the RMSE of the mean was on average 0.0023 (SD = 0.0006, Max = 0.0044). The corresponding Monte Carlo error for the RMSE of the standard deviation turned out to be quite similar (M = 0.0023, SD = 0.0007, Max = 0.0042).

3.3. Summary

In this illustrative simulative study, we showed that one could not generally conclude that missing items must never be scored wrong. Moreover, models that treat missing item responses as latent ignorable do not guarantee a smaller bias compared to the scoring as wrong. In general, the scoring as wrong can provide negatively biased mean estimates, while the treatment as latent ignorable will typically provide positively biased estimates.
As with any simulation study, the data-generating truth must be known in advance which is not the case in any empirical application. The Mislevy-Wu model is a general model for treating nonignorable missing item responses. It certainly has the potential to provide less biased estimates than alternatives recently discussed in the literature.

4. PISA 2018 Mathematics Case Study: Method

4.1. Sample

The mathematics test in PISA 2018 [16] was used to investigate different treatments of missing item responses. We included 45 countries that did receive the main test in a computer-based administration. These countries did not receive test booklets with items of lower difficulty that were included for low-performing countries.
In total, 72 test booklets were administered in the computer-based assessment in PISA 2018 [16]. Test booklets were compiled from four clusters of items of the same ability domain (i.e., mathematics, reading, science). We selected only booklets which had two item clusters of mathematics items. We took booklets from students that had two item clusters containing mathematics items. Students from booklets 1 to 12 were selected. The cluster of mathematics items appeared either at the first and second (booklets 7 to 12) or the third and fourth positions (booklets 1 to 6) in the test.
As a consequence, 70 mathematics items were included in our analysis. In each of the selected booklets, 22, 23, or 24 mathematics items were administered. Seven of the 70 items were polytomous and were dichotomously recoded, with only the highest category being recoded as correct. In total, 27 out of 70 items had the complex multiple-choice (MC) format, and 43 items had constructed-response (CR) format. For 18 MC items, there were 4 response alternatives, 4 MC items had 8 response alternatives, and 5 MC items had 16 response alternatives.
In Table 2, descriptive statistics for the sample used in our analysis are presented. In total, 167,092 students from these 45 countries were included in the analysis. On average, M = 3713.2 students were available in each country. The average number of students per item within each country ranged between 415.8 (MLT, Malta) and 4408.3 (ESP, Spain). On average, M = 1120.3 students per item were available at the country level.
The average proportion of missing item responses in the dataset was 8.4% ( SD = 3.3 % ) and ranged between 1.2% (MYS, Malaysia) and 18.8% (BIH; Bosnia and Herzegovina). The proportion of not reached item responses was on average 2.4% ( SD = 1.0 % ) with the maximum of 5.9% (SWE, Sweden). Interestingly, the missing data proportions and the country means were only moderately correlated ( Cor = 0.48 ). Missing proportions for CR items were substantially larger ( M = 12.3 % , SD = 4.8 % , Min = 1.5 % , Max = 27.9 % ) than for MC items ( M = 2.3 % , SD = 1.0 % , Min = 0.7 % , Max = 5.4 % ). Figure 2 shows the distribution of the proportion of missing and not reached items at the student level aggregated across countries. Most students produced no missing items (i.e., 61.9 % ) or no not reached items (i.e., 90.2 % ).

4.2. Scaling Models

The different scaling models for treating missing item responses are compared for the PISA 2018 mathematics data for country means and country standard deviations. To compare the parameters of ability distributions across countries, different strategies are considered viable in the literature. These strategies will typically provide different results in the presence of differential item functioning between countries (country DIF; [111,112,113,114]). In this situation, item parameters vary across countries, they are not invariant across countries. First, the noninvariance can be ignored in the scaling model. A misspecified model assuming invariant item parameters is purposely specified [114,115,116,117,118]. Second, scaling is conducted under partial invariance in which only a portion of item parameters is allowed to differ across countries [13,16,119,120,121,122]. Third, a hierarchical model is utilized as the scaling model in which country-specific item parameters are modeled as random effects [111,123,124]. Fourth, the scaling models are separately applied for each country in the first step. In a second step, a common metric is established by applying a linking procedure that transforms item parameters and the ability distribution [108,118,125].
In our analysis, we use the linking approach relying on separate scalings for comparing the ability distribution across countries. We opted for this strategy for the following reasons. First, it is likely that the missingness mechanisms differ across countries [126]. Hence, in a model-based approach to treating missing item responses, it does not seem justified to assume invariant model parameters for the missingness mechanism across countries. Second, it has been shown in the presence of country DIF that a misspecified scaling model assuming invariant item parameters provides more biased parameter estimates than those obtained from the linking approach [127]. Third, large models that concurrently scale all countries (assuming full invariance or partial invariance) are less robust to model deviations. Fourth, we argued elsewhere that the partial invariance approach currently used in PISA results in invalid country comparisons because the comparisons of each pair of countries essentially rely on different sets of items [26,114,118]. Fifth, the linking approach is computationally much less demanding than concurrent scaling approaches (assuming invariance or partial invariance; see [118,125,128]).
As argued above, the scalings the analysis of our PISA 2018 mathematics case study are carried out separately for each country c. That is, one obtains country-specific item parameters a i c and b i c :
P ( X p c i = 1 | θ p c ) = Ψ ( a i c ( θ p c b i c ) ) , θ p c N ( 0 , 1 ) .
Sampling weights were always used when applying the scaling model (17) to the PISA 2018 dataset. To enable the comparability of the ability distribution across countries, the obtained item discriminations a i c and item difficulties b i c are transformed on a common in a subsequent linking step (see Section 4.3) for details.
For the PISA 2018 mathematics data, the scaling models discussed in Section 2 are applied. An overview of the specified models with brief explanations is given in Table 3. Some of the models required particular adaptations that are described in the two following subsections.

4.2.1. Treating Not Reached Items as Ignorable or in the Latent Background Model

Since PISA 2015, not reached items are no longer scored as wrong [13]. To investigate this scaling method, we ignored not reached items in the scaling model but scored omitted items as wrong (model UN1). We also implemented the operational practice since PISA 2015 [13] that includes the proportion of not reached item response as a predictor in the latent background model (model UN2; [12,129]). This second model is similar to assuming latent ignorability when the response indicators for not reached items follow a 1PL model.

4.2.2. Imputation Models Based on Fully Conditional Specification

In Section 2.6, we introduced the FCS imputation models IF1 and IF2 that used X p and ( X p , R p ) in the imputation, respectively. Previous research indicated that item parameters are affected by position effects [130,131,132,133,134,135,136,137]. Hence, in our analysis, the FCS imputation models IF1 and IF2 are separately applied for each test booklet. In general, missing item responses at the end of a test booklet will be less likely imputed with a correct scoring (i.e., X p i = 1 ) than missing item responses at the beginning of a test booklet. As the sample size for each country in each test booklet can be quite low, using PLS regression for dimension reduction of the covariates in the imputation models is vital.

4.3. Linking Procedure

The scaling models described above resulted in country-specific item discriminations a i c and item difficulties b i c . To enable a comparison of country means and country standard deviations, the corresponding ability distributions can be obtained by linking approaches that establish a common ability metric [108,138]. In this article, Haberman linking [107] in its original proposal is used. The linking procedure produces country means and standard deviations as its outcome. To enable a comparisons across the 19 specified different scaling models, the ability distributions were linearly transformed such that the total population involving all students in all countries in our study has a mean M = 500 and a standard deviation SD = 100 (i.e., the so-called PISA metric). More formally, for each model m and each country c, there is a linear transformation θ t m c ( θ ) = ν 0 m c + ν 1 m c θ that transforms the country-specific ability distributions obtained from separate scaling to the PISA metric.

4.4. Model Comparisons

It is of particular interest whether the Mislevy-Wu model (MM1 and MM2) outperforms other treatments of missing item responses such as the scoring as wrong (model MW) and latent ignorable (models MO1 and MO2). The Bayesian information criterion (BIC) is used for conducting model comparisons ([33]; see also [16,120,121,139] for similar model comparisons in PISA, but [140,141,142] for improved information criteria in complex surveys). Moreover, the Gilula–Haberman penalty (GHP; [143,144,145]) is used as an effect size that is relatively independent of the sample size and the number of items. The GPH is defined as GHP = AIC / ( 2 p = 1 N I p ) , where I p is the number of estimated model parameters for person p and AIC is the Akaike information criterion. For example, if 20 out of 70 items were administered to person p in a test, I p would be 40 in the 2PL model. If a student worked on all 70 items in the test, I p would be 140. Note that the GHP can be considered a normalized variant of the AIC. A difference in GHP larger than 0.001 is declared a notable difference in model fit [145,146].
It might be questioned whether information criteria AIC (for the GHP criterion) and BIC might be appropriate for datasets ( X p i , R p i ) consisting of item responses and response indicators with missing data on item responses X p i (see [147,148,149,150]). As was argued in Section 1, there are two types of missing item responses in large-scale assessment datasets. First, item responses can be missing for a student because only a portion of items was administered in a test booklet in the multi-matrix test design [16]. Second, missing item responses appear due to item omissions to administered items. The latter type of missingness is the main topic of this article.
It has been demonstrated in Section 2.5 (see Equation (15)) that for each item i, observations ( X p i , R p i ) can be regarded as a random variable V p i with three categories: Category 0 ( V p i = 0 ): X p i = 0 , R p i = 1 , Category 1 ( V p i = 1 ): X p i = 1 , R p i = 1 , and Category 2 ( V p i = 2 ): X p i = NA , R p i = 0 . The dataset with observations V p i does not contain missing values, and the Mislevy-Wu model can be formulated as a function of V p i instead of ( X p i , R p i ) . As the former dataset does not contain missing values, model selection based on information criteria might be justified for item omissions because no missing data occurs for the redefined variables. However, it might still be questioned whether information criteria AIC and BIC remain valid when applied to multi-matrix designs. In this case, the number of effectively estimated item parameters per student is lower than those obtained when all items would be administered in a test booklet. In our opinion and our limited experience obtained in an unpublished simulation study, it could be that AIC and BIC show inferior performance for multi-matrix designs compared to the complete-data case. Note also that most educational large-scale assessment studies also apply the conventional information criteria without adaptations (e.g., [121,139,151,152,153,154]).
We would like to point out that BIC and GHP are only applied for the model-based treatment scaling models and not to the scaling models that rely on multiply imputed datasets (see [155]).

4.5. Computation of Standard Errors

In the PISA study, statistical inference is typically conducted with the balanced repeated replication methodology to account for stratified clustered sampling within countries [16,156]. The rth replication sample uses a modified set of person sampling weights w p ( r ) . Using R = 80 replication samples in PISA, a parameter of interest is computed for the original sample (i.e., γ ^ ) based on student weights w p . Moreover, the analysis is repeated in each replication sample using sampling weights w p ( r ) , resulting in parameter estimates γ ^ ( r ) . The standard error for γ ^ is then calculated as [16]
SE ( γ ^ ) = A r = 1 R ( γ ^ ( r ) γ ^ ) 2 ,
where the scaling factor A equals 0.05 in the PISA replication design. In our analysis, we are interested in standard errors for country means. The standard error is first determined for the country mean obtained in country-specific scaling models. Each scaling model provides a person-specific individual posterior distribution h p ( θ t | X p , R p ) for a discrete grid θ t ( t = 1 , , T ) of θ points (e.g., for T = 21 integration points, a discrete θ grid θ 1 = 5 , , θ 21 = 5 can be chosen). These posterior distributions reflect the subject-specific uncertainty with respect to the estimated ability. The country means have to be computed in the transformed metric (see Section 4.3). Hence, one uses the transformed grid ν 0 m c + ν 1 m c θ t ( t = 1 , , T ) for determining the country mean. For the rth replication sample, the mean γ ^ ( r ) is determined as
γ ^ ( r ) = p = 1 N w p ( r ) t = 1 T h p ( θ t | X p , R p ) ( ν 0 m c + ν 1 m c θ t ) p = 1 N w p ( r ) .
Note that this approach is a numerical approximation technique that coincides with the plausible value technique [129] when a large number of plausible values would be used. The standard error for γ ^ can be computed using (18). In our analysis, we are also interested in determining the statistical inference of a difference in means for a particular country resulting from different models. It is not appropriate to compute the standard errors for the means of the different models and to apply the t-test for a mean difference relying on independent samples because two models are applied to the same dataset resulting in highly dependent parameter estimates. However, the replication technique in Equation (18) can also be applied for the difference in means. One must only compute a mean difference in each replication sample in this case.

5. PISA 2018 Mathematics Case Study: Results

5.1. Similarity of Scaling Models

Each of the 19 scaling models provided a set of country means. For each country, the absolute difference of two means of a country stemming from a pair of two models can be computed. Table 4 summarizes the average absolute differences. Scaling models that resulted in an average absolute difference of at most 1.0 can be considered similar. In Table 4, groups of models are grayed in the rectangles containing the absolute differences classified as similar. Table 4 indicates that the methods that treat missing item responses as wrong (UW, MW, IW) or treat MC items as partially correct (UP, IP) resulted in similar country mean estimates. Both methods that did not score nor reached item responses as wrong (UN1, UN2) resulted in relatively similar estimates. The models that rely on ignorability (UO1, MO1, IO1) or latent ignorability (MO2, UO2, IO2) provided similar estimates. In line with previous research [18], the inclusion of the latent response propensity ξ did not result in strongly different estimates of country means compared to models that ignore missing item responses. The specifications of the Mislevy-Wu model (MM1, IM1, MM2, IM2) resulted in similar country means. Interestingly, country means from the Mislevy-Wu model were more similar to the treatment of missing item responses as wrong than those that relied on ignorability or latent ignorability. Finally, the scaling model based on FCS imputation involving only item responses (IF1) was similar to the models assuming (latent) ignorability (UO1, MO1, IO1, MO2, UO2, IO2). FCS imputation involving item responses and response indicators different from the imputed item (IF2) were neither similar to the ignorability-based treatment nor the scoring as wrong or the Mislevy-Wu model. This finding could be explained by the fact that the imputation method IF2 is based on strongly opposing assumptions of the missingness mechanism than the Mislevy-Wu model.

5.2. Model Comparisons

From Table 5, we can see that for the majority of countries (35 out of 45), the IRT model treating missing item responses as wrong (model MW) provided a better model fit in terms of BIC than modeling it with a latent propensity (model MO2). For 39 out of 45 countries, the Mislevy-Wu model with item-format specific ρ parameters (model MM2) was preferred. In 5 out of 45 countries, the Mislevy-Wu model with one common ρ parameter (MM1) was the best-fitting model. Only in one country (MYS), the model treating missing item responses as wrong had the best model fit.
For 29 out of 45 countries, the proposed Mislevy-Wu model outperformed the suggested model with a latent response propensity in terms of a GHP difference of at least 0.001. Overall, these findings indicated that the models assuming ignorability or latent ignorability performed worse in terms of model fit compared to scaling models that acknowledge the dependence of responding to an item from the true but occasionally unobserved item response.

5.3. Country-Specific Model Parameters for Latent Ignorable Model and Mislevy-Wu Model

Now, we present findings of model parameters characterizing the missingness mechanism from the model MO2 relying on latent ignorability and the Mislevy-Wu model MM2. The parameters are shown in Table 6. The SD of the latent response propensity SD ( ξ ) was somewhat lower in the Mislevy-Wu model (MM2, with a median Med = 1.98 ) than the model assuming latent ignorability (MO2, Med = 1.93 ). Moreover, by additionally including the latent item response as a predictor for the response indicator, the correlation Cor ( θ , ξ ) between the latent ability θ and response propensity ξ was slightly lower in model MM2 ( Med = 0.43 ) than MO2 ( Med = 0.46 ). Most importantly, the missingness mechanism strongly differed between CR and MC items. The median δ parameter in model MM2 for CR items was 2.61 , indicating that students that did not know the item had a higher probability of omitting the item even after controlling for the latent response propensity ξ . In contrast, the median δ parameter was 0.48 . Hence, there was a smaller influence of (latently) knowing the item with the response indicators. However, it was different from zero for most countries, indicating that the model MO2 assuming latent ignorability did not adequately explain the missingness mechanism. Overall, it can be seen that those model parameters strongly vary across countries. Hence, it can be concluded that assuming different missingness mechanisms for countries could have non-negligible consequences for country rankings (see [126]).

5.4. Country Means and Country Standard Deviations Obtained from Different Scaling Models

For comparing country means, 11 out of 19 specified scaling models were selected to contrast the dissimilarity of country mean and standard deviation estimates. Based on the findings of the similarity of models in Section 5.1 (see Table 4), 8 out of 19 models were omitted in the reporting of the comparisons because they provided very similar findings to at least one of the 11 reported models. Table 7 shows the country means of these 11 different treatments of missing item responses. The country rank (column “ rk UW ”) serves as the reference for the comparison among methods. Moreover, the interval of country ranks obtained from the different methods are shown in column “ rk Int ”. The average maximum difference in country ranks was 2.4 ( SD = 1.8 ) and ranged between 0 (SGP, HKG, EST, DEU, LUX, BIH) and 8 (IRL). The range in country means (i.e., the difference of the largest and smallest country mean of the 11 methods) was noticeable ( M = 5.0 ) and showed strong variability between countries ( SD = 2.8 , Min = 1.5 , Max = 12.5 ). Interestingly, large range values were obtained for countries with missing proportions that were strongly below and above the average missing proportion. For example, Ireland (IRL) had a relatively low missing rate of 5.8% and reached rank 15 with method UW ( M = 505.2 ) that treated missing item responses as wrong. Methods that ignore missing item responses resulted in a lower country mean (UO1: M = 499.9 ; MO2: M = 500.7 ; IO2: M = 500.0 ). In contrast, the Mislevy-Wu model (MM2 and IO2)—which also takes the relation of the response indicator and the true item response into account—resulted in higher country means (MM2: M = 505.1 ; IO2: M = 504.9 ). Across the 11 estimation methods, Ireland reached ranks between 15 and 23 which can be considered a large variability. Moreover, the range of country means for Ireland was 8.2, which is two to three times higher than standard errors for country means due to the sampling of students in PISA. Italy (ITA, rank 26; M = 492.0 ) that had a relatively high missing rate of 12.4% profit by ignoring missing item responses assuming latent ignorability (UO1: M = 494.7 ; MO2: M = 494.4 ; IO2: M = 494.0 ). However, the Mislevy-Wu model produced considerably lower scores (MO2: M = 490.1 ; IO2: M = 489.9 ). An interesting case is Sweden (SWE, rank 25) that had a high missing proportion rate of 12.7%, but almost half of missing item responses (i.e., 5.9%) stemmed from not reached responses. This not reached proportion was the highest among all countries in our study. Sweden had rank 25 when treating missing item responses as wrong (UW: M = 491.8 ), but strongly profits in models that ignore the not reached items (UN1: M = 499.1 ) or treated the proportion of not reached items as a predictor in the latent background model (UN2: M = 499.7 ). If also omitted items would be treated as (latent) ignorable, the country mean for Sweden further increased (UO1: M = 501.3 ; MO2: M = 501.1 ; IO2: M = 501.3 ). In contrast to many other countries, the country means obtained from the Mislevy-Wu model (MM2: M = 497.9 ; IO2: M = 498.0 ) were also much larger than the country mean obtained by treating missing items as wrong (UW: M = 491.8 ).
In Table A3 in Appendix C, standard errors for country means are shown. Across different models and countries, the average standard error was 2.20 (SD = 0.47, Min = 1.21, Max = 3.65). Within a country, the variability (i.e., standard deviation (column SD) in Table A3) of standard errors for the mean was small (M = 0.05, SD = 0.05, Min = 0.01, Max = 0.21).
In Table A4 in Appendix C, standard errors for differences in means stemming from two different models are displayed. We consider differences between the models UW, MO2 and MM2. It turned out that the standard error for mean differences between two models was very small compared to the standard error for the mean for a single model. The largest average standard errors were obtained for the mean difference between models UW2 and MO2 (see column “UW-MO2” in Table A4; M = 0.037, SD = 0.036, Min = 0, Max = 0.149). These two models represent the two extreme missing data treatments that explain the observation of obtaining the largest standard errors. The smallest standard errors were obtained for the model difference between UW and MM2 (column “UW-MM2”; M = 0.021, SD = 0.019, Min = 0.000, Max = 0.096). The average standard errors for the mean difference between the models MO2 and MM2 was 0.027 (column “UW-MO2”; M = 0.022, SD = 0.019, Min = 0.001, Max = 0.093).
The estimates of country standard deviations stemming from different models for the missing data treatment are shown in Table A5 in Appendix C. As in the case of the country mean, it turned out that model choice also impacted standard deviations. Within a country, the standard deviation of the different standard deviation estimates showed nonnegligible variability (column “SD” in Table A5; M = 1.25, SD = 0.96, Min = 0.3, Max = 5.4). The within-country ranges of country standard deviations across models were even larger than for country means.

6. Discussion

In this paper, competing approaches for handling missing item responses in educational large-scale assessment studies like PISA are investigated. We compared the Mislevy-Wu model that allows the probability of item missingness depending on the item itself with the more frequently discussed approaches of scoring items as wrong or models assuming latent ignorability. In an illustrative simulation study, we demonstrated that neither of the two latter approaches provides unbiased parameter estimates if the more general Mislevy-Wu model holds (see also [44]). In realistic data constellations in which the Mislevy-Wu model holds, it is likely that the method of scoring missing item responses as wrong results in underestimated (country) means, while models relying on latent ignorability provide overestimated means. Based on these findings, we are convinced that the often-taken view in psychometric literature that strongly advocates latent ignorability and denies the scoring as wrong [4,11,12,18] is unjustified (see also [24,25,27]).
In our reanalysis of the PISA 2018 mathematics data, different scaling models with different treatments of missing item responses were specified. It has been shown that differences in country means and country standard deviations across models can be substantial. The present study sheds some light on the ongoing debate about properly handling missing item responses in educational large-scale assessment studies. Ignoring missing item responses and treating them as wrong can be seen as opposing strategies. Other scaling models can be interpreted to provide results somewhere between these two extreme poles of handling missingness. We argued that the Mislevy-Wu model contains the strategy of scoring as wrong and the latent ignorable model as submodels. Hence, these missing data treatments can be tested. In our analysis, it turned out that the Mislevy-Wu model fitted the PISA data best. More importantly, the treatment of missing item responses as wrong provided a better model fit than ignoring them or modeling them by the latent ignorable model that has been strongly advocated in the past [10,11]. It also turned out that the missingness mechanism strongly differed between CR and MC items.
We believe that the call for controlling for test-taking behavior in the reporting in large-scale assessment studies such as response propensity [4] using models that also include response times [157,158] poses a threat to validity [159,160,161,162,163,164] because results can be simply manipulated by instructing students to omit items they do not know [26]. Notably, missing item responses are mostly omissions for CR items. Response times might be useful for detecting rapid guessing or noneffortful responses [81,165,166,167,168,169,170,171]. However, it seems likely that students who do not know the solution to CR items do not respond to these items. In this case, the latent ignorability assumption is unlikely to hold, and scaling models that rely on it (see [4,12]) will result in biased and unfair country comparisons. We are skeptical that the decision of whether a missing item response is scored as wrong should be based on a particular response time threshold [166,172,173]. Students can also be simply instructed to quickly skip items that they are not probably able to solve.
In our PISA analysis, we restricted the analysis to 45 countries that received booklets of average item difficulty. Recently, a number of low-performing countries also participated in recent PISA cycles that receive booklets of lower difficulty [174,175,176]. We did not include these low-performing countries for the following reasons. First, the proportion of correctly solved items for low-performing countries is lower. This implies that it is more difficult for these countries to disentangle the parameters of the model for response indicators and item parameters. Second, the meaning of missingness on item responses across countries differs if different booklets are administered in countries. Hence, it is difficult to compare outcomes of different scaling models for the missing data treatment if there is no prerequisite of the same administered test design. To some extent, the issue also appears in the recently implemented multi-stage testing (MST; [177,178]) design in PISA that also results in different proportions of test booklets of different average difficulty across countries. We think that there is no defensible strategy of properly treating missing item responses from MST designs that enables a fair and valid comparison of countries [26].
In this article, we only investigated the impact of missing item responses on country means and country standard deviations. In LSA studies, missing data is also a prevalent issue for student covariates (e.g., sociodemographic status; see [104,179,180,181,182,183,184]). As covariates also enter the plausible value imputation of latent abilities through the latent background model [75,129] or relationships of abilities and covariates are often of interest in reporting, missing data on covariates is also a crucial issue that needs to be adequately addressed [104].
It could be argued that there is not a unique, scientifically sound, or widely publicly accepted scaling model in PISA (see [185]). The uncertainty in choosing a psychometric model can be reflected by explicitly acknowledging the variability of country means and standard deviations obtained by different model assumptions. This additional source of variance associated with model uncertainty [186,187,188,189,190,191] can be added to the standard error due to the sampling of students and linking error due to the selection of items [192]. The assessment of specification uncertainty has been discussed in sensitivity analysis [193] and has recently become popular as multiverse analysis [194,195] or specification curve analysis [196,197]. As educational LSA studies are policy-relevant [198,199], we think that model uncertainty should be included in statistical inference [200,201].

Funding

This research received no external funding.

Informed Consent Statement

This article uses publicly available PISA 2018 data.

Data Availability Statement

The PISA 2018 dataset is available from https://www.oecd.org/pisa/data/2018database/ (accessed on 15 April 2021).

Acknowledgments

I sincerely thank four anonymous reviewers for their valuable comments that substantially improved this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
1PLone-parameter logistic model
2PLtwo-parameter logistic model
AICAkaike information criterion
BICBayesian information criterion
CRconstructed-response
DIFdifferential item functioning
FCSfully conditional specification
GHPGilula–Haberman penalty
IRTitem response theory
LBMlatent background model
LSAlarge-scale assessment
Mmean
MARmissing at random
MCmultiple-choice
MSTmulti-stage testing
PISAprogramme for international student assessment
PLSpartial least squares
PMMpredictive mean matching
RMSEroot mean square error
SDstandard deviation

Appendix A. Additional Information for the Illustrative Simulation Study

In Table A1, the computed β parameter used in the illustrative simulation study as a function of the proportion of missing data and the missingness parameter δ is shown.
Table A1. Computed β parameter in the Mislevy-Wu model as a function of the proportion of missing data and the missingness parameter δ .
Table A1. Computed β parameter in the Mislevy-Wu model as a function of the proportion of missing data and the missingness parameter δ .
Miss % δ
−10−3−2−10
5%−3.728−3.805−3.906−4.100−4.422
10%−2.548−2.678−2.821−3.061−3.417
20%−1.001−1.288−1.513−1.827−2.228
30% 0.322−0.255−0.568−0.947−1.384
Note. Miss% = proportion of missing data.
In Table A2, the RMSE for the estimated mean and the standard deviation is shown for the different missing data treatments as a function of the proportion of missing data and the missingness parameter δ .
Table A2. Root mean square error (RMSE) for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ .
Table A2. Root mean square error (RMSE) for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ .
δ MeanStandard Deviation
−10−3−2−10−10−3−2−10
Model
5% missing data
CD0.0490.0540.0540.0550.0490.0470.0530.0540.0530.049
MM10.0510.0490.0550.0540.0610.0480.0490.0530.0500.056
UW0.0490.0520.0530.0670.0870.0470.0500.0460.0500.054
UO0.1100.1050.1020.0820.0580.0770.0750.0730.0650.055
MO20.1060.0990.0910.0710.0530.0770.0720.0670.0590.052
IF10.1090.1060.1020.0820.0580.0750.0740.0740.0630.054
IF20.1080.1030.0970.0780.0530.0740.0740.0700.0630.052
10% missing data
CD0.0520.0520.0500.0510.0560.0500.0530.0490.0500.055
MM10.0530.0560.0600.0590.0630.0510.0550.0560.0530.054
UW0.0520.0580.0760.1000.1490.0500.0530.0540.0520.048
UO0.1680.1440.1250.1110.0660.1070.0820.0730.0790.060
MO20.1590.1320.1190.0960.0540.0970.0720.0740.0710.053
IF10.1690.1480.1280.1120.0670.1060.0850.0740.0780.060
IF20.1610.1420.1250.1040.0570.0970.0800.0760.0760.056
20% missing data
CD0.0520.0520.0510.0490.0510.0520.0500.0510.0480.050
MM10.0770.0580.0640.0690.0810.0700.0530.0580.0580.061
UW0.0520.0930.1390.2080.2780.0510.0560.0460.0560.061
UO0.2110.2190.1930.1540.0910.1580.1440.1180.0920.075
MO20.2110.2180.1870.1290.0530.1580.1410.1120.0800.050
IF10.2160.2230.1920.1570.0860.1590.1450.1120.0920.068
IF20.2220.2200.1940.1400.0560.1660.1380.1140.0870.054
30% missing data
CD0.0530.0510.0520.0530.0550.0530.0510.0510.0520.054
MM10.1280.0630.0720.0830.0920.1590.0570.0590.0640.064
UW0.0510.1670.2320.3060.3720.0510.0490.0490.0560.061
UO0.2110.2470.2340.1900.0950.2220.1790.1480.1170.076
MO20.2070.2470.2380.1710.0600.2250.1790.1520.1070.057
IF10.2200.2530.2400.1930.0980.2210.1800.1490.1150.075
IF20.2160.2560.2420.1790.0640.2240.1800.1490.1090.061
Note. CD = complete-data analysis; UW = scoring as wrong (Section 2.1); MM1 = Mislevy-Wu model with common d parameter (Section 2.5, Equation (14)); UO = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); IF1 = FCS imputation based on item responses (Section 2.6); IF2 = FCS imputation based on item responses and response indicators (Section 2.6).

Appendix B. Country Labels Used in the PISA 2018 Mathematics Case Study

The country labels used in Table 2, Table 5, Table 6 and Table 7, Table A3, Table A4 and Table A5 are as follows: ALB = Albania; AUS = Australia; AUT = Austria; BEL = Belgium; BIH = Bosnia and Herzegovina; BLR = Belarus; BRN = Brunei Darussalam; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HKG = Hong Kong; HRV = Croatia; HUN = Hungary; IRL = Ireland; ISL = Iceland; ISR = Israel; ITA = Italy; JPN = Japan; KOR = Korea; LTU = Lithuania; LUX = Luxembourg; LVA = Latvia; MLT = Malta; MNE = Montenegro; MYS = Malaysia; NLD = Netherlands; NOR = Norway; NZL = New Zealand; POL = Poland; PRT = Portugal; RUS = Russian Federation; SGP = Singapore; SVK = Slovak Republic; SVN = Slovenia; SWE = Sweden; TUR = Turkey; USA = United States.

Appendix C. Further Results of the PISA 2018 Mathematics Case Study

In Table A3, standard errors for country means for the PISA 2018 mathematics case study resulting from 11 different scaling models are shown.
In Table A4, standard errors for country mean differences between the models UW, MO2 and MM2 are presented.
In Table A5, country standard deviations for the PISA 2018 mathematics case study resulting from 11 different scaling models are reported.
Table A3. Standard errors for country means for PISA 2018 mathematics from 11 different scaling models for missing item responses.
Table A3. Standard errors for country means for PISA 2018 mathematics from 11 different scaling models for missing item responses.
CountryAverSDrgUWUPUN1UN2UO1MO2IO2MM2IM2IF1IF2
ALB2.020.060.242.072.042.032.161.992.022.022.022.021.991.91
AUS1.690.030.111.731.711.671.751.641.681.681.701.691.681.69
AUT2.530.050.142.562.552.542.602.472.532.492.612.552.472.47
BEL1.790.060.161.851.841.841.831.711.781.711.861.821.701.75
BIH2.370.190.562.532.532.542.522.262.322.272.512.482.131.99
BLR2.290.020.062.312.322.312.282.262.272.272.302.292.272.26
BRN1.560.020.061.581.571.521.531.551.561.581.561.551.571.56
CAN2.050.030.122.052.052.062.052.022.032.032.042.032.052.14
CHE2.280.020.052.282.292.272.262.252.272.262.302.292.272.28
CZE2.150.020.052.162.172.172.162.132.152.132.172.142.132.17
DEU2.510.070.222.562.562.582.592.462.492.482.562.562.432.38
DNK1.820.020.081.861.831.811.841.781.811.811.831.831.841.79
ESP1.210.030.101.231.231.241.241.151.191.171.241.211.171.20
EST1.800.020.061.781.781.821.801.821.801.801.781.791.831.81
FIN1.970.020.071.981.981.961.951.962.001.981.981.951.951.93
FRA2.170.030.092.202.192.212.202.142.152.142.202.212.122.13
GBR2.530.050.182.582.572.522.652.472.482.482.522.532.512.53
GRC2.680.070.262.692.702.752.732.682.692.682.722.722.612.49
HKG2.830.020.052.832.852.822.842.812.822.802.852.842.822.85
HRV2.360.090.252.412.452.462.422.292.332.292.432.452.262.21
HUN2.260.060.162.332.322.262.312.192.262.232.322.282.212.17
IRL2.010.010.052.002.012.042.022.012.032.012.002.001.992.01
ISL2.060.040.112.032.012.102.102.112.072.112.002.022.082.01
ISR3.650.180.563.843.733.663.803.523.653.623.823.803.403.27
ITA2.470.060.142.512.532.522.502.392.432.402.532.512.422.39
JPN2.420.010.042.422.442.422.412.402.432.412.432.412.402.42
KOR2.890.090.332.942.932.853.122.792.862.862.902.912.792.88
LTU1.910.020.071.891.911.891.881.931.931.921.911.901.931.95
LUX1.720.030.091.721.721.761.741.691.711.701.751.751.671.70
LVA1.900.020.051.921.931.891.901.881.891.891.891.911.881.89
MLT2.750.210.822.912.812.693.222.632.742.712.812.782.512.40
MNE1.310.070.231.361.351.321.401.271.331.271.371.341.211.17
MYS2.660.090.302.572.622.592.572.712.722.692.632.592.732.87
NLD2.220.030.102.262.252.162.262.172.212.222.232.232.192.26
NOR1.650.060.171.711.681.651.681.611.651.641.691.691.561.54
NZL1.810.050.161.861.841.791.901.781.811.811.851.821.771.73
POL2.660.040.122.662.672.712.712.652.682.682.692.662.612.58
PRT2.270.040.132.292.302.342.312.222.252.232.282.282.222.21
RUS2.630.020.072.622.622.682.642.612.612.632.622.632.652.61
SGP1.560.050.141.531.541.521.521.601.551.561.521.531.631.66
SVK2.400.030.082.372.392.402.382.442.412.452.382.402.382.38
SVN1.990.030.082.002.001.981.981.951.981.972.032.011.952.01
SWE2.520.050.162.502.482.572.552.552.552.572.542.552.482.41
TUR1.930.030.101.891.901.911.891.951.941.951.911.921.961.98
USA2.770.040.142.832.822.712.692.772.802.782.782.762.822.74
Note. Aver = average of standard errors of country means across 11 models; SD = standard deviation of standard errors of country means across 11 models; rg = range of standard errors of country means across 11 models; UW = scoring as wrong (Section 2.1); UP = MC items scored as partially correct (Section 2.2); UN1 = ignoring not reached items (Section 4.2.1); UN2 = including proportion of not reached items in background model (Section 4.2.1); UO1 = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); IO2 = imputed under latent ignorability (Section 2.4.1, Equations (10) and (11)); MM2 = Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IM2 = imputed under Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IF1 = FCS imputation based on item responses (Section 2.6 and Section 4.2.2); IF2 = FCS imputation based on item responses and response indicators (Section 2.6 and Section 4.2.2); See Appendix B for country labels.
Table A4. Standard errors for country mean differences between three different models UW, MO2 and MM2 for PISA 2018 mathematics.
Table A4. Standard errors for country mean differences between three different models UW, MO2 and MM2 for PISA 2018 mathematics.
Country UW MO2 MM2 UW–MO2UW–MM2MO2–MM2
ALB2.0682.0252.0220.0440.0580.014
AUS1.7321.6781.7010.0310.0300.001
AUT2.5622.5302.6070.0280.0090.036
BEL1.8471.7821.8630.0130.0070.019
BIH2.5332.3162.5130.1170.0190.093
BLR2.3112.2742.3030.0280.0100.037
BRN1.5801.5621.5600.0140.0040.018
CAN2.0462.0322.0440.0020.0200.018
CHE2.2832.2702.2990.0040.0170.013
CZE2.1592.1482.1730.0390.0240.014
DEU2.5642.4892.5600.0460.0110.035
DNK1.8561.8061.8270.0080.0210.013
ESP1.2321.1851.2380.0160.0030.013
EST1.7761.7991.7830.0260.0080.018
FIN1.9842.0011.9810.0510.0180.032
FRA2.1972.1542.2010.0240.0090.033
GBR2.5802.4842.5220.0790.0560.023
GRC2.6862.6862.7200.0460.0350.011
HKG2.8292.8202.8510.0040.0120.008
HRV2.4132.3332.4300.0150.0310.045
HUN2.3282.2632.3160.0170.0150.001
IRL1.9972.0302.0010.0360.0040.040
ISL2.0282.0702.0020.0000.0360.037
ISR3.8353.6523.8160.1060.0190.084
ITA2.5122.4342.5270.0370.0140.050
JPN2.4162.4272.4300.0370.0020.034
KOR2.9442.8652.8960.0750.0450.030
LTU1.8911.9261.9100.0390.0270.013
LUX1.7221.7061.7460.0040.0160.012
LVA1.9201.8931.8920.0210.0030.018
MLT2.9122.7402.8110.1450.0960.048
MNE1.3641.3341.3750.0560.0180.038
MYS2.5702.7172.6320.1490.0610.088
NLD2.2572.2132.2260.0090.0140.005
NOR1.7111.6501.6920.0340.0100.023
NZL1.8581.8141.8530.0070.0130.005
POL2.6642.6812.6850.0230.0000.022
PRT2.2882.2472.2850.0180.0110.006
RUS2.6172.6072.6200.0110.0070.004
SGP1.5301.5491.5220.0030.0110.008
SVK2.3742.4152.3800.0220.0110.011
SVN1.9991.9822.0310.0070.0180.025
SWE2.5052.5512.5410.0420.0260.016
TUR1.8901.9401.9070.0640.0300.034
USA2.8342.8012.7800.0890.0150.071
Note. Aver = average of standard errors of country means across 11 models; SD = standard deviation of standard errors of country means across 11 models; rg = range of standard errors of country means across 11 models; UW = scoring as wrong (Section 2.1); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); MM2 = Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); See Appendix B for country labels.
Table A5. Country standard deviations for PISA 2018 mathematics from 11 different scaling models for missing item responses.
Table A5. Country standard deviations for PISA 2018 mathematics from 11 different scaling models for missing item responses.
CountryAverSDrgUWUPUN1UN2UO1MO2IO2MM2IM2IF1IF2
ALB69.61.03.770.969.869.171.569.469.469.368.969.570.067.9
AUS91.40.92.392.491.990.591.090.790.890.990.990.992.692.9
AUT90.50.61.690.990.889.990.989.789.990.391.291.190.091.1
BEL88.40.72.488.688.688.587.887.587.987.888.988.788.589.9
BIH81.73.511.484.984.483.984.280.581.080.984.283.278.073.5
BLR90.61.02.990.190.490.389.191.691.191.289.789.891.992.0
BRN90.01.03.089.589.389.588.590.990.390.889.289.591.591.1
CAN86.71.65.986.586.385.585.186.586.486.585.685.888.291.0
CHE88.10.31.087.788.288.787.788.087.887.788.388.088.688.2
CZE90.10.82.788.789.390.489.690.990.390.189.789.890.691.4
DEU91.31.23.692.692.292.092.490.590.990.992.292.389.489.0
DNK83.80.82.483.983.783.182.983.883.683.983.083.285.085.3
ESP83.10.82.783.283.183.983.082.182.182.383.083.383.184.7
EST83.51.03.082.682.683.482.584.183.883.683.082.685.185.5
FIN84.40.92.683.083.285.484.685.685.285.283.883.885.083.7
FRA91.10.71.891.691.591.891.190.590.690.292.092.190.990.3
GBR92.21.33.994.193.391.194.090.291.291.092.192.291.792.8
GRC83.01.34.882.282.684.283.584.183.683.683.383.483.579.4
HKG89.61.14.089.189.588.388.989.289.289.389.489.590.592.3
HRV82.51.65.382.883.984.682.881.582.381.283.984.181.179.3
HUN92.01.14.193.192.992.492.292.092.492.092.592.491.088.9
IRL77.80.71.777.177.478.377.678.478.478.376.976.978.278.6
ISL86.21.24.187.086.187.186.487.687.187.385.585.785.283.6
ISR109.22.79.4112.0109.7110.0110.4108.4109.0109.2111.5111.8106.2102.6
ITA92.10.92.492.392.793.292.690.891.091.292.893.190.992.0
JPN85.20.82.284.485.084.884.285.885.785.984.584.686.286.4
KOR91.21.44.992.892.289.994.089.190.490.691.391.589.791.4
LTU90.30.82.888.889.790.489.591.290.690.790.090.291.691.1
LUX92.90.62.092.692.693.792.792.992.893.393.593.892.491.7
LVA80.40.72.279.880.480.279.480.980.780.479.980.281.681.3
MLT99.05.418.2105.2101.498.2105.896.7100.099.4101.7101.092.187.6
MNE79.72.89.382.482.380.981.878.379.178.781.381.077.373.1
MYS89.03.110.285.887.686.585.990.590.789.787.886.791.695.9
NLD89.80.93.089.990.288.988.989.789.689.489.489.190.891.8
NOR89.71.24.191.090.090.890.489.189.289.390.490.788.686.8
NZL93.20.41.493.192.793.093.993.193.493.493.792.593.793.0
POL88.60.30.988.388.588.688.488.989.089.288.388.488.688.7
PRT95.20.52.194.495.196.495.395.595.195.594.895.295.494.7
RUS79.90.41.679.479.780.379.280.279.880.179.679.680.880.1
SGP91.71.96.791.591.889.789.991.691.391.390.890.893.896.4
SVK91.90.51.491.491.792.291.192.592.392.491.992.092.191.2
SVN86.00.51.685.786.186.385.585.585.385.786.486.386.487.0
SWE88.51.24.287.686.890.389.389.589.189.488.688.888.586.1
TUR90.92.05.788.589.389.888.792.491.591.789.990.094.094.2
USA95.32.05.993.593.994.092.797.296.496.794.094.098.697.4
Note. Aver = average of country standard deviations across 11 models; SD = standard deviation of country standard deviations across 11 models; rg = range of country standard deviations across 11 models; UW = scoring as wrong (Section 2.1); UP = MC items scored as partially correct (Section 2.2); UN1 = ignoring not reached items (Section 4.2.1); UN2 = including proportion of not reached items in background model (Section 4.2.1); UO1 = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); IO2 = imputed under latent ignorability (Section 2.4.1, Equations (10) and (11)); MM2 = Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IM2 = imputed under Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IF1 = FCS imputation based on item responses (Section 2.6 and Section 4.2.2); IF2 = FCS imputation based on item responses and response indicators (Section 2.6 and Section 4.2.2); See Appendix B for country labels.

References

  1. Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  2. Maehler, D.B.; Rammstedt, B. (Eds.) Large-Scale Cognitive Assessment; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
  3. Simon, M.; Ercikan, K.; Rousseau, M. (Eds.) Improving Large-Scale Assessment in Education; Routledge: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
  4. Pohl, S.; Ulitzsch, E.; von Davier, M. Reframing rankings in educational assessments. Science 2021, 372, 338–340. [Google Scholar] [CrossRef] [PubMed]
  5. Carpenter, J.; Kenward, M. Multiple Imputation and Its Application; Wiley: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
  6. Graham, J.W. Missing data analysis: Making it work in the real world. Annu. Rev. Psychol. 2009, 60, 549–576. [Google Scholar] [CrossRef] [Green Version]
  7. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; Wiley: New York, NY, USA, 2002. [Google Scholar] [CrossRef] [Green Version]
  8. Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef]
  9. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  10. Rose, N.; von Davier, M.; Xu, X. Modeling Nonignorable Missing Data with Item Response Theory (IRT); (Research Report No. RR-10-11); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
  11. Pohl, S.; Gräfe, L.; Rose, N. Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educ. Psychol. Meas. 2014, 74, 423–452. [Google Scholar] [CrossRef]
  12. Rose, N.; von Davier, M.; Nagengast, B. Modeling omitted and not-reached items in IRT models. Psychometrika 2017, 82, 795–819. [Google Scholar] [CrossRef] [PubMed]
  13. OECD. PISA 2015. Technical Report; OECD: Paris, France, 2017; Available online: https://bit.ly/32buWnZ (accessed on 3 October 2021).
  14. Frey, A.; Hartig, J.; Rupp, A.A. An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educ. Meas. 2009, 28, 39–53. [Google Scholar] [CrossRef]
  15. Weeks, J.; von Davier, M.; Yamamoto, K. Design considerations for the program for international student assessment. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 259–276. [Google Scholar] [CrossRef]
  16. OECD. PISA 2018. Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 3 October 2021).
  17. Pohl, S.; Carstensen, C. NEPS Technical Report—Scaling the Data of the Competence Tests; (NEPS Working Paper No. 14); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germnay, 2012; Available online: https://bit.ly/2XThQww (accessed on 3 October 2021).
  18. Pohl, S.; Carstensen, C.H. Scaling of competence tests in the national educational panel study—Many questions, some answers, and further challenges. J. Educ. Res. Online 2013, 5, 189–216. Available online: https://bit.ly/39AETyE (accessed on 3 October 2021).
  19. Kuha, J.; Katsikatsou, M.; Moustaki, I. Latent variable modelling with non-ignorable item nonresponse: Multigroup response propensity models for cross-national analysis. J. R. Stat. Soc. Ser. A Stat. Soc. 2018, 181, 1169–1192. [Google Scholar] [CrossRef] [Green Version]
  20. Holman, R.; Glas, C.A.W. Modelling non-ignorable missing-data mechanisms with item response theory models. Br. J. Math. Stat. Psychol. 2005, 58, 1–17. [Google Scholar] [CrossRef]
  21. Knott, M.; Tzamourani, P. Fitting a latent trait model for missing observations to racial prejudice data. In Applications of Latent Trait and Latent Class Models in the Social Sciences; Rost, J., Langeheine, R., Eds.; Waxmann: New York, NY, USA, 1997; pp. 244–252. Available online: https://bit.ly/3CMEJ3K (accessed on 3 October 2021).
  22. Finch, H. Estimation of item response theory parameters in the presence of missing data. J. Educ. Meas. 2008, 45, 225–245. [Google Scholar] [CrossRef]
  23. Sinharay, S. Score reporting for examinees with incomplete data on large-scale educational assessments. Educ. Meas. 2021, 40, 79–91. [Google Scholar] [CrossRef]
  24. Robitzsch, A. Zu nichtignorierbaren Konsequenzen des (partiellen) Ignorierens fehlender Item Responses im Large-Scale Assessment [On nonignorable consequences of (partial) ignoring of missing item responses in large-scale assessments]. In PIRLS & TIMSS 2011. Die Kompetenzen in Lesen, Mathematik und Naturwissenschaften am Ende der Volksschule. Österreichischer Expertenbericht; Suchan, B., Wallner-Paschon, C., Schreiner, C., Eds.; Leykam: Graz, Austria, 2016; pp. 55–64. Available online: https://bit.ly/2ZnEYDP (accessed on 3 October 2021).
  25. Robitzsch, A. About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-Scale Assessment. 2020. OSF Preprints. Available online: https://osf.io/hmy45 (accessed on 3 October 2021).
  26. Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
  27. Rohwer, G. Making Sense of Missing Answers in Competence Tests; (NEPS Working Paper No. 30); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2013; Available online: https://bit.ly/3AGfsr5 (accessed on 3 October 2021).
  28. Gorgun, G.; Bulut, O. A polytomous scoring approach to handle not-reached items in low-stakes assessments. Educ. Psychol. Meas. 2021, 81, 847–871. [Google Scholar] [CrossRef]
  29. Pools, E.; Monseur, C. Student test-taking effort in low-stakes assessments: Evidence from the English version of the PISA 2015 science test. Large-Scale Assess. Educ. 2021, 9, 10. [Google Scholar] [CrossRef]
  30. Wise, S.L.; DeMars, C.E. Low examinee effort in low-stakes assessment: Problems and potential solutions. Educ. Assess. 2005, 10, 1–17. [Google Scholar] [CrossRef]
  31. Becker, B.; van Rijn, P.; Molenaar, D.; Debeer, D. Item order and speededness: Implications for test fairness in higher educational high-stakes testing. Assess. Eval. High. Educ. 2021, 1–13. [Google Scholar] [CrossRef]
  32. Mislevy, R.J.; Wu, P.K. Missing Responses and IRT Ability Estimation: Omits, Choice, Time Limits, and Adaptive Testing; (Research Report No. RR-96-30); Educational Testing Service: Princeton, NJ, USA, 1996. [Google Scholar] [CrossRef]
  33. Robitzsch, A.; Lüdtke, O. An Item Response Model for Omitted Responses in Performance Tests. Talk Held at IMPS 2017, Zurich, July 2017. Available online: https://bit.ly/3u8rgjy (accessed on 3 October 2021).
  34. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  35. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  36. Dai, S. Handling missing responses in psychometrics: Methods and software. Psych 2021, 3, 673–693. [Google Scholar] [CrossRef]
  37. Sinharay, S. Reporting proficiency levels for examinees with incomplete data. J. Educ. Behav. Stat. 2021, 10769986211051379. [Google Scholar] [CrossRef]
  38. Sinharay, S.; Stern, H.S.; Russell, D. The use of multiple imputation for the analysis of missing data. Psychol. Methods 2001, 6, 317–329. [Google Scholar] [CrossRef]
  39. Chalmers, R.P. mirt: A multidimensional item response theory package for the R environment. J. Stat. Softw. 2012, 48, 1–29. [Google Scholar] [CrossRef] [Green Version]
  40. Beesley, L.J.; Taylor, J.M.G. A stacked approach for chained equations multiple imputation incorporating the substantive model. Biometrics 2020. [Google Scholar] [CrossRef]
  41. van Buuren, S. Flexible Imputation of Missing Data; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar] [CrossRef]
  42. Chan, K.W.; Meng, X.L. Multiple improvements of multiple imputation likelihood ratio tests. arXiv 2017, arXiv:1711.08822. [Google Scholar] [CrossRef]
  43. Rose, N.; von Davier, M.; Nagengast, B. Commonalities and differences in IRT-based methods for nonignorable item nonresponses. Psych. Test Assess. Model. 2015, 57, 472–498. [Google Scholar]
  44. Pohl, S.; Becker, B. Performance of missing data approaches under nonignorable missing data conditions. Methodology 2020, 16, 147–165. [Google Scholar] [CrossRef]
  45. Lord, F.M. Estimation of latent ability and item parameters when there are omitted responses. Psychometrika 1974, 39, 247–264. [Google Scholar] [CrossRef]
  46. Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Vol. 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
  47. Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
  48. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
  49. Robitzsch, A. Sirt: Supplementary Item Response Theory Models. R Package Version 3.10-118. 2021. Available online: https://CRAN.R-project.org/package=sirt (accessed on 3 September 2021).
  50. Frangakis, C.E.; Rubin, D.B. Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika 1999, 86, 365–379. [Google Scholar] [CrossRef]
  51. Harel, O.; Schafer, J.L. Partial and latent ignorability in missing-data problems. Biometrika 2009, 96, 37–50. [Google Scholar] [CrossRef]
  52. Beesley, L.J.; Taylor, J.M.G.; Little, R.J.A. Sequential imputation for models with latent variables assuming latent ignorability. Aust. N. Z. J. Stat. 2019, 61, 213–233. [Google Scholar] [CrossRef]
  53. Debeer, D.; Janssen, R.; De Boeck, P. Modeling skipped and not-reached items using IRTrees. J. Educ. Meas. 2017, 54, 333–363. [Google Scholar] [CrossRef]
  54. Glas, C.A.W.; Pimentel, J.L.; Lamers, S.M.A. Nonignorable data in IRT models: Polytomous responses and response propensity models with covariates. Psych. Test Assess. Model. 2015, 57, 523–541. [Google Scholar]
  55. Jung, H.; Schafer, J.L.; Seo, B. A latent class selection model for nonignorably missing data. Comp. Stat. Data An. 2011, 55, 802–812. [Google Scholar] [CrossRef]
  56. Bacci, S.; Bartolucci, F. A multidimensional finite mixture structural equation model for nonignorable missing responses to test items. Struct. Equ. Model. 2015, 22, 352–365. [Google Scholar] [CrossRef]
  57. Bartolucci, F.; Montanari, G.E.; Pandolfi, S. Latent ignorability and item selection for nursing home case-mix evaluation. J. Classif. 2018, 35, 172–193. [Google Scholar] [CrossRef]
  58. Fu, Z.H.; Tao, J.; Shi, N.Z. Bayesian estimation of the multidimensional graded response model with nonignorable missing data. J. Stat. Comput. Simul. 2010, 80, 1237–1252. [Google Scholar] [CrossRef]
  59. Huang, H.Y. A mixture IRTree model for performance decline and nonignorable missing data. Educ. Psychol. Meas. 2020, 80, 1168–1195. [Google Scholar] [CrossRef] [PubMed]
  60. Okumura, T. Empirical differences in omission tendency and reading ability in PISA: An application of tree-based item response models. Educ. Psychol. Meas. 2014, 74, 611–626. [Google Scholar] [CrossRef]
  61. Albert, P.S.; Follmann, D.A. Shared-parameter models. In Longitudinal Data Analysis; Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G., Eds.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2008; pp. 447–466. [Google Scholar] [CrossRef] [Green Version]
  62. Little, R.J. Selection and pattern-mixture models. In Longitudinal Data Analysis; Fitzmaurice, G., Davidian, M., Verbeke, G., Molenberghs, G., Eds.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2008; pp. 409–431. [Google Scholar] [CrossRef] [Green Version]
  63. Bertoli-Barsotti, L.; Punzo, A. Rasch analysis for binary data with nonignorable nonresponses. Psicologica 2013, 34, 97–123. [Google Scholar]
  64. Glas, C.A.W.; Pimentel, J.L. Modeling nonignorable missing data in speeded tests. Educ. Psychol. Meas. 2008, 68, 907–922. [Google Scholar] [CrossRef]
  65. Korobko, O.B.; Glas, C.A.; Bosker, R.J.; Luyten, J.W. Comparing the difficulty of examination subjects with item response theory. J. Educ. Meas. 2008, 45, 139–157. [Google Scholar] [CrossRef]
  66. Rosas, G.; Shomer, Y. Models of nonresponse in legislative politics. Legis. Stud. Q. 2008, 33, 573–601. [Google Scholar] [CrossRef]
  67. Köhler, C.; Pohl, S.; Carstensen, C.H. Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educ. Psychol. Meas. 2015, 75, 850–874. [Google Scholar] [CrossRef] [Green Version]
  68. Xu, X.; von Davier, M. Fitting the Structured General Diagnostic Model to NAEP Data; (Research Report No. RR-08-28); Educational Testing Service: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
  69. Kreitchmann, R.S.; Abad, F.J.; Ponsoda, V. A two-dimensional multiple-choice model accounting for omissions. Front. Psychol. 2018, 9, 2540. [Google Scholar] [CrossRef]
  70. Zhou, S.; Huggins-Manley, A.C. The performance of the semigeneralized partial credit model for handling item-level missingness. Educ. Psychol. Meas. 2019, 80, 1196–1215. [Google Scholar] [CrossRef]
  71. Hughes, R.A.; White, I.R.; Seaman, S.R.; Carpenter, J.R.; Tilling, K.; Sterne, J.A.C. Joint modelling rationale for chained equations. BMC Med. Res. Methodol. 2014, 14, 28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  72. Yuan, K.H. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. J. Multivar. Anal. 2009, 100, 1900–1918. [Google Scholar] [CrossRef] [Green Version]
  73. Fischer, G.H. Rasch models. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 515–585. [Google Scholar] [CrossRef]
  74. Adams, R.J.; Wilson, M.; Wang, W. The multidimensional random coefficients multinomial logit model. Appl. Psychol. Meas. 1997, 21, 1–23. [Google Scholar] [CrossRef]
  75. Mislevy, R.J. Randomization-based inference about latent variables from complex samples. Psychometrika 1991, 56, 177–196. [Google Scholar] [CrossRef]
  76. Finch, H.W. A comparison of the Heckman selection model, Ibrahim, and Lipsitz methods for dealing with nonignorable missing data. J. Psychiatry Behav. Sci. 2021, 4, 1045. [Google Scholar]
  77. Galimard, J.E.; Chevret, S.; Protopopescu, C.; Resche-Rigon, M. A multiple imputation approach for MNAR mechanisms compatible with Heckman’s model. Stat. Med. 2016, 35, 2907–2920. [Google Scholar] [CrossRef] [PubMed]
  78. Galimard, J.E.; Chevret, S.; Curis, E.; Resche-Rigon, M. Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors. BMC Med. Res. Methodol. 2018, 18, 90. [Google Scholar] [CrossRef] [PubMed]
  79. Heckman, J. Sample selection bias as a specification error. Econometrica 1979, 47, 153–161. [Google Scholar] [CrossRef]
  80. Sportisse, A.; Boyer, C.; Josse, J. Imputation and low-rank estimation with missing not at random data. Stat. Comput. 2020, 30, 1629–1643. [Google Scholar] [CrossRef]
  81. Deribo, T.; Kroehne, U.; Goldhammer, F. Model-based treatment of rapid guessing. J. Educ. Meas. 2021, 58, 281–303. [Google Scholar] [CrossRef]
  82. Mislevy, R.J. Missing responses in Item response modeling. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 171–194. [Google Scholar] [CrossRef]
  83. Guo, J.; Xu, X. An IRT-based model for omitted and not-reached items. arXiv 2019, arXiv:1904.03767. [Google Scholar]
  84. Rosas, G.; Shomer, Y.; Haptonstahl, S.R. No news is news: Nonignorable nonresponse in roll-call data analysis. Am. J. Pol. Sc. 2015, 59, 511–528. [Google Scholar] [CrossRef]
  85. Gomes, H.; Matsushita, R.; Da Silva, S. Item tesponse theory modeling of high school students’ behavior in a high-stakes exam. Open Access Libr. J. 2019, 6, e5242. [Google Scholar] [CrossRef]
  86. Huisman, M. Imputation of missing item responses: Some simple techniques. Qual. Quant. 2000, 34, 331–351. [Google Scholar] [CrossRef]
  87. Huisman, M.; Molenaar, I.W. Imputation of missing scale data with item response models. In Essays on Item Response Theory; Boomsma, A., van Duijn, M.A.J., Snijders, T.A.B., Eds.; Springer: New York, NY, USA, 2001; pp. 221–244. [Google Scholar] [CrossRef]
  88. Sijtsma, K.; van der Ark, L.A. Investigation and treatment of missing item scores in test and questionnaire data. Multivar. Behav. Res. 2003, 38, 505–528. [Google Scholar] [CrossRef]
  89. van Ginkel, J.R.; van der Ark, L.A.; Sijtsma, K. Multiple imputation of item scores in test and questionnaire data, and influence on psychometric results. Multivar. Behav. Res. 2007, 42, 387–414. [Google Scholar] [CrossRef] [Green Version]
  90. van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef] [Green Version]
  91. van Buuren, S.; Brand, J.P.L.; Groothuis-Oudshoorn, C.G.M.; Rubin, D.B. Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 2006, 76, 1049–1064. [Google Scholar] [CrossRef]
  92. van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 2007, 16, 219–242. [Google Scholar] [CrossRef] [PubMed]
  93. Raghunathan, T.E.; Lepkowski, J.M.; Van Hoewyk, J.; Solenberger, P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 2001, 27, 85–96. [Google Scholar]
  94. Bulut, O.; Kim, D. The use of data imputation when investigating dimensionality in sparse data from computerized adaptive tests. J. Appl. Test. Technol. 2021. Available online: https://bit.ly/3oC2dTR (accessed on 3 October 2021).
  95. Edwards, J.M.; Finch, W.H. Recursive partitioning methods for data imputation in the context of item response theory: A Monte Carlo simulation. Psicológica 2018, 39. [Google Scholar] [CrossRef] [Green Version]
  96. Xiao, J.; Bulut, O. Evaluating the performances of missing data handling methods in ability estimation from sparse data. Educ. Psychol. Meas. 2020, 80, 932–954. [Google Scholar] [CrossRef]
  97. Horton, N.J.; Kleinman, K.P. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 2007, 61, 79–90. [Google Scholar] [CrossRef]
  98. Morris, T.P.; White, I.R.; Royston, P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 2014, 14, 75. [Google Scholar] [CrossRef] [Green Version]
  99. Münnich, R.; Rässler, S. PRIMA: A new multiple imputation procedure for binary variables. J. Off. Stat. 2005, 21, 325–341. [Google Scholar]
  100. Howard, W.J.; Rhemtulla, M.; Little, T.D. Using principal components as auxiliary variables in missing data estimation. Multivar. Behav. Res. 2015, 50, 285–299. [Google Scholar] [CrossRef]
  101. Hodge, D.W.; Safo, S.E.; Long, Q. Multiple imputation using dimension reduction techniques for high-dimensional data. arXiv 2019, arXiv:1905.05274. [Google Scholar]
  102. Wehrens, R.; Mevik, B.H. The pls package: Principal component and partial least squares regression in R. J. Stat. Softw. 2007, 18. [Google Scholar] [CrossRef] [Green Version]
  103. Robitzsch, A.; Pham, G.; Yanagida, T. Fehlende Daten und Plausible Values [Missing data and plausible values]. In Large-Scale Assessment mit R: Methodische Grundlagen der Österreichischen Bildungsstandardüberprüfung; Breit, S., Schreiner, C., Eds.; Facultas: Vienna, Austria, 2016; pp. 259–293. Available online: https://bit.ly/2YaZQ0G (accessed on 3 October 2021).
  104. Grund, S.; Lüdtke, O.; Robitzsch, A. On the treatment of missing data in background questionnaires in educational large-scale assessments: An evaluation of different procedures. J. Educ. Behav. Stat. 2021, 46, 430–465. [Google Scholar] [CrossRef]
  105. Beesley, L.J.; Bondarenko, I.; Elliott, M.R.; Kurian, A.W.; Katz, S.J.; Taylor, J.M.G. Multiple imputation with missing data indicators. arXiv 2021, arXiv:2103.02033. [Google Scholar] [CrossRef]
  106. Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef]
  107. Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
  108. Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  109. Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods. Stat. Med. 2019, 38, 2074–2102. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  110. White, I.R. simsum: Analyses of simulation studies including Monte Carlo error. Stata J. 2010, 10, 369–385. [Google Scholar] [CrossRef] [Green Version]
  111. de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
  112. Kankaraš, M.; Moors, G. Analysis of cross-cultural comparability of PISA 2009 scores. J. Cross-Cult. Psychol. 2014, 45, 381–399. [Google Scholar] [CrossRef] [Green Version]
  113. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  114. Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
  115. OECD. PISA 2006. Technical Report; OECD: Paris, France, 2009; Available online: https://bit.ly/38jhdzp (accessed on 3 October 2021).
  116. OECD. PISA 2009. Technical Report; OECD: Paris, France, 2012; Available online: https://bit.ly/3xfxdwD (accessed on 3 October 2021).
  117. OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 3 October 2021).
  118. Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. [Google Scholar] [CrossRef]
  119. Byrne, B.M.; Shavelson, R.J.; Muthén, B. Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychol. Bull. 1989, 105, 456–466. [Google Scholar] [CrossRef]
  120. Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. [Google Scholar]
  121. von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  122. von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
  123. Fox, J.P. Bayesian Item Response Modeling; Springer: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
  124. Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
  125. Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
  126. Sachse, K.A.; Mahler, N.; Pohl, S. When nonresponse mechanisms change: Effects on trends and group comparisons in international large-scale assessments. Educ. Psychol. Meas. 2019, 79, 699–726. [Google Scholar] [CrossRef]
  127. Robitzsch, A. A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations 2021, 1, 116–144. [Google Scholar] [CrossRef]
  128. Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
  129. von Davier, M.; Sinharay, S. Analytics in international large-scale assessments: Item response theory and population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 155–174. [Google Scholar] [CrossRef]
  130. Bulut, O.; Quo, Q.; Gierl, M.J. A structural equation modeling approach for examining position effects in large-scale assessments. Large-Scale Assess. Educ. 2017, 5, 8. [Google Scholar] [CrossRef] [Green Version]
  131. Debeer, D.; Janssen, R. Modeling item-position effects within an IRT framework. J. Educ. Meas. 2013, 50, 164–185. [Google Scholar] [CrossRef]
  132. Hartig, J.; Buchholz, J. A multilevel item response model for item position effects and individual persistence. Psych. Test Assess. Model. 2012, 54, 418–431. [Google Scholar]
  133. Nagy, G.; Nagengast, B.; Becker, M.; Rose, N.; Frey, A. Item position effects in a reading comprehension test: An IRT study of individual differences and individual correlates. Psych. Test Assess. Model. 2018, 60, 165–187. [Google Scholar]
  134. Robitzsch, A. Methodische Herausforderungen bei der Kalibrierung von Leistungstests [Methodological challenges in calibrating performance tests]. In Bildungsstandards Deutsch und Mathematik; Bremerich-Vos, A., Granzer, D., Köller, O., Eds.; Beltz Pädagogik: Weinheim, Germany, 2009; pp. 42–106. [Google Scholar]
  135. Rose, N.; Nagy, G.; Nagengast, B.; Frey, A.; Becker, M. Modeling multiple item context effects with generalized linear mixed models. Front. Psychol. 2019, 10, 248. [Google Scholar] [CrossRef]
  136. Trendtel, M.; Robitzsch, A. Modeling item position effects with a Bayesian item response model applied to PISA 2009–2015 data. Psych. Test Assess. Model. 2018, 60, 241–263. [Google Scholar]
  137. Weirich, S.; Hecht, M.; Böhme, K. Modeling item position effects using generalized linear mixed models. Appl. Psychol. Meas. 2014, 38, 535–548. [Google Scholar] [CrossRef]
  138. Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
  139. Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
  140. Lumley, T.; Scott, A. Tests for regression models fitted to survey data. Aust. N. Z. J. Stat. 2014, 56, 1–14. [Google Scholar] [CrossRef]
  141. Lumley, T.; Scott, A. AIC and BIC for modeling with complex survey data. J. Surv. Stat. Methodol. 2015, 3, 1–18. [Google Scholar] [CrossRef]
  142. Trendtel, M.; Robitzsch, A. A Bayesian item response model for examining item position effects in complex survey data. J. Educ. Behav. Stat. 2021, 46, 34–57. [Google Scholar] [CrossRef]
  143. Gilula, Z.; Haberman, S.J. Prediction functions for categorical panel data. Ann. Stat. 1995, 23, 1130–1142. [Google Scholar] [CrossRef]
  144. Haberman, S.J. The Information a Test Provides on an Ability Parameter; (Research Report No. RR-07-18); Educational Testing Service: Princeton, NJ, USA, 2007. [Google Scholar] [CrossRef]
  145. van Rijn, P.W.; Sinharay, S.; Haberman, S.J.; Johnson, M.S. Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assess. Educ. 2016, 4, 10. [Google Scholar] [CrossRef] [Green Version]
  146. George, A.C.; Robitzsch, A. Validating theoretical assumptions about reading with cognitive diagnosis models. Int. J. Test. 2021, 21, 105–129. [Google Scholar] [CrossRef]
  147. Ibrahim, J.G.; Zhu, H.; Tang, N. Model selection criteria for missing-data problems using the EM algorithm. J. Am. Stat. Assoc. 2008, 103, 1648–1658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  148. Kuiper, R.M.; Hoijtink, H. How to handle missing data in regression models using information criteria. Stat. Neerl. 2011, 65, 489–506. [Google Scholar] [CrossRef]
  149. Lai, K. Using information criteria under missing data: Full information maximum likelihood versus two-stage estimation. Struct. Equ. Model. 2021, 28, 278–291. [Google Scholar] [CrossRef]
  150. Shimodaira, H.; Maeda, H. An information criterion for model selection with missing data via complete-data divergence. Ann. Inst. Stat. Math. 2018, 70, 421–438. [Google Scholar] [CrossRef] [Green Version]
  151. Carstensen, C.H.; Prenzel, M.; Baumert, J. Trendanalysen in PISA: Wie haben sich die Kompetenzen in Deutschland zwischen PISA 2000 und PISA 2006 entwickelt? [Trend analyses in PISA: How did competencies in Germany develop between PISA 2000 and PISA 2006?]. In Vertiefende Analysen zu PISA 2006; Prenzel, M., Baumert, J., Eds.; VS Verlag für Sozialwissenschaften: Wiesbaden, Germany, 2008; pp. 11–34. [Google Scholar] [CrossRef]
  152. Carstensen, C.H. Linking PISA Competencies over Three Cycles—Results from Germany. In Research on PISA; Prenzel, M., Kobarg, M., Schöps, K., Rönnebeck, S., Eds.; Springer: Dordrecht, The Netherlands, 2013; pp. 199–213. [Google Scholar] [CrossRef]
  153. Oliveri, M.E.; von Davier, M. Toward increasing fairness in score scale calibrations employed in international large-scale assessments. Int. J. Test. 2014, 14, 1–21. [Google Scholar] [CrossRef]
  154. Wetzel, E.; Carstensen, C.H. Linking PISA 2000 and PISA 2009: Implications of instrument design on measurement invariance. Psych. Test Assess. Model. 2013, 55, 181–206. [Google Scholar]
  155. Schomaker, M.; Heumann, C. Model selection and model averaging after multiple imputation. Comp. Stat. Data An. 2014, 71, 758–770. [Google Scholar] [CrossRef]
  156. Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
  157. Lu, J.; Wang, C. A response time process model for not-reached and omitted items. J. Educ. Meas. 2020, 57, 584–620. [Google Scholar] [CrossRef]
  158. Ulitzsch, E.; von Davier, M.; Pohl, S. Using response times for joint modeling of response and omission behavior. Multivar. Behav. Res. 2020, 55, 425–453. [Google Scholar] [CrossRef]
  159. Kane, M.T. A sampling model for validity. Appl. Psychol. Meas. 1982, 6, 125–160. [Google Scholar] [CrossRef]
  160. Kane, M.T. Validating the interpretations and uses of test scores. J. Educ. Meas. 2013, 50, 1–73. [Google Scholar] [CrossRef]
  161. Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
  162. Frey, A.; Hartig, J. Methodological challenges of international student assessment. In Monitoring Student Achievement in the 21st Century; Harju-Luukkainen, H., McElvany, N., Stang, J., Eds.; Springer: Cham, Switzerland, 2020; pp. 39–49. [Google Scholar] [CrossRef]
  163. Hartig, J.; Frey, A.; Jude, N. Validität von Testwertinterpretationen [Validity of test score interpretations]. In Testtheorie und Fragebogenkonstruktion; Moosbrugger, H., Kelava, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
  164. Leutner, D.; Hartig, J.; Jude, N. Measuring competencies: Introduction to concepts and questions of assessment in education. In Assessment of Competencies in Educational Contexts; Hartig, J., Klieme, E., Leutner, D., Eds.; Hogrefe: Göttingen, Germany, 2008; pp. 177–192. [Google Scholar]
  165. Frey, A.; Spoden, C.; Goldhammer, F.; Wenzel, S.F.C. Response time-based treatment of omitted responses in computer-based testing. Behaviormetrika 2018, 45, 505–526. [Google Scholar] [CrossRef] [Green Version]
  166. Goldhammer, F.; Martens, T.; Lüdtke, O. Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modelling approach considering person and item characteristics. Large-Scale Assess. Educ. 2017, 5, 18. [Google Scholar] [CrossRef] [Green Version]
  167. Pokropek, A. Grade of membership response time model for detecting guessing behaviors. J. Educ. Behav. Stat. 2016, 41, 300–325. [Google Scholar] [CrossRef]
  168. Schweizer, K.; Krampen, D.; French, B.F. Does rapid guessing prevent the detection of the effect of a time limit in testing? Methodology 2021, 17, 168–188. [Google Scholar] [CrossRef]
  169. Ulitzsch, E.; von Davier, M.; Pohl, S. A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. Brit. J. Math. Stat. Psychol. 2020, 73, 83–112. [Google Scholar] [CrossRef] [Green Version]
  170. Weeks, J.P.; von Davier, M.; Yamamoto, K. Using response time data to inform the coding of omitted responses. Psych. Test Assess. Model. 2016, 58, 671–701. [Google Scholar]
  171. Wise, S.L.; Im, S.; Lee, J. The impact of disengaged test taking on a state’s accountability test results. Educ. Assess. 2021, 26, 163–174. [Google Scholar] [CrossRef]
  172. Rios, J.A.; Deng, J. Does the choice of response time threshold procedure substantially affect inferences concerning the identification and exclusion of rapid guessing responses? A meta-analysis. Large-Scale Assess. Educ. 2021, 9, 18. [Google Scholar] [CrossRef]
  173. Soland, J.; Kuhfeld, M.; Rios, J. Comparing different response time threshold setting methods to detect low effort on a large-scale assessment. Large-Scale Assess. Educ. 2021, 9, 8. [Google Scholar] [CrossRef]
  174. Rutkowski, L.; Rutkowski, D.; Liaw, Y.L. The existence and impact of floor effects for low-performing PISA participants. Assess. Educ. 2019, 26, 643–664. [Google Scholar] [CrossRef]
  175. Rutkowski, D.; Rutkowski, L. Running the wrong race? The case of PISA for development. Comp. Educ. Rev. 2021, 65, 147–165. [Google Scholar] [CrossRef]
  176. Tijmstra, J.; Bolsinova, M.; Liaw, Y.L.; Rutkowski, L.; Rutkowski, D. Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. J. Educ. Meas. 2020, 57, 566–583. [Google Scholar] [CrossRef]
  177. Yamamoto, K.; Khorramdel, L.; Shin, H.J. Introducing multistage adaptive testing into international large-scale assessments designs using the example of PIAAC. Psych. Test Assess. Model. 2018, 60, 347–368. [Google Scholar]
  178. Yamamoto, K.; Shin, H.J.; Khorramdel, L. Multistage adaptive testing design in international large-scale assessments. Educ. Meas. 2018, 37, 16–27. [Google Scholar] [CrossRef]
  179. Adams, R.J.; Lietz, P.; Berezner, A. On the use of rotated context questionnaires in conjunction with multilevel item response models. Large-Scale Assess. Educ. 2013, 1, 5. [Google Scholar] [CrossRef]
  180. Aßmann, C.; Gaasch, C.; Pohl, S.; Carstensen, C.H. Bayesian estimation in IRT models with missing values in background variables. Psych. Test Assess. Model. 2015, 57, 595–618. [Google Scholar]
  181. Bouhlila, D.S.; Sellaouti, F. Multiple imputation using chained equations for missing data in TIMSS: A case study. Large-Scale Assess. Educ. 2013, 1, 4. [Google Scholar] [CrossRef] [Green Version]
  182. Kaplan, D.; Su, D. On imputation for planned missing data in context questionnaires using plausible values: A comparison of three designs. Large-Scale Assess. Educ. 2018, 6, 6. [Google Scholar] [CrossRef] [Green Version]
  183. Rutkowski, L. The impact of missing background data on subpopulation estimation. J. Educ. Meas. 2011, 48, 293–312. [Google Scholar] [CrossRef]
  184. von Davier, M. Imputing proficiency data under planned missingness in population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2013; pp. 175–201. [Google Scholar] [CrossRef]
  185. Reckase, M. A Tale of Two Models: Sources of Confusion in Achievement Testing; (Research Report No. RR-17-44); Educational Testing Service: Princeton, NJ, USA, 2017. [Google Scholar] [CrossRef] [Green Version]
  186. Athey, S.; Imbens, G. A measure of robustness to misspecification. Am. Econ. Rev. 2015, 105, 476–480. [Google Scholar] [CrossRef] [Green Version]
  187. Buckland, S.T.; Burnham, K.P.; Augustin, N.H. Model selection: An integral part of inference. Biometrics 1997, 53, 603–618. [Google Scholar] [CrossRef]
  188. Longford, N.T. ‘Which model?’ is the wrong question. Stat. Neerl. 2012, 66, 237–252. [Google Scholar] [CrossRef]
  189. Siddique, J.; Harel, O.; Crespi, C.M. Addressing missing data mechanism uncertainty using multiple-model multiple imputation: Application to a longitudinal clinical trial. Ann. Appl. Stat. 2012, 6, 1814–1837. [Google Scholar] [CrossRef] [Green Version]
  190. Young, C. Model uncertainty in sociological research: An application to religion and economic growth. Am. Sociol. Rev. 2009, 74, 380–397. [Google Scholar] [CrossRef] [Green Version]
  191. Young, C.; Holsteen, K. Model uncertainty and robustness: A computational framework for multimodel analysis. Sociol. Methods Res. 2017, 46, 3–40. [Google Scholar] [CrossRef] [Green Version]
  192. Robitzsch, A.; Dörfler, T.; Pfost, M.; Artelt, C. Die Bedeutung der Itemauswahl und der Modellwahl für die längsschnittliche Erfassung von Kompetenzen Relevance of item selection and model selection for assessing the development of competencies: The development in reading competence in primary school students. Z. Entwicklungspsychol. Pädagog. Psychol. 2011, 43, 213–227. [Google Scholar] [CrossRef]
  193. Saltelli, A.; Ratto, M.; Andres, T.; Campolongo, F.; Cariboni, J.; Gatelli, D.; Saisana, M.; Tarantola, S. Global Sensitivity Analysis: The Primer; Wiley: New York, NY, USA, 2008. [Google Scholar] [CrossRef] [Green Version]
  194. Harder, J.A. The multiverse of methods: Extending the multiverse analysis to address data-collection decisions. Perspect. Psychol. Sci. 2020, 15, 1158–1177. [Google Scholar] [CrossRef]
  195. Steegen, S.; Tuerlinckx, F.; Gelman, A.; Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 2016, 11, 702–712. [Google Scholar] [CrossRef]
  196. Simonsohn, U.; Simmons, J.P.; Nelson, L.D. Specification curve: Descriptive and inferential statistics on all reasonable specifications. SSRN Electron. J. 2015, 2694998. [Google Scholar] [CrossRef] [Green Version]
  197. Simonsohn, U.; Simmons, J.P.; Nelson, L.D. Specification curve analysis. Nat. Hum. Behav. 2020, 4, 1208–1214. [Google Scholar] [CrossRef] [PubMed]
  198. Rutkowski, D.; Delandshere, G. Causal inferences with large scale assessment data: Using a validity framework. Large-Scale Assess. Educ. 2016, 4, 6. [Google Scholar] [CrossRef] [Green Version]
  199. Rutkowski, D.; Thompson, G.; Rutkowski, L. Understanding the policy influence of international large-scale assessments in education. In Reliability and Validity of International Large-Scale Assessment: Understanding IEA’s Comparative Studies of Student Achievement; Wagemaker, H., Ed.; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
  200. Brock, W.A.; Durlauf, S.N.; West, K.D. Model uncertainty and policy evaluation: Some theory and empirics. J. Econom. 2007, 136, 629–664. [Google Scholar] [CrossRef] [Green Version]
  201. Brock, W.A.; Durlauf, S.N. On sturdy policy evaluation. J. Leg. Stud. 2015, 44, S447–S473. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Overview of different statistical models for the treatment of missing item responses. The abbreviations of the different modeling strategies (“U”, “M” and “I”) are printed in red.
Figure 1. Overview of different statistical models for the treatment of missing item responses. The abbreviations of the different modeling strategies (“U”, “M” and “I”) are printed in red.
Ejihpe 11 00117 g001
Figure 2. Frequency distribution of missing item responses (left panel) and not reached items at the student level (right panel).
Figure 2. Frequency distribution of missing item responses (left panel) and not reached items at the student level (right panel).
Ejihpe 11 00117 g002
Table 1. Bias for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ .
Table 1. Bias for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ .
δ MeanStandard Deviation
−10−3−2−10−10−3−2−10
Model
5% missing data
CD 0.002 0.006 0.006 0.007 0.004−0.005−0.008−0.009−0.010−0.007
MM1 0.005 0.005 0.006 0.008 0.007−0.007−0.007−0.009−0.010−0.010
UW 0.002−0.005−0.022−0.041−0.065−0.005−0.004 0.003 0.006 0.005
UO 0.090 0.084 0.081 0.058 0.021−0.040−0.036−0.039−0.027−0.015
MO2 0.085 0.077 0.071 0.044 0.005−0.037−0.032−0.033−0.021−0.009
IF1 0.090 0.086 0.082 0.058 0.022−0.039−0.036−0.039−0.026−0.014
IF2 0.088 0.082 0.078 0.052 0.008−0.037−0.034−0.037−0.025−0.009
10% missing data
CD 0.005 0.006 0.005 0.005 0.008−0.009−0.009−0.008−0.008−0.011
MM1 0.009 0.008 0.009 0.006 0.002−0.011−0.010−0.012−0.008−0.007
UW 0.005−0.022−0.049−0.083−0.139−0.009 0.000 0.004 0.005 0.015
UO 0.159 0.136 0.113 0.090 0.037−0.090−0.064−0.047−0.039−0.023
MO2 0.149 0.123 0.103 0.075 0.006−0.079−0.052−0.040−0.035−0.010
IF1 0.160 0.139 0.116 0.092 0.038−0.089−0.065−0.047−0.040−0.022
IF2 0.152 0.132 0.109 0.083 0.012−0.080−0.057−0.042−0.038−0.014
20% missing data
CD 0.004 0.005 0.002 0.004 0.004−0.007−0.009−0.005−0.008−0.006
MM1 0.018 0.005 0.005 0.008 0.007−0.017−0.009−0.009−0.012−0.011
UW 0.004−0.072−0.129−0.198−0.268−0.006 0.005 0.014 0.019 0.022
UO 0.203 0.211 0.183 0.144 0.064−0.148−0.129−0.095−0.073−0.038
MO2 0.203 0.210 0.175 0.115 0.005−0.146−0.126−0.088−0.053−0.007
IF1 0.208 0.214 0.183 0.148 0.063−0.147−0.129−0.091−0.073−0.033
IF2 0.212 0.211 0.183 0.126 0.010−0.152−0.121−0.089−0.059−0.011
30% missing data
CD 0.008 0.006 0.004 0.004 0.006−0.010−0.008−0.008−0.009−0.011
MM1 0.054 0.008 0.008 0.010−0.005−0.122−0.012−0.011−0.013−0.005
UW 0.006−0.159−0.225−0.298−0.363−0.009 0.021 0.018 0.014 0.002
UO 0.198 0.238 0.226 0.179 0.070−0.211−0.165−0.132−0.094−0.042
MO2 0.192 0.239 0.228 0.159 0.001−0.213−0.165−0.133−0.083−0.008
IF1 0.208 0.244 0.231 0.183 0.074−0.210−0.165−0.134−0.092−0.039
IF2 0.202 0.247 0.233 0.168 0.010−0.211−0.166−0.130−0.086−0.013
Note. CD = complete-data analysis; UW = scoring as wrong (Section 2.1); MM1 = Mislevy-Wu model with common d parameter (Section 2.5, Equation (14)); UO = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11); IF1 = FCS imputation based on item responses (Section 2.6); IF2 = FCS imputation based on item responses and response indicators (Section 2.6); Absolute biases values larger than 0.03 are printed in bold.
Table 2. Descriptive statistics of the PISA 2018 mathematics sample.
Table 2. Descriptive statistics of the PISA 2018 mathematics sample.
CountryNI N item 1 M OECD SD OECD M stand %NA%NR% NA CR % NA MC
ALB260969787.0438.083.4446.08.01.911.42.6
AUS7705702367.1491.792.9501.87.32.410.32.5
AUT3731701133.7499.192.7509.68.41.812.52.0
BEL4696701393.0507.895.6518.68.32.611.92.5
BIH3512701071.0406.582.0413.118.8 1 3.927.94.2
BLR314170967.8470.792.4480.07.82.411.42.1
BRN281269845.0430.691.3438.26.11.78.81.8
CAN9782702786.3511.792.4522.75.82.28.22.1
CHE314170964.5514.593.4525.68.22.511.92.2
CZE3798701164.0498.593.4509.09.22.013.91.9
DEU300070908.6499.095.9509.59.62.514.02.4
DNK4354701250.1510.781.3521.75.92.08.61.7
ESP14,768 1 704408.3481.788.3491.510.6 1 2.915.52.7
EST288070890.9523.881.6535.36.62.09.61.8
FIN305670935.0505.783.3516.48.93.012.82.7
FRA3405701046.8495.592.2505.810.1 1 3.014.82.6
GBR7063702174.0502.192.9512.78.22.511.82.5
GRC263470790.4451.189.5459.610.7 1 2.715.72.6
HKG248470748.0551.092.5563.53.90.85.80.8
HRV268370805.2464.587.1473.611.8 1 2.717.62.5
HUN278570857.4482.391.2492.08.62.013.01.7
IRL303170935.5500.278.1510.75.81.38.71.2
ISL180770545.1493.890.8504.19.74.412.94.5
ISR282570846.6464.0107.5473.012.1 1 4.516.94.5
ITA6401701978.9485.994.0495.812.4 1 2.818.92.1
JPN3302701018.6527.487.1539.18.41.912.91.4
KOR274170823.1525.9100.4537.56.41.79.41.6
LTU282470846.3480.190.0489.87.41.511.41.1
LUX282770872.0481.398.6491.010.4 1 2.815.32.7
LVA219070656.4498.580.5509.06.41.79.71.1
MLT138369415.8469.5101.6478.89.83.913.53.6
MNE3595701109.7430.883.0438.417.3 1 3.825.93.5
MYS3284701000.8440.282.0448.21.20.61.50.7
NLD293970742.6518.492.9529.74.41.16.70.9
NOR314170969.5502.490.3513.010.7 1 3.715.13.7
NZL3309701021.2495.693.0506.08.12.211.72.3
POL302270932.6515.890.5526.97.11.910.71.3
PRT320270987.6493.196.2503.310.6 1 2.815.82.3
RUS313170939.3487.887.4497.87.92.211.62.1
SGP273270822.3570.393.3583.62.70.83.80.8
SVK251470727.9484.6100.1494.58.01.811.91.7
SVN3519701054.7509.588.7520.47.11.510.71.4
SWE298270918.7502.890.3513.412.7 1 5.917.35.4
TUR3723701147.8453.487.4462.06.71.69.71.8
USA262970804.9478.092.4487.64.02.05.21.9
Note.N = number of students; I = number of items; N item = average number of students per item; N OECD = officially reported country mean by OECD [16]; M OECD = officially reported country standard deviation by OECD [16]; Mstand = standardized country mean (M = 500 and SD = 100 in total population); %NA = proportion of item responses with missing data; %NR = proportion of item responses that are not reached; % NA CR = proportion of constructed-response item responses with missing data; % NA MC = proportion of multiple-choice item responses with missing data; Missing item response rates larger than 10.0% and smaller than 5.0% are printed in bold. Missing rates for not reached responses larger than 3.0% are printed in bold. See Appendix B for country labels.
Table 3. Overview of 19 specified scaling models for the treatment of missing item responses in the PISA 2018 mathematics case study.
Table 3. Overview of 19 specified scaling models for the treatment of missing item responses in the PISA 2018 mathematics case study.
ModelRef.Description
UWSection 2.1response indicators unmodeled: scoring as wrong
MWSection 2.5model-based treatment: treatment as wrong in the Mislevy-Wu model by setting ρ i = 10 in Equation (14)
IWSection 2.5imputation-based treatment on the IRT model MW: imputation as wrong based on the Mislevy-Wu model and setting ρ i = 10 in Equation (14)
UPSection 2.2response indicators unmodeled: multiple-choice items scored as partially correct
IPSection 2.2imputation-based treatment: multiple-choice items imputed with probabilities 1 / K i , for correct response where K i is the number of response alternatives
UN1Section 4.2.1response indicators unmodeled: not reached items ignored in the scaling model
UN2Section 4.2.1response indicators unmodeled: proportion of not reached items included as a predictor in the latent background model
UO1Section 2.3response indicators unmodeled: missing item responses ignored in the scaling model
MO1Section 2.4model-based treatment: model-based ignorability specified as the Mislevy-Wu model with δ i = 0 and Cor ( θ , ξ ) = 0
IO1Section 2.4.1imputation-based treatment on the IRT model MO1
UO1Section 2.3 and Section 4.2.1response indicators unmodeled: including proportion of missing item responses in the latent background model
MO2Section 2.4model-based treatment: model-based latent ignorability specified as the Mislevy-Wu model with δ i = 0
IO2Section 2.4.1imputation-based treatment on the IRT model MO2
MM1Section 2.5model-based treatment: Mislevy-Wu model with common δ i parameter
IM1Section 2.4.1imputation-based treatment on the IRT model MM1
MM2Section 2.5model-based treatment: Mislevy-Wu model with item-format specific δ i parameter
IM2Section 2.4.1imputation-based treatment on the IRT model MM2
IF1Section 2.6 and Section 4.2.2imputation-based treatment on fully conditional specification: using predictive mean matching for item responses X p separately for each test booklet
IF2Section 2.6 and Section 4.2.2imputation-based treatment on fully conditional specification: using predictive mean matching for item responses X p and response indicators R p separately for each test booklet
Note. Ref. = reference in this article.
Table 4. Average absolute differences in country means of different treatments of missing item responses.
Table 4. Average absolute differences in country means of different treatments of missing item responses.
UWMWIWUPIPUN1UN2UO1MO1IO1MO2UO2IO2MM1IM1MM2IM2IF1IF2
UW0.30.00.70.81.91.73.03.03.02.62.82.61.41.51.61.53.02.8
MW0.30.30.90.92.01.73.03.03.02.72.82.61.41.61.61.63.12.8
IW0.00.30.70.81.91.73.03.03.02.62.82.61.41.51.61.53.02.8
UP0.70.90.70.31.41.52.72.72.72.42.42.31.11.21.31.22.72.6
IP0.80.90.80.31.51.52.72.72.72.42.42.31.21.21.41.32.72.6
UN11.92.01.91.41.51.02.12.12.12.01.81.91.01.00.90.92.22.6
UN21.71.71.71.51.51.02.42.52.52.02.22.01.41.41.21.32.62.7
UO13.03.03.02.72.72.12.40.00.20.70.30.62.52.52.22.30.71.9
MO13.03.03.02.72.72.12.50.00.20.70.40.72.52.52.22.30.71.9
IO13.03.03.02.72.72.12.50.20.20.70.40.72.62.52.32.40.81.9
MO22.62.72.62.42.42.02.00.70.70.70.60.42.22.31.82.01.01.8
UO22.82.82.82.42.41.82.20.30.40.40.60.52.32.22.02.10.81.8
IO22.62.62.62.32.31.92.00.60.70.70.40.52.22.21.82.01.01.8
MM11.41.41.41.11.21.01.42.52.52.62.22.32.20.40.60.52.62.7
IM11.51.61.51.21.21.01.42.52.52.52.32.22.20.40.80.72.62.6
MM21.61.61.61.31.40.91.22.22.22.31.82.01.80.60.80.42.32.5
IM21.51.61.51.21.30.91.32.32.32.42.02.12.00.50.70.42.42.5
IF13.03.13.02.72.72.22.60.70.70.81.00.81.02.62.62.32.41.9
IF22.82.82.82.62.62.62.71.91.91.91.81.81.82.72.62.52.51.9
Note. Mean absolute differences smaller or equal than 1.0 are printed in bold.
Table 5. Model comparisons based on the Bayesian information crierion (BIC) and the Gilula–Haberman penalty (GHP).
Table 5. Model comparisons based on the Bayesian information crierion (BIC) and the Gilula–Haberman penalty (GHP).
CountryBICGHP
MWMO1MO2MM1MM2MWMO1MO2MM1MM2Diff
ALB63663637546360063579635860.64230.64330.64160.64140.64140.0003
AUS1933041940081933161931451931050.63210.63440.63210.63150.63140.0007
AUT97019976859717497007969930.66180.66640.66280.66160.66150.0013
BEL1182641191311184261182361181860.66650.67150.66750.66640.66600.0014
BIH98447987799853498371983590.71010.71250.71070.70950.70930.0014
BLR82460827298256482455823960.65090.65310.65170.65080.65030.0014
BRN62751628646275662715627190.59250.59360.59250.59210.59210.0005
CAN2135512142152135492133822132680.63160.63360.63160.63110.63070.0009
CHE84792853298494084777847430.67240.67680.67360.67220.67190.0017
CZE1024411028381025081023821023010.67800.68070.67840.67760.67700.0015
DEU79134797147921979118791020.67290.67790.67360.67270.67250.0011
DNK97368976329732897270972770.62320.62490.62290.62250.62250.0004
ESP3772033789983775283770273768320.68440.68770.68500.68410.68370.0013
EST74697749217471674639746230.63840.64040.63860.63790.63770.0009
FIN80421805048038680315802280.66020.66090.65990.65920.65850.0014
FRA92877935939301992868928330.68200.68740.68300.68190.68160.0015
GBR1816801827701817041815181814710.64570.64960.64580.64510.64490.0009
GRC68339686066848568317682690.68140.68410.68290.68110.68050.0023
HKG57050574595711357054570480.59650.60090.59720.59650.59640.0008
HRV70685710447079170679706690.69270.69630.69370.69260.69240.0013
HUN72125724927218772080720600.64370.64700.64420.64320.64300.0013
IRL77409777127743277381773690.63230.63490.63250.63200.63190.0006
ISL48098480714804348006479650.67820.67790.67740.67680.67610.0013
ISR62551629646267562531625200.67710.68170.67850.67680.67660.0018
ITA1790411802751792531789561789140.69510.69990.69590.69470.69450.0014
JPN87938883758799887917878580.66060.66390.66100.66040.65990.0012
KOR65114656136511065067650660.62290.62780.62290.62240.62230.0005
LTU68816690986889368797687880.64110.64390.64190.64090.64080.0011
LUX79066795527923679051790330.69330.69760.69480.69310.69290.0019
LVA53764539225375453731537280.64410.64610.64390.64360.64350.0005
MLT33418336253340433370333710.63250.63670.63230.63150.63140.0008
MNE1039071044121040441038571038330.71740.72100.71830.71700.71680.0016
MYS66244662716625666246662530.50420.50450.50430.50420.50420.0001
NLD50077502865012550063500550.58690.58950.58750.58670.58650.0010
NOR86955872608700586842868020.68590.68840.68630.68500.68460.0017
NZL87003875198707786965869510.65140.65540.65200.65110.65090.0010
POL78675789877867578616785990.64410.64680.64410.64360.64340.0007
PRT89473899008962789457893220.69330.69670.69450.69310.69200.0025
RUS78318785637838478290782620.65880.66100.65940.65860.65830.0011
SGP58480587245851558466584660.55760.56000.55790.55740.55730.0006
SVK59699599585978859692596710.65930.66220.66020.65910.65880.0014
SVN88287888188845188292882450.65180.65580.65300.65180.65140.0016
SWE86292864168627286145860370.71880.71990.71870.71750.71660.0021
TUR96064963269623096041960320.64120.64300.64230.64100.64090.0014
USA61234612236116761154611470.58060.58060.58000.57980.57970.0003
Note. BIC values for best-performing model printed in bold. GHP differences (column Diff) between models MO2 and MM2 larger than 0.001 printed in bold. See Appendix B for country labels.
Table 6. Model parameters from the latent ignorable model (MO2) and the Mislevy-Wu Model (MM2).
Table 6. Model parameters from the latent ignorable model (MO2) and the Mislevy-Wu Model (MM2).
CountryMO2MM2
SD ( ξ ) Cor ( θ , ξ ) SD ( ξ ) Cor ( θ , ξ ) δ CR δ MC
ALB2.500.422.470.44−1.23−0.91
AUS2.590.462.520.46−2.31−0.71
AUT1.900.541.790.49−3.42−1.01
BEL1.920.561.830.51−3.10−0.43
BIH1.870.401.820.43−2.12−0.53
BLR1.810.351.790.29−2.950.43
BRN2.210.332.170.33−2.08−1.08
CAN2.300.442.260.41−2.37−0.09
CHE1.910.501.830.44−3.12−0.46
CZE1.730.431.680.35−2.460.46
DEU1.910.571.800.53−2.63−0.48
DNK2.250.432.190.43−1.73−1.32
ESP1.830.471.770.45−2.45−0.01
EST2.100.412.060.36−2.43−0.35
FIN1.990.312.000.28−2.220.57
FRA1.850.571.740.52−3.19−0.52
GBR2.480.572.380.56−2.26−0.41
GRC1.800.331.780.30−3.59−0.24
HKG2.340.602.220.52−4.07−0.67
HRV1.890.461.830.45−2.99−0.64
HUN2.170.492.110.45−2.48−0.15
IRL1.970.471.910.44−2.23−0.01
ISL2.350.222.360.23−2.000.06
ISR2.360.502.260.49−3.04−1.28
ITA1.750.541.650.49−2.69−0.49
JPN1.920.491.840.43−2.670.45
KOR2.610.642.490.62−2.15−0.80
LTU1.890.421.840.36−3.20−0.69
LUX1.760.471.680.41−3.01−0.73
LVA1.980.441.930.41−1.86−0.08
MLT2.940.612.860.62−2.03−0.82
MNE1.860.471.810.49−2.61−0.57
MYS2.420.182.430.15−1.94−2.76
NLD2.370.452.320.40−3.07−0.61
NOR2.110.422.050.41−2.64−0.64
NZL2.190.532.090.50−2.56−0.60
POL2.050.481.990.42−2.12−0.08
PRT1.760.421.720.34−2.721.20
RUS2.000.381.970.35−2.79−0.28
SGP2.510.502.430.44−2.80−1.11
SVK1.930.411.880.36−3.15−0.23
SVN1.850.491.770.42−9.99−0.34
SWE1.900.321.890.30−2.240.01
TUR1.710.261.680.18−4.07−1.40
USA2.720.262.700.26−1.54−0.28
Note. standard deviation of latent propensity variable ξ ; Cor ( θ , ξ ) = correlation of latent ability θ with latent propensity variable ξ ; δ CR = common δ parameter for constructed response items; δ MC = common δ parameter for multiple-choice items. See Appendix B for country labels.
Table 7. Country means for PISA 2018 mathematics from 11 different scaling models for missing item responses.
Table 7. Country means for PISA 2018 mathematics from 11 different scaling models for missing item responses.
Country%NA%NR rk UW rk Int AverSDrgUWUPUN1UN2UO1MO2IO2MM2IM2IF1IF2
SGP2.70.811–1568.11.55.3568.0567.8567.6567.4567.7567.0567.7567.3567.8568.7572.4
HKG3.90.822–2548.91.34.1550.1550.0548.3548.3548.2548.0547.9548.3548.4548.8552.0
NLD4.41.133–4531.40.62.1531.6531.5531.7531.6530.9530.7531.1531.2531.5530.9532.9
JPN8.41.943–4532.11.84.6530.8530.6530.0530.3533.8533.9533.9531.1531.0533.5534.6
EST6.62.055–5526.71.03.4527.9529.2526.5526.8525.7526.2526.4526.8525.9526.1526.1
KOR6.41.766–7522.51.34.4523.7523.6523.1521.7522.1520.4520.9522.2522.8522.6524.8
POL7.11.976–8521.50.72.5521.4521.2520.9521.0521.2521.1520.9521.7521.8521.5523.4
CAN5.82.287–8520.60.82.8519.5519.9521.4521.4520.1519.9519.9521.0521.1520.5522.2
DNK5.92.099–10518.40.82.3518.1518.2519.4519.4517.6518.1517.5519.4518.9517.1518.6
SVN7.11.51010–12515.20.82.3516.4516.0514.3514.9514.7515.3515.7514.3514.1515.2515.9
BEL8.32.6119–11517.20.72.3516.1516.7516.9517.2517.4518.1517.0516.7516.7517.6518.4
CHE8.22.51211–12514.50.51.5514.2514.8514.0514.4514.9515.2514.8513.9513.7515.2514.0
DEU9.62.51313–13509.81.03.1509.1509.1509.2509.2511.4511.5510.4509.8509.4509.8508.4
FIN8.93.01414–16506.71.03.7506.9506.5506.7507.3506.5506.9506.8508.0507.9506.0504.3
IRL5.81.31515–23502.22.68.2505.2504.8501.6502.1499.9500.7500.0505.1504.9497.1502.5
CZE9.22.01614–17505.11.03.7504.9504.3503.4504.2505.3505.8505.8505.1505.0505.5507.1
GBR8.22.51714–17505.61.03.3503.9504.9506.6504.6507.2505.7505.0505.8505.8506.6505.4
NZL8.12.21818–22502.41.24.0503.3504.3502.5502.0501.8501.6501.6502.4504.8501.9500.8
FRA10.1 1 3.01917–20502.80.92.4502.1502.2502.5503.0503.9503.9503.9501.8501.5503.1503.3
AUT8.41.82020–23500.80.92.7500.9501.7500.1499.4501.9500.7501.4499.6500.4501.0502.1
PRT10.6 1 2.82117–21501.91.23.8500.1500.1500.4501.2502.4502.7502.5502.4502.0502.6503.9
LVA6.41.72222–27496.81.75.1499.7498.6497.2497.7494.6495.3495.1497.9497.9494.7496.5
NOR10.7 1 3.72318–23501.81.54.2499.4499.8502.0502.0503.7503.4503.4500.6500.3502.7502.0
AUS7.32.42424–26495.70.93.7495.3496.3497.8495.6496.0495.5495.5495.6495.7495.6494.0
SWE12.7 1 5.92521–25498.43.310.1 1 491.8493.4499.1499.7501.3501.1501.3497.9498.0501.9496.8
ITA12.4 1 2.82625–27492.02.35.4490.4490.1489.4490.1494.7494.4494.0490.1489.9494.6494.1
ISL9.74.42724–27494.22.88.9489.1491.3496.5498.0495.1495.6495.8495.9495.2493.3490.5
LUX10.4 1 2.82828–28486.50.92.6486.8486.5485.5486.3487.2487.4486.6485.3485.2487.7487.5
LTU7.41.52929–34482.01.75.5485.5484.4482.1483.0480.1480.9480.6482.0481.9480.6480.9
RUS7.92.23029–31483.70.72.1484.6484.2483.6484.6482.9483.7483.9484.0483.7482.5482.5
SVK8.01.83129–32483.20.62.4484.5483.9482.8483.4482.8483.3483.0483.2483.1482.1482.7
HUN8.62.03229–32483.20.72.4484.1483.8483.7483.4482.9482.7482.9484.0483.7482.6481.6
ESP10.6 1 2.93332–35481.51.75.8482.4482.4481.7482.5481.6482.1482.3482.4481.9480.4476.7
USA4.02.03429–36482.22.46.6481.6483.1484.9485.7479.5480.4480.5484.5484.5479.1480.7
BLR7.82.43532–36480.31.75.4477.7477.2481.4482.6479.5480.3480.5481.6481.6480.1481.1
MLT9.83.93633–37476.62.89.2474.2476.0479.8471.2480.3475.0476.8475.6476.6480.5476.5
HRV11.8 1 2.73736–37470.81.85.0471.8469.0468.3471.7472.9471.1473.3468.8468.5471.1472.3
TUR6.71.63838–39460.32.06.2464.0462.9460.8462.2458.0459.3459.4460.0460.0457.8458.7
ISR12.1 1 4.53938–39461.71.64.3459.9461.4462.4461.4463.5462.6462.3459.3459.2463.3463.0
GRC10.7 1 2.74040–41439.12.59.2440.9440.1440.0441.3438.2439.8438.8439.3439.2440.0432.2
MYS1.20.64141–44429.14.812.5 1 435.8433.4432.2433.5423.3424.6425.4431.5432.4424.4423.6
ALB8.01.94241–44429.62.26.3432.0431.1430.5426.5427.8427.8427.8432.2432.8428.5428.2
BRN6.11.74342–44427.73.28.1430.9430.0428.8430.0423.2424.4424.2428.8429.4423.7431.3
MNE17.3 1 3.84440–44433.13.510.7 1 430.5431.3429.8428.1436.8436.3436.1431.1430.4438.8434.8
BIH18.8 1 3.94545–45413.93.810.5 1 410.7410.4410.3409.8417.3417.3417.1412.2411.2420.3416.6
Note. %NA = proportion of item responses with missing data; %NR = proportion of item responses that are not reached; rk UW = country rank from model UW; rk Int = interval of country ranks obtained from 11 different scaling models; Aver = average of country means across 11 models; SD = standard deviation of country means across 11 models; rg = range of country means across 11 models; UW = scoring as wrong (Section 2.1); UP = MC items scored as partially correct (Section 2.2); UN1 = ignoring not reached items (Section 4.2.1); UN2 = including proportion of not reached items in background model (Section 4.2.1); UO1 = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); IO2 = imputed under latent ignorability (Section 2.4.1, Equations (10) and (11)); MM2 = Mislevy-Wu model with item format-specific δ parameters (Section 2.5, Equation (14)); IM2 = imputed under Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IF1 = FCS imputation based on item responses (Section 2.6 and Section 4.2.2); IF2 = FCS imputation based on item responses and response indicators (Section 2.6 and Section 4.2.2); The following entries in the table are printed in bold: Missing proportions (%NA) larger than 10.0% and smaller than 5.0%, not reached proportions larger than 3.0%, country rank differences larger than 2, ranges in country means larger than 5.0. See Appendix B for country labels.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 1653-1687. https://doi.org/10.3390/ejihpe11040117

AMA Style

Robitzsch A. On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data. European Journal of Investigation in Health, Psychology and Education. 2021; 11(4):1653-1687. https://doi.org/10.3390/ejihpe11040117

Chicago/Turabian Style

Robitzsch, Alexander. 2021. "On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data" European Journal of Investigation in Health, Psychology and Education 11, no. 4: 1653-1687. https://doi.org/10.3390/ejihpe11040117

Article Metrics