On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data

Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.


Introduction
It has frequently been argued that measured student performance in educational largescale assessment (LSA; [1][2][3]) studies is affected by test-taking strategies. In a recent paper that was published in the highly ranked Science journal, researchers Steffi Pohl, Esther Ulitzsch and Matthias von Davier [4] argue that "current reporting practices, however, they confound differences in test-taking behavior (such as working speed and item nonresponse) with differences in competencies (ability). Furthermore, they do so in a different way for different examinees, threatening the fairness of country comparisons" [4]. Hence, the reported student performance (or, equivalently, student ability) is regarded by the authors as a conflated composite of a "true" ability and test-taking strategies. Importantly, Pohl et al. [4] question the validity of country comparisons that are currently reported in LSA studies and argue for an approach that separates test-taking behavior (i.e., item response propensity and working speed) from a purified ability measure. The core idea of the Pohl et al. [4] approach is on how to model missing item responses in educational largescale assessment studies. In this article, we systematically investigate the consequences of different treatments of missing item responses in the programme for international student assessment (PISA) study conducted in 2018. Note that we do not focus on exploring or modeling test-taking strategies in this article.
While the treatment of missing data in statistical analyses in social sciences is now widely used [5][6][7][8], in recent literature, there are recommendations for treating missing item responses in item response theory (IRT; [9]) models in LSA studies [10,11]. Typically, the treatment of item responses can be distinguished between calibration (computation of item parameters) and scaling (computation of ability distributions).
It is essential to distinguish the type of missing item responses. Missing item responses at the end of the test are referred to as not reached items, while missing items within the test are denoted as omitted items [12]. Since the PISA 2015 study, not reached items are no longer scored as wrong and the proportion of not reached items is used as a predictor in the latent background model [13]. Items that are not administered to students in test booklets in a multiple-matrix design [13][14][15] lead to missingness completely at random (except in multi-stage adaptive testing; see [16]). This kind of missingness is not the topic of this article and typically does not cause issues in estimating population and item parameters.
Several psychometricians have repeatedly argued that missing item responses should never be scored as wrong because such a treatment would produce biased item parameter estimates and unfair country rankings [4,10,11,17,18]. In contrast, model-based treatments of missing item responses that rely on latent ignorability [4,10,11,19] are advocated. Missing item responses can be ignored in this approach when including response indicators and a latent response propensity [20,21]. Importantly, the missingness process is summarized by the latent response variable. As an alternative, multiple imputation at the level of items can be employed to handle missing item responses properly [22,23]. However, scoring missing item responses as wrong could be defended for validity reasons [24][25][26]. Moreover, it has been occasionally argued that simulation studies cannot provide information on the proper treatment of missing item responses in a concrete empirical application because the truth is unknown that would have generated the data [25,27]. Nevertheless, simulation studies could be tremendously helpful in understanding and comparing competitive statistical modeling approaches.
Our findings might only be generalizable to other low-stakes assessment studies like PISA [28][29][30]. However, the underlying mechanisms for missing item responses can strongly differ from high-stakes assessment studies [31].
Although several proposals of using alternative scaling models for abilities in LSA studies like PISA have been made, previous work either did not report country means in the metric of interest [10] such that consequences cannot be interpreted, or constituted only a toy analysis consisting only a few countries [4] that did enable a generalization to operational practice. Therefore, this article compares different scaling models that rely on different treatments of missing item responses. We use the PISA 2018 mathematics dataset as a showcase. We particularly contrast the scoring of missing item responses as wrong with model-based approaches that rely on latent ignorability [4,10,11] and a more flexible Mislevy-Wu model [32,33] containing the former two models as submodels. In the framework of the Mislevy-Wu model, it is tested whether the scoring of missing item responses as wrong or treating them as latent ignorable are preferred in terms of model fit. Moreover, it is studied whether the probability of responding to an item depends on the item response itself (i.e., nonignorable missingness, [7]). In the most general model, the missingness process is assumed to be item format-specific. Finally, we investigate the variability across means from different models for a country.
The rest of the article is structured as follows. In Section 2, an overview of different statistical modeling approaches for handling missing item responses is presented. Section 3 contains an illustrative simulation study that demonstrates the distinguishing features of the different modeling approaches. In Section 4, the sample of persons and items and the analysis strategy for the PISA 2018 mathematics case study are described. In Section 5, the results of PISA 2018 mathematics are presented. Finally, the paper closes with a discussion in Section 6.

Statistical Models for Handling Missing Item Responses
In this section, different statistical approaches for handling missing item responses are discussed. These different approaches are utilized in the illustrative simulation study (see Section 3) and the empirical case study involving PISA 2018 mathematics data (see Section 4).
For simplicity, we only consider the case of dichotomous items. The case of polytomous items only requires more notation for the description of models but does not change the general reasoning elaborated for dichotomous items. Let X pi denote the dichotomous item responses and the R pi response indicators for person p and item i. The response indicator R pi takes the value one if X pi is observed and zero if X pi is missing. Consistent with the operational practice since PISA 2015, the two-parameter logistic (2PL) model [34] is used for scaling item responses [13,16]. The item response function is given as where Ψ denotes the logistic distribution function. The item parameters a i and b i are item discriminations and difficulties, respectively. It holds that 1 − Ψ(x) = Ψ(−x). Local independence of item responses is posed; that is, item responses X pi are conditionally independent from each other given the ability variable θ p . The latent ability θ p follows a standard normal distribution. If all item parameters are estimated, the mean of the ability distribution is fixed to zero and the standard deviation is fixed to one. The one-parameter logistic (1PL, [35]) model is obtained if all item discriminations are set equal to each other. In Figure 1, the main distinctive features of the different missing data treatments are shown. Three primary strategies can be distinguished [36,37]. These strategies differ in how to include information from the response indicator variables.
First, response indicators R p are unmodelled (using model labels starting with "U"), and missing entries in item responses X p are scored using some a priorily defined rule resulting in item responses X sco,p without missing entries. For example, missing item responses can be scored as wrong or can be omitted in the estimation of the scaling model. In a second step, the 2PL scaling model is applied to the dataset containing scored item responses X sco,p .
Second, model-based approaches (using model labels starting with "M") pose a joint IRT model for item responses X p and response indicators R p [19]. The 2PL scaling model for the one-dimensional ability variable θ p is part of this model. In addition, a further latent variable ξ p (i.e., the so-called response propensity) is included that describes the correlational structure underlying the response indicators R p . In most approaches discussed in the literature, there is no path from X pi to R pi . After controlling for ability θ p and response propensity ξ p , there is no modeled effect of the item response on the response indicator. In this paper, we allow for this additional relation by using the Mislevy-Wu model and empirically demonstrate that missingness on items depends on the item response itself.
Third, imputation-based approaches (using model labels starting with "I") first generate multiply imputed datasets and fit the 2PL scaling model to the imputed datasets in a second step [37,38]. Different imputation models can be employed. One can either use only the item responses X p or use the item responses X p and the response indicators R p in the imputation model. As an alternative, imputations can be generated based on an IRT model that contains item responses X p and missing indicators R p . These imputation models can coincide with IRT models that are employed as model-based approaches in our overview. After fitting the IRT models for (X p , R p ), the output contains a posterior distribution P(θ p , ξ p |X p , R p ) for each subject p. For each imputed dataset, one first simulates latent variables θ * p and ξ * p from the posterior distribution [39]. For items with missing item responses (i.e., R pi = 0), one can simulate scores for X pi according to the conditional distribution P(X pi = x|R pi = 0, θ * p , ξ * p ) (x = 0, 1). It holds that The 2PL scaling model is applied to the imputed datasets X imp,p in a second step. In the analyses of this paper, we always created 5 imputed datasets to reduce the simulation error associated with the imputation. We stack the 5 multiply imputed datasets into one long dataset and applied the 2PL scaling model for the stacked dataset (see [40][41][42]). The stacking approach does not result in biased item parameter estimates [41], but resampling procedures are required for obtaining correct standard errors [40]. This article mainly focuses on differences between results from different models and does not investigate the accuracy of standard error computation methods based on resampling procedures. MO1, MO2,  MM1, MM2   IW, IO1, IO2,  IM1, IM2 IF2 IF1, IP In the next subsections, we describe the different models for treating missing item responses. These models differ with regards to the missingness mechanism assumptions of missing item responses. Some of the model abbreviations in Figure 1 are already mentioned in this section. Models that only appear in the case study PISA 2018 mathematics are described in Section 4.1.

Scoring Missing Item Responses as Wrong
In a reference model, we scored all missing item responses (omitted and not reached items) as wrong (model UW). The literature frequently argues that missing item responses should never be scored as wrong [4,10,17,43]. However, we think that the arguments against the scoring as wrong are flawed because these studies simulate missing item responses based on response probabilities that do not depend on the item itself. We think that these data-generating models are not plausible in applications (but see also [44] for a more complex missing model; [25,26]). On the other hand, one can simulate missing item responses such that missing item responses can only occur for incorrectly solved items (i.e., for items with X pi = 0). In this situation, all missing data treatments that do not score missing item responses as wrong will provide biased estimates [27].

Scoring Missing Item Responses as Partially Correct
Missing responses for MC items can be scored as partially correct (also known as fractional correct item responses; see [45]). The main idea is that a student could guess the MC item if he or she does not know the answer. If an item i has K i alternatives, a random guess of an item option would provide a correct response with probability 1/K i . In IRT estimation, one can weigh probabilities P(X pi = 1) with 1/K i and P(X pi = 0) with 1 − 1/K i [45]. This weighing implements a scoring of a missing MC item as partially correct (model UP). The maximum likelihood estimation is replaced by a pseudo-likelihood estimation that allows non-integer item responses [45]. More formally, the log-likelihood function l for estimating item parameters a = (a 1 , . . . , a I ) and b = (b 1 , . . . , b I ) can be written as where f denotes the density of the standard normal distribution, and N denotes the sample size. The entries x pi in the vector of scored item responses X p can generally take values between 0 and 1. The EM algorithm typically used in estimating IRT models [46,47] only needs to be slightly modified for handling fractionally correct item responses. In the M-step for computing expected counts, one must utilize the fractional item responses instead of using only zero or one values. The estimation can be carried out in the R [48] package sirt [49] (i.e., using the function rasch.mml2()).
It should be mentioned pseudo-likelihood estimation of IRT models that allow noninteger item responses is not widely implemented in IRT software. However, the partially correct scoring can be alternatively implemented by employing a multiple imputation approach of item responses. For every missing item response of item i, a correct item response is imputed with probability 1/K i . No imputation algorithm is required because only random guessing is assumed. This means that the guessing probability of 1/K i is constant for persons and items.
Missing item responses for CR items are scored as wrong in the partially correct scoring approach because students in this situation cannot simply guess unknown answers.

Treating Missing Item Responses as Ignorable
As an alternative to scoring missing item responses as wrong, missing item responses can be ignored in likelihood estimation. In model UO1, all missing item responses are ignored in the scaling model. The student ability θ p is extracted based on the observed item responses only. The log-likelihood function l for this model can be written as It can be seen from Equation (4) that only observations with observed item responses (i.e., r pi = 1) contribute to the likelihood function.
The method UO1 is valid if missing item responses can be regarded as ignorable [18]. If X com,p = (X obs,p , X mis,p ) is a partitioning of the vector of complete item responses into the observed and the missing part, the assumption that item responses are missing at random [7] is given as P(R p |X obs,p , X mis,p ) = P(R p |X obs,p ). (5) This means that the probability of omitting items only depends on observed items and not the unobserved item responses. By integrating out missing item responses X mis,p , the joint distribution (X com,p , R p ) and using the MAR assumption (5) can be written as P(X obs,p , X mis,p , R p ) dX mis,p = P(R p |X obs,p )P(X obs,p ).
Hence, Equation (6) shows that likelihood inference for MAR data can entirely rely on the probability distribution P(X obs,p ) of observed item responses. The notion of (manifest) ignorability means that model parameters of the distributions P(X obs,p ) and P(R p |X obs,p ) are distinctive. This means that these distributions can be modeled independently.
It should be emphasized that the MAR assumption (5) does not involve the latent ability θ p . The probability of missingness must be inferred by (summaries of) observed item responses only. This kind of missingness process might be violated in practice. In the following subsection, a weakened version of ignorability is discussed.
That is, the probability of missing item responses depends on observed item responses and the latent variable η p , but not the unknown missing item responses X mis,p itself. By integrating out X mis,p , we obtain P(R p , X obs,p , X mis,p |η p ) dX mis,p = P(R p |X obs,p , η p )P(X obs,p |η p ).
The specification (7) is also known as a shared-parameter model [61,62]. In most applications, conditional independence of item responses X pi and response indicators R pi conditional on η p is assumed [19]. In this case, Equation (8) simplifies to P(R p = r p , X obs,p = x obs,p , X mis,p |η p ) dX mis,p = In the rest of this paper, it is assumed that the latent variable η p consists of a latent ability θ p and a latent response propensity ξ p . The latent response propensity ξ p is a unidimensional latent variable that represents the dimensional structure of the response indicators R p . The probability of responding to an item is given by (model MO2; [10,20,44,[63][64][65][66]) Note that the probability of responding to item i only depends on ξ p and is independent of X pi and θ p . The 2PL model is assumed for item responses X pi (see Equation (1)): The model defined by Equations (10) and (11) is also referred to as the Holman-Glas model [20,37]. In this article, a bivariate normal distribution for (θ p , ξ p ) is assumed, where SD(θ p ) is fixed to one, and SD(ξ p ), as well as Cor(θ p , ξ p ), are estimated (see [67,68] for more complex distributions).
The model UO1 (see Section 2.3) that presupposes ignorability (instead of latent ignorability) can be tested as a nested model within model MO2 by setting Cor(θ p , ξ p ) = 0. This model is referred to as model MO1.
Note that the joint measurement model for item responses X pi and response indicators R pi can be written as Hence, the model defined in Equation (12) can be interpreted as an IRT model for a variable V pi that has three categories: Category 0 (observed incorrect): X pi = 0, R pi = 1, Category 1 (observed correct): X pi = 1, R pi = 1, and Category 2 (missing item response): X pi = NA, R pi = 0 (see [43,69,70]).

Generating Imputations from IRT Models Assuming Latent Ignorability
The IRT models MO1 and MO2 are also used for generating multiply imputed datasets. Conditional on θ p , missing item responses are imputed according to the response probability from the 2PL model (see Equation (11)). The stacked imputed dataset is scaled with the unidimensional 2PL model. If models MO1 or MO2 were be the true data-generating models, the results from multiple imputation (i.e., IO1 and IO2) would coincide with model-based treatments (i.e., MO1 and MO2). However, results can differ in the case of misspecified models [71,72].

Including Summaries of Response Indicators in the Latent Background Model
The IRT model for response indicators R pi in Equation (10) is a 1PL model. Hence, the sum score R p• = ∑ I i=1 R pi is a sufficient statistic for the response propensity ξ p [73]. Then, the joint distribution can be written as P(R p , X obs,p , θ p , ξ p )P(θ p |ξ p )P(ξ p ) = P(X obs,p |θ p ) P(θ p |ξ p )P(R p• |ξ p )P(ξ p ) .
Instead of estimating a joint distribution (θ p , ξ p ), a conditional distribution θ p |R p• can be specified in a latent background model (LBM; [74,75]). That is, one uses the proportion of missing item responses Z p = 1 − R p• /I as a predictor for θ p [11,12] and employs a conditional normal distribution θ p |Z p ∼ N(γ 0 + γ 1 Z p , σ 2 e ). This manifest variable Z p can be regarded as a proxy variable for the latent variable ξ p . The resulting model is referred to as model UO2.

Mislevy-Wu Model for Nonignorable Item Responses
Latent ignorability characterizes only a weak deviation from an ignorable missing data process. It might be more plausible that the probability P(R pi = 1|X pi , θ p , ξ p ) of responding to an item depends on the observed or unobserved item response X pi itself [76][77][78][79][80]. The so-called Mislevy-Wu model [32,33,81,82] extends the model MO2 (see Equation (10)) that assumes latent ignorability to In this model, the probability of responding to an item depends on the latent response propensity ξ p and the item response X pi itself (see [24,25,49,81,83,84]). The parameter β i governs the missingness proportion for X pi in the subgroup of persons with X pi = 0, while the sum β i + δ i represents the missingness proportion for persons with X pi = 1. The unique feature of the Mislevy-Wu model is that the missingness proportion is allowed to depend on the item response. If a very small negative value for the missingness parameter δ i is chosen (e.g., δ i = −10), the response probability P(R pi = 1|X pi , θ p , ξ p ) in Equation (14) is close to one, meaning that persons with X pi = 1 always provide item response (i.e., they have a missing proportion of zero). By applying the Bayes theorem, it follows in this case that persons with a missing item response must possess an incorrectly solved item; that is, it holds X pi = 0. It should be emphasized that the Mislevy-Wu model is a special case of models discussed in [85].
Model MM1 is defined by assuming a common δ i parameter for all items. In model MM2, two δ parameters are estimated for item formats CR and MC in the PISA 2018 mathematics case study (see Section 5 for results).
Note that the Mislevy-Wu model for item responses X pi and response indicators R pi can be also formulated as a joint measurement model for a polytomous item with three categories 0 (observed incorrect), 1 (observed correct), and 2 (missing; see also Equation (12)): The most salient property of the models MM1 and MM2 is that the model treating missing item responses as wrong (model UW) can be tested by setting δ i = −10 in Equation (14) (see [33]). This model is referred to as model MW and the corresponding scaling model based on multiply imputed datasets from MW as model IW. Moreover, the model MO2 assuming latent ignorability is obtained by setting δ i = 0 for all items i (see Equation (10)). It has been shown that parameter estimation in the Mislevy-Wu model and model selection among models MW, MO2, and MM1 based on information criteria have satisfactory performance [33].
For both models, multiply imputed datasets were also created based on conditional distributions P(X pi |R pi , θ p , ξ p ). The scaling models based on stacked imputed datasets are referred to as IM1 and IM2.

Imputation Models Based on Fully Conditional Specification
The imputation models discussed in previous subsections are based on unidimensional or two-dimensional IRT models (see [36,[86][87][88][89] for more imputation approaches relying on strong assumptions). Posing such a strict dimensionality assumption might result in invalid imputations because almost all IRT models in educational large-scale assessment studies are likely to be misspecified [26]. Hence, alternative imputation models for missing item responses were considered that relied on fully conditional specification (FCS; [41]) implemented in the R package mice [90].
The FCS imputation algorithm operates as follows (see [41,[91][92][93]). Let W p denote the vector of variables that can have missing values. FCS cycles through all variables in W p (see [37,[94][95][96]). For variable W pv , all remaining variables in W p except W pv are used as predictors for W pv (denotes as W p, (−v) ) in the imputation model. More formally, a linear regression model is specified. For dichotomous variables W pv , (16) might be replaced by a logistic regression model. Our experiences correspond with those from the literature that using a linear regression with predictive mean matching (PMM; [41,[97][98][99]) provides more stable estimates of the conditional imputation models. PMM guarantees that imputed values only take values that are present in the observed data (i.e., values of 0 or 1 for dichotomous item responses).
In situations with many items, W p,(−v) is a high-dimensional vector of covariates in the imputation model (16). To provide a stable and efficient estimation of the imputation model, a dimension reduction method for the vector of covariates can be applied to enable a feasible estimation. For example, principal component analysis [100] or sufficient dimension reduction [101] can be applied in each imputation model for reducing the dimensionality of W p, (−v) . In this paper, partial least squares (PLS) regression [102] is used for transforming the vector of covariates to a low-dimensional vector of PLS factors that successively maximize the covariance with the criterion variable (i.e., maximize the covariance Cov(α f W p,(−v) , W pv ) with factor loading vectors α f for uncorrelated factors [103]). In the simulation study and the empirical case study, we use 10 PLS factors to avoid the curse of dimensionality due to estimating too many parameters in the regression models [103,104].
In the imputation model IF1, only item responses X p are included. This specification will provide approximately unbiased estimates if the MAR assumption (i.e., manifest ignorability) holds. In model IF2, response indicators R p are additionally included [105]. This approach is close to the assumption of latent ignorability in which summaries of the response indicators are also required for predicting the missingness of an item response. Hence, it can be expected that the model IF2 outperforms IF1 and provides similar results to the model MO2 relying on latent ignorability. In contrast to the Mislevy-Wu model, for imputing item response X pi in model IF2, the predictors X p,−(i) and R p,(−i) are used. Hence, the probability of responding to an item is not allowed to depend on the item itself. This assumption might be less plausible than assuming the response model in Equation (14).
Like for all imputation-based approaches in this paper, 5 multiply imputed datasets were created, and the 2PL scaling model is applied to the stacked dataset involving all imputed datasets.

Illustrative Simulation Study
In order to better understand the relations between different models for the treatment of missing item responses, we performed a small illustrative simulation study to provide insights into the behavior of the most important models under a variety of datagenerating models.

Method
We restrict ourselves to the analysis of only one group. This does not imply interpretational issues because the main motivation of this study is to provide a better insight into the behavior of the models and not to mimic the PISA application involving 45 countries. We only employed a fixed number of I = 20 items in a linear fixed test design. Hence, we did not utilize a multi-matrix design with random allocation of students to test booklets as implemented in PISA. In our experience, we have not (yet) seen any simulation study whose results with a multi-matrix test design substantially differ from a linear fixed test design. We chose a sample size of N = 1500, which corresponds to a typical sample size at the item level in the PISA application.
Item responses were generated based on the Mislevy-Wu model (see Equation (10)). Item responses were simulated according to the 2PL model. We fixed the correlation of the latent ability θ and the latent response propensity ξ to 0.5. We assumed item difficulties that were equidistantly chosen on the interval [−2, 2] (i.e., −2.000, −1.789, −1.579, ..., 1.789, 2.000), and we used item discriminations of 1 when simulating data. The ability variable θ was assumed to be standard normally distributed. For the response mechanism in the Mislevy-Wu model in Equation (10), we varied a common missingness parameter δ in five factor levels −10, −3, −2, −1, and 0. The case δ = −10 effectively corresponds to the situation in which missing item responses can only be produced by incorrect item responses. This simulation condition refers to the situation in which missing item responses must be scored as wrong for obtaining unbiased statistical inference. The situation δ = 0 corresponds to the situation of latent ignorability. The cases δ = −3, −2, −1 correspond to situations in which both the scoring as wrong and latent ignorability missing data treatment are not consistent with the data-generating model, and biased estimation can be expected. For the model for response indicators, we used a common β parameter across items in the simulation. As our motivation was to vary the average proportion of missing item responses (i.e., the factor levels were 5%, 10%, 20%, and 30%), the common β parameter is a function of the δ parameter. Prior to the main illustrative simulation, we numerically determined the β parameter to obtain the desired missing data proportion rate (see Table A1 in Appendix A for the specific values used).
Seven analysis models were utilized in this simulation study. First, we evaluated the performance of the 2PL model for complete data (model CD). Second, we estimated the Mislevy-Wu model assuming a common missingness parameter δ (model MM1; Section 2.5). Third, we applied the method of scoring of missing items as wrong in model UW. Fourth, in contrast to UW, missing item responses were ignored in the estimation in model UO (Section 2.3). Fifth, we estimated the model with response propensity ξ relying on latent ignorability (model MO2, Section 2.4). Furthermore, two imputation-based approaches were used that rely on the fully conditional specification approach implemented in the R package mice [90]. For both approaches, five multiply imputed datasets were utilized, and the 2PL models were estimated by using a stacked dataset containing all five imputed datasets. Sixth, the model IF1 uses item responses in the imputation approach that employs PMM. Seventh, the model IF2 uses item responses and response indicators in the imputation model. To avoid multicollinearity issues, PLS imputation with 10 PLS factors was applied for models IF1 and IF2.
The 2PL analysis models provided item difficulties and item discriminations and fixed the ability distribution to the standard normal distribution. To enable a comparison of the estimated mean and the standard deviation with the mean and the standard deviation of the data-generating model, estimated item parameters were linked to the true item parameters used in the data-generating model. As a result, a mean and a standard deviation as a result of the linking procedure is compared to the true mean (i.e., M = 0) and the true standard deviation (SD = 1). In this simulation, we applied Haberman linking [106,107] that is equivalent to log-mean-mean linking for two groups [108]. Note that we use Haberman linking for multiple groups (i.e., multiple countries) in the case study in Section 4.
A total number of 500 replications was carried out for each cell of the design. We evaluated bias and root mean square error (RMSE) for the estimated mean and standard deviation. We also assessed Monte Carlo standard errors for bias, and RMSE are calculated based on the jackknife procedure [109,110]. Twenty jackknife zones were defined for the computing of the Monte Carlo standard errors.
In this illustrative simulation study, the statistical software R [48] along with the packages mice [90] and sirt [49] are used.

Results
In Table 1, the bias for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ is shown. In the case of complete data (CD), no biases exist. Except for the situation of a large proportion of missing item responses of 30% and an extreme δ parameter of −10 (bias = 0.054), the Mislevy-Model (model MM1)-that is consistent with the datagenerating model-performed very well in terms of bias for the mean and the standard deviation. If missing data were only caused by wrong items (i.e., δ = −10), models that rely on ignorability (UO, IF1) or latent ignorability (MO2, IF2) produced large biases (e.g., for the mean in the condition of 10% missing data UO 0.159, MO2 0.149, IF1 0.160, IF2 0.152). As was to be expected in this case, scoring missing item responses as wrong provided unbiased results. In contrast, if the data-generating model relied on latent ignorability (i.e., δ = 0), scoring missing item responses as wrong provided biased estimates (e.g., for the mean for 10% missing data, the bias was −0.139). Note that in this condition, MO2 and IF2 provided unbiased estimates, while the models that did not take response indicators into account provided biased estimates (e.g., for the mean for 10% missing data: UO 0.037, IF1 0.038). Table 1. Bias for the mean and the standard deviation for different missing data treatments as a function of the missing proportion and the missingness parameter δ. For values of the missingness parameter δ between −10 and 0, both missing data treatments as wrong and latent ignorable provided biased estimates for the mean. The biases were much more pronounced for higher missing data proportions. Moreover, the standard estimation is substantially underestimated when relying on a model for latent ignorability if the latent ignorability was not used for simulating item responses. Interestingly, the imputation model IF2 that uses both item responses and response indicators showed similar behavior to the model MO2 that involves the latent response propensity ξ, while the imputation model IF1 only using item responses performed similarly to UO. The standard deviation was underestimated in many conditions for the models assuming latent ignorability if the Mislevy-Wu model holds.
The Monte Carlo standard errors for the bias of the mean (M = 0.0023, SD = 0.0005, Max = 0.0044) were similar to those of the standard deviation (M = 0.0022, SD = 0.0005, Max = 0.0038). The uncertainty in the bias estimates is negligible to the variation across different missing data treatments. Hence, the conclusions obtained from this simulation study can be considered trustworthy.
In Table A2 in Appendix A, the RMSE for the mean and the standard deviation for the different missing data treatments are shown as a function of the missing data proportion and the missingness parameter δ. In situations where the models UW or MO2 provided unbiased estimates, the Mislevy-Wu model MM1 has slightly larger variable estimates.
However, only in these particular situations, the RMSE of the simpler restrictive models was smaller than those of MM1. In general situations, the increase in variability was outperformed by a lower bias of model MM1. The Monte Carlo standard error for the RMSE of the mean was on average 0.0023 (SD = 0.0006, Max = 0.0044). The corresponding Monte Carlo error for the RMSE of the standard deviation turned out to be quite similar (M = 0.0023, SD = 0.0007, Max = 0.0042).

Summary
In this illustrative simulative study, we showed that one could not generally conclude that missing items must never be scored wrong. Moreover, models that treat missing item responses as latent ignorable do not guarantee a smaller bias compared to the scoring as wrong. In general, the scoring as wrong can provide negatively biased mean estimates, while the treatment as latent ignorable will typically provide positively biased estimates.
As with any simulation study, the data-generating truth must be known in advance which is not the case in any empirical application. The Mislevy-Wu model is a general model for treating nonignorable missing item responses. It certainly has the potential to provide less biased estimates than alternatives recently discussed in the literature.

Sample
The mathematics test in PISA 2018 [16] was used to investigate different treatments of missing item responses. We included 45 countries that did receive the main test in a computer-based administration. These countries did not receive test booklets with items of lower difficulty that were included for low-performing countries.
In total, 72 test booklets were administered in the computer-based assessment in PISA 2018 [16]. Test booklets were compiled from four clusters of items of the same ability domain (i.e., mathematics, reading, science). We selected only booklets which had two item clusters of mathematics items. We took booklets from students that had two item clusters containing mathematics items. Students from booklets 1 to 12 were selected. The cluster of mathematics items appeared either at the first and second (booklets 7 to 12) or the third and fourth positions (booklets 1 to 6) in the test.
As a consequence, 70 mathematics items were included in our analysis. In each of the selected booklets, 22, 23, or 24 mathematics items were administered. Seven of the 70 items were polytomous and were dichotomously recoded, with only the highest category being recoded as correct. In total, 27 out of 70 items had the complex multiple-choice (MC) format, and 43 items had constructed-response (CR) format. For 18 MC items, there were 4 response alternatives, 4 MC items had 8 response alternatives, and 5 MC items had 16 response alternatives.
In Table 2, descriptive statistics for the sample used in our analysis are presented. In total, 167,092 students from these 45 countries were included in the analysis. On average, M = 3713.2 students were available in each country. The average number of students per item within each country ranged between 415.8 (MLT, Malta) and 4408.3 (ESP, Spain). On average, M = 1120.3 students per item were available at the country level.
The average proportion of missing item responses in the dataset was 8.4% (SD = 3.3%) and ranged between 1.2% (MYS, Malaysia) and 18.8% (BIH; Bosnia and Herzegovina). The proportion of not reached item responses was on average 2.4% (SD = 1.0%) with the maximum of 5.9% (SWE, Sweden). Interestingly, the missing data proportions and the country means were only moderately correlated (Cor = −0.48). Missing proportions for CR items were substantially larger (M = 12.3%, SD = 4.8%, Min = 1.5%, Max = 27.9%) than for MC items (M = 2.3%, SD = 1.0%, Min = 0.7%, Max = 5.4%). Figure 2 shows the distribution of the proportion of missing and not reached items at the student level aggregated across countries. Most students produced no missing items (i.e., 61.9%) or no not reached items (i.e., 90.2%). Note. N = number of students; I = number of items; N item = average number of students per item; N OECD = officially reported country mean by OECD [16]; M OECD = officially reported country standard deviation by OECD [16];

Scaling Models
The different scaling models for treating missing item responses are compared for the PISA 2018 mathematics data for country means and country standard deviations. To compare the parameters of ability distributions across countries, different strategies are considered viable in the literature. These strategies will typically provide different results in the presence of differential item functioning between countries (country DIF; [111][112][113][114]). In this situation, item parameters vary across countries, they are not invariant across countries. First, the noninvariance can be ignored in the scaling model. A misspecified model assuming invariant item parameters is purposely specified [114][115][116][117][118]. Second, scaling is conducted under partial invariance in which only a portion of item parameters is allowed to differ across countries [13,16,[119][120][121][122]. Third, a hierarchical model is utilized as the scaling model in which country-specific item parameters are modeled as random effects [111,123,124]. Fourth, the scaling models are separately applied for each country in the first step. In a second step, a common metric is established by applying a linking procedure that transforms item parameters and the ability distribution [108,118,125].
In our analysis, we use the linking approach relying on separate scalings for comparing the ability distribution across countries. We opted for this strategy for the following reasons. First, it is likely that the missingness mechanisms differ across countries [126]. Hence, in a model-based approach to treating missing item responses, it does not seem justified to assume invariant model parameters for the missingness mechanism across countries. Second, it has been shown in the presence of country DIF that a misspecified scaling model assuming invariant item parameters provides more biased parameter estimates than those obtained from the linking approach [127]. Third, large models that concurrently scale all countries (assuming full invariance or partial invariance) are less robust to model deviations. Fourth, we argued elsewhere that the partial invariance approach currently used in PISA results in invalid country comparisons because the comparisons of each pair of countries essentially rely on different sets of items [26,114,118]. Fifth, the linking approach is computationally much less demanding than concurrent scaling approaches (assuming invariance or partial invariance; see [118,125,128]).
As argued above, the scalings the analysis of our PISA 2018 mathematics case study are carried out separately for each country c. That is, one obtains country-specific item parameters a ic and b ic : Sampling weights were always used when applying the scaling model (17) to the PISA 2018 dataset. To enable the comparability of the ability distribution across countries, the obtained item discriminations a ic and item difficulties b ic are transformed on a common in a subsequent linking step (see Section 4.3) for details.
For the PISA 2018 mathematics data, the scaling models discussed in Section 2 are applied. An overview of the specified models with brief explanations is given in Table 3. Some of the models required particular adaptations that are described in the two following subsections. Since PISA 2015, not reached items are no longer scored as wrong [13]. To investigate this scaling method, we ignored not reached items in the scaling model but scored omitted items as wrong (model UN1). We also implemented the operational practice since PISA 2015 [13] that includes the proportion of not reached item response as a predictor in the latent background model (model UN2; [12,129]). This second model is similar to assuming latent ignorability when the response indicators for not reached items follow a 1PL model.

Imputation Models Based on Fully Conditional Specification
In Section 2.6, we introduced the FCS imputation models IF1 and IF2 that used X p and (X p , R p ) in the imputation, respectively. Previous research indicated that item parameters are affected by position effects [130][131][132][133][134][135][136][137]. Hence, in our analysis, the FCS imputation models IF1 and IF2 are separately applied for each test booklet. In general, missing item responses at the end of a test booklet will be less likely imputed with a correct scoring (i.e., X pi = 1) than missing item responses at the beginning of a test booklet. As the sample size for each country in each test booklet can be quite low, using PLS regression for dimension reduction of the covariates in the imputation models is vital.

Linking Procedure
The scaling models described above resulted in country-specific item discriminations a ic and item difficulties b ic . To enable a comparison of country means and country standard deviations, the corresponding ability distributions can be obtained by linking approaches that establish a common ability metric [108,138]. In this article, Haberman linking [107] in its original proposal is used. The linking procedure produces country means and standard deviations as its outcome. To enable a comparisons across the 19 specified different scaling models, the ability distributions were linearly transformed such that the total population involving all students in all countries in our study has a mean M = 500 and a standard deviation SD = 100 (i.e., the so-called PISA metric). More formally, for each model m and each country c, there is a linear transformation θ → t mc (θ) = ν 0mc + ν 1mc θ that transforms the country-specific ability distributions obtained from separate scaling to the PISA metric.

Model Comparisons
It is of particular interest whether the Mislevy-Wu model (MM1 and MM2) outperforms other treatments of missing item responses such as the scoring as wrong (model MW) and latent ignorable (models MO1 and MO2). The Bayesian information criterion (BIC) is used for conducting model comparisons ( [33]; see also [16,120,121,139] for similar model comparisons in PISA, but [140][141][142] for improved information criteria in complex surveys). Moreover, the Gilula-Haberman penalty (GHP; [143][144][145]) is used as an effect size that is relatively independent of the sample size and the number of items. The GPH is defined as GHP = AIC/(2 ∑ N p=1 I p ), where I p is the number of estimated model parameters for person p and AIC is the Akaike information criterion. For example, if 20 out of 70 items were administered to person p in a test, I p would be 40 in the 2PL model. If a student worked on all 70 items in the test, I p would be 140. Note that the GHP can be considered a normalized variant of the AIC. A difference in GHP larger than 0.001 is declared a notable difference in model fit [145,146].
It might be questioned whether information criteria AIC (for the GHP criterion) and BIC might be appropriate for datasets (X pi , R pi ) consisting of item responses and response indicators with missing data on item responses X pi (see [147][148][149][150]). As was argued in Section 1, there are two types of missing item responses in large-scale assessment datasets. First, item responses can be missing for a student because only a portion of items was administered in a test booklet in the multi-matrix test design [16]. Second, missing item responses appear due to item omissions to administered items. The latter type of missingness is the main topic of this article.
It has been demonstrated in Section 2.5 (see Equation (15)) that for each item i, observations (X pi , R pi ) can be regarded as a random variable V pi with three categories: Category 0 (V pi = 0): X pi = 0, R pi = 1, Category 1 (V pi = 1): X pi = 1, R pi = 1, and Category 2 (V pi = 2): X pi = NA, R pi = 0. The dataset with observations V pi does not contain missing values, and the Mislevy-Wu model can be formulated as a function of V pi instead of (X pi , R pi ). As the former dataset does not contain missing values, model selection based on information criteria might be justified for item omissions because no missing data occurs for the redefined variables. However, it might still be questioned whether information criteria AIC and BIC remain valid when applied to multi-matrix designs. In this case, the number of effectively estimated item parameters per student is lower than those obtained when all items would be administered in a test booklet. In our opinion and our limited experience obtained in an unpublished simulation study, it could be that AIC and BIC show inferior performance for multi-matrix designs compared to the complete-data case. Note also that most educational large-scale assessment studies also apply the conventional information criteria without adaptations (e.g., [121,139,[151][152][153][154]).
We would like to point out that BIC and GHP are only applied for the model-based treatment scaling models and not to the scaling models that rely on multiply imputed datasets (see [155]).

Computation of Standard Errors
In the PISA study, statistical inference is typically conducted with the balanced repeated replication methodology to account for stratified clustered sampling within countries [16,156]. The rth replication sample uses a modified set of person sampling weights w (r) p . Using R = 80 replication samples in PISA, a parameter of interest is computed for the original sample (i.e.,γ) based on student weights w p . Moreover, the analysis is repeated in each replication sample using sampling weights w (r) p , resulting in parameter estimatesγ (r) . The standard error forγ is then calculated as [16] where the scaling factor A equals 0.05 in the PISA replication design. In our analysis, we are interested in standard errors for country means. The standard error is first determined for the country mean obtained in country-specific scaling models. Each scaling model provides a person-specific individual posterior distribution h p (θ t |X p , R p ) for a discrete grid θ t (t = 1, . . . , T) of θ points (e.g., for T = 21 integration points, a discrete θ grid θ 1 = −5, . . . , θ 21 = 5 can be chosen). These posterior distributions reflect the subjectspecific uncertainty with respect to the estimated ability. The country means have to be computed in the transformed metric (see Section 4.3). Hence, one uses the transformed grid ν 0mc + ν 1mc θ t (t = 1, . . . , T) for determining the country mean. For the rth replication sample, the meanγ (r) is determined aŝ Note that this approach is a numerical approximation technique that coincides with the plausible value technique [129] when a large number of plausible values would be used. The standard error forγ can be computed using (18). In our analysis, we are also interested in determining the statistical inference of a difference in means for a particular country resulting from different models. It is not appropriate to compute the standard errors for the means of the different models and to apply the t-test for a mean difference relying on independent samples because two models are applied to the same dataset resulting in highly dependent parameter estimates. However, the replication technique in Equation (18) can also be applied for the difference in means. One must only compute a mean difference in each replication sample in this case.

Similarity of Scaling Models
Each of the 19 scaling models provided a set of country means. For each country, the absolute difference of two means of a country stemming from a pair of two models can be computed. Table 4 summarizes the average absolute differences. Scaling models that resulted in an average absolute difference of at most 1.0 can be considered similar. In Table 4, groups of models are grayed in the rectangles containing the absolute differences classified as similar. Table 4 indicates that the methods that treat missing item responses as wrong (UW, MW, IW) or treat MC items as partially correct (UP, IP) resulted in similar country mean estimates. Both methods that did not score nor reached item responses as wrong (UN1, UN2) resulted in relatively similar estimates. The models that rely on ignorability (UO1, MO1, IO1) or latent ignorability (MO2, UO2, IO2) provided similar estimates. In line with previous research [18], the inclusion of the latent response propensity ξ did not result in strongly different estimates of country means compared to models that ignore missing item responses. The specifications of the Mislevy-Wu model (MM1, IM1, MM2, IM2) resulted in similar country means. Interestingly, country means from the Mislevy-Wu model were more similar to the treatment of missing item responses as wrong than those that relied on ignorability or latent ignorability. Finally, the scaling model based on FCS imputation involving only item responses (IF1) was similar to the models assuming (latent) ignorability (UO1, MO1, IO1, MO2, UO2, IO2). FCS imputation involving item responses and response indicators different from the imputed item (IF2) were neither similar to the ignorability-based treatment nor the scoring as wrong or the Mislevy-Wu model. This finding could be explained by the fact that the imputation method IF2 is based on strongly opposing assumptions of the missingness mechanism than the Mislevy-Wu model.

Model Comparisons
From Table 5, we can see that for the majority of countries (35 out of 45), the IRT model treating missing item responses as wrong (model MW) provided a better model fit in terms of BIC than modeling it with a latent propensity (model MO2). For 39 out of 45 countries, the Mislevy-Wu model with item-format specific ρ parameters (model MM2) was preferred. In 5 out of 45 countries, the Mislevy-Wu model with one common ρ parameter (MM1) was the best-fitting model. Only in one country (MYS), the model treating missing item responses as wrong had the best model fit.
For 29 out of 45 countries, the proposed Mislevy-Wu model outperformed the suggested model with a latent response propensity in terms of a GHP difference of at least 0.001. Overall, these findings indicated that the models assuming ignorability or latent ignorability performed worse in terms of model fit compared to scaling models that acknowledge the dependence of responding to an item from the true but occasionally unobserved item response.

Country-Specific Model Parameters for Latent Ignorable Model and Mislevy-Wu Model
Now, we present findings of model parameters characterizing the missingness mechanism from the model MO2 relying on latent ignorability and the Mislevy-Wu model MM2. The parameters are shown in Table 6. The SD of the latent response propensity SD(ξ) was somewhat lower in the Mislevy-Wu model (MM2, with a median Med = 1.98) than the model assuming latent ignorability (MO2, Med = 1.93). Moreover, by additionally including the latent item response as a predictor for the response indicator, the correlation Cor(θ, ξ) between the latent ability θ and response propensity ξ was slightly lower in model MM2 (Med = 0.43) than MO2 (Med = 0.46). Most importantly, the missingness mechanism strongly differed between CR and MC items. The median δ parameter in model MM2 for CR items was −2.61, indicating that students that did not know the item had a higher probability of omitting the item even after controlling for the latent response propensity ξ. In contrast, the median δ parameter was −0.48. Hence, there was a smaller influence of (latently) knowing the item with the response indicators. However, it was different from zero for most countries, indicating that the model MO2 assuming latent ignorability did not adequately explain the missingness mechanism. Overall, it can be seen that those model parameters strongly vary across countries. Hence, it can be concluded that assuming different missingness mechanisms for countries could have non-negligible consequences for country rankings (see [126]).  standard deviation of latent propensity variable ξ; Cor(θ, ξ) = correlation of latent ability θ with latent propensity variable ξ; δ CR = common δ parameter for constructed response items; δ MC = common δ parameter for multiple-choice items. See Appendix B for country labels.

Country Means and Country Standard Deviations Obtained From Different Scaling Models
For comparing country means, 11 out of 19 specified scaling models were selected to contrast the dissimilarity of country mean and standard deviation estimates. Based on the findings of the similarity of models in Section 5.1 (see Table 4), 8 out of 19 models were omitted in the reporting of the comparisons because they provided very similar findings to at least one of the 11 reported models. Table 7 shows the country means of these 11 different treatments of missing item responses. The country rank (column "rk UW ") serves as the reference for the comparison among methods. Moreover, the interval of country ranks obtained from the different methods are shown in column "rk Int ". The average maximum difference in country ranks was 2.4 (SD = 1.8) and ranged between 0 (SGP, HKG, EST, DEU, LUX, BIH) and 8 (IRL). The range in country means (i.e., the difference of the largest and smallest country mean of the 11 methods) was noticeable (M = 5.0) and showed strong variability between countries (SD = 2.8, Min = 1.5, Max = 12.5). Interestingly, large range values were obtained for countries with missing proportions that were strongly below and above the average missing proportion. For example, Ireland (IRL) had a relatively low missing rate of 5.8% and reached rank 15 with method UW (M = 505.2) that treated missing item responses as wrong. Methods that ignore missing item responses resulted in a lower country mean (UO1: M = 499.9; MO2: M = 500.7; IO2: M = 500.0). In contrast, the Mislevy-Wu model (MM2 and IO2)-which also takes the relation of the response indicator and the true item response into account-resulted in higher country means (MM2: M = 505.1; IO2: M = 504.9). Across the 11 estimation methods, Ireland reached ranks between 15 and 23 which can be considered a large variability. Moreover, the range of country means for Ireland was 8.2, which is two to three times higher than standard errors for country means due to the sampling of students in PISA. Italy ( In Table A3 in Appendix C, standard errors for country means are shown. Across different models and countries, the average standard error was 2.20 (SD = 0.47, Min = 1.21, Max = 3.65). Within a country, the variability (i.e., standard deviation (column "SD") in Table A3) of standard errors for the mean was small (M = 0.05, SD = 0.05, Min = 0.01, Max = 0.21).
In Table A4 in Appendix C, standard errors for differences in means stemming from two different models are displayed. We consider differences between the models UW, MO2 and MM2. It turned out that the standard error for mean differences between two models was very small compared to the standard error for the mean for a single model. The largest average standard errors were obtained for the mean difference between models UW2 and MO2 (see column "UW-MO2" in Table A4; M = 0.037, SD = 0.036, Min = 0, Max = 0.149). These two models represent the two extreme missing data treatments that explain the observation of obtaining the largest standard errors. The smallest standard errors were obtained for the model difference between UW and MM2 (column "UW-MM2"; M = 0.021, SD = 0.019, Min = 0.000, Max = 0.096). The average standard errors for the mean difference between the models MO2 and MM2 was 0.027 (column "UW-MO2"; M = 0.022, SD = 0.019, Min = 0.001, Max = 0.093).
The estimates of country standard deviations stemming from different models for the missing data treatment are shown in Table A5 in Appendix C. As in the case of the country mean, it turned out that model choice also impacted standard deviations. Within a country, the standard deviation of the different standard deviation estimates showed nonnegligible variability (column "SD" in Table A5; M = 1.25, SD = 0.96, Min = 0.3, Max = 5.4). The within-country ranges of country standard deviations across models were even larger than for country means. Note. %NA = proportion of item responses with missing data; %NR = proportion of item responses that are not reached; rk UW = country rank from model UW; rk Int = interval of country ranks obtained from 11 different scaling models; Aver = average of country means across 11 models; SD = standard deviation of country means across 11 models; rg = range of country means across 11 models; UW = scoring as wrong (Section 2.1) ; UP = MC items scored as partially correct (Section 2.2); UN1 = ignoring not reached items (Section 4.2.1); UN2 = including proportion of not reached items in background model (Section 4.2.1); UO1 = ignoring missing item responses (Section 2.3); MO2 = model-based latent ignorability (Section 2.4, Equations (10) and (11)); IO2 = imputed under latent ignorability (Section 2.4.1, Equations (10) and (11)); MM2 = Mislevy-Wu model with item format-specific δ parameters (Section 2.5, Equation (14)); IM2 = imputed under Mislevy-Wu model with item format specific d parameters (Section 2.5, Equation (14)); IF1 = FCS imputation based on item responses (Sections 2.6 and 4.2.2); IF2 = FCS imputation based on item responses and response indicators (Sections 2.6 and 4.2.2); The following entries in the table are printed in bold: Missing proportions (%NA) larger than 10.0% and smaller than 5.0%, not reached proportions larger than 3.0%, country rank differences larger than 2, ranges in country means larger than 5.0. See Appendix B for country labels.

Discussion
In this paper, competing approaches for handling missing item responses in educational large-scale assessment studies like PISA are investigated. We compared the Mislevy-Wu model that allows the probability of item missingness depending on the item itself with the more frequently discussed approaches of scoring items as wrong or models assuming latent ignorability. In an illustrative simulation study, we demonstrated that neither of the two latter approaches provides unbiased parameter estimates if the more general Mislevy-Wu model holds (see also [44]). In realistic data constellations in which the Mislevy-Wu model holds, it is likely that the method of scoring missing item responses as wrong results in underestimated (country) means, while models relying on latent ignorability provide overestimated means. Based on these findings, we are convinced that the often-taken view in psychometric literature that strongly advocates latent ignorability and denies the scoring as wrong [4,11,12,18] is unjustified (see also [24,25,27]).
In our reanalysis of the PISA 2018 mathematics data, different scaling models with different treatments of missing item responses were specified. It has been shown that differences in country means and country standard deviations across models can be substantial. The present study sheds some light on the ongoing debate about properly handling missing item responses in educational large-scale assessment studies. Ignoring missing item responses and treating them as wrong can be seen as opposing strategies. Other scaling models can be interpreted to provide results somewhere between these two extreme poles of handling missingness. We argued that the Mislevy-Wu model contains the strategy of scoring as wrong and the latent ignorable model as submodels. Hence, these missing data treatments can be tested. In our analysis, it turned out that the Mislevy-Wu model fitted the PISA data best. More importantly, the treatment of missing item responses as wrong provided a better model fit than ignoring them or modeling them by the latent ignorable model that has been strongly advocated in the past [10,11]. It also turned out that the missingness mechanism strongly differed between CR and MC items.
We believe that the call for controlling for test-taking behavior in the reporting in largescale assessment studies such as response propensity [4] using models that also include response times [157,158] poses a threat to validity [159][160][161][162][163][164] because results can be simply manipulated by instructing students to omit items they do not know [26]. Notably, missing item responses are mostly omissions for CR items. Response times might be useful for detecting rapid guessing or noneffortful responses [81,[165][166][167][168][169][170][171]. However, it seems likely that students who do not know the solution to CR items do not respond to these items. In this case, the latent ignorability assumption is unlikely to hold, and scaling models that rely on it (see [4,12]) will result in biased and unfair country comparisons. We are skeptical that the decision of whether a missing item response is scored as wrong should be based on a particular response time threshold [166,172,173]. Students can also be simply instructed to quickly skip items that they are not probably able to solve.
In our PISA analysis, we restricted the analysis to 45 countries that received booklets of average item difficulty. Recently, a number of low-performing countries also participated in recent PISA cycles that receive booklets of lower difficulty [174][175][176]. We did not include these low-performing countries for the following reasons. First, the proportion of correctly solved items for low-performing countries is lower. This implies that it is more difficult for these countries to disentangle the parameters of the model for response indicators and item parameters. Second, the meaning of missingness on item responses across countries differs if different booklets are administered in countries. Hence, it is difficult to compare outcomes of different scaling models for the missing data treatment if there is no prerequisite of the same administered test design. To some extent, the issue also appears in the recently implemented multi-stage testing (MST; [177,178]) design in PISA that also results in different proportions of test booklets of different average difficulty across countries. We think that there is no defensible strategy of properly treating missing item responses from MST designs that enables a fair and valid comparison of countries [26].
In this article, we only investigated the impact of missing item responses on country means and country standard deviations. In LSA studies, missing data is also a prevalent issue for student covariates (e.g., sociodemographic status; see [104,[179][180][181][182][183][184]). As covariates also enter the plausible value imputation of latent abilities through the latent background model [75,129] or relationships of abilities and covariates are often of interest in reporting, missing data on covariates is also a crucial issue that needs to be adequately addressed [104].
It could be argued that there is not a unique, scientifically sound, or widely publicly accepted scaling model in PISA (see [185]). The uncertainty in choosing a psychometric model can be reflected by explicitly acknowledging the variability of country means and standard deviations obtained by different model assumptions. This additional source of variance associated with model uncertainty [186][187][188][189][190][191] can be added to the standard error due to the sampling of students and linking error due to the selection of items [192]. The assessment of specification uncertainty has been discussed in sensitivity analysis [193] and has recently become popular as multiverse analysis [194,195] or specification curve analysis [196,197]. As educational LSA studies are policy-relevant [198,199], we think that model uncertainty should be included in statistical inference [200,201]. Acknowledgments: I sincerely thank four anonymous reviewers for their valuable comments that substantially improved this article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Additional Information for the Illustrative Simulation Study
In Table A1, the computed β parameter used in the illustrative simulation study as a function of the proportion of missing data and the missingness parameter δ is shown. In Table A2, the RMSE for the estimated mean and the standard deviation is shown for the different missing data treatments as a function of the proportion of missing data and the missingness parameter δ. Note. CD = complete-data analysis ; UW = scoring as wrong (Section 2.1) ; MM1 = Mislevy-Wu model with common d parameter (Section 2.5, Equation (14)); UO = ignoring missing item responses (Section 2.3); MO2 = modelbased latent ignorability (Section 2.4, Equations (10) and (11)); IF1 = FCS imputation based on item responses (Section 2.6); IF2 = FCS imputation based on item responses and response indicators (Section 2.6).

Appendix B. Country Labels Used in the PISA 2018 Mathematics Case Study
The country labels used in Tables 2, 5

Appendix C. Further Results of the PISA 2018 Mathematics Case Study
In Table A3, standard errors for country means for the PISA 2018 mathematics case study resulting from 11 different scaling models are shown.
In Table A4, standard errors for country mean differences between the models UW, MO2 and MM2 are presented.