On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty

In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students’ performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability θ provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.


Introduction
The presence of missing values in data preparation and data analysis makes the use of state-of-the art statistical methods difficult to apply. Seeking a universal answer to such problems was the main idea of [1], who introduced (multiple) imputation. Through imputation, one provides data analysts (sequences) of completed datasets, based on which, various data analysis procedures can be conducted. An alternative to imputation is the use of so-called data adjustment methods: statistical methods that directly treat missing instances during training or parameter estimation, such as the full-information-maximumlikelihood method (see, e.g., [2]) or the expectation-maximization algorithm (cf. [3]).
A large disadvantage of these methods is the expertise knowledge on theoretical model construction, where the likelihood function of parameters of interest needs to be adopted appropriately in order to account for missing information. Such examples can be found in [4][5][6], where whole statistical testing procedures were adjusted to account for missing values. It is already well-known that more naive methods, such as list-wise deletion or mean imputation can lead to severe estimation bias, see, e.g., [1,[7][8][9][10]. Therefore, we do not discuss these approaches further.
In the current paper, we focus on regression problems, where we do not have complete information on the set of covariates. Missing covariates in supervised regression learning have been part in a variety of theoretical and applicative research fields. In [10], for example, a theoretical analysis based on maximum semiparametric likelihood for constructing consistent regression estimates was conducted. While in [11,12] or [13], for example, multiple imputation is used as a tool in medical research for variable selection or biasreduction in parameter estimation. More recent research has focused on Machine-Learning (ML)-based imputation.

Introduction
Item response theory (IRT) models [1] are central to analyzing dichotomous random variables. IRT models can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular interest is the application of an IRT model in educational large-scale assessment (LSA; [2]), such as the programme for international student assessment (PISA; [3]), which summarizes the ability of students on test items in different cognitive domains.
In the official reporting of outcomes of LSA studies such as PISA, the set of test items is represented by a unidimensional summary measure extracted by applying a unidimensional IRT model. Across different LSA studies, there is no consensus on which particular IRT model should be utilized [4][5][6]. In previous research, there are a few attempts that quantity the impact of IRT model choice on distribution parameters of interest such as country means, standard deviations, or percentiles. However, previous research did not systematically study a large number of competing IRT models [7][8][9]. Our research fills a gap because it conducts an empirical comparison involving 11 different IRT models for scaling for PISA 2009 data in three ability domains. Moreover, we compare the model fit of these different IRT models and quantify the variability in model uncertainty using the model error. We compare the model error with the standard error associated with the uncertainty due to the sampling of students.
The rest of the article is structured as follows. In Section 2, we discuss different IRT models used for scaling. Section 3 introduces the concepts of model selection and model uncertainty. Section 4 describes the method used to analyze PISA 2009 data. In Section 5, In Equation (1), a latent variable θ is involved that can be interpreted as a unidimensional summary of the test items X. The distribution of θ is modeled using a (semi)parametric distribution F with density function f . In the rest of the article, we fix this distribution to be standard normal, but this can be weakened [13][14][15]. The item response functions (IRF) P i (θ; γ i ) model the relationship of the dichotomous item with the latent variable, and we collect all item parameters in the vector γ. In most cases, a parametric model is utilized in the estimation of the IRF (but see [16] for a nonparametric identification), which is indicated by the item parameter γ i in Equation (1). Note that in (1), item responses X i are conditionally independent on θ; that is, after controlling the latent ability θ, pairs of items X i and X j are conditionally uncorrelated. This property is also known as the local dependence assumption, which can be statistically tested [12,17]. The item parameters γ i of the estimated IRFs in Equation (1) can be estimated by (marginal) maximum likelihood (ML) using an EM algorithm [18][19][20]. The estimation can involve sampling weights for students [21] and a multi-matrix design in which only a subset of items is administered to each student [22]. In the likelihood formulation of (1), non-administered items are skipped in the multiplication term.
In practice, the IRT model (1) is likely to be misspecified because the unidimensionality assumption is implausible. Moreover, the parametric assumption P i (θ; γ i ) of the IRF might be incorrect. In addition, in educational LSA studies involving a large number of countries, there will typically be country differential item functioning [23][24][25]; that is, item parameters will vary across countries. In this case, applying ML using countryinvariant item parameters defines the best approximation with respect to the Kullback-Leibler distance of the true distribution and a model-implied distribution. In this sense, an IRT model is selected by purpose and not by reasons of model fit because it will not even approximately fit the data (see also [26]). If country means are computed based on a particular IRT model, the parameter of interest should be, rather, interpreted as a descriptive statistic of interest [27]. Using a particular model does not mean that we believe that the model (approximately) fits the data. In contrast, we think that a vector of country means µ and item parameters γ summarize a high-dimensional contingency table P(X = x).
Locally optimal weights [28] can be used to discuss the consequences for scoring when using a particular IRT model. A local scoring rule for the ability θ can be defined by a weighted sum ∑ I i=1 ν i (θ)X i for abilities near θ = θ 0 . The ability θ is determined by ML estimation using previously estimated item parameters. The locally optimal weights can be derived as (see [27][28][29]): If the local weight ν i (θ) (also referred to as the local item score) varies across different θ values, the impact of single items in the ability differs. This property can be critically recognized, particularly for country comparisons in LSA studies [29]. Subsequently, we will discuss the properties of different IRT models regarding the optimal weights ν i (θ). In this article, several competitive functional forms of the IRF are compared, and their consequences for distribution parameters (e.g., means, standard deviations, and percentiles) for the prominent LSA study PISA are discussed. Performing such a fit index contest [30,31] does not necessarily mean that we favor model selection based on model fit. In the next Section 2.1, we discuss several IRFs later utilized for model comparisons. In Section 2.2, we investigate the behavior of the estimated ability distribution under misspecified IRFs. Finally, we conclude this section with some thoughts on the choice of the IRT model (see Section 2.3).

Different Functional Forms for IRT Models
In this section, we discuss several parametric specifications of the IRF P i (θ) that appear in the unidimensional IRT model defined in Equation (1).
The one-parameter logistic model (1PL; also known as the Rasch model; [32,33]) employs a logistic link function and parametrizes an item with a single parameter b i that is called item difficulty. The model is defined by where a is the common item discrimination parameter. Alternatively, one can fix the parameter a to 1 and estimate the standard deviation of the latent variable θ. Notably, the sum score ∑ I i=1 x i is a sufficient statistic for θ in the 1PL model. The 1PL model has wide applicability in educational assessment [34,35].
The 1PL model uses a symmetric link function. However, asymmetric link functions could also be used for choosing an IRF. The cloglog link function is used in the oneparameter cloglog (1PCL) model [36,37]: Consequently, items are differentially weighted in the estimation of θ at each θ location, and the sum score is not a sufficient statistic. The cloglog link function has similar behavior to the logistic link function in the 1PL model in the lower tail (i.e., for negative values of θ), but differs from it in the upper tail.
The one-parameter loglog (1PLL) IRT model is defined by In contrast to the cloglog link function, the loglog function is similar to the logistic link function in the upper tail (i.e., for positive θ values), but different from it in the lower tail. Figure 1 compares the 1PL, 1PCL, and 1PLL models regarding the IRF P i and the locally optimal weight ν i . The loglog IRT model (1PLL) stretches more in the lower tails than in the lower θ tail than the logistic link function. The converse is true for the cloglog IRT model (1PCL), which is significantly stretched in the upper θ tail. In the right panel of Figure 1, locally optimal weights are displayed. The 1PL model has a constant weight of 1, while the local contribution of item score for θ differs across the θ range for the 1PCL and the 1PLL model. The 1PCL model provides a higher local item score for higher θ values than for lower θ values. Hence, more difficult items receive lower local item scores than easier items. In contrast, the 1PLL model results in higher local item scores for difficult items compared to easier items. This idea is reflected in the D-scoring method [38,39].
Notably, the 1PCL and 1PLL models use asymmetric IRFs. One can try to estimate the extent of asymmetry in IRFs by using a generalized logistic link function (also called the Stukel link function; [40]): where the generalized logit link function is defined as In this 1PGL model, common shape parameters α 1 and α 2 for the IRFs are additionally estimated. The 1PL, 1PCL and 1PLL models can be obtained as special cases of (6).
The four models 1PL, 1PCL, 1PLL, and 1PGL have in common that they only estimate one parameter per item. The assumption of a common item discrimination is weakened in the two-parameter logistic (2PL) IRT model [28], as a generalization of the 1PL model in which the discriminations a i are now made item-specific: Note that ∑ I i=1 a i x i is a sufficient statistic for θ. Hence, items X i are differentially weighted by the weight a i , which is determined within the statistical model.
Further, the assumption of a symmetric logistic link function might be weakened, and a four-parameter generalized logistic (4PGL) model can be estimated: In the IRT model (9), the shape parameters α 1i and α 2i are made item-specific. Hence, the extent of asymmetry of the IRF is estimated for each item.
The 2PL model (8) can be generalized to the three-parameter logistic (3PL; [41]) IRT model that assumes an item-specific lower asymptote c i larger than 0 for the IRF: Parameter c i is often referred to as a (pseudo-)guessing parameter [42,43]. The 3PL model might be reasonable if multiple-choice items are used in the test.
The 3PL model can be generalized in the four-parameter logistic (4PL; [44][45][46]) model such that it also contains upper asymptotes d i smaller than 1 for the IRF: The d i parameter is often referred to as a slipping parameter, which characterizes careless (incorrect) item responses [47]. In contrast to the 1PL, 2PL, or the 3PL model, the 4PL model has not yet been applied in the operational practice of LSA studies. However, there are a few research papers that apply the 4PL model to LSA data [48,49].
It should be mentioned that the 3PL or the 4PL model might suffer from empirical nonidentifiability [45,[50][51][52]. This is why prior distributions for guessing (3PL and 4PL) and slipping (4PL) parameters are required for stabilizing model estimation. As pointed out by an anonymous reviewer, the use of prior distributions changes the meaning of the IRT model. However, we think that identifiability issues are of less concern in the large-samplesize situations that are present in educational LSA studies. If item parameters are obtained in a pooled sample of students comprising all countries, sample sizes are typically above 10,000. In this case, the empirical data will typically dominate prior distributions, and prior distributions are therefore not needed.
In Figure 2, IRFs and locally optimal weights for the 4PL, 3PL, and 2PL models are displayed. The item parameters for the 4PL model were a i = 1, b i 0 =, c i = 0.25, and d i = 0.1. The parameters of the displayed 2PL and 3PL models were obtained by minimizing the weighted squared distance between the IRF of the 4PL model and the simpler model under the constraint that the model-implied item-means coincide under the normal distribution assumption of θ. Importantly, it can be seen in the right panel that the 2PL model has a constant local item score, while it is increasing for the 3PL model and it is inversely U-shaped for the 4PL model. Hence, when using the 4PL model, it must not be too easy or too difficult to obtain a high local item score for a student that got the item correct.
A different strand of model extensions also starts from the 2PL model but introduces more item parameters to model asymmetry or nonlinearity while retaining the logistic link function. The three-parameter logistic model with quadratic effects (3PLQ) additionally includes additional quadratic effects of θ in the 2PL model [42,50]: Due to the presence of the a 2i parameter, asymmetric IRFs can be modeled. As a disadvantage, the IRF in (12) must not be monotone, although this constraint can be incorporated in the estimation [53,54].
The three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter δ i [55,56]: The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the 3PL model [57,58]. In Figure 3, IRFs and locally optimal weights are displayed for three parameter specifications in the 3PLRH model (i.e., a i = 1, b i = 0, and δ i = −0.5, 0, 0.5). One can see that the introduced asymmetry parameter δ i governs the behavior of the IRF in the lower or upper tails. The displayed IRFs mimic the 1PL, 1PCL, and 1PLL models. Moreover, with δ i parameters different from zero, different locally optimal weights across the θ range are introduced. Notably, a positive δ i parameter is associated with a larger local item score in the lower θ tail. The opposite is true for a negative δ i parameter.  Finally, the 3PL model is extended in the four-parameter logistic model with quadratic effects (4PLQ), in which additional item-specific quadratic effects for θ are included [50] Model 4PLQ :

Ability Estimation under Model Misspecification
In this section, we study the estimation of θ when working with a misspecified IRT model. In the treatment, we assume that there is a true IRT model with unknown IRFs. We study the bias in estimated abilities for a fixed value of θ if misspecified IRFs are utilized. This situation refers to the empirical application in an LSA study, in which a misspecified IRF is estimated based on data comprising all countries, and the distribution of θ is evaluated at the level of countries. The misspecification emerges due to incorrectly assumed functional forms of the IRF or the presence of differential item functioning at the level of countries [24,59].
We assume that the there are true but unknown IRFs P i (θ) = Ψ(α i (θ)) with a continuously differentiable function α i and Ψ(x) = [1 + exp(−x)] −1 denotes the logistic link function. We assume that the local independence assumption holds in the IRT model. For estimation, we use a misspecified IRT model with IRFs P i (θ) = Ψ(a i (θ)) with a continuously differentiable function a i . Notably, there exists a misspecification if α i = a i . In Appendix A, we derive an estimate θ 1 under the misspecified IRT model if θ 0 is the data-generating ability value under the true IRT model. Hence, we derive a transformation function m(θ 1 ) = θ 0 + B(θ 0 ), where B(θ) is the bias function that indicates the bias in the estimated ability due to the application of the misspecified IRT model. We assume that the item parameters under the misspecified IRT model are known (i.e., the IRFs a i (θ) are known). Then, the ML estimate is determined based on the misspecified IRT model taking into account that θ 0 solves the maximum likelihood equation under the true IRT model. It is assumed that the number of items I is large. Moreover, we apply two Taylor approximations that rely on the assumption that |α i (θ) − a i (θ)| is sufficiently small.
The derivation in Appendix A (see Equation (A10)) provides where the bias term B is defined by A is determined by item information functions (see Appendix A). Equation (15) clarifies how the misspecified IRFs enter the computation of θ. Interestingly, the extent of misspecification Equation (15) provides practical consequences when applying misspecified IRT models. For instance, θ 0 might be the true country percentile, referring to a true IRT model. If the transformation θ 1 = m(θ 0 ) is monotone, the percentile with the misspecified model is θ 1 and Equation (15) quantifies a bias for the estimated percentile. Moreover, let f c be the density of the ability under the true IRT model for country c; then, one can determine the bias in the country means by using (15). The true country mean of country c is given by The estimated country mean µ * c under the misspecified model is given by Note that the bias term B(θ) will typically be country-specific because the true IRF P i (θ) = Ψ(α i (θ)) are country-specific due to differential item functioning at the level of countries. Hence, item-specific relative country effects regarding the IRF that are uniformly weighted in Equation (15) can be considered a desirable property.
In the case of a fitted 2PL model, it holds that a i (θ) = a i θ, and deviations Ψ(a i (θ)) − Ψ(α i (θ)) are weighted by a i (θ) = a i in the derived bias in (15). For the 1PL model, the deviations are equally weighted due to a i (θ) = 1. This property might legitimate the use of the often ill-fitting 1PL model because model deviations are equally weighted across items (see [27]). We elaborate on this discussion in the following Section 2.3.

A Few Remarks on the Choice of the IRT Model
In Section 2.1, we introduced several IRT models and it might be asked which criteria should be used for selecting one among these models. We think that model-choice principles depend on the purpose of the scaling models. Pure research purposes (e.g., understanding cognitive processes underlying item response behavior; modeling item complexity) must be distinguished from policy-relevant reporting practice (e.g., country rankings in educational LSA studies). Several researchers have argued that model choice should be primarily a matter of validity and not based on purely statistical criteria [27,[60][61][62][63][64].
Myung et al. [63] discussed several criteria for model selection with a focus on cognition science. We would like to emphasize that these criteria might be differently weighted if applied to educational LSA studies that are not primarily conducted for research purposes. The concept of the interpretability of a selected IRT model means that the model parameters must be linked to psychological processes and constructs. We think that simple unidimensional IRT models in LSA studies are not used because one believes a unidimensional underlying (causal) variable exists. The chosen IRT model is used for summarizing item response patterns and for providing simple and interpretable descriptive statistics. In this sense, we have argued elsewhere [27] that model fit should not have any relevance for model selection in LSA studies. However, it seems in the official LSA publications such as those from PISA that information criteria are also used for justifying the use of scaling models [5]. We would like to note that these model comparisons are often biased in the sense that the personally preferred model is often the winner of this fit contest, and other plausible IRT models are excluded from these contests because they potentially could provide a better model fit. Information-criteria-based model selection falls into the criterion of generalizability according to Myung et al. [63]. These criteria are briefly discussed in Section 3.1.
Notably, different IRT models imply a differential weighting of items in the summary variable θ [29,65]. This characteristic is quantified with locally optimal weights (see Section 2.1). The differential item weighting might impair the comparison of subgroups. More critically, the weighing of items is, in most applications, determined by statistical models and might, hence, have undesirable consequences because practitioners have an implicitly defined different weighing of items in mind when composing a test based on a single test of items. Nevertheless, our study investigates the consequences of using different IRT models for LSA data. To sum up, which of the models should be chosen in operational practice is a difficult question that should not be (entirely) determined by statistical criteria.

Model Selection
It is of particular interest to conduct model comparisons of the different scaling models that involve different IRFs (see Section 2.1). The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are used for conducting model comparisons in this article (see [66][67][68][69]). Moreover, the Gilula-Haberman penalty (GHP; [70][71][72]) is used as an effect size that is relatively independent of the sample size and the number of items. The GPH is defined as GHP = AIC/(2 ∑ N p=1 I p ), where I p is the number of estimated model parameters for person p. The GHP can be seen as a normalized variant of the AIC. A difference in GHP larger than 0.001 is a notable difference regarding global model fit [72,73].

Model Uncertainty
Country comparisons in LSA studies such as PISA can depend on the chosen IRT model. In this case, choosing a single best-fitting model might be questionable [74,75]. To investigate the impact of model dependency, we discuss the framework of model uncertainty [76][77][78][79][80][81][82][83][84][85][86] in this section and quantify it by a statistic that characterizes model error.
To quantify model uncertainty, each model m is associated with a weight w m ≥ 0 and we assume ∑ M m=1 w m = 1 [87]. To adequately represent the diversity of findings from different models, an equal weighting of models has been criticized [88]. In contrast, particular models in the set of all models are downweighted if they are highly dependent and produce similar results [89][90][91]. We believe that model fit should not influence model weights [92]. The goal is to represent differences between models in the model error. If the model weights were determined by model fit, plausible but non-fitting models such as the 1PL model would receive a model weight of zero, which is not preferred because the 1PL model should not be excluded from the set of specified models. Moreover, if model weights are computed based on information criteria [80], only one or a few models receive weights that differ from zero, but all other models do not impact the statistical inference. This property is why we do not prefer Bayesian model averaging in our application [82,93,94].
Let γ = (γ 1 , . . . , γ M ) be the vector of a statistical parameter of all models. We can define a composite parameter γ comp as We can also define a population-level model error (ME) as Now, assume that data is available andγ = (γ 1 , . . . , γ M ) is estimated. The estimateγ is multivariate normally distributed with mean γ and a covariance matrix V . Typically, estimates of different models using the same dataset will be (strongly) positively correlated. An estimate of the composite parameter γ comp is given aŝ Due to E(γ m ) = γ m , we obtain thatγ comp is an unbiased estimate of γ comp . The empirical model error ME is defined as Now, it can be shown that ME 2 is a positively biased estimate of M 2 γ comp because the former also contains sampling variability. Define γ comp = w γ, where w = (w 1 , . . . , w M ). Similarly, we can writeγ comp = w γ. Let e m be the m-th unit vector of length M that has an entry of 1 at the m-th entry and 0 otherwise. This notation enables the representation Furthermore, we can then rewrite the expected value of E(ME 2 ) as (see Equation (20)) where the second term B is a biasing term that is the estimated variation across models due to sampling error. This term can be estimated if an estimate of the covariance matrix V of the vector of model estimatesγ is available. As an alternative, the bias in E(ME 2 ) can be removed by estimating B in (22) with resampling techniques such as bootstrap, jackknife or (balanced) half sampling [21,95]. Let B be an estimate of the bias; a bias-corrected model error can be estimated by One can define a total error TE that includes the sampling error SE due to person sampling and a model error estimate ME bc : (24) This total error also takes the variability in the model choice into account and allows for broader inference. Constructed confidence intervals relying on TE will be wider than ordinary confidence intervals that are only based on the SE.

Method
In our empirical application, we used data from PISA 2009 to assess the influence of the choice of different scaling models. Similar research with substantially fewer IRT modeling alternatives was conducted in [8,96,97].

Data
PISA 2009 data was used in this empirical application [3]. The impact of the choice of the scaling model was investigated for the three cognitive domains mathematics, reading, and science. In total, 35, 101, and 53 items were included in our analysis for the domains mathematics, reading, and science, respectively. All polytomous items were dichotomously recoded, with only the highest category being recoded as correct.
A For all analyses at the country level, student weights were taken into account. Within a country, student weights were normalized to a sum of 5000, so that all countries contributed equally to the analyses.

Analysis
We compared the fit of 11 different scaling models (see Section 2.1) in an international calibration sample [98]. To this end, 500 students were randomly sampled from each of the 26 countries and each of the three cognitive domains. Model comparisons were conducted based on the resulting samples involving 13,000 students.
In the next step, the item parameters obtained from the international calibration sample were fixed in the country-specific scaling models. In this step, plausible values for the θ distribution in each of the countries were drawn [99,100]. We did not include student covariates when drawing plausible values. Note that sampling weights were taken into account in this scaling step. The resulting plausible values were subsequently linearly transformed such that a weighted mean of 500 and a weighted standard deviation of 100 holds in the total sample of studies comprising all countries. Weighted descriptive statistics and their standard errors of the θ distribution were computed according to the Rubin rules of multiple imputation [3]. The only difference to the original PISA approach is that we apply balanced half sampling instead of balanced repeated replication for computing standard errors (see [21,101]). Balanced half sampling has the advantage of easy computation of the bias for model error (see Equation (23)).
For quantifying model uncertainty, model weights were assigned prior to analysis based on the principles discussed in Section (3.2). First, because the 1PL, 2PL, and the 3PL are the most frequently used models in LSA studies, we decided that the sum of their model weight should at least exceed 0.50. Second, the weights of models with similar behavior (i.e., models that result in similar country means) should be decreased. These considerations resulted in the following weights: 1PL: 0.273, 2PL: 0.136, 3PL: 0.136; 1PCL: 0.061; 1PLL: 0.061; 1PGL: 0.061; 3PLQ: 0.068; 3PLRH: 0.068; 4PGL: 0.045; 4PL: 0.045; 4PLQ: 0.045. It is evident that a different choice of model weight will change the composite parameter of interest and the associated model error. We did not opt for a sensitivity analysis employing an alternative set of model weights in order to ease the presentation of results in this paper. In order to study the importance of sampling error (SE) and the bias-corrected model error (ME bc ), we computed an error ratio (ER) that is defined by ER = ME bc /SE. Moreover, we computed the total error as TE = SE 2 + ME 2 bc . All analyses were carried out with the statistical software R [102]. The different IRT models were fitted using the xxirt() function in the R package sirt [103]. Plausible value imputation was conducted using the R package TAM [104].

Model Comparisons Based on Information Criteria
The 11 different scaling models were compared for the three cognitive domains mathematics, reading, and science for the PISA 2009 dataset. Table 1 displays model comparisons based on AIC, BIC, and ∆GHP, which is defined as the difference between the GHP values of a particular model and the best-fitting model.
Based on the AIC or ∆GHP, one of the models, 4PGL, 3PLQ, 3PLRH, 3PL, 4PL, or 4PLQ, was preferred in one of the domains. If the BIC were used as a selection criterion, the 3PLQ or the 3PLRH will always be chosen across the models. Notably, the operationally used 2PL model had only satisfactory for the reading domain. By inspecting ∆GHP, it is evident that the largest gain in model fit is obtained by switching from one-to two-, threeor four-parameter models. However, the gain in model fit from the 2PL to the 3PL model is not noteworthy.
In contrast, the gains in fitting the 3PLQ or 3PLRH can be significant. Among the oneparameter models, it is interesting that the loglog link function resulted in a better model fit for mathematics compared to the logistic or the cloglog link functions. This was not the case for reading or science. Overall, the model comparison for PISA 2009 demonstrated that the 3PLQ or 3PLRH should be preferred over the 2PL model for reasons of model fit.  (3) to (14). For AIC and BIC, the best-fitting model and models whose information criteria did not deviate from the minimum value by more than 100 are printed in bold. For ∆GHP, the model with the smallest value and models with ∆GHP values smaller than 0.0005 are printed in bold.

Model Uncertainty for Distribution Parameters
To obtain a visual insight into the similarity of the different scaling models, we computed pairwise absolute differences in the country means. We used the average of them as a distance matrix used as the input of a hierarchical cluster analysis based on the Ward method. Figure 4 shows the dendrogram of this cluster analysis. It can be seen that the 2PL and 3PL provided similar results. Another cluster of models was formed by the more complex models 3PLQ, 3PLRH, 4PGL, 4PL, and 4PLQ. Finally, the different one-parameter models 1PLL, 1PGL, 1PL (and 1PGL) provided relatively distinct findings.

2PL
3PL In Table 2, detailed results for 11 different scaling models for country means in PISA 2009 reading are shown The largest number of substantial deviations of country means from the weighted mean (i.e., the composite parameter) with at least 1 were obtained for the 1PCL model (10), 1PLL (9), and 4PLQ (9). At the level of countries, there were 11 countries in which none of the scaling models substantially differed from the weighted mean. In contrast, there was a large number of deviations for Denmark (DNK; 9) and South Korea (KOR; 10). The ranges in country means across different scaling models at the level of countries varied between 0.3 (SWE; Sweden) and 7.7 (JPN; Japan), with a mean of 2.4.  (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold.
In Table A1 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 mathematics are shown. The largest number of substantial deviations from the weighted mean was obtained for the 1PCL (12), the 1PLL (11), and the 1PGL (9) model. The ranges of the country means across models ranged between 0.5 and 7.9, with a mean of 2.8.
In Table A2 in Appendix C, detailed results for 11 different scaling models for country means in PISA 2009 science are shown. For science, many models showed a large number of deviations. This demonstrates large model uncertainty. The ranges of the country means across models varied between 0.6 and 7.8, with a mean of 2.8.
In Table 3, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 reading are shown. The unadjusted model error had an average of M = 0.66. The bias-corrected model error ME bc was slightly smaller, with M = 0.62. On average, the error ratio was 0.24, indicating that the larger portion of uncertainty is due to sampling error compared to model error.
The estimated country standard deviations for reading were much more modeldependent. The bias-corrected model error has an average of 0.96 (ranging between 0.00 and 2.68). This was also pronounced in the error ratio, which had an average of 0.60. The maximum error ratio was 2.05 for Finland (FIN; with a model error of 9.8), indicating that the model error was twice as large as the sampling error. Overall, model error turned out to be much more important for the standard deviation than the mean. Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as ME bc /SE; TE = total error computed by TE = SE 2 + ME 2 bc (see Equation (24)).
In Table 4, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 reading are shown. For the 10th percentile Q10, the error ratio was on average 0.60, with a range between 0.13 and 2.61. The average error ratio was even larger for the 90th percentile Q90 (M = 0.84, Min = 0.23, Max = 2.16). Hence, quantile comparisons across countries can be sensitive to the choice of the IRT scaling model.
In Table A3 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 mathematics are shown. As for reading, the error ratio was on average smaller for country means (M = 0.24, Max = 0.66) than for country standard deviations (M = 0.77, Max = 1.58). Nevertheless, the additional uncertainty associated with model uncertainty is too large to be ignored in statistical inference. For example, South Korea (KOR) had a range of 15.7 for the standard deviation across models, which corresponds to an error of 3.75 and an error ratio of 1.58.
In Table A4 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics are shown. The error ratios for the 10th and the 90th percentiles were similar (Q10: M = 0.66; Q90: M = 0.65). In general, the relative increase in uncertainty due to model error for percentiles was similar to the standard deviation.
In Table A5 in Appendix C, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 science are shown. As for reading and mathematics, the importance of model error was relatively small for country means (M = 0.27 for the error ratio). However, it reached 0.72 for Denmark with a bias-corrected model error of 1.89. For country standard deviations, the error ratio was larger (M = 0.53, Min = 0.00, Max = 1.50).
In Table A6 in Appendix C, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science are shown. The influence of model error on percentiles was slightly smaller in science than in reading or mathematics. The average error ratios were M = 0.44 (Q10) and M = 0.57 (Q90), but the maximum error ratios of 1.53 (Q10) and 2.04 (Q90) indicated that model error was more important than sampling error for some countries. Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as ME bc /SE; TE = total error computed by TE = SE 2 + ME 2 bc (see Equation (24)).
To investigate the impact of the choice of model weights in our analysis (see Section 4.2), we additionally conducted a sensitivity analysis for the reading domain by using uniform model weights (weighting scheme W2). That is, we weighted each of the 11 scaling models by w m = 1/11 = 0.091 (m = 1, . . . , 11). We studied changes in country means and country standard deviations regarding the composite mean, standard errors (SE), and model errors (ME bc ). The results are displayed in Table 5. Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); W1 = model weighting used in the main analysis (see Section 4.2 and results in other tables); W2 = uniform weighting of models.
For the composite estimate of the country mean, we only observed tiny differences between the proposed model weighting W1 and the uniform weighting W2. The absolute difference in country means was 0.14 on average (SD = 0.11) and ranged between 0.01 and 0.36 (South Korea, KOR). The average absolute difference for the change in country standard deviations was also small (M = 0.26; SD = 0.20). Notably, there were almost no changes in the standard error for country means and country standard deviations for the weighting methods. However, the model error slightly increased with uniform weighting from M = 0.62 to M = 0.68 for country means and from 0.96 to 1.12 for country standard deviation. In conclusion, one can state that employing a different weighting scheme might not strongly change the composite estimate or the standard error but can have importance regarding the quantified model uncertainty in the model error ME bc .

Discussion
Overall, our findings demonstrate that uncertainty regarding IRT scaling model influences country means. This kind of uncertainty is too large to be neglected in reporting. For some of the countries, the model error exceeded the sampling error. In this case, confidence intervals based on standard errors for the sampling of students might be overly narrow.
A different picture emerged for standard deviations and percentiles. In this case, the choice of the IRT model turned out to be much more important. Estimated error ratios were, on average, between 0.40 and 0.80, indicating that the model error introduced a non-negligible amount of uncertainty in parameters of interest. However, the importance of model error compared to sampling error was even larger for some of the countries. In particular, distribution parameters for high-and low-performing countries were substantially affected by the choice of the IRT model.
In our analysis, we only focused on 11 scaling models studied in the literature. However, semi-or nonparametric IRT models could alternatively be utilized [16,53,[105][106][107], and their impact on distribution parameters could be an exciting topic for future research. If more parameters in an IRT model were included, we expect an even larger impact of model choice on distribution parameters.
In our analysis, we did not use student covariates for drawing plausible values [100,108]. It could be that the impact of the choice of the IRT model would be smaller if relevant student covariates were included [109]. Future research can provide answers to this important question. As a summary of our research (see also Section 2.3), we would like to argue that model uncertainty should also be reported in educational LSA studies. This could be particularly interesting because the 1PL, 2PL, or the 3PL models are applied in the studies. In model comparisons, we have shown that the 3PL with residual heterogeneity (3PLRH) and the 3PL with quadratic effects of θ (3PLQ) were superior to alternatives. If the 2PL model is preferred over the 1PL model for reasons of model fit, three-parameter models must be preferred for the same reason. However, a central question might be whether the 3PLRH should be implemented in the operational practice of LSA. Technically, it would be certainly feasible, and there is no practical added complexity compared to the 2PL or the 3PL model.
Interestingly, some specified IRT models have the same number of item parameters but a different ability to fit the item response data. For example, the 3PL and the 3PLRH models have the same number of parameters, but the 3PLRH is often preferred in terms of model fit. This underlines that the choice of the functional form is also relevant, not only the number of item parameters [30].
Frequently, the assumed IRT models will be grossly misspecified for educational LSA data. The misspecification could lie in the functional form of the IRFs or the assumption of invariant item parameters across countries. The reliance of ML estimation on misspecified IRT models might be questioned. As an alternative, (robust) limited-information (LI) estimation methods [110] can be used. Notably, ML and LI methods result in a different weighing of model errors [111]. If differential item functioning (DIF) across countries is critical, IRT models can also be separately estimated in each country, and the results brought onto a common international metric through linking methods [112,113]. In the case of a small sample size at the country level, regularization approaches for more complex IRT models can be employed to stabilize estimation [114,115]. Linking methods have the advantage of a clear definition of model loss regarding country DIF [116][117][118] compared to joint estimation with ML or LI estimation [119].
As pointed out by an anonymous reviewer, applied psychometric researchers seem to have a tendency to choose the best fitting model with little care for whether that choice is appropriate in the particular research context. We have argued elsewhere that the 1PL model compared to other IRT models with more parameters is more valid because of its equal weighting of items [27]. If Pandora's box is opened via the argument of choosing a more complex IRT model due to improved model fit, we argue for a specification of different IRT models and an integrated assessment of model uncertainty, as has been proposed in this article. In this approach, however, the a priori choice of model weights has to be carefully conducted. Acknowledgments: I sincerely thank three anonymous reviewers for their valuable comments that improved this article.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

AIC
Akaike information criterion BIC Bayesian information criterion DIF differential item functioning GHP Gilula

Appendix A. Ability Estimation in Misspecified IRT Models
Let P i (θ) = Ψ(a i (θ)) be the true but unknown IRF, where Ψ is the logistic link function and a i is a differentiable function. If the IRFs are known, the latent ability θ can be obtained be maximizing the following log-likelihood function The maximization of Equation (A1) provides the estimating equation where a i denotes the first derivative of a i . Note that Now assume that misspecified IRFs P * i (θ) = Ψ(α i (θ)) instead of P i (θ) are used. The following estimating equation provides an ability estimate θ 1 : We make use of the following Taylor approximations where I(x) = Ψ(x)(1 − Ψ(x)). Set ∆θ = θ 1 − θ 0 . We obtain by inserting (A5) and (A6) in (A4) We can now determine the bias ∆θ by solving E(l * 1 (θ 1 )) = 0 for θ 1 and taking E(l 1 (θ 0 )) = 0 into account. Moreover, we take the expectation and ignore the squared term in ∆θ (i.e., ∆θ 0). Then, we compute from (A7) Finally, we obtain from (A8) We can further approximate the term in (A9) to

Appendix B. Country Labels for PISA 2009 Study
The following country labels were used in the

Appendix C. Additional Results for PISA 2009 Mathematics and Science
In In Table A3, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 mathematics are shown. In Table A4, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 mathematics are shown. In Table A5, results and model uncertainty of 11 different scaling models for country means and standard deviations in PISA 2009 science are shown. In Table A6, results and model uncertainty of 11 different scaling models for country 10th and 90th percentiles in PISA 2009 science are shown.  (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold. Note. CNT = country label (see Appendix B); M = weighted mean across different scaling models; rg = range of estimates across models; ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); For model descriptions see Section 2.1 and Equations (3) to (14). Country means that differ from the weighted mean of country means of the 11 different models more than 1 are printed in bold. Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as ME bc /SE; TE = total error computed by TE = SE 2 + ME 2 bc (see Equation (24)).  Note. CNT = country label (see Appendix B); N = sample size; M = weighted mean across different scaling models; rg = range of estimates across models; SE = standard error (computed with balanced half sampling); ME = estimated model error (see Equation (20)); ME bc = bias-corrected estimate of model error based on balanced half sampling (see Equation (23)); ER = error ratio defined as ME bc /SE; TE = total error computed by TE = SE 2 + ME 2 bc (see Equation (24)).