Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data

In educational large-scale assessment (LSA) studies such as PISA, item response theory (IRT) scaling models summarize students’ performance on cognitive test items across countries. This article investigates the impact of different factors in model specifications for the PISA 2018 mathematics study. The diverse options of the model specification also firm under the labels multiverse analysis or specification curve analysis in the social sciences. In this article, we investigate the following five factors of model specification in the PISA scaling model for obtaining the two country distribution parameters; country means and country standard deviations: (1) the choice of the functional form of the IRT model, (2) the treatment of differential item functioning at the country level, (3) the treatment of missing item responses, (4) the impact of item selection in the PISA test, and (5) the impact of test position effects. In our multiverse analysis, it turned out that model uncertainty had almost the same impact on variability in the country means as sampling errors due to the sampling of students. Model uncertainty had an even larger impact than standard errors for country standard deviations. Overall, each of the five specification factors in the multiverse analysis had at least a moderate effect on either country means or standard deviations. In the discussion section, we critically evaluate the current practice of model specification decisions in LSA studies. It is argued that we would either prefer reporting the variability in model uncertainty or choosing a particular model specification that might provide the strategy that is most valid. It is emphasized that model fit should not play a role in selecting a scaling strategy for LSA applications.


Introduction
Item response theory (IRT) models [1,2] are central to analyzing item response datasets that emerge in educational large-scale assessment (LSA; [3]) such as the (PISA; [4]), the (PIAAC; [5]) or (TIMSS; [6]). The IRT models provide a unidimensional summary of the performance of students on test items in different cognitive test domains. The process of extracting a single summary variable from multivariate item responses is labeled as scaling in LSA.
Interestingly, there is no consensus on which IRT modeling approach should be employed in LSA studies [6][7][8]. This article simultaneously and systematically analyzes the impact of analytical decisions in the scaling model in LSA studies. We use the PISA 2018 mathematics dataset [9] as an example. We follow an approach that integrates results from multiple models because findings from a single model chosen by a particular criterion might not be scientifically sound [10,11]. Moreover, because LSA studies are primarily policy-relevant and less relevant for research, it is vital to investigate whether particular findings are robust regarding different modeling assumptions.
The statistical theory of model uncertainty (or multi-model inference) quantifies the variability in statistical parameters of interest that can be traced back to different model specifications [12][13][14][15]. At its core, the parameter of interest is estimated as a weighted (or unweighted) average of results from multiple models [16][17][18][19][20][21][22]. Many applications can be found in climate research in which researchers have to deal with uncertainty in assumptions about their substantive models [23,24]. This uncertainty is reflected in the variability of findings obtained from different models [25]. A simple example might be reporting uncertainty in weather forecasting of temperature three days or one week ahead.
In the social sciences, the diverse possibilities of model specifications has been addressed with the concepts of multiverse analysis [26][27][28] and specification curve analysis [29,30]. The main idea is to study the variability of findings under the specification of plausible modeling alternatives. This variability should also be reported as an integral part of statistical inference.
In this article, we investigate five important analytical decisions for the scaling model in educational LSA data. First, we consider the choice of the functional form of the IRT model. This choice defines the weighing of each item in the unidimensional summary ability variable [31]. Second, we investigate the treatment of differential item functioning at the country level in the scaling models. Different treatments effectively define at the country level which items should be used for linking a country to an international reference value [32]. Third, the impact of different treatments of missing item responses is investigated. In LSA studies, it is occasionally recommended not to score all missing items as incorrect because missingness might reflect low motivation, which should not be part of the ability variable [33]. Fourth, we discuss the impact of findings due to the choice of particular items in the test. It has been shown that results at the country level could depend on the selected items [34]. Fifth, we investigate the impact of test position effects. It was often empirically shown that items administered at later test positions were more difficult than those presented at earlier test positions. Critically, the impact of test positions also varies across countries which illustrates the dependence of country comparisons on the choice of a particular test design [35].
The rest of the article is structured as follows. In Section 2, we discuss the dataset, the different factors in our multiverse analysis, and the analysis strategy. Section 3 presents the results for the PISA 2018 mathematics dataset. Finally, the paper closes with a discussion in Section 4.

Data
The mathematics test in PISA 2018 [9] was used to conduct the multiverse analysis. We included 45 countries that did receive the PISA test in a computer-based test administration. These countries did not receive test booklets with lower difficulty items that were specifically targeted for low-performing countries.
In total, 72 test booklets were administered in the computer-based assessment in PISA 2018 [9]. Test booklets were compiled from four clusters of items of the same ability domain (i.e., mathematics, reading, science). In our analysis, we selected test booklets that had two item clusters of mathematics items. As a consequence, students from booklets 1 to 12 were selected. The cluster of mathematics items appeared either in the first and second (booklets 7 to 12) or the third and fourth positions (booklets 1 to 6) in the test.
In total, 70 mathematics items were included in our multiverse analysis. In each of the 12 selected booklets, 22, 23 or 24 mathematics items were administered. Seven out of the seventy items were polytomous and were dichotomously recoded, with only the highest category being recoded as correct. In total, 27 out of 70 items had the complex multiple-choice (MC) format, and 43 items had the constructed-response (CR) format.
In our analysis, 167,092 students from 45 countries were included in the analysis. The sample sizes per country are presented in Table 1 (p. 8). The average sample size of students per country was M = 3713.2. The average number of students per item within each country ranged between 415.8 (MLT, Malta) and 4408.3 (ESP, Spain) and had an average of M = 1120.3.
The IRT scaling models were first fitted on an international calibration sample [36] consisting of N = 44,820 students (see Section 2.3). In each of the 45 countries, 996 students were randomly chosen for inclusion in this calibration sample. In a second step, all students within a country were used in the country-wise scaling models to obtain country means and standard deviations.

Analytical Choices in Specification Curve Analysis
In the following five subsections, the definition of five model misspecification factors of our multiverse analysis is described.

Functional Form of the Item Response Model (Factor "Model")
An IRT model is a representation of the multivariate item response vector X = (X 1 , . . . , X I ) that takes values in {0, 1} I if I denotes the number of items [37,38]. Hence, there are 2 I different item response patterns. The IRT model assumes the existence of a unidimensional latent variable θ, and item responses X i are conditionally independent of θ. Formally, the IRT model is defined as where the item response functions (IRF) are defined as P i (θ; γ i ) = P(X i = 1|θ; γ i ) and γ i denote item parameters. We define γ = (γ 1 , . . . , γ I ). In principle, IRFs can be nonparametrically identified [39][40][41][42]. Notably, one can view the unidimensional IRT model as an approximation of a true multidimensional IRT model with (possibly strongly) correlated dimensions [43][44][45][46]. In our multiverse analysis, we specify three functional forms of the IRF. First, the oneparameter logistic 1PL (also referred to as the Rasch model; [47]) IRT model is defined as 1PL model : where b i is the item difficulty and a is the common item discrimination parameter. Second, in the two-parameter logistic (2PL) model [48], the item discriminations are allowed to be item-specific: 2PL model : Third, the three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter δ i [49,50] 3PLRH model : The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the three-parameter logistic model (3PL; [48]) that includes a guessing parameter instead of an asymmetry parameter [51][52][53][54]. In this study, we did not include the 3PL model for two reasons, even though the PISA test includes multiple-choice items. It has been argued that the guessing parameter in the 3PL model is not necessarily related to the probability of randomly guessing an item for students that do not attempt to solve an item referring to their knowledge [55,56]. Alternative models might be preferable if the goal is to adjust for guessing effects adequately [55,57]. In a previous study, we demonstrated that the 3PL model did not substantially improve the model fit compared to the 2PL model [54]. In contrast, the 3PLRH model significantly improved the model fit in terms of information criteria [54]. The 3PLRH model is able to account for guessing and slipping effects, as well as for asymmetry in item response functions [53].
In total, the three IRT models, 1PL (factor level "1PL"), 2PL (factor level "2PL"), and 3PLRH (factor level "3PLRH"), are utilized in our multiverse analysis. The 1PL model was used in the PISA study until PISA 2012 [7], while the 2PL model has been employed since PISA 2015 [8,9]. To our knowledge, the 3PLRH has not yet been implemented in the operational practice of any important educational LSA study. The choice of the IRT model in LSA studies has been investigated in [54,58- Educational LSA studies compare ability performances across multiple countries. In applications, IRFs are often not invariant across countries. That is, there could exist country-specific item parameters γ ig for item i in country g [34]. This property is also labeled as (country) differential item functioning (DIF; [62,63]). Some restriction(s) on the parameters must be imposed for identification. A popular identification assumption is partial invariance (PI; [64,65]) model in which most of the item parameters for an item i are assumed to be equal across countries, while they can differ from a common international item parameter γ i for a few countries [5,[66][67][68][69][70].
In the operational practice of scaling in educational LSA studies, for each item i and each country g, a decision is made whether the item parameters are fixed to a common international parameter or they are freely estimated for a country. In practice, the computation of country means and country standard deviations only relies on the invariant items because the linking to the international metric is only conducted on those items. In PIAAC [5] and PISA [8] studies, the root mean square item deviation (RMSD) item fit statistic is used [70][71][72] that is defined as where f g is the density of the ability variable θ in country g. It has been shown that the RMSD statistic can be effectively used for detecting DIF [5]. Several studies have demonstrated that the RMSD statistic depends on the proportion of misfitting items and the sample size [73][74][75]. Moreover, the distribution of the RMSD statistic for a country depends on the average of uniform DIF effects (i.e., whether DIF is unbalanced or balanced; see [74]).
If the RMSD statistic exceeds a chosen cutoff value, an item is declared to be noninvariant because the country-specific IRF P ig substantially deviates from the model-implied IRF P i . In LSA studies, the cutoff of 0.12 is frequently chosen [5,76]. However, it has been pointed out in the literature that lower cutoff values must be selected to efficiently handle country DIF [72,[77][78][79]. In our multiverse analysis, we explore the choice of the three RMSD cutoff values 1.00 (factor level "RMSD100"), 0.08 (factor level "RMSD008"), and 0.05 (factor level "RMSD005"). A rationale for this choice can be found in [78,79]. The cutoff of 1.00 means that all item parameters are assumed to be invariant because the RMSD statistic is always smaller than 1. The RMSD values are obtained from the 2PL scaling in which all item parameters were invariant across countries. In principle, the choice of DIF items will depend on the chosen IRT model. However, to disentangle the factor of the definition of DIF items from other model specification factors in the multiverse analysis, we decided to let the DIF item sets be the same across specifications. Note that the PI approach is practically equivalent to a robust linking approach in which the impact of some items is downweighted (or entirely removed) for a particular country [75,78,80].

Treatment of Missing Item Responses (Factor "Score0")
In LSA studies, students often do not respond to administered items [81][82][83][84][85][86][87]. Two different types of missing item responses can be distinguished [88]. First, not reached items [89] are missing item responses at the end of a test booklet (or an item cluster). Second, omitted items are missing item responses within the test booklet (or an item cluster) and are no not reached items.
Until PISA 2012, all missing item responses are scored as incorrect. Since PISA 2015, not reached items are treated as non-administered items (i.e., treating it as "NA" in the scaling model), while omitted items are scored as incorrect. Several psychometricians argue that missing item responses should never be scored as incorrect [33,[90][91][92][93][94][95][96], while others argue that the treatment of missing item responses is not an empirical question because it should be framed as an issue in scoring, not an issue of missing data modeling [45,88,97,98].
Likely, the choice of the treatment of missing item responses impact on country rankings if the proportion of missing item responses and the missing mechanisms differ between countries [99]. Relatively large differences for some countries have been reported for the PISA study in [88].
In our multiverse analysis, we use three different scoring methods for the treatment of missing item responses. First, all missing item responses are scored as incorrect (factor level "S960"). Second, we scored omitted item responses as incorrect and treated not reached items as non-administered (factor level "S90"). Third, we treat omitted and not reached items as non-administered (factor level "S0"). We have to admit that other proposals in the literature [33,95] will typically lead to results that lie between those from the second and the third approach. However, our three specifications are helpful in deriving bounds for different possible missing data treatments.

Impact of Item Choice (Factor "Items")
It has been emphasized in generalizability theory that the choice of items should also be included as part of statistical inference, like the sampling of persons [100][101][102][103][104][105][106][107][108]. The uncertainty with respect to items has been quantified as linking errors for trend estimates [109][110][111]. However, a similar error can also be computed for cross-sectional country means [34,112,113]. The reason for the variability in country means with different item sets is the presence of country DIF. That is, performance differences between countries appear to be item-specific. Hence, the country mean is also influenced by the average of country DIF effects for a particular set of chosen items. The variability in country means and standard deviations due to the choice of items can be investigated by using subsamples of items in the multiverse analysis. The half sampling method is a particular subsampling method [80,114] that uses resampling based on half of the sample sizes for determining the variability in estimates. It has been shown that half sampling has superior statistical properties compared to the widely used jackknife method [109].
In our multiverse analysis, we use two item sets. First, we consider the full item set administered in the PISA 2018 mathematics assessment (factor level "All"). Second, we used half of the items in the test (factor level "Part"). In more detail, we used every second testlet (i.e., a group of items with a common item stimulus; see [115]). In the presence of country DIF, we expect that the estimated country means and standard deviations will differ in the two factor levels.
We now formally derive the expected variability due to item choice for our two specifications. Let µ 0 be the country mean estimate based on the full item set with I items and µ 1 be the estimated country mean based on half of the items (i.e., I/2 items). The variance of µ 0 and µ 1 due to DIF effects is given by respectively. The DIF variance is denoted by σ 2 DIF = Var(e ig ) for DIF effects e ig of item i in country g, and Var(µ 0 ) = σ 2 DIF /I is the square of the cross-sectional linking error [112]. In a multiverse analysis, we average across all model specifications. We compute the composite mean µ = (µ 0 + µ 1 )/2 based on the two specifications. Then, we can evaluate the total variance as By comparing (7) with (6), we see that the associated variance with the factor item choice in our multiverse analysis is smaller than the error component associated with Var(µ 0 ). The linking error is σ DIF / √ I, while the square root of the variance of the associated variance component in our multiverse analysis is given by σ DIF /(2 √ I) (see Equation (7)). Because we report the square roots of variance components in the Results section, we have to multiply the result regarding the multiverse analysis factor "Items" by two to obtain the linking error. It can be shown that considering only half samples of items would result in an unbiased variance component [80,114]. However, in such an approach, the original scaling model that includes all items would not be part of the multiverse analysis, which might be considered a disadvantage.

Impact of Position Effects (Factor "Pos")
The PISA test involves testing students with a test booklet that lasts two times 60 min of testing time. It is conceivable that student's test performance can fluctuate in the course of a test. Most likely, performance declines will be observed during the test [116][117][118][119][120]. Items administered at later test positions will typically be more difficult than if they were earlier administered in the test [121][122][123]. Moreover, position effects often differ between persons and, hence, across countries in LSA studies [124][125][126][127][128].
The investigation of position effects in LSA studies is often conducted by including additional latent variables [126,129,130]. In such an approach, the ability variable of interest is defined as the performance at the first test position [35,[131][132][133]. If students only got items at the third or fourth test position, the abilities of those students are adjusted and extrapolated to the first test position. Hence, the country means of an ability variable are model dependent.
Consequently, in our multiverse analysis, we study the impact of position effects in a design-based approach. We use three test specifications. First, we considered all students and items at all test positions (factor level "Pos1234"). Second, we used students and items at the first and second test positions in the scaling models (factor level "Pos12"). Third, we used all students and all items at the first test position (factor level "Pos1"). Obviously, the sample size was reduced in the second and the third specification. However, the definition of the ability variable is entirely defined by the test design and, in contrast to the approaches in the literature, is not dependent on a particular scaling model.

Analysis
In total, 3 (scaling models) × 3 (RMSD cutoff values) × 3 (missing data treatments) × 2 (item choice) × 3 (position effects) = 162 models were specified in our multiverse analysis. We declared the reference model as the 2PL model with an RMSD cutoff value of 0.08, scoring only omitted items as incorrect (while treating not reached items as non-administered), used all items for scaling, and the students and items at all four test positions. This approach follows the one employed in PISA 2018 [9].
In each model specification, we scaled the international calibration sample of N = 44,820 students for obtaining international item parameters. In the next step, the country mean and country standard deviation were obtained in a separate scaling model for each country in which item parameters were fixed to the international item parameters from the first step except for items whose RMSD values exceed the pre-specified cutoff value. For the country-wise scaling models, student weights were used in marginal maximum likelihood estimation. To enable comparisons across the different model specifi-cations, the ability distributions were linearly transformed such that the total population involving all students in all countries in our study has a mean of 500 and a standard deviation of 100. According to the official PISA approach, standard errors are computed based on the balanced repeated replication (BRR) method [9,114].
For each country, M = 162 distribution parametersγ m (m = 1, . . . , M) for means and standard deviations are obtained in the multiverse analysis. These parameters are summarized in a multi-model inference [12]. A composite estimateγ comp based on all model specifications is defined as the equally weighted averagê Model uncertainty is quantified as the model error (ME) that is computed as the square root of average squared parameter deviations (see [12,54]) It is interesting to compare the influence of model error (i.e., uncertainty due to different model specifications) with the uncertainty due to sampling of students that is reflected in the error ratio (ER; [54]). The error ratio is defined by where SE is the standard error of the composite estimateγ comp . This standard error is also easily computed with the BRR method because the estimated model parameters for each model specification are available in each replication sample.
It should be noted that we equally weigh all models in the computation of the composite estimator (Equation (8)) and the quantification of variability (Equation (9)). However, such a choice assumes that all model specifications would be considered equally plausible, which has been criticized in the literature [54,134,135]. It might be more legitimate to downweight similar models and upweight models that provide very different results with respect to a target criterion [136][137][138]. Because to our knowledge, almost all of the applications of multiverse and specification curve analysis used equal weights, we also follow this strategy in this article.
Our multiverse analysis varies 5 model specification factors, each having 2 or 3 factor levels. To analyze the importance of each of the factors in model outcomes, we specified a two-way analysis of variance (ANOVA) and computed the extent of explained variance of each of the one-way and two-way factors (see also [139]). In a preliminary analysis, it turned out that no higher-order interactions than two are required because no nonnegligible amount of variance was explained by additional higher-order factors. For ease of comparability with standard errors due to sampling of students, we report the square root of the variance component (SRVC; i.e., a standard deviation) for each factor (see also [140,141]). Note that we computed the ANOVA model separately for each countries and averaged the variance components across countries before taking the square root to obtain the standard deviations for each factor.
We used the statistical software R [142] in all computations. The R package TAM [143] was used for determining the RMSD statistic from the 2PL model, assuming international item parameters obtained from the calibration sample. The xxirt() function in the R package sirt [144] was used for estimating all scaling models. Graphical visualization of the multiverse analyses was presented using the default plot taken from specification curve analysis [29] in the specr [145] package. Table A1 in Appendix B The estimated common item discrimination a in the 1PL model was 1.273. The average of the item difficulties b i was 0.43 (SD = 1.47). In the 2PL model, the item discriminations a i had an average of 1.43 (SD = 0.54). The harmonic mean of the item discriminations was slightly lower at 1.32. The item difficulties b i had a mean of 0.60 (SD = 1.73). Interestingly, the correlations between the item discrimination and the item difficulty in the 2PL model was relatively large with r = 0.60. The descriptive statistics of the estimated item parameters in the 3PLRH model are for item discriminations a i : M = 1.00, a harmonic mean of 0.93, SD = 0.38; for item difficulties b i : M = 0.40, SD = 1.26; and the asymmetry parameter δ i : M = 0.31, SD = 0.78. Like in the 2PL model, item discriminations and item difficulties were strongly correlated (r = 0.57), while the other two correlations were less substantial (r(a, δ) = 0.33; r(b, δ) = −0.12).

Results
In Table 1, the results of the ANOVA of the multiverse analysis for country means and country standard deviations in PISA 2018 are presented. Square roots of variance components (SRVC) of factors are displayed in Table 1. For the country mean and standard deviation, it turned out that the position effect factor ("Pos") explains most of the total variance in the multiverse analysis. For the country mean, the DIF treatment ("RMSD") is based on the chosen RMSD cutoff value and the missing data handling ("Score0"). While the chosen IRT scaling model ("Model") had the least influence on country means, its impact on SRVC was much larger. The two-way interactions in the ANOVA model were less important. Hence, only square roots of variance components for main effects in the ANOVA are reported at the level of countries in the next tables.
In Table 2  Note. cnt = country label (see Appendix A); N = sample size; M = composite estimator for multi-model inference (see (8)); ME = model error (see (9)); ER = error ratio defined as ME/SE (see (10) In Figure 1, the country means for four countries Austria (AUT), Spain (ESP), the Netherlands (NLD) and USA are displayed as a function of factors in the multiverse analysis. These four countries were intentionally chosen to illustrate that the factors in the multiverse analysis have country-specific impacts on their means. Country means that differs from the reference value by at least 0.5 times a standard deviation of a corresponding model are displayed in red or blue lines, respectively. We do not use confidence intervals for inference in Figure 1 because the estimates are strongly dependent across models, and model error is practically uncorrelated with sampling error. That is, model uncertainty constitutes an additional source of uncertainty that is, at least in large sample sizes, unrelated to sampling uncertainty.
For Austria (AUT; ME = 2.97, ER = 0.93; upper left panel in Figure 1), Table 2 indicated that position ("Pos": SRVC = 1.50) and the RMSD cutoff ("RMSD": SRVC = 2.34) were the most important factors for the country mean in the multiverse analysis. It can be seen that low country means are obtained for model specifications that involve "RMSD100". This specification corresponds to the scaling model in which all items were assumed to be invariant. In contrast, specifications with RMSD cutoff values of 0.08 ("RMSD008") or 0.05 ("RMSD005") resulted in higher country means for Austria. These specifications allow for some noninvariant items. Critically, the noninvariant items do not contribute to the linking of Austria to the common international metric, which possibly explains difference between the factor levels of "RMSD". Moreover, if only students and items at the first test position ("Pos1") were included in the analysis, country means were lower on average compared with the overall mean of M = 509.7 across all model specifications in the multiverse analysis.
For Spain (ESP; ME = 1.91, ER = 0.93; upper right panel in Figure 1), position effects (SRVC = 1.40) were the most important factor. Model specifications that included all four test positions resulted in lower country means ("Pos1234") than those that included only the first ("Pos1") or the first and the second test position ("Pos12"). Interestingly, the lowest country mean was obtained if all items were used in combination with RMSD cutoff values of 0.08 and 0.05, resulting in an elimination of some items from linking for Spain.
For the Netherlands (NLD; ME = 3.50, ER = 1.29; lower left panel in Figure 1), the RMSD cutoff value for the treatment of DIF ("RMSD") had the largest impact (SRVC = 2.61), followed by test position ("Pos"; SRVC = 1.36) and missing data treatment ("Score0", SRVC = 1.23). The country means for the Netherlands were lowest when the most strict RMSD cutoff value of 0.05 was applied ("RMSD005"). Moreover, if only the first ("Pos1") or the first and second ("Pos12") test positions were used in the analysis, country means in the different model specifications were larger on average than the country means based on all four test positions ("Pos1234"). Finally, country means were larger on average if all missing item responses were scored as incorrect (factor level "S960" for the factor "Score0").
For the USA (USA; ME = 0.90, ER = 0.90; lower right panel in Figure 1), the missing data treatment ("Score0") had the largest impact on country means (SRVC = 2.32). Country means were lower on average if all missing items were scored as non-administered ("S0"). In contrast, country means for the USA were larger if all missing items were scored as incorrect ("S960") or only omitted items were scored as incorrect ("S90").  In Table 3, the results of the multiverse analysis of PISA 2018 mathematics for σ are presented. The average model error (ME) across countries was 2.98 (SD = 1.13) and ranged between 1.27 (Spain; ESP) and 5.55 (The Netherlands, NLD). The error ratio (ER) for country standard deviations was 1.45 on average (SD = 0.50; Min = 0.74, Max = 3.05) and slightly larger than the ER for country means. This means that model uncertainty induced more variability in standard deviations than sampling uncertainty due to the sampling of students (see also findings in [54]). Note. cnt = country label (see Appendix A); N = sample size; M = composite estimator for multi-model inference (see (8)); ME = model error (see (9)); ER = error ratio defined as ME/SE (see (10) In Figure 2, the country standard deviations for four countries Austria (AUT), Spain (ESP), the Netherlands (NLD) and USA are displayed as a function of factors in the multiverse analysis. The model errors for Austria (ME = 1.60) and Spain (ME = 1.27) were smaller than for the Netherlands (ME = 5.55) and the USA (ME = 2.60).
The variability in standard deviations for the Netherlands (NLD; lower left panel in Figure 2) was particularly large (M = 90.2, Min = 78.7, Max = 101.5). Test position ("Pos"; SRVC = 3.75), choice of the IRT model ("Model"; SRVC = 2.96), and item choice ("Items"; SRVC = 1.80) had the largest impact. The country standard deviations computed on all four test positions ("Pos1234") were larger than those obtained from the first ("Pos1") or the first and the second ("Pos12") test positions. The standard deviations based on the 1PL model were larger on average than those obtained with the 2PL or the 3PLRH models.

Discussion
Our study illustrates that model uncertainty (i.e., model error) cannot be neglected in outcomes of educational LSA studies such as PISA. It was shown that model error was more pronounced in country standard deviations than in country means. Discussions about model specifications in the literature often focus on the influence of country means or country rankings. This might have led to false impressions that particular modeling choices were less consequential.
It turned out that all five considered specification factors in our multiverse analysis had an impact on either country means or standard deviations or both statistics. Test position impacted the mean and the standard deviation. Interestingly, the DIF and the missing item response treatment mainly affected the country mean more than the standard deviation. At the same time, the choice of the IRT model strongly influenced the standard deviation (see also [54]).
Particular model specification choices differentially impact the mean or the standard deviation of a country. For example, the choice of different RMSD cutoff values depends on the proportion of DIF items in a country. Moreover, the missing item response treatment will mainly affect countries with relatively low or high missing proportions compared to the average proportion of all countries. We studied the model error and the error ratio for quantifying the country-specific model uncertainty in our multiverse analysis.
If all model specifications are plausible, model uncertainty can be ignored and considered part of the statistical inference in country comparisons in educational LSA studies. By varying different model specifications, different assumptions about model generalization are made. This perspective was taken in a sampling model of validity [146,147].
In [45], we argued that the computation of statistics for the latent variable θ (i.e., the ability variable) should be mainly motivated by design-based considerations. We think that particular specification choices are preferable for the five considered factors in our multiverse analysis. We will discuss our preferences in the following.
First, for the test position, we think that the test design should be defined a priori. We do not think that it is a threat to validity because country rankings can change if the first two or all test positions were used in an analysis. The computed ability in a longer test of 120 min testing time represents a different test situation than in a test that only involves 60 min of testing time. A researcher must define how ability should be assessed. Some researchers argue that test position must be disentangled from performance decline that could be due to lower test motivation at later test positions [131]. We do not think that it is useful to define ability independent of test motivation. One could put the argument to the other extreme that average performance should be computed only for one administered item per student at the beginning of the test because the performance on subsequently administered items also depends on test persistence.
Second, we think that the mechanistic inclusion of country-specific item parameters for DIF items based on certain RMSD cutoff values decreases validity because country comparisons effectively only rely on the items that are declared to be non-DIF-items [45,79]. If substantial DIF for an item is detected, researchers must judge whether the DIF truly refers to a bias in measurement for a country. That is, it must be decided whether DIF is construct-relevant or construct-irrelevant [32,63,78]. In the PISA studies until PISA 2012, DIF items were only removed from analysis if technical reasons or explanations for the DIF were found [148,149]. Hence, DIF items for a particular item had international item parameters that were assumed to be invariant across countries, although there is a misfit in some countries. We argued elsewhere [45] that model misfit should be no concern in LSA studies because all IRT models are intentionally misspecified. The model parameters in a selected IRT model receive their sole meaning because of their definition in the likelihood function for deriving summaries of the multivariate item response dataset. Hence, item and model parameters such as country means and standard deviations can be unbiased even if the IRT model is grossly misspecified. Hence, conclusions in the literature that there might be biased country means or standard deviations due to the presence of DIF [70,150] are misplaced.
Third, we believe that missing item responses should always be treated as incorrect in educational LSA studies [45,98]. Otherwise, countries can simply manipulate their performance by instructing students to omit items they do not know [88]. We are also unconvinced that response times are beneficial for obtaining more valid ability measures by downweighing item responses with very fast item responses (see [33] for such arguments). Moreover, proponents of model-based treatments of missing item responses assume that the probability of omitting an item depends on latent variables but not the particular item itself (i.e., they pose a latent ignorability assumption; see [33,85]). It has been shown that this modeling assumption must be refuted by means of model fit [88]. Interestingly, analyses for PISA have shown that the missingness of constructed-response items can be statistically traced back to the fact that students do not know the item. We are also less convinced of the scoring of not-reached items as non-administered since PISA 2015. We think that not-reached items should always be scored as incorrect because ability should be defined on student's performance for a fixed length, not a test length chosen by the test taker.
Fourth, we have shown that the choice of items can impact country means and standard deviations. We think the uncertainty due to item choice should be included in statistical inference. For cross-sectional and trend estimates [45,112], this concept is labeled as linking error and can be simply determined by resampling techniques of items [54,80]. In this sense, all items should be included in a cross-sectional analysis. With a larger number of representative items for a larger item domain [151,152], the linking error will be smaller. The situation is a bit more intricate for trend estimation in LSA studies (i.e., the trend in country means for PISA mathematics between PISA 2015 and PISA 2018) if the item sets in the two studies differ. Typically, there will be link items that appear in both assessments and unique items that are only administered in one study. In this case, trend estimates computed only on link items might be more efficient than those computed on all items [112] if DIF between countries exists. If the same items were used for trend estimation, stable country DIF effects are blocked because only changes in item performances are effectively quantified in trend estimation. In contrast, the average of DIF effects of unique items and of link items impacts trend estimates if all items were used in the analysis [45].
Fifth, the choice of the IRT model is crucial for defining the impact of items in the ability variable [31,45,54]. Until PISA 2012, the 1PL model was used that equally weighs items in the ability variable. Since PISA 2015, the 2PL model has been utilized that weighs item discriminations that are estimated in the IRT model. We concur with Brennan ([153]; see also [154]) that it is questionable to let a statistical model decide how items should be weighed in the ability variable. The resulting weighing of items might contradict the intended test blueprint composition [31]. Some researchers argue that one should not fit more complex IRT models than the 2PL model, such as the three-parameter logistic (3PL) IRT model. They argue that at most two item parameters can be identified from multivariate data [75] and base their argument on a result of the Dutch identity of Holland [155]. However, Zhang and Stout [156] disproved the finding. Hence, using the 2PL model instead of the 3PL or the alternative 3PLRH model might in LSA studies be rather a personal preference than due to model fit or validity reasons. In typical LSA datasets, item responses are multidimensional, and violations of local dependence are likely found [157][158][159]. We argued above that the chosen unidimensional IRT model must (and will typically) not hold (see also [160]). However, we have shown that for reasons of model fit, the 2PL model must be refuted in the PISA study [54].
Finally, we would like to emphasize that we believe that decisions for model specifications in LSA studies must not be primarily convincing based on research findings, but are selected by purpose. We doubt that that model fit should play a role in reaching a decision. It could be more honest to state that the model specifications of a particular test scaling contractor in LSA studies are part of its role as a player in the testing industry, and every company has its own brands (i.e., IRT models and model specifications). Choices are almost always made by conventions and historical or recent preferences, but the underlying motivations should be transparently disclosed [161]. We doubt that discussions about analytical choice can be resolved by relying on empirical findings. Acknowledgments: I would like to thank the academic editor, two anonymous reviewers and Ulrich Schroeders for helpful comments that helped to improve the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:   Table A1 presents estimated item parameters for the international calibration sample 551 including 70 items if all missing item responses were treated as incorrect. Note. 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; 3PLRH = three-parameter logistic model with residual heterogeneity.