Next Article in Journal
Association between Suicide Rate and Human Development Index, Income, and the Political System in 46 Muslim-Majority Countries: An Ecological Study
Previous Article in Journal
Relationship between Nomophobia, Various Emotional Difficulties, and Distress Factors among Students
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data

by
Alexander Robitzsch
1,2,†
1
IPN— Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany
2
Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
Current address: Olshausenstraße 62, 24118 Kiel, Germany.
Eur. J. Investig. Health Psychol. Educ. 2022, 12(7), 731-753; https://doi.org/10.3390/ejihpe12070054
Submission received: 28 May 2022 / Revised: 29 June 2022 / Accepted: 4 July 2022 / Published: 7 July 2022

Abstract

:
In educational large-scale assessment (LSA) studies such as PISA, item response theory (IRT) scaling models summarize students’ performance on cognitive test items across countries. This article investigates the impact of different factors in model specifications for the PISA 2018 mathematics study. The diverse options of the model specification also firm under the labels multiverse analysis or specification curve analysis in the social sciences. In this article, we investigate the following five factors of model specification in the PISA scaling model for obtaining the two country distribution parameters; country means and country standard deviations: (1) the choice of the functional form of the IRT model, (2) the treatment of differential item functioning at the country level, (3) the treatment of missing item responses, (4) the impact of item selection in the PISA test, and (5) the impact of test position effects. In our multiverse analysis, it turned out that model uncertainty had almost the same impact on variability in the country means as sampling errors due to the sampling of students. Model uncertainty had an even larger impact than standard errors for country standard deviations. Overall, each of the five specification factors in the multiverse analysis had at least a moderate effect on either country means or standard deviations. In the discussion section, we critically evaluate the current practice of model specification decisions in LSA studies. It is argued that we would either prefer reporting the variability in model uncertainty or choosing a particular model specification that might provide the strategy that is most valid. It is emphasized that model fit should not play a role in selecting a scaling strategy for LSA applications.

1. Introduction

Item response theory (IRT) models [1,2] are central to analyzing item response datasets that emerge in educational large-scale assessment (LSA; [3]) such as the (PISA; [4]), the (PIAAC; [5]) or (TIMSS; [6]). The IRT models provide a unidimensional summary of the performance of students on test items in different cognitive test domains. The process of extracting a single summary variable from multivariate item responses is labeled as scaling in LSA.
Interestingly, there is no consensus on which IRT modeling approach should be employed in LSA studies [6,7,8]. This article simultaneously and systematically analyzes the impact of analytical decisions in the scaling model in LSA studies. We use the PISA 2018 mathematics dataset [9] as an example. We follow an approach that integrates results from multiple models because findings from a single model chosen by a particular criterion might not be scientifically sound [10,11]. Moreover, because LSA studies are primarily policy-relevant and less relevant for research, it is vital to investigate whether particular findings are robust regarding different modeling assumptions.
The statistical theory of model uncertainty (or multi-model inference) quantifies the variability in statistical parameters of interest that can be traced back to different model specifications [12,13,14,15]. At its core, the parameter of interest is estimated as a weighted (or unweighted) average of results from multiple models [16,17,18,19,20,21,22]. Many applications can be found in climate research in which researchers have to deal with uncertainty in assumptions about their substantive models [23,24]. This uncertainty is reflected in the variability of findings obtained from different models [25]. A simple example might be reporting uncertainty in weather forecasting of temperature three days or one week ahead.
In the social sciences, the diverse possibilities of model specifications has been addressed with the concepts of multiverse analysis [26,27,28] and specification curve analysis [29,30]. The main idea is to study the variability of findings under the specification of plausible modeling alternatives. This variability should also be reported as an integral part of statistical inference.
In this article, we investigate five important analytical decisions for the scaling model in educational LSA data. First, we consider the choice of the functional form of the IRT model. This choice defines the weighing of each item in the unidimensional summary ability variable [31]. Second, we investigate the treatment of differential item functioning at the country level in the scaling models. Different treatments effectively define at the country level which items should be used for linking a country to an international reference value [32]. Third, the impact of different treatments of missing item responses is investigated. In LSA studies, it is occasionally recommended not to score all missing items as incorrect because missingness might reflect low motivation, which should not be part of the ability variable [33]. Fourth, we discuss the impact of findings due to the choice of particular items in the test. It has been shown that results at the country level could depend on the selected items [34]. Fifth, we investigate the impact of test position effects. It was often empirically shown that items administered at later test positions were more difficult than those presented at earlier test positions. Critically, the impact of test positions also varies across countries which illustrates the dependence of country comparisons on the choice of a particular test design [35].
The rest of the article is structured as follows. In Section 2, we discuss the dataset, the different factors in our multiverse analysis, and the analysis strategy. Section 3 presents the results for the PISA 2018 mathematics dataset. Finally, the paper closes with a discussion in Section 4.

2. Method

2.1. Data

The mathematics test in PISA 2018 [9] was used to conduct the multiverse analysis. We included 45 countries that did receive the PISA test in a computer-based test administration. These countries did not receive test booklets with lower difficulty items that were specifically targeted for low-performing countries.
In total, 72 test booklets were administered in the computer-based assessment in PISA 2018 [9]. Test booklets were compiled from four clusters of items of the same ability domain (i.e., mathematics, reading, science). In our analysis, we selected test booklets that had two item clusters of mathematics items. As a consequence, students from booklets 1 to 12 were selected. The cluster of mathematics items appeared either in the first and second (booklets 7 to 12) or the third and fourth positions (booklets 1 to 6) in the test.
In total, 70 mathematics items were included in our multiverse analysis. In each of the 12 selected booklets, 22, 23 or 24 mathematics items were administered. Seven out of the seventy items were polytomous and were dichotomously recoded, with only the highest category being recoded as correct. In total, 27 out of 70 items had the complex multiple-choice (MC) format, and 43 items had the constructed-response (CR) format.
In our analysis, 167,092 students from 45 countries were included in the analysis. The sample sizes per country are presented in Table 1 (p. 8). The average sample size of students per country was M = 3713.2. The average number of students per item within each country ranged between 415.8 (MLT, Malta) and 4408.3 (ESP, Spain) and had an average of M = 1120.3.
The IRT scaling models were first fitted on an international calibration sample [36] consisting of N = 44,820 students (see Section 2.3). In each of the 45 countries, 996 students were randomly chosen for inclusion in this calibration sample. In a second step, all students within a country were used in the country-wise scaling models to obtain country means and standard deviations.

2.2. Analytical Choices in Specification Curve Analysis

In the following five subsections, the definition of five model misspecification factors of our multiverse analysis is described.

2.2.1. Functional Form of the Item Response Model (Factor “Model”)

An IRT model is a representation of the multivariate item response vector X = ( X 1 , , X I ) that takes values in { 0 , 1 } I if I denotes the number of items [37,38]. Hence, there are 2 I different item response patterns. The IRT model assumes the existence of a unidimensional latent variable θ , and item responses X i are conditionally independent of θ . Formally, the IRT model is defined as
P ( X = x ; γ ) = i = 1 I P i ( θ ; γ i ) x i 1 P i ( θ ; γ i ) 1 x i f ( θ ) d θ for x = ( x 1 , , x i ) ,
where the item response functions (IRF) are defined as P i ( θ ; γ i ) = P ( X i = 1 | θ ; γ i ) and γ i denote item parameters. We define γ = ( γ 1 , , γ I ) . In principle, IRFs can be nonparametrically identified [39,40,41,42]. Notably, one can view the unidimensional IRT model as an approximation of a true multidimensional IRT model with (possibly strongly) correlated dimensions [43,44,45,46].
In our multiverse analysis, we specify three functional forms of the IRF. First, the one-parameter logistic 1PL (also referred to as the Rasch model; [47]) IRT model is defined as
1 PL model : P i ( θ ; a , b i ) = 1 1 + exp ( a ( θ b i ) ) ,
where b i is the item difficulty and a is the common item discrimination parameter. Second, in the two-parameter logistic (2PL) model [48], the item discriminations are allowed to be item-specific:
2 PL model : P i ( θ ; a , b i ) = 1 1 + exp ( a i ( θ b i ) ) .
Third, the three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter δ i [49,50]
3 PLRH model : P i ( θ ; a i , b i , δ i ) = 1 1 + exp 1 + exp ( δ i θ ) 1 / 2 ( a i θ + b i ) .
The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the three-parameter logistic model (3PL; [48]) that includes a guessing parameter instead of an asymmetry parameter [51,52,53,54]. In this study, we did not include the 3PL model for two reasons, even though the PISA test includes multiple-choice items. It has been argued that the guessing parameter in the 3PL model is not necessarily related to the probability of randomly guessing an item for students that do not attempt to solve an item referring to their knowledge [55,56]. Alternative models might be preferable if the goal is to adjust for guessing effects adequately [55,57]. In a previous study, we demonstrated that the 3PL model did not substantially improve the model fit compared to the 2PL model [54]. In contrast, the 3PLRH model significantly improved the model fit in terms of information criteria [54]. The 3PLRH model is able to account for guessing and slipping effects, as well as for asymmetry in item response functions [53].
In total, the three IRT models, 1PL (factor level “1PL”), 2PL (factor level “2PL”), and 3PLRH (factor level “3PLRH”), are utilized in our multiverse analysis. The 1PL model was used in the PISA study until PISA 2012 [7], while the 2PL model has been employed since PISA 2015 [8,9]. To our knowledge, the 3PLRH has not yet been implemented in the operational practice of any important educational LSA study. The choice of the IRT model in LSA studies has been investigated in [54,58,59,60,61].

2.2.2. Treatment of Differential Item Functioning Based on the RMSD Item Fit Statistic (Factor “RMSD”)

Educational LSA studies compare ability performances across multiple countries. In applications, IRFs are often not invariant across countries. That is, there could exist country-specific item parameters γ i g for item i in country g [34]. This property is also labeled as (country) differential item functioning (DIF; [62,63]). Some restriction(s) on the parameters must be imposed for identification. A popular identification assumption is partial invariance (PI; [64,65]) model in which most of the item parameters for an item i are assumed to be equal across countries, while they can differ from a common international item parameter γ i for a few countries [5,66,67,68,69,70].
In the operational practice of scaling in educational LSA studies, for each item i and each country g, a decision is made whether the item parameters are fixed to a common international parameter or they are freely estimated for a country. In practice, the computation of country means and country standard deviations only relies on the invariant items because the linking to the international metric is only conducted on those items. In PIAAC [5] and PISA [8] studies, the root mean square item deviation (RMSD) item fit statistic is used [70,71,72] that is defined as
RMSD i g = ( P i g ( θ ) P i ( θ ; γ i ) ) 2 f g ( θ ) d θ
where f g is the density of the ability variable θ in country g.
It has been shown that the RMSD statistic can be effectively used for detecting DIF [5]. Several studies have demonstrated that the RMSD statistic depends on the proportion of misfitting items and the sample size [73,74,75]. Moreover, the distribution of the RMSD statistic for a country depends on the average of uniform DIF effects (i.e., whether DIF is unbalanced or balanced; see [74]).
If the RMSD statistic exceeds a chosen cutoff value, an item is declared to be noninvariant because the country-specific IRF P i g substantially deviates from the model-implied IRF P i . In LSA studies, the cutoff of 0.12 is frequently chosen [5,76]. However, it has been pointed out in the literature that lower cutoff values must be selected to efficiently handle country DIF [72,77,78,79]. In our multiverse analysis, we explore the choice of the three RMSD cutoff values 1.00 (factor level “RMSD100”), 0.08 (factor level “RMSD008”), and 0.05 (factor level “RMSD005”). A rationale for this choice can be found in [78,79]. The cutoff of 1.00 means that all item parameters are assumed to be invariant because the RMSD statistic is always smaller than 1. The RMSD values are obtained from the 2PL scaling in which all item parameters were invariant across countries. In principle, the choice of DIF items will depend on the chosen IRT model. However, to disentangle the factor of the definition of DIF items from other model specification factors in the multiverse analysis, we decided to let the DIF item sets be the same across specifications. Note that the PI approach is practically equivalent to a robust linking approach in which the impact of some items is downweighted (or entirely removed) for a particular country [75,78,80].

2.2.3. Treatment of Missing Item Responses (Factor “Score0”)

In LSA studies, students often do not respond to administered items [81,82,83,84,85,86,87]. Two different types of missing item responses can be distinguished [88]. First, not reached items [89] are missing item responses at the end of a test booklet (or an item cluster). Second, omitted items are missing item responses within the test booklet (or an item cluster) and are no not reached items.
Until PISA 2012, all missing item responses are scored as incorrect. Since PISA 2015, not reached items are treated as non-administered items (i.e., treating it as “NA” in the scaling model), while omitted items are scored as incorrect. Several psychometricians argue that missing item responses should never be scored as incorrect [33,90,91,92,93,94,95,96], while others argue that the treatment of missing item responses is not an empirical question because it should be framed as an issue in scoring, not an issue of missing data modeling [45,88,97,98].
Likely, the choice of the treatment of missing item responses impact on country rankings if the proportion of missing item responses and the missing mechanisms differ between countries [99]. Relatively large differences for some countries have been reported for the PISA study in [88].
In our multiverse analysis, we use three different scoring methods for the treatment of missing item responses. First, all missing item responses are scored as incorrect (factor level “S960”). Second, we scored omitted item responses as incorrect and treated not reached items as non-administered (factor level “S90”). Third, we treat omitted and not reached items as non-administered (factor level “S0”). We have to admit that other proposals in the literature [33,95] will typically lead to results that lie between those from the second and the third approach. However, our three specifications are helpful in deriving bounds for different possible missing data treatments.

2.2.4. Impact of Item Choice (Factor “Items”)

It has been emphasized in generalizability theory that the choice of items should also be included as part of statistical inference, like the sampling of persons [100,101,102,103,104,105,106,107,108]. The uncertainty with respect to items has been quantified as linking errors for trend estimates [109,110,111]. However, a similar error can also be computed for cross-sectional country means [34,112,113]. The reason for the variability in country means with different item sets is the presence of country DIF. That is, performance differences between countries appear to be item-specific. Hence, the country mean is also influenced by the average of country DIF effects for a particular set of chosen items. The variability in country means and standard deviations due to the choice of items can be investigated by using subsamples of items in the multiverse analysis. The half sampling method is a particular subsampling method [80,114] that uses resampling based on half of the sample sizes for determining the variability in estimates. It has been shown that half sampling has superior statistical properties compared to the widely used jackknife method [109].
In our multiverse analysis, we use two item sets. First, we consider the full item set administered in the PISA 2018 mathematics assessment (factor level “All”). Second, we used half of the items in the test (factor level “Part”). In more detail, we used every second testlet (i.e., a group of items with a common item stimulus; see [115]). In the presence of country DIF, we expect that the estimated country means and standard deviations will differ in the two factor levels.
We now formally derive the expected variability due to item choice for our two specifications. Let μ 0 be the country mean estimate based on the full item set with I items and μ 1 be the estimated country mean based on half of the items (i.e., I / 2 items). The variance of μ 0 and μ 1 due to DIF effects is given by
Var ( μ 0 ) = σ DIF 2 I and Var ( μ 1 ) = σ DIF 2 I / 2 ,
respectively. The DIF variance is denoted by σ DIF 2 = Var ( e i g ) for DIF effects e i g of item i in country g, and Var ( μ 0 ) = σ DIF 2 / I is the square of the cross-sectional linking error [112]. In a multiverse analysis, we average across all model specifications. We compute the composite mean μ = ( μ 0 + μ 1 ) / 2 based on the two specifications. Then, we can evaluate the total variance as
E 1 2 ( μ 0 μ ) 2 + 1 2 ( μ 1 μ ) 2 = 1 4 E ( μ 0 μ 1 ) 2 = 1 4 E 1 I i = 1 I / 2 e i g + 1 I i = I / 2 + 1 I e i g 2 = σ DIF 2 4 I .
By comparing (7) with (6), we see that the associated variance with the factor item choice in our multiverse analysis is smaller than the error component associated with Var ( μ 0 ) . The linking error is σ D I F / I , while the square root of the variance of the associated variance component in our multiverse analysis is given by σ D I F / ( 2 I ) (see Equation (7)). Because we report the square roots of variance components in the Results section, we have to multiply the result regarding the multiverse analysis factor “Items” by two to obtain the linking error. It can be shown that considering only half samples of items would result in an unbiased variance component [80,114]. However, in such an approach, the original scaling model that includes all items would not be part of the multiverse analysis, which might be considered a disadvantage.

2.2.5. Impact of Position Effects (Factor “Pos”)

The PISA test involves testing students with a test booklet that lasts two times 60 min of testing time. It is conceivable that student’s test performance can fluctuate in the course of a test. Most likely, performance declines will be observed during the test [116,117,118,119,120]. Items administered at later test positions will typically be more difficult than if they were earlier administered in the test [121,122,123]. Moreover, position effects often differ between persons and, hence, across countries in LSA studies [124,125,126,127,128].
The investigation of position effects in LSA studies is often conducted by including additional latent variables [126,129,130]. In such an approach, the ability variable of interest is defined as the performance at the first test position [35,131,132,133]. If students only got items at the third or fourth test position, the abilities of those students are adjusted and extrapolated to the first test position. Hence, the country means of an ability variable are model dependent.
Consequently, in our multiverse analysis, we study the impact of position effects in a design-based approach. We use three test specifications. First, we considered all students and items at all test positions (factor level “Pos1234”). Second, we used students and items at the first and second test positions in the scaling models (factor level “Pos12”). Third, we used all students and all items at the first test position (factor level “Pos1”). Obviously, the sample size was reduced in the second and the third specification. However, the definition of the ability variable is entirely defined by the test design and, in contrast to the approaches in the literature, is not dependent on a particular scaling model.

2.3. Analysis

In total, 3 (scaling models) × 3 (RMSD cutoff values) × 3 (missing data treatments) × 2 (item choice) × 3 (position effects) = 162 models were specified in our multiverse analysis. We declared the reference model as the 2PL model with an RMSD cutoff value of 0.08, scoring only omitted items as incorrect (while treating not reached items as non-administered), used all items for scaling, and the students and items at all four test positions. This approach follows the one employed in PISA 2018 [9].
In each model specification, we scaled the international calibration sample of N = 44,820 students for obtaining international item parameters. In the next step, the country mean and country standard deviation were obtained in a separate scaling model for each country in which item parameters were fixed to the international item parameters from the first step except for items whose RMSD values exceed the pre-specified cutoff value. For the country-wise scaling models, student weights were used in marginal maximum likelihood estimation. To enable comparisons across the different model specifications, the ability distributions were linearly transformed such that the total population involving all students in all countries in our study has a mean of 500 and a standard deviation of 100. According to the official PISA approach, standard errors are computed based on the balanced repeated replication (BRR) method [9,114].
For each country, M = 162 distribution parameters γ ^ m ( m = 1 , , M ) for means and standard deviations are obtained in the multiverse analysis. These parameters are summarized in a multi-model inference [12]. A composite estimate γ ^ comp based on all model specifications is defined as the equally weighted average
γ ^ comp = 1 M m = 1 M γ ^ m
Model uncertainty is quantified as the model error (ME) that is computed as the square root of average squared parameter deviations (see [12,54])
ME = 1 M m = 1 M ( γ ^ m γ ^ comp ) 2
It is interesting to compare the influence of model error (i.e., uncertainty due to different model specifications) with the uncertainty due to sampling of students that is reflected in the error ratio (ER; [54]). The error ratio is defined by
ER = ME SE ,
where SE is the standard error of the composite estimate γ ^ comp . This standard error is also easily computed with the BRR method because the estimated model parameters for each model specification are available in each replication sample.
It should be noted that we equally weigh all models in the computation of the composite estimator (Equation (8)) and the quantification of variability (Equation (9)). However, such a choice assumes that all model specifications would be considered equally plausible, which has been criticized in the literature [54,134,135]. It might be more legitimate to downweight similar models and upweight models that provide very different results with respect to a target criterion [136,137,138]. Because to our knowledge, almost all of the applications of multiverse and specification curve analysis used equal weights, we also follow this strategy in this article.
Our multiverse analysis varies 5 model specification factors, each having 2 or 3 factor levels. To analyze the importance of each of the factors in model outcomes, we specified a two-way analysis of variance (ANOVA) and computed the extent of explained variance of each of the one-way and two-way factors (see also [139]). In a preliminary analysis, it turned out that no higher-order interactions than two are required because no non-negligible amount of variance was explained by additional higher-order factors. For ease of comparability with standard errors due to sampling of students, we report the square root of the variance component (SRVC; i.e., a standard deviation) for each factor (see also [140,141]). Note that we computed the ANOVA model separately for each countries and averaged the variance components across countries before taking the square root to obtain the standard deviations for each factor.
We used the statistical software R [142] in all computations. The R package TAM [143] was used for determining the RMSD statistic from the 2PL model, assuming international item parameters obtained from the calibration sample. The xxirt() function in the R package sirt [144] was used for estimating all scaling models. Graphical visualization of the multiverse analyses was presented using the default plot taken from specification curve analysis [29] in the specr [145] package.

3. Results

Table A1 in Appendix B The estimated common item discrimination a in the 1PL model was 1.273. The average of the item difficulties b i was 0.43 (SD = 1.47). In the 2PL model, the item discriminations a i had an average of 1.43 (SD = 0.54). The harmonic mean of the item discriminations was slightly lower at 1.32. The item difficulties b i had a mean of 0.60 (SD = 1.73). Interestingly, the correlations between the item discrimination and the item difficulty in the 2PL model was relatively large with r = 0.60. The descriptive statistics of the estimated item parameters in the 3PLRH model are for item discriminations a i : M = 1.00, a harmonic mean of 0.93, SD = 0.38; for item difficulties b i : M = 0.40, SD = 1.26; and the asymmetry parameter δ i : M = 0.31, SD = 0.78. Like in the 2PL model, item discriminations and item difficulties were strongly correlated (r = 0.57), while the other two correlations were less substantial ( r ( a , δ ) = 0.33; r ( b , δ ) = −0.12).
In Table 1, the results of the ANOVA of the multiverse analysis for country means and country standard deviations in PISA 2018 are presented. Square roots of variance components (SRVC) of factors are displayed in Table 1.
For the country mean and standard deviation, it turned out that the position effect factor (“Pos”) explains most of the total variance in the multiverse analysis. For the country mean, the DIF treatment (“RMSD”) is based on the chosen RMSD cutoff value and the missing data handling (“Score0”). While the chosen IRT scaling model (“Model”) had the least influence on country means, its impact on SRVC was much larger. The two-way interactions in the ANOVA model were less important. Hence, only square roots of variance components for main effects in the ANOVA are reported at the level of countries in the next tables.
In Table 2, the results of the multiverse analysis of PISA 2018 mathematics for μ are presented. For example, Austria (AUT) had a country mean of 508.7 (SE = 3.20) in the reference scaling model. The country means for Austria in the 162 model specifications ranged between 503.6 and 514.8 with an average of M = 509.7. The variability is reflected in the computed model error of ME = 2.97. Hence, model uncertainty has almost the same importance as sampling error which is reflected in the error ratio ER = 0.93. Interestingly, most of the variability in Austria’s country means can be attributed to the DIF treatment based on different RMSD cutoff values (SRVC = 2.34), followed by position effects (SRVC = 1.50).
The variability in the country means across countries was very similar for the reference model (M = 500, SD = 33.37) and the composite estimator across models (M = 500, SD = 33.34). At the level of countries, the model error ranged between 1.22 (FIN) and 5.74 (BRN) with an average value of 3.05 (SD = 1.05). The distribution of the error ratio ER across countries indicated that model uncertainty was (on average) of similar importance like standard errors (M = 1.12), while it substantially varies across countries (SD = 0.47, Min = 0.51, Max = 5.74). These findings imply that there could be good reasons to include the component of model uncertainty in statistical inference.
In Figure 1, the country means for four countries Austria (AUT), Spain (ESP), the Netherlands (NLD) and USA are displayed as a function of factors in the multiverse analysis. These four countries were intentionally chosen to illustrate that the factors in the multiverse analysis have country-specific impacts on their means. Country means that differs from the reference value by at least 0.5 times a standard deviation of a corresponding model are displayed in red or blue lines, respectively. We do not use confidence intervals for inference in Figure 1 because the estimates are strongly dependent across models, and model error is practically uncorrelated with sampling error. That is, model uncertainty constitutes an additional source of uncertainty that is, at least in large sample sizes, unrelated to sampling uncertainty.
For Austria (AUT; ME = 2.97, ER = 0.93; upper left panel in Figure 1), Table 2 indicated that position (“Pos”: SRVC = 1.50) and the RMSD cutoff (“RMSD”: SRVC = 2.34) were the most important factors for the country mean in the multiverse analysis. It can be seen that low country means are obtained for model specifications that involve “RMSD100”. This specification corresponds to the scaling model in which all items were assumed to be invariant. In contrast, specifications with RMSD cutoff values of 0.08 (“RMSD008”) or 0.05 (“RMSD005”) resulted in higher country means for Austria. These specifications allow for some noninvariant items. Critically, the noninvariant items do not contribute to the linking of Austria to the common international metric, which possibly explains difference between the factor levels of “RMSD”. Moreover, if only students and items at the first test position (“Pos1”) were included in the analysis, country means were lower on average compared with the overall mean of M = 509.7 across all model specifications in the multiverse analysis.
For Spain (ESP; ME = 1.91, ER = 0.93; upper right panel in Figure 1), position effects (SRVC = 1.40) were the most important factor. Model specifications that included all four test positions resulted in lower country means (“Pos1234”) than those that included only the first (“Pos1”) or the first and the second test position (“Pos12”). Interestingly, the lowest country mean was obtained if all items were used in combination with RMSD cutoff values of 0.08 and 0.05, resulting in an elimination of some items from linking for Spain.
For the Netherlands (NLD; ME = 3.50, ER = 1.29; lower left panel in Figure 1), the RMSD cutoff value for the treatment of DIF (“RMSD”) had the largest impact (SRVC = 2.61), followed by test position (“Pos”; SRVC = 1.36) and missing data treatment (“Score0”, SRVC = 1.23). The country means for the Netherlands were lowest when the most strict RMSD cutoff value of 0.05 was applied (“RMSD005”). Moreover, if only the first (“Pos1”) or the first and second (“Pos12”) test positions were used in the analysis, country means in the different model specifications were larger on average than the country means based on all four test positions (“Pos1234”). Finally, country means were larger on average if all missing item responses were scored as incorrect (factor level “S960” for the factor “Score0”).
For the USA (USA; ME = 0.90, ER = 0.90; lower right panel in Figure 1), the missing data treatment (“Score0”) had the largest impact on country means (SRVC = 2.32). Country means were lower on average if all missing items were scored as non-administered (“S0”). In contrast, country means for the USA were larger if all missing items were scored as incorrect (“S960”) or only omitted items were scored as incorrect (“S90”).
In Table 3, the results of the multiverse analysis of PISA 2018 mathematics for σ are presented. The average model error (ME) across countries was 2.98 (SD = 1.13) and ranged between 1.27 (Spain; ESP) and 5.55 (The Netherlands, NLD). The error ratio (ER) for country standard deviations was 1.45 on average (SD = 0.50; Min = 0.74, Max = 3.05) and slightly larger than the ER for country means. This means that model uncertainty induced more variability in standard deviations than sampling uncertainty due to the sampling of students (see also findings in [54]).
In Figure 2, the country standard deviations for four countries Austria (AUT), Spain (ESP), the Netherlands (NLD) and USA are displayed as a function of factors in the multiverse analysis. The model errors for Austria (ME = 1.60) and Spain (ME = 1.27) were smaller than for the Netherlands (ME = 5.55) and the USA (ME = 2.60).
The variability in standard deviations for the Netherlands (NLD; lower left panel in Figure 2) was particularly large (M = 90.2, Min = 78.7, Max = 101.5). Test position (“Pos“; SRVC = 3.75), choice of the IRT model (“Model”; SRVC = 2.96), and item choice (“Items”; SRVC = 1.80) had the largest impact. The country standard deviations computed on all four test positions (“Pos1234”) were larger than those obtained from the first (“Pos1”) or the first and the second (“Pos12”) test positions. The standard deviations based on the 1PL model were larger on average than those obtained with the 2PL or the 3PLRH models.

4. Discussion

Our study illustrates that model uncertainty (i.e., model error) cannot be neglected in outcomes of educational LSA studies such as PISA. It was shown that model error was more pronounced in country standard deviations than in country means. Discussions about model specifications in the literature often focus on the influence of country means or country rankings. This might have led to false impressions that particular modeling choices were less consequential.
It turned out that all five considered specification factors in our multiverse analysis had an impact on either country means or standard deviations or both statistics. Test position impacted the mean and the standard deviation. Interestingly, the DIF and the missing item response treatment mainly affected the country mean more than the standard deviation. At the same time, the choice of the IRT model strongly influenced the standard deviation (see also [54]).
Particular model specification choices differentially impact the mean or the standard deviation of a country. For example, the choice of different RMSD cutoff values depends on the proportion of DIF items in a country. Moreover, the missing item response treatment will mainly affect countries with relatively low or high missing proportions compared to the average proportion of all countries. We studied the model error and the error ratio for quantifying the country-specific model uncertainty in our multiverse analysis.
If all model specifications are plausible, model uncertainty can be ignored and considered part of the statistical inference in country comparisons in educational LSA studies. By varying different model specifications, different assumptions about model generalization are made. This perspective was taken in a sampling model of validity [146,147].
In [45], we argued that the computation of statistics for the latent variable θ (i.e., the ability variable) should be mainly motivated by design-based considerations. We think that particular specification choices are preferable for the five considered factors in our multiverse analysis. We will discuss our preferences in the following.
First, for the test position, we think that the test design should be defined a priori. We do not think that it is a threat to validity because country rankings can change if the first two or all test positions were used in an analysis. The computed ability in a longer test of 120 min testing time represents a different test situation than in a test that only involves 60 min of testing time. A researcher must define how ability should be assessed. Some researchers argue that test position must be disentangled from performance decline that could be due to lower test motivation at later test positions [131]. We do not think that it is useful to define ability independent of test motivation. One could put the argument to the other extreme that average performance should be computed only for one administered item per student at the beginning of the test because the performance on subsequently administered items also depends on test persistence.
Second, we think that the mechanistic inclusion of country-specific item parameters for DIF items based on certain RMSD cutoff values decreases validity because country comparisons effectively only rely on the items that are declared to be non-DIF-items [45,79]. If substantial DIF for an item is detected, researchers must judge whether the DIF truly refers to a bias in measurement for a country. That is, it must be decided whether DIF is construct-relevant or construct-irrelevant [32,63,78]. In the PISA studies until PISA 2012, DIF items were only removed from analysis if technical reasons or explanations for the DIF were found [148,149]. Hence, DIF items for a particular item had international item parameters that were assumed to be invariant across countries, although there is a misfit in some countries. We argued elsewhere [45] that model misfit should be no concern in LSA studies because all IRT models are intentionally misspecified. The model parameters in a selected IRT model receive their sole meaning because of their definition in the likelihood function for deriving summaries of the multivariate item response dataset. Hence, item and model parameters such as country means and standard deviations can be unbiased even if the IRT model is grossly misspecified. Hence, conclusions in the literature that there might be biased country means or standard deviations due to the presence of DIF [70,150] are misplaced.
Third, we believe that missing item responses should always be treated as incorrect in educational LSA studies [45,98]. Otherwise, countries can simply manipulate their performance by instructing students to omit items they do not know [88]. We are also unconvinced that response times are beneficial for obtaining more valid ability measures by downweighing item responses with very fast item responses (see [33] for such arguments). Moreover, proponents of model-based treatments of missing item responses assume that the probability of omitting an item depends on latent variables but not the particular item itself (i.e., they pose a latent ignorability assumption; see [33,85]). It has been shown that this modeling assumption must be refuted by means of model fit [88]. Interestingly, analyses for PISA have shown that the missingness of constructed-response items can be statistically traced back to the fact that students do not know the item. We are also less convinced of the scoring of not-reached items as non-administered since PISA 2015. We think that not-reached items should always be scored as incorrect because ability should be defined on student’s performance for a fixed length, not a test length chosen by the test taker.
Fourth, we have shown that the choice of items can impact country means and standard deviations. We think the uncertainty due to item choice should be included in statistical inference. For cross-sectional and trend estimates [45,112], this concept is labeled as linking error and can be simply determined by resampling techniques of items [54,80]. In this sense, all items should be included in a cross-sectional analysis. With a larger number of representative items for a larger item domain [151,152], the linking error will be smaller. The situation is a bit more intricate for trend estimation in LSA studies (i.e., the trend in country means for PISA mathematics between PISA 2015 and PISA 2018) if the item sets in the two studies differ. Typically, there will be link items that appear in both assessments and unique items that are only administered in one study. In this case, trend estimates computed only on link items might be more efficient than those computed on all items [112] if DIF between countries exists. If the same items were used for trend estimation, stable country DIF effects are blocked because only changes in item performances are effectively quantified in trend estimation. In contrast, the average of DIF effects of unique items and of link items impacts trend estimates if all items were used in the analysis [45].
Fifth, the choice of the IRT model is crucial for defining the impact of items in the ability variable [31,45,54]. Until PISA 2012, the 1PL model was used that equally weighs items in the ability variable. Since PISA 2015, the 2PL model has been utilized that weighs item discriminations that are estimated in the IRT model. We concur with Brennan ([153]; see also [154]) that it is questionable to let a statistical model decide how items should be weighed in the ability variable. The resulting weighing of items might contradict the intended test blueprint composition [31]. Some researchers argue that one should not fit more complex IRT models than the 2PL model, such as the three-parameter logistic (3PL) IRT model. They argue that at most two item parameters can be identified from multivariate data [75] and base their argument on a result of the Dutch identity of Holland [155]. However, Zhang and Stout [156] disproved the finding. Hence, using the 2PL model instead of the 3PL or the alternative 3PLRH model might in LSA studies be rather a personal preference than due to model fit or validity reasons. In typical LSA datasets, item responses are multidimensional, and violations of local dependence are likely found [157,158,159]. We argued above that the chosen unidimensional IRT model must (and will typically) not hold (see also [160]). However, we have shown that for reasons of model fit, the 2PL model must be refuted in the PISA study [54].
Finally, we would like to emphasize that we believe that decisions for model specifications in LSA studies must not be primarily convincing based on research findings, but are selected by purpose. We doubt that that model fit should play a role in reaching a decision. It could be more honest to state that the model specifications of a particular test scaling contractor in LSA studies are part of its role as a player in the testing industry, and every company has its own brands (i.e., IRT models and model specifications). Choices are almost always made by conventions and historical or recent preferences, but the underlying motivations should be transparently disclosed [161]. We doubt that discussions about analytical choice can be resolved by relying on empirical findings.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The PISA 2018 dataset is available from https://www.oecd.org/pisa/data/2018database/ (accessed on 16 May 2022).

Acknowledgments

I would like to thank the academic editor, two anonymous reviewers and Ulrich Schroeders for helpful comments that helped to improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
1PLone-parameter logistic model
2PLtwo-parameter logistic model
3PLthree-parameter logistic model
3PLRHthree-parameter logistic model with residual heterogeneity
ANOVAanalysis of variance
DIFdifferential item functioning
ERerror ratio
IRFitem response function
IRTitem response theory
LSAlarge-scale assessment
MEmodel error
MMLmarginal maximum likelihood
PIAACprogramme for the international assessment of adult competencies
PISAprogramme for international student assessment
SEstandard error
SRVCsquare root of variance component
TIMSStrends in international mathematics and science study

Appendix A. Country Labels for PISA 2018 Mathematics Study

The country labels used in the tables of the Results Section 3 are as follows:
ALB = Albania; AUS = Australia; AUT = Austria; BEL = Belgium; BIH = Bosnia and Herzegovina; BLR = Belarus; BRN = Brunei Darussalam; CAN = Canada; CHE = Switzerland; CZE = Czech Republic; DEU = Germany; DNK = Denmark; ESP = Spain; EST = Estonia; FIN = Finland; FRA = France; GBR = United Kingdom; GRC = Greece; HKG = Hong Kong; HRV = Croatia; HUN = Hungary; IRL = Ireland; ISL = Iceland; ISR = Israel; ITA = Italy; JPN = Japan; KOR = Korea; LTU = Lithuania; LUX = Luxembourg; LVA = Latvia; MLT = Malta; MNE = Montenegro; MYS = Malaysia; NLD = Netherlands; NOR = Norway; NZL = New Zealand; POL = Poland; PRT = Portugal; RUS = Russian Federation; SGP = Singapore; SVK = Slovak Republic; SVN = Slovenia; SWE = Sweden; TUR = Turkey; USA = United States.

Appendix B. International Item Parameters for PISA 2018 Mathematics Study

Table A1 presents estimated item parameters for the international calibration sample 551 including 70 items if all missing item responses were treated as incorrect.
Table A1. Estimated item parameters for the 1PL, 2PL, and the 3PLRH model.
Table A1. Estimated item parameters for the 1PL, 2PL, and the 3PLRH model.
1PL2PL3PLRH
Item a b i a i b i a i b i δ i
CM033Q01S1.273−1.8180.903−1.6150.656−1.1490.026
CM474Q01S1.273−0.9510.924−0.8340.690−0.6680.763
DM155Q02C1.273−0.1331.594−0.1131.045−0.2711.454
CM155Q01S1.273−0.9991.482−1.0401.059−0.8641.142
DM155Q03C1.2732.2591.3572.3180.9621.643−0.283
CM155Q04S1.273−0.1760.995−0.1380.652−0.2201.091
CM411Q01S1.2730.0721.6830.1191.090−0.1221.524
CM411Q02S1.2730.3940.9120.3750.6390.349−0.832
CM803Q01S1.2731.4821.9181.8241.2821.2050.626
CM442Q02S1.2731.2231.9401.5281.2960.9710.769
DM462Q01C1.2733.6121.4133.7261.0102.623−0.189
CM034Q01S1.2730.7061.3310.7440.8450.3960.907
CM305Q01S1.2730.5050.3140.4140.2260.300−0.155
CM496Q01S1.2730.2401.5000.2871.0250.1250.506
CM496Q02S1.273−0.7821.240−0.7710.881−0.6510.884
CM423Q01S1.273−1.4890.833−1.3240.633−0.9740.393
CM192Q01S1.2730.5411.4280.6010.9910.459−0.303
DM406Q01C1.2731.6531.8101.9971.2671.457−0.824
DM406Q02C1.2732.5752.5953.8021.8652.743−0.088
CM603Q01S1.2730.7990.9160.7460.6580.569−0.416
CM571Q01S1.2730.3741.3760.4160.9550.342−0.395
CM564Q01S1.2730.1940.7370.1840.4890.219−0.988
CM564Q02S1.2730.2750.7180.2530.4550.295−1.489
CM447Q01S1.273−0.6381.440−0.6530.979−0.392−0.554
CM273Q01S1.2730.3790.9970.3640.7000.259−0.036
CM408Q01S1.2730.8851.2900.9210.8500.5570.680
CM420Q01S1.2730.1181.0410.1250.7150.0230.481
CM446Q01S1.273−0.7791.775−0.8861.264−0.7280.678
DM446Q02C1.2733.1212.2804.1901.5953.0600.544
CM559Q01S1.273−0.4580.876−0.4010.591−0.241−0.371
DM828Q02C1.273−0.4981.082−0.4590.755−0.4461.053
CM828Q03S1.2731.1541.2711.1850.7680.6991.038
CM464Q01S1.2731.5452.0062.0011.3891.3790.280
CM800Q01S1.273−2.3290.639−1.9880.711−1.4501.417
CM982Q01S1.273−2.0750.922−1.8890.829−1.4071.387
CM982Q02S1.2730.9950.9770.9120.6030.5520.725
CM982Q03S1.273−0.7181.082−0.6730.772−0.5140.272
CM982Q04S1.2730.1881.4630.2191.0070.206−0.426
CM992Q01S1.273−1.1881.207−1.1640.792−0.759−0.530
CM992Q02S1.2732.3331.8462.7791.2911.961−0.064
DM992Q03C1.2733.3102.8175.0552.1413.9420.802
CM915Q01S1.2730.5480.9380.4990.6540.426−0.718
CM915Q02S1.273−0.9761.215−0.9560.889−0.8191.427
CM906Q01S1.273−0.4851.233−0.4700.830−0.283−0.391
DM906Q02C1.2730.8881.8241.0861.2010.5981.370
DM00KQ02C1.2732.5511.1662.4640.8831.763−0.426
CM909Q01S1.273−2.3831.710−2.7071.263−1.9410.322
CM909Q02S1.273−0.4291.595−0.4551.110−0.266−0.520
CM909Q03S1.2731.0242.3791.4451.6770.9270.760
CM949Q01S1.273−1.0721.639−1.1831.177−0.8990.418
CM949Q02S1.2730.8761.3530.9050.9510.682−0.447
DM949Q03C1.2731.0931.4561.1601.0000.7850.177
CM00GQ01S1.2733.2071.8393.7001.3102.582−0.430
DM955Q01C1.273−1.0830.977−0.9780.735−0.7851.012
DM955Q02C1.2730.9141.4140.9610.9570.6210.349
CM955Q03S1.2732.9822.2553.8761.5432.8090.818
DM998Q02C1.273−0.8541.185−0.8170.857−0.6550.614
CM998Q04S1.2730.6900.2360.5290.2640.414−1.939
CM905Q01S1.273−1.4361.020−1.3000.709−0.908−0.123
DM905Q02C1.2730.6111.9650.7781.3350.4130.865
CM919Q01S1.273−1.7811.672−1.9801.250−1.4901.185
CM919Q02S1.2730.3911.1060.3840.6540.1101.327
CM954Q01S1.273−0.9662.022−1.1771.456−0.9010.343
DM954Q02C1.2730.9471.6361.0661.0960.6680.508
CM954Q04S1.2731.4062.0651.7821.3051.0702.059
CM943Q01S1.273−0.0530.855−0.0290.5590.074−0.930
CM943Q02S1.2733.9792.4745.2771.7233.9090.478
DM953Q02C1.2730.6901.4350.7350.9820.4690.273
CM953Q03S1.2730.0522.0070.0981.394−0.0600.760
DM953Q04C1.2732.7272.7073.9681.8822.8941.052
Note. 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; 3PLRH = three-parameter logistic model with residual heterogeneity.

References

  1. Holland, P.W. On the sampling theory foundations of item response theory models. Psychometrika 1990, 55, 577–601. [Google Scholar] [CrossRef]
  2. Van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
  3. Rutkowski, L.; von Davier, M.; Rutkowski, D. (Eds.) A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Chapman Hall/CRC Press: London, UK, 2013. [Google Scholar] [CrossRef]
  4. OECD. PISA 2009; Technical Report; OECD: Paris, France, 2012; Available online: https://bit.ly/3xfxdwD (accessed on 28 May 2022).
  5. Yamamoto, K.; Khorramdel, L.; von Davier, M. Scaling PIAAC cognitive data. In Technical Report of the Survey of Adult Skills (PIAAC); OECD, Ed.; OECD Publishing: Paris, France, 2013; pp. 408–440. Available online: https://bit.ly/32Y1TVt (accessed on 28 May 2022).
  6. Foy, P.; Yin, L. Scaling the TIMSS 2015 achievement data. In Methods and Procedures in TIMSS 2015; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Boston, MA, USA, 2016. [Google Scholar]
  7. OECD. PISA 2012; Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g. (accessed on 28 May 2022).
  8. OECD. PISA 2015; Technical Report; OECD: Paris, France, 2017; Available online: https://bit.ly/32buWnZ (accessed on 28 May 2022).
  9. OECD. PISA 2018; Technical Report; OECD: Paris, France, 2020; Available online: https://bit.ly/3zWbidA (accessed on 28 May 2022).
  10. Longford, N.T. An alternative to model selection in ordinary regression. Stat. Comput. 2003, 13, 67–80. [Google Scholar] [CrossRef]
  11. Longford, N.T. ’Which model?’ is the wrong question. Stat. Neerl. 2012, 66, 237–252. [Google Scholar] [CrossRef]
  12. Buckland, S.T.; Burnham, K.P.; Augustin, N.H. Model selection: An integral part of inference. Biometrics 1997, 53, 603–618. [Google Scholar] [CrossRef]
  13. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer: New York, NY, USA, 2002. [Google Scholar] [CrossRef] [Green Version]
  14. Chatfield, C. Model uncertainty, data mining and statistical inference. J. R. Stat. Soc. Series A Stat. Soc. 1995, 158, 419–444. [Google Scholar] [CrossRef]
  15. Clyde, M.; George, E.I. Model uncertainty. Stat. Sci. 2004, 19, 81–94. [Google Scholar] [CrossRef]
  16. Athey, S.; Imbens, G. A measure of robustness to misspecification. Am. Econ. Rev. 2015, 105, 476–480. [Google Scholar] [CrossRef] [Green Version]
  17. Brock, W.A.; Durlauf, S.N.; West, K.D. Model uncertainty and policy evaluation: Some theory and empirics. J. Econom. 2007, 136, 629–664. [Google Scholar] [CrossRef] [Green Version]
  18. Brock, W.A.; Durlauf, S.N. On sturdy policy evaluation. J. Leg. Stud. 2015, 44, S447–S473. [Google Scholar] [CrossRef] [Green Version]
  19. Muñoz, J.; Young, C. We ran 9 billion regressions: Eliminating false positives through computational model robustness. Sociol. Methodol. 2018, 48, 1–33. [Google Scholar] [CrossRef] [Green Version]
  20. Young, C. Model uncertainty in sociological research: An application to religion and economic growth. Am. Sociol. Rev. 2009, 74, 380–397. [Google Scholar] [CrossRef] [Green Version]
  21. Young, C.; Holsteen, K. Model uncertainty and robustness: A computational framework for multimodel analysis. Sociol. Methods Res. 2017, 46, 3–40. [Google Scholar] [CrossRef] [Green Version]
  22. Young, C. Model uncertainty and the crisis in science. Socius 2018, 4, 1–7. [Google Scholar] [CrossRef] [Green Version]
  23. Knutti, R.; Baumberger, C.; Hadorn, G.H. Uncertainty quantification using multiple models—Prospects and challenges. In Computer Simulation Validation; Beisbart, C., Saam, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 835–855. [Google Scholar] [CrossRef]
  24. Baumberger, C.; Knutti, R.; Hirsch Hadorn, G. Building confidence in climate model projections: An analysis of inferences from fit. WIREs Clim. Chang. 2017, 8, e454. [Google Scholar] [CrossRef] [Green Version]
  25. Dormann, C.F.; Calabrese, J.M.; Guillera-Arroita, G.; Matechou, E.; Bahn, V.; Bartoń, K.; Beale, C.M.; Ciuti, S.; Elith, J.; Gerstner, K.; et al. Model averaging in ecology: A review of Bayesian, information-theoretic, and tactical approaches for predictive inference. Ecol. Monogr. 2018, 88, 485–504. [Google Scholar] [CrossRef] [Green Version]
  26. Hoffmann, S.; Schönbrodt, F.D.; Elsas, R.; Wilson, R.; Strasser, U.; Boulesteix, A.L. The multiplicity of analysis strategies jeopardizes replicability: Lessons learned across disciplines. MetaArXiv 2020. [Google Scholar] [CrossRef]
  27. Steegen, S.; Tuerlinckx, F.; Gelman, A.; Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 2016, 11, 702–712. [Google Scholar] [CrossRef]
  28. Harder, J.A. The multiverse of methods: Extending the multiverse analysis to address data-collection decisions. Perspect. Psychol. Sci. 2020, 15, 1158–1177. [Google Scholar] [CrossRef]
  29. Simonsohn, U.; Simmons, J.P.; Nelson, L.D. Specification curve: Descriptive and inferential statistics on all reasonable specifications. SSRN 2015. [Google Scholar] [CrossRef] [Green Version]
  30. Simonsohn, U.; Simmons, J.P.; Nelson, L.D. Specification curve analysis. Nat. Hum. Behav. 2020, 4, 1208–1214. [Google Scholar] [CrossRef]
  31. Camilli, G. IRT scoring and test blueprint fidelity. Appl. Psychol. Meas. 2018, 42, 393–400. [Google Scholar] [CrossRef] [PubMed]
  32. Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
  33. Pohl, S.; Ulitzsch, E.; von Davier, M. Reframing rankings in educational assessments. Science 2021, 372, 338–340. [Google Scholar] [CrossRef] [PubMed]
  34. Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
  35. Hartig, J.; Buchholz, J. A multilevel item response model for item position effects and individual persistence. Psych. Test Assess. Model. 2012, 54, 418–431. [Google Scholar]
  36. Rutkowski, L.; Rutkowski, D.; Zhou, Y. Item calibration samples and the stability of achievement estimates and system rankings: Another look at the PISA model. Int. J. Test. 2016, 16, 1–20. [Google Scholar] [CrossRef]
  37. Van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
  38. Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
  39. Douglas, J.A. Asymptotic identifiability of nonparametric item response models. Psychometrika 2001, 66, 531–540. [Google Scholar] [CrossRef]
  40. Levine, M.V. Dimension in latent variable models. J. Math. Psychol. 2003, 47, 450–466. [Google Scholar] [CrossRef]
  41. Peress, M. Identification of a semiparametric item response model. Psychometrika 2012, 77, 223–243. [Google Scholar] [CrossRef]
  42. Stout, W. A nonparametric approach for assessing latent trait unidimensionality. Psychometrika 1987, 52, 589–617. [Google Scholar] [CrossRef]
  43. Ip, E.H.; Molenberghs, G.; Chen, S.H.; Goegebeur, Y.; De Boeck, P. Functionally unidimensional item response models for multivariate binary data. Multivar. Behav. Res. 2013, 48, 534–562. [Google Scholar] [CrossRef]
  44. Kirisci, L.; Hsu, T.c.; Yu, L. Robustness of item parameter estimation programs to assumptions of unidimensionality and normality. Appl. Psychol. Meas. 2001, 25, 146–162. [Google Scholar] [CrossRef]
  45. Robitzsch, A.; Lüdtke, O. Reflections on analytical choices in the scaling model for test scores in international large-scale assessment studies. PsyArXiv 2021. [Google Scholar] [CrossRef]
  46. Zhang, B. Application of unidimensional item response models to tests with items sensitive to secondary dimensions. J. Exp. Educ. 2008, 77, 147–166. [Google Scholar] [CrossRef]
  47. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
  48. Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
  49. Molenaar, D.; Dolan, C.V.; De Boeck, P. The heteroscedastic graded response model with a skewed latent trait: Testing statistical and substantive hypotheses related to skewed item category functions. Psychometrika 2012, 77, 455–478. [Google Scholar] [CrossRef] [PubMed]
  50. Molenaar, D. Heteroscedastic latent trait models for dichotomous data. Psychometrika 2015, 80, 625–644. [Google Scholar] [CrossRef]
  51. Lee, S.; Bolt, D.M. An alternative to the 3PL: Using asymmetric item characteristic curves to address guessing effects. J. Educ. Meas. 2018, 55, 90–111. [Google Scholar] [CrossRef]
  52. Lee, S.; Bolt, D.M. Asymmetric item characteristic curves and item complexity: Insights from simulation and real data analyses. Psychometrika 2018, 83, 453–475. [Google Scholar] [CrossRef]
  53. Liao, X.; Bolt, D.M. Item characteristic curve asymmetry: A better way to accommodate slips and guesses than a four-parameter model? J. Educ. Behav. Stat. 2021, 46, 753–775. [Google Scholar] [CrossRef]
  54. Robitzsch, A. On the choice of the item response model for scaling PISA data: Model selection based on information criteria and quantifying model uncertainty. Entropy 2022, 24, 760. [Google Scholar] [CrossRef]
  55. Aitkin, M.; Aitkin, I. Investigation of the Identifiability of the 3PL Model in the NAEP 1986 Math Survey; Technical Report; US Department of Education, Office of Educational Research and Improvement National Center for Education Statistics: Washington, DC, USA, 2006. Available online: https://bit.ly/35b79X0 (accessed on 28 May 2022).
  56. von Davier, M. Is there need for the 3PL model? Guess what? Meas. Interdiscip. Res. Persp. 2009, 7, 110–114. [Google Scholar] [CrossRef]
  57. San Martín, E.; Del Pino, G.; De Boeck, P. IRT models for ability-based guessing. Appl. Psychol. Meas. 2006, 30, 183–203. [Google Scholar] [CrossRef] [Green Version]
  58. Brown, G.; Micklewright, J.; Schnepf, S.V.; Waldmann, R. International surveys of educational achievement: How robust are the findings? J. R. Stat. Soc. Series A Stat. Soc. 2007, 170, 623–646. [Google Scholar] [CrossRef] [Green Version]
  59. Jerrim, J.; Parker, P.; Choi, A.; Chmielewski, A.K.; Sälzer, C.; Shure, N. How robust are cross-country comparisons of PISA scores to the scaling model used? Educ. Meas. 2018, 37, 28–39. [Google Scholar] [CrossRef] [Green Version]
  60. Macaskill, G. Alternative scaling models and dependencies in PISA. In Proceedings of the TAG(0809)6a, TAG Meeting, Sydney, Australia, 7–11 July 2008; Available online: https://bit.ly/35WwBPg (accessed on 28 May 2022).
  61. Schnepf, S.V. Insights into Survey Errors of Large Scale Educational Achievement Surveys; JRC Working Papers in Economics and Finance, No. 2018/5; Publications Office of the European Union: Luxembourg, 2018. [Google Scholar] [CrossRef]
  62. Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
  63. Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
  64. Byrne, B.M.; Shavelson, R.J.; Muthén, B. Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychol. Bull. 1989, 105, 456–466. [Google Scholar] [CrossRef]
  65. Van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  66. Kunina-Habenicht, O.; Rupp, A.A.; Wilhelm, O. A practical illustration of multidimensional diagnostic skills profiling: Comparing results from confirmatory factor analysis and diagnostic classification models. Stud. Educ. Eval. 2009, 35, 64–70. [Google Scholar] [CrossRef]
  67. Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. Available online: https://bit.ly/3k4K9kt (accessed on 28 May 2022).
  68. Oliveri, M.E.; von Davier, M. Toward increasing fairness in score scale calibrations employed in international large-scale assessments. Int. J. Test. 2014, 14, 1–21. [Google Scholar] [CrossRef]
  69. Von Davier, M.; Khorramdel, L.; He, Q.; Shin, H.J.; Chen, H. Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. J. Educ. Behav. Stat. 2019, 44, 671–705. [Google Scholar] [CrossRef]
  70. Von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
  71. Khorramdel, L.; Shin, H.J.; von Davier, M. GDM software mdltm including parallel EM algorithm. In Handbook of Diagnostic Classification Models; von Davier, M., Lee, Y.S., Eds.; Springer: Cham, Switzerland, 2019; pp. 603–628. [Google Scholar] [CrossRef]
  72. Tijmstra, J.; Bolsinova, M.; Liaw, Y.L.; Rutkowski, L.; Rutkowski, D. Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. J. Educ. Meas. 2020, 57, 566–583. [Google Scholar] [CrossRef]
  73. Köhler, C.; Robitzsch, A.; Hartig, J. A bias-corrected RMSD item fit statistic: An evaluation and comparison to alternatives. J. Educ. Behav. Stat. 2020, 45, 251–273. [Google Scholar] [CrossRef]
  74. Robitzsch, A. Statistical properties of estimators of the RMSD item fit statistic. Foundations 2022, 2, 488–503. [Google Scholar] [CrossRef]
  75. Von Davier, M.; Bezirhan, U. A robust method for detecting item misfit in large scale assessments. Educ. Psychol. Meas. 2022. Epub ahead of print. [Google Scholar] [CrossRef]
  76. Joo, S.H.; Khorramdel, L.; Yamamoto, K.; Shin, H.J.; Robin, F. Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educ. Meas. 2021, 40, 37–48. [Google Scholar] [CrossRef]
  77. Buchholz, J.; Hartig, J. Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Appl. Psychol. Meas. 2019, 43, 241–250. [Google Scholar] [CrossRef]
  78. Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. Available online: https://bit.ly/3ezBB05 (accessed on 28 May 2022).
  79. Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2022, 47, 36–68. [Google Scholar] [CrossRef]
  80. Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
  81. Dai, S. Handling missing responses in psychometrics: Methods and software. Psych 2021, 3, 673–693. [Google Scholar] [CrossRef]
  82. Finch, H. Estimation of item response theory parameters in the presence of missing data. J. Educ. Meas. 2008, 45, 225–245. [Google Scholar] [CrossRef]
  83. Frey, A.; Spoden, C.; Goldhammer, F.; Wenzel, S.F.C. Response time-based treatment of omitted responses in computer-based testing. Behaviormetrika 2018, 45, 505–526. [Google Scholar] [CrossRef] [Green Version]
  84. Kalkan, Ö.K.; Kara, Y.; Kelecioğlu, H. Evaluating performance of missing data imputation methods in IRT analyses. Int. J. Assess. Tool. Educ. 2018, 5, 403–416. [Google Scholar] [CrossRef]
  85. Pohl, S.; Becker, B. Performance of missing data approaches under nonignorable missing data conditions. Methodology 2020, 16, 147–165. [Google Scholar] [CrossRef]
  86. Rose, N.; von Davier, M.; Nagengast, B. Commonalities and differences in IRT-based methods for nonignorable item nonresponses. Psych. Test Assess. Model. 2015, 57, 472–498. Available online: https://bit.ly/3kD3t89 (accessed on 28 May 2022).
  87. Rose, N.; von Davier, M.; Nagengast, B. Modeling omitted and not-reached items in IRT models. Psychometrika 2017, 82, 795–819. [Google Scholar] [CrossRef]
  88. Robitzsch, A. On the treatment of missing item responses in educational large-scale assessment data: An illustrative simulation study and a case study using PISA 2018 mathematics data. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 1653–1687. [Google Scholar] [CrossRef]
  89. Gorgun, G.; Bulut, O. A polytomous scoring approach to handle not-reached items in low-stakes assessments. Educ. Psychol. Meas. 2021, 81, 847–871. [Google Scholar] [CrossRef]
  90. Debeer, D.; Janssen, R.; De Boeck, P. Modeling skipped and not-reached items using IRTrees. J. Educ. Meas. 2017, 54, 333–363. [Google Scholar] [CrossRef]
  91. Köhler, C.; Pohl, S.; Carstensen, C.H. Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educ. Psychol. Meas. 2015, 75, 850–874. [Google Scholar] [CrossRef] [Green Version]
  92. Köhler, C.; Pohl, S.; Carstensen, C.H. Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. J. Educ. Meas. 2017, 54, 397–419. [Google Scholar] [CrossRef] [Green Version]
  93. Pohl, S.; Carstensen, C.H. NEPS Technical Report—Scaling the Data of the Competence Tests; (NEPS Working Paper No. 14); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2012; Available online: https://bit.ly/2XThQww (accessed on 28 May 2022).
  94. Pohl, S.; Carstensen, C.H. Scaling of competence tests in the national educational panel study – Many questions, some answers, and further challenges. J. Educ. Res. Online 2013, 5, 189–216. [Google Scholar]
  95. Pohl, S.; Gräfe, L.; Rose, N. Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educ. Psychol. Meas. 2014, 74, 423–452. [Google Scholar] [CrossRef]
  96. Rose, N.; von Davier, M.; Xu, X. Modeling Nonignorable Missing Data with Item Response Theory (IRT); Research Report No. RR-10-11; Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
  97. Rohwer, G. Making Sense of Missing Answers in Competence Tests; (NEPS Working Paper No. 30); Otto-Friedrich-Universität, Nationales Bildungspanel: Bamberg, Germany, 2013; Available online: https://bit.ly/3AGfsr5 (accessed on 28 May 2022).
  98. Robitzsch, A. About still nonignorable consequences of (partially) ignoring missing item responses in large-scale assessment. OSF Preprints 2020. [Google Scholar] [CrossRef]
  99. Sachse, K.A.; Mahler, N.; Pohl, S. When nonresponse mechanisms change: Effects on trends and group comparisons in international large-scale assessments. Educ. Psychol. Meas. 2019, 79, 699–726. [Google Scholar] [CrossRef]
  100. Brennan, R.L. Generalizability theory. Educ. Meas. 1992, 11, 27–34. [Google Scholar] [CrossRef]
  101. Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
  102. Brennan, R.L. Perspectives on the evolution and future of educational measurement. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 1–16. [Google Scholar]
  103. Cronbach, L.J.; Rajaratnam, N.; Gleser, G.C. Theory of generalizability: A liberalization of reliability theory. Brit. J. Stat. Psychol. 1963, 16, 137–163. [Google Scholar] [CrossRef]
  104. Cronbach, L.J.; Gleser, G.C.; Nanda, H.; Rajaratnam, N. The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles; John Wiley: New York, NY, USA, 1972. [Google Scholar]
  105. Hunter, J.E. Probabilistic foundations for coefficients of generalizability. Psychometrika 1968, 33, 1–18. [Google Scholar] [CrossRef]
  106. Husek, T.R.; Sirotnik, K. Item Sampling in Educational Research; CSEIP Occasional Report No. 2; University of California: Los Angeles, CA, USA, 1967; Available online: https://bit.ly/3k47t1s (accessed on 28 May 2022).
  107. Kane, M.T.; Brennan, R.L. The generalizability of class means. Rev. Educ. Res. 1977, 47, 267–292. [Google Scholar] [CrossRef]
  108. Robitzsch, A.; Dörfler, T.; Pfost, M.; Artelt, C. Die Bedeutung der Itemauswahl und der Modellwahl für die längsschnittliche Erfassung von Kompetenzen [Relevance of item selection and model selection for assessing the development of competencies: The development in reading competence in primary school students]. Z. Entwicklungspsychol. Pädagog. Psychol. 2011, 43, 213–227. [Google Scholar] [CrossRef]
  109. Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. Available online: https://bit.ly/2WDPeqD (accessed on 28 May 2022). [PubMed]
  110. Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
  111. Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
  112. Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
  113. Robitzsch, A.; Lüdtke, O.; Goldhammer, F.; Kroehne, U.; Köller, O. Reanalysis of the German PISA data: A comparison of different approaches for trend estimation with a particular emphasis on mode effects. Front. Psychol. 2020, 11, 884. [Google Scholar] [CrossRef]
  114. Kolenikov, S. Resampling variance estimation for complex survey data. Stata J. 2010, 10, 165–199. [Google Scholar] [CrossRef] [Green Version]
  115. Sireci, S.G.; Thissen, D.; Wainer, H. On the reliability of testlet-based tests. J. Educ. Meas. 1991, 28, 237–247. [Google Scholar] [CrossRef]
  116. Bolt, D.M.; Cohen, A.S.; Wollack, J.A. Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. J. Educ. Meas. 2002, 39, 331–348. [Google Scholar] [CrossRef]
  117. Jin, K.Y.; Wang, W.C. Item response theory models for performance decline during testing. J. Educ. Meas. 2014, 51, 178–200. [Google Scholar] [CrossRef]
  118. Kanopka, K.; Domingue, B. A position sensitive IRT mixture model. PsyArXiv 2022. [Google Scholar] [CrossRef]
  119. List, M.K.; Robitzsch, A.; Lüdtke, O.; Köller, O.; Nagy, G. Performance decline in low-stakes educational assessments: Different mixture modeling approaches. Large-Scale Assess. Educ. 2017, 5, 15. [Google Scholar] [CrossRef] [Green Version]
  120. Nagy, G.; Robitzsch, A. A continuous HYBRID IRT model for modeling changes in guessing behavior in proficiency tests. Psych. Test Assess. Model. 2021, 63, 361–395. Available online: https://bit.ly/3FHtA6l (accessed on 28 May 2022).
  121. Alexandrowicz, R.; Matschinger, H. Estimation of item location effects by means of the generalized logistic regression model: A simulation study and an application. Psychol. Sci. 2008, 50, 64–74. Available online: https://bit.ly/3MEHM3n (accessed on 28 May 2022).
  122. Hecht, M.; Weirich, S.; Siegle, T.; Frey, A. Effects of design properties on parameter estimation in large-scale assessments. Educ. Psychol. Meas. 2015, 75, 1021–1044. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  123. Robitzsch, A. Methodische Herausforderungen bei der Kalibrierung von Leistungstests [Methodological challenges in calibrating performance tests]. In Bildungsstandards Deutsch und Mathematik; Bremerich-Vos, A., Granzer, D., Köller, O., Eds.; Beltz Pädagogik: Weinheim, Germany, 2009; pp. 42–106. [Google Scholar]
  124. Bulut, O.; Quo, Q.; Gierl, M.J. A structural equation modeling approach for examining position effects in large-scale assessments. Large-Scale Assess. Educ. 2017, 5, 8. [Google Scholar] [CrossRef] [Green Version]
  125. Debeer, D.; Janssen, R. Modeling item-position effects within an IRT framework. J. Educ. Meas. 2013, 50, 164–185. [Google Scholar] [CrossRef]
  126. Debeer, D.; Buchholz, J.; Hartig, J.; Janssen, R. Student, school, and country differences in sustained test-taking effort in the 2009 PISA reading assessment. J. Educ. Behav. Stat. 2014, 39, 502–523. [Google Scholar] [CrossRef] [Green Version]
  127. Demirkol, S.; Kelecioğlu, H. Investigating the effect of item position on person and item parameters: PISA 2015 Turkey sample. J. Meas. Eval. Educ. Psychol. 2022, 13, 69–85. [Google Scholar] [CrossRef]
  128. Rose, N.; Nagy, G.; Nagengast, B.; Frey, A.; Becker, M. Modeling multiple item context effects with generalized linear mixed models. Front. Psychol. 2019, 10, 248. [Google Scholar] [CrossRef]
  129. Trendtel, M.; Robitzsch, A. Modeling item position effects with a Bayesian item response model applied to PISA 2009–2015 data. Psych. Test Assess. Model. 2018, 60, 241–263. Available online: https://bit.ly/3l4Zi5u (accessed on 28 May 2022).
  130. Weirich, S.; Hecht, M.; Böhme, K. Modeling item position effects using generalized linear mixed models. Appl. Psychol. Meas. 2014, 38, 535–548. [Google Scholar] [CrossRef]
  131. Nagy, G.; Lüdtke, O.; Köller, O. Modeling test context effects in longitudinal achievement data: Examining position effects in the longitudinal German PISA 2012 assessment. Psych. Test Assess. Model. 2016, 58, 641–670. Available online: https://bit.ly/39Z4iFw (accessed on 28 May 2022). [CrossRef]
  132. Nagy, G.; Nagengast, B.; Becker, M.; Rose, N.; Frey, A. Item position effects in a reading comprehension test: An IRT study of individual differences and individual correlates. Psych. Test Assess. Model. 2018, 60, 165–187. Available online: https://bit.ly/3Biw74g. (accessed on 28 May 2022).
  133. Nagy, G.; Nagengast, B.; Frey, A.; Becker, M.; Rose, N. A multilevel study of position effects in PISA achievement tests: Student-and school-level predictors in the German tracked school system. Assess. Educ. 2019, 26, 422–443. [Google Scholar] [CrossRef] [Green Version]
  134. Garthwaite, P.H.; Mubwandarikwa, E. Selection of weights for weighted model averaging. Aust. N. Z. J. Stat. 2010, 52, 363–382. [Google Scholar] [CrossRef]
  135. Knutti, R. The end of model democracy? Clim. Chang. 2010, 102, 395–404. [Google Scholar] [CrossRef]
  136. Lorenz, R.; Herger, N.; Sedláček, J.; Eyring, V.; Fischer, E.M.; Knutti, R. Prospects and caveats of weighting climate models for summer maximum temperature projections over North America. J. Geophys. Res. Atmosph. 2018, 123, 4509–4526. [Google Scholar] [CrossRef]
  137. Sanderson, B.M.; Knutti, R.; Caldwell, P. A representative democracy to reduce interdependency in a multimodel ensemble. J. Clim. 2015, 28, 5171–5194. [Google Scholar] [CrossRef] [Green Version]
  138. Sanderson, B.M.; Wehner, M.; Knutti, R. Skill and independence weighting for multi-model assessments. Geosci. Model Dev. 2017, 10, 2379–2395. [Google Scholar] [CrossRef] [Green Version]
  139. Scharkow, M. Getting More Information Out of the Specification Curve. 15 January 2019. Available online: https://bit.ly/3z9ebLz (accessed on 28 May 2022).
  140. Gelman, A. Analysis of variance—Why it is more important than ever. Ann. Stat. 2005, 33, 1–53. [Google Scholar] [CrossRef] [Green Version]
  141. Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, MA, USA, 2006. [Google Scholar] [CrossRef]
  142. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 11 January 2022).
  143. Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules. R Package Version 4.0-16. 2022. Available online: https://CRAN.R-project.org/package=TAM (accessed on 14 May 2022).
  144. Robitzsch, A. sirt: Supplementary Item Response Theory Models. R Package Version 3.12-66. 2022. Available online: https://CRAN.R-project.org/package=sirt (accessed on 17 May 2022).
  145. Masur, P.K.; Scharkow, M. specr: Conducting and Visualizing Specification Curve Analyses. R Package Version 0.2.1. 2020. Available online: https://CRAN.R-project.org/package=specr (accessed on 26 March 2020).
  146. Kane, M.T. A sampling model for validity. Appl. Psychol. Meas. 1982, 6, 125–160. [Google Scholar] [CrossRef]
  147. Kane, M.T. Validating the interpretations and uses of test scores. J. Educ. Meas. 2013, 50, 1–73. [Google Scholar] [CrossRef]
  148. Adams, R.J. Response to ’Cautions on OECD’s recent educational survey (PISA)’. Oxf. Rev. Educ. 2003, 29, 379–389. [Google Scholar] [CrossRef]
  149. Adams, R.J. Comments on Kreiner 2011: Is the Foundation under PISA Solid? A Critical Look at the Scaling Model Underlying International Comparisons of Student Attainment; Technical Report; OECD: Paris, France, 2011; Available online: https://bit.ly/3wVUKo0 (accessed on 28 May 2022).
  150. Kreiner, S.; Christensen, K.B. Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika 2014, 79, 210–231. [Google Scholar] [CrossRef]
  151. McDonald, R.P. Generalizability in factorable domains: “Domain validity and generalizability”. Educ. Psychol. Meas. 1978, 38, 75–79. [Google Scholar] [CrossRef]
  152. McDonald, R.P. Behavior domains in theory and in practice. Alta. J. Educ. Res. 2003, 49, 212–230. Available online: https://bit.ly/3O4s2I5 (accessed on 28 May 2022).
  153. Brennan, R.L. Misconceptions at the intersection of measurement theory and practice. Educ. Meas. 1998, 17, 5–9. [Google Scholar] [CrossRef]
  154. Leutner, D.; Hartig, J.; Jude, N. Measuring competencies: Introduction to concepts and questions of assessment in education. In Assessment of Competencies in Educational Contexts; Hartig, J., Klieme, E., Leutner, D., Eds.; Hogrefe: Göttingen, Germany, 2008; pp. 177–192. [Google Scholar]
  155. Holland, P.W. The Dutch identity: A new tool for the study of item response models. Psychometrika 1990, 55, 5–18. [Google Scholar] [CrossRef]
  156. Zhang, J.; Stout, W. On Holland’s Dutch identity conjecture. Psychometrika 1997, 62, 375–392. [Google Scholar] [CrossRef]
  157. Frey, A.; Seitz, N.N.; Kröhne, U. Reporting differentiated literacy results in PISA by using multidimensional adaptive testing. In Research on PISA; Prenzel, M., Kobarg, M., Schöps, K., Rönnebeck, S., Eds.; Springer: Dordrecht, The Netherlands, 2013; pp. 103–120. [Google Scholar] [CrossRef]
  158. Goldstein, H. International comparisons of student attainment: Some issues arising from the PISA study. Assess. Educ. 2004, 11, 319–330. [Google Scholar] [CrossRef]
  159. Goldstein, H.; Bonnet, G.; Rocher, T. Multilevel structural equation models for the analysis of comparative data on educational performance. J. Educ. Behav. Stat. 2007, 32, 252–286. [Google Scholar] [CrossRef]
  160. VanderWeele, T.J. Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology 2022, 33, 141–151. [Google Scholar] [CrossRef] [PubMed]
  161. Frey, A.; Hartig, J. Methodological challenges of international student assessment. In Monitoring Student Achievement in the 21st Century; Harju-Luukkainen, H., McElvany, N., Stang, J., Eds.; Springer: Cham, Switzerland, 2020; pp. 39–49. [Google Scholar] [CrossRef]
Figure 1. Graphical visualization of multiverse analysis involving M = 162 models for country means μ for countries Austria (AUT; upper left panel), Spain (ESP; upper right panel), Netherlands (NLD; lower left panel), and USA (lower right panel). The dashed line corresponds to the value from the reference model. Country means colored in blue, gray, or red indicate that they are larger, similar, or smaller than the reference value, respectively.
Figure 1. Graphical visualization of multiverse analysis involving M = 162 models for country means μ for countries Austria (AUT; upper left panel), Spain (ESP; upper right panel), Netherlands (NLD; lower left panel), and USA (lower right panel). The dashed line corresponds to the value from the reference model. Country means colored in blue, gray, or red indicate that they are larger, similar, or smaller than the reference value, respectively.
Ejihpe 12 00054 g001
Figure 2. Graphical visualization of multiverse analysis involving M = 162 models for country standard deviations σ for countries Austria (AUT; upper left panel), Spain (ESP; upper right panel), Netherlands (NLD; lower left panel), and USA (lower right panel). The dashed line corresponds to the value from the reference model. Country standard deviations colored in blue, gray, or red indicate that they are larger, similar, or smaller than the reference value, respectively.
Figure 2. Graphical visualization of multiverse analysis involving M = 162 models for country standard deviations σ for countries Austria (AUT; upper left panel), Spain (ESP; upper right panel), Netherlands (NLD; lower left panel), and USA (lower right panel). The dashed line corresponds to the value from the reference model. Country standard deviations colored in blue, gray, or red indicate that they are larger, similar, or smaller than the reference value, respectively.
Ejihpe 12 00054 g002
Table 1. Square roots of variance components (SRVCs) associated with factors of the multiverse analysis in a two-way analysis of variance for country mean μ and country standard deviation σ .
Table 1. Square roots of variance components (SRVCs) associated with factors of the multiverse analysis in a two-way analysis of variance for country mean μ and country standard deviation σ .
μ σ
Total3.052.98
Items0.891.13
Model0.601.48
Pos1.831.76
RMSD1.520.91
Score1.370.84
Model × Items0.200.35
Model × Pos0.200.42
Model × RMSD0.360.54
Model × Score0.090.19
Pos × Items0.410.69
Pos × RMSD0.430.44
Pos × Score0.410.29
RMSD × Items0.890.55
Score × Items0.220.15
Score × RMSD0.140.10
Note. Total = standard deviation associated with total variability across models; Items = item choice (see Section 2.2.4); Model = specified IRT model (see Section 2.2.1); Pos = choice for handling position effects (see Section 2.2.5); RMSD = used cutoff value for RMSD item fit statistic for handling DIF (see Section 2.2.2); Score0 = scoring of missing item responses (see Section 2.2.3); Square roots of variance components larger than 0.50 are printed in bold.
Table 2. Results of a multiverse analysis for PISA 2018 mathematics for country means.
Table 2. Results of a multiverse analysis for PISA 2018 mathematics for country means.
Reference ModelMulti-Model InferenceSquare Root of Variance Component (SRVC)
cntNEstSEMMinMaxMEERPosRMSDScore0ItemsModel
ALB2116439.73.39442.8434.7450.13.381.002.440.461.120.001.00
AUS6508504.42.17505.8499.6510.32.801.292.370.820.600.820.15
AUT3104508.73.20509.7503.6514.82.970.931.502.340.360.440.38
BEL3763523.62.39525.3522.4529.41.630.681.080.230.610.780.34
BIH2934415.43.21418.0405.4426.84.181.300.731.782.720.461.52
BLR2681482.52.88478.0472.9483.62.520.881.950.650.970.350.04
BRN2259439.02.08430.1420.0446.75.742.753.572.992.090.081.16
CAN7200530.42.54527.7522.8531.41.960.770.880.930.591.170.32
CHE2679522.72.96524.3519.5530.42.590.881.930.820.581.240.33
CZE3199510.82.70512.5507.0518.62.310.861.410.851.030.910.35
DEU2482514.63.18514.1508.0518.92.390.751.231.111.250.690.35
DNK3304522.52.30522.3515.9527.83.061.330.812.180.791.530.36
ESP11855491.31.63492.7488.6497.31.911.171.400.060.450.770.20
EST2467532.72.36534.4529.7539.71.950.831.211.150.230.500.22
FIN2573514.22.40515.1512.1517.41.220.510.250.430.550.080.70
FRA2880506.02.64506.5502.4511.12.240.850.581.490.670.990.26
GBR5979513.33.16516.4511.7521.61.960.621.320.570.421.040.17
GRC2114458.93.74456.0450.2459.72.150.581.560.830.340.130.23
HKG2008564.23.74560.5546.0571.94.851.302.442.821.210.800.70
HRV2150471.13.08470.9464.0476.73.161.032.460.480.691.650.19
HUN2361492.12.77486.3476.6494.93.971.432.901.730.131.120.27
IRL2581510.42.54502.7493.7510.43.591.412.871.171.410.380.56
ISL1485501.32.64506.6494.8517.64.831.833.681.351.600.711.04
ISR1944465.54.85470.0462.2478.23.570.742.201.381.880.200.94
ITA5475496.83.00499.6494.0507.83.031.011.171.721.511.280.29
JPN2814539.53.08542.2537.0549.12.630.850.091.481.620.210.23
KOR2200535.23.76534.3530.0541.62.660.710.281.940.260.120.06
LTU2265491.12.33488.7481.5495.52.991.281.871.161.121.310.89
LUX2407491.82.23493.6489.3499.41.890.851.280.570.790.470.25
LVA1751503.92.46500.5491.4508.73.341.362.231.811.230.110.69
MLT1113481.33.77486.1480.4495.93.340.892.081.101.340.990.31
MNE3066435.61.84441.8434.4449.63.401.840.921.172.291.331.10
MYS2797445.43.17441.3430.2453.55.051.602.370.973.760.730.56
NLD1787542.62.71541.5532.4549.13.501.291.362.611.230.520.31
NOR2679507.52.07511.1502.5519.13.411.641.790.911.581.820.68
NZL2821508.02.29505.3501.9509.11.600.700.340.930.290.380.31
POL2577524.43.32521.6516.3526.02.290.692.040.350.200.150.68
PRT2730501.12.74503.3497.8513.53.461.260.382.030.952.300.48
RUS2510495.43.46497.1488.9504.03.210.931.931.730.661.150.78
SGP2201584.22.03580.3567.8592.85.212.573.012.951.310.291.07
SVK1904496.43.00498.9493.7506.62.900.971.542.040.420.580.76
SVN2863522.02.49523.6520.0527.61.820.731.080.890.340.140.50
SWE2539503.43.20511.4498.9519.64.831.512.212.102.931.080.34
TUR3172469.12.42462.7456.1469.52.861.180.861.631.900.310.28
USA2218490.03.43486.3479.1492.33.080.900.901.292.320.370.28
Note. cnt = country label (see Appendix A); N = sample size; M = composite estimator for multi-model inference (see (8)); ME = model error (see (9)); ER = error ratio defined as ME/SE (see (10)); Items = item choice (see Section 2.2.4); Model = specified IRT model (see Section 2.2); Pos = choice for handling position effects (see Section 2.2.5); RMSD = used cutoff value for RMSD item fit statistic for handling DIF (see Section 2.2.2); Score0 = scoring of missing item responses (see Section 2.2.3); Square roots of variance components larger than 1.00 are printed in bold.
Table 3. Results of a multiverse analysis for PISA 2018 mathematics for standard country deviations.
Table 3. Results of a multiverse analysis for PISA 2018 mathematics for standard country deviations.
Reference ModelMulti-Model InferenceSquare Root of Variance Component (SRVC)
cnt N EstSEMMinMaxMEERPosRMSDScore0ItemsModel
ALB211687.92.0384.975.996.25.092.503.121.270.670.183.11
AUS650898.21.5695.790.3100.82.231.431.910.540.620.180.17
AUT310495.52.1694.390.798.91.600.740.130.250.240.430.43
BEL376395.21.8996.492.2100.11.730.910.900.220.280.560.89
BIH293487.11.7884.774.3104.05.433.053.071.000.960.123.72
BLR268195.02.33100.192.7108.53.631.563.050.130.861.200.41
BRN225996.51.7394.388.8102.53.121.811.150.550.300.292.19
CAN720092.81.4393.288.997.51.881.320.560.240.420.841.02
CHE267997.82.0097.390.9101.02.001.001.240.320.550.750.64
CZE319994.31.9498.094.3103.51.750.900.690.700.870.660.56
DEU248297.61.7398.193.0104.02.301.330.630.600.241.470.43
DNK330486.11.7884.977.890.32.891.622.450.720.370.381.03
ESP1185587.81.3187.484.091.01.270.970.600.480.350.210.25
EST246785.41.7087.679.095.13.492.050.640.310.681.962.30
FIN257383.21.8485.481.090.22.121.150.670.900.680.400.87
FRA288095.42.1093.186.196.21.870.891.170.530.440.480.74
GBR5979100.41.9098.791.8105.02.831.491.690.171.591.040.42
GRC211491.82.4592.886.5103.43.871.582.760.830.700.661.90
HKG200898.92.7996.885.7107.05.031.803.451.920.570.022.62
HRV215086.82.5487.882.194.82.711.071.700.560.380.461.58
HUN236194.72.1598.792.8106.93.521.641.361.350.262.471.21
IRL258180.01.4280.176.584.32.111.491.310.530.281.110.29
ISL148593.52.3393.488.297.52.030.870.510.510.430.290.47
ISR1944119.83.15117.9109.8128.83.971.262.050.691.381.291.81
ITA547594.62.4993.987.697.12.110.850.920.340.301.590.19
JPN281491.42.3389.179.097.84.331.863.080.900.391.681.64
KOR2200103.42.4898.086.3107.83.991.611.361.471.301.161.71
LTU226593.32.0795.690.8101.52.291.110.650.351.111.520.02
LUX2407101.21.64101.095.7106.12.051.250.330.280.671.330.78
LVA175184.12.0883.073.388.53.331.600.851.100.232.510.87
MLT1113112.83.17104.295.3114.74.271.352.161.352.960.230.45
MNE306689.21.5784.378.292.42.841.810.970.351.031.231.61
MYS279788.21.9088.580.096.93.721.951.441.421.040.052.19
NLD178790.02.5490.278.7101.55.552.193.750.310.221.802.96
NOR267995.21.7891.786.296.52.081.170.711.100.330.990.59
NZL282197.91.6499.495.9103.41.791.090.360.050.371.330.43
POL257794.22.1295.489.799.31.940.921.180.870.140.700.75
PRT273097.62.17103.594.9113.14.131.903.230.481.211.900.26
RUS251084.62.1685.781.093.02.591.202.011.010.240.380.30
SGP2201101.51.90102.289.6111.64.732.490.231.780.811.093.92
SVK190497.82.2699.292.0109.83.061.350.711.280.731.410.96
SVN286391.11.9792.989.096.61.790.910.910.630.220.060.72
SWE253995.11.8997.089.3103.33.231.712.251.230.840.880.20
TUR317294.22.3796.989.1107.63.441.450.872.141.271.210.45
USA221897.12.3498.993.1106.22.601.110.940.760.911.390.38
Note. cnt = country label (see Appendix A); N = sample size; M = composite estimator for multi-model inference (see (8)); ME = model error (see (9)); ER = error ratio defined as ME/SE (see (10)); Items = item choice (see Section 2.2.4); Model = specified IRT model (see Section 2.2.1); Pos = choice for handling position effects (see Section 2.2.5); RMSD = used cutoff value for RMSD item fit statistic for handling DIF (see Section 2.2.2); Score0 = scoring of missing item responses (see Section 2.2.3); Square roots of variance components larger than 1.00 are printed in bold.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Robitzsch, A. Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data. Eur. J. Investig. Health Psychol. Educ. 2022, 12, 731-753. https://doi.org/10.3390/ejihpe12070054

AMA Style

Robitzsch A. Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data. European Journal of Investigation in Health, Psychology and Education. 2022; 12(7):731-753. https://doi.org/10.3390/ejihpe12070054

Chicago/Turabian Style

Robitzsch, Alexander. 2022. "Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data" European Journal of Investigation in Health, Psychology and Education 12, no. 7: 731-753. https://doi.org/10.3390/ejihpe12070054

APA Style

Robitzsch, A. (2022). Exploring the Multiverse of Analytical Decisions in Scaling Educational Large-Scale Assessment Data: A Specification Curve Analysis for PISA 2018 Mathematics Data. European Journal of Investigation in Health, Psychology and Education, 12(7), 731-753. https://doi.org/10.3390/ejihpe12070054

Article Metrics

Back to TopTop