Examining Position Effects on Students’ Ability and Test-Taking Speed in the TIMSS 2019 Problem-Solving and Inquiry Tasks: A Structural Equation Modeling Approach

: Position effects occur when changes in item positions on a test impact the test outcomes (e


Introduction
Digital assessments are on the rise, with many countries around the world making the transition from paper-based to computer-based assessments for at least some of their schoolor national-level examinations.International large-scale assessments in education, such as the Program for International Student Assessment (PISA) and the Trends in International Mathematics and Science Study (TIMSS), made the transition to a digital format in 2015 and 2019, respectively [1,2].To fully take advantage of the digital platform, test developers usually incorporate new innovative item types (e.g., technology-rich items) in these assessments to enhance test-taking engagement and potentially improve the measurement quality of intended constructs.In addition, various types of process data are often captured in the background (e.g., item response times and event log data) to help uncover greater insights into students' test-taking process [3].
In eTIMSS 2019-the digital version of TIMSS 2019-in addition to the usual 14 student booklets that are included in the paper-based version of TIMSS, two additional booklets (Booklets 15 and 16) were developed comprising innovative problem-solving and inquiry (PSI) tasks.These tasks were designed around real-life scenarios and incorporated various interactive elements to engage the students and capture their responses [3].In each of the two booklets, the tasks were identical but placed in different orders to counterbalance potential position effects on item statistics and achievement [4].Upon analysis of data from the PSI tasks, Mullis et al. [3] noted that there were differences between students' completion rates for each block of tasks in the two booklets.For example, the completion rate was generally higher when a task was presented earlier in a test session.Further analysis revealed that among those students who did not complete all the items, a higher proportion of students stopped responding rather than running out of time on the test [3].This finding suggests that items' positions on a test might have impacted students' use of time during the test, their test-taking motivation (or effort), and their performance.
Previous studies on position effects in large-scale assessments have mainly focused on its impact on item parameters, such as item difficulty, to address the concern of fairness (e.g., [5][6][7][8][9]).Several more recent studies have also examined how position effects could vary in different subject domains (e.g., [10,11]), for different item types (e.g., [11,12]), or given different student characteristics such as ability levels (e.g., [11,13]) or gender (e.g., [14]).Other studies have explored the relationship between position effect and test-taking effort (e.g., [15,16]) or the relationship between ability and speed, including potential applications of response time in measuring or predicting achievement [17][18][19][20].However, only a few studies have examined the effects of item position on test-taking speed.Given the increasing adoption of digital assessments involving innovative item types, it is also essential to study position effects within this context.In this study, we make use of response data from the eTIMSS 2019 Grade 4 Mathematics and Science PSI tasks and examine the associations between block positions, students' test-taking speed, and their ability.Findings from this study could offer insight into the interplay of these variables in a computer-based test with technology-enhanced items and potentially help to inform future test development practices.

Theoretical Framework
In large-scale educational assessments such as PISA and TIMSS, booklet designs are typically used for test assembly and administration [21].As such, each student is administered a particular booklet that contains a subset of all items that are used in the assessment, organized into item blocks.The same block of items usually appears in more than one booklet, so that items can be linked and calibrated on a common scale [8].Item blocks are intentionally distributed so that the same item block will appear at different positions in different booklets.This approach helps enhance the test security [13] and counterbalance position effects on item statistics [21,22].The eTIMSS 2019 PSI booklets used a similar counterbalancing booklet design, but in this case, there were only two booklets, each containing all five PSI items (see Table 1).Note: M1 and M2 are mathematics item blocks.S1 and S2 are science item blocks.There were 5 PSI tasks in total, 3 for mathematics (2 in M1, 1 in M2), and 2 for science (1 each in S1 and S2).Table adapted from [3].
Researchers have shown significant interest in item position effects, driven by the prevalent use of test designs where students encounter the same items at different points during the assessment.This phenomenon applies to booklet designs and computerized adaptive tests or multistage adaptive tests, where item and testlet positions cannot be fully controlled [6,23].Numerous studies have explored how items' position influences item parameters, particularly item difficulty, employing various modeling approaches.
Researchers have often advocated for the review and potential removal of items displaying substantial position effects to enhance test fairness [6,23].
Generally, two types of position effects have been reported in the literature [24]: a positive position effect (i.e., when an item becomes easier when administered at later positions, see for example [10]) and, more frequently, a negative position effect (i.e., when an item becomes more difficult when administered at later positions, see for example [11]).Kingston and Dorans [23] and Ong et al. [12] found that the susceptibility to position effects appears to be item-type-specific.In particular, they found that longer items with higher reading demands were more susceptible to item position effects.Demirkol and Kelecio glu [11] found stronger negative position effects in reading items compared to mathematics items using PISA 2015 data from Turkey.On the other hand, Hohensinn et al. [8] did not find any significant position effects for mathematical or quantitative items given unspeeded conditions (i.e., when sufficient time was given to complete all items).This supported Kingston and Dorans' [23] earlier findings and led the researchers to suggest that "position effects should be examined for every newly constructed assessment which deals with booklet designs" (p.508).Debeer and Janssen [13] conducted an empirical study using PISA 2006 data and found that position effects could differ for individuals with different latent abilities (students with a higher ability tend to be less susceptible to position effects).Weirich et al.'s [16] study partly supported this finding and further demonstrated that changes in test-taking effort may also moderate position effects throughout a test.
In the context of eTIMSS 2019, Fishbein et al. [22] acknowledged the presence of position effects occurring in the PSI booklets, especially for mathematics.PSI item blocks appearing in the second half of a test session were more difficult and had more not-reached responses than item blocks appearing in the first half [22].The actual completion rates for each task also varied based on block position [3].These findings suggest that there could have been a booklet effect on students' overall achievement and their performance on individual items.In this case, the availability of response time data also presents a unique opportunity to examine the booklet effect on students' use of time during the test as an indicator of their test-taking speed.
Figure 1 shows a theoretical model demonstrating the relationship between items, booklets, and response times.The model defines two latent variables: ability, with itemlevel scores as its indicators, and speed, with screen-level response times as its indicators (item-level response times were not available for the PSI tasks in TIMSS 2019).Booklet is a binary variable in this context, and its effect on ability and speed will be examined.In the model, it is also possible to examine the booklet effect on ability and speed across individual items and screens throughout the test.This addition could offer greater insight, especially when viewed in conjunction with individual item characteristics.
Psychol.Int.2024, 6, FOR PEER REVIEW 3 parameters, particularly item difficulty, employing various modeling approaches.Researchers have often advocated for the review and potential removal of items displaying substantial position effects to enhance test fairness [6,23].Generally, two types of position effects have been reported in the literature [24]: a positive position effect (i.e., when an item becomes easier when administered at later positions, see for example [10]) and, more frequently, a negative position effect (i.e., when an item becomes more difficult when administered at later positions, see for example [11]).Kingston and Dorans [23] and Ong et al. [12] found that the susceptibility to position effects appears to be item-type-specific.In particular, they found that longer items with higher reading demands were more susceptible to item position effects.Demirkol and Kelecioğlu [11] found stronger negative position effects in reading items compared to mathematics items using PISA 2015 data from Turkey.On the other hand, Hohensinn et al. [8] did not find any significant position effects for mathematical or quantitative items given unspeeded conditions (i.e., when sufficient time was given to complete all items).This supported Kingston and Dorans' [23] earlier findings and led the researchers to suggest that "position effects should be examined for every newly constructed assessment which deals with booklet designs" (p.508).Debeer and Janssen [13] conducted an empirical study using PISA 2006 data and found that position effects could differ for individuals with different latent abilities (students with a higher ability tend to be less susceptible to position effects).Weirich et al.'s [16] study partly supported this finding and further demonstrated that changes in test-taking effort may also moderate position effects throughout a test.
In the context of eTIMSS 2019, Fishbein et al. [22] acknowledged the presence of position effects occurring in the PSI booklets, especially for mathematics.PSI item blocks appearing in the second half of a test session were more difficult and had more notreached responses than item blocks appearing in the first half [22].The actual completion rates for each task also varied based on block position [3].These findings suggest that there could have been a booklet effect on students' overall achievement and their performance on individual items.In this case, the availability of response time data also presents a unique opportunity to examine the booklet effect on students' use of time during the test as an indicator of their test-taking speed.
Figure 1 shows a theoretical model demonstrating the relationship between items, booklets, and response times.The model defines two latent variables: ability, with itemlevel scores as its indicators, and speed, with screen-level response times as its indicators (item-level response times were not available for the PSI tasks in TIMSS 2019).Booklet is a binary variable in this context, and its effect on ability and speed will be examined.In the model, it is also possible to examine the booklet effect on ability and speed across individual items and screens throughout the test.This addition could offer greater insight, especially when viewed in conjunction with individual item characteristics.Ability and speed are commonly associated with each other (e.g., [18,19,25,26]).There are generally two perspectives on the relationship between speed and ability.One perspective is that spending more time on an item (i.e., working more slowly) increases the probability of answering the item correctly, whereas speeding up reduces the expected response accuracy.This phenomenon is commonly referred to as the within-person "speedability trade-off" [19,27].On the other hand, a person with stronger ability in a domain could exhibit faster speed due to greater skill and fluency [28].Goldhammer [19] pointed out that most assessments are a mixture of speed and ability tests, as they typically have a time limit and include items of varying difficulty, so it can be very difficult to separate these measures.Goldhammer et al. [28] closely examined the relationship between the time spent on a task and task success using large-scale assessment data from the computer-based Programme for the International Assessment of Adult Competencies (PIAAC) and found that the time spent on task effect is moderated by the task difficulty and skill.Notably, the researchers found that task success is positively related to time spent on task for more difficult tasks, such as problem-solving, and negatively related to more routine or easier tasks.These findings suggest that the relationship between speed and ability is complex and could vary in different contexts.In Figure 1, the relationship between speed and ability is left as a correlation, as there is no theoretical basis to say that either one causes the other.
Position, ability, and speed have all been modeled in different ways through various studies that examined different combinations of these ideas.For speed, a well-known approach to model response times is the lognormal model introduced by van der Linden [29].This model is based on item response theory (IRT) and has been extended in various ways to incorporate other variables, such as with a multivariate multilevel regression structure [30] and with structural equation modeling (SEM) [31].For a detailed overview of modeling techniques involving response times, see De Boeck and Jeon's [32] recent review.For position effects, researchers often employed IRT-based methodologies such as Rasch or 2PL models, incorporating random or fixed position effects (e.g., [9,33]), or explanatory IRT approaches based on generalized linear mixed models (e.g., [8,11,12,16]).Bulut et al. [6] introduced a factor analytic approach using the SEM framework, which allows for the examination of linear position effects and interaction effects in the same model and provides added flexibility for assessments with more complex designs.In this study, an SEM approach was employed to allow us to model position, ability, and test-taking speed within the same model.Due to the way in which response times were captured (at the screen level rather than at the item level), it was not appropriate to use an IRT-based approach.
The following hypotheses, derived from a thorough literature review, can offer insights into the PSI tasks in TIMSS 2019.First, a negative correlation is anticipated between speed and ability, owing to the problem-solving nature of PSI tasks-implying that heightened speed may correspond to diminished ability.Second, a shift in booklet order from 15 to 16 is predicted to be associated with an elevation in science ability but a reduction in mathematics ability.This expectation arises from the alteration in the subject sequencing.Third, the impact of booklet changes is expected to manifest across all four item blocks, with a potentially heightened influence on items in blocks M1 and S2 due to the more substantial positional change between Block Position 1 and Block Position 4.
The current study aims to contribute to the existing literature in several ways.First, previous research examining position effects typically used item data from more traditional forms of assessment (e.g., multiple-choice items).In this study, position effects are studied in the context of a computer-based assessment with technology-rich items, which could offer valuable insights, especially as more PSI-type items are planned to be incorporated in future cycles of eTIMSS [34].Second, few studies have incorporated response times into research on position effects (e.g., [35]).Since response times are routinely captured in digital assessments, tapping into this data source would add value to current discussions.

Data Source
This study used response data from the eTIMSS 2019 Grade 4 PSI booklets.eTIMSS, the digital version of TIMSS, was taken by students from 36 participating countries in 2019.PSI tasks were placed in Booklets 15 and 16 and administered to approximately 12% of all students who participated in eTIMSS 2019.In the eTIMSS 2019 administration, each student was randomly assigned one booklet to complete, followed by a 30 min questionnaire [36].At the Grade 4 level, five PSI tasks (three mathematics and two science tasks, each comprising between six and twelve items) were grouped into two mathematics and two science blocks and presented to students in two separately timed sessions of 36 min each with a 15 min break in between (see Table 1) [3].The two PSI booklets consisted of the same tasks and item blocks, arranged in different orders.
The Grade 4 PSI dataset included responses from 27,682 students from 36 countries.The students had a mean age of 10.14 years (SD = 0.57 years) and were evenly split between males (50.6%) and females (49.4%).Half (50%; 13,829) of the students completed Booklet 15, and the rest completed Booklet 16.The two booklets were similar regarding students' demographic characteristics (see Table 2), which is expected, given that all booklets that were used in eTIMSS were administered according to a rotated design [22].A separate check was carried out to confirm that the representation by country was also similar across the two booklets.

Measures
Two sets of measures were derived from the PSI dataset: one for scores and another for response times on each of the 5 PSI tasks.The TIMSS International Database [4] contained students' responses to all the individual PSI items, coded as fully correct, partially correct, incorrect, not reached, or omitted/invalid.For this study, all items were scored using the same methodology as that used by TIMSS for achievement scaling.Omitted items were given a score of zero, and not-reached items were treated as missing.Furthermore, some items were excluded from the data, as not all PSI items were included in achievement scaling for TIMSS (e.g., items exhibiting poor psychometric properties and science items with postclue scores [22]).These data preparation procedures yielded a total of 29 mathematics and 18 science PSI items.Table 3 shows the complete list of items and the maximum possible score for each item.
Response times for each task were derived using screen times captured in the original dataset.Screen time refers to the total time a student spends on a particular screen, and each screen could contain between one and three items.There was a total of 17 screens containing mathematics items and 17 screens containing science items (see Table 4).It was observed in the data that some students spent a disproportionate amount of time on specific screens, which could have resulted from disengaged behavior (i.e., the student stopped responding midway through the test) or from early completion of the test and staying on the same screen until the test ended.The screen time would not accurately represent the student's speed in these cases.Thus, it was necessary to determine a reasonable threshold to remove outliers from the data.As the number of items on each screen was not the same, and items may vary in difficulty and demand, the outlier threshold for each screen should not be the same.
In this study, the transformation approach suggested by Cousineau and Chartier [37] was adopted to identify response time outliers for each screen.This method was found to work well for response time data, yielding low bias in the data cleaning process [38].To identify the outliers (i.e., responses with very high or low response times), the following transformation was first applied to the response times for each screen: where x is the untransformed response time, X min is the minimum response time (out of all students) on a given screen, and X max is the maximum response time on that screen.This transformation normalizes the data and bounds the data into the range of [0, 1].Following this step, z scores were computed.In this study, screen response times associated with a z-score that was either larger than 3 or smaller than −3 were removed.This application of Cousineau and Chartier's [37] method removed between 0.4% and 1.7% of the response time data for each screen.

Data Analysis
This study followed an SEM approach to examine booklet effects on students' ability and speed in the context of a PSI assessment.Descriptive and correlation analyses were first conducted on all the observed variables (booklet, 47 score indicators, and 34 speed indicators) to check that distributional assumptions were met and that there were no multicollinearity issues.For the SEM analysis, ability indicators (item scores) were treated as categorical (ordinal) variables due to the way in which they were scored (i.e., correct, partially correct, incorrect).In contrast, speed indicators (screen response times) were treated as continuous variables.All analyses were conducted using Mplus 8.10 [39].The weighted least-squares mean-and variance-adjusted (WLSMV) estimator was used to handle the categorical indicators for ability, and the rest of the model was estimated using the default maximum likelihood (ML) estimator.
The theoretical SEM model shown In Figure 1 was first fitted to the data and assessed for model fit.Alternative models were tested for global and local fit before arriving at the final structural regression models.The final models were analyzed in two stages.In the first stage, model parameters were estimated without the dashed paths from the booklet variable to the individual items or screens.In the second stage, these paths were added to examine the booklet effect on individual items and screens.To evaluate the model fit, aside from the chi-square test, the following indices were used: root-mean-square error of approximation (RMSEA), comparative fit index (CFI), Tucker-Lewis Index (TLI), and standardized root-mean-square residual (SRMR).The cutoff values suggested by Hu and Bentler [40] were referenced, namely, CFI and TLI greater than 0.95, RMSEA smaller than 0.06, and SRMR smaller than 0.08 indicate a relatively good fit for models analyzed using ML.Xia and Yang [41] cautioned against using a universal set of cutoff values for analyses conducted with ordered categorical variables.In particular, they noted that fit indices under WLSMV estimation tended to show better model-data fit compared to ML fit indices for the same misspecified model.Hence, in this study, the suggested cutoff values were used to diagnose the model fit, but not to serve as the sole justification for the acceptance of a model.
Regarding missing data, the proportion of missing scores across the PSI items ranged from 0.7% to 19.4%, while the proportion of missing response times ranged from 2.2% to 15.2%.Greater missingness typically occurred in the last few items of a task due to running out of time (see Table 3).For the speed part of the model, missing data were handled through the full-information ML estimation in Mplus, which estimates model parameters directly from available data without deleting cases or imputing missing values [42].For the ability part of the model, missing data were handled using pairwise deletion through the WLSMV estimator [39].

Descriptive and Correlation Analyses
A preliminary data screening was conducted on all observed variables to check that the assumptions of the SEM had been met.Descriptive statistics for the ability and speed indicators are presented in Tables 3 and 4, respectively.For the part of the model that was estimated using the default ML method, it was important to screen the data for multivariate normality.Here, we adopted the approach suggested by Kline [42] to assess the quantitative measures of skewness and kurtosis in the observed variables.After removing outliers, the distribution of each screen response time variable was found to be approximately normal, with skewness and kurtosis values below 2 and 7, respectively [42]; hence, further transformation was not necessary.
Bivariate correlations were computed for each pair of observed variables.The correlations between item score variables generally ranged from r = 0.1 to r = 0.3.The correlations between screen response time variables appeared to vary distinctly by PSI task, with the screen response times between some pairs of tasks correlating more strongly than others.The response times for screens of the same task were more closely related (generally ranging from r = 0.2 to r = 0.4) than those for different tasks.The correlations between item scores and screen response times were mainly close to 0. Overall, the maximum absolute correlation between any two variables was 0.51, and the variance inflation factor for all observed variables ranged between 1.1 and 2.2 (less than 10), which indicated that multicollinearity would not be a concern [43].

Model Specification and Fit
As a first step in the analysis, the initial theoretical model in Figure 1 was fitted to the data to assess the model fit.It was necessary to rescale the screen response time variables to units of minutes instead of seconds to prevent the problem of an ill-scaled covariance matrix and allow the model to converge [42].The conceptual structural regression model demonstrated a poor fit to the data.To identify the source of the misfit, the measurement components of the structural regression model were analyzed separately before combining with the booklet variable.Table 5 shows the model fit indices of models tested in this study.It was noted that all tested models yielded a significant exact-fit (χ 2 ) test, which could be due to the large sample size in this study.For the ability component, a one-factor model with a single ability construct, indicated by all items on the test, was found to fit the data well.However, as TIMSS typically treats mathematics and science achievement as two separate constructs and reports these results separately, it is more appropriate to reflect this in the model using two separate latent constructs.The two-factor ability model showed a good fit to the data, as indicated by the global fit indices.As Mplus does not display standardized or normalized residuals for analyses conducted using the WLSMV estimator, a reference was made to modification indices for an indication of the local fit.No specific inter-item error correlations were suggested, which would result in a significant improvement in χ 2 .Thus, the measurement model for ability was retained as such, which also fits well with the theory.TIMSS uses item response theory for achievement scaling, which assumes local independence of item responses given ability.Since the items used in this study were those included in the eTIMSS achievement scaling, they can be assumed to be high-quality items, and there is thus no basis for correlating errors between any pair of items.
For the speed component, it was found that a one-factor model with a single speed construct that was indicated by all screen response times on the test did not fit the data well.It is conceivable that the response time patterns for mathematics items could differ from those for science items, and thus, a two-factor model was also tested.However, the model fit was still poor (see Table 5).The earlier correlation analysis suggested that the response time patterns could be task specific.A five-factor model with separate latent speed variables (one for each task), each indicated by the screen response times for the specific tasks, yielded a substantial improvement in the global model fit.The absolute fit indices (RMSEA and SRMR) indicated a good fit, although the relative fit indices (CFI and TLI) still indicated an insufficient model fit.An inspection of the normalized residuals showed that the local misfit was scattered throughout the model rather than just confined to several pairs of observed variables (this is discussed more in the next section).One possible explanation is that response time data are inherently prone to fluctuations and are difficult to capture accurately in a way that truly represents a student's test-taking speed.One modification was made to the model by correlating the errors on two specific screens (SF01_S and SF02_S), as it was observed that the normalized residual between these two screens was substantively larger than others.A close review of the specific items on these two screens (available in [3]) revealed that the items were more open-ended, with very similar wording and structure, suggesting that the unique variances of these screen response times could be related.Adding this modification improved the model fit slightly, and no other modifications were made, as they would not be theoretically justifiable.
The original theoretical model (see Figure 1) required combining the ability and speed models and assumed that ability and speed could be uniquely measured by item scores and screen response times, respectively (i.e., no cross-loadings).The five-factor speed model yielded the best possible model fit in this study.While not ideal, it could give a reasonable representation of test-taking speed for the purpose of this study.Other attempts to form a single latent speed variable (e.g., using a higher-order latent variable to draw shared variance from the five tasks or specifying indicators at the task level instead of at the screen level) also did not yield a sufficient model fit.Due to the challenges of modeling speed in this context, a combined ability and speed model was not feasible and would not fit the data well.Hence, subsequent analyses of the booklet effect were carried out separately for the ability and speed models.

Booklet Effect on Ability
The final structural regression model for the booklet effect on ability is shown in Figure 2. The parameter estimates for the overall model are reported in Table 6.The model was also re-run once for each item on the test (including the dashed path) to examine the booklet effect on each item throughout the test.The parameter estimates for the dashed paths are reported in Table 7. Due to the large sample size in this study, most of the parameter estimates were statistically significant.Hence, it was essential to consider the effect size.For the factor loadings, the average variance that was extracted (average of all squared standardized loadings) for the mathematics ability factor was 0.40, and it was 0.30 for the science ability factor.These results showed that the item score variables were good indicators of their factors (based on criteria from [44], who also noted that factor loadings for categorical indicators tend to be lower).The first part of the analysis focused on the direct effects of a booklet on mathematics and science ability, respectively.The results showed that a change from Booklet 15 to Booklet 16 predicted a slight decrease in both mathematics and science ability.However, the differences were only about 0.04 in the score, so they may not be of practical significance.For the booklet effect at the item level, a consistent pattern could be seen, where the booklet variable was associated with decreased performance for items in blocks M1 and S1 but conversely predicted increased performance for items in blocks M2 and S2.In the assessment, blocks M1 and S1 were administered in the first half of each session in Booklet 15, but in the second half in Booklet 16.On the other hand, blocks M2 and S2 were administered in the second half of each session in Booklet 15 but in the first half in Booklet 16.Our results showed that students performed better when the same item was placed earlier in a test session.Also, the booklet effect appeared stronger for some items than others.In particular, the effect generally seemed stronger for items appearing in the second half of a block.

Booklet Effect on Speed
The final structural regression model for the booklet effect on speed is shown in Figure 3.The parameter estimates for the overall model are reported in Table 8.As with the ability model, the speed model was re-run once for each screen on the test (including the dashed path) to examine the booklet effect on the response time on each screen.The parameter estimates for the dashed paths are reported in Table 9.

Booklet Effect on Speed
The final structural regression model for the booklet effect on speed is shown in Figure 3.The parameter estimates for the overall model are reported in Table 8.As with the ability model, the speed model was re-run once for each screen on the test (including the dashed path) to examine the booklet effect on the response time on each screen.The parameter estimates for the dashed paths are reported in Table 9.  Regarding factor loadings, the average extracted variance for the speed factors ranged between 0.24 and 0.33, suggesting that the indicators for speed were fairly good [44].For direct effects, our results showed that a change from Booklet 15 to 16 predicted an increase in speed for the tasks in blocks M1 and S1, and a decrease in speed for the tasks in blocks M2 and S2.This finding suggested that students tended to spend more time responding to the same task when it was placed in the first half of a test session.The standardized estimates, indicating effect size, suggested that the booklet effect on task-level speeds was non-trivial.The analysis of the booklet effect on individual screen response times showed a more mixed picture within each task, but larger effects tended to show up on the last screens of each task.

Discussion
This study examined booklet effects on students' ability and test-taking speed in a digital problem-solving and inquiry assessment in eTIMSS 2019.The two booklets contained the same tasks and items but differed in the position of the various item blocks.The results from the analysis on overall ability suggested a small but statistically significant booklet effect on overall mathematics and science ability, both being slightly lower for Booklet 16.In the booklet design, the order of the subjects and the order of appearance of the item blocks in each test session were switched in Booklet 16.Referring to the IRT item parameters published by TIMSS [4], the average difficulty (b) parameters for the four item blocks were 0.317 (M1), 0.861 (M2), 0.227 (S1), and 0.463 (S2), respectively, meaning that the items in M2 and S2 were generally more difficult than the ones in M1 and S1.In Booklet 16, students were first presented with the more difficult blocks in both test sessions.This could be a possible explanation for the observed booklet effect, which is consistent with previous research (e.g., [45][46][47]), which found that hard-to-easy item arrangements on a test tended to predict a lower test performance compared to easy-to-hard or random arrangements, particularly when there is an imposed time limit.These studies were typically conducted using traditional pen-and-paper multiple-choice tests.
The results from the analysis at the item level suggested a booklet effect on both ability and speed for the items appearing in the same block.When item blocks were placed in the first half of a test session, students' speed on those items was slower and performance was better.This points to a negative position effect, which is consistent with numerous other studies (e.g., [9,11,13,24]).An intuitive explanation would be that students tended to go through items more carefully and slowly at the start of each test session, but they may feel more tired, less motivated, or rushed for time toward the end of the test.Previous research surrounding item position effects often discussed fatigue effects and practice effects (e.g., [8,10,23,48]), suggesting that performance could decrease as a test progresses due to fatigue or increase due to practice if students become more familiar with the test material [49].Due to the problem-solving nature of the PSI tasks, the presence of a fatigue effect seems more likely than a practice effect, as each item was crafted to be unique.However, as each test session was only 36 minutes long, another plausible explanation is that students might have felt more rushed for time when they attempted the second item block, affecting their performance.This finding echoes Albano's [5] argument that items with more complex content or wording may be more susceptible to position effects (i.e., perceived as more difficult) when testing time is limited.In a more recent study, Demirkol and Kelecio glu [11] found negative position effects in the reading and mathematics domains in PISA 2015, with stronger position effects for reading and for openended items in mathematics, which are more complex than multiple-choice items in the same domain.Weirich et al. [16] further found that position effects were more pronounced for students whose test-taking effort decreased more throughout a test, but also pointed out that position effects remained, even in students with persistently high test-taking effort.These findings suggest that there could be multiple causes of position effects, and further research could help uncover when and why they occur.
Interestingly, all the key findings in this study pointed towards booklet effects that were unique to each item block.The swapped order of mathematics and science between the two booklets did not seem to have impacted students' performance or speed as much as the ordering of blocks within each test session.This finding suggests that the short 15-minute break between the two test sessions acted almost like a "reset button", which mitigated the position effect and gave students equal time and opportunity to perform in both portions of the assessment.In a study by Rose et al. [50], item position and domain order effects were examined concurrently in a computer-based assessment with mathematics, science, and reading items and were found to interact substantially.However, in this case, the assessment did not incorporate any breaks between the domains.When discussing the speed-ability trade-off, Goldhammer [19] recommended that item-level speed limits be set on assessments to estimate ability levels more accurately.The confounding effect of speed would be removed by ensuring that students have the same amount of time to work on each item.This controlled speed idea was later tested in a more recent study [51].In practice, it may be challenging to implement this condition due to various technical and logistical issues.However, the results of this study suggest that administering a long assessment in separately timed sessions could be a feasible alternative to improve measurement, especially if each portion is aimed at a different construct.

Limitations and Future Research
It is necessary to acknowledge the limitations of this study.First, even though the results hinted at a possible relationship between students' ability and speed in this context (e.g., a slower speed may be related to a better performance), it was not possible to test this directly in the SEM model due to poor model fit in the combined model.In eTIMSS 2019, the total response time on each screen was captured throughout the assessment.This measured the total time that students spent on each screen, but this may not be the best measure of the actual response time (i.e., the amount of time that students spent engaging with items on each screen).For example, some students may have finished the test early or decided to take a break halfway through and lingered on some screens for longer.It was also unclear whether the screen times included overhead times (e.g., screen loading times), which could vary on different devices and contribute to increased screen times if students visited the same screen multiple times.In this study, response time outliers were removed as best as possible from the two ends of the distribution, but it was still a challenge to model speed with the existing data.More fine-grained response time data, such as those available in PISA 2018 [52], may be helpful for researchers looking to use response time data to model test-taking speed.
Second, the dataset used in this study consisted of students from all the countries who took the eTIMSS 2019 PSI booklets.While this approach provided further insights into booklet effects occurring for all students, there may be country-specific differences that could be analyzed within each country's context.Student motivation, engagement, and exposure to PSI-like items could vary widely in different countries, in addition to the level of ability.As eTIMSS is a low-stakes assessment, the results from this study may not apply to high-stakes assessments, where speed and ability may be more tightly related.As pointed out by Ong et al. [12], results from position effect studies that incorporate examinee variables (e.g., gender, effort, anxiety) tended to vary depending on the features of the testing context (e.g., content, format, and stakes associated with the test).More research is thus needed to reveal how different groups of students may be impacted by position effects in different testing contexts.
Digital assessments incorporating elements of authentic assessment (e.g., scenariobased assessment) and interactive item types are increasingly used to evaluate students' learning.As such, contextual item blocks resembling those seen in the PSI assessment may increasingly replace the typical discrete items that are used in mathematics and science assessments.This study showed that students tended to spend more time and perform better on item blocks when they were placed earlier in a test session.Test developers should be mindful of the potential effects of different orderings of item blocks on students' test-taking process.In practice, the relative difficulty of item blocks and position effects due to blocks appearing earlier or later in a test session should be considered when assembling multiple test forms.
In the PSI section of eTIMSS 2019, each task consists of a set of items that follow a narrative or theme surrounding a real-life context.Even though the items themselves are independent of each other [3], students' response and response time patterns could still be related to the specific tasks.Our findings suggested that in this context, response time patterns could be task specific.More research could be carried out to examine these patterns within a task and between tasks, alongside item-specific features such as the inclusion of interactive elements, to provide insights into students' use of time and performance in such innovative digital assessments.Future research could also examine position effects alongside item-specific and examinee-specific features to better inform test development.In this study, we analyzed data from all the countries that participated in the PSI assessment.A future study could explore country-level variations in the observed position effects and their underlying causes.Lastly, it is also worthwhile to explore how speed could be better modeled using response time data, and how the response time could be better captured in digital assessments, which may allow researchers to draw a link between ability and speed in this context.

Figure 1 .
Figure 1.Theoretical model for examining booklet effects on ability and test-taking speed.Si represents item-level scores, and RTj represents screen-level response times.Figure 1. Theoretical model for examining booklet effects on ability and test-taking speed.S i represents item-level scores, and RT j represents screen-level response times.

Figure 1 .
Figure 1.Theoretical model for examining booklet effects on ability and test-taking speed.Si represents item-level scores, and RTj represents screen-level response times.Figure 1. Theoretical model for examining booklet effects on ability and test-taking speed.S i represents item-level scores, and RT j represents screen-level response times.

Table 2 .
Demographic summary of students across two PSI booklets.
BookletN of Students N of Countries Age Gender

Table 3 .
Descriptive statistics for ability indicators (mathematics and science item scores).
Note: The first letter of the item name refers to the subject (M-mathematics; S-science).The second letter is an abbreviation of the task name.If an item has multiple parts, e.g., A, B, C, it means that they appeared on the same screen.M: mean; SD: Standard Deviation.

Table 4 .
Descriptive statistics for speed indicators (response time in seconds for each screen).
Note: The first letter of the item name refers to the subject (M-mathematics; S-science).The second letter is an abbreviation of the task name.M: mean; SD: Standard Deviation.

Table 5 .
Model fit indices for different measurement models and structural regression models.

Table 6 .
WLSMV estimates for the structural regression model of the booklet effect in the mathe matics (M) and science (S) tasks.

Table 6 .
WLSMV estimates for the structural regression model of the booklet effect in the mathematics (M) and science (S) tasks.

Table 7 .
WLSMV estimates for the structural regression model of the booklet effect on performance in individual mathematics (M) and science (S) tasks.

Table 8 .
Maximum likelihood estimates for the structural regression model of the booklet effect on speed at the mathematics (M) and science (S) task levels.

Table 9 .
Maximum likelihood estimates for the structural regression model of the booklet effect on speed at individual screen level in mathematics (M) and science (S) tasks.