Factors Associated with the Equivalence of the Scores of Computer-Based Test and Paper-and-Pencil Test: Presentation Type, Item Difﬁculty and Administration Order

: Since schools cannot use face-to-face tests to evaluate students’ learning effectiveness during the COVID-19 pandemic, many schools implement computer-based tests (CBT) for this evaluation. From the perspective of Sustainable Development Goal 4, whether this type of test conversion affects students’ performance in answering questions is an issue worthy of attention. However, studies have not yielded consistent ﬁndings on the equivalence of the scores of examinees’ answering performance on computer-based tests (CBT) and paper-and-pencil tests (PPT) when taking the same multiple-choice tests. Some studies have revealed no signiﬁcant differences, whereas others have exhibited signiﬁcant differences between the two formats. This study adopted a counterbalanced experimental design to investigate the effects of test format, computerised presentation type, difﬁculty of item group, and administration order of item groups of different difﬁculty levels on examinees’ answering performance. In this study, 381 primary school ﬁfth graders in northern Taiwan completed an achievement test on the topic of Structure and Functions of Plants, which is part of the primary school Natural Science course. The achievement test included 16 multiple-choice items. After data collection and analysis, no signiﬁcant differences in the answering performance of examinees were identiﬁed among the PPT, CBT with single-item presentation, and CBT with multiple-item presentation. However, after further analysis, the results indicated that the difﬁculty of item group and the administration order of item groups of different difﬁculty levels had signiﬁcant inﬂuences on answering performance. The ﬁndings suggest that compared with a PPT, examinees exhibit better answering performance when taking multiple-choice tests in a CBT with multiple-item presentation.


Introduction
Since the year 2019, with the outbreak of the COVID-19 pandemic, many countries have curbed the spread of the virus by reducing crowd movement or close interaction, and temporarily closing certain places, such as educational institutions and public recreational places. Considering the equitable quality education and lifelong learning opportunities promoted by the United Nations on the Sustainable Development Goal 4, many schools applied online education to compensate for the learning impossibility at school during the pandemic and to mitigate the impact caused by school closures [1]. Since the information and communication technology (ICT) resources each student can access are different, which might affect the opportunities to learn and the fairness of evaluation, in addition to providing online courses to continue educational activities, schools have needed to modify the original evaluation standard and method to ensure the fairness of evaluation during the school closures [2,3]. As students are unable to take face-to-face paper-and-pencil tests, index of the same test administered as a PPT and CBT. The test items included lengthy passages, and the topics included reading and science reasoning. The CBT was divided into paging and scrolling groups. The results revealed that in the reading test, the average scores of the PPT group were higher than those of the CBT group, but the difference was not significant. On the science reasoning test, the average scores of the CBT group were higher than those of the PPT group, but only the paging group achieved statistically significant differences, and the correlation coefficient among the difficulty of the CBT and PPT groups was greater than 0.9. Examinees' attitudes and familiarity with computers also influence their answering performance in CBT. Russell and Plati [17] indicated that if examinees are familiar with typing on a computer, and a CBT is used to evaluate their writing ability, the students may exhibit better performance than in a PPT. Wang et al. [18] also found that compared with PPT, examinees' attitudes regarding CBT were more positive, so they exhibited better answering performance. In addition, the subject is another important factor. Kingston [19] conducted a meta-analysis of 81 studies about the differences in scores in subjects such as mathematics, reading, English language arts, science, and social studies when tests were administered using CBT and PPT among American students in grades 1-12 from 1997 to 2007. The results showed that grade level had no influence on the results, but the subjects exhibited significant influences. When CBT was adopted for English language arts and social studies, the examinees had better answering performance, but when PPT was adopted in mathematics, answering performance was better. Besides, Hensley [20] also identified a significant difference between CBT and PPT in the test performance of 142 college students in the mathematics courses.
However, some studies have indicated that the scores of CBT and PPT have equivalence and interchangeability. For example, Wang et al. [13] and Logan [21] compared mathematics tests administered using CBT and PPT through a literature review and metaanalysis and found no statistically significant differences. Several studies also found no significant difference between CBT and PPT in examinees' answering performance in the reading comprehension tests [22], mathematics test [23], language learning test [24,25]. The findings of relevant studies are summarised in Table 1. factors and the content factors. Presentation factors refer to the means used to present test information, including screen size, font size, resolution of graphics, the nature of the display (multiscreen, graphical, or complex), the amount of test information that can be presented on a screen at one time, and interactive assessment strategies. Content factors refer to the content of test items, including test subjects such as English, social studies, culture, mathematics, reading, and scientific reasoning. Table 1 shows that the influence of the presentation factors is larger than that of content factors. In traditional PPT, several content factors may affect performance outcomes, including item difficulty and the distribution of items at different difficulty levels. The sequence of items at different difficulty levels may affect examinees' test anxiety, and in turn it may affect their confidence, and ultimately having an impact on their performance outcomes [26][27][28]. However, studies (see Table 1) have not offered further investigation on the influence of these content factors. In addition, Leeson [12] stated that some early studies pointed out that in the CBT, whether the test items are displayed in single-item presentation or multiple-item presentation can influence an examinee's answering performance. However, Leeson also pointed out that the forms of test item design in these studies are all affected by the limitation of technologies. Therefore, it is necessary to further explore this factor and other factors that may affect the answering performance in CBT, so as to further understand the potential impacts of various CBT designs on the answering performance. The ways to address these effects should also be further investigated.
In addition, the PPT originally used at school were converted into CBT via the internet during the school closure caused by the COVID-19 pandemic [29]. From the perspective of Sustainable Development Goal 4, whether different test formats will affect students' performance in answering questions is an issue worthy of attention [29]. This is because whether students' performance under different test formats can be converted equally will cause fairness problems [8]. To address this research gap, in this study, a counterbalanced experimental design was used to explore how the answering performance of examinees differs when they answer multiple-choice items in CBT and PPT. This study also explored how the computerised presentation type (single-item presentation vs. multiple-item presentation), difficulty of item group, and administration order of item groups of different difficulty levels influence examinees' answering performance in PPT and CBT. Thus, the study addressed the following research question:

•
What are the effects of test format (CBT and PPT), computerised presentation type, difficulty of item group, and administration order of item groups of different difficulty levels on students' answering performance in CBT and PPT?

Participants
The participants were fifth-grade students from 16 classes of two primary schools in northern Taiwan, including 199 boys (average age: 11.1 years) and 182 girls (average age: 10.9 years) for a total of 381 students (average age: 11.0 years). In order to avoid the impact of participants' unfamiliarity with the computer, Internet, and CBT environment on the research results [30], all students had taken basic courses related to computers and the Internet. Before participating in this study, they had practised using the CBT environment of this study. After completing this study, each participant received stationery as a reward for joining the experiment.

Achievement Test
Based on the Structure and Functions of Plants topic from the primary school fifthgrade Natural Science course, 20 multiple-choice items were designed. All 20 items were tested by 136 fifth graders who had learned about the topic, and the difficulty and discrimination indexes of each item were evaluated. The higher the difficulty index, the less difficult the item was; the lower the difficulty index, the more difficult the item was. After the difficulty index of each item was evaluated, the eight simplest items were chosen to form the simple-item group (difficulty index mean = 0.772, SD = 0.099), and the eight most difficult items were chosen to form the difficult-item group (difficulty index mean = 0.348, SD = 0.076). There is a significant difference between the difficulty index mean of difficultitem group and simple-item group (t(14) = 9.621, SE = 0.441, p < 0.001). The achievement test adopted in this study consisted of the simple-item group and the difficult-item group. Both group items separately adopted the CBT and PPT test format. The simple-item and difficult-item groups were used to understand the effects of their administration order and item difficulty on students' answering performance in CBT and PPT.

Computer-Based Test Environment
The CBT in this study was presented in the Web browser. The computerised presentation types of the CBT were divided into single-item presentation ( Figure 1a) and multiple-item presentation (Figure 1b), where Zone A is the item number area, showing the items already answered (green background), the items now being answered (blue background), and the items yet to be answered (white background). Zone B is the item and option presentation area, where the examinee can click on the box of an option to choose it as the correct answer; Zone C is pressed to present the next item after the answer is confirmed; and Zone D is used to submit all answers.

Achievement Test
Based on the Structure and Functions of Plants topic from the primary school fifthgrade Natural Science course, 20 multiple-choice items were designed. All 20 items were tested by 136 fifth graders who had learned about the topic, and the difficulty and discrimination indexes of each item were evaluated. The higher the difficulty index, the less difficult the item was; the lower the difficulty index, the more difficult the item was. After the difficulty index of each item was evaluated, the eight simplest items were chosen to form the simple-item group (difficulty index mean = 0.772, SD = 0.099), and the eight most difficult items were chosen to form the difficult-item group (difficulty index mean=.348, SD = 0.076). There is a significant difference between the difficulty index mean of difficultitem group and simple-item group (t(14) = 9.621, SE = 0.441, p < 0.001). The achievement test adopted in this study consisted of the simple-item group and the difficult-item group. Both group items separately adopted the CBT and PPT test format. The simple-item and difficult-item groups were used to understand the effects of their administration order and item difficulty on students' answering performance in CBT and PPT.

Computer-Based Test Environment
The CBT in this study was presented in the Web browser. The computerised presentation types of the CBT were divided into single-item presentation ( Figure 1a) and multiple-item presentation (Figure 1b), where Zone A is the item number area, showing the items already answered (green background), the items now being answered (blue background), and the items yet to be answered (white background). Zone B is the item and option presentation area, where the examinee can click on the box of an option to choose it as the correct answer; Zone C is pressed to present the next item after the answer is confirmed; and Zone D is used to submit all answers.

Research Design
To address the research questions, this study investigated the effects of test format, computerised presentation type, difficulty of item group, and administration order of item groups of different difficulty levels on students' answering performance in the achievement test. Regarding test format, the achievement test was presented in the form of PPT and CBT. In the PPT, all items of the achievement test were presented on one test paper. The CBT had two computerised presentation types, namely, single-item presentation (CS) and multiple-item presentation (CM). The CS version presented one item at a time, and the examinee would press the 'NEXT' button after answering to present the next item. The CM version presented all the test items on the computer screen at a time. In

Research Design
To address the research questions, this study investigated the effects of test format, computerised presentation type, difficulty of item group, and administration order of item groups of different difficulty levels on students' answering performance in the achievement test. Regarding test format, the achievement test was presented in the form of PPT and CBT. In the PPT, all items of the achievement test were presented on one test paper. The CBT had two computerised presentation types, namely, single-item presentation (CS) and multiple-item presentation (CM). The CS version presented one item at a time, and the examinee would press the 'NEXT' button after answering to present the next item. The CM version presented all the test items on the computer screen at a time. In terms of the difficulty of item group, the items of the achievement test were divided into the simple-item and difficult-item groups according to the difficulty index, with eight items in each group. Regarding the administration order of item groups of different difficulty levels, the test was divided into two types: the simple-item group followed by the difficult-item group, or the difficult-item group followed by the simple-item group. Finally, a counterbalanced experimental design was adopted to form eight treatments (see Appendix A). In the within-subject design, the factors of test format (PPT and CBT) and difficulty of item group (simple-item and difficult-item) were adopted. In the between-subject design, the computerised presentation types (CS and CM) and administration order of item groups of different difficulty levels (the administration order of simple-item group and difficultitem group) were adopted. Taking one class as a unit, participants from 16 classes were randomly assigned to a treatment which comprised two classes of students.

Data Collection and Analysis
The data collected in this study were all quantitative data (i.e., examinees' correct answering rate on the achievement test in the eight treatment types). To address the research questions, this study adopted an independent sample t-test to perform descriptive analysis. In addition, this study adopted two-way ANOVA, taking test format, administration order of item groups of different difficulty levels, difficulty of item group, and computerised presentation type as independent variables and the correct answering rate on the achievement test as the dependent variable.

Test Item Analysis
This study first compared the average correct answering rate of items on the achievement test under different test formats and item groups of different difficulty levels, as shown in Table 2. The results of the independent sample t-test reveal that the average correct answer rates of the PPT and CS as well as the PPT and CM did not exhibit significant differences (t(30) = −0.220, SE = 0.089, p = 0.827, ES = 0.076, t(30) = −0.139, SE = 0.091, p = 0.891, ES = 0.051). This result suggests that without considering the factors of difficulty of item group and administration order of item groups of different difficulty levels, the correct answering rates on the achievement test were not significantly different based on whether the test items were presented in a PPT and CS or in a PPT and CM.

Analysis of Answering Performance of Simple-Item Group and Difficult-Item Group by Test Format and Administration Order of Item Groups of Different Difficulty Levels
This study tested the effects of test format and administration order of item groups of different difficulty levels on the answering performance of the achievement test. The following is an analysis of whether the effect of the independent variables on the dependent variables was different when the test items were presented in the form of a PPT and CS from when the test items were presented in the form of a PPT and CM. The effects of different test formats and different administration orders of item groups of different difficulty levels on examinees' answering performance in the simple-item group and the difficult-item group were analysed as follows.

Simple-Item Group
First, the simple-item group was examined using two-way ANOVA to determine the correct answering rate of all examinees under the different test formats and different administration orders of item groups of different difficulty levels (see Tables 3 and 4). For the PPT and CS as well as the PPT and CM, in the simple-item group, the main effect

Difficult-Item Group
Next, we used two-way ANOVA to analyse the difficult-item group, examining the correct answering rate of all examinees under different test formats and different administration orders of item groups of different difficulty levels (see Tables 4 and 5). We observed that in the PPT and CS, the main effect of test format was significant (F 1,191 = 4.917, MSe = 0.032, p < 0.05, η 2 = 0.025), suggesting that test format had a significant influence on the correct answering rate. The correct answering rate in the CS was significantly higher than that in the PPT, but the effect of administration order was not significant (F 1,191 = 1.924, MSe = 0.032, p = 0.167, η 2 = 0.010), suggesting that administration order had no significant influence on the correct answering rate. The two variables exhibited a significant interaction effect (F 1,191 = 5.705, MSe = 0.032, p < 0.05, η 2 = 0.029). We conducted post hoc analyses and found that in the difficult-item group administered in PPT, the effect of administration order was significant; the correct answering rate when the difficult-item group was presented in the first part was significantly higher than when the difficult-item group was presented in the second part (t(95) = 2.932, SE = 0.033, p = 0.004, ES = 0.288). In the difficult-item group administered in CS, the effect of administration order was not significant; the correct answering rate when the difficult-item group was presented in the first part was not significantly different from when the difficult-item group was presented in the second part (t(96) = −0.665, SE = 0.039, p = 0.514, ES = 0.067). When the difficult-item group was presented in the first part, the effect of the test format was not significant; no significant difference was identified between the PPT and CS (t(95) = 0.116, SE = 0.037, p = 0.908, ES = 0.012). However, when the difficult-item group was presented in the second part, the effect of the test format was significant; the correct answering rate of the CS was significantly higher than that of the PPT (t(96) = −3.393, SE = 0.035, p = 0.001, ES = 0.327). For the PPT and CM, the main effect of the test format was significant (F 1,187 = 4.633, MSe = 0.034, p < 0.05, η 2 = 0.024), suggesting that test format had a significant influence on the correct answering rate; the correct answering rate in the CM was significantly higher than in the PPT, but the effect of the administration order was not significant (F 1,187 = 0.036, MSe = 0.034, p = 0.850, η 2 = 0.000); the two variables did not exhibit a significant interaction effect (F 1,187 = 0.004, MSe = 0.034, p = 0.950, η 2 = 0.000).
Based on the findings above, the different test formats (PPT, CS, and CM) and different administration orders of the simple-item group (first part and second part) had no significant influence on the correct answering rate in the achievement test. This study revealed that in the difficult-item group, the test format and administration order had significant influences on the correct answering rate of the achievement test: (1) PPT and CS-in the PPT, compared with when the difficult-item group is presented in the second part, the correct answering rate was significantly higher when the difficult-item group was presented in the first part; when the difficult-item group was presented in the second part, the correct answering rate in CS was significantly higher than that in the PPT; (2) PPT and CM-the correct answering rate of difficult-item group in the CM was significantly higher than that in the PPT; administration order had no significant influences on the correct answering rate. In other words, for the difficult-item group, the significant influences of different test formats (PPT, CS, CM) on the correct answering rate in the achievement test could be observed.

Analysis of Answering Performance of the Achievement Test by Difficulty of Item Group and Computerised Presentation Type
Another research purpose of this study was to examine the influences of the computerised presentation type (CS and CM) and the difficulty of item group (simple-item group, difficult-item group) on answering performance in the achievement test. This study analysed whether the effects of the independent variables on the dependent variables were different in different computerised presentation types.
Two-way ANOVA was used to examine the correct answering rate of all examinees in different computerised presentation types and item groups of different difficulty levels (see Tables 6 and 7). The results suggest that the main effect of computerised presentation type was not significant (F 1,187 = 0.931, MSe = 0.038, p = 0.336, η 2 = 0.005). The computerised presentation type did not have a significant influence on the correct answering rate in the achievement test. However, the main effect of the difficulty of item group reached a significant level (F 1,187 = 263.023, MSe = 0.050, p < 0.01, η 2 = 0.584), suggesting that the difficulty of item group had a significant influence on the correct answering rate on the achievement test, and the simple-item group was significantly higher than the difficultitem group; no significant interaction effect between the two variables was observed (F 1,187 = 0.327, MSe = 0.050, p = 0.568, η 2 = 0.002).  On the basis of these findings, the computerised presentation type (CS and CM) demonstrated no significant influence on the correct answering rate in the achievement test; only the difficulty of item group exhibited a significant influence on the correct answering rate. The correct answering rate of the simple-item group was greater than that of the difficult-item group.

Concluding Remarks
The purpose of this study was to explore how the formats of multiple-choice tests influence the examinees' answering performance when they are administered as CBT and PPT. It is anticipated that several suggestions can be given for schools that are forced to change the test format for the evaluation of students' learning effectiveness, from PPT to CBT, during the COVID-19 pandemic. The conversion of test format might affect students' performance in answering questions, thereby affecting the accuracy and fairness of evaluation on students' learning effectiveness [8]. This study analysed four factors: test format, computerised presentation type, difficulty of item group, and administration order of item groups of different difficulty levels. The four factors were investigated using a counterbalanced experimental design. Regarding test format, the findings showed that comparing PPT, CBT with single-item presentation, and CBT with multiple-item presentation revealed no significant difference in answering performance of the examinees; this is consistent with previous research results [12,13,16,[21][22][23]25]. However, this study further examined the factors of difficulty of item group and administration order of item groups of different difficulty levels and revealed that the examinees' performance in the CBT and PPT was significantly different only for the difficult-item group. The CBT with multiple-item presentation (CM) was the most favourable in terms of examinee performance. This result can be explained by Leeson's viewpoint. Leeson points out that CBT adopts the multiple-item presentation method, which gives examinees a chance to preview test items and improve their answering performance. This is called the "facilitating effect" [12]. Regardless of whether the difficult-item group was arranged as the first or second part of the test, the performance of the examinees was better than that in the PPT. In the CBT with single-item presentation (CS), only when the difficult-item group was arranged as the second part of the test did the examinees perform better than in the PPT. For the difficult-item group, whether the CS or CM was used, the examinees' performance was better than that in the PPT, and the answering performance of examinees was not significantly different in the CS or CM. Based on the findings, the present study suggests that test administrators may consider adopting the form of CM on multiple-choice tests so that examinees can achieve better answering performance.
Additionally, the findings indicate that the difficulty of test items may influence the equivalence of examinees' answering performance in the CBT and PPT. This may explain the inconsistent findings on the equivalence of examinee's answering performance in the CBT and PPT in previous studies. The difficulty of test items can influence examinees' answering performance in CBT and PPT, as explained by cognitive load theory [31]. The efficiency of learners' cognitive operation can be influenced by cognitive load, and the sources of cognitive load include intrinsic cognitive load, extraneous cognitive load, and germane cognitive load [31,32]. In this study, cognitive load theory suggests that item difficulty influences intrinsic cognitive load, whereas the test format, computerised presentation type, and administration order of item groups of different difficulty levels influence the extraneous cognitive load and germane cognitive load of examinees during the testing process. In other words, if the same test implemented in the traditional PPT format is changed to the CBT format, it may affect extraneous cognitive load and germane cognitive load. Therefore, with a limited response time, to enable the examinees to perform at their highest ability and achieve a successful answering performance, we must pay attention to how cognitive load influences examinees' responses on the test. If the examinees' answering performance is influenced by cognitive load, then the test itself cannot effectively measure the real abilities of the examinees and achieve the test goals.
According to this concept, because the traditional exam is generally administered in PPT, if CBT is adopted, the extraneous cognitive load of the examinees is be increased because the test requires additional operation of mouse and keyboards as well as searching for the items, options, and answer area on the computer screen. If the examinees are not familiar with computer operation, the extraneous cognitive load is more severe. If the load is within the range of the examinees can bear, it will not influence their answering performance. However, if the test items themselves are difficult, meaning the intrinsic cognitive load of examinees is high, then the extraneous cognitive load, which increases due to the change to CBT, will cause the overall cognitive overload of examinees to influence their answering performance. This concept may explain the findings of this study; that is, the item difficulty will affect the equivalence of students' performance in the CBT and PPT, especially for the most difficult items. Additionally, this study finds that during the exam, whether the difficult-item group is presented by the CM or by the CS (and the difficult items are arranged in the second part of the test), the examinees' performance is better than that in the PPT. This can be explained by the suggestions of Miller et al. [26] on test preparation, which indicate that the distribution of test items should be based on the principle of "from simple to difficult" to improve the performance of anxious examinees. In other words, it is beneficial to answering performance when examinees answer the simple-item group first and then the difficult-item group after. In addition, according to Mayer's [33] suggestions on multimedia learning environment design, when managing the cognitive load in a multimedia environment, the "segmenting principle" is helpful to learners. It helps provide learners with germane cognitive load and effectively manage their cognitive load [33,34]. In the present study, in both the CS and CM, only one or a few items were presented on the computer screen at a time, allowing the examinees to answer items in CBT more attentively than in PPT, as they were not being affected by other items in the test. This aligns with the "segmenting principle", helping examinees focus on answering the difficult-item group with its higher intrinsic cognitive load. In other words, for the difficult-item group, if the design of the answering environment can be oriented to reduce the extraneous cognitive load and increase the germane cognitive load, the answering performance may improve.
This study has some limitations. During the school closure caused by the COVID-19 pandemic, schools implemented "take-home exams" to prevent the epidemic and ensure the equitable quality education advocated by the Sustainable Development Goal 4. However, the technical issues, students' academic integrity, and testing environment may affect evaluation results. That is, if students do not follow the honest principles of traditional face-to-face paper-and-pencil test, encounter technology problems, or stay in an uncomfortable testing environment, the results of "take-home exams" have no reference value for the evaluation of students' learning effectiveness [2,3,29,35]. This study is mainly to understand the equivalence of scores of different test formats. In order to avoid the aforementioned factors that affect students' performance in answering questions will impact on the results of the study, the CBT and PPT are not implemented in the way of "take-home exams" in this study. As a consequence, findings might have some limitation in the application in the context of COVID-19 pandemic. In addition, because this study mainly concerned fifth graders of primary school and focused on investigating the multiplechoice items for the specific topic of primary school Nature Science course, the research results cannot be directly extended to examinees of other ages, other subjects, or other types of test items. It is suggested that further research should be conducted with other age groups and subjects. This study also suggests follow-up investigation on other types of test items that can be implemented in both CBT and PPT, such as filling blanks, short answer questions, and essay questions to further explore the difference of examinees' answering performance on the same test in the CBT and PPT. In addition, the CBT is a new test format for fifth graders in primary schools. Although before this study, all the examinees had taken basic courses related to computers and the Internet and practised operating in a CBT environment, they may still have experienced a novelty effect [7]. Their selfefficacy and perceptions towards CBT may have influenced the research results [17,18,30]. Therefore, this study suggests that future researchers should seek to increase the experience of examinees participating in the CBT and provide them with rich CBT experience before conducting the research, thereby reducing the influences of the novelty effect and computer operation familiarity on the research results.
This study also revealed that, generally, the examinees' answering performance was not significantly different in the CBT and PPT. However, when answering more difficult test items, compared with the PPT, examinees exhibited better answering performance in the CBT. In the CBT with multiple-item presentation, the administration order of difficult and simple items has no significant effect. However, in the CBT with single-item presentation, it is suggested that the difficult items be placed in the later part of the test, so that examinees can have better answering performance. In other words, when the same test items were implemented in the CBT or PPT, examinees' answering performance in the CBT may be better than that in the PPT. This finding suggests that a CBT can encourage examinees to leverage their abilities to answer multiple-choice items. It also indicates that the real abilities of examinees may be measured more effectively in CBT. In the COVID-19 pandemic, as distance education is becoming an important strategy to maintain educational equity, accordingly, remote CBT would become an important way for the evaluation of learning effectiveness. Based on the findings of this research, if the teacher wants to use the CBT format, and the content is multiple-choice items with standard answers, it will be a suitable type whether the CBT is administered in the way of multiple-item presentation or single-item presentation. Special attention should be paid to the item difficulty and their administration order in the test. If there is a significant difference in the difficulty of items in the test, it is suggested to administer the CBT by using multiple-item presentation, that is, presenting all items for students to answer at one time, and students are free to choose which one to answer first. If the CBT is presented in a single-item presentation format, students have to answer items in a sequence arranged by the computer without autonomy in choosing the order of answering. In this respect, the difficult items should be placed in the later part of the test. In addition, this study suggests referring to Mayer's [33] suggestions in follow-up studies to make full use of computer functions to design various methods of presenting test items and various answering strategies to help reduce or assist examinees to manage their cognitive load during the answering process. Through this approach, a CBT can become a more effective and appropriate test format in the era of e-Learning.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy. Simple-item group in PPT Difficult-item group in CM 3

Appendix A
Difficult-item group in PPT Simple-item group in CS 4 Difficult-item group in PPT Simple-item group in CM 5 Simple-item group in CS Difficult-item group in PPT 6 Simple-item group in CM Difficult-item group in PPT 7 Difficult-item group in CS Simple-item group in PPT 8 Difficult-item group in CM Simple-item group in PPT Note. PPT: paper-and-pencil test; CS: computerised single-item presentation; CM: computerised multiple-item presentation.