Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software

Yeh, Shih-Chao; Chang, Chun-Yen; Ngo, Van T. Hoang

doi:10.3390/engproc2025103016

Open AccessProceeding Paper

Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software^†

by

Shih-Chao Yeh

¹,

Chun-Yen Chang

^2,3,4,5,*

and

Van T. Hoang Ngo

¹

Graduate Institute of Science Education, National Taiwan Normal University, Taipei 116, Taiwan

²

Institute for Research Excellence in Learning Sciences and Graduate Institute of Science Education, National Taiwan Normal University, Taipei 116, Taiwan

³

Department of Earth Sciences, National Taiwan Normal University, Taipei 116, Taiwan

⁴

Department of Biology, Universitas Negeri Malang, Malang 65145, Indonesia

⁵

Graduate School of Education, Chung Yuan Christian University, Taoyuan 320314, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 8th Eurasian Conference on Educational Innovation 2025, Bali, Indonesia, 7–9 February 2025.

Eng. Proc. 2025, 103(1), 16; https://doi.org/10.3390/engproc2025103016

Published: 13 August 2025

(This article belongs to the Proceedings of The 8th Eurasian Conference on Educational Innovation 2025)

Download

Browse Figures

Versions Notes

Abstract

We developed and validated an animation-based assessment (ABA) method for evaluating high school students’ inquiry competencies in Taiwan’s 12-Year Curriculum. Contextualized in atmospheric chemistry involving methane and hydroxyl radicals, ABA integrated dynamic simulations, tiered multiple-choice and open-ended tasks, and process tracking on the CloudClassRoom platform, the assessment focused on measuring two inquiry skills: causal reasoning and critical thinking. The results of 26,823 students revealed that the ABA effectively differentiated student performance across ability levels and academic disciplines, with open-ended items sensitive to higher-order reasoning. Gender difference was not observed, indicating the gender-free design of the developed ABA. While the ABA supports diagnostic insights, limitations need to be addressed, including the underassessment of modeling and creative experimentation skills. Therefore, it is necessary to include open modeling tasks and AI-powered semantic scoring. The developed ABA contributes a scalable, competency-aligned framework for inquiry-based science assessments.

Keywords:

animation-based assessment; inquiry competence; causal reasoning; critical thinking; modeling; AI scoring; 12-Year Curriculum; science education

1. Introduction

Scientific inquiry skills are foundational to modern science education, emphasizing evidence-based reasoning, problem-solving, and critical thinking to decipher natural phenomena [1]. Globally, frameworks such as the Organisation for Economic Co-operation and Development (OECD) Programme for International Student Assessment (PISA) and the U.S. Next Generation Science Standards (NGSS) define inquiry as a core competency. PISA 2025 expands “scientific literacy” to require students to integrate data analysis, variable control, and causal reasoning to address challenges like climate change [2]. NGSS delineate eight science and engineering practices (SEPs), including modeling and interdisciplinary argumentation, reflecting a shift from rote learning to competency-based education [3]. These trends align with Taiwan’s emphasis on cognitive intelligence (e.g., modeling) and problem-solving dimensions in its 12-Year Curriculum, though gaps remain in assessing ethical reasoning and collaborative inquiry [4].

Traditional assessments (e.g., multiple-choice tests) prioritize factual recall over inquiry processes. For example, a greenhouse effect question is asked to assess recognition of “methane’s heat-trapping efficiency,” but fails to evaluate experimental design or data interpretation [5]. While digital tools offer online interfaces, they lack interactive simulations and real-time feedback, rendering them inadequate for capturing dynamic scientific practices. Such tools poorly discriminate higher-order reasoning (e.g., mechanistic explanations) and exacerbate urban–rural inequities due to resource disparities [6]. This limitation echoes critiques of the NGSS’s underemphasis on metacognitive strategies [3].

Based on Cognitive Load Theory and Multimedia Learning Theory [7,8], interactive animations reduce the cognitive load through dual-channel processing (visual–textual integration) and enable the manipulation of variables (e.g., adjusting ultraviolet intensity) to observe real-time trends [9]. For instance, simulating molecular interactions between methane and hydroxyl radicals helps students link macroscopic climate effects to microscopic mechanisms. Such an approach enhances motivation and differentiates competencies like variable control [10]. Recent advances in learning analytics further allow diagnostic feedback on subskills (e.g., hypothesis formulation) through models such as the deterministic input, noisy, and gate (DINA) [11].

Despite their promise, the ABA lacks diagnostic inquiry skill structures. Considering Taiwan’s need to address interdisciplinary and ethical gaps in inquiry frameworks [12], we developed an ABA contextualized in methane’s greenhouse effect to assess multidimensional skills such as modeling and causal reasoning in tiered tasks (e.g., data matching and variable identification), distinguish basic cognition (experimental design) from advanced reasoning (mechanistic explanations), and mitigate disparities between urban–rural or disciplinary groups (science vs. humanities students) [5].

2. Background Knowledge

2.1. Theoretical Framework of Inquiry Competence

2.1.1. Scientific Inquiry Competencies Across Research Institutions

The design of scientific inquiry frameworks is created by considering cultural and educational priorities. As shown in Table 1, a comparison is made among the Taiwan Curriculum, OECD PISA, and NGSS frameworks in terms of their core dimensions, strengths, and limitations. Taiwan’s 12-Year Basic Education Curriculum (2018) organizes inquiry into two dimensions: cognitive intelligence (reasoning, critical thinking, creativity, and modeling) and problem-solving (hypothesis testing and experimental design). While this model promotes systematic skill integration, it lacks explicit guidance on sociocultural ethics and collaborative practices [4]. In contrast, the OECD PISA framework emphasizes three competencies, namely explaining phenomena scientifically, evaluating and designing inquiry, and interpreting data, with a strong focus on socioscientific reasoning (e.g., debating renewable energy policies) [13]. Similarly, the U.S. Next Generation Science Standards (NGSS) prioritize crosscutting concepts (e.g., patterns and cause-and-effect relationships) and “developing models,” though critics note insufficient attention to metacognition [3].

2.1.2. Assessment Approaches and Disciplinary Variations in Scientific Inquiry Competencies

Modern assessments capture conceptual mastery and procedural fluency. Traditional methods (e.g., multiple-choice tests) emphasize factual recall over higher-order skills such as error analysis [14]. Interactive simulations (e.g., PhET’s virtual labs) have revolutionized the methods by enabling real-time evaluation of competencies such as variable manipulation and data interpretation [15]. These tools align with evidence-centered design (ECD), which links task performance to latent skills (e.g., “identifying confounding variables”) [16].

Disciplinary differences between arts and science students necessitate tailored approaches. Drawing on a study of Taiwanese high school students’ science learning self-efficacy and learning strategies, Lin and Tsai (2013) found that science-track students reported stronger confidence in conceptual understanding and practical skills, while arts-track students showed greater efficacy in science communication and real-world applications. These findings highlight the need for differentiated assessments: modeling-based tasks for science students and socioscientific reasoning tasks for arts students [5], such as science-focused tasks with simulated experiments (e.g., optimizing chemical reactions) and arts-oriented tasks with ethical debates (e.g., clustered regularly interspaced short palindromic repeats (CRISPR) gene-editing policies).

Advanced analytics enable personalized feedback through cognitive diagnostic models. The DINA model identifies subskill mastery (e.g., “hypothesis formulation”) by analyzing response patterns [11]. Similarly, multimodal analytics track creativity in problem-solving. For example, arts students employ analogical reasoning to generate unconventional hypotheses in physics [17].

2.2. Role of Interactive Animations in Science Education

With the widespread adoption of digital technology and multimedia tools, interactive animation has become a promising tool for teaching and assessments in science education. Through visualization, dynamic presentation, and interactive design, students intuitively observe complex processes and actively manipulate variables in simulated environments, thereby enhancing learning outcomes and problem-solving skills. Therefore, it is necessary to consider the role and value of interactive animations in inquiry-based learning from theoretical and empirical perspectives.

2.2.1. Cognitive Load Theory (CLT) and Multimedia Learning Theory (MLT)

CLT posits that learning outcomes are constrained by the limited capacity of working memory. Poorly designed learning materials result in excessive cognitive load that disrupts learning processes and decreases effectiveness [7]. MLT emphasizes that instructional materials designed according to principles such as dual-channel processing (text and visuals), segmentation, and coherence reduce extraneous cognitive load and improve learning efficiency [8].

When well-aligned with these theories, interactive animations can support inquiry learning by using progressive presentation, timely prompting, and learner-controlled features. These approaches help students effectively integrate information, form robust mental models, and enhance higher-order thinking skills.

2.2.2. Supportive Effects of Animation and Interaction Design on Inquiry Learning

Animations possess temporal and continuous characteristics, enabling students to comprehend dynamic processes that are difficult to convey with static images. For example, in teaching chemical reactions or physical motions, animations intuitively depict particle movement or force interactions, making it easier for students to form accurate concepts [18]. Furthermore, interaction design transforms students from passive viewers into active participants. By dragging, adjusting parameters, or clicking to observe changes, learners actively construct knowledge.

Animations and interactive simulations significantly enhance students’ reasoning skills, critical thinking, and model-building abilities [19]. Through repeated trials and self-directed exploration, students revise their thinking based on errors and progressively build conceptual models. In addition, interactivity increases learning motivation and engagement, encouraging students to immerse themselves more deeply in the inquiry process [10].

Well-designed animations and interactive features not only support deep learning through inquiry but also serve as effective assessment tools for collecting process data to evaluate students’ inquiry competencies.

2.3. Current Development of Online Inquiry Assessments

As educational assessments become digitized, online inquiry assessments have emerged as an important tool in education. Compared with traditional paper-based tests, online assessments offer high interactivity, a process for data collection, and the potential for automated scoring, making them well-suited to assess the dynamic processes and cognitive strategies involved in scientific inquiry.

2.3.1. Development of International Online Inquiry Platforms

Several global platforms provide inquiry-based online assessment tools. For instance, Tufts University’s Inquiry Intelligent Tutoring System (Inq-ITS) assigns simulated inquiry tasks to guide students through hypothesis generation, experimental design, and data interpretation. It also employs natural language processing to automatically score student responses based on the claim, evidence, and reasoning (CER) framework [20]. Additionally, PISA’s computer-based assessments integrate virtual inquiry scenarios that emphasize data analysis, model building, and collaborative problem-solving. These developments demonstrate the growing theoretical and technical feasibility of digital inquiry assessments.

2.3.2. Current Status and Challenges in Taiwan

Although Taiwan’s 2019 Curriculum Guidelines (108 Curriculum) explicitly emphasize the development of students’ inquiry competencies, most schools still rely on traditional paper-based tests and lack assessment tools that fully capture the inquiry process. While interactive assessment systems such as virtual labs and guided animations have been developed, the automated diagnosis of inquiry abilities and analysis of students’ process behaviors remain in an early stage of development [21]. Moreover, teachers face challenges such as limited technological familiarity, time constraints, and inconsistent evaluation criteria, which hinder the widespread adoption and practical application of online inquiry assessments in classrooms.

3. Materials and Methods

3.1. Implementation of the Inquiry-Based Assessment

To authentically capture students’ inquiry competencies, we designed and implemented an ABA method focusing on the atmospheric interactions between methane and hydroxyl radicals. Based on inquiry-based learning (IBL) and MLT, the assessment contextualizes real-world environmental problems and integrates animations with data-driven tasks to guide students through the iterative cycle of observation, hypothesis generation, experimentation, and explanation.

We employed a three-stage design in this study: (1) task design and expert validation, (2) platform development and item implementation, and (3) assessment administration and data analysis (Figure 1). To ensure the scientific accuracy and pedagogical appropriateness of the animation-based inquiry assessment, four experts who were university-level scholars or experienced high school teachers were invited to co-develop and review the tasks. Professor Chu-Ting Chen, Department of Chemistry, National Taiwan University, reviewed reaction mechanisms, free radical behavior, and chemical data presentation. Professor Pay-Liam Lin at the Department of Atmospheric Sciences, National Central University, reviewed the realism and accuracy of climate-related experimental scenarios. Mr. Hsing-Chung Ho, a physics teacher at Tainan First Senior High School, advised on students’ cognitive difficulties and animation authenticity. Mr. Yi-Wen Hung, a teacher at the National Taiwan Normal University Affiliated Senior High School, assisted with item scripting, language refinement, and contextual alignment with high school teaching. The experts provided iterative feedback throughout the design process to ensure that the final assessment was scientifically valid, instructionally relevant, and adaptable across disciplines and grade levels.

Inquiry Competency Framework and Items

To align with curriculum objectives, we adopted a two-dimensional inquiry framework for cognitive intelligence and problem-solving and identified six tasks to assess reasoning, critical thinking, creativity, and modeling. The assessment included animated simulations, multiple-choice questions, and open-ended responses. Items are tiered in complexity to scaffold students’ inquiry process. The items are presented in Table 2.

3.2. Participants and Data Collection

To ensure representativeness and external validity, we recruited 26,823 11th-grade students from 29 secondary schools in Sichuan Province, China, including participants from both science and humanities streams across urban and rural regions. They completed the assessment independently on the CloudClassRoom (CCR) platform using either school computers or mobile devices. Collaboration or external assistance was not permitted during the task. The ABA method comprised two animation segments and six assessment items (Table 2), with an average completion time of 35 min. The scenario centered on atmospheric reactions between methane and hydroxyl radicals. Students were required to observe animated experiments, interpret graphical data, and apply logical reasoning.

We assessed two inquiry competencies outlined in Taiwan’s curriculum: causal reasoning and critical thinking. The former evaluates students’ ability to derive scientific explanations from data and recognize causal mechanisms, while the latter assesses their capacity to critique experimental design and reason about the necessity of control conditions. AE1 and AE2 served as cognitive scaffolds to support students in mentally modeling gas interactions and environmental changes. However, model construction was not explicitly assessed as an independent competency in this study.

Data collected included students’ responses, interaction logs, and time-on-task, all automatically captured by the CCR platform. Open-ended responses were manually scored by trained raters using a standardized rubric. Subsequent analyses examined differences across disciplinary backgrounds and regions, as well as the ability of each item to discriminate between levels of reasoning and critical thinking.

3.3. Evaluation Framework and Scoring Method

The evaluation framework of this study was used to analyze students’ performance in the ABA on the CCR platform. The framework was used to assess students’ inquiry-based learning outcomes, including data interpretation, scientific reasoning, and the ability to control variables. By aligning with the principles of inquiry-based learning, the evaluation emphasized understanding over rote memorization. To ensure the validity and reliability of the inquiry skill assessment, we implemented a dual-track scoring system aligned with task formats and targeted competencies. Among the six tasks, four were multiple-choice items (T1–T4) and two were open-ended (T5 and T6). Multiple-choice items were scored automatically, while open-ended responses were manually coded by trained raters using standardized rubrics.

Scoring Process

Students’ inquiry-based learning outcomes were scored using a combination of automated and manual methods to ensure reliability and validity. The CCR platform analyzed objective responses, such as multiple-choice answers. Answers for open-ended questions, which required subjective evaluation, were scored by trained educators using a standardized rubric. This mixed-method approach minimized bias and provided a comprehensive assessment of students’ abilities.

Scoring for multiple-choice questions

Scoring the data analysis (DA) section in this study was performed using an expected average score-based approach to ensure fairness across different types of multiple-choice questions. For questions with a single correct answer, full points were awarded for the correct response, while all other options received zero points, resulting in an expected average score of 0.2. For questions with multiple correct answers, the number of incorrect options was scored. Specifically, selecting one incorrect option reduced the score to 60% of the item points; two incorrect options reduced it to 20%, and all other responses resulted in zero points. However, if multiple correct answers were required and the student failed to include all correct options, the answer was treated as fully incorrect, with an expected average score of 0.03. This difference in the expected average score compared with single-answer questions highlighted the stricter evaluation standard for multi-answer questions.

To address this discrepancy and ensure fairness, the expected average score for multiple-answer questions was calibrated to 0.19, closely aligned with the single-answer expected average score of 0.2. This scoring framework was used to ensure that both types of questions contributed equally to the overall evaluation, maintaining fairness and consistency. Additionally, the method aligned with the principles of the inquiry-based assessment by rewarding accurate data interpretation and minimizing potential biases in question design.

2.: Scoring rubric for reasoning and argumentation (open-ended questions)

The evaluation of open-ended scientific reasoning questions in this study was carried out using a structured rubric to assess the depth and accuracy of students’ reasoning, as well as the coherence of their explanations. The scoring rubric consisted of three levels: full credit, partial credit, and no credit. Each level was defined by using specific criteria and supported by examples to ensure consistent application across responses.

Full credit: Responses in this category demonstrate the ability to provide logical and comprehensive reasoning supported by coherent explanations. Students accurately interpret scientific phenomena, establish clear connections between variables, and justify their conclusions with evidence. For example, a full-credit response might explain how hydroxyl radicals react with methane under ultraviolet light to produce carbon dioxide, linking the reaction mechanism to observed changes in experimental conditions.
Partial credit: Responses in this category provide reasonable but incomplete reasoning. Students may correctly identify some aspects of the scientific phenomena but fail to fully explain the mechanisms or relationships between variables. For example, a partial-credit response might note that hydroxyl radicals react with methane but omit details about the resulting chemical products or their effects on the experimental outcomes.
No credit: Responses in this category are irrelevant, incorrect, or absent. Students fail to address the scientific phenomena or provide reasoning that aligns with the question’s context. For example, a no-credit response might present unrelated information or omit an explanation entirely.

The structured rubric ensured fairness and reliability in assessing open-ended responses by providing clear criteria and examples for each score level. This method emphasized the importance of logical reasoning and evidence-based explanations in scientific inquiry, aligning with the principles of inquiry-based learning and fostering critical thinking skills among students [22].

3.: Scoring rubric for critical thinking (open-ended questions)

The evaluation of open-ended questions related to the control of variables was conducted by using a structured rubric grounded in principles of experimental design and inquiry-based learning. This rubric was used to assess students’ ability to identify and articulate the relationship between variables, justify their experimental choices, and demonstrate alignment with the research objectives. Responses were categorized into three levels: full credit, partial credit, and no credit, each with specific criteria and examples.

Full credit: Responses in this category exhibit a comprehensive understanding of the experimental objectives, providing accurate reasoning and clear justification for the inclusion of specific variables. Students demonstrate the ability to align experimental choices with research goals and support their conclusions with logical explanations. For example, a full-credit response may explain why both experiments are necessary to compare the effects of hydroxyl radicals on methane oxidation under different conditions.
Partial credit: Responses in this category demonstrate a basic understanding of the experimental objectives but lack depth or completeness. Students may correctly identify the need for certain variables but fail to fully justify their choices or provide detailed reasoning. For example, a partial-credit response may state that both experiments are necessary but omit specific references to the variables or their interactions.
No credit: Responses in this category fail to demonstrate an understanding of the experimental objectives or provide reasoning that aligns with the question’s requirements. Students may offer unrelated, incorrect, or incomplete answers, such as suggesting the omission of one experiment without justification.

This scoring rubric was created based on the “control of variables strategy” (CVS), a well-established framework in science education that emphasizes the importance of isolating and systematically testing variables in experimental designs [23]. The rubric aligned with the principles of CVS by assessing students’ ability to recognize independent, dependent, and controlled variables and their role in achieving valid experimental results. By integrating these principles, the rubric fosters a deeper understanding of experimental rigor and scientific reasoning.

3.4. Scoring Method

The scoring method in this study was designed to evaluate students’ understanding of inquiry-based activities and their ability to apply scientific reasoning and control variables. The assessment was divided into three components: DA, scientific reasoning, and control of variables. DA accounted for the majority of the score with a maximum of eight points, as it primarily assessed students’ comprehension of inquiry activities through logical reasoning. Scientific reasoning and control of variables were assigned two points, focusing on the accuracy and depth of reasoning in explaining experimental phenomena and variable relationships. Responses for scientific reasoning and control of variables were scored by two independent raters who were trained in applying the coding scheme. To ensure inter-rater reliability, an initial subset of 200 responses was jointly scored. The Cohen’s Kappa coefficient was 0.87, indicating strong agreement. The remaining responses were independently scored, with periodic cross-checks to maintain consistency.

The total score for each student was calculated on a 12-point scale, and all responses were subsequently grouped based on score rankings. Using the pass rate (Equation (1)) and item discrimination index (Equation (2)), comparisons of performance between groups were conducted to evaluate differentiation and the effectiveness of the assessment in identifying variations in students’ inquiry skills.

P a s s r a t e (P) = \frac{t h e n u m b e r o f s t u d e n t s w h o p a s s e d t h e t e s t}{the total number of students who passed the test},

(1)

I t e m d i s c r i m i n a t i o n i n d e x (D) = P h - P l

(2)

4. Results

ABA results were analyzed, focusing on students’ inquiry performance across item types and demographic groups. Quantitative analyses were conducted on item difficulty, discrimination indices, and score distributions by gender and academic background.

4.1. Inquiry Performance

The inquiry assessment items effectively distinguished students across different score groups, confirming the validity of the test design. Multiple-choice items demonstrated a strong capability to differentiate low-scoring students, particularly through questions requiring basic knowledge application and recognition of patterns. In contrast, free-response items excelled at distinguishing high-performing students by assessing their ability to provide detailed reasoning and synthesize data. The combined inquiry item scores reflected an effective evaluation of students’ inquiry skills, ensuring a balanced assessment framework.

The pass rates (P) across scoring groups illustrated a progression from low- to high-score groups(Table 3). For instance, the pass rates for the top 20% (Ph) of students consistently exceeded 80%, while the lowest 20% (Pl) achieved pass rates below 35%, especially on items requiring higher-order reasoning. The item discrimination index (D) highlighted the ability of different items to distinguish between student groups. For example, the discrimination index for free-response items ranged from 0.34 to 0.53, reflecting their strong differentiation power, particularly in tasks involving data interpretation and justification of experimental choices. Multiple-choice items also were representative, with D values ranging from 0.11 to 0.35, indicating their utility in assessing foundational knowledge.

There was no significant difference in total scores by gender, confirming the absence of gender bias in the assessment design. However, differences emerged in academic backgrounds. Students majoring in science and engineering outperformed their peers majoring in arts and humanities, particularly on tasks requiring detailed data analysis and experimental reasoning. For instance, the average score for students majoring in science and engineering on free-response items was 15% higher than that of students majoring in arts and humanities. This result indicates that disciplinary training influences students’ inquiry skills, particularly in areas requiring logical reasoning and quantitative analysis.

The result also confirms that the inquiry assessment results successfully distinguish students’ inquiry skills and scientific competencies. The combination of multiple-choice and free-response items ensured a robust evaluation framework, capturing basic knowledge and advanced reasoning skills. The assessment method provides basic data on students’ performance and highlights areas for targeted instructional improvement.

4.2. Visual Analysis of Item Discrimination

To investigate item-level discrimination between student proficiency groups, we employed visual representations of response distributions and compared the top-performing (top 20%) and bottom-performing (bottom 20%) groups. Figure 2a, b, and c represent the item discrimination of multiple-choice questions, free-response tasks, and overall inquiry performance, respectively.

Figure 2a illustrates that while multiple-choice items (e.g., T1–T4) target foundational knowledge, they effectively distinguished low-performing students. For example, in T1-1, 99.8% of high performers chose the correct answer, while only 37.5% of the low group did. Over 40% of low performers left the item unanswered, indicating challenges in data–visual correspondence. T4, which required a comparison across experiments, showed a pronounced gap between groups (95% vs. 53.3%). This suggests that even among multiple-choice formats, cross-contextual tasks can elicit significant variation in student performance.

Figure 2b highlights the sensitivity of free-response items in identifying advanced reasoning. In T5, students were asked to explain the trend of curve shifts. High performers consistently referenced the methane–hydroxyl radical reaction and inferred the resulting formation of CO₂. In contrast, low performers tended to provide superficial or causally disconnected responses. T6 assessed students’ understanding of experimental design. High performers articulated the need for sequential comparison and control logic. Many low performers reduced their response to “only one experiment is needed,” lacking scientific justification.

Figure 2c presents overall test discrimination. High performers achieved pass rates above 80% across all tasks, while low performers consistently scored below 30%, demonstrating the assessment’s capacity to differentiate inquiry skill levels effectively. The combined use of multiple-choice and free-response items, supported by animated simulations, successfully captured performance differences in data interpretation, logical reasoning, and design critique. These results confirm the ABA’s utility and validity in assessing multidimensional inquiry competencies.

5. Discussion

5.1. Educational and Assessment Implications

The results of this study highlight the practical utility and discriminative validity of ABA in identifying students’ inquiry competencies, particularly in the domains of causal reasoning and critical thinking. From pedagogical and assessment perspectives, the following implications were identified:

Scaffolding inquiry skills through multistage task design:

Although multiple-choice items are considered low-level tasks, the results of this study demonstrate that if designed with structural complexity, such as requiring cross-animation integration, variable analysis, or temporal comparisons. They stimulate higher-order thinking. Therefore, teachers must provide inquiry instruction by sequencing tasks from observation to data matching, then to reasoning and critique to guide students in constructing coherent scientific explanations.

2.: Aligning assessment tools with competency-based curricula:

Taiwan’s 12-Year Curriculum emphasizes that students must form claims, explain mechanisms, and critique designs based on data. However, traditional paper-based tests are not appropriate to assess such process-oriented skills. The animation-based simulations and open-ended items in this study offer a feasible model for such assessments. They capture not just correctness but also students’ reasoning pathways and misconceptions, supporting formative diagnosis in the classroom.

3.: Promoting inquiry culture through assessment innovation:

The assessment method of this study encourages students to reflect, compare, and generate creative explanations. When prompted with questions “Why are two experiments needed?”, students must correct their answers with a consideration of evidence credibility, variable control, and experimental logic. This shift in student thinking inspires teachers to focus more on reasoning processes rather than fixed answers in science instruction.

4.: Balancing equity and scalability through interactive assessments:

While academic background and contextual disparities influence performance, the open-ended animation tasks enable the effective identification of high-level thinkers across demographic groups. It is necessary to integrate semantic scoring technologies to reduce grading costs, making this assessment model scalable for large-scale implementation in diagnostics and formative feedback.

5.2. Differences and Challenges of Student Performance

The results of this study reveal notable differences in students’ inquiry performance across academic disciplines and geographic regions. These disparities reflect the actual distribution of competencies and systemic and pedagogical challenges in implementing inquiry-based instruction and assessment.

Disciplinary background

Students in the science track significantly outperformed their humanities counterparts in causal reasoning tasks, especially open-ended ones such as T5. Those students articulated molecular mechanisms and linked data trends to theoretical explanations. In contrast, students majoring in humanities often described surface-level phenomena without grounding their reasoning in scientific principles. This reflects how disciplinary training shapes students’ cognitive tendencies: students majoring in science are trained for systematic analysis and variable control, while those majoring in humanities focus on semantic expression and the sociocultural context [5]. Cross-disciplinary curriculum design must intentionally include targeted scaffolds to address these divergent thinking profiles.

2.: Gender difference

There was no significant difference observed between male and female students in overall scores or performance on individual item types. Whether in data interpretation on multiple-choice items or logical reasoning and experimental critique in open-ended responses, female and male students showed comparable performance. This indicates that the animation-based assessment demonstrated strong gender fairness.

This trend diverges from earlier studies suggesting that “males excel in visual-kinesthetic tasks while females prefer verbal-contextual reasoning.” The result is attributed to the use of contextualized animation, which integrates visual dynamics and narrative elements, allowing students of different genders to access the tasks through diverse modalities.

3.: Further alignment of items with inquiry competencies

Although the assessment results demonstrated solid discriminatory power, a noticeable disconnect was observed in students’ performance between multiple-choice and open-ended items. Several students performed well on selected-response tasks, indicating proficiency in basic data recognition, but failed to exhibit equivalent depth of understanding or reasoning when confronted with argumentation or design-based tasks. Additionally, selected-response items must be expanded through justification selection or multi-step tiered structures to better reflect inquiry processes. These strategies add items beyond memory checks, transforming them into valid entry points for logical reasoning.

5.3. Limitations and Future Research

While the animation-based assessment developed in this study showed strong validity in evaluating causal reasoning and critical thinking, limitations remain in terms of assessment scope, item design, and technological integration, which need future research and development.

Limited coverage of inquiry competency dimensions

The assessment in this study targeted two core competencies, causal reasoning and critical thinking, while model construction and creative experimentation, also emphasized in the curriculum, were not directly assessed. Although the animations provided cognitive scaffolding for students to visualize gas interactions, no task required students to actively construct representations or generate novel experimental designs. It is required to incorporate tasks involving diagrammatic modeling, mechanism sketching, or alternative solution design, such as drawing reaction processes or proposing novel variable combinations. These approaches enable the evaluation of students’ skills in visual modeling and hypothesis generation.

2.: Need for advanced automated scoring tools

The open-ended tasks were scored manually with high inter-rater reliability in this study. However, for large-scale deployment or real-time classroom feedback, this method faces constraints in terms of labor and time. Therefore, the development of AI-powered scoring systems using natural language processing (NLP) and semantic analysis is required to identify reasoning levels based on linguistic features, including causal connectors or variable labels, and deliver instant formative feedback or error-specific hints.

Moreover, AI scoring tools must prioritize interpretability and teacher customizability for automated graders and the integral components of instructional decision-making.

6. Conclusion

We developed and validated an interactive animation-based assessment tool for scientific inquiry that aligned with Taiwan’s 12-Year Basic Education Curriculum. By contextualizing tasks in real-world phenomena (i.e., methane and hydroxyl radical reactions) and combining animated simulations, multiple-choice and open-ended items, as well as process-tracking via the CloudClassRoom platform students’ performance in key inquiry competencies such as causal reasoning and critical thinking was evaluated. The results demonstrated strong discriminatory power and gender neutrality. The assessment results effectively distinguished the students of varying proficiency and disciplinary backgrounds. Open-ended tasks were sensitive to identifying high-level reasoning, while animation supported students in contextualizing and justifying their responses. The task design was inclusive of diverse learning styles and genders.

Nonetheless, limitations in assessing competencies, such as model construction and creative experimentation, must be addressed to mitigate the reliance on predominantly closed formats for students’ full inquiry processes. It is essential to incorporate diagram-based modeling tasks, open simulation platforms, and AI-driven semantic scoring to establish diagnostic, scalable, and competency-aligned assessment systems.

A feasible model developed in this study can be used for assessing inquiry-based competencies and offers both theoretical grounding and practical implications for future development of multimodal, multi-competency instructional assessments.

Author Contributions

Conceptualization, S.-C.Y., V.T.H.N. and C.-Y.C.; methodology, S.-C.Y.; software, S.-C.Y.; validation, S.-C.Y. and C.-Y.C.; formal analysis, S.-C.Y.; investigation, S.-C.Y.; resources, C.-Y.C.; data curation, S.-C.Y.; writing—original draft preparation, S.-C.Y.; writing—review and editing, C.-Y.C.; visualization, S.-C.Y.; supervision, C.-Y.C.; project administration, C.-Y.C.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

National Research Council. National Science Education Standards; National Academies Press: Washington, DC, USA, 1996. [Google Scholar] [CrossRef]
OECD. Agency in the Anthropocene: Supporting document to the PISA 2025 Science Framework; OECD Publishing: Paris, France, 2023. [Google Scholar] [CrossRef]
NGSS Lead States. Next Generation Science Standards: For States, by States; National Academies Press: Washington, DC, USA, 2013. [Google Scholar] [CrossRef]
Ministry of Education. Curriculum Guidelines of 12-Year Basic Education: General Guidelines (English Version); National Academy for Educational Research: New Taipei City, Taiwan, 2014. [Google Scholar]
Lin, T.-J.; Tsai, C.-C. A multi-dimensional instrument for evaluating Taiwanese high school students’ science learning self-efficacy in relation to their approaches to learning science. Int. J. Sci. Educ. 2013, 35, 1525–1549. [Google Scholar] [CrossRef]
Hwang, G.-J.; Tsai, C.-C. Research trends in mobile and ubiquitous learning: A review of publications in selected journals from 2001 to 2010. Br. J. Educ. Technol. 2011, 42, E65–E70. [Google Scholar] [CrossRef]
Sweller, J. Cognitive load theory, learning difficulty, and instructional design. Learn. Instr. 1994, 4, 295–312. [Google Scholar] [CrossRef]
Mayer, R.E. Multimedia Learning, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar] [CrossRef]
de Jong, T.; Linn, M.C.; Zacharia, Z.C. Physical and virtual laboratories in science and engineering education. Science 2013, 340, 305–308. [Google Scholar] [CrossRef]
Zacharia, Z.C.; Olympiou, G. Physical versus virtual manipulative experimentation in physics learning. Learn. Instr. 2011, 21, 317–331. [Google Scholar] [CrossRef]
de la Torre, J. DINA model and parameter estimation: A didactic. J. Educ. Behav. Stat. 2009, 34, 115–130. [Google Scholar] [CrossRef]
Chang, T.-C.; Lyu, Y.-M.; Wu, H.-C.; Min, K.-W. Introduction of Taiwanese literacy-oriented science curriculum and development of an aligned scientific literacy assessment. Eurasia J. Math. Sci. Technol. Educ. 2024, 20, em2380. [Google Scholar] [CrossRef] [PubMed]
OECD. PISA 2018 Assessment and Analytical Framework; OECD Publishing: Paris, France, 2019. [Google Scholar] [CrossRef]
Pellegrino, J.W.; Hilton, M.L. Education for Life and Work: Developing Transferable Knowledge and Skills in the 21st Century; National Academies Press: Washington, DC, USA, 2012. [Google Scholar] [CrossRef]
Wieman, C.E.; Adams, W.K.; Perkins, K.K. PhET: Simulations that enhance learning. Science 2008, 322, 682–683. [Google Scholar] [CrossRef]
Mislevy, R.J.; Steinberg, L.S.; Almond, R.G. On the structure of educational assessments. Meas. Interdiscip. Res. Perspect. 2003, 1, 3–62. [Google Scholar] [CrossRef]
Tuveri, M.; Steri, A.; Fadda, D. Using storytelling to foster the teaching and learning of gravitational waves physics at high-school. Phys. Educ. 2024, 59, 045031. [Google Scholar] [CrossRef]
Rieber, L.P. Animation in computer-based instruction. Educ. Technol. Res. Dev. 1990, 38, 77–86. [Google Scholar] [CrossRef]
Rutten, N.; van Joolingen, W.R.; van der Veen, J.T. The learning effects of computer simulations in science education. Comput. Educ. 2012, 58, 136–153. [Google Scholar] [CrossRef]
Gobert, J.D.; Sao Pedro, M.; Baker, R.; Toto, E.; Montalvo, O. Leveraging educational data mining for real-time performance assessment of scientific inquiry skills within microworlds. J. Educ. Data Min. 2013, 5, 111–143. [Google Scholar]
Elmoazen, R.; Saqr, M.; Khalil, M. Learning analytics in virtual laboratories: A systematic literature review of empirical research. Smart Learn. Environ. 2023, 10, 23. [Google Scholar] [CrossRef] [PubMed]
Anderson, L.W.; Krathwohl, D.R. (Eds.) A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives; Longman: New York, NY, USA, 2001. [Google Scholar]
Chen, Z.; Klahr, D. All other things being equal: Children’s acquisition of the control of variables strategy. Child Dev. 1999, 70, 1098–1120. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Research design and implementation.

Figure 2. CDA results of inquiry assessment items. This combined title reflects the overarching theme of the three graphs, highlighting their focus on evaluating the discrimination capabilities of multiple-choice items (a), free-response items (b), and the overall inquiry assessment (c).

Table 1. Comparative analysis of inquiry competency frameworks.

Institution	Core Dimensions	Strengths	Limitations
Taiwan Curriculum	Cognitive intelligence and problem-solving	Integrates creativity with logical rigor	Neglects ethics and interdisciplinary links
OECD PISA	Explanation, evaluation, and data-driven	Real-world relevance	Limited hypothesis-generation tasks
NGSS (U.S.)	Practices and crosscutting concepts	Aligns with STEM careers	Overlooks metacognitive strategies

Table 2. Summary of experimental steps and analysis tasks (animation experiment (AE) and data tasks).

No.	Type	Description	Objective
AE1	Animation experiment 1 (visual representation)	Students observe an animated experiment showing the temperature changes in three bottles containing different gases.	To provide a foundational understanding of how gas composition influences temperature in a controlled environment.
Task 1	Data analysis (multiple-choice)	Based on the table, Kevin drew the graph; however, he forgot to describe the symbols. Help Kevin match the description to each line.	To develop the ability to interpret data and link graphical representations to descriptive elements.
Task 2	Data analysis (multiple-choice)	Based on the table, what role does Group A play in the experiment? The experimental group or the control group?	To understand the experimental design and distinguish between the experimental and control groups.
AE2	Animation experiment 2 (understanding chemical reactions)	Students watch a second animation where the bottles are exposed to ultraviolet light, simulating chemical reactions involving hydroxyl radicals.	To visualize and understand the dynamic interactions between ultraviolet light, hydroxyl radicals, and methane in atmospheric processes.
Task 3	Data analysis (multiple-choice)	Kevin paired the curves wrong again. Please help Kevin match the correct description to each line.	To refine students’ skills in analyzing experimental data and interpreting graphical trends accurately.
Task 4	Data analysis (multiple-choice)	Compare (Experiment 1) and (Experiment 2) just drawn. What are the changes in lines A, B, and C?	To assess the ability to compare and interpret changes in experimental outcomes across multiple experimental setups.
Task 5	Reasoning and argumentation (open-ended)	If the data you collected are correct, in the change line of Experiment 2, the A and B lines have hardly changed, but the C line has become closer to the B line. Why?	To encourage critical thinking and reasoning about experimental outcomes based on observed data.
Task 6	Critical thinking (open-ended)	Was only Experiment 2 required for experimental purposes, or were both experiments required? Why?	To evaluate students’ understanding of experimental design, the necessity of controls, and the role of multiple trials in validating scientific findings.

Table 3. Inquiry assessment pass rates and item discrimination index by demographic groups.

Variable	Number of Student	P	Ph	Pl	Pa	Pb	Pc	Pd	Pe	D	D1	D2	D3	D4
All	26,823	61	83	30	88	76	68	50	16	0.53	0.11	0.08	0.19	0.34
Male	12,747	60	83	28	87	76	67	44	15	0.54	0.11	0.09	0.23	0.30
Female	13,989	63	87	35	88	76	68	50	16	0.52	0.12	0.08	0.18	0.34
Science and Engineering	19,776	65	87	42	88	76	68	56	21	0.45	0.12	0.08	0.12	0.35
Arts and Humanities	7,047	51	76	18	81	69	58	31	11	0.57	0.12	0.11	0.27	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yeh, S.-C.; Chang, C.-Y.; Ngo, V.T.H. Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software. Eng. Proc. 2025, 103, 16. https://doi.org/10.3390/engproc2025103016

AMA Style

Yeh S-C, Chang C-Y, Ngo VTH. Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software. Engineering Proceedings. 2025; 103(1):16. https://doi.org/10.3390/engproc2025103016

Chicago/Turabian Style

Yeh, Shih-Chao, Chun-Yen Chang, and Van T. Hoang Ngo. 2025. "Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software" Engineering Proceedings 103, no. 1: 16. https://doi.org/10.3390/engproc2025103016

APA Style

Yeh, S.-C., Chang, C.-Y., & Ngo, V. T. H. (2025). Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software. Engineering Proceedings, 103(1), 16. https://doi.org/10.3390/engproc2025103016

Article Menu

Multimedia-Based Assessment of Scientific Inquiry Skills: Evaluating High School Students’ Scientific Inquiry Abilities Using Cloud Classroom Software †