The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review

Ma, Huangdong; Yang, Hongbin; Li, Chen; Ma, Shiwen; Li, Gaofeng

doi:10.3390/su17073145

Open AccessSystematic Review

The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review

by

Huangdong Ma

^1,2

,

Hongbin Yang

^3,4,

Chen Li

⁵,

Shiwen Ma

^1,2 and

Gaofeng Li

^6,*

¹

School of Teacher Development, Shaanxi Normal University, Xi’an 710000, China

²

Shaanxi Institute of Teacher Development, Xi’an 710000, China

³

Faculty of Education, Shaanxi Normal University, Xi’an 710000, China

⁴

Chongqing Academy of Education Science, Chongqing 400000, China

⁵

College of Life Sciences, Ningxia University, Yinchuan 750000, China

⁶

College of Life Sciences, Shaanxi Normal University, Xi’an 710000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(7), 3145; https://doi.org/10.3390/su17073145

Submission received: 14 January 2025 / Revised: 28 March 2025 / Accepted: 29 March 2025 / Published: 2 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

The rapid development of science and technology has made scientific literacy a key focus in education, with scientific concepts considered central to the development of citizens’ literacy. However, misconceptions hinder the improvement of students’ scientific literacy. Misconception tier diagnostic technologies (MTDTs) provide an effective means for assessing the depth of students’ understanding of concepts. This paper provides a systematic review of research on MTDTs from 1985 to 2024. Following PRISMA standards, a comprehensive literature search was conducted across multiple databases. The screening process is described in detail, and the selection steps are visually presented. A total of 28 studies were selected, analyzing the development history, effectiveness, and practical application of this technology across various scientific domains. The results show that MTDTs have undergone rigorous reliability and validity assessments, demonstrating high quality. Four-tier diagnostic technology is regarded as the most accurate method for identifying misconceptions. This technology is most widely applied in physics and biology, with relatively fewer applications in geography. Moreover, MTDTs are more commonly used in high school and higher education but are underused in primary education. With ongoing technological advancements, MTDTs are transitioning toward large-scale online applications. The study also reveals that misconceptions are prevalent among students across various educational stages and subjects. However, interventions targeting these misconceptions lack effective validation and systematic empirical research. Future research should focus on these findings to provide theoretical support and practical guidance for the sustainable development of education and student literacy.

Keywords:

science education; misconceptions; sustainability; diagnostic technologies; systematic literature review

1. Introduction

1.1. Core Role of Scientific Concepts in Scientific Literacy

Since 2010, international organizations and countries, including the United Nations, the European Union, China, and the United States, have implemented policies and initiatives to enhance global scientific literacy in response to challenges such as climate change [1,2,3,4]. These efforts have strengthened public understanding of science and advanced the achievement of sustainable development goals, particularly in environmental protection and green technologies. Improving scientific literacy equips global citizens to better address environmental issues and support long-term sustainability [5,6]. In this context, scientific concepts are seen as core elements of scientific literacy, with growing attention in global educational policies. For instance, the United Nations’ 2030 Agenda for Sustainable Development [3] and China’s National Science Literacy Action Plan (2021–2035) [7] emphasize the mastery and application of scientific concepts as crucial to improving scientific literacy.

Despite increasing policy emphasis, misconceptions about scientific concepts persist among students. These misconceptions are prevalent in students’ understanding of basic concepts, such as photosynthesis, respiration, and the nature of force, and hinder deeper comprehension and effective application of subject knowledge [8,9,10,11].

Conceptual learning is the foundation of science education. Scientific concepts are an essential part of students’ cognitive frameworks, directly influencing their understanding and interpretation of scientific phenomena [12]. However, current educational practices, especially in science education, often prioritize operational skills and problem-solving over a deep understanding of scientific concepts [13,14]. Scholars note that while students may quickly acquire scientific skills, such as conducting experiments and analyzing data, they struggle to apply these skills effectively in real-world contexts without a deep understanding of underlying scientific concepts [15]. As a result, students may fail to make scientifically sound judgments when faced with unfamiliar problems.

The issue stems from an overemphasis on skill development and factual memorization in science education, often at the expense of systematic concept learning. This oversight results in misconceptions about complex scientific phenomena, hindering students’ ability to apply their knowledge effectively [16,17]. In the context of global efforts to promote sustainable development, improving scientific literacy is crucial for addressing both personal academic challenges and global issues such as climate change, resource scarcity, and environmental degradation [18,19]. Reexamining the teaching of scientific concepts and using targeted technological methods to identify students’ misconceptions is crucial for enhancing scientific literacy. Strengthening concept learning and helping students grasp the systemic and holistic nature of science allows them to convert scientific knowledge into problem-solving abilities, providing the cognitive support needed for sustainable development [20]. Focusing on concept learning is both the foundation for improving scientific literacy and the key to developing citizens with innovative and critical thinking skills for sustainable development [21].

1.2. Contributions of Existing Research in Revealing Misconceptions

Recent studies have made significant progress in examining misconceptions within education. Researchers have employed various assessment methods to explore students’ cognitive errors across disciplines, enhancing our understanding of how students construct and sustain misconceptions. Methods such as interviews, concept mapping, and open-ended or free-response questionnaires have been widely used [22,23,24,25]. Among these, MTDTs are viewed as the most effective.

Early research focused on identifying misconceptions through traditional methods, such as multiple-choice questions, which revealed students’ misunderstandings of core concepts and laid the foundation for exploring the cognitive mechanisms behind these errors [26]. As research advanced, the development of MTDTs introduced more precise assessment technologies. Earlier, single-tier tests did not fully capture students’ cognitive processes. However, the introduction of a reasoning tier in the second-tier diagnostic approach allowed for deeper insights into the root causes of misconceptions [27,28].

Further developments, such as third-tier and fourth-tier technologies, incorporated a confidence tier, enabling researchers to distinguish between deep understanding, surface-level memorization, and lack of knowledge, thus improving the diagnosis of misconceptions [29,30,31]. This evolution signifies a shift from merely detecting cognitive errors to a more nuanced analysis of cognitive processes. These advancements have enabled more effective interventions in teaching and contributed significantly to the improvement of students’ scientific literacy.

1.3. Necessity of Conducting a Systematic Review on Multi-Tier Diagnostic Technologies

This systematic review seeks to explore the effectiveness and application of MTDTs in identifying students’ misconceptions. Misconceptions are erroneous understandings that arise because of cognitive biases, knowledge misunderstandings, and other factors during the learning process [32]. These misconceptions not only hinder students’ mastery of subject content but can also negatively affect their academic performance and interest in the subject [33]. In recent years, research on misconceptions has become a prominent topic in educational psychology and instructional design, particularly in science education, where identifying and correcting misconceptions has become a pivotal task for enhancing sustainable development of students‘ scientific literacy [34,35].

As a sustainable assessment method, MTDTs overcome the limitations of traditional diagnostic technologies by incorporating multiple diagnostic tiers (such as the reasoning and confidence tiers), offering a more comprehensive insight into students’ cognitive processes. Existing studies indicate that MTDTs demonstrate high accuracy and effectiveness in diagnosing students’ misconceptions, particularly in uncovering deeper-level errors in their understanding [36]. However, despite the growing interest in the application of MTDTs for identifying misconceptions, no systematic review evaluating the role and effectiveness of these technologies has been found in the SSCI citation database. This gap in the literature points to the need for a comprehensive synthesis of the existing research. This is the value of this systematic review.

Based on the research context outlined above, the following research questions are presented: (i) How scientifically grounded is the development of MTDTs, and what are their overall reliability and validity? (ii) How are MTDTs applied across various scientific disciplines? (iii) How are MTDTs applied across various educational stages (K-12 and higher education)? (iv) Can MTDTs be applied on a large scale in online settings? (v) What common misconceptions do MTDTs reveal in science education, and are the teaching strategies targeting these misconceptions effective? (vi) Which type of MTDTs is most effective in revealing students’ misconceptions?

To answer these questions, the following hypotheses are proposed: (i) The development of MTDTs follows a scientific process, and instruments developed through rigorous validation procedures possess good reliability and validity. (ii) The effectiveness of MTDTs varies across different scientific disciplines, with certain disciplines more likely to use MTDTs to reveal students’ conceptual misconceptions. (iii) The application of MTDTs differs between K-12 and higher education, requiring adjustments and optimizations based on educational stages. (iv) MTDTs have the potential for large-scale online application and are expected to become the preferred method for future use. (v) MTDTs can reveal cross-disciplinary common misconceptions, and targeted teaching strategies can effectively facilitate students’ conceptual change. (vi) Different types of MTDTs have varying effectiveness in assessing students’ conceptual understanding, with some types being more effective than others in revealing students’ misconceptions.

2. Methodology

This study employs a systematic review method to examine the current research on MTDTs. The process follows a four-stage procedure based on the models of Miller et al., Scott et al., and Díaz-Burgos [37,38,39], as detailed below: (i) The literature search began with a diagram outlining relevant terms and thematic axes, utilizing the Web of Science (WOS) and Google Scholar platforms for database searches. (ii) A multi-step screening process was carried out based on predefined inclusion criteria. (iii) After determining the inclusion and exclusion criteria, the researchers reached a consensus on the coding method and proceeded with the coding process in an organized manner, aiming for further qualitative and quantitative analysis.

2.1. Literature Search

The literature review was systematically initiated with a diagram outlining relevant terms and thematic axes, as shown in Figure 1. Relevant literature was searched for using the Web of Science (WOS) and Google Scholar platforms. The last retrieval date was June 2024. The selected papers were required to be Social Science Citation Index (SSCI) education-related articles published between 1985 and 2024 within the fields of humanities and social sciences. These studies had to be peer-reviewed and written in English. A predefined search strategy was applied to guide the identification of all relevant studies, with Boolean operators (AND) used to combine search terms, such as misconception AND diagnostic technology/instrument.

2.2. Inclusion and Exclusion Criteria

It is necessary to note that the research process was documented and is visually represented in accordance with the PRISMA statement [40]. The inclusion and exclusion criteria were clearly defined based on the research theme, search range, and publication timeframe to identify the contributions of MTDTs in science education.

The selection of studies for this review was guided by clearly defined inclusion and exclusion criteria to ensure relevance, quality, and alignment with the research objectives. Studies were included if they were published in peer-reviewed, SSCI-indexed journals between 1985 and 2024, focusing on the development, validation, or application of tier diagnostic technologies in science education. Only empirical studies conducted in English and involving participants from K-12 or higher education settings were considered. The decision to include only English-language studies was made to maintain consistency in data interpretation and avoid potential biases arising from translations. Additionally, the inclusion of empirical studies ensures that the review draws on robust and data-driven insights.

Conversely, studies were excluded if they did not explicitly focus on tier diagnostic technologies. For example, research that employed other diagnostic methods, such as interviews or concept mapping, was excluded, as these approaches do not align with the core focus of this review. Non-empirical studies, including book reviews, theoretical commentaries, and reports, were also excluded to maintain a focus on evidence-based research. Furthermore, articles outside the domain of science education, such as those addressing misconceptions in mathematics or linguistics, were omitted to maintain the review’s specificity. Finally, studies that lacked sufficient methodological detail or were not written in English were excluded to ensure the quality and interpretability of the findings.

The PRISMA method was adopted to evaluate the quality of the studies. As this systematic review focuses on diagnostic technology, the reliability and validity testing of the diagnostic instruments developed based on these technologies were crucial for ensuring the studies’ quality. Specific metrics reflecting these aspects are presented in the corresponding results tables, along with detailed discussions and analyses. This approach was therefore used to assess study quality.

2.3. Types of Studies Reviewed

This systematic review encompasses a variety of study types to provide a comprehensive understanding of the development, validation, and application of tier diagnostic technologies. These include descriptive studies that identify and characterize misconceptions across different scientific disciplines, experimental studies that test the effectiveness and reliability of these technologies, validation studies (both qualitative and quantitative) that assess the reliability and accuracy of these technologies, and unvalidated proposals that highlight innovative approaches but lack empirical testing. By including both validated and unvalidated studies, this review aims to present a holistic perspective on the current state of tier diagnostic technology research, highlighting both established practices and emerging criteria.

2.4. Literature Screening Process

Figure 2 illustrates the entire process of systematic screening. A total of 158 articles were retrieved. After excluding 107 non-MTDTs and non-SSCI papers, 11 more were removed because of language issues or duplication. Further analysis of the remaining 40 articles led to the exclusion of studies on topics like mathematics and information literacy. Ultimately, 28 articles met the inclusion criteria.

2.5. Content Analysis Process

Based on the research objectives and relevant literature, an overarching categorical system was developed, as shown in Table 1. It included a first classification and secondary classification, as well as their definitions to ensure systematic analysis and reproducibility. Tertiary classifications were subsequently developed according to the focus research questions. The unit of analysis in this study was a single document, treated as an independent object of analysis.

The thematic classification of the selected literature followed the categorical system. Both inductive and deductive approaches were applied. Studies were categorized using predefined criteria while also allowing emerging themes to shape iterative refinements in the coding system. For instance, the application stages of MTDTs were initially classified into elementary school, junior high school, senior high school, and university. Further analysis showed that some studies examined cross-stage applications.

To ensure coding consistency, two coders independently coded the data and recorded their results in an Excel spreadsheet. After completing the coding for each core category, they compared their results, discussed discrepancies, and reached a consensus, making corrections as needed. For instance, one researcher classified a misconception with a 10% prevalence rate as common, while another labeled it uncommon. After discussion, a clearer definition was established, as follows: misconceptions with at least a 10% prevalence were considered universal.

To ensure data consistency and stability, 20% of the reviewed studies (a total of six) were randomly selected for re-coding. The approach followed a method similar to that of Risko [41]. The first, second, and fourth authors collaboratively re-coded these six studies, with each author handling two studies [42,43]. A comparison between the initial and re-coded results showed an agreement rate exceeding 90%, confirming the reliability and consistency of the coding process.

After verifying the coding quality, researchers synthesized and visualized data across different categories to identify key trends. Frequency and co-occurrence analyses were conducted to determine high-frequency concepts and examine their distribution across studies.

In the Section 3, the selected studies are analyzed from the following two main perspectives: descriptive characteristics and synthesis of the studies. The primary focus is on the synthesis of the studies, guided by the overall categorical system, which organizes the analysis into a three-level classification framework across the following five key domains: (1) development of MTDTs; (2) reliability and validity validation; (3) application of MTDTs across disciplines and educational stages; (4) misconceptions identified by MTDTs and proposed interventions; and (5) representational effectiveness of MTDTs in assessing conceptual understanding.

3. Results

3.1. Descriptive Characteristics

This section provides an overview of the key descriptive characteristics of the selected studies, including the analyzed literature, division of education stages, geographical distribution, scientific disciplines covered, and research methods employed. Examining these aspects helps define the scope and focus of the reviewed studies, ensuring consistency in data interpretation, enabling meaningful cross-context comparisons.

3.1.1. Analyzed Documents and Their Locations

The studies reviewed span diverse participant demographics, including students and educators in K-12 and higher educational settings. Participants’ ages ranged from 10 to over 25 years old, covering elementary, middle, and high school students, as well as preservice teachers. All included studies were conducted in English to ensure consistency in data interpretation. Because of slight variations in how educational stages are divided across different countries, the following terminology was established for consistency: Primary school (Grades 1–6, ages 6–12), Junior high school (Grades 7–9, ages 13–15), and Senior high school (Grades 10–12, ages 16–18). Some studies ambiguously classified participants as “high school”, with ages spanning two educational stages (14–16 years). For consistency, these participants were categorized as junior high school students based on the predominant characteristics of the group. Among the 28 studies reviewed, none focused exclusively on primary school students. Of the remaining studies, 10.71% involved only junior high school students, 39.29% involved only senior high school students, and 35.71% involved only university students. Additionally, 7.14% of the studies included primary, junior, and senior high school students, while another 7.14% focused on both junior and senior high school students.

The inclusion criteria for this study did not impose geographical limitations, but language requirements influenced the geographical distribution of the included studies. Among the studies analyzed, the highest proportion were from Turkey, reflecting the region’s significant research activity in this field. Chinese Taiwan and Singapore followed with a relatively high number of studies from these regions. The research also included studies from Australia, the United Kingdom, the United States, and other European and Asian countries, such as Spain, Malaysia, Serbia, and China. Overall, the distribution of the studies suggests that tier diagnostic technologies in science education are widely applied, with Turkey and East Asia emerging as research hotspots.

3.1.2. Content Areas

This study covers multiple scientific disciplines, including physics, chemistry, biology, and geography. Most of the studies focus on a single scientific discipline, reflecting a trend of exploring individual subjects in depth rather than integrating across disciplines. However, some studies explored both chemistry and biology.

3.1.3. Research Methods and Data Analyses

All studies used tiered diagnostic technologies, such as traditional multiple-choice tests and two-tier, three-tier, and four-tier methods, to assess students’ conceptual understanding. The research methods combined both quantitative and qualitative analyses in a mixed-methods approach, ensuring a comprehensive exploration of students’ understanding and misconceptions. The quantitative analysis focused on evaluating test reliability and validity, categorizing students’ concept comprehension, and performing descriptive statistical analysis to assess the reliability and effectiveness of the instruments. Qualitative analysis was used to explore students’ potential misconceptions in greater detail. Some studies used interviews during the instrument development process to identify misconceptions, which were then used as distractors in the assessment. Researchers also used content analysis to organize and examine students’ open-ended responses, identifying potential misconceptions in the process.

3.2. Synthesis of the Studies

This study followed the standard methodology for systematic literature reviews [24,25,26,27] to analyze the collected literature, aiming to synthesize research questions and findings, and identify core themes and research trends. The selected studies were thoroughly evaluated to identify focal topics and further synthesize the findings, deepening the understanding of the development of MTDTs. Based on the research questions and the proposed coding classification system. The study ultimately identified the following five central focus areas: (i) development of MTDTs, (ii) quality verification of MTDTs, (iii) application of MTDTs, and (v) misconceptions revealed by MTDTs with suggested interventions, and (iv) representational effectiveness of conceptual understanding in MTDTs.

3.2.1. Development of MTDTs

Table 2 provides a systematic classification of the development procedures for MTDTs. This classification organizes the development procedures of various MTDTs in the existing literature, highlights their commonalities and differences, and offers a reference for improvement in future research.

Table 2 shows that most researchers followed Treagust’s framework for developing diagnostic instruments. However, not all studies strictly followed every step, as adaptations were made based on technological advancements and evolving research needs. For example, Treagust [11,28] suggested defining the conceptual scope by comparing a list of propositional knowledge statements. However, 42.86% of the later studies utilized concept mapping to define the conceptual scope [44,45,46]. While a list of propositional knowledge statements systematically outlines concepts, concept maps better illustrate the relationships among sub-concepts, offering a clearer definition of conceptual boundaries. This improvement has made the development framework more comprehensive.

Confidence ratings were first integrated into three-tier diagnostic technology [47]. 7 studies measured confidence ratings using a binary scale [29,48,49], where 1 indicated certainty and 2 indicated uncertainty. The other 9 studies used Likert scales ranging from 4 to 7 points [8,31,50]. The choice of scale depends on the researchers’ specific needs. While confidence ratings typically distinguish certainty from uncertainty, more refined scales enable a deeper analysis of misconceptions. To illustrate, one study used a six-point Likert scale with values ranging from one to six. Misconceptions with an average confidence level above 3.50 were classified as true misconceptions; those between 3.5 and 4 as moderate-strength misconceptions; and those above 4 as high-strength misconceptions [50]. The strength of misconceptions depends on the frequency of reinforcement students receive. Frequent exposure to misinformation and a lack of corrections can reinforce misconceptions. Notably, moderate-strength misconceptions are easier to address than high-strength ones, offering insights for instructional adjustments.

The sources of proposition concept statements mainly include academic literature, curriculum standards, textbooks, expert opinions, and teaching experience. In 64.29% of the studies, literature reviews served as a primary source for developing propositional concept statements [9,27,51]. Academic literature is critical for defining concept statements, as it is timely and reflects the latest scientific concepts, offering direct theoretical support. Researchers also tend to rely on indexed literature when addressing research questions. Similarly, 64.29% of the studies relied on curriculum standards or syllabi to develop propositional concept statements [31,52,53]. As authoritative documents that guide instructional practices, curriculum standards and syllabi offer concept statements that are accurate, scientifically validated, and broadly accepted. Consequently, their role in developing propositional concept statements is considered on par with that of scholarly literature. Textbooks account for 42.86%, less than academic literature and curriculum standards [30,45,54]. Although textbooks are rigorously reviewed, authoritative, and aligned with educational standards, their content is constrained by curriculum standards, updated more slowly, and focused on teaching practice, leading to lower citation rates in research. In practice, scholars generally tend to choose either a single or a combined approach based on their specific needs and preferences to determine propositional concept statements. Specifically, Zhao [53] derived concept statements about human blood circulation from both curriculum standards and textbooks.

Identifying potential misconceptions is critical in the development of MTDTs, as it directly impacts the quality of distractors. Table 2 indicates that the three most frequently used methods for identifying potential misconceptions are literature review (25 instances) [10,26,55], interviews (21 instances) [25,30,56], and open-ended tests (15 instances) [31,46,49]. The literature review is the most frequently used method, as it extracts misconceptions from existing studies. However, as diagnostic technology advances, these misconceptions must be further validated. Given their potential to become true misconceptions, they remain viable as distractors. Interviews are the second most common method. Through in-depth discussions with students, researchers gain insights into cognitive processes and identify potential biases in their understanding. Although these biases represent a smaller population, they can still serve as effective distractors. Open-ended tests are used less frequently than literature reviews and interviews. This method allows students to freely express ideas without predefined options, offering a more authentic reflection of misunderstandings. However, the complexity of data analysis, requiring categorization of numerous responses, results in its lower adoption rate. As shown in Table 2, identifying potential misconceptions requires using multiple complementary methods. By integrating and comparing these potential misconceptions, high-quality distractors can be obtained. For example, Kiray and Simsek [31] combined literature review, student interviews, and open-ended tests to identify potential misconceptions related to density.

3.2.2. Validation of the Reliability and Validity

To comprehensively evaluate the reliability and validity of different MTDTs, we have summarized the validation processes for each instrument. The validation methods employed varied across the studies, but they were generally based on classical test theory (CTT). Reliability was assessed using methods such as Cronbach’s alpha coefficient, split-half reliability, and test–retest reliability, while validity was verified through expert reviews and statistical analyses. Table 3 outlines the classification system used to focus on the reliability and validity of the instruments in the reviewed studies, providing essential guidance for their future improvement and implementation.

Table 3 identifies three primary approaches for evaluating the instrument’s reliability. The first approach consists of nine studies that rely exclusively on Cronbach’s alpha coefficient, representing 32.14% of the total [29,46,54]. In these studies, the Cronbach’s alpha values range from just above 0.60 to as high as 0.90. In diagnostic tests, these variations can be affected by the number of items and confidence interval levels. Generally, more items or precise confidence intervals lead to higher Cronbach’s alpha values, while fewer items or broader intervals result in lower coefficients. Such variations can cause the instrument’s reliability to be overestimated or underestimated. Therefore, relying solely on Cronbach’s alpha may not offer a comprehensive reliability assessment. The second approach, found in 14 studies [8,52,56] (50% of the total), incorporates Cronbach’s alpha as part of a broader reliability assessment. These studies employed multiple reliability measures, such as Cronbach’s alpha and test–retest reliability. This approach evaluates both internal consistency and the instrument’s stability across different time points or samples. The test–retest reliability assesses the instrument’s stability over repeated use, while Cronbach’s alpha measures internal consistency. Since Cronbach’s alpha increases with the number of items, using multiple reliability indicators helps reduce potential biases, ensuring a more thorough and objective assessment. The third approach, found in only three studies [11,27,57] (10.71% of the total), does not use Cronbach’s alpha for the reliability assessment. Cronbach’s alpha aims to differentiate test-takers based on their ability rather than measuring a single underlying trait. Consequently, its internal consistency may be relatively low, making it less suitable for this type of test. For example, Gönen [27] used the Spearman–Brown formula and split-half reliability to assess the instrument’s stability. These methods evaluate consistency over time or across samples, particularly in test stability and measurement agreement. The Spearman–Brown formula estimates the test stability by splitting test items into two halves and calculating their correlation. Split-half reliability, on the other hand, evaluates test consistency under varying conditions, making it particularly useful for classification tasks.

Validity assessment can also be categorized into three distinct approaches. The first approach relies exclusively on expert review, as seen in nine studies [27,54,58]. Expert review is a traditional evaluation method valued for its simplicity, efficiency, and suitability for preliminary MTDT validity assessment. It offers timely, authoritative feedback but is influenced by individual expertise and subjective biases, lacking systematic data support. However, its reliability depends on the expert panel’s composition, including academic alignment with the research domain and diversity of perspectives. The second approach incorporates expert review as part of a broader validity assessment. This approach, used in 16 studies [48,55,58], accounts for the highest proportion (57.14%) and is the most common. The use of multiple methods helps mitigate the limitations of expert review alone. Integrating data-driven statistical analyses and quantitative indicators enhances precision and reliability, reducing biases of single-method validation. For example, factor analysis identifies underlying data structures, Pearson correlation tests measure variable relationships, and expert judgment adds nuanced insights. This comprehensive approach significantly strengthens the validity assessment. The third approach excludes expert review, with only two studies identified [49,57]. One study utilized a combination of a comparison with propositional knowledge statements and comparison with a specification grid, while the other employed factor analysis alongside false positives and false negatives. This approach reduces subjective biases and enhances objectivity and reliability. However, lacking expert insights may hinder the identification of nuanced, real-world details. Calculating false positive and false negative ratios in participant responses offers an innovative validity assessment method. A ratio below 10% typically indicates good validity. However, setting this threshold based solely on experience lacks theoretical justification. Identifying the optimal tolerance level requires extensive simulations or data analysis. Comparing different test types, such as psychometric instruments, medical screenings, and educational assessments, can help in assessing performance across varying false positive and false negative levels. Instead of using a fixed 10% threshold, empirical evidence should guide its refinement.

3.2.3. Application of MTDTs Across Disciplines and Educational Stages

Table 4 systematically categorizes each study by its implementation method, educational stages, subject area, sample size, and target conceptual themes. According to the literature review, nearly all studies used paper-based testing, covering educational stages from junior high to university and spanning various scientific disciplines, including biology, chemistry, and physics. Each study also focused on specific subject concepts, which varied across different fields.

By comparing the implementation methods, educational stages, participant sizes, subjects, and key research areas of these instruments, we gain a comprehensive understanding of their practical operation and scope. A total of 96% of the instruments use paper-based testing [30,50,52], except for Saat et al. [8], who employed online testing with a large sample size of 526 participants. This contrasts sharply with other studies. Online testing not only expands the sample size but also overcomes geographic and time constraints. However, it remains unclear whether this format can fully capture students’ deeper understanding [59]. The implementation of MTDTs involves transitioning from traditional paper-and-pencil tests to digital formats, a process that is still in the exploratory phase and requires ongoing refinement. With the rapid advancement of modern information technology, digital assessment methods have great potential for future development.

Table 4 reveals a significant imbalance in misconception studies across educational stages. No studies specifically targeted primary school students as their research subjects. At the junior high school level, 3 studies exist [29,52,53], highlighting the lack of research at both the junior high and primary school levels. This may be due to students in these stages being in the early phases of cognitive development, where misconceptions are less complex and widespread. At the high school level, there are 11 studies [45,49,55,56], and 10 at the university level [27,48,54,57]. Misconception research is concentrated at the high school and university levels, especially in high school, where students face more complex knowledge and are more prone to misconceptions. These misconceptions significantly impact learning outcomes and academic development. In contrast, cross-stage studies are scarce. Only two studies span primary, junior high, and high school [26,44], and two others include junior high and high school [10,28]. This is likely due to significant differences in students’ cognitive abilities, educational content, and teaching methods across stages, prompting researchers to focus on a single stage for deeper analysis. Overall, misconception research is concentrated at the high school and university levels, with limited focus on the junior high and primary school stages. Cross-stage studies remain rare.

The sample data in Table 4 reveal significant variation in the sample sizes across the studies. Seven studies with sample sizes exceeding 400 participants account for 25% of the total [49,51,60]. Large-scale research provides higher statistical power, ensuring the stability and generalizability of the results, especially in identifying common student performance patterns. Studies with large samples offer more reliable inferences, minimizing biases caused by insufficient sample sizes. For example, Haslam and Treagust [10] assessed misconceptions about photosynthesis and respiration among 438 middle school students in Western Australia. A large sample size enhances the representativeness of the findings. However, recruiting a large number of participants to diagnose misconceptions presents significant challenges, requiring extensive human and material resources. The second category includes 11 studies [9,30,46,57] with sample sizes between 200 and 400 participants, representing 39.29% of the total. Although smaller than the previous category, these studies still provide strong data support. For diverse educational groups, especially when studying misconceptions across various educational stages and subjects, these sample sizes provide moderate statistical inference, though with potential errors in highly heterogeneous studies. For example, Wang [44] studied 329 participants from primary, middle, and high schools, showing how this sample size supports empirical research across educational stages while maintaining statistical validity. The third category consists of seven studies [45,53,56] with fewer than 200 participants, accounting for 25% of the total. These studies are more prone to selection bias, and their findings may be less stable, limiting their generalizability. However, they remain valuable in specific contexts, particularly for studying individual phenomena or conducting qualitative analyses. For example, Milenković et al. [58] studied 42 second- and third-year pharmacy students at Bjelina University, analyzing misconceptions about carbohydrates. Their findings confirmed the strong validity of the research instrument and identified several misconceptions unique to pharmacy students at Bjelina University. Overall, while studies with over 400 or 200–400 participants dominate the field, offering higher statistical power and generalizability, those with fewer than 200 participants remain crucial in specific research areas. Future research should balance sample sizes, ensuring large samples for generalizability while allowing smaller studies for in-depth exploration.

Table 4 shows that biology and physics are the most prominent disciplines in misconception studies, comprising 75% of the total. This suggests that core concepts in these fields are particularly prone to cognitive biases among students. In biology, key concepts include photosynthesis and respiration, plant growth and development, the circulatory system, and osmosis and diffusion [8,10,51,53,57]. These topics span molecular to systemic levels, and because of the complexity of biological processes, students frequently develop misconceptions. For instance, some misunderstand the relationship between photosynthesis and respiration or underestimate environmental factors in plant growth. Accordingly, MTDTs in biology aim to clarify fundamental life processes and accurately identify and address students’ misconceptions. Ten of the reviewed studies focused on misconceptions in physics, particularly in core areas such as thermodynamics, electricity, mass, inertia, and gravity [27,28,29,44]. The high level of abstraction in physics and its reliance on mathematical reasoning make students more susceptible to misconceptions, potentially hindering their understanding of physical phenomena. Conversely, chemistry research accounts for 14.29% of the total, focusing on the following four key concepts: reaction kinetics, carbohydrates, chemical reaction rates, and boiling phenomena [50,54,58,61]. While the proportion of misconceptions in chemistry is lower than in biology and physics, it still presents challenges in understanding certain core concepts. These four themes, each with its own complexity and abstraction, are prone to generating cognitive errors among students. For example, when learning about boiling, many students mistakenly believe the boiling point depends solely on the type of liquid, overlooking the influence of atmospheric pressure [54]. In geography, relevant studies account for only 3.57%, with a primary focus on environmental issues, such as global warming, the greenhouse effect, and acid rain [48]. The relatively low proportion of research in this field may be due to the fact that science education research often prioritizes core disciplines like physics, chemistry, and biology, while geography tends to receive less attention. Moreover, geography is an inherently interdisciplinary field, combining elements of both natural sciences and social sciences, which may contribute to its underrepresentation in science education classifications. Another key observation, as shown in Table 4, is that Treagust [11,28] conducted two interdisciplinary studies using MTDTs in biology and chemistry. However, no further interdisciplinary studies have been recorded since then. This may be attributed to the high complexity of interdisciplinary research, which requires the integration of distinct knowledge systems and teaching methodologies across different fields. As a result, researchers often prefer to conduct in-depth studies within a single discipline rather than face the challenges posed by interdisciplinary research.

3.2.4. Misconceptions Identified by MTDTs and Proposed Interventions

Table 5 categorizes the identified misconceptions by their quantity, characteristics, use of psychological variables, and the intervention strategies proposed by the authors. It also analyzes whether psychological variables were utilized to assess these misconceptions. Additionally, the table consolidates the intervention strategies proposed for concept change, along with their implementation status, aiming to reveal the effectiveness of these interventions and the specifics of their application process.

Nineteen studies identified between one and five significant misconceptions. On average, each conceptual theme yielded 5.82 significant misconceptions. For instance, in his study, Putica [56] identified three misconceptions related to the circulatory system. While these misconceptions may seem simple, they are widespread among students. Common misconceptions are defined as those held by more than 10% of students, a threshold initially proposed by Caleon [47] and widely adopted by scholars. However, the determination of this threshold lacks empirical evidence. This is primarily because once a standard, such as the 10% threshold, it becomes accepted within the academic community, and researchers tend to rely on the existing standard rather than revisiting and revalidating it.

The results indicate that six studies incorporated psychological variables, representing 21.43% of the total [47,50,55]. These variables include CF, CFC, CFW, and CDQ. Scholars have introduced psychological variables to assess misconceptions primarily because traditional answer-combination methods struggle to fully capture students’ conceptual understanding, particularly in terms of depth, intensity, and cognitive biases. Psychological variables, especially confidence-related metrics, provide a more nuanced instrument for cognitive assessment. For example, CDQ is calculated as (CFC − CFW)/SD, where CFC represents the confidence level when students provide correct answers, CFW denotes the confidence level when they provide incorrect answers, and SD refers to the standard deviation of confidence ratings [56]. CDQ measures an individual’s ability to accurately distinguish between what they know and what they do not know. Almost all of these studies employed a refined six-point Likert scale rather than a binary judgment approach, enhancing measurement sensitivity and allowing for the quantification of subtle changes in the confidence ratings. However, these variables may be influenced by factors such as individual cognition, task characteristics, knowledge background, learning environment, and emotional motivation. For instance, high anxiety levels may lower overall confidence, while individuals with stronger learning motivation may assess their knowledge more cautiously, reducing extreme confidence biases. Therefore, to accurately assess misconceptions using psychological variables, it is crucial to consider various factors, including individual cognition, task characteristics, knowledge background, learning environment, and emotional motivation. This comprehensive approach enhances the accuracy and applicability of the measurements.

The intervention strategies proposed by the authors were categorized into 11 distinct types. Among them, recommendations involving hands-on and experimental activities were mentioned most frequently (n = 22) [28,51,61], followed by those emphasizing real-life connections (n = 11) [27,48,52] and the use of concept maps or visual models (n = 10) [47,50,57]. Experimental and practical activities were the most prevalent, likely due to their ability to engage students directly and facilitate connections between abstract concepts and real-world phenomena. This is especially effective in natural sciences, where experiments provide tangible validation of knowledge, enhancing comprehension and retention. For instance, Irmak [60] suggested using experiments and real-life examples to improve students’ understanding of energy conversion and conservation. While researchers have proposed targeted strategies for conceptual change, empirical validation remains limited. For example, aside from Coştu et al. [54], who examined misconceptions related to boiling, most studies have not implemented practical interventions. This is likely due to the fact that conceptual change often requires long-term monitoring and intervention, which can be difficult to carry out within typical study durations. Additionally, designing rigorous experimental controls, collecting data, and analyzing intervention effects are complex processes, which may further impede research execution. Moreover, educational research often relies on real classroom environments, where teachers and institutions may be reluctant to alter established teaching methods or accommodate experimental interventions. This reluctance further limits large-scale practical validation.

3.2.5. Representational Effectiveness of MTDTs in Assessing Conceptual Understanding

Figure 3 shows the distribution of the 28 selected studies across various MTDTs. The majority of the studies used two-tier diagnostic technologies, while only one study employed traditional multiple-choice tests.

Table 6 compares how two-tier diagnostic technology and traditional multiple-choice tests categorize responses to assess conceptual understanding. Two-tier diagnostic technology categorizes responses based on both content and reason tiers. When both tiers are incorrect, the response is categorized as a misconception [45,61]. In contrast, traditional multiple-choice tests identify misconceptions solely based on the correctness of the answer, disregarding reasoning; a wrong answer, regardless of reasoning, is treated as a misconception [26].

Table 7 introduces the confidence tier (the third tier) [47], which adds an assessment of certainty to the existing two-tier system. For example, when both the content and reason layers are incorrect and the confidence layer indicates certain, the final result is classified as a misconception [29,48]. This layer further refines the diagnostic process by considering students’ confidence ratings, thereby enhancing the understanding of their misconceptions and providing more precise results.

Table 8 expands on the previous framework by adding an assessment of the confidence level for both the content and reason tiers, followed by further classification after evaluating both the reason and confidence ratings [55]. This modification enhances the specificity and clarity of the results. When both the content and reason tiers are incorrect, and both confidence tiers are marked as certain, the result is classified as a misconception [25,50].

Table 6, Table 7 and Table 8 present the performances of the various MTDTs in assessing students’ conceptual understanding. The four-tier diagnostic technology uses 16 classification combinations to accurately reflect the depth of students’ understanding of scientific concepts [62]. Each combination reveals both whether students have misconceptions and the specific level of their understanding. In contrast, the three-tier diagnostic technology is limited to just eight classification combinations. While it captures students’ core understanding, it lacks the detailed, multi-dimensional analysis provided by the four-tier system. The two-tier diagnostic technology offers a simplified evaluation, classifying results into categories like understanding, partial understanding, and misconceptions [54]. This classification is similar to traditional multiple-choice tests, limiting the in-depth exploration of students’ cognitive processes. These differences highlight the varying precision levels of diagnostic technologies in uncovering students’ cognitive biases and misconceptions. Although these quantitative indicators may obscure the complex cognitive activities students exhibit during the answering process [63], MTDTs offer a fast and effective technological approach, particularly in the context of China’s large student population with varying cognitive levels.

4. Discussion

The following section will discuss the findings from the practical application of MTDTs, compare them with the existing literature, identify the limitations of current studies, and offer recommendations for future research.

4.1. Are the Development and Overall Reliability and Validity of MTDTs Scientifically Sound?

The theoretical framework, design of distractors, acquisition of conceptual statements, definition of the research scope, examination of Cronbach’s alpha, expert reviews, and factor analysis, along with other supporting evidence, collectively demonstrate the scientific rigor in the development, reliability, and validity testing of MTDTs. The development of MTDTs generally follows Treagust’s procedural framework and relies on CTT to assess the instruments’ reliability and validity using scientific methods. Reliability tests, based on CTT, commonly use Cronbach’s alpha coefficient, reflecting the reliance on this traditional method to assess consistency. However, literature suggests that Cronbach’s alpha does not strongly correlate with criterion-referenced tests (CRT), especially in diagnostic testing [64], where the main goal is to accurately differentiate respondents’ abilities. Therefore, relying solely on Cronbach’s alpha to test internal consistency does not effectively reflect the diagnostic power of the instrument [65]. While Cronbach’s alpha is important in general psychometrics, its application in diagnostic tests is limited and may not offer a comprehensive assessment of reliability and validity. Some studies have applied item response theory (IRT) and the Rasch model to assess the reliability of hierarchical diagnostic instruments, demonstrating better performance than traditional CTT-based methods [62]. The Rasch model provides high precision in evaluating reliability by separately assessing respondents’ abilities and item difficulty. It provides more reliable results through consistency checks and error analysis. Furthermore, the Rasch model helps assess the fairness and adaptability of the instruments, ensuring their applicability across various populations [66].

Regarding validity testing, researchers most commonly rely on expert reviews, factor analysis, and Pearson correlation methods. In comparison to these traditional methods, the Rasch model provides a more systematic, data-driven approach to validity testing. Studies show that the Rasch model can precisely reveal the effectiveness of measurement instruments by quantifying the interaction between each item and respondent [67]. Within the Rasch framework, validity testing incorporates both expert theoretical evaluations and analysis of large sample data to assess whether the instrument can effectively distinguish ability levels across groups, thus reducing subjective biases inherent in expert reviews. In contrast, expert review relies heavily on domain experts’ judgment, which can be influenced by personal biases, potentially distorting the validity assessment [68]. The Rasch model mitigates these biases by offering a standardized evaluation system for item and ability estimation through statistical methods and model calibration, ensuring a more objective, transparent, and consistent validity test [69].

4.2. Which Scientific Discipline Has the Widest Application of MTDTs?

The results show that the MTDTs primarily focus on physics and biology concepts, with less emphasis on geography. This aligns with findings from other studies that also highlight differences in the occurrence of misconceptions across scientific disciplines. Kaltakci Gurel et al. [70] found that physics and biology are the most common subjects for misconceptions, particularly in concepts like optics, momentum, and biological adaptation, which supports my findings. This is because physics and biology involve more complex and abstract concepts, which lead to a higher frequency of misconceptions. Consequently, MTDTs have been more widely applied in these fields. However, Soeharto et al. [71] observed that while geography has fewer misconceptions, certain concepts, such as global warming and acid rain, still present challenges [48]. Nevertheless, the use of MTDTs in geography, especially MTDTs, is less frequent than in physics and biology. This finding does not fully align with my conclusions, indicating a lower demand for diagnosing misconceptions in geography. Another possible reason is that geography concepts are often tied to real-world phenomena like terrain, climate, and resources. This connection helps students deepen their understanding through observation and hands-on experience, which reduces misconceptions.

4.3. Application of MTDTs in Different Educational Stages (K12 vs. Higher Education)

The systematic review shows that MTDTs are predominantly applied in senior high school and higher education, with minimal use at the elementary school level. According to Soeharto et al. [71], senior high school and university students are more likely to develop misconceptions when learning complex and abstract scientific concepts, requiring more accurate diagnostic instruments for identification and correction. MTDTs, with their multi-layered diagnostic approach, effectively identify students’ misconceptions and assist educators in providing personalized guidance, leading to greater demand at these educational stages.

Additionally, Kaltakci Gurel et al. [70] found that MTDTs are particularly effective in uncovering deep misconceptions among senior high school and university students, especially in subjects like physics and biology. These students require more detailed feedback and guidance when dealing with complex scientific concepts. In contrast, elementary school students primarily learn more intuitive basic science concepts, and their misconceptions are generally less complex [72]. Consequently, although MTDTs offer strong diagnostic capabilities, elementary school teachers often rely on traditional assessment methods, such as multiple-choice questions and oral quizzes, resulting in less frequent use of MTDTs [73].

4.4. Can MTDTs Be Applied on a Large Scale in Online Settings?

The results suggest that MTDTs are transitioning to large-scale online applications. With advancements in technology, it is likely that education research will increasingly adopt online platforms for large-scale diagnostic testing [74]. MTDTs can reach a broader student population, and online platforms, compared to traditional paper-based tests, enable more efficient data collection and analysis, thus improving diagnostic accuracy and efficiency [8]. Moreover, online applications not only overcome the limitations of traditional diagnostic instruments in large-scale implementation but also facilitate their global application. This is especially evident in physics and biology, where the proliferation of online diagnostic instruments has enhanced their availability and usability [75]. The support for this shift stems from the ability of online platforms to enable multi-layered diagnostic tests, overcoming geographical and resource limitations and enabling large-scale use. By leveraging cloud computing and data analytics, online systems can process large volumes of data, quickly identify students’ misconceptions, and offer substantial advantages in real-time feedback and personalized teaching [8]. Furthermore, online platforms can dynamically adjust the difficulty and content of questions based on real-time student performance, enhancing the accuracy and relevance of the tests [76].

However, this viewpoint is not universally accepted. Despite the theoretical advantages of MTDTs’ multi-layered diagnostic capabilities, their large-scale online application faces several challenges. MTDTs require complex item design and detailed analysis, which can encounter technical limitations and resource constraints during large-scale online use. Specifically, automatic scoring systems on online platforms may fail to fully understand students’ reasoning behind their answers, which can affect the accuracy of diagnosing misconceptions [77]. Moreover, the absence of immediate teacher feedback and personalized interventions on online platforms may prevent timely and effective correction of some misconceptions.

4.5. What Common Misconceptions Do MTDTs Reveal in Science Education, and Are the Teaching Strategies Targeting These Misconceptions Effective?

Research findings indicate that students across various educational stages exhibit numerous common misconceptions. However, the proposed interventions to address these misconceptions frequently lack effective validation. Pacaci [78] highlighted that although the identification of misconceptions has advanced, most interventions still lack scientific and systematic validation, meaning their effectiveness remains unverified. Similarly, Briggs [79] observed that although teachers and curriculum designers widely adopt targeted intervention methods, these interventions often lack long-term effectiveness tracking and empirical validation, preventing effective assessment of their sustainability and universality.

The insufficient validation of intervention effectiveness may arise from many strategies being confined to theoretical construction and practical suggestions, rather than undergoing rigorous evaluations of their actual application. In education, much research on misconceptions focuses more on diagnosing misconceptions than on verifying the effectiveness of intervention strategies. This leads to a lack of empirical evidence on how to eliminate these misconceptions [71]. Consequently, although research shows that misconceptions are widespread among students, interventions to address these issues often lack robust validation of their effectiveness.

4.6. Which MTDT Is the Most Effective in Assessing Student Conceptual Understanding?

This systematic review finds that four-tier diagnostic technology is the most accurate instrument for identifying misconceptions, primarily due to its multiple layers. This conclusion supports existing research. Kaltakci-Gurel [25] noted that the four-tier diagnostic technology, which incorporates content, reason, and confidence layers, offers a more comprehensive understanding of students’ cognitive misunderstandings, providing more precise results than single- or two-tier instruments. Kiray [31] highlighted that four-tier diagnostic instruments are better at capturing the depth of students’ understanding of complex concepts, especially when addressing higher-order cognitive skills, thus helping teachers effectively identify misconceptions. Some studies have raised concerns about the universal applicability of four-tier diagnostic technology. They pointed out that, despite its theoretical advantages, the high technical requirements and complexity may hinder effective implementation in resource-limited educational environments. In such contexts, they argue that simpler two-tier or single-tier diagnostic instruments can still be effective in many educational settings [71]. Therefore, while four-tier diagnostic technology provides more precise results, its practical application faces technical and resource-related challenges.

However, although four-tier diagnostic technology performs well, it does not identify the origin of misconceptions, limiting its ability to explain their formation mechanisms. Studies indicate that while multiple-choice questions can assess students’ knowledge and cognitive skills, they may not provide sufficient evidence to fully understand students’ cognitive structures [9]. In other words, many technologies, such as the four-tier diagnostic instrument, can assess conceptual understanding but cannot explain how these concepts are formed. A new and promising technology, five-tier diagnostic technology, is being explored to address this issue. This technology has been discussed at international academic conferences and provides direct evidence of the origins of misconceptions [62].

4.7. Future Research Directions and Recommendations

Future research should adopt a sustainable development perspective to optimize and expand the use of MTDTs. While Cronbach’s alpha has been used in traditional reliability testing, its limitations in distinguishing ability levels highlight the need for more precise assessment instruments. The Rasch model, based on item response theory (IRT), provides more accurate and objective measurements, reducing subjective biases in traditional methods. Thus, future MTDT development should prioritize adopting more precise methods, such as the Rasch model, to address the growing diversity of educational needs. Additionally, the rapid growth in online education presents an opportunity to use these instruments in large-scale assessments, enhancing their scientific rigor and promoting global educational equity and quality. This would optimize the sustainable allocation of educational resources, aligning with global sustainable education goals [80].

Although the four-tier diagnostic technology is one of the most accurate instruments for assessing students’ conceptual understanding, its high technical demands and complexity limit its use in resource-constrained environments. To promote its sustainable development, strategies should be explored to reduce implementation costs and technical challenges, ensuring its applicability in various educational settings. While the four-tier diagnostic technology comprehensively assesses misconceptions, it does not fully uncover their origins, limiting our understanding of students’ cognitive structures and the mechanisms behind misconceptions. Future research should focus on exploring the potential of five-tier diagnostic technology, especially how it can analyze the origins and formation of misconceptions, offering more precise diagnostic support.

In science education, MTDTs should be more extensively applied in physics and biology, as these subjects involve more complex misconceptions. MTDTs effectively identify and correct misconceptions, providing data-driven insights to optimize teaching strategies. Though geography has fewer misconceptions, key concepts like global warming still require attention. To achieve sustainable development in science education, efforts should focus on promoting MTDTs across disciplines, particularly in subjects like geography. This will improve students’ scientific understanding, enhance education quality, and promote equity.

As education shifts toward more personalized approaches, we should explore how to integrate MTDTs across various educational stages to align with sustainable development goals. In high school and higher education, the complexity of scientific learning and the increased likelihood of misconceptions make MTDTs especially valuable. These instruments help educators identify and correct misconceptions, improving education quality, fostering critical thinking, and enhancing innovation, supporting the sustainable development goal of “quality education”. Efforts should promote the widespread use of MTDTs, integrating them with digital resources to enhance accessibility and equity, particularly in remote areas with limited resources. For elementary education, simplified versions of MTDTs should be gradually introduced to foster scientific thinking as students engage with foundational concepts.

MTDTs have great potential for large-scale online applications, improving the efficiency of diagnosing misconceptions and reaching a broader student base. However, challenges in technology and implementation require sustainable development strategies. Strengthening technical support for online platforms is crucial for handling large-scale data and improving automated scoring accuracy, addressing the limitations in understanding students’ reasoning. While online platforms efficiently collect and analyze data, the lack of immediate teacher feedback and personalized interventions remains a major barrier. Future development should bridge this gap with intelligent tutoring systems and personalized learning paths, offering individualized support to transform students’ conceptual understanding. To further promote MTDTs globally, especially in resource-limited areas, international cooperation must be strengthened. Leveraging open educational resources and cross-border cloud platforms for sharing technology and data will ensure equitable access to education and sustainable quality improvement. These measures will enhance MTDTs’ feasibility in large-scale online applications and provide technical support for the sustainable development of global education.

The ultimate goal of concept research is the transformation of misconceptions, and sustainable development should prioritize the scientific and systematic validation of intervention strategies for misconceptions. While existing research reveals widespread misconceptions across various educational stages, most intervention strategies lack empirical evidence and long-term tracking, hindering the assessment of their sustainability and generalizability. Future research should focus more on evaluating the effectiveness of intervention strategies, particularly their application in real-world teaching environments, to promote sustainable development in education [81]. Long-term longitudinal studies, combined with dynamic feedback from educational practice, are needed to assess the impact of interventions on misconceptions and refine teaching strategies. Educators and textbook designers should collaborate to create scientifically grounded, practical intervention plans that align with real teaching processes and are applicable across diverse educational contexts. Only continuous and systematic evaluation can ensure the sustainability of intervention measures and provide long-term support for students’ scientific learning.

5. Conclusions

This systematic review examines the development and application of various MTDTs in science education at both K-12 and higher education levels. Key issues are highlighted in the tables, including validation of development, reliability, and validity; representation of conceptual understanding; implementation methods; subject areas; and misconceptions revealed by MTDTs, along with intervention recommendations. The findings emphasize the high quality of the MTDTs, their crucial role in uncovering misconceptions, and their practical applications and outcomes across various educational stages and subjects. These findings are particularly relevant.

The future directions aim to promote the sustainable development of MTDTs in the new era, as improper development and implementation could threaten sustainable growth in students’ scientific literacy.

An additional academic and practical contribution of this work is the provision of a theoretical foundation for transforming misconceptions. It compiles nearly 40 years of common misconceptions, creating a repository. Frontline teachers and policymakers can easily extract misconceptions from this repository, offering valuable insights for developing teaching strategies and shaping educational policies.

In conclusion, this review effectively meets its goals, answers the proposed research questions, and provides valuable insights, but it also exposes certain limitations. These limitations should be viewed as a foundation for future growth and sustainable development.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su17073145/s1, Table S1: Raw data for the classification of MTDTs development; Table S2: Original data for quality evaluation of MTDTs; Table S3: Original data for applicant of MTDTs; Table S4: Original data on misconceptions identified by MTDTs and corresponding intervention strategies.

Author Contributions

H.M. wrote the first manuscript. H.Y., C.L., S.M. and G.L. all reviewed and revised the manuscript, providing valuable feedback and contributing to the refinement of the content. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Program Funds of National Education Sciences “14th Five Year Plan” 2022 year Key Project of the Ministry of Education “Evaluation Research on Middle School Students’ Scientific Reasoning Ability” under Grant number DHA220396.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

European Commission. Europe 2020: A Strategy for Smart, Sustainable and Inclusive Growth: Communication from the Commission; Publications Office of the European Union: Luxembourg, 2010. [Google Scholar]
The State Council of the People’s Republic of China. Outline of Action Plan for National Scientific Literacy (2021–2035); The State Council of the People’s Republic of China: Beijing, China, 2021.
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development; United Nations: New York, NY, USA, 2015; Available online: https://sdgs.un.org/2030agenda (accessed on 10 July 2024).
NGSS Lead States. Next Generation Science Standards: For States, by States; The National Academy Press: Washington, DC, USA, 2013. [Google Scholar] [CrossRef]
Giner-Baixauli, A.; Corbí, H.; Mayoral, O. Exploring the Intersection of Paleontology and Sustainability: Enhancing Scientific Literacy in Spanish Secondary School Students. Sustainability 2024, 16, 5890. [Google Scholar] [CrossRef]
Guevara-Herrero, I.; Bravo-Torija, B.; Pérez-Martín, J.M. Educational Practice in Education for Environmental Justice: A Systematic Review of the Literature. Sustainability 2024, 16, 2805. [Google Scholar] [CrossRef]
The State Council of the People’s Republic of China. China’s National Science Literacy Action Plan (2021–2035). 2021. Available online: https://www.gov.cn/zhengce/content/2021-06/25/content_5620813.htm (accessed on 22 June 2024).
Mohd Saat, R.; Mohd Fadzil, H.; Aziz, N.A.A.; Haron, K.; Rashid, K.; Shamsuar, N. Development of an online three-tier diagnostic test to assess pre-university students? Understanding of cellular respiration. J. Balt. Sci. Educ. 2016, 15, 532–546. [Google Scholar] [CrossRef]
Kao, H.L. A Study of Aboriginal and Urban Junior High School Students’ Alternative Conceptions on the Definition of Respiration. Int. J. Sci. Educ. 2007, 29, 517–533. [Google Scholar] [CrossRef]
Haslam, F.; Treagust, D.F. Diagnosing secondary students’ misconceptions of photosynthesis and respiration in plants using a two-tier multiple choice instrument. J. Biol. Educ. 1987, 21, 203–211. [Google Scholar] [CrossRef]
Treagust, D.F. Development and use of diagnostic tests to evaluate students’ misconceptions in science. Int. J. Sci. Educ. 1988, 10, 159–169. [Google Scholar] [CrossRef]
Viehmann, C.; Fernández Cárdenas, J.M.; Reynaga Peña, C.G. The Use of Socioscientific Issues in Science Lessons: A Scoping Review. Sustainability 2024, 16, 5827. [Google Scholar] [CrossRef]
Cairns, D.; Areepattamannil, S. Exploring the Relations of Inquiry-Based Teaching to Science Achievement and Dispositions in 54 Countries. Res. Sci. Educ. 2017, 49, 1–23. [Google Scholar] [CrossRef]
Bao, L.; Koenig, K. Physics education research for 21st century learning. Discip. Interdiscip. Sci. Educ. Res. 2019, 1, 2. [Google Scholar] [CrossRef]
Zhou, S.-N.; Han, J.; Koenig, K.; Raplinger, A.; Pi, Y.; Li, D.; Xiao, H.; Fu, Z.; Bao, L. Assessment of Scientific Reasoning: The Effects of Task Context, Data, and Design on Student Reasoning in Control of Variables. Think. Ski. Creat. 2016, 19, 175–187. [Google Scholar] [CrossRef]
Omer, L. Successful Scientific Instruction Involves More Than Just Discovering Concepts through Inquiry-Based Activities. Education 2002, 123, 318. [Google Scholar]
Marx, R.; Blumenfeld, P.; Krajcik, J.; Fishman, B.; Soloway, E.; Geier, R.; Tal, R. Inquiry-based science in the middle grades: Assessment of learning in urban systemic reform. J. Res. Sci. Teach. 2004, 41, 1063–1080. [Google Scholar] [CrossRef]
Fernández-Huetos, N.; Pérez-Martín, J.M.; Guevara-Herrero, I.; Esquivel-Martín, T. Primary-Education Students’ Performance in Arguing About a Socioscientific Issue: The Case of Pharmaceuticals in Surface Water. Sustainability 2025, 17, 1618. [Google Scholar] [CrossRef]
Maillard, O.; Michme, G.; Azurduy, H.; Vides-Almonacid, R. Citizen Science for Environmental Monitoring in the Eastern Region of Bolivia. Sustainability 2024, 16, 2333. [Google Scholar] [CrossRef]
Kioupi, V.; Voulvoulis, N. Education for Sustainable Development: A Systemic Framework for Connecting the SDGs to Educational Outcomes. Sustainability 2019, 11, 6104. [Google Scholar] [CrossRef]
Pozuelo-Muñoz, J.; de Echave Sanz, A.; Cascarosa Salillas, E. Inquiring in the Science Classroom by PBL: A Design-Based Research Study. Educ. Sci. 2025, 15, 53. [Google Scholar] [CrossRef]
Chen, S.M. Shadows: Young Taiwanese children’s views and understanding. Int. J. Sci. Educ. 2009, 31, 59–79. [Google Scholar] [CrossRef]
Novak, J.D. Concept mapping: A tool for improving science teaching and learning. Improv. Teach. Learn. Sci. Math. 1996, 32–43. [Google Scholar]
Langley, D.; Ronen, M.; Eylon, B.-S. Light propagation and visual patterns: Preinstruction learners’ conceptions. J. Res. Sci. Teach. 1997, 34, 399–424. [Google Scholar] [CrossRef]
Kaltakci-Gurel, D.; Eryilmaz, A.; McDermott, L.C. Development and application of a four-tier test to assess pre-service physics teachers’ misconceptions about geometrical optics. Res. Sci. Technol. Educ. 2017, 35, 238–260. [Google Scholar] [CrossRef]
Canal, P. Photosynthesis and ’inverse respiration’ in plants: An inevitable misconception? Int. J. Sci. Educ. 1999, 21, 363–371. [Google Scholar] [CrossRef]
Gönen, S. A Study on Student Teachers’ Misconceptions and Scientifically Acceptable Conceptions About Mass and Gravity. J. Sci. Educ. Technol. 2008, 17, 70–81. [Google Scholar] [CrossRef]
Treagust, D. Evaluating students’ misconceptions by means of diagnostic multiple choice items. Res. Sci. Educ. 1986, 16, 199–207. [Google Scholar] [CrossRef]
Peşman, H.; Eryılmaz, A. Development of a Three-Tier Test to Assess Misconceptions About Simple Electric Circuits. J. Educ. Res. 2010, 103, 208–222. [Google Scholar] [CrossRef]
Aydeniz, M.; Bilican, K.; Kirbulut, Z.D. Exploring Pre-Service Elementary Science Teachers’ Conceptual Understanding of Particulate Nature of Matter through Three-Tier Diagnostic Test. Int. J. Educ. Math. Sci. Technol. 2017, 5, 221–234. [Google Scholar] [CrossRef]
Kiray, S.A.; Simsek, S. Determination and Evaluation of the Science Teacher Candidates’ Misconceptions About Density by Using Four-Tier Diagnostic Test. Int. J. Sci. Math. Educ. 2021, 19, 935–955. [Google Scholar] [CrossRef]
Guerra-Reyes, F.; Guerra-Dávila, E.; Naranjo-Toro, M.; Basantes-Andrade, A.; Guevara-Betancourt, S. Misconceptions in the Learning of Natural Sciences: A Systematic Review. Educ. Sci. 2024, 14, 497. [Google Scholar] [CrossRef]
Mwangi, S.W. Effects of the Use of Computer Animated Loci Teaching Technique on Secondary School Students’ Achievement and Misconceptions in Mathematics Within Kitui County, Kenya. Ph.D. Thesis, Egerton University, Njoro, Kenya, 2019. Available online: http://41.89.96.81:4000/items/4fe69378-ff6d-4d8e-973d-cf44e39611a7 (accessed on 13 July 2024).
Menz, C.; Spinath, B.; Seifried, E. Misconceptions die hard: Prevalence and reduction of wrong beliefs in topics from educational psychology among preservice teachers. Eur. J. Psychol. Educ. 2020, 36, 477–494. [Google Scholar] [CrossRef]
Ruiz-Gallardo, J.; Reavey, D. Learning Science Concepts by Teaching Peers in a Cooperative Environment: A Longitudinal Study of Preservice Teachers. J. Learn. Sci. 2018, 28, 107–173. [Google Scholar] [CrossRef]
Diani, R.; Alfin, J.; Anggraeni, Y.M.; Mustari, M.; Fujiani, D. Four-Tier Diagnostic Test With Certainty of Response Index on The Concepts of Fluid. J. Phys. Conf. Ser. 2019, 1155, 012078. [Google Scholar] [CrossRef]
Miller, D.M.; Scott, C.E.; McTigue, E.M. Writing in the Secondary-Level Disciplines: A Systematic Review of Context, Cognition, and Content. Educ. Psychol. Rev. 2018, 30, 83–120. [Google Scholar] [CrossRef]
Scott, C.E.; McTigue, E.M.; Miller, D.M.; Washburn, E.K. The what, when, and how of preservice teachers and literacy across the disciplines: A systematic literature review of nearly 50 years of research. Teach. Teach. Educ. 2018, 73, 1–13. [Google Scholar] [CrossRef]
Díaz-Burgos, A.; García-Sánchez, J.-N.; Álvarez-Fernández, M.L.; de Brito-Costa, S.M. Psychological and Educational Factors of Digital Competence Optimization Interventions Pre- and Post-COVID-19 Lockdown: A Systematic Review. Sustainability 2024, 16, 51. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. Declaración PRISMA 2020: Una guía actualizada para la publicación de revisiones sistemáticas. Rev. Esp. Cardiol. 2021, 74, 790–799. [Google Scholar] [CrossRef] [PubMed]
Risko, V.J.; Roller, C.M.; Cummins, C.; Bean, R.M.; Block, C.C.; Anders, P.L.; Flood, J. A Critical Analysis of Research on Reading Teacher Education. Read. Res. Q. 2008, 43, 252–288. [Google Scholar] [CrossRef]
Bean, T.W. Preservice Teachers’ Selection and Use of Content Area Literacy Strategies. J. Educ. Res. 1997, 90, 154–163. [Google Scholar] [CrossRef]
Nourie, B.L.; Lenski, S.D. The (In)Effectiveness of Content Area Literacy Instruction for Secondary Preservice Teachers. Clear. House A J. Educ. Strateg. Issues Ideas 1998, 71, 372–374. [Google Scholar] [CrossRef]
Wang, J.-R. Development and Validation of a Two-Tier Instrument to Examine Understanding of Internal Transport in Plants and the Human Circulatory System. Int. J. Sci. Math. Educ. 2004, 2, 131–157. [Google Scholar] [CrossRef]
Sesli, E.; Kara, Y. Development and application of a two-tier multiple-choice diagnostic test for high school students’ understanding of cell division and reproduction. J. Biol. Educ. 2012, 46, 214–225. [Google Scholar] [CrossRef]
Taslidere, E. Development and use of a three-tier diagnostic test to assess high school students’ misconceptions about the photoelectric effect. Res. Sci. Technol. Educ. 2016, 34, 164–186. [Google Scholar] [CrossRef]
Caleon, I.S.; Subramaniam, R. Development and Application of a Three-Tier Diagnostic Test to Assess Secondary Students’ Understanding of Waves. Int. J. Sci. Educ. 2010, 32, 939–961. [Google Scholar] [CrossRef]
Arslan, H.O.; Cigdemoglu, C.; Moseley, C. A Three-Tier Diagnostic Test to Assess Pre-Service Teachers’ Misconceptions about Global Warming, Greenhouse Effect, Ozone Layer Depletion, and Acid Rain. Int. J. Sci. Educ. 2012, 34, 1667–1686. [Google Scholar] [CrossRef]
Gurcay, D.; Gulbas, E. Development of three-tier heat, temperature and internal energy diagnostic test. Res. Sci. Technol. Educ. 2015, 33, 197–217. [Google Scholar] [CrossRef]
Yan, Y.K.; Subramaniam, R. Using a multi-tier diagnostic test to explore the nature of students’ alternative conceptions on reaction kinetics. Chem. Educ. Res. Pract. 2018, 19, 213–226. [Google Scholar] [CrossRef]
Lin, S.-W. Development and Application of a Two-Tier Diagnostic Test for High School Students’ Understanding of Flowering Plant Growth and Development. Int. J. Sci. Math. Educ. 2004, 2, 175–199. [Google Scholar] [CrossRef]
Yeo, J.-H.; Yang, H.-H.; Cho, I. Using a Three-Tier Multiple-Choice Diagnostic Instrument toward Alternative Conceptions among Lower-Secondary School Students in Taiwan: Taking Ecosystems Unit as an Example. J. Balt. Sci. Educ. 2022, 21, 69–83. [Google Scholar] [CrossRef]
Zhao, C.; Zhang, S.; Cui, H.; Hu, W.; Dai, G. Middle school students’ alternative conceptions about the human blood circulatory system using four-tier multiple-choice tests. J. Biol. Educ. 2021, 57, 51–67. [Google Scholar] [CrossRef]
Coştu, B.; Ayas, A.; Niaz, M.; Ünal, S.; Çalik, M. Facilitating Conceptual Change in Students’ Understanding of Boiling Concept. J. Sci. Educ. Technol. 2007, 16, 524–536. [Google Scholar] [CrossRef]
Caleon, I.S.; Subramaniam, R. Do Students Know What They Know and What They Don’t Know? Using a Four-Tier Diagnostic Test to Assess the Nature of Students’ Alternative Conceptions. Res. Sci. Educ. 2010, 40, 313–337. [Google Scholar] [CrossRef]
Putica, K.B. Development and Validation of a Four-Tier Test for the Assessment of Secondary School Students’ Conceptual Understanding of Amino Acids, Proteins, and Enzymes. Res. Sci. Educ. 2023, 53, 651–668. [Google Scholar] [CrossRef]
Odom, A.L.; Barrow, L.H. Development and application of a two-tier diagnostic test measuring college biology students’ understanding of diffusion and osmosis after a course of instruction. J. Res. Sci. Teach. 1995, 32, 45–61. [Google Scholar] [CrossRef]
Milenković, D.D.; Hrin, T.N.; Segedinac, M.D.; Horvat, S. Development of a Three-Tier Test as a Valid Diagnostic Tool for Identification of Misconceptions Related to Carbohydrates. J. Chem. Educ. 2016, 93, 1514–1520. [Google Scholar] [CrossRef]
Yang, W.; Chan, A.; Gagarina, N. Editorial: Remote online language assessment: Eliciting discourse from children and adults. Front. Commun. 2024, 9, 1508448. [Google Scholar] [CrossRef]
Irmak, M.; İnaltun, H.; Ercan-Dursun, J.; Yaniş-Kelleci, H.; Yürük, N. Development and Application of a Three-Tier Diagnostic Test to Assess Pre-service Science Teachers’ Understanding on Work-Power and Energy Concepts. Int. J. Sci. Math. Educ. 2023, 21, 159–185. [Google Scholar] [CrossRef]
Milenković, D.D.; Segedinac, M.D.; Hrin, T.N. Increasing High School Students’ Chemistry Performance and Reducing Cognitive Load through an Instructional Strategy Based on the Interaction of Multiple Levels of Knowledge Representation. J. Chem. Educ. 2014, 91, 1409–1416. [Google Scholar] [CrossRef]
Putra, A.S.U.; Hamidah, I.; Nahadi. The development of five-tier diagnostic test to identify misconceptions and causes of students’ misconceptions in waves and optics materials. J. Phys. Conf. Ser. 2020, 1521, 022020. [Google Scholar] [CrossRef]
Rodrigues, H.; Jesús, A.; Lamb, R.; Choi, I.; Owens, T. Unravelling Student Learning: Exploring Nonlinear Dynamics in Science Education. Int. J. Psychol. Neurosci. 2023, 9, 118–137. [Google Scholar] [CrossRef]
Sawaki, Y. 4. Norm-referenced vs. criterion-referenced approach to assessment. In Handbook of Second Language Assessment; Dina, T., Jayanti, B., Eds.; De Gruyter Mouton: Berlin, Germany, 2016; pp. 45–60. [Google Scholar]
Kumar, R.V. Cronbach’s Alpha: Genesis, Issues and Alternatives. IMIB J. Innov. Manag. 2024, 1, 17. [Google Scholar] [CrossRef]
Avinç, E.; Doğan, F. Digital literacy scale: Validity and reliability study with the rasch model. Educ. Inf. Technol. 2024, 29, 22895–22941. [Google Scholar] [CrossRef]
Grau-Gonzalez, I.A.; Villalba-Garzon, J.A.; Torres-Cuellar, L.; Puerto-Rojas, E.M.; Ortega, L.A. A psychometric analysis of the Early Trauma Inventory-Short Form in Colombia: CTT and Rasch model. Child Abus. Negl. 2024, 149, 106689. [Google Scholar] [CrossRef]
Nayak, A.; Khuntia, R. Development and Content Validation of a Measure to Assess the Parent-Child Social-emotional Reciprocity of Children with ASD. Indian J. Psychol. Med. 2024, 46, 66–71. [Google Scholar] [CrossRef] [PubMed]
Ardianto, D.; Rubini, B.; Pursitasari, I. Assessing STEM career interest among secondary students: A Rasch model measurement analysis. Eurasia J. Math. Sci. Technol. Educ. 2023, 19, em2213. [Google Scholar] [CrossRef] [PubMed]
Gurel, D.K.; Eryılmaz, A.; McDermott, L.C. A review and comparison of diagnostic instruments to identify students’ misconceptions in science. Eurasia J. Math. Sci. Technol. Educ. 2015, 11, 989–1008. [Google Scholar] [CrossRef]
Soeharto, S.; Csapó, B.; Sarimanah, E.; Dewi, F.I.; Sabri, T. A Review of Students’ Common Misconceptions in Science and Their Diagnostic Assessment Tools. J. Pendidik. IPA Indones. 2019, 8, 247–266. [Google Scholar] [CrossRef]
Özmen, K. Health Science Students’ Conceptual Understanding of Electricity: Misconception or Lack of Knowledge? Res. Sci. Educ. 2024, 54, 225–243. [Google Scholar] [CrossRef]
Niaoustas, G. Primary School Teacher’s Views on the Purpose and Forms of Student Performance Assessment. Int. J. Elem. Educ. 2024, 8, 132–140. [Google Scholar] [CrossRef]
Anwyl-Irvine, A.; Massonnié, J.; Flitton, A.; Kirkham, N.; Evershed, J. Gorilla in our midst: An online behavioral experiment builder. Behav. Res. Methods 2018, 52, 388–407. [Google Scholar] [CrossRef]
Permatasari, G.A.; Ellianawati, E.; Hardyanto, W. Online web-based learning and assessment tool in vocational high school for physics. J. Penelit. Pengemb. Pendidik. Fis. 2019, 5, 1–8. [Google Scholar] [CrossRef]
Das, A.; Malaviya, S. AI-Enabled Online Adaptive Learning Platform and Learner’s Performance: A Review of Literature. Empir. Econ. Lett. 2024, 23, 234. [Google Scholar] [CrossRef]
Erickson, J.A.; Botelho, A.F.; McAteer, S.; Varatharaj, A.; Heffernan, N.T. The automated grading of student open responses in mathematics. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Frankfurt, Germany, 23–27 March 2020. [Google Scholar]
Pacaci, C.; Ustun, U.; Ozdemir, O.F. Effectiveness of conceptual change strategies in science education: A meta-analysis. J. Res. Sci. Teach. 2024, 61, 1263–1325. [Google Scholar] [CrossRef]
Briggs, A.G.; Morgan, S.K.; Sanderson, S.K.; Schulting, M.C.; Wieseman, L.J. Tracking the Resolution of Student Misconceptions about the Central Dogma of Molecular Biology. J. Microbiol. Biol. Educ. 2016, 17, 339–350. [Google Scholar] [CrossRef]
Alismaiel, O.A. Using Structural Equation Modeling to Assess Online Learning Systems’ Educational Sustainability for University Students. Sustainability 2021, 13, 13565. [Google Scholar] [CrossRef]
Okada, A.; Gray, P. A Climate Change and Sustainability Education Movement: Networks, Open Schooling, and the ‘CARE-KNOW-DO’ Framework. Sustainability 2023, 15, 2356. [Google Scholar] [CrossRef]

Figure 1. Diagram of search term clusters.

Figure 2. PRISMA flow diagram: details on the inclusion and exclusion of studies and the criteria at each stage are provided.

Figure 3. Distribution of 28 pieces of literature across different MTDTs. The different colors represent various diagnostic technologies. The literature includes 1 article on traditional multiple-choice tests (3.57%), 11 articles on two-tier diagnostic technology (39.29%), 10 articles on three-tier diagnostic technology (35.71%), and 6 articles on four-tier diagnostic technology (21.43%).

Table 1. Overall categorical system summary table.

First Classification	Secondary Classification	Definitions and Explanations
Basic Information	Author(s), publication year, region, participant scale
Development Process	Theoretical framework, confidence ratings, concept statement, concept scope, distractors, instrument proofreading, instrument number of items, type of instrument	Theoretical framework: the foundational theoretical principles guiding the development of diagnostic instruments. Concept statement: a scientific articulation of the main concept and its sub-concepts. Concept scope: the boundaries defining the main concept and its sub-concepts. Distractors: erroneous options based on potential misconceptions.
Instrument Quality Assessment	Reliability testing, validity testing, conclusion on effectiveness	-
Application of MTDTs	Implementation methods, educational stages, sample sizes, subject areas, theme concept	Theme concept: specific disciplines encompass key theme concepts that students need to understand, such as photosynthesis.
Misconceptions and Interventions	Identified misconceptions, types, psychological variables, intervention recommendations, validation of intervention recommendations	Psychological variables: these variables include the mean confidence (CF), mean confidence of correct responses (CFC), mean confidence of wrong responses (CFW), mean confidence of discrimination quotient (CDQ), confidence bias (CB).
Representational Effectiveness of MTDTs	Representational effectiveness of traditional multiple-choice tests, two-tier diagnostic instruments, three-tier diagnostic instruments, and four-tier diagnostic instruments	Representational effectiveness: the degree of precision with which different tiers of MTDTs evaluate participants’ levels of conceptual understanding.

Table 2. Classification system for the development of MTDTs (the original tabular data can be referenced in Table S1).

First Classification	Secondary Classification	Tertiary Code (Specific Categories)	Frequency of Occurrence
Development Process	TF (Theoretical Framework)	DP1 Fully follows Treagust’s model	2
		DP2 Partially follows Treagust’s model	1
		DP3 Does not fully follow Treagust’s model	24
		DP4 Not described	1
	SP (Source of Proposition Concepts Statements)	SP1 Science curriculum standards/syllabus	18
		SP2 Textbooks	12
		SP3 Literature review	18
		SP4 Expert opinions or reviews	8
		SP5 Teaching experience	3
		SP6 Others	4
	CT (Scope of Conceptual Themes)	CT1 Defined by propositional knowledge statements	4
		CT2 Concept map construction	12
		CT3 Focus on core concepts	9
		CT4 Hierarchical conceptual framework	1
		CT5 Following curriculum requirements	2
		CT6 Others	3
	AM (Acquisition of Potential Misconceptions)	AM1 Literature review	25
		AM2 Classroom observation	6
		AM3 Student interviews	21
		AM4 Open-ended questions/tests	15
		AM5 Classroom discussion and feedback	4
		AM6 Expert feedback	5
		AM7 Others	2
	CRs (Confidence Ratings)	CR1 No confidence scale involved	12
		CR2 2-point scale	7
		CR3 4-point scale	2
		CR4 5-point scale	1
		CR5 6-point scale	5
		CR6 7-point scale	1
	IP (Instrument Proofreading)	IP1 Expert review feedback	25
		IP2 Pilot testing feedback	24
		IP3 Validation not involved	1
Development Process	IN1 (Instrument Number of Items)	IN1-1 8–10 items	2
		IN1-2 11–13 items	13
		IN1-3 14–16 items	8
		IN1-4 More than 16 items	4
	IN2 (Type of Instrument)	IN2-1 Two-tier diagnostic instrument	11
		IN2-2 Three-tier diagnostic instrument	10
		IN2-3 Four-tier diagnostic instrument	6
		IN2-4 Traditional multiple-choice testing	1

Table 3. Classification system for the quality assessment of MTDTs (the original tabular data can be referenced in Table S2).

First Classification	Secondary Classification	Tertiary Code (Specific Categories)	Frequency of Occurrence
Instrument quality Assessment	RV (Reliability Validation)	RV1 Cronbach’s alpha coefficient only	9
		RV2 Cronbach’s alpha combined	14
		RV3 No Cronbach’s alpha	3
		RV4 Not involved	2
	VV (Validity Validation)	VV1 Expert review only	24
		VV2 Expert review combined	9
		VV3 No expert review	2
		VV4 Not involved	1
	CC (Conclusion on Effectiveness)	CC1 Good reliability and validity	12
		CC2 High reliability and validity	4
		CC3 Excellent reliability and validity	3
		CC4 Effective and reliable instrument	4
		CC5 Demonstrates good reliability and validity	2
		CC6 Fair reliability and validity	2
		CC7 Not involved or described	1

Table 4. Classification system for applicants of MTDTs (the original tabular data can be referenced in Table S3).

First Classification	Secondary Classification	Tertiary Code (Specific Categories)	Frequency of Occurrence
Applicant of MTDTs	IMs (Implementation of Methods)	IM1 Paper-based test	26
		IM2 Online test	1
		IM3 Not involved	1
	ELs (Educational Stages)	EL1 Primary school	0
		EL2 Junior high school	3
		EL3 Senior high school	11
		EL4 University	10
		EL5 Cross-school segments	4
	PS (Participant Scale)	PS1 < 200 participants	7
		PS2 200–400 participants	11
		PS4 > 400 participants	7
		PS5 Not specified	3
	AS (Applicable Subjects)	AS1 Biology	11
		AS2 Chemistry	4
		AS3 Physics	10
		AS4 Geography	1
		AS5 Interdisciplinary	2
	FTs (Focus Topics)	FT1 Respiration	4
		FT2 Photosynthesis	3
		FT3 Covalent bond structure	2
		FT4 Diffusion and osmosis	1
		FT5 Internal transport and circulation	1
		FT6 Plant growth and development	1
		FT7 Boiling phenomena	1
		FT8 Mass and gravity	1
		FT9 Cell division and reproduction	1
		FT10 Chemical reactions	1
		FT11 Waves (mechanical and propagation)	2
		FT12 Electric circuits	1
		FT13 Environmental issues (global warming, acid rain, ozone depletion)	1
		FT14 Heat and temperature	1
		FT15 Cellular respiration	1
		FT16 Photoelectric effect	1
		FT17 Carbohydrates	1
		FT18 Work, power, energy	1
		FT19 Ecosystems	1
		FT20 Particle nature of matter	1
		FT21 Reaction kinetics	1
		FT22 Geometrical optics	1
		FT23 Density	1
		FT24 Amino acids, proteins, enzymes	1

Table 5. Classification system of misconceptions and intervention (the original tabular data can be referenced in Table S4).

First Classification	Secondary Classification	Tertiary Code (Specific Categories)	Frequency of Occurrence
Misconceptions and Intervention	RMs (Revealed Misconceptions)	RM1 (1–5 misconceptions)	19
		RM2 (6–10 misconceptions)	6
		RM3 (11 or more misconceptions)	3
	CMs (Categories of Misconceptions)	CM1 Common Misconceptions	27
	CMs (Categories of Misconceptions)	CM2 Not involved	1
	PVs (Use of Psychological Variables)	PV1 Involved (CF, CFW, CFC, CDQ, CB)	6
	PVs (Use of Psychological Variables)	PV2 Not involved	22
	IRs (Intervention Recommendations)	IR1 Teaching sequence optimization	2
		IR2 Hands-on experiments and simulations	22
		IR3 Concept maps and visual models	10
		IR4 Conceptual change texts	1
		IR5 Cognitive conflict approach	1
		IR6 Real-life analogies and examples	11
		IR7 Formative assessments and continuous feedback	4
		IR8 Multimodal and inquiry-based teaching	2
		IR9 Personalized/tiered teaching	5
		IR10 Clarification of terminology and concept distinctions	8
		IR11 Student-centered discussions and group activities	3
	EIs (Examination of Interventions)	EI1 Examined	1
	EIs (Examination of Interventions)	EI2 Unexamined	27

Table 6. Representation of conceptual understanding by two-tier diagnostic technology and traditional multiple-choice tests.

Two-Tier Diagnostic Technology			Traditional Multiple-Choice Tests
Content Tier (First Tier)	Reason Tier (Second Tier)	Conceptual Understanding Performance	Content Tier	Conceptual Understanding Performance
True	True	Understanding	True	Understanding
True	False	-	True	Understanding
False	True	Partial understanding	False	Misconception
False	False	Misconception	False	Misconception

Note. “True” denotes a correct response, while “False” denotes an incorrect response.

Table 7. Representation of conceptual understanding using three-tier diagnostic technology.

Content Tier (First Tier)	Reason Tier (Second Tier)	Confidence Tier (Third Tier)	Conceptual Understanding Performance
True	True	Certain	Scientific concept
True	True	Uncertain	Knowledge gap
True	False	Certain	FP
True	False	Uncertain	Knowledge gap
False	True	Certain	FN
False	True	Uncertain	Knowledge gap
False	False	Certain	Misconception
False	False	Uncertain	Knowledge gap

Note. “True” represents a correct answer, “False” represents an incorrect answer, “FN” represents a false negative, and “FP” represents a false positive.

Table 8. Four-tier diagnostic technology for representing conceptual understanding.

Content Tier (First Tier)	Confidence Tier (Second Tier)	Reason Tier (Third Tier)	Confidence Tier (Fourth Tier)	Conceptual Understanding Performance
True	Certain	True	Certain	Scientific concept
True	Certain	True	Uncertain	Knowledge gap
True	Uncertain	True	Certain	Knowledge gap
True	Uncertain	True	Uncertain	Knowledge gap
True	Certain	False	Certain	FP
True	Certain	False	Uncertain	Knowledge gap
True	Uncertain	False	Certain	Knowledge gap
True	Uncertain	False	Uncertain	Knowledge gap
False	Certain	True	Certain	FN
False	Certain	True	Uncertain	Knowledge gap
False	Uncertain	True	Certain	Knowledge gap
False	Uncertain	True	Uncertain	Knowledge gap
False	Certain	False	Certain	Misconception
False	Certain	False	Uncertain	Knowledge gap
False	Uncertain	False	Certain	Knowledge gap
False	Uncertain	False	Uncertain	Knowledge gap

Note. “True” represents a correct answer, “False” represents an incorrect answer, “FN” represents a false negative, and “FP” represents a false positive.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, H.; Yang, H.; Li, C.; Ma, S.; Li, G. The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review. Sustainability 2025, 17, 3145. https://doi.org/10.3390/su17073145

AMA Style

Ma H, Yang H, Li C, Ma S, Li G. The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review. Sustainability. 2025; 17(7):3145. https://doi.org/10.3390/su17073145

Chicago/Turabian Style

Ma, Huangdong, Hongbin Yang, Chen Li, Shiwen Ma, and Gaofeng Li. 2025. "The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review" Sustainability 17, no. 7: 3145. https://doi.org/10.3390/su17073145

APA Style

Ma, H., Yang, H., Li, C., Ma, S., & Li, G. (2025). The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review. Sustainability, 17(7), 3145. https://doi.org/10.3390/su17073145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Effectiveness and Sustainability of Tier Diagnostic Technologies for Misconception Detection in Science Education: A Systematic Review

Abstract

1. Introduction

1.1. Core Role of Scientific Concepts in Scientific Literacy

1.2. Contributions of Existing Research in Revealing Misconceptions

1.3. Necessity of Conducting a Systematic Review on Multi-Tier Diagnostic Technologies

2. Methodology

2.1. Literature Search

2.2. Inclusion and Exclusion Criteria

2.3. Types of Studies Reviewed

2.4. Literature Screening Process

2.5. Content Analysis Process

3. Results

3.1. Descriptive Characteristics

3.1.1. Analyzed Documents and Their Locations

3.1.2. Content Areas

3.1.3. Research Methods and Data Analyses

3.2. Synthesis of the Studies

3.2.1. Development of MTDTs

3.2.2. Validation of the Reliability and Validity

3.2.3. Application of MTDTs Across Disciplines and Educational Stages

3.2.4. Misconceptions Identified by MTDTs and Proposed Interventions

3.2.5. Representational Effectiveness of MTDTs in Assessing Conceptual Understanding

4. Discussion

4.1. Are the Development and Overall Reliability and Validity of MTDTs Scientifically Sound?

4.2. Which Scientific Discipline Has the Widest Application of MTDTs?

4.3. Application of MTDTs in Different Educational Stages (K12 vs. Higher Education)

4.4. Can MTDTs Be Applied on a Large Scale in Online Settings?

4.5. What Common Misconceptions Do MTDTs Reveal in Science Education, and Are the Teaching Strategies Targeting These Misconceptions Effective?

4.6. Which MTDT Is the Most Effective in Assessing Student Conceptual Understanding?

4.7. Future Research Directions and Recommendations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI