Fostering Critical Thinking in STEM Education

Straser, Oliver; Bašić, Matija; Doorman, Michiel; Weinberg, Lucas; Kapelari, Suzanne; Maaß, Katja

doi:10.3390/educsci16030461

Open AccessArticle

Fostering Critical Thinking in STEM Education

by

Oliver Straser

^1,*,†

,

Matija Bašić

^2,†,

Michiel Doorman

^3,†

,

Lucas Weinberg

^4,†

,

Suzanne Kapelari

⁴

and

Katja Maaß

¹

International Centre for STEM Education, University of Education Freiburg, 79115 Freiburg, Germany

²

Faculty of Science, Department of Mathematics, University of Zagreb, 10000 Zagreb, Croatia

³

Freudenthal Institute, Utrecht University, 3584 CC Utrecht, The Netherlands

⁴

Department of Subject-Specific Education, University of Innsbruck, 6020 Innsbruck, Austria

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Educ. Sci. 2026, 16(3), 461; https://doi.org/10.3390/educsci16030461

Submission received: 15 January 2026 / Revised: 10 March 2026 / Accepted: 12 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Critical Thinking in Science Education: Nurturing Inquiry, Socio-Scientific Issues, and Innovation)

Download Versions Notes

Abstract

Critical thinking is widely regarded as a key competency in STEM education, particularly in light of 21st-century challenges such as digitalisation, climate change, and technological transformation. Although critical thinking is included in educational policies, its implementation in classroom practice remains limited, partly resulting from a lack of a common understanding that is both theoretically grounded and usable for teachers. In this paper, we introduce a rubric that aims to support the integration of critical thinking into STEM education. The rubric is based on an epistemic understanding of critical thinking rooted in the scientific process of discovery. It was developed through an iterative design process grounded in the Synergy Model of Critical Thinking and piloted with pre- and in-service teachers in four European countries. Their feedback was collected using qualitative questionnaires and focus groups and was analysed using a comparative analysis of the pilot implementations. Results suggest that the rubric captures the central aspects of critical thinking from a scientific perspective and provides a useful reference point for STEM teaching and reflection. However, its use as an assessment tool for critical thinking in all its manifestations is limited, due to its reliance on subject-specific knowledge. Overall, the findings indicate that this rubric could be used to flexibly support instructional design and professional reflection, rather than as a standardised instrument for assessing student performance.

Keywords:

critical thinking; STEM education; assessment rubric; pilot study; classroom practice; instructional design

1. Introduction

1.1. Societal Relevance of Critical Thinking in STEM Education

Contemporary global challenges, such as climate change, energy transitions, and the proliferation of polarised information sources, necessitate equipping young people with specific epistemic and cognitive competencies (Thornhill-Miller et al., 2023). To respond appropriately to those challenges and to enable social change, educational opportunities are needed that offer interdisciplinary approaches to support deep learning. Socio-Scientific issues instruction is increasingly used in education as a multidisciplinary context for teaching science as well as technology, engineering, mathematics (STEM), and other subjects to enable learners to experience multiple visions of scientific literacy. SSIs are issues that are grounded in science but that cannot be effectively resolved without also addressing societal dimensions of those issues. (Owens & Sadler, 2023). In this context, critical thinking (CT) is crucial, as it enables students to analyse complex problems and devise sustainable solutions (Singer-Brodowski, 2023). Most importantly, all citizens should be supported and motivated to adopt critical attitudes that enable them to question their own knowledge and translate this reflection into informed decision making. This necessity is particularly evident in the context of the urgent global climate change crisis and the emergence of pandemics, both of which demand carefully considered societal and ethical decision making. Fleming et al. (2021), for example, state that climate change is a discipline-overarching topic, often accompanied by widespread misconceptions and deliberate misinformation (van der Linden et al., 2017). Other topics frequently negotiated in society, social media, and also in the science classroom are, e.g., vaccination, energy transition, embryonic stem cell research, genetically modified organisms, biodiversity loss, or the return of wolves in areas close to human settlements. Dealing with these Socio-Scientific challenges in particular demands the ability to think critically and to reflect on the sources and reliability of information (Machete & Turpin, 2020; Facione, 1990). Consequently, the responsibility falls on educators to foster these high-level competencies alongside standard curriculum content. However, facilitating the acquisition of these skills poses a significant challenge, as the concept of CT is often abstract and multidimensional, and teachers and students alike have different ideas about what CT actually is. It becomes even more complex when individual progress in acquiring CT skills is to be evaluated (Kuhn, 1999).

1.2. STEM Education: Rationale, Definitions, and Goals

Contemporary societal challenges such as climate change, digitalisation, and sustainable development require competencies that extend beyond single academic disciplines. This has contributed to a growing emphasis on STEM education—the coordinated teaching of science, technology, engineering, and mathematics—as a framework for developing transversal skills relevant to complex, real-world problems (Bybee, 2013; Bacovic et al., 2022). The acronym STEM was formalised in the early 2000s within the US National Science Foundation, though the underlying rationale of connecting scientific and mathematical disciplines for applied purposes has a longer history in educational policy (Daugherty, 2013).

Definitions of STEM education vary considerably in the literature. A narrow interpretation treats STEM as the sum of its constituent disciplines taught in parallel. Broader definitions emphasise interdisciplinary or transdisciplinary integration, anchored in authentic problem contexts and methodological approaches such as inquiry-based or project-based learning (Breiner et al., 2012; Kelley & Knowles, 2016; Ortiz-Revilla et al., 2022). Systematic analyses of the field confirm that this definitional plurality reflects substantively different assumptions about the purpose and organisation of STEM learning (Li et al., 2020; Halawa et al., 2024). For the purposes of this study, STEM education is understood in an integrative sense, in which disciplinary content serves as a vehicle for developing transversal competencies, with CT as a central example.

Research on STEM education has expanded substantially over the past two decades. Bibliometric analyses document a shift in focus from participation and workforce preparation toward competency-oriented goals including problem solving, scientific reasoning, and higher-order thinking (Hsu et al., 2023; Zhan et al., 2022). A recent umbrella review of 22 meta-analyses reported moderate positive effects of STEM interventions on higher-order thinking outcomes, with considerable variability across contexts and implementation models (Lu et al., 2025). At the same time, CT is rarely operationalised through dedicated assessment instruments in STEM contexts and is explicitly addressed in only a minority of documented STEM initiatives (Reynders et al., 2020). This gap between the stated goals of STEM education and the available tools for supporting and monitoring CT in practice provides the direct motivation for the present study.

1.3. Challenges of Implementing and Assessing Critical Thinking

Addressing this gap, this article presents a rubric developed within the Erasmus+ project STEMkey, designed to make CT visible in STEM education and integrate it into teaching practices. We report on a design study conducted across four countries to evaluate the rubric’s potential to communicate, monitor, and support the development of CT in diverse educational settings.

Despite decades of development, available instruments for assessing CT, often designed for recruitment rather than education, remain limited in their applicability to school contexts. These tests typically focus on analysing and interpreting written information, failing to account for complex social settings or multiple representations. As Ennis (1993) argues, valid assessment requires a defensible conception of CT, comprehensiveness, and appropriateness for the target group. Thus, the challenges in assessment are fundamentally conceptual rather than merely technical, a distinction central to this manuscript.

1.4. Research Gap

A study conducted by Sermeus et al. (2021) to test students’ critical thinking skills in physics concluded that students show very limited domain specific CT skills and do not learn to think critically in class. Infusing lessons with CT without making CT explicit does not result in increased domain-specific CT skills.

Furthermore, teachers face significant hurdles in translating abstract CT goals into practice. Educators often hold fragmented conceptions of the construct (Choy & Cheah, 2009) and report low self-efficacy. This uncertainty is reinforced by learners’ resistance to complex tasks (Aliakbari & Sadeghdaghighi, 2013) and by the “assessment dilemma,” in which evaluating reasoning is perceived as more subjective than grading content knowledge. Consequently, CT is often marginalised in classroom assessments.

To address this gap, the present study builds on the Synergy Model of Critical Thinking (Rafolt et al., 2019), which conceptualises CT as a dynamic interaction of cognitive abilities, dispositions, knowledge, values, and self-regulation embedded within domain-specific contexts. This model reconciles generalist and specialist perspectives by integrating general reasoning skills within disciplinary knowledge structures.

The study underlying this manuscript is based on a rubric developed within the Erasmus+ funded project STEMkey to support and monitor the development of CT across various STEM educational materials. By translating the abstract dimensions of the Synergy Model into observable criteria, such as quality checking, perspective taking, and justification, the rubric supports teachers in designing tasks, diagnosing students’ reasoning, and providing formative feedback. CT is contextual, requiring attention to tasks, problems, and topics embedded in the science curriculum (Bailin, 2002).

This study makes two contributions. First, it offers a STEM-specific conceptualisation of CT centred on epistemic trustworthiness under conditions of scientific, societal, and technological uncertainty. Second, it examines the use of a rubric-based approach to support and monitor CT across diverse STEM contexts. Based on this framework, the study addresses the following research questions:

RQ1 (Design and Communication): How can rubrics communicate key features of CT development in STEM education to (student) teachers?

RQ2 (Adaptability and Usage): To what extent is a rubric a flexible and adaptable tool for teachers to support and monitor CT across diverse STEM contexts?

2. Theoretical Background

2.1. Societal Relevance and Educational Policy

Contemporary society is increasingly shaped by what scholars term a poly-crisis, a convergence of rapid technological advancement, pressing global challenges such as climate change and pandemics, and an unprecedented overflow of information (Thornhill-Miller et al., 2023). While access to information has democratised access to knowledge claims, it has simultaneously weakened traditional gatekeeping mechanisms. In today’s digital information ecosystem, content is frequently prioritised by engagement-optimised ranking systems, which can disadvantage uncertainty- or accuracy-oriented signals and facilitate the rapid spread of polarised misinformation.

These challenges create the need for the development of critical attitudes towards source evaluation and reasoning under uncertainty. A specific focus on “critical” in the context of STEM education does not address the normative and socio-political dimensions in STEM education that are referred to by scholars like Freire and Skovsmose (Freire, 2020; Skovsmose, 2020). Therefore, we acknowledge the presence of mechanisms by which educational systems (implicitly) sustain inequities and marginalisation of large groups of people. Many STEM practices are still guided by predominant cultural views based on so-called Western democratic values that suggest that everybody is equal and free but that are also driven by capitalist and colonialist mechanisms. The attention for critical pedagogies has stressed the importance of including local practices to give students opportunities to bring in social and cultural practices as critical funds of knowledge (Solomon et al., 2022). Ethnomodelling, adding cultural perspectives to the modelling process, stresses the importance of including local cultural information and respecting local social issues to open schools to their communities (Rosa et al., 2023). We acknowledge the importance of these approaches, but in this paper, we limit our focus to fostering critical attitudes in STEM practices that need attention in the current data-driven society with polarising environmental challenges.

Despite the ubiquity of information, empirical evidence indicates a widening gap between the availability of data and individuals’ ability to evaluate its validity critically. It cannot be assumed that access to information equates to the competence to judge its validity. For example, Wineburg et al. (2016) showed that so-called digital natives, i.e., persons who grew up in the information age, have been exposed to digital technologies from early childhood and experience considerable challenges in distinguishing reliable news from sponsored content or in identifying the sources of online information. This lack of epistemic vigilance extends beyond students to the general population (OECD, 2021). As a result, scientific knowledge claims regarding issues such as public health and climate change are increasingly perceived as matters of opinion rather than as evidence-based consensus. The ability to evaluate evidence is therefore a fundamental requirement for informed decision making in democratic societies.

In response to these developments, educational policies across Europe have increasingly shifted their focus from the acquisition of factual knowledge to the development of higher-order reasoning competencies (Voogt & Roblin, 2012). For example, in Germany, Austria, and Switzerland, promoting CT and helping young people to learn how to think critically is a systematic educational goal (Rafolt et al., 2019). Competence frameworks such as PISA (OECD, 2023) and the EU GreenComp Framework (Bianchi et al., 2022) describe CT as a key competence that must be developed not only in school education but also as a lifelong learning principle. The new PISA 2025 Science Framework puts emphasis on student agency in the Anthropocene and asks for competencies to enable systematic change (Balán, 2025). The ability to critique standard flaws in science-related arguments is explicitly required (OECD, 2023). CT is considered fundamental for learners to cope with uncertainty, complexity, and change (Bacigalupo, 2022). Taking together, these developments and being aware that a characteristic of science-, technology-, and mathematics-related knowledge acquisition is to think and peer review critically a clear mandate is placed with STEM education to foster learners’ epistemic vigilance when evaluating uncertain and contested knowledge claims, as well as when it comes to questioning one’s own cognitive process and the level of knowledge needed to draw reliable and robust conclusions. CT has thus evolved from an overarching educational objective into a key competence not only in STEM education but for democratic participation in the 21st century. For example, an umbrella review of 22 meta-analyses found moderate positive effects of STEM education on higher-order thinking skills, though effects varied considerably across educational levels, subject areas, and instructional methods (Lu et al., 2025). This variability underscores that STEM education does not automatically produce gains in higher-order reasoning; rather, such gains depend on the deliberate and explicit integration of competencies such as CT into instructional design.

Despite this strong societal and policy-level emphasis, translating CT into classroom practice remains challenging. A central obstacle is the lack of a unified, operationalisable definition of CT, often described as a conceptual jungle (Lai, 2011). The following section addresses this challenge by reviewing and structuring dominant approaches to defining CT in educational research.

2.2. The Challenge of Defining Critical Thinking

Since the early 1980s, researchers have increasingly emphasised the importance of shaping and fostering CT in education. However, this increased attention led to various approaches highlighting the difficulty of establishing a shared understanding. Although the intellectual discourse on CT can be traced back more than 2500 years to Socrates’ vision of teaching (R. W. Paul et al., 1997), CT has long been regarded as an overarching educational goal, and an explicit, shared understanding of its components remains rare. Elder and Paul (2022) suggests that when deconstructing and examining reasoning, it is efficacious to focus on elements such as purpose, questions, information, inferences, concepts, implications, assumptions, and point of view.

Various approaches to defining CT have emerged across different disciplines, reflecting the difficulty of establishing a shared understanding of the concept. Moore (2013) identified seven definitional strands coming from three disciplines: history, philosophy, and cultural studies. Ennis (1993) defined CT as reasonable reflective thinking focused on deciding what to believe or do (p. 180). Additionally, it provided a list of characteristic skills that a critical thinker should demonstrate independently (Ennis, 1993). In contrast, Bailin et al. (1999) refused to present such a list of abilities and skills to avoid misunderstanding. However, they argue that CT is a “normative term” and that becoming a critical thinker requires intellectual resources such as ‘background knowledge, operational knowledge of key concepts, possession of effective heuristics and of certain vital habits of mind” (p. 385).

These conceptual debates have direct consequences for assessment practice. Saxton et al. (2012) identify four persistent challenges in CT assessment at the secondary level: insufficient teacher preparation for CT instruction, the absence of evidence-based instructional methods, curricula that fail to target higher-order thinking, and the lack of valid and reliable measurement instruments. Crucially, they argue that analytic rubrics, in which CT is assessed across multiple sub-dimensions independently, are superior to holistic scoring approaches precisely because they preserve diagnostic information at the level of individual competencies, information that holistic scores collapse into a single undifferentiated rating. This argument is directly relevant to the present study, which adopts an analogous, multidimensional rubric structure.

Other academics who have tried to define CT describe a critical thinker as someone moved by reasons (Siegel, 1988) or as a liberating force in education, important for one’s personal life and vital to society in general (Facione, 1990). Halpern (2003) integrates cognitive skills and strategies in her definition of CT in use. While correlating CT competencies with specific concepts of CT, R. Paul and Elder (2013) describe twenty-five competencies that constitute a master rubric for scoring students’ achievements and providing an overall performance score for CT in educational settings.

Although definitions of CT vary, they share common elements such as reflection on one’s thinking and purposeful decision making or action. Ongoing debates about whether CT is a skill or a disposition and is domain specific or domain general have led to diverse teaching and assessment approaches, with research suggesting that explicitly integrating CT into domain-based instructions enhances domain-specific CT more effectively than implicit learning (Sermeus et al., 2021) Critical thinking is a multidimensional phenomenon and unites more or less similar attitudes, strategies, processes, and goals. However, this diversity of perspectives makes it difficult to arrive at a comprehensive definition of the competencies involved in CT. This makes it challenging to develop practical reference frameworks for educators to help them design lessons that promote effective learning.

Consequently, it remains challenging for educators to evaluate whether a student is capable of CT because no universally accepted operational definition guides assessment. The synthesis model by Rafolt et al. (2019) distils areas of broad consensus within the scientific discourse on CT and offers educators a clear framework for the diverse competencies that together constitute an effective critical thinker. An explicit understanding of one, in our case the scientific dimension of CT, offers a starting point for engaging in conversation with students on how scientific thinking strategies contribute to dealing with social challenges critically and what this can and cannot achieve.

2.3. Connecting Critical Thinking to STEM Content

CT is widely regarded as one of the key competencies in STEM education and is considered a domain-specific skill. Contemporary conceptualisations commonly emphasise skills such as clarifying meaning, analysing arguments, evaluating evidence, and drawing warranted conclusions (Hitchcock, 2017). As STEM disciplines aim to transform data into scientific models that underpin reliable knowledge (Osborne, 2014), CT inherently entails evaluating the trustworthiness of assertions and identifying remaining uncertainties.

Within STEM, these reasoning processes are not generic; instead, they rely on domain-specific knowledge that enables learners to navigate disciplinary representations, inferential structures, and epistemic norms (Tricot & Sweller, 2014; Willingham, 2007). Consequently, effective CT in STEM presupposes a conceptual understanding of how evidence is generated, validated, and limited within specific scientific domains.

Socio-Scientific issues constitute a well-established pedagogical context in which such epistemically grounded CT becomes highly salient. Socio-Scientific issues are real-world problems situated at the intersection of science and society. In STEM education, they serve as a pedagogical tool for fostering epistemically grounded CT. SSIs are characterised by uncertainty, contested evidence, and the absence of definitive answers (Sadler, 2004; Zeidler, 2014), often due to limited scientific knowledge. Engaging with SSIs therefore requires learners to interpret the problem context, connect it to scientific concepts, critically evaluate potentially conflicting data, and justify decisions under uncertainty while considering diverse viewpoints and evaluating the limitations of their conclusions. The objective shifts from merely identifying the answer to determining how well the answer is warranted. Thus, SSIs provide a context in which CT extends to assessing the validity and limits of knowledge claims when standard disciplinary procedures no longer suffice.

Recent advances in Generative AI introduce an additional layer of complexity. Such AI systems are easily accessible and offer well-articulated answers that can be occasionally inaccurate and not anchored in transparent methodological reasoning (Baidoo-Anu & Ansah, 2023; Bender et al., 2021; Long & Magerko, 2020; Ng et al., 2021). This challenges traditional cues for evaluating credibility and further increases the importance of equipping learners with competencies for scrutinising the trustworthiness of data-driven and AI-mediated claims. Consequently, the ability to interpret evidence, assess measurement quality, and identify algorithmic biases has become an integral part of modern CT.

The challenge of operationalising CT in STEM is compounded by persistent disciplinary unevenness in the research literature. Reynders et al. (2020), in a study developing and validating CT rubrics for undergraduate STEM courses across multiple institutions and disciplines, find that CT and information processing are widely stated as intended learning outcomes in STEM programs but are rarely assessed explicitly by instructors. Their work demonstrates that rubrics can create the constructive alignment between learning goals, tasks, and assessment tools that is necessary for CT to be systematically developed but also that such rubrics require iterative refinement through faculty collaboration and student feedback to achieve disciplinary fit. This finding resonates with the present study’s design process and confirms that the challenge of translating CT frameworks into usable classroom tools is not unique to secondary education but extends across educational levels and national contexts.

Taken together, these developments suggest that CT in STEM is best understood as a competence for evaluating the epistemic trustworthiness of scientific, data-driven, and AI-generated knowledge claims. To operationalise this in STEM education, our framework consolidates these demands into three key dimensions: the quality checking of information and sources, perspective taking that accounts for diverse viewpoints and norms, and the rigorous justification of findings. Each component supports the others in enabling learners to critically scrutinise how knowledge is produced and legitimised in science and society.

3. Methodology

In this exploratory study aimed at a better understanding of CT in the STEM domain, we employed qualitative methods to elicit how (student) teachers can come to understand and use CT features in STEM education. The study is grounded in a social constructivist theoretical framework (Vygotsky, 1978), which holds that knowledge is not acquired individually but emerges through social interaction, collaborative meaning making, and the use of cultural tools within specific contexts. From this perspective, a rubric is not merely an assessment instrument but a cultural tool that mediates how teachers think about, discuss, and develop CT in their practice. The framework positions learning, including professional learning among teachers, as inherently situated, context dependent and shaped by the interpretive resources participants bring to their encounters with new tools and concepts.

Following Koro-Ljungberg et al. (2009), we explicitly ground our study in a constructivist epistemology, ensuring that methodological decisions remain consistent with the study’s knowledge claims. Accordingly, the choice of focus groups, written questionnaires, and semi-structured interviews reflects the aim of capturing how participants socially construct meaning around a novel tool rather than measuring predefined outcomes. This theoretical orientation led us to adopt a design research approach (McKenney & Reeves, 2019; Bakker, 2018), in which tools are iteratively developed and refined through systematic empirical inquiry in authentic practice settings.

The social constructivist framework has direct implications for how the research questions were formulated, how data were collected and analysed, and what kinds of claims the findings can support. RQ1, asking how rubrics can communicate key features of CT, presupposes that communication is not a transmission of fixed meaning but a socially mediated process in which participants actively interpret and negotiate the relevance of a tool for their own practice. RQ2, asking to what extent the rubric is flexible and adaptable, similarly assumes that usability is not an intrinsic property of the instrument but emerges from the interaction between the tool and the specific institutional, disciplinary, and professional contexts in which it is used. Both questions therefore call for methods that preserve contextual specificity while enabling cross-case pattern recognition, which is precisely the rationale for the comparative multi-site design employed here.

The framework also delimits the scope of the findings. Consistent with the formative phase of design research (McKenney & Reeves, 2019), the study does not seek to establish the rubric’s effectiveness in improving student CT outcomes. Rather, it generates theoretically grounded, context-sensitive evidence about the rubric’s communicative validity and practical usability from a teacher perspective. This serves as a necessary precondition before causal claims about student-level impact can responsibly be investigated.

3.1. Design Process

The EU Erasmus + Project STEMkey (teaching standard STEM topics with a key competency approach 2020–2023) focused on the development of specific teaching and learning modules addressing standard STEM topics, with emphasis on interdisciplinary approaches. The STEMkey consortium gathered universities from 12 European countries covering all STEM disciplines and featuring strong expertise in competence-based and student-centred STEM education research and practice: (https://icse.eu/international-projects/stemkey/ accessed on 3 March 2026).

The goal was to support the development of key competences for lifelong learning at the secondary school level. Competence development was supported by designing context-specific tasks and learning activities that foster CT within the problem-solving process. The decision that STEMkey teaching materials and activities should encourage CT was the result of a developmental process involving all partners. It was coordinated by the authors of this paper in three stages: theoretical, exploratory, and summarising. In these stages, we developed and validated the instrument by using the notions of construct validity and content validity (Drost, 2011). Construct validity was established by drawing on existing ideas about CT to develop a tool for educational practice.

In the first stage, the multiple perspectives of CT were shared among all partners, based on an extensive literature review and the fourth and fifth authors’ research experience with CT (described in the previous section). The structure of the rubric, its dimensions and levels, was theoretically underpinned based on this overview and a dedicated workshop with a small group of project members, including all authors of this paper. Within the scope of the project, the workshop established a common understanding of the Synergy Model and highlighted the need to develop practical tools to educate educators about the concept of CT and to assess students’ work on tasks in the designed teaching and learning modules.

In the second stage, the second and third authors designed and presented the first example of the rubric formulated with four levels. The design was based on the Synergy Model and the specific questions fostering CT in the context of the measurement task that were known to the author based on his experience in teacher education. The emerging model was elaborated into example rubrics for some illustrative tasks covering a variety of STEM topics and evaluated by peers in a consortium meeting, which included another workshop based on the example. The participants of the workshop were representatives of all project partner countries, and they discussed the extent to which the example covers various aspects of CT.

In the third stage, feedback was processed, and the final rubric was extended with auxiliary tools (e.g., a set of questions covering the CT dimensions in cases where the rubric is not available for a specific task).

3.2. Procedure and Data Collection

The feasibility and practical validity of our instruments were studied by piloting the rubric and the accompanying tasks in sessions with prospective teachers and teacher educators. First, we organised an external expert consultation during a workshop at an international conference (15 participants). The workshop included group work on the measurement task, an introduction to the CT rubric, and an open discussion. Based on the feedback, carefully designed trials were implemented in four of the project’s partner countries. Three different target groups were identified: science students, prospective student teachers, and teacher educators. This variation was partly due to convenience sampling by the countries’ representatives, but it made it possible to cover the perspectives of teachers who are at different stages of their teaching careers and cover different school subjects, such as mathematics or biology. In each of the countries, we used the same procedure for the try-out, using a combination of experiencing the tasks from the perspective of secondary students and reflecting on written results based on applying the rubrics as teachers. This was followed by a discussion of the rubric in general and by the presentation of its implementation. To facilitate the uniformity across case studies, five categories were identified to be discussed in each country: concept validity, usefulness and purpose, practicality and barriers, adaptability, and willingness to use. These five categories were formulated as questions to be answered by participants.

In each of the countries, the focus group discussion included a series of questions:

Does the rubric capture CT in STEM?
Is it useful? Why and why not?
Is it practical? Why and why not?
Is it adaptable? Example?
Are you willing to use it? Why and why not?

The same categories were used to analyse the participants’ answers, as well as to produce and compare the case study reports. In some cases, the data and the analysis have been extended with further interviews that provided additional clarification. Details of the procedure, sample, and the context for each country are described in the following paragraphs.

In Austria, the session took place at the University of Innsbruck, at the Faculty for Teacher Education, and lasted 90 min. It involved a group of seven pre-service biology education teachers who were enrolled in a master’s programme and were taking the course “Inquiry-based Learning” as part of the regular teacher education curriculum. The session was conducted in the middle of the semester, after the students had learned about the five instructional phases described by Bybee et al. (2006) to design fruitful teaching units. The session started with a wrap-up and an introduction to CT by using the visualisation of the Synergy Model of CT (Rafolt et al., 2019). The student used STEMkey Module 5: Material Cycles. Students were given the task to identify carbon sinks and sources while manipulating a figure model of a garden environment, which included humans, animals, plants, soil, technical equipment, etc. This task led to discussions among the students about whether the depicted objects function as sinks, sources, or both. Based on this discussion, the students were asked to reflect on their future work as teachers and how they could observe CT in group discussions in class. After a brief exchange, the students received the rubric and were asked to discuss its potential use in three groups: two groups with two participants and one group with three participants. The lecturer asked for permission to record the discussions, which were transcribed and analysed afterwards.

In Croatia, the session has been conducted as part of the course “Didactics of Mathematics 3” in the final year of the graduate programme for pre-service teachers of mathematics, mathematics and physics, and mathematics and computer science at the Mathematics Department, Faculty of Science, University of Zagreb. There were 22 participants in total. In the session, which lasted 90 min, the author asked students to consider the measurement task and the racing car task. After solving the first problem, there was a brief discussion about the notion of CT. The participants mostly struggled to define CT, and only a few students joined in the discussion by relating CT to an individual’s use of prior experience. Following the work on the second problem, the participants were asked to look at the general rubric. They were invited to read the rubric and comment on its structure (levels and dimensions). Finally, they were asked to study the rubrics adapted to the two given problems and fill in the questionnaire. The written answers were coded and analysed according to categories to obtain an overview and to extract explanatory answers used to prepare the Croatian case study.

In Germany, the data collection consisted of two sessions, both conducted with second- and third-year pre-service mathematics teachers specialising in lower secondary education (i.e., those preparing to teach students in grades 5 to 10) at the University of Education Freiburg, Institute for Mathematics Education. The first session involved 44 participants. After a 45 min introductory input on CT and the structure of the rubric, the students worked in small groups of two to four and provided written responses to the guiding questions. The second session was carried out with four third-year bachelor students from the same programme. Instead of written work, these students participated in semi-structured interviews after the introductory phase. Each session lasted about 90 min in total, with 30 to 45 min of task engagement. While the participants had not yet been explicitly confronted with the concept of CT in their studies, they were able to easily grasp the mathematical background of the presented tasks. The interviews were audio recorded, transcribed using NoScribe version 0.7, and later edited and linguistically smoothed following the Dresing and Pehl (2018) guidelines.

In the Netherlands, the session involved a group of ten science and mathematics teacher educators at Utrecht University. The session lasted for one hour, which was feasible given the familiarity of the participants with the need for addressing transversal skills in STEM education. The participants worked for 15 min on the measurement task and the carbon cycle task, followed by an evaluation of their answers with the rubrics (15 min of group work). After a plenary discussion on CT and the potential of the rubrics, the participants had 10 min to provide written answers to the questionnaire. The written answers, together with notes from the facilitator (member of the researcher team), constituted the Dutch data collection. Data were analysed according to categories.

3.3. Data Analysis

The four national implementations are termed “cases” strictly for structural organisation, denoting distinct pilot settings for the instrument rather than a classical case study methodology. The underlying research design aligns with the formative evaluation phase of educational design research, where tools are piloted in practice to explore their feasibility and usability (McKenney & Reeves, 2019). Each country representative analysed the collected data, considering the background of the participants. For each country, this led to a summary of the main findings based on the data collected and summarised under the questions asked. Findings are supported with quantifications and enriched with quotations representative of particular patterns or to illustrate a specific voice that impacts the potential of using the rubrics and the tasks. These specific voices are discussed within the research team to allow for a shared understanding of the selection of these quotations and the importance of providing space to them.

The general patterns emerging in the try-outs have resulted in country case studies. These case studies are discussed from an international and overarching perspective in the discussion section. Five categories used to design the questionnaire were also used in the analysis of each country case and in the tabular synthesis. An overview of all cases in the form of a table is provided, and it was used to highlight similarities and differences in the country cases.

Consistent with the social constructivist epistemology outlined in the theoretical framework and following the principle of epistemological consistency advocated by Koro-Ljungberg et al. (2009), the analysis was conducted using Qualitative Content Analysis (Mayring, 2014), applying a deductive–inductive coding strategy. The five predefined categories—concept validity, usefulness, practicality, adaptability, and willingness—served as deductive coding dimensions derived a priori from the conceptual structure of the rubric and the overarching research questions. During the coding process, these categories were inductively expanded and refined to capture context-specific barriers or adaptations that emerged from participants’ social interaction with the tool. Cross-case synthesis was conducted by comparing category summaries across cases and identifying convergent and divergent patterns. Disagreements in interpretation were discussed within the research team until consensus was reached.

3.4. Ethical Considerations

The study was conducted in accordance with the ethical guidelines of the participating universities and national regulations regarding research with human participants. Prior to data collection, all participants, comprising pre-service teachers and teacher educators, were informed about the study’s objectives, the nature of their involvement, and the intended use of the data. Participation was strictly voluntary, and informed consent was obtained from all individuals. In addition to this, participants were informed that they could terminate their participation at any time. To ensure confidentiality and privacy, all data collected through questionnaires, interviews, and focus group discussions were anonymised or pseudonymised during the analysis process and stored securely in compliance with the General Data Protection Regulation (GDPR)

4. Results

We first present results from the design process, followed by case studies from try-outs in the four participating countries.

4.1. Result of the Design Process

As a result of the three-stage design process described in the methods section (Section 3), the content of the rubric has been systematised using the identification of eight dimensions, and the rubric has been simplified to three levels. These three levels—basic, intermediate, and advanced—are intended to establish a shared reference for CT and to capture varying degrees of critical engagement across tasks and contexts rather than to imply a fixed or linear developmental progression. The questions enable students, pre-service teachers, and in-service teachers to engage in a discussion about what makes a critical thinker. The basic level is characterised by modest use of CT while solving a problem, the intermediate level is characterised by solving a problem with certain elements of CT without using its full potential, and the advanced level is characterised by solving a problem based on extensive experience, context-specific knowledge, and a professional approach, exhibiting CT in every use (e.g., Table 1). Each level is described according to the following dimensions:

-: Quality check of resources;
-: Variety of methods;
-: Argumentation and coherence of conclusions;
-: Use of tools and data;
-: Understanding and consideration of different norms and values;
-: Communication of the results;
-: Presence of goal-orientation and perseverance;
-: Richness of reflection.

The connection of CT with STEM content is illustrated by adapting the rubric for evaluating students’ answers on specific STEM tasks to support teachers in understanding how to address CT in teaching STEM topics and implementing practices that support students’ CT development (e.g., Table 2).

Examples in other contexts and further variations of the rubric were designed to facilitate various needs of the users. One such auxiliary tool is a set of questions that a student or a teacher can use in situations where the rubric is not adapted to the specific task. By answering the questions, the user engages in reflection about their work. To achieve more flexibility during assessment, the teachers may select those dimensions that they find relevant at a certain moment and add levels according to their aims and needs. The dimensions can be evaluated jointly or separately, depending on the aims and the scope of the activities in which the students are engaged. This leads to the design of a table with separated dimensions. This variation has the advantage of considering the criticism that the levels of the original rubric are too rigid, and a person can present different levels across dimensions. The full rubric and its adaptations are given in Appendix A.

4.2. Preliminary Case: Workshop at the International Conference

The rubric has been presented to a group of 15 external experts at an international conference for STEM educators. The workshop showed the feasibility of the measurement task to facilitate discussion on the topic of CT. Participants discussed the definition of CT and the necessity for developing a more structured approach in teaching CT based on their own teaching experience. The presentation of the rubric raised interest, which led to the feedback that these experts find the rubric general, that in their opinion, CT is content specific and that the auxiliary tools (such as the list of questions, a rubric based on a concrete example, and a table with free choice of dimensions) are more aligned with their teaching needs and beliefs.

4.3. Case: Austria

Six out of seven students participating in the session have positive opinions on the use of the rubric as a supporting tool in their classrooms. From the transcripts of the interviews, it becomes clear that the students identify the tool as beneficial, as it invites teachers to think about the importance of CT:

“[…], one thinks critically about the whole thing, and I do believe that it will definitely improve the quality of one’s teaching.”

Students consider the overall design and the aspects included in the rubric to be a beneficial framework for guiding teachers through the most important aspects of CT, but they do not view it as a standalone document. However, they also noted that some elements are missing and pointed to limitations in efficiency, as the rubric does not capture students’ learning success:

“If I just take that into account, I don’t know if it will be effective. Because there is nothing in there about how you assess learning success or whether there is any learning success.”

In general, some students emphasised the additional workload for teachers, as supervising tasks to acquire CT skills are time consuming. This is especially the case for group discussions and group work, which need to be carefully designed. Reviewing and applying the rubric was also described as an additional burden for teachers. Furthermore, educators using the rubric need to be aware of students who may not feel confident enough to contribute their own ideas to discussions. The students reported that this approach only works if the teacher constantly observes the discussions carefully, and even then, it is difficult to assess whether aspects such as goal orientation or reflection are actually missing, particularly in secondary school settings. Future teachers described the observation of individual students within group discussions as especially challenging, as high-performing students tend to receive most of the attention:

“[…] but even so, in a discussion like this, I can only observe the group dynamics as a whole and not respond to the individual. And the problem is that when I reflect on the discussion, I always focus on the good students, and perhaps many students simply get lost in the shuffle.”

However, most students stated that they would use the rubric in their own teaching, as they attributed general benefits to the use of rubrics in providing a structure and a good overview for designing teaching activities. At the same time, they described the rubric as too general and therefore suggested adding specific examples but not overly detailed ideas for how to apply it in class using different social forms and learning settings. In addition, they reported difficulties in distinguishing between the different performance levels clearly:

“So yes, I would probably use it, but I find it a bit difficult to draw a clear line between what is basic level, what is intermediate level, and what is advanced level.”

The adaptability for transdisciplinary use is described as an additional workload for teachers that needs to be invested in, but they consider it worthwhile to do so, and it is helpful to have such rubrics as an overview of goals to achieve.

4.4. Case: Croatia

All participants wrote that the rubric captures CT in STEM, while some added that it “synthesises” individual situations into three clear categories. A few students mentioned that it is the first time they have encountered such a rubric, and some say they do not have enough experience as teachers, so they would need more time and more details to feel sure to use the rubric as a tool. One participant wrote:

“Yes, it triggered my own critical thinking, so I would expect it to also do that in others. [The workshop] solidified my knowledge about critical thinking and forced me to think about its importance.”

Most of the participants think that the rubric is useful. They explain that they would use it as an assessment tool to track students’ development of CT, higher cognitive skills like problem solving (or the ones from Bloom’s taxonomy), or for students’ self-evaluation. Others say they would use it to help them prepare for the lessons: how to pose better questions and how to reflect on their teaching activity. One student says that the rubric could be used in educating teachers about CT. Some participants claim that the styles of teaching that would foster CT are missing in the Croatian school system and that the teachers are not prepared to teach CT.

A few participants claim that the rubric is not useful in the conditions of limited time, which is often encountered in the school system. One student compared the levels from the rubric to the grades they would give to students. They write that even “the basic level” represents competences for a good grade (2–4 out of 5), “the intermediate level” represents higher grades (4–5 out of), and “the advanced level” represents competences above expectations for high school.

In terms of practicality, the participants’ opinions are divided. While about half of them write that the rubric is practical because it is systematic, concise, detailed, and they were able to understand it, some participants claim that there is not enough time for similar workshops in high school, while others argue that students’ competences are too specific and diverse to be classified in the presented “levels”:

“Students are too unique, so we can’t apply the same criteria to everyone.”

More specifically, when considering the use of a rubric in their own practice, the participants mention many different potentials uses and constraints. About half of them see the rubric as a self-reflection tool for teachers or as a tool to prepare and evaluate complex tasks, but there are also those participants that say that the rubric is not suited for students as they fear the students are not mature enough to use it, that it could be used only while working with talented students, or that it is good for lessons in which the class headmaster discusses classroom issues. One participant argues that the presented method might increase students’ motivation, while another student warns that the rubric should be carefully used as it might influence students’ self-confidence.

As the participants report, this workshop presents their first encounters with CT, rubrics, and this style of teaching, so we may speculate that their answers reflect their own experience from the perspective of a student (as opposed to the perspective of an experienced teacher), in particular when discussing the aspects from the affective domain. Furthermore, mentioning institutional constraints (such as time) might also be a reflection of slight insecurities due to a lack of experience.

4.5. Case: Germany

Based on the analysis of the written student answers and the interview data, the German participants (pre-service STEM teachers) acknowledged that the rubric addresses relevant skills for CT. However, they expressed strong reservations about its specificity for MINT (German for STEM) contexts. A primary criticism was the lack of concrete definitions to apply the concepts within the STEM domain.

“… I am missing some specific definitions to be able to apply this concretely in the MINT (STEM) domain.”

A deeper and frequently mentioned concern was the rubric’s strong connection to substantial subject knowledge. Participants argued that criteria like source evaluation are difficult to apply without this foundation. More specifically, they criticised that a meaningful engagement with scientific sources (e.g., academic publications) requires a level of expertise that is far beyond what can be expected in a school context. The rubric was thus seen as setting an unrealistic standard, as students often lack the necessary access (e.g., due to paywalls or language barriers) and the specialised knowledge to assess such sources appropriately. This argument implies, from the participants’ perspective, that the rubric’s applicability in secondary education is relatively limited.

“… if you don’t have the subject knowledge, it’s difficult to check sources to see if they are right or wrong.”

When reflecting on usefulness, views were mixed. Some participants saw potential for the rubric as a diagnostic tool to determine where students stand, particularly in specific contexts. However, many found the rubric “less useful” in its current form, citing several barriers. Beyond the aforementioned dependency on subject knowledge, participants criticised the perceived subjectivity of key criteria, such as “efficiency” and “creativity,” which they felt were difficult to assess objectively.

“… ‘efficiency’ and ‘creativity’ are such subjective assessments.”

This student observation touches on a practical barrier: constructs like creativity or collaboration are notoriously difficult to operationalise and measure reliably. It implicitly questions whether teachers in general are adequately trained to assess such complex constructs, even if the rubric were more specific. While this criticism [regarding access to sources] may partly reflect the pre-service teachers’ own limited experience with academic search strategies (e.g., using Google Scholar or finding Open-Access articles), it indirectly points to a more fundamental barrier: the question of whether in-service teachers possess the necessary skills to research, read, and evaluate scientific literature themselves. This teacher competency is a crucial, unstated prerequisite for the rubric’s meaningful application, thus significantly affecting its overall usefulness.

Regarding practicality, the consensus was “not yet”. Participants explicitly stated the rubric felt “not finished enough” and required significant clarification of its definitions before it could be applied. While participants struggled to see its application in mathematics, they did identify potential for specific, open-ended tasks, such as lab reports or student presentations:

“… in science lessons, something like conducting an experiment and then writing a report.”

Another significant practical barrier identified was the structure of the proficiency levels. Participants raised two distinct concerns:

“… for me, ‘Intermediate’ should basically be the minimum standard; what falls below that is simply not critical thinking …”

First, the levels were seen as not optimally calibrated. The “basic level” was criticised as being set too low, essentially representing an absence of CT rather than its starting point. Also, the gap between “intermediate” and “advanced” was perceived as very large, raising doubts about whether the highest level is realistically attainable in a school setting. This implied a concern among participants that they themselves, and potentially even in-service teachers, might not meet these advanced criteria. This, in turn, poses a fundamental challenge to the rubric’s utility: it questions whether teachers can design instruction that aims beyond their own perceived limitations, or if the highest level simply becomes irrelevant in practice.

Second, participants questioned the practical applicability of the discrete levels, worrying that students might meet criteria from different levels simultaneously (e.g., basic and advanced), making a consistent classification “difficult”.

Participants largely agreed that the rubric could be made usable through adaptation. Key suggestions included providing clearer definitions for subjective terms and perhaps adding an internal weighting system to clarify which criteria are most important. The primary goal of adaptation would be to shift its function away from formal assessment.

“… not necessarily as a scoring sheet, but … to have a rubric where one could theoretically categorise critical thinking …”

Willingness to use the rubric was therefore cautious. While some were “ready to use it”, others stated they would only use the current version as a “rough orientation” for themselves and would not “put too much value on it” for student evaluation.

“… as it is now, I would perhaps keep it in the back of my mind for myself … but wouldn’t put too much value on it yet, rather just use it as a rough orientation.”

The group’s consensus was that the rubric’s main value lies not in student assessment, but as a reflective tool for teachers during lesson planning.

4.6. Case: The Netherlands

In the Netherlands, all participants (science and mathematics teacher educators) agreed that the rubric covers aspects of CT. One participant was not sure whether all aspects were covered, and one participant was concerned about whether the accompanying tasks required students to reflect critically on their answers.

When reflecting on the question of usefulness, many participants also included issues related to practicalities. Most participants mentioned the help of such a rubric for articulating and monitoring progress across levels of CT in STEM disciplines. One mentioned the concern of how to proceed as a teacher once you notice that a student does not show willingness to reflect critically. Remarks were also made with respect to the specificity of the descriptions in the rubric, which sometimes makes it difficult to connect to varying student answers and tedious to use the rubric.

The participants’ reflections on the practicality were quite diverse. Most teacher educators found the rubrics helpful but also raised some concerns. They questioned whether all aspects were needed for each task and implementation and advised to use topics to cluster aspects and to connect them better to the general rubrics (e.g., with numbers or by using a grid). Some participants also would have appreciated more background information and connections to theory. The concrete examples for implementation were much appreciated by most participants.

In the question about the adaptability to the participants’ practice, some were unsure how to include it in their teacher education practice. Many teacher educators answered that they could make use of it in their practice, and some even provided examples:

“It is adaptable for the practice, especially for the biology context.”
“Yes. In a mathematical modelling task, students estimate the number of tiles needed to cover a floor. Neutral: just give a number without reasoning. Basic: calculate area, but don’t check assumptions. Proficient: reflect on measuring errors, consider tile gaps, etc.”

Almost all participants noted that they are willing to use the rubric in their future practice. Typical answers referred to how it helps to elicit aspects of CT and possible progressions. Some acknowledged the importance of creating ownership of such a tool:

“I think for it to be useful for your lessons, you might need to make your own version for that specific assignment/project. Maybe more as a starting point for designing/development of a lesson/assignment.”

A concern that was mentioned by some of the participants is the complexity of using a tool like this, requiring hard work from a teacher. A final remark was that users also need to stay critical of a tool for CT, for instance, by giving space in the rubrics to extensions or other dimensions that users can add or change.

4.7. Cross-Case Analysis

The results are summarised in the table below and will be the basis for the discussion.

5. Discussion

The rubric developed in this study was not intended as a direct instructional drill for students to simply “do exercises” in CT. Instead, it served primarily as an instrument for conceptual clarification and self-reflection for participants learning about CT and for us as teacher educators teaching it. This approach is based on our underlying strategy for developing critical thinkers: (1) understanding the multiple components of CT and (2) becoming aware of the specific questions one has to ask oneself during problem solving (Halpern, 2003; Rafolt et al., 2019). We conceptualise CT here as a process of leading an “inner dialogue.” This involves posing specific questions to ensure that decisions are determined through logical reasoning and valid data rather than intuition. This conceptualisation aligns with the established literature on metacognition and scientific reasoning (Facione, 1990; Halpern, 2003) as well as digital literacy (Ng et al., 2021).

At the same time, the rubric was designed to support teachers in interpreting and discussing students’ learning development in CT. Its dual function, supporting instructional design and the diagnosis of learning processes, reflects the assumption that CT cannot be reduced to a single performance indicator but unfolds as a complex, context-dependent competence over time (Bailin et al., 1999; Rafolt et al., 2019).

To address our research questions, we structure the discussion along RQ1 (Design and Communication) and RQ2 (Adaptability and Usage), drawing on the cross-case synthesis (Table 3).

Across all country cases, participants confirmed the conceptual plausibility of the rubric and generally agreed that it captures relevant aspects of CT from a STEM perspective. The rubric was perceived as a useful reference point for lesson preparation and professional reflection, indicating that it can communicate core CT dimensions as a shared language for discussing reasoning in STEM contexts. Importantly, this confirmation of construct validity refers primarily to communicative and conceptual value rather than to immediate applicability as a standardised assessment tool.

However, the pilots also show that communicative validity is contingent on operational clarity. Participants repeatedly requested clearer definitions to apply abstract terms concretely in STEM contexts and emphasised the need for task-based examples to make rubric descriptors actionable. Criticism of the level structure suggests that the current calibration may invite misinterpretation: the “basic” level was sometimes perceived as an absence of CT rather than a starting point, while the distance between “intermediate” and “advanced” was perceived as large and potentially unrealistic for school contexts.

Taken together, with respect to RQ1, the rubric appears to function most effectively as a tool for conceptual clarification and reflective planning: it helps (student) teachers articulate what counts as CT in STEM, while its role as a uniform scoring instrument remains limited by ambiguous constructs and interpretation demands.

Across cases, participants consistently emphasised that practical use requires adaptation to specific disciplines, tasks, and classroom settings. Rather than viewing the rubric as a fixed instrument, many participants framed it as a starting point that teachers would need to tailor by selecting dimensions, clustering criteria, adding task-specific descriptors, and, where intended, introducing weightings. In this sense, flexibility emerged as a precondition for usability across diverse STEM contexts, not as an optional feature.

At the same time, participants described several barriers to adaptation and routine use: the perceived time and workload burden, difficulties in applying the rubric to group work and diverse student profiles, and challenges in observing internal processes such as goal orientation or reflection. These constraints point to a need for task-derived rubric variants that reduce interpretive burden, yet teachers may lack the time and experience to develop such adaptations independently.

A central, cross-cutting limitation for RQ2 concerns domain knowledge prerequisites. Participants argued that practices such as checking sources, validating data, or critically comparing information are difficult to enact without substantial subject knowledge and epistemological competence. This finding is consistent with prior research showing that effective engagement with CT in STEM presupposes disciplinary knowledge structures that enable learners and teachers alike to navigate inferential norms and evaluate evidence within specific domains (Osborne, 2014; Reynders et al., 2020). It implies that teachers’ own content knowledge and epistemic beliefs are key boundary conditions for implementing the rubric meaningfully across STEM disciplines.

Overall, regarding RQ2, the findings suggest that the rubric is adaptable in principle but practically viable mainly as a flexible framework for instructional design and reflection, whose successful use depends on supported local tailoring and sufficient epistemic and disciplinary resources.

We observed a significant shift in function from grading to instructional design. Participants frequently rejected the rubric as a rigid scoring sheet due to perceived subjectivity and complexity yet expressed willingness to use it as a scaffolding tool for structuring lessons, visualising CT dimensions, and planning interventions. This shift reflects a broader theoretical insight: CT is not a linear, additive skill that can be reliably captured through standardised scoring but a qualitative form of reasoning whose evaluation requires interpretative judgment.

Criticism directed at specific rubric dimensions also indicates that several constructs commonly invoked in contemporary STEM education remain only vaguely operationalised. Similar observations have been made in the broader literature on CT assessment, where rubric-based approaches consistently reveal the gap between normative expectations and the practical capacity of teachers to apply evaluative criteria consistently (Saxton et al., 2012). In this sense, the rubric exposes rather than creates conceptual ambiguities, highlighting areas where clearer definitions and exemplars are required for classroom use.

These findings have direct implications for professional development. The discrepancy between high willingness to use the tool and the barriers identified suggests that simply providing the rubric is insufficient. Future programmes should enable teachers to apply CT themselves, integrate CT explicitly into subject-matter training, use the rubric as a self-assessment tool during training to identify reasoning gaps, and strengthen teachers’ capacity to diagnose thinking processes rather than solely grade outcomes. This is particularly relevant given the documented gap between policy-level CT mandates and actual classroom implementation across STEM disciplines, which underscores that rubric-based tools can only unfold their potential within a broader and sustained professional development framework.

6. Limitations

Several limitations of the present study warrant explicit discussion. First, the sampling strategy relies on convenience sampling across all four national implementations. Participants were recruited through project partner networks, meaning the sample is neither random nor representative of the broader population of STEM teachers or teacher educators in the respective countries. However, this limitation is inherent to the formative phase of educational design research, where the goal is not statistical representativeness but the generation of rich, context-sensitive feedback from participants with sufficient domain expertise to evaluate the tool meaningfully (McKenney & Reeves, 2019). The deliberate variation across countries, target groups, and educational levels partially compensates for this by introducing a breadth of perspectives that would not be achievable through a homogeneous sample.

Second, the study’s empirical basis rests exclusively on teacher and educator perceptions of the rubric. No direct assessment of student CT performance was conducted. This is a genuine scope limitation: claims regarding the rubric’s capacity to foster CT in learners cannot be derived from the present data. At the same time, this limitation is consistent with the study’s explicit purpose, which was to evaluate the rubric’s communicative validity and practical usability from a teacher perspective, a necessary precondition before any claims about student-level impact can responsibly be investigated.

Third, the data collection took place in workshop settings facilitated by members of the research team, who were simultaneously the designers of the rubric under evaluation. This dual role introduces a potential source of social desirability bias. This risk was partially mitigated by the use of written questionnaires alongside oral discussion, which allowed participants to respond independently and anonymously, and by the cross-national structure of the study, in which data collection was carried out by different local facilitators rather than a single centralised team.

Fourth, the study does not address inter-rater reliability in the application of the rubric. Given the concerns raised by participants about the subjectivity of certain criteria, this is an important direction for future work. The present study was not designed to establish psychometric properties of the rubric: its aim was formative and exploratory, but systematic reliability testing constitutes a clear and necessary next step toward the rubric’s broader validation.

7. Conclusions

The rubric for CT in STEM developed in this study was primarily described by participants as most useful when not treated as a static, standalone artefact or a standardised grading sheet. Instead, it serves as a flexible framework for reflection and instructional design. Its successful implementation requires embedding within a broader professionalisation framework in which teachers are supported in developing the very competencies, both in content knowledge and critical reasoning, that they are expected to foster in their students. In this respect, the rubric not only has the potential to support the teaching of CT but also indicates that CT is perceived by participants as a demanding instructional objective that often requires explicit support.

Notwithstanding the limitations discussed above, particularly the reliance on convenience sampling and the exclusive focus on teacher perceptions rather than direct measures of student CT development, the findings provide theoretically grounded and empirically supported evidence for the rubric’s value as a tool for professional reflection and instructional design. The consistency of the key findings across four nationally and institutionally distinct settings lends credibility to the conclusions drawn, even in the absence of statistical representativeness.

Finally, this study serves to open discussion of further efforts to implement CT in STEM education. CT is commonly described in the literature as an essential transversal skill required to navigate complex Socio-Scientific and technological developments. This study provides suggestions for conceptualising, addressing and monitoring this skill. Further work is needed to disseminate and implement attention for the rubrics, their dimensions and levels, and for the accompanying tasks in the different STEM disciplines. Such detailed, lesson-based examples can help illustrate how CT might be fostered in daily STEM practice to better understand its role in conducting science and interpreting scientific results. Moreover, practical tools for fostering CT are needed to move beyond general pleas for our students to become resilient and more critical of the diverse and sometimes contradicting information they are confronted with.

In addition, it is crucial to relate CT not only to scientific reasoning, understood as a specific form of CT applied to scientific contexts. At the same time, CT encompasses more general everyday situations but also broader aspects of civic engagement, including factchecking, argumentation, cognitive biases, attitude formation, decision making, ethics, values, freedom, and responsibility. In this respect, the proposed model represents a deliberate simplification, and the selection of its dimensions remains open to further empirical examination and revision.

Author Contributions

Conceptualization, M.B., M.D., S.K., K.M. and L.W.; methodology, M.B., M.D., S.K., O.S. and L.W.; formal analysis, M.B., M.D., O.S. and L.W.; writing—original draft preparation, M.B., M.D., S.K., O.S. and L.W.; writing—review and editing, M.B., M.D., S.K., O.S., L.W.; funding acquisition, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the European Union—NextGenerationEU through the National Recovery and Resilience Plan 2021–2026. Institutional grant of University of Zagreb Faculty of Science (IK IA 1.1.3. Impact4Math).

Institutional Review Board Statement

We anonymized the study in such a way that we were complient with our ethics protocols. Therefore no approval was necessary.

Informed Consent Statement

Informed consent was obtained from all participants involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We would like to thank STEMkey project members Elena Köck and Andrea Šoporová Chochoľáková for their contributions to developing initial versions of the framework for the rubric, and the prospective student teachers and teacher educators involved in the workshops for providing us with valuable feedback.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CT	Critical Thinking
STEM	Science Technology Engineering Mathematics

Appendix A

Table A1. Full rubric.

Levels	Description
Basic level	The basic level is characterised by modest use of critical thinking while solving a problem: - No quality check of external sources of information - Almost no appearance of contextual knowledge - Alternative methods are taken into consideration - Tools are used with systematic errors - No normative considerations - Explicit arguments for a decision not given - Visible lack of motivation or perseverance - Presentation is sloppy and incoherent - No reflection about the answer
Intermediate level	The intermediate level is characterised by solving a problem with certain elements of critical thinking without using its full potential: - Only limited sources of information used, and sources are poorly checked - The variation of methods is limited to procedures shown by others in similar contexts - Arguments are given, but with limited knowledge and potential to be generalised - Use of tools and data processing follows standard procedures, but it is flawed or misinterpreted - All steps of the process presented, but the structure, coherence, or attractivity of the presentation could be improved - Occasional understanding of norms and values - Limited perseverance, based on external motivation - In the reflection, the solution is checked, and the answer is evaluated inside the context, but the metacognitive relation to the whole process is rather weak
Advanced level	The advanced level is characterised by solving a problem based on extensive experience and a professional approach, exhibiting critical thinking in every aspect: - Multiple sources are considered and selected based on quality checks - A variety of methods is used or even invented for the purpose of analysing and solving the problem - Conclusions are coherent, logical, and supported by theoretical and empirical arguments, based on sources and considering norms and values relevant for the problem - Tools are used efficiently and in original ways - Multiple perspectives considered with care - Clear, structured, coherent, attractive, and comprehensive presentation - Goal orientation is continuously present, and capability to act reasonably and rationally is exhibited throughout the process - Reflection is rich, includes the higher aims of the activity and the possibility to evaluate findings in a wider context

Table A2. Example of the rubric adapted to the task on measurement skills.

Levels	Description
Basic level	The basic level is characterised by modest use of critical thinking and creativity while solving a problem: - In the calculation, no reference to different dimensions and one number being more precise than possible - No use of knowledge about school yards (or personal ref. points) - In the advice, no reference to other measuring tools/strategies - No underpinning of the advice - No reflection on a circumference that differs significantly from 100 m
Intermediate level	The intermediate level is characterised by solving a problem with certain elements of critical thinking and creativity, without using its full potential: - A reference to different dimensions or to one number being more precise than possible - Calculation is basically adding results with the same dimension and dividing by the number of results - The advice includes a reflection on the limitation of step counting as a measuring strategy - The answer also includes a reference to what is possible for a real school yard (e.g., why 1 km needs to be excluded)
Advanced level	The advanced level is characterised by solving a problem based on extensive experience and a professional approach, exhibiting critical thinking and creativity in every aspect: - Explicit reflection on different dimensions and to one number being more precise than possible - Calculation selects only steps (underpinned) and includes a reference to possible variation in lengths of a step - The advice includes a reflection on the limitation of step counting as a measuring strategy and suggestions for alternative measuring strategies - The answer includes a reference to what is possible for a real school yard and positions the size in a wider context (e.g., by referring to other yards or the size of the school)

Mobile phones have an activity meter that measures the number of steps during the day and the distance covered. The collected results of measuring the circumference of a school yard were: 100 steps, 98 steps, 93, steps, 103 steps, 100 m, 97.45 m, and about 1 km.

What would you do with these results to determine the circumference?

What could be the cause of the differences between these results? How would you improve the measurements?

Table A3. Example of the rubric adapted to a “car racing task”.

Levels	Description
Neutral level	The neutral level is characterised by the absence of critical approach to solving a problem: - The naïve solution that the graph is curved at the places where the track is curved - Misinterpreted variables on the axes of the graph - No perspective of the driver of the car - Sloppy or obviously incorrect graphs
Intermediate level	The intermediate level is characterised by solving a problem with certain elements of critical thinking and creativity, without using its full potential: - Applying a strategy that works only for some tracks that are similar to a shown example - Explicitly linking the curvature of the graph with the curvature of the track - Neat graphs, although not always correct - Description of the reasoning in written text - Evident change of a graph during the solving process - Comparing the graph and concluding that it is or is not correct
Advanced level	The advanced level is characterised by solving a problem based on extensive experience and a professional approach, exhibiting critical thinking and creativity in every aspect: - Evidence of at least two ways of explaining the shape of the graph - Making up and discussing different types of tracks and graphs, not given by any source - Confidence in the graphs and the reasoning behind them - General patterns such as constant curvature of the track leads to constant speed - Providing explanations for efficiency of the method that directly connects the curvature of the track and the graph - Expressing the value of functional thinking - Providing explanations considering practical and psychological aspects of a race - Discussing the difference of time dependence and space dependence of the speed - Discussing possible misconceptions or pitfalls that beginners might encounter

The students have the task to study different racing tracks and associate them with the graph that describes the dependency of the speed along the track. How would you evaluate students’ approach and reasoning to the task?

Table A4. (Self-)evaluation table that can be adapted for different dimensions.

Dimension\Level	Basic	Intermediate	Advanced
Quality check of resources
Variety of methods
Originality of ideas and approaches
Argumentation and coherence of conclusions
Use of tools and data
Understanding and consideration of norms and values
Communication of results
Presence of goal orientation
Richness of reflection

Table A5. Description of each dimension according to level.

Dimension\Level	Basic	Intermediate	Advanced
Quality check of resources	no quality check of external sources of information	only limited sources of information used, and sources are poorly checked	multiple sources are considered and selected based on quality checks
Variety of methods	alternative methods are not taken into consideration	the variation of methods is limited to procedures shown by others in similar contexts	a variety of methods is used or even invented for the purpose of analysing and solving the problem
Argumentation and coherence of conclusions	explicit arguments for a decision not given	arguments are given, but with limited knowledge and potential to be generalised	conclusions are coherent, logical, and supported by theoretical and empirical arguments, based on sources and considering norms and values relevant for the problem
Use of tools and data	tools are used with systematic errors	use of tools and data processing follows standard procedures, but it is flawed or misinterpreted	tools are used efficiently and in original ways
Understanding and consideration of norms and values	no normative considerations	occasional understanding of norms and values	multiple perspectives considered with care
Communication of results	presentation is sloppy and incoherent	all steps of the process presented, but the structure, coherence, or attractivity of the presentation could be improved	clear, structured, coherent, attractive, and comprehensive presentation
Presence of goal orientation	visible lack of motivation or perseverance	limited perseverance, based on external motivation	goal orientation is continuously present, and the capability to act reasonably and rationally is exhibited throughout the process
Richness of reflection	no reflection about the answer	in the reflection, the solution is checked, and the answer is evaluated inside the context, but the metacognitive relation to the whole process is rather weak	reflection is rich and includes the higher aims of the activity and the possibility to evaluate findings in a wider context

Below is a list of questions supporting the problem-solving process and improving the CT skills.

Based on the rubric, teachers can plan to use the following (non-extensive) list of questions while preparing the lessons or pose these questions to students to support their problem-solving process and improve their skills:

·: Do we understand the problem and its context?
·: Which competences might help us in investigating the problem further?
·: Do we know similar tasks and methods to solve them?
·: How many different approaches can we come up to solve the given problem?
·: Have we allowed ourselves to think “outside of the box”?
·: Which sources of information we consider using?
·: Have we checked the information from multiple sources and checked their reliability?
·: Have we included all the key information that is available to us?
·: Do we have enough measurements, or might our sample be biased?
·: How precise are our measuring tools?
·: Do we respect the prescribed procedures for using the tools and processing data?
·: What kind of an answer do we accept and which solutions we consider to be “good”?
·: Which values and norms condition our reasoning?
·: Do we understand social values, norms, and rules to act properly?
·: Do we understand the motivation of the subject or object we are confronted with?
·: Are we motivated to complete the task? What is our motivation?
·: Do we have enough evidence for our conclusions?
·: How do we make sure that our reasoning is correct?
·: How much time did we invest in the solving process?
·: How to present our solution to others?
·: Do we respect the norms and principles that our community uses in communication?
·: Is the solution that we reached understandable and meaningful to us?
·: Does the solution make sense with respect to the context of the problem?
·: Are we able to reflect on, interpret, and evaluate the solution?
·: Could a more general viewpoint provide us a deeper understanding of the solution?
·: Are there more aspects/dimensions that we can consider in our reflection?
·: What have we learned from the interaction with this problem?

References

Aliakbari, M., & Sadeghdaghighi, A. (2013). Teachers’ perception of the barriers to critical thinking. Procedia-Social and Behavioral Sciences, 70, 1–5. [Google Scholar] [CrossRef]
Bacigalupo, M. (2022). Competence frameworks as orienteering tools. RiiTE Revista Interuniversitaria de Investigación en Tecnología Educativa, 12, 20–33. [Google Scholar] [CrossRef]
Bacovic, M., Andrijasevic, Z., & Pejovic, B. (2022). STEM education and growth in Europe. Journal of the Knowledge Economy, 13(3), 2348–2371. [Google Scholar] [CrossRef]
Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52–62. [Google Scholar] [CrossRef]
Bailin, S. (2002). Critical thinking and science education. Science & Education, 11(4), 361–375. [Google Scholar] [CrossRef]
Bailin, S., Case, R., Coombs, J. R., & Daniels, L. B. (1999). Conceptualizing critical thinking. Journal of Curriculum Studies, 31(3), 285–302. [Google Scholar] [CrossRef]
Bakker, A. (2018). Design research in education: A practical guide for early career researchers. Routledge. [Google Scholar]
Balán, L. (2025). Educational quality in strategic negotiation: A critical discourse analysis of the rebranding of the PISA 2025 science test rationale. Research Papers in Education, 40(6), 871–894. [Google Scholar] [CrossRef]
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623). Association for Computing Machinery. [Google Scholar]
Bianchi, G., Pisiotis, U., & Cabrera Giraldez, M. (2022). GreenComp: The European sustainability competence framework (pp. 1–40). EUR 30955. Publications Office of the European Union. [Google Scholar]
Breiner, J. M., Harkness, S. S., Johnson, C. C., & Koehler, C. M. (2012). What is STEM? A discussion about conceptions of STEM in education and partnerships. School Science and Mathematics, 112(1), 3–11. [Google Scholar] [CrossRef]
Bybee, R. W. (2013). The case for STEM education: Challenges and opportunities. NSTA Press. [Google Scholar]
Bybee, R. W., Taylor, J. A., Gardner, A., Van Scotter, P., Powell, J. C., Westbrook, A., & Landes, N. (2006). The BSCS 5E instructional model: Origins and effectiveness (Vol. 5). BSCS. [Google Scholar]
Choy, S. C., & Cheah, P. K. (2009). Teacher perceptions of critical thinking among students and its influence on higher education. International Journal of Teaching and Learning in Higher Education, 20(2), 198–206. [Google Scholar]
Daugherty, M. K. (2013). The prospect of an “A” in STEM education. Journal of STEM Education: Innovations and Research, 14(2), 10–15. [Google Scholar]
Dresing, T., & Pehl, T. (2018). Praxisbuch interview, transkription & analyse: Anleitungen und regelsysteme für qualitativ forschende. Dr. Dresing & Pehl Gmbh. [Google Scholar]
Drost, E. A. (2011). Validity and reliability in social science research. Education Research and Perspectives, 38(1), 114–121. [Google Scholar] [CrossRef]
Elder, L., & Paul, R. (2022). Critical thinking. Routledge. [Google Scholar] [CrossRef]
Ennis, R. H. (1993). Critical thinking assessment. Theory into Practice, 32(3), 179–186. [Google Scholar] [CrossRef]
Facione, P. (1990). Critical thinking: A statement of expert consensus for purposes of educational assessment and instruction (The Delphi Report). Centre for Digital Philosophy. [Google Scholar]
Fleming, W., Hayes, A. L., Crosman, K. M., & Bostrom, A. (2021). Indiscriminate, irrelevant, and sometimes wrong: Causal misconceptions about climate change. Risk Analysis, 41(1), 157–178. [Google Scholar] [CrossRef]
Freire, P. (2020). Pedagogy of the oppressed. In Toward a sociology of education (pp. 374–386). Routledge. [Google Scholar]
Halawa, S., Lin, T. C., & Hsu, Y. S. (2024). Exploring instructional design in K-12 STEM education: A systematic literature review. International Journal of STEM Education, 11(1), 43. [Google Scholar] [CrossRef]
Halpern, D. F. (2003). Thought & knowledge: An introduction to critical thinking. Lawrence Erlbaum Associates. [Google Scholar]
Hitchcock, D. (2017). Critical thinking as an educational ideal. In On reasoning and argument: Essays in informal logic and on critical thinking (pp. 477–497). Springer International Publishing. [Google Scholar]
Hsu, Y.-S., Tang, K.-Y., & Lin, T.-C. (2023). Trends and hot topics of STEM and STEM education: A co-word analysis of literature published in 2011–2020. Science & Education, 33, 1069–1092. [Google Scholar] [CrossRef]
Kelley, T. R., & Knowles, J. G. (2016). A conceptual framework for integrated STEM education. International Journal of STEM Education, 3(1), 11. [Google Scholar] [CrossRef]
Koro-Ljungberg, M., Yendol-Hoppey, D., Smith, J. J., & Hayes, S. B. (2009). (E)pistemological awareness, instantiation of methods, and uninformed methodological ambiguity in qualitative research projects. Educational Researcher, 38(9), 687–699. [Google Scholar] [CrossRef]
Kuhn, D. (1999). A developmental model of critical thinking. Educational Researcher, 28(2), 16–26. [Google Scholar] [CrossRef]
Lai, E. R. (2011). Critical thinking: A literature review. Pearson’s Research Reports, 6(1), 40–41. [Google Scholar]
Li, Y., Wang, K., Xiao, Y., & Froyd, J. E. (2020). Research and trends in STEM education: A systematic review of journal publications. International Journal of STEM Education, 7(1), 11. [Google Scholar] [CrossRef]
Long, D., & Magerko, B. (2020). What is AI literacy? Competencies and design considerations. In Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–16). Association for Computing Machinery (ACM). [Google Scholar]
Lu, J., Si, H., Xu, J., & Xu, T. (2025). An overview of applications and trends of STEM for learning effectiveness. An umbrella review based on 22 meta-analyses. Educational Research Review, 48, 100712. [Google Scholar] [CrossRef]
Machete, P., & Turpin, M. (2020). The use of critical thinking to identify fake news: A systematic literature review. In Conference on e-business, e-services and e-society (pp. 235–246). Springer. [Google Scholar]
Mayring, P. (2014). Qualitative content analysis: Theoretical foundation, basic procedures and software solution. Social Science Open Access Repository. Available online: https://nbn-resolving.org/urn:nbn:de:0168-ssoar-395173 (accessed on 3 March 2026).
McKenney, S., & Reeves, T. (2019). Conducting educational design research. Routledge. [Google Scholar]
Moore, T. (2013). Critical thinking: Seven definitions in search of a concept. Studies in Higher Education, 38(4), 506–522. [Google Scholar] [CrossRef]
Ng, D. T. K., Leung, J. K. L., Chu, S. K. W., & Qiao, M. S. (2021). Conceptualizing AI literacy: An exploratory review. Computers and Education: Artificial Intelligence, 2, 100041. [Google Scholar] [CrossRef]
OECD. (2021). 21st-century readers: Developing literacy skills in a digital world. OECD Publishing. [Google Scholar] [CrossRef]
OECD. (2023). PISA 2025 science framework (draft). Available online: https://pisa-framework.oecd.org/science-2025/assets/docs/PISA_2025_Science_Framework.pdf (accessed on 3 March 2026).
Ortiz-Revilla, J., Greca, I. M., & Arriassecq, I. (2022). A theoretical framework for integrated STEM education. Science & Education, 31(2), 383–404. [Google Scholar] [CrossRef]
Osborne, J. (2014). Teaching scientific practices: Meeting the challenge of change. Journal of Science Teacher Education, 25(2), 177–196. [Google Scholar] [CrossRef]
Owens, D. C., & Sadler, T. D. (2023). Socio-scientific issues instruction for scientific literacy: 5E framing to enhance teaching practice. School Science and Mathematics, 124(3), 203–210. [Google Scholar] [CrossRef]
Paul, R., & Elder, L. (2013). Critical thinking: Tools for taking charge of your professional and personal life. Pearson Education. [Google Scholar]
Paul, R. W., Elder, L., & Bartell, T. (1997). California teacher preparation for instruction in critical thinking: Research findings and policy recommendations. California Commission on Teacher Credentialing. [Google Scholar]
Rafolt, S., Kapelari, S., & Kremer, K. (2019). Kritisches Denken im naturwissenschaftlichen unterricht–synergiemodell, problemlage und desiderata. Zeitschrift für Didaktik der Naturwissenschaften, 25(1), 63–75. [Google Scholar] [CrossRef]
Reynders, G., Lantz, J., Ruder, S. M., Stanford, C. L., & Cole, R. S. (2020). Rubrics to assess critical thinking and information processing in undergraduate STEM courses. International Journal of STEM Education, 7(1), 9. [Google Scholar] [CrossRef]
Rosa, M., Orey, D. C., & de Sousa Mesquita, A. P. S. (2023). An ethnomodelling perspective for the development of a citizenship education. ZDM Mathematics Education, 55(5), 953–965. [Google Scholar] [CrossRef]
Sadler, T. D. (2004). Informal reasoning regarding socioscientific issues: A critical review of research. Journal of Research in Science Teaching, 41(5), 513–536. [Google Scholar] [CrossRef]
Saxton, E., Belanger, S., & Becker, W. (2012). The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments. Assessing Writing, 17(4), 251–270. [Google Scholar]
Sermeus, J., De Cock, M., & Elen, J. (2021). Critical thinking in electricity and magnetism: Assessing and stimulating secondary school students. International Journal of Science Education, 43(16), 2597–2617. [Google Scholar] [CrossRef]
Siegel, H. (1988). Educating reason: Rationality, critical thinking and education. Routledge. [Google Scholar]
Singer-Brodowski, M. (2023). The potential of transformative learning for sustainability transitions: Moving beyond formal learning environments. Environment, Development and Sustainability, 27, 20621–20639. [Google Scholar] [CrossRef]
Skovsmose, O. (2020). Critical mathematics education. In Encyclopedia of mathematics education (pp. 154–159). Springer International Publishing. [Google Scholar]
Solomon, F., Champion, D., Steele, M., & Wright, T. (2022). Embodied physics: Utilizing dance resources for learning and engagement in STEM. Journal of the Learning Sciences, 31(1), 73–106. [Google Scholar] [CrossRef]
Thornhill-Miller, B., Camarda, A., Mercier, M., Burkhardt, J.-M., Morisseau, T., Bourgeois-Bougrine, S., Vinchon, F., El Hayek, S., Augereau-Landais, M., Mourey, F., Feybesse, C., Sundquist, D., & Lubart, T. (2023). Creativity, critical thinking, communication, and collaboration: Assessment, certification, and promotion of 21st century skills for the future of work and education. Journal of Intelligence, 11(3), 54. [Google Scholar] [CrossRef]
Tricot, A., & Sweller, J. (2014). Domain-specific knowledge and why teaching generic skills does not work. Educational Psychology Review, 26(2), 265–283. [Google Scholar]
van der Linden, S., Leiserowitz, A., Rosenthal, S., & Maibach, E. (2017). Inoculating the public against misinformation about climate change. Global Challenges, 1(2), 1600008. [Google Scholar] [CrossRef]
Voogt, J., & Roblin, N. P. (2012). A comparative analysis of international frameworks for 21st century competences: Implications for national curriculum policies. Journal of Curriculum Studies, 44(3), 299–321. [Google Scholar]
Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes (Vol. 86). Harvard University Press. [Google Scholar]
Willingham, D. T. (2007). Critical thinking: Why it is so hard to teach? American Federation of Teachers Summer, 31, 8–19. [Google Scholar] [CrossRef]
Wineburg, S., McGrew, S., Breakstone, J., & Ortega, T. (2016). Evaluating information: The cornerstone of civic online reasoning. Stanford Digital Repository. [Google Scholar]
Zeidler, D. L. (2014). Socioscientific issues as a curriculum emphasis: Theory, research, and practice. In Handbook of research on science education (Vol. II, pp. 697–726). Routledge. [Google Scholar]
Zhan, Z., Shen, W., Xu, Z., Niu, S., & You, G. (2022). A bibliometric analysis of the global landscape on STEM education (2004–2021): Towards global distribution, subject integration, and research trends. Asia Pacific Journal of Innovation and Entrepreneurship, 16(2), 171–203. [Google Scholar] [CrossRef]

Table 1. Example of the levels in the rubric for the dimension “quality check of resources”.

Level	Description
Basic	No quality check of external sources of information.
Intermediate	Only limited sources of information are used, and sources are poorly checked.
Advanced	Multiple sources are considered and selected based on quality checks.

Table 2. Example of the levels in the rubric adapted for the measuring task for the dimension “variety of methods”.

Level	Description
Basic	Only the step-counting strategy, with no reference to other measuring tools/strategies.
Intermediate	Includes a reflection on the limitation of step counting as a measuring strategy.
Advanced	Includes a reflection on the limitation of step counting as a measuring strategy and suggestions for alternative measuring strategies.

Table 3. Summary of findings.

Category	Croatia	Austria	Germany	The Netherlands
Concept Validity (dimensions and levels)	Synthesises CT well. Served as an introduction to CT. Some concerns relate to the levels (descriptions are more advanced than labels suggest).	Beneficial framework, but does not capture students’ learning. Too general.	Levels are seen as not perfectly calibrated. Adaption needed. Inclusion of specific definitions and a weighting system.	Broad agreement that the rubric captures key aspects of CT in STEM, but uncertainty about full coverage of all CT aspects. CT progression across levels seen as meaningful. Difficulty mapping diverse student responses to fixed level descriptions. Some require a better connection to theory.
Practicality and Barriers	Half of the students see practicality, while some concerns relate to complexity and time needed for classroom use.	Positive for structure and overview. Its use requires additional workload. Limitations to use it in group work or with introverted students. Goal orientation and reflection are difficult to observe. Hard to distinguish levels.	Knowledge barriers due to a lack of subject specific knowledge. Subjectivity as barrier: practicability is not given yet. Several constructs, including efficiency and creativity, were considered not operationalisable. Potential for open specific tasks.	Rubric perceived as helpful but demanding. Complexity and time effort seen as major barriers/risk of overburdening assessment practices. Better connection to theory required. Not all CT aspects considered necessary for every task. The rubric alone does not help to develop critical thinkers.
Willingness	Most students are positive on using the tool in a variety of situations (classroom, self-reflection, and teacher education). A concern is raised on levels becoming labels for diverse students.	Positive to use as a supporting tool for the teachers.	Willing to use as orientation and reflection.	Almost all declared a high intention to use the rubric if able to adapt it to personal needs.
Usefulness/Purpose	Positive about using it as a tool for learning, teaching, and assessment.	Positive. Supporting tool in teaching. Not for assessing student learning.	The tool was viewed positively as a diagnostic instrument, but several barriers were identified, leading to recommendations against its use in its current form due to the perceived subjectivity of key criteria.	Positive, as a tool to monitor students’ progress. Could work as starting point for designing/development of a lesson/assignment.
Adaptability	Perceived as a fixed tool.	Positive, but it requires additional workload. Needs adaptation for social forms.	Potential is recognised but needs to adapt.	The rubric is adaptable to specific disciplines (biology and mathematics). Strict adherence is not always possible, and there is a need for personalisation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Straser, O.; Bašić, M.; Doorman, M.; Weinberg, L.; Kapelari, S.; Maaß, K. Fostering Critical Thinking in STEM Education. Educ. Sci. 2026, 16, 461. https://doi.org/10.3390/educsci16030461

AMA Style

Straser O, Bašić M, Doorman M, Weinberg L, Kapelari S, Maaß K. Fostering Critical Thinking in STEM Education. Education Sciences. 2026; 16(3):461. https://doi.org/10.3390/educsci16030461

Chicago/Turabian Style

Straser, Oliver, Matija Bašić, Michiel Doorman, Lucas Weinberg, Suzanne Kapelari, and Katja Maaß. 2026. "Fostering Critical Thinking in STEM Education" Education Sciences 16, no. 3: 461. https://doi.org/10.3390/educsci16030461

APA Style

Straser, O., Bašić, M., Doorman, M., Weinberg, L., Kapelari, S., & Maaß, K. (2026). Fostering Critical Thinking in STEM Education. Education Sciences, 16(3), 461. https://doi.org/10.3390/educsci16030461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fostering Critical Thinking in STEM Education

Abstract

1. Introduction

1.1. Societal Relevance of Critical Thinking in STEM Education

1.2. STEM Education: Rationale, Definitions, and Goals

1.3. Challenges of Implementing and Assessing Critical Thinking

1.4. Research Gap

2. Theoretical Background

2.1. Societal Relevance and Educational Policy

2.2. The Challenge of Defining Critical Thinking

2.3. Connecting Critical Thinking to STEM Content

3. Methodology

3.1. Design Process

3.2. Procedure and Data Collection

3.3. Data Analysis

3.4. Ethical Considerations

4. Results

4.1. Result of the Design Process

4.2. Preliminary Case: Workshop at the International Conference

4.3. Case: Austria

4.4. Case: Croatia

4.5. Case: Germany

4.6. Case: The Netherlands

4.7. Cross-Case Analysis

5. Discussion

6. Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI