1. Introduction
The study of scientific thinking dates back nearly a century and has been the interest of psychologists and science educators alike [
1,
2]. The origins lie at Piaget’s [
3] theory of stages of cognitive development, formulating that adolescents are able to evaluate evidence and build hypothetical terms. One main focus of research covers the development of domain-general strategies of reasoning and problem-solving [
2]. Terminology varies respectively by discipline as well as focus on research or teaching, yielding terms such as scientific reasoning, critical thinking (with regard to a science context), scientific discovery, scientific inquiry, or inquiry learning [
2,
4]. For the purpose of this article, scientific inquiry is used to describe teaching methods and activities aimed at the process of gaining scientific knowledge [
5,
6], whereas scientific reasoning (SR) refers to the cognitive skills required during a scientific inquiry activity [
1,
6,
7]. Due to its cognitive nature, SR is viewed to be a competency consisting of a complex set of skills [
8,
9].
SR is needed for acquiring scientific knowledge [
10], making it an essential competency in science and one of the most important key competencies in societies, increasing their technical progress. Therefore, developing SR is a central goal of school science as well as science-teacher preparation worldwide [
11,
12,
13,
14,
15,
16,
17,
18,
19]. In the context of teaching, SR can be fostered by inquiry teaching activities [
20]. Teachers, however, perceive the integration of inquiry into their lessons to be a difficult task [
21,
22], particularly if they have never conducted inquiry experiments before [
5]. SR and methods of inquiry teaching should therefore already be introduced in preservice science-teacher education [
5,
15,
23].
1.1. Models for Scientific Reasoning Competency
SR can be studied in problem-solving situations, making it accessible to assessment [
1,
24]. Using a “simulated discovery context” [
1] (p. 3) gives detailed insight into the underlying processes while maintaining control over situational boundaries and prior knowledge needed by the participants. So far, several models of scientific reasoning ranging from global measures to more fine-grained descriptions have been proposed, yielding various instruments for measuring SR competency [
25]. Models can be differentiated by target group, i.e., school and/or university context or other target groups [
25], and by their purposes, such as assessment [
10], description of learners’ activities in the process of SR, and problem-solving [
26,
27] or the description of important aspects of inquiry teaching [
28]. Following Klahr and Dunbar’s [
29] “Dual Space Search,” SR models usually cover the domain “conducting scientific investigations” [
24] and exhibit at least a three-fold structure involving the facets “hypothesis,” “experiment,” and “conclusion”. However, the respective models differ in the degree of sub-division into up to 9 facets (see
Figure 1) [
10,
24,
26,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37]. For example, the model developed by Fischer and colleagues [
30] also covers the problem identification, that is, the analysis of the underlying phenomenon before the generation of hypotheses. The skill “generating hypotheses” is divided into “questioning” and “hypothesis generation”. Moreover, “testing hypotheses” is subdivided into a construction phase (cf. the different notions of planning in
Figure 1) and “evidence generation,” which is comparable to experimentation and observation [
26,
28,
31]. The last facet of Klahr and Dunbar’s [
29] model can be divided into “evidence evaluation,” such as the preparation and interpretation of experimental findings [
26,
32], and “drawing conclusions”. “Communicating and scrutinizing” completes the facets [
30], whereas, for example, in Kambach’s model [
26], communication is viewed as a skill relevant to all processes involved in SR. Some models also specify the documentation [
28] skill as a separate facet, while this is seen as a cross-process skill relevant to all other facets in other models [
26]. Recently, modelling skills have also been proposed to amend skills in conducting investigations to cover a broader notion of SR [
10,
33,
34].
For this study, parts of Nawrath et al. and Kambach’s models [
26,
28] were combined to form a fine-grained model for measuring SR competency of preservice chemistry teachers. The skills
formulating research questions, generating hypotheses, planning experiments, observing and measuring, preparing data for analysis, and
drawing conclusions were included. In contrast to Kambach, the ability to specify the
phenomenon is not assessed, but nevertheless, a phenomenon underlies the test instrument: A phenomenon is presented to the participants in an introduction and thereafter worked on. Furthermore,
setting up experiments (cf. [
28]) is not applied in our work, as this skill does not play a central role in scientific reasoning in chemistry and partly overlaps with practical work skills [
24]. Documentation is understood as an overarching skill following Kambach [
26] and is thus not part of the skill
observing and measuring, which is contrary to the model by Nawrath et al. [
28].
1.2. Assessment of Scientific Reasoning
SR competency can be assessed with domain-specific as well as domain-general instruments in different test formats, such as paper-pencil tests or experimental tests. For the latter, real or virtual/mental experiment can be used [
25]. Both types can employ multiple-choice items [
10,
34] or open formats [
27] and assess theoretical notions of SR, performance-related measures [
26,
27], or self-assessments [
35]. Some studies also use mixed formats [
25]. The different methods for eliciting these skills have different properties regarding time consumption, practicability, individual diagnostics, external influences, congruence, and simultaneity [
27]. Performance tests showed significantly larger effect sizes than multiple-choice tests [
38]. Moreover, as Shavelson [
9] pointed out, multiple-choice assessment can hardly be seen as a situation closely representing real life. Furthermore, multiple-choice questions may measure knowledge that the participants might be able to state but might not be able to put to practice (inert knowledge, see for instance [
39]). Hence, a performance-based assessment format seems to be more closely tied to the competency to be inferred. However, performance-based assessment can be time-consuming for participants and researchers alike. In addition, there are differences in measuring skills in individual and group work: while group work enhances communication and therefore makes thoughts accessible to the researcher [
40], better performance in problem solving of groups compared to individuals has been demonstrated [
41], which may limit reliability of assessment of individual skills in group work situations.
Adults, and therefore also preservice teachers, experience difficulties in dealing with SR tasks: they tend to design confounded experiments or to misinterpret evidence to be able to verify their beliefs [
1]. Kunz [
36], Khan, and Krell [
42] as well as Hartmann et al. [
10] found higher SR competency in preservice science teachers with two natural science subjects, while this made no difference in a study conducted by Hilfert-Rüppell et al. [
43]. Furthermore, students’ SR competency differs by school type and progression in university studies [
10,
34,
42]. Kambach’s [
26] findings suggest that preservice biology teachers either are very apt in describing phenomena, generating hypotheses, and interpreting results or do not show these processes at all. As for the other processes, skills show more variation among the sample. However, students also lack experimental precision and demonstrate deficient reasoning for their choice of material in planning investigations. While conducting experiments, they tend not to consider blanks and hardly ever plan intervals or end points while measuring. Finally, they tend not to prepare their data for analysis or refer back to their hypotheses while interpreting. Overall, Kambach’s sample demonstrates variation of SR competency across the entire scale [
26]. Hilfert-Rüppell et al. [
43] demonstrated that preservice science teachers’ SR skills
generating hypotheses and
planning investigations are deficient. However, they found that students’ skill in
planning investigations is moderated by their skill to
generate hypotheses.
1.3. Learning Activities Supporting the Formation of Scientific Reasoning Competency
While the empirical and pedagogical literature has to offer various ideas and propositions for incorporating scientific inquiry into learning environments in schools and universities [
44,
45,
46], preservice science teacher education still lacks inquiry learning activities [
5,
32]. For instance, lab courses mainly employ cook-book experiments [
26,
32,
47]. If at all, the prospective teachers come into contact with forms of inquiry learning only in teaching methods courses in graduate education [
10] or, if offered, while working or training in school laboratories [
26]. Khan and Krell [
42] therefore suggested a combination of contextualized, authentic scientific problem solving and its application to new contexts with tasks to reflect on problem solving and scientific reasoning on a meta level.
Still, the laboratory already “is the place of information overload” [
48] (p. 266). Traditional cook-book experiments demand of the students to conduct, observe, note—and, hopefully, also interpret and understand—an enormous number of elements [
48]. However, most of these demands are clearly stated in the laboratory instructions. Working on a problem-solving experiment, students need to perform a similar amount of tasks as well as additional cognitive activities, such as understanding the problem and devising their own strategy to solving it. Since these demands occupy working memory space not directed to learning, especially open inquiry is seen to be ineffective due to cognitive overload [
49]. Furthermore, unfamiliarity with the method of problem-solving experiments from previous laboratories might add to these strains. Reducing the amount of cognitive demands students have to face simultaneously can be achieved by scaffolding the problem-solving process [
50], providing learners with worked examples [
51,
52], examples before the problem-solving process [
53,
54], or structuring tasks [
55,
56]. For instance, Yanto et al. [
32] found that structuring three subsequent experimental classes using the three main types of inquiry (structured, guided, and open inquiry, cf. [
6]) in a stepped sequence fosters preservice biology teachers’ SR skills better than a traditional cook-book approach.
While the use of problem-solving and inquiry activities is widely seen as important, time-consuming learning activities like these do not always fit into tight schedules in schools and universities [
44]. However, students may benefit from instruction before inquiry activities [
57]. In a meta-analysis on the control of variables strategy, Schwichow et al. [
38] showed that larger effects were achieved when learners were given a demonstration. Regarding chemistry laboratories, implementing instruction, such as demonstrations or examples as prelab learning activities, seems to be a promising approach [
48]. This may be achieved by using educational videos [
58,
59,
60].
1.4. Educational Videos as Pre Laboratory Activities
Educational videos are seen as a suitable medium to enhance students’ preparation in undergraduate chemistry, for instance, regarding content learning in organic chemistry [
61], calculations for laboratory courses [
62], as well as the use of laboratory equipment and procedures [
62,
63,
64,
65,
66]. Methods for the development of effective videos are subsumed in [
67]. Cognitive load theory (CLT) [
68] and cognitive theory of multimedia learning (CTML) [
69] inform design of effective videos. For instance, exclusion of unnecessary details helps keep students’ working memory from overloading (CLT), and making use of the visual and auditory channel in a way to avoid redundancy contributes to the effective use of both channels in educational videos (CTML). In terms of learning outcomes, Pulukuri et al. and Stieff et al. demonstrated that students preparing with videos statistically outperform a control group without any preparation [
61] or an alternative treatment group preparing with a lecture [
66]. However, many studies only report significant effects regarding the affective domain: students perceive videos to be helpful for preparation of and participation in the laboratory [
60,
62,
63,
64], even if no evidence can be found for their effectiveness on student performance [
62,
63]. Moreover, videos are still seen as rather new and motivating media in university education [
70] and may therefore, like other newly advancing educational technologies, enjoy a novelty effect when used over a short period of time [
61,
70,
71,
72,
73]. This may lead to an overestimation of their impact on student performance [
61].
1.5. Self-Concept of Ability and Performance
“The relationship between self and performance is associated with an improvement in ability” [
74] (p. 132). Self-concept is not a unidimensional construct but consists of various facets, such as academic self-concept [
75]. Regarding students in an introductory chemistry course at university, House [
76] showed that students’ academic self-concept is a better predictor of first-year achievement in chemistry than, for example, grade of college admission test. Moreover, facets of self-concept can be broken down further, i.e., academic self-concept can be further differentiated for different subjects [
75]. For example, Atzert and colleagues demonstrated that self-concept of ability can be measured regarding science experimentation [
77]. Sudria and colleagues [
78] compared self-assessment and objective assessments of preservice chemistry teacher students’ practical skills in a chemistry laboratory. Their findings suggest that both students’ self-assessed skills at the beginning and during the course correlate with objective assessment of their performance by the lecturer. Self-concept of ability usually is assessed with regard to three different norms: individual (i.e., development of abilities over time), social (i.e., own ability in relation to others), and criterial (i.e., own ability with regard to an objective measure) [
79]. However, in agreement with the criterial rubric Sudria et al. used, Atzert et al. showed that only the criterial norm informs school students’ self-concept of ability regarding science experimentation [
77,
78].
The aim of the project underlying this paper is both to foster preservice teachers’ SR competency by implementing a small number of problem-solving experiments and explanatory videos into an already-existing lab course and to measure a potential increase in SR competency. This paper first describes an instrument for objectively measuring SR skills as well as a self-assessment questionnaire in which students rate their SR skills with regard to the criterial norm before and after the intervention. Using data from the pilot study, a first insight is given into development of students’ SR skills.
4. Discussion and Limitations of the Study
In this study, an already validated, performance-based instrument for description of SR processes of school students was adapted for measurement of SR competency of preservice chemistry teachers. Accompanying variables adopted from Kraeva [
27] as well as tasks measuring SR skills were found to be suitable for preservice chemistry teachers regarding difficulty and comparability of test booklets (Hypothesis 2). Kraeva’s performance test originally only involved
generating hypotheses, planning experiments, and
drawing conclusions in a mixed format of individual and pair work. Due to the fact that, in contrast to Kraeva [
27], no significant correlation between accompanying variables A4 and A5 could be identified, these variables were not summarized to form a measure for methodological knowledge but treated as separate items. Even though Hypothesis 1 therefore had to be rejected in parts, the data from A5 could now be used to assess SR skills
generating hypotheses II and
planning experiments II from individual work. Furthermore, the test was extended to measure
observing and measuring in pair work in the pilot study. Factor analysis indicated a two-factorial structure of SR skills, separating skills assessed in individual work from those assessed in pair work. This is in accordance with findings from other studies comparing individual and group performance [
40,
41]. Even though reliabilities were low, tasks assessing skills individually yielded slightly more reliable data. Still, excellent interrater reliabilities were found indicating reliability of the method for collecting data on SR skills. Hence, for use in the main study, new tasks assessing skills
formulating research questions and
preparing data for analysis were added to the test. Factor analysis of SR self-assessment items indicated that the skills
using blanks and
observing and measuring load on a different factor than the other SR skills, such as
formulating research questions or
generating hypotheses. The former two skills seem to be not only relevant to inquiry experiments exclusively but also to cook-book experimentation. For example, Sudria and colleagues included
observing in a set of practical laboratory skills [
78].
The second aim of this project was to enhance preservice chemistry teachers’ SR competency through experimental problem solving and explanatory videos in an organic chemistry lab course. First insights can be inferred from comparison of control and treatment groups in the pilot study. Even though 60 students in total participated in the pilot study, the rather small sample sizes in each group still limit generalizability of the findings. Both SR self-assessment and objective assessment data show that preservice chemistry teachers in their second year at university already demonstrate substantial skill before attending the laboratory. That is, without having received any explicit instruction on inquiry learning or scientific reasoning so far. In comparison to instruments used with secondary preservice science teachers in other studies [
10,
34,
42], the instrument presented here seems to be less difficult. This is in accordance with the origin of the instrument, which was originally developed for school students [
27]. Nevertheless, increases in several skills were measurable (see below).
Students rated their own abilities in
using blanks, drawing conclusions and
observing and measuring as particularly highly developed, while lower self-assessment of skills was found for
formulating research questions and
using control of variables strategy. Especially the
control of variables strategy may be unknown to some preservice teacher students in their second year of the bachelor, which might explain why students more frequently chose the alternative answer “I don’t know” with this item. A similar pattern was found in the SR skills assessed from students’ performance: students’ individual performance was found to be relatively high in
drawing conclusions and moderate in
generating hypotheses and
planning experiments. This is accordance with findings from Krell and colleagues as well as Khan and Krell that students’ performances are lower in
formulating research questions and
generating hypotheses than in
planning investigations,
analyzing data, and
drawing conclusions [
34,
42]. As was also demonstrated before [
41], students scored more points in skills assessed in pair work than in individual work: Performance for
generating hypothesis I and
planning experiments I (assessed in pair work) tended to be higher than for
generating hypotheses II and
planning experiments II (assessed in individual work). This is in accordance with findings that groups have higher success in problem solving than individuals because they engage more actively in explanatory activities [
41,
99]. However, it cannot be said with certainty that the higher score is exclusively due to the work in pairs. Stiller and colleagues identified several features rendering test items difficult [
100], such as text length, use of specialist terms, and cognitive demands, i.e., use of abstract concepts (for instance, also [
101] in this special issue). A comparison of the experimental task and task 5 (see
Figure A1) indicates no difference in text length or use of specialist terms. Tasks involving abstract concepts require participants to build “hypothetical assumptions […] not open to direct investigation” [
100] (p. 725). This only holds true for
planning experiments II (planning of a hypothetical experiment) but not for
generating hypotheses I/II, as these are both hypothetical tasks. Moreover, students were not observed to change their answers in task
generating hypotheses I after writing up their answers to the experimental task. Kraeva’s test construction followed the Model of Hierarchical Complexity in chemistry (MHC) [
102], which describes task complexity with regard to the number of elements to be processed and their level of interrelatedness. Both the experimental task and task 5 were constructed on the highest level (“multivariate interdependencies”) [
102] (p. 168); therefore, both tasks require the same cognitive demands. Additionally, the students were videotaped while solving the experimental task, which may have led to greater care and effort in solving the task. Therefore, it cannot be conclusively clarified whether this is an effect of pair work.
Regarding the increase in skill achieved through participation in the lab course, some learning gains were found in the control and treatment groups. As was expected for the control group, an increase was only found for
observing and measuring but not for
generating hypotheses,
planning experiments, or
drawing conclusions (Hypothesis 3). This may be attributed to the fact that in the traditional laboratory, students were not asked to generate their own research questions or hypotheses. Hence, there was also no need for them to reason with respect to question or hypothesis, consequently yielding no increase in these skills [
6,
86,
87]. Hypothesis 3 was therefore provisionally accepted. Increases in SR skills in the treatment groups were not as clear cut as hypothesized. Both treatment groups showed an increase in
generating hypotheses and
planning experiments, whereas no increase was found for
observing and measuring and
drawing conclusions. Thus, Hypothesis 4 could be provisionally accepted for the respective skills. As the control group’s skill in
observing and measuring increased, an increase would have been expected in the treatment groups as well. This may be attributed to several possible reasons: On the one hand, cognitive demands placed on the treatment groups due to the additional and new learning objectives in the intervention (such as generating hypotheses) could have been too high, therefore reducing cognitive capacity directed at skills students might have perceived as already familiar to them. On the other hand, since
observing and measuring showed negative factor loading on SR skills in pair work, there might as well be an issue with the assessment of this skill either in the manual or in the task. Hence, this skill should undergo revision before start of the main study. So far, Hypothesis 5 had to be rejected since the SR group only showed a significantly larger learning gain than the alternative treatment group in one skill,
planning experiments I, but no difference in learning gain compared to the control group. Since the data analyzed here belonged to the pilot study and therefore only give a first indication of the effectiveness of the intervention, both hypotheses 4 and 5 will have to be tested again in the main study.
Qualitative assessment was chosen to arrange for a more individualized view on students’ skills; yet, quantitative analyses show that small sample sizes are a serious limitation of the presented investigation. This applied particularly to the very small sizes of the treatment groups due to the piloting. Resulting from this, issues arose for the ratio of items to participants in the factor analyses as well as regarding the low reliabilities of the instruments. Furthermore, conducting parts of the performance tasks in pair work led to reliability issues in comparison to individually assessed SR skills, negating the advantage of the pair-work format in enhancing communication and hence accessibility of participants’ thoughts [
40]. In addition, since the original test instrument was constructed for school students, we expected preservice teachers to achieve moderate to high scores. In some variables however, this produced ceiling effects [
92], such as in the control group’s posttest for performance-based SR skill
observing and measuring or in the self-assessment in the respective skill. This may lead to a failure of the instruments in differentiating potential gains in these skills due to the treatment. Furthermore, negative factor loading of SR skill
observing and measuring demands that the respective task and manual should undergo revision before conducting the main study. Regarding the late introduction of the self-assessment questionnaire in the design of the study, comparison of self-assessed skills in the control group was impossible. Moreover, no gains in self-assessment of skills were found for the treatment groups. It cannot be ruled out that the pandemic had an influence on student motivation in the 2020 cohort, as lab activities had to be conducted under strict pandemic restrictions, for example, prohibiting pair work in the lab. Furthermore, the pandemic may also have had an impact on the researchers’ and assistants’ performance in the laboratory due to uncertainties in the planning process. Since the 2020 cohort was part of the SR group, this limits the scope of the findings from comparison of treatment groups even more. New pandemic regulations may also hinder the further conduction of this study.