Assessing Biology Pre-Service Teachers’ Professional Vision of Teaching Scientific Inquiry

Professional vision is a key ability in the professional development of pre- and in-service teachers as it determines how professionals perceive and interpret situations. The aim of this study was to conceptualize an instrument for professional vision focusing on formative assessment in the context of scientific inquiry. This focus is highly valuable, since formative assessment contributes to the quality of science teaching and learning. The four-dimensionality of the construct of professional vision with its abilities (perception, description, explanation, and prediction) was confirmed by means of our text-vignette-based instrument. The professional vision of pre-service teachers (N = 80) was fostered in training, involving a seminar phase and a teaching phase in an out-of-school laboratory. In a pre-post design significant interaction effects of groups (training vs. comparison group (N = 39)) and time for the ability description (F(1,117) = 29.14 p < 0.001) and prediction (F(1,117) = 14.81 p < 0.001) were found, indicating the sensitivity of the instrument. Our instrument allows the assessment of the abilities description and prediction. The scales for the abilities perception and explanation need further refinements. Nonetheless, our instrument could be a starting point to further investigate professional vision in science contexts as it incorporates the essential key features such as a situated approach.


Introduction
Among other aspects, previous research regarding science teachers' professional development focused on knowledge and beliefs of teachers. Researchers focused on Shulman's [1] conceptualization of knowledge relevant for teaching, by investigating the development and relation of pre-and in-service biology teachers' content knowledge (CK), pedagogical content knowledge (PCK), and pedagogical knowledge (PK) [2][3][4][5][6]. These types of knowledge are considered to affect teaching and student learning [7,8]. In addition, teachers' beliefs are considered a major factor affecting the strategy of teaching science subjects and an important aspect to be addressed in professional development programs [9][10][11][12][13][14]. The other main area of research in the field of science teachers' professional development focused on classroom practices of pre-and in-service teachers [15][16][17][18]. To develop a more comprehensive understanding of teachers' competence of professional development, Blömeke, Gustafsson, and Shavelson [19] conceptualized a model to bridge the gap between knowledge, beliefs, and in-classroom practice.

Professional Vision as a Part of Professional Development
Blömeke et al. [19] presented a model of professional development, presenting competence as a continuum stretching from dispositions (e.g., self-efficacy and beliefs) and teachers' knowledge (e.g., CK, PCK, PK) via situation-specific skills to actual teaching performance in the classroom Professional vision does not seem to be general ability but has to be reconstructed and reconsidered in the particular teaching situation. Empirical studies have demonstrated that professional vision of classroom management and content-specific learning support in science education lessons are two separate constructs [33]. Further, findings have shown that professional vision is not only content-specific but also topic-specific [34,35]. Sunder et al. [35] demonstrated this topic specificity of professional vision by means of an intervention study. The professional vision of the intervention group increased regarding the fostered ability-in this case, specific learning support concerning the topic of floating and sinking-in contrast to the specific learning support of the nonfostered topic.

Assessment of Professional Vision
Primarily qualitative studies have been conducted in the field of professional vision [21,22,[36][37][38], giving valuable insights into the structure and procedures of professional vision. The Observer Tool [23] and the instrument developed by Möller, Steffensky, Meschede, and Wolters [39] to test the professional vision of learning support in the context of science education in primary schools are two of the few instruments that measure professional vision with a quantitative approach. A relatively Professional vision does not seem to be general ability but has to be reconstructed and reconsidered in the particular teaching situation. Empirical studies have demonstrated that professional vision of classroom management and content-specific learning support in science education lessons are two separate constructs [33]. Further, findings have shown that professional vision is not only content-specific but also topic-specific [34,35]. Sunder et al. [35] demonstrated this topic specificity of professional vision by means of an intervention study. The professional vision of the intervention group increased regarding the fostered ability-in this case, specific learning support concerning the topic of floating and sinking-in contrast to the specific learning support of the non-fostered topic.

Assessment of Professional Vision
Primarily qualitative studies have been conducted in the field of professional vision [21,22,[36][37][38], giving valuable insights into the structure and procedures of professional vision. The Observer Tool [23] and the instrument developed by Möller, Steffensky, Meschede, and Wolters [39] to test the professional vision of learning support in the context of science education in primary schools are two of the few instruments that measure professional vision with a quantitative approach. A relatively new approach Educ. Sci. 2020, 10, 332 3 of 17 to assess teachers' professional vision is by measuring their eye movements directly with the help of eye-tracking technologies [40][41][42].
When measuring professional vision, a situated assessment approach is needed because professional vision is the adaptation of knowledge in a specific teaching and learning situation. Therefore, open questions or closed-rating items are combined with video vignettes [23,39] or text vignettes [43,44] that serve as the key stimuli for the analyses of teaching and learning situations. As professional vision has to be reconstructed and reconsidered in the particular teaching situation, instruments used for its measurement must be aligned with the particular context that is of interest.

Training Fostering Professional Vision
Empirical studies indicate a positive relation between teachers' quality of professional vision and students' performance in mathematics [45] and science [46]. Additionally, the ability of teachers to register interactions relates to the quality of the teachers' actions in the related situations [47]. Therefore, it is important to foster the professional vision of pre-and in-service teachers. Training focusing on professional vision of pre- [35,48,49] and in-service teachers [22,46] has already shown to be successful. Some forms of training focus on general pedagogical aspects of teaching [36,48,50] such as goal clarity, teacher support, and learning climate [31]. The majority of studies have centered on the development of subject-specific educational teaching aspects [22,38,46] like analyzing the teaching of mathematic procedures [51] or providing learning support for physical phenomena [35].
An overarching concept of these forms of training is the implementation of videotaped teaching and learning sessions. The success of video-based professional development programs such as the STeLLA program highlights the valuable aspects of using videos and their effectiveness on science teaching and learning [52,53]. The way of implementing videos in professional development programs is key as participant-centered discussions can foster the professional vision of the participants [54].
Specifically, pre-service teachers benefit from the realistic yet less complex presentations of practical teaching that enable situational and multifaceted analyses of teaching and learning processes without the proximate pressure to act [36]. Since these videos offer a broad spectrum of opportunities to reflect and discuss pedagogical and subject educational aspects, the process of interlinking declarative, case-related, and strategic knowledge can be supported [55]. The types of videos available for analysis range from best-practice reinforcement to daily teaching methods. They are further differentiated by their authenticity (e.g., staged or real teaching) [56] as well as the variety of role players acting in the videos, for example, an unknown teacher, peers, or the participants themselves filmed during their own teaching activities [57]. Through repetitive use of these videos, the capabilities to observe, identify, and interpret can be improved [34,55,58]. Pre-service and in-service teachers highlight the importance of being able to observe effective STEM lessons. Furthermore, in-service teachers stress the relevance of analyzing videos of experienced teachers teaching STEM lessons [59]. Current research indicates that combination of analyzing videos from own teaching sessions, from peers, and from unknown teachers fosters the professional vision of pre-service teachers best [60]. Additionally, pre-service teachers' professional vision is fostered by offering video-based feedback [61].

Challenges in the Teaching and Learning of Scientific Inquiry
Professional vision can only be considered a part of teachers' professional expertise if the aspect in question is relevant for the students' learning process [21]. Therefore, we focus on professional vision of formative assessment in the specific context of scientific inquiry. Learning about scientific inquiry as well as acquiring the necessary inquiry skills form part of the educational standards in many countries [62][63][64][65]. However, students have difficulties in understanding and conducting scientific inquiry as they find the required processes and its logic challenging [66]. Typical difficulties of students conducting experiments concern all steps of scientific inquiry [67], e.g., students have difficulties formulating a relevant or even any hypothesis [68,69], have weak strategies for controlling variables [70], and have poor skills regarding data analysis [71].
Arnold, Kremer, and Mayer [72] advocate that procedural knowledge and procedural understanding will improve students' understanding of scientific inquiry. To foster the procedural understanding of scientific inquiry, inquiry-based learning is a promising approach [73]. A suitable method to teach inquiry-based learning is guided-inquiry teaching [74,75], as learning inquiry skills in a guided setting help students to overcome high cognitive challenges of open inquiry learning [76]. By practicing guided inquiry, teachers are confronted with various decision-making processes [19] such as when and how to provide support to their students [74]. To address this challenge, conducting formative assessment can help to diagnose students' prior knowledge and understanding, and to facilitate their learning processes [77,78].

The Role of Formative Assessment for Science Teaching
Formative assessment is the continuous diagnosis of the individual learning progress and the continuous response to promote learning [78][79][80]. Meta-analyses have indicated the general importance of formative assessment for student learning [81]. Formative assessment is also an important prerequisite for successful learning in science education [82][83][84]. The practice of formative assessment depends on the specific subject [78,85,86]. In-class performance of formative assessment is challenging for teachers in math and science education [86,87]. To address this challenge, formative assessment can be fostered in pre-and in-service teachers [88]. By investigating the quality of formative assessment, Furtak, Ruiz-Primo, and Bakeman [89] were able to identify four categories of teacher response quality in science teaching: (1) evaluative responses such as judging students' contributions and providing longer content-specific remarks; (2) neutral responses, including reactions that do not help students evaluate their own contributions; (3) leading responses, for example, prompts that lead to very short and oblivious students' answers; and (4) pushing responses that comprise impulses that activate students' own thinking.

Aim of the Study
Following the demands to model competences of teachers' professional development as a continuum development [19], we focused on the professional vision of pre-service teachers. As professional vision is topic-specific [34,35,48] and thus has to be reconstructed and reconsidered in the particular teaching situation, training must also focus on the particular teaching situation to foster the specific aspects of professional vision that are of interest. We decided to combine the important yet challenging field of formative assessment for teachers [78,82] with the concept of scientific inquiry-also particularly challenging for students in science education [68][69][70][71]-as a valuable point of focus for the professional development of pre-service teachers.
Our major objective was to conceptualize a test instrument enabling us to measure changes in pre-service teachers' professional vision. To evaluate the instrument we focused on three aspects: (1) dimensionality and reliability, (2) scoring of the participant's answers, and (3) sensitivity. These aspects are reflected in our research questions (RQs):

1.
Dimensionality: Based on the theoretical background, we assumed a four-dimensional structure of professional vision. Hence, we explored in RQ 1: To what extent does the empirical data collected with our instrument fit this theoretically described structure of professional vision? 2.
Scoring: Different expert reference norms have been used to score participants' answers in previous research. Hence, we aimed to answer the following as RQ 2: Is the use of a strict (dichotomous) or less strict (partial credit) expert reference norm more suitable? 3.
Sensitivity: As any suitable measurement instrument should be able to detect changes, RQ 3 was whether our instrument is sensitive enough to measure changes of professional vision.

Designing a Test Instrument
We developed a test instrument to assess professional vision regarding formative assessment in the context of scientific inquiry. It is composed of text-vignette-associated items containing statements focusing on teacher response qualities within the context of students conducting experiments. A Likert scale answer was predominantly used.

Development of the Text Vignettes
To comply with the situated assessment approach of measuring professional vision, authentic video materials presenting microteaching situations [90], in which biology pre-service teachers supported the learning process of students conducting experiments, were recorded. These videos served as a basis for the development of the text vignettes. They were screened for passages showing students' difficulties when working on the hypothesis as a part of scientific inquiry. Three different video vignettes were selected. They differed in the type of students' difficulty when working on the hypothesis as well as in the level of response qualities of the pre-service teachers' responses. The video vignettes were transcribed and supplemented with notes about physical actions in the video, for example, when a pre-service teacher pointed at something. Thus, the final text vignettes were coherent for the reader.

Development of the Items
Items were developed based on the theoretical structure of professional vision (perception and interpretation: description, explanation, and prediction). The underlying system for all items was the formative assessment regarding scientific inquiry. The items interlinked the four levels of teacher response quality [89] with ways to support procedural knowledge and procedural understanding [72] systematically. This approach resulted in a set of 36 rating items per vignette. The process of perception was covered with a dichotomous item format (yes/no), whereas a four-point Likert scale (1 (disagree) to 4 (agree)) was used for items concerning interpretation. In Table 1, exemplary items for the ability of professional vision are given.

Scoring
Following the procedure of quantitative research in the field of professional vision, an expert rating was conducted to establish criterion-referenced norms to analyze participants' responses in an objective manner [23,39,91]. Three expert researchers independently rated all items in connection with the text vignettes according to their own professional vision. The experts were all educational researchers in biology education and had teaching experience at secondary schools. This rating showed an excellent consistency with ICC unjust = 0.86 [92,93], indicating that the responses to the items were unambiguous and discernable. In cases of disagreement, consensus validation was performed. The participants' responses were compared to the expert rating. As the strictness of allocating points differs when using expert ratings as a criterion-referenced norm [23,35], two procedures to calculate the agreement with the experts were conducted. The following points were allocated in the strict approach: 1 (hit expert rating) and 0 (miss expert rating), whereas in the less strict approach we allocated 2 (hit expert rating), 1 (correct direction on the scale), and 0 (missed expert rating). The items interlink the level of teacher response quality (pushing: e.g., explain) with a certain method of facilitating the learning of scientific inquiry (procedural knowledge: e.g., what).

Structure of the Training
To assess the sensitivity of our instrument (RQ 3) we needed training to increase the professional vision regarding formative assessment in context of scientific inquiry to be measured by our test. The training consisted of two major phases, a seminar phase and a teaching phase. The core activities to foster professional vision in the seminar phase are based on the analyses of teacher-student interaction by means of both text and video vignettes, which focus on typical students' difficulties in the context of scientific inquiry [68,69] and teacher responses to support the learning process ( Figure 2). In the teaching phase, one pre-service teacher attended to the learning process of the same two to three students experimenting together. During the full-day course in the out-of-school laboratory, the students carried out three experiments focusing on the topic "adaptation of animals to their habitats" [94] (p. 59) that were presented as "learning kit experiments" [95,96] (p. 57). Text vignettes used in the test instrument were not identical to the vignettes used during the training to avoid memorization effects. Fostering professional vision by making use of video vignettes was shown to be an effective approach [25,34,35,46,[48][49][50][51]. The processes of practical teaching were incorporated in the teaching phase by participants supporting the learning processes of students in our out-of-school laboratory (for more details regarding the training, see [97]).

Participants and Research Design
We used a quasi-experimental research design incorporating a training group and a comparison group. The training group comprised three cohorts (cohort 1: n = 29; cohort 2: n = 30; cohort 3: n = 21) resulting in a total of N = 80 biology pre-service teachers as participants (68% female). They had a mean age of 22.5 years (SD = 2.2) and were on average in their fifth semester of the university teacher educational program (M = 5.0; SD = 1.2). The training was divided into a seminar phase (duration: 7 consecutive days with 5 h per day) and a teaching phase in our out-of-school laboratory (duration: 5 consecutive days with 7 h per day) (Figure 3). The teaching phase was held approximately three weeks after completion of the seminar phase. The participants completed the pre-and post-test before and after the training program, respectively. The comparison group comprised four cohorts (cohort 1: n = 11; cohort 2: n = 7; cohort 3: n = 8; cohort 4: n = 13) resulting in a total of N = 39 biology pre-service teachers as participants (82% female). Their mean age was 23.2 (SD = 2.6). On average, the participants were in the sixth semester of their university teacher educational program (M = 6.3; SD = 1.8). They completed the pre-and post-test before and after, respectively, a biology educational seminar or the university-based theory practice term with an interim period of approximately three months. All pre-service teachers were enrolled in a university program at our university to become teachers for secondary schools and participated on a voluntary basis. The requirements of passing the university courses were independent of the participation in this study. All participants received the identical digital introduction to the study and assessment. Due to time economic factors, no data regarding motivational aspects were collected. 2). The training was divided into a seminar phase (duration: 7 consecutive days with 5 h per day) and a teaching phase in our out-of-school laboratory (duration: 5 consecutive days with 7 h per day) (Figure 3). The teaching phase was held approximately three weeks after completion of the seminar phase. The participants completed the pre-and post-test before and after the training program, respectively. The comparison group comprised four cohorts (cohort 1: n = 11; cohort 2: n = 7; cohort 3: n = 8; cohort 4: n = 13) resulting in a total of N = 39 biology preservice teachers as participants (82% female). Their mean age was 23.2 (SD = 2.6). On average, the participants were in the sixth semester of their university teacher educational program (M = 6.3; SD = 1.8). They completed the pre-and post-test before and after, respectively, a biology educational seminar or the university-based theory practice term with an interim period of approximately three months. All pre-service teachers were enrolled in a university program at our university to become teachers for secondary schools and participated on a voluntary basis. The requirements of passing the university courses were independent of the participation in this study. All participants received the identical digital introduction to the study and assessment. Due to time economic factors, no data regarding motivational aspects were collected.

Analyses of Data
We used item response theory (IRT) models to scale our data. IRT models are often used for data analysis in the field of empirical studies focused on performance tests [23,[100][101][102]. For the analyses of our data, models from the Rasch tradition were used [103,104]. We used R Studio (version 1.0.153) including the TAM package [105] to analyze our data as well as IBM SPSS Statistics (version 24) for further analyses (for details see below).

Assessing Preconditions for Using the Test to Assess Professional Vision
To assess the structure of the data originating from the test, more dimensional models were tested because professional vision seems to consist of four separate processes, namely perception as well as the three processes of interpretation: description, explanation, and prediction (RQ 1). We tested the structure of professional vision by comparing a four-dimensional model presuming that perception, description, explanation, and prediction can be measured as distinct dimensions and more restricted models ( Table 2). The more restricted models were a one-dimensional model (pooling all items on one dimension), a two-dimensional model (differentiating the two processes of perception and interpretation), and a three-dimensional model (distinguishing perception, integrating description/explanation, and prediction). Thus, four different models were tested by contrasting the global model fit of the four-dimensional model with the global model fit of the more restricted models based on the criterion-referenced norms (dichotomous and partial credit).

Analyses of Data
We used item response theory (IRT) models to scale our data. IRT models are often used for data analysis in the field of empirical studies focused on performance tests [23,[100][101][102]. For the analyses of our data, models from the Rasch tradition were used [103,104]. We used R Studio (version 1.0.153) including the TAM package [105] to analyze our data as well as IBM SPSS Statistics (version 24) for further analyses (for details see below).

Assessing Preconditions for Using the Test to Assess Professional Vision
To assess the structure of the data originating from the test, more dimensional models were tested because professional vision seems to consist of four separate processes, namely perception as well as the three processes of interpretation: description, explanation, and prediction (RQ 1). We tested the structure of professional vision by comparing a four-dimensional model presuming that perception, description, explanation, and prediction can be measured as distinct dimensions and more restricted models ( Table 2). The more restricted models were a one-dimensional model (pooling all items on one dimension), a two-dimensional model (differentiating the two processes of perception and interpretation), and a three-dimensional model (distinguishing perception, integrating description/explanation, and prediction). Thus, four different models were tested by contrasting the global model fit of the four-dimensional model with the global model fit of the more restricted models based on the criterion-referenced norms (dichotomous and partial credit). The lower penalty scores AIC and BIC for the four-dimensional models indicate that the professional vision assessed with our instrument can be described best with the four-dimensional models (strict and less strict expert-referenced norms) as it fits the data significantly better in relation to the more restricted models. The likelihood ratio test affirmed these results.
The four-dimensional models (strict and less strict expert-referenced norms) were used as a basis for the following investigations. To determine the change in participants' abilities in the repeated measurement design, the procedure of using virtual persons was deployed [107]. Items were analyzed according to the mean square fit index (0.75 ≤ MNSQ ≤ 1.30, [108]). As a result, the item pool was reduced from the original i = 108 to i = 92 items that fit both strict and less strict norm four-dimensional models. Regarding the eliminated items, no meaningful pattern was identifiable, and all relevant aspects of the test were still covered by the remaining items. Hence, we assume that the final test is adequate to assess the professional vision, since no items were eliminated in the dimensions perception (i = 12) and explanation (i = 21). Due to poor fit, 10 items were eliminated in the dimension explanation resulting in i = 41, and six items were eliminated in the dimension prediction resulting in (i = 18).

Test Scoring
To determine which of the two expert-referenced norms is more suitable, the indices of the four-dimensional models were compared (RQ 2). The strict expert-referenced norm (dichotomous model) resulted in better indices, with excellent to good reliabilities for the scales description (EAP = 0.91) and prediction (EAP = 0.85) as well as a good item discrimination with up to σ 2 = 2.13 explained variance ( Table 3). The less strict expert-referenced norm (partial credit model) showed reliability scores similar to the model with the strict expert-referenced norm but a low discrimination of the scales (explained variance). Due to the unacceptable and poor reliability of the scales for perception and explanation for both norms, these scales need further improvement in the future. Hence, no further results regarding these scales are reported here, and all interpretations and conclusions are limited to the scales description and prediction.

Demonstrating Sensitivity
In order to test the sensitivity of the test instrument (RQ 3), repeated measure ANOVA was conducted, and the pairwise comparison tests were Bonferroni corrected ( Figure 4). Regarding the ability of description, we found a significant interaction effect of group (training vs. comparison group) and time (F(1,117) = 29.14 p < 0.001) with a large effect (η 2 part = 0.20). The pairwise comparison test indicated no significant (p = 0.602) difference between the two groups in the pre-test; however, there was a significant (p < 0.001) increase in the ability of description of the training group in contrast to the comparison group.
further results regarding these scales are reported here, and all interpretations and conclusions are limited to the scales description and prediction.

Demonstrating Sensitivity
In order to test the sensitivity of the test instrument (RQ 3), repeated measure ANOVA was conducted, and the pairwise comparison tests were Bonferroni corrected (Figure 4). Regarding the ability of description, we found a significant interaction effect of group (training vs. comparison group) and time (F(1,117) = 29.14 p < 0.001) with a large effect (η 2 part = 0.20). The pairwise comparison test indicated no significant (p = 0.602) difference between the two groups in the pre-test; however, there was a significant (p < 0.001) increase in the ability of description of the training group in contrast to the comparison group.
For the ability of prediction, a significant interaction effect was detected between groups and time (F(1,117) = 14.81 p < 0.001) with a medium effect (η 2 part = 0.11). The pairwise comparison test revealed that there was no difference (p = 0.377) in the participants' ability of prediction in the pretest. However, in the post-test the participants' ability to predict was significantly (p = 0.003) higher in the training group in contrast to the comparison group.  For the ability of prediction, a significant interaction effect was detected between groups and time (F(1,117) = 14.81 p < 0.001) with a medium effect (η 2 part = 0.11). The pairwise comparison test revealed that there was no difference (p = 0.377) in the participants' ability of prediction in the pre-test. However, in the post-test the participants' ability to predict was significantly (p = 0.003) higher in the training group in contrast to the comparison group.

Discussion
The aim of the present study was to assess biology pre-service teachers' professional vision of formative assessment in the context of scientific inquiry. Adhering to the situated assessment approach of professional vision, authentic text vignettes served as a stimulus. Text vignettes can serve as appropriate stimuli in subject-educational research in particular, as these in contrast to video vignettes are primarily less complex in terms of simultaneously occurring processes, whereas video vignettes are mainly of interest in pedagogical or general educational research [111]. Additionally, Friesen et al. [112] were able to show that the format of the vignettes (video, text, or comic ) did not affect the teachers' perception in subject-didactical contexts. Text and video vignettes are both perceived as authentic representations of classroom settings [113]. The authenticity of the vignettes is crucial for the initiation of the professional vision [114]. However, staged videos can also be perceived as authentic [115].
The development of our test instrument was successful in the sense that a four-dimensional construct of professional vision was detected, which is in line with the already described abilities of professional vision (perception, description, explanation, prediction) [25][26][27]. This finding is an indication of construct validity; however, further support for validity is needed.
Regarding the expert-referenced norm comparison (strict vs. less strict), the strict model is to be favored over the less strict model, since the explained variance was higher in the strict model. In contrast to the scales description and prediction, the scales testing the abilities perception and explanation did not have satisfactory reliability. The low reliability for the ability perception may be due to the low variance caused by the dichotomous item format. Due to content-based considerations, we preferred to use a dichotomous answer format over a four-point Likert scale format as participants either perceive or do not perceive a certain aspect in the vignettes. However, it seems more promising to use a four-point Likert scale format [33] or assessing perception indirectly, since perceiving certain aspects forms the basis for interpretation [23]. The reliability for the ability explanation may have been unsatisfactory based on the low variance present in the participants' responses, which was possibly due to most of the items being too difficult in relation to the participants' abilities. Therefore, the items for perception and explanation have to be revised. In our training we considered the features of successful training regarding professional vision (such as analyzing teaching videos, role plays, and active teaching in microteaching situations) and incorporated the important aspect of theory-practice integration. Hence, we assumed that professional vision of the pre-service teachers could be strengthened [34,35,48,116]. By measuring significant changes in professional vision regarding the abilities description and prediction, the conceptualized instrument is shown to be sensitive enough to detect changes in these abilities of participants' professional vision.

Limitations and Outlook
The support for the validity of the instrument is not comprehensive. Further investigations regarding validity evidence could be realized in the future by conducting, for example, cross validation of the results based on a replication study. Concerning the sensitivity of the instrument, verification of the increase of professional vision of pre-and in-service teachers is needed, similar to Gold and Holodynski [91]. Sources of validity evidence based on response processes presenting insights of test-taking population regarding their performance strategies such as eye tracking or the thinking aloud method could help to understand the fit between the construct of professional vision and the actual response the test takers are engaged in. Only two of four scales showed satisfying reliability, as the scales for the abilities perception and explanation need further refinements. Thus, a holistic assessment of professional vision by means of the instrument is not possible right now. Furthermore, the content specificity of the instrument has to be taken into account in future studies; hence, generic statements regarding professional vision are not suitable based on the data collected by the present instrument.

Conclusions
This paper reports the investigation of a text-vignette-based test instrument focusing on professional vision of formative assessment regarding scientific inquiry. Data assessed with this test instrument fit the theoretically assumed four-dimensional structure of professional vision (RQ 1). We found that the use of a strict expert-referenced norm is to be favored over the partial credit model (RQ 2). Furthermore, the test instrument was sensitive enough to detect changes in the professional vision of pre-service teachers that participated in training sessions focused on professional vision of formative assessment regarding scientific inquiry concerning the abilities description and prediction (RQ 3). Incorporating the instrument in existing out-of-school laboratory courses will improve understanding of their effects on the development of professional vision, thereby enabling insightful comparisons with other kinds of courses.
Funding: This research was funded by "Qualitätsoffensive Lehrerbildung", a joint initiative of the Federal Government and the Länder, which aims to improve the quality of teacher training. The program is funded by the Federal Ministry of Education and Research (BMBF). The authors are responsible for the content of this publication; grant number 01 JA 1610. We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.