A Gender Bias in Curriculum-Based Measurement across Content Domains: Insights from a German Study

: By immediately responding to achievement progress data, teachers can improve students’ performance by using curriculum-based measurement. However, there are studies showing that teachers are prone to make biased judgments about the students providing the data. The present investigation experimentally examined whether pre-service teachers in Germany were biased by the use of gender stereotypes when judging students’ achievement derived from progress data. N = 100 pre-service teachers received graphs that depicted the development of either oral reading fluency or math achievement of girls and boys over a time interval of 11 weeks. The results obtained confirmed the hypotheses partially. The participants did not favor girls over boys on average. However, they judged achievement in reading to be higher for girls than for boys, and math achievement to be higher for boys than for girls. The results suggest that gender stereotypes (boys are good at math, girls are good at reading) are still prevalent in pre-service teachers.


Introduction
Teachers are increasingly turning to curriculum-based measurement (CBM) as a tool for monitoring student development in fundamental academic domains like arithmetic, reading, spelling, and writing.In most cases, it comprises employing brief, routinely administered standardized tests to gauge pupils' advancement toward a long-term objective [1].However, the effectiveness of CBM for raising student achievement appears to be mixed [2][3][4][5], despite the fact that it provides teachers with a strong framework for making judgments based on evidence regarding whether students need help, instructions need to be revised, or teaching objectives need to be adjusted [6].Research has shown that teachers have difficulty using progress data to inform and guide their instruction [7,8].The interpretation of progress data is impacted by a number of factors, which is one reason why CBM alone does not result in better teaching [9][10][11][12].These factors may be related to the lack of attention teachers devote to relevant aspects of the graph [13], may be connected with characteristics of the progress data itself or its presentation, or may belong to characteristics of the to-be-judged students.For example, refs.[9,11,12] found that high data variability results in relative overestimation of current trends.Ref. [14] could show that the presence of a trend line, which visually depicts the linear component of progress data points, reduces judgment errors.Moreover, ref. [11] demonstrated that pre-service teachers judged progress data of reading fluency obtained from girls more positively than the same data obtained from boys.
The aim of the present study was to contribute to the growing literature showing that in-service and pre-service teachers have difficulty interpreting and making adequate decisions on progress data.In particular, the gender bias obtained by [11] was sought to be replicated with a different task.Furthermore, it was investigated whether a gender bias would depend on the content domain wherein progress data were obtained.Specifically, it was assumed that although pre-service teachers should on average favor girls over boys, boys should benefit from a math domain, whereas girls should be at an advantage in a reading domain.

Curriculum-Based Measurement
Curriculum-based measurement is a general term for assessment systems that track a student's progress in learning within a particular academic subject.In order to determine whether students have achieved a learning goal or instead require extra support, it is necessary to frequently evaluate their abilities [15].
CBM entails the frequent administration of brief measures of performance in a chosen academic domain of interest (such as reading, writing, or mathematics).A graph showing the student's learning trajectory over a set time period is typically used to depict the student's achievement.The graph can be used by teachers to assess the efficacy of a lesson plan, a student's mastery of a subject, or whether a student is expected to perform in accordance with pre-set learning objectives.CBM can be a useful tool for teachers to raise students' performance by systematically responding to accomplishment data with instructional adjustments.However, when employing CBM, teachers frequently struggle to enhance their education [2][3][4][5]8].Although CBM graphs are constructed to facilitate teachers' understanding of their students' progress, their comprehension appears to be challenging [8,16].One probable explanation for teachers' lack of adequate response to the presentation of progress data is their inability to read and understand data accurately [17,18].Even using computer software designed to help teachers analyze graphs by presenting statistics like the graph's linear trend does not produce an adequate grasp of the progress data.Moreover, teachers frequently do not use these statistics [16].Instead, they rely on visual assessment of the data more frequently [14].Visual inspection, on the other hand, is prone to inaccuracy [19,20], and as a result, teachers make errors when evaluating visible progress data.For instance, ref. [12] presented teachers with CBM graphs and assessed their ability to grasp information from them.They found that teachers were prone to ignore the relevant information and to focus on rather marginal details.Similar results were revealed by [13] who examined teachers' eye movements when judging CBM graphs.The quality of data interpretation seems also to depend on the intensive support of teachers by researchers.Without such support, teachers are likely to use CBM data inconsistently and inappropriately [21].According to [22], merely providing teachers with students' data will not necessarily result in their using it, as long as they do not believe in the importance of these data.

Origins of CBM
CBM was invented in the 1970s by Deno and Mirkin [23].It was developed within the field of special education with the aim of allowing teachers to formatively evaluate their instructional programs by successively using probes to test their students' basic academic skills, so that they were able to describe the students' growth in academic achievement.The invention of CBM was followed by a 5-year program of research conducted at the Institute for Research on Learning Disabilities at the University of Minnesota [24].Since CBM offers a system for monitoring students' attainment of academic goals and evaluating instructional programs, its use has been formalized among school districts in the United States.Today, U.S.-based norms and data management are available on internet platforms, e.g., www.aimsweb.com(accessed on 10 December 2023).

CBM in Germany
Whereas in Germany, where the present study was run, formative evaluation of the achievement of students has been debated and applied since the 1970s, the particular use of CBM started after the seminal paper from [25] about the history and development of CBM.Since then, research in Germany on CBM flourished, leading to internet platforms that provide teachers with diagnostic instruments to monitor students' learning progress [26].However, the relevant tests must be obtained by the respective school or the teacher via providers; they are therefore not made available to the teacher as part of the teaching material that is available anyway.Instruction in diagnostic practice is usually provided via videos or handouts available online.Training programs for teachers do not seem to exist so far.Hence, some authors e.g., [5,8] call for teacher training programs that offer working with assessment data and using these data to modify instruction.Similar to the U.S.-but established there for a long time-there are also approaches to the standardization of learning progress data in Germany (e.g., [27]).

Different Content Domains
CBM was developed initially to help teachers at the primary school level increase the achievement of students struggling to learn basic skills in reading, writing, and arithmetic [15].Since many students struggle with reading [28], one of the first domains where CBM was applied was reading.Reading CBM often consists of oral reading fluency [6], where students read aloud from a passage for a limited time (e.g., 1 min).To do this, students need to use a variety of different literacy skills, for instance, decoding, vocabulary, and comprehension [29].Teachers score reading CBM by first counting the total amount of words attempted in 1 min, then counting the total number of errors, and finally subtracting the total number of errors from the total number of words, yielding the words read correctly (WRC) score [6].
As with reading, math is a skill that is essential for success in life.In parallel to reading CBM, math CBM has been developed to assess computation e.g., [30], with the majority of research and development focused on the primary school level [31].Math CBM is conducted by having students answer computational problems for a certain amount of time (e.g., 2 min; [32]).When scoring math CBM, usually the number of correct digits is used [6].

Biases in Interpreting Progress Data
When teachers use CBM data to judge student achievement, several causes of bias have been discovered.For instance, when progress data are highly variable, teachers find it challenging to predict the rate of progress accurately [9,12,33].Ref. [33] could show that teachers tended to overestimate student progress when data variability was high.Similar results were obtained by [9,12].They could show that pre-service teachers tended to overestimate the current trend.This result could be explained by the participants' proclivity to identify trends in random patterns.Peaks in achievement in progress data with a high level of random variability may imply that those children will perform better than students with the same trend but a lower amount of random variability and hence lower peaks.

A Gender Bias in CBM
Girls typically outperform boys in reading competency across countries and languages e.g., [34][35][36][37][38].In math, gender differences are also likely to occur.Boys continue to outperform girls in math, with a wider disparity among the highest achievers, despite gender gaps in job market involvement and educational attainment narrowing [39,40].
Despite the relative advantage of boys in math, several studies in different countries have shown that, on average and across domains, girls outperform boys e.g., [40].Compared to boys, girls are more likely to display high-achieving developmental patterns e.g., [41].
Differences between boys and girls in achievement are usually reflected in differences in teachers' assessment of their achievement [42].However, gender-related differences in assessment might also arise from bias that is not based on achievement or skills.Boys' lower reading proficiency levels and their relatively higher math successes are discussed as being partially the product of a bias due to teachers' gender stereotypes, which hold that reading is more appropriate for girls than for boys e.g., [43], whereas math is better suited for boys than for girls [38,44].Gender stereotypes among teachers conform to stereotypes about student motivation and working habits [45,46].

Gender Stereotypes as a Source of Gender Bias
Stereotypes can be defined as "shared [. ..] beliefs about traits that are characteristic of members of a social category" [47] (p.14).Thus, they are the result of categorizing individuals into groups based on supposed commonality.Stereotypes can serve as norms, affecting expectations and behavior toward members of a particular social group, and as schemas, enhancing social interactions with strangers [48].These expectations are activated when a target is classified as belonging to a specific group [49,50].
According to dual process theories of social judgment e.g., [51], people's evaluations of other people take place along a continuum of two simultaneous processes.On one end of the spectrum, judgments are, quick, effortless, automatic, and based on social categories (e.g., "girl", "boy", "immigrant"); on the other end of the spectrum, it is assumed that a slow, laborious, voluntarily initiated process will outweigh and enrich the automatic process by incorporating all pertinent information about the subject of the judgment.
If a person exhibits salient characteristics that are consistent with a certain stereotype, or if the judging person is unsure about the proper interpretation of the other person's behavior, the use of stereotypical categories becomes more likely [52,53].
Gender stereotypes in particular cause female students to be seen as less talented than male students in all areas of science, whereas male students are considered inferior to female students in the domain of languages [54].

Pre-Service Teacher Education in Germany
The German teacher education system has traditionally been divided into two different phases: After graduating from a university with a first state examination (or a master's degree), the second phase of teacher education is the traineeship, which might take up to two years.During this time, the student teacher observes lessons, takes classes on general topics related to teaching in schools, and works as a teacher in one or various schools [55].There is ample evidence that even in pre-service teachers' stereotypical beliefs about students regarding their ethnicity or gender exist [56,57].Once established, these stereotypes are likely to affect the judgment of students later during their teaching in school e.g., [58].

Research Questions and Hypotheses
The following was the justification for the current study.Because pre-service teachers may have different stereotypical expectations about achievement development of boys and girls in school cf.[53], they would presumably judge the achievement trajectories of girls to be higher than those of boys, even when they are actually the same.Beyond that general gender achievement bias, pre-service teachers should in particular overestimate girls' oral reading fluency compared to that of boys, whereas the opposite should be true in math.Consequently, the following four hypotheses were tested: 1.
It was assumed that estimates of achievement progress, depicted as CBM graphs, should in general be higher for girls than for boys, irrespective of the content domain.

2.
In addition, it was hypothesized that estimates of achievement progress in oral reading fluency would be higher for girls than for boys, even when both exhibit identical learning progress.

3.
On the contrary, it was supposed that estimates of achievement progress in math would be higher for boys than for girls, even when both exhibit the same learning progress.4.
Finally, it was assumed that the participants would estimate achievement progress of both girls and boys to be higher when the linear trend of the data is steep rather than flat, and when data variability is high rather than low.

Participants
Based on previous investigations [11], an average medium-sized effect (f = 0.30, corresponding to η 2 = 0.08) of the independent variables on the participants' judgments was assumed.An a priori power analysis for ANOVA with repeated measures and two groups was conducted by applying G*Power 3.1 [59].When prespecifying f = 0.30, α = 0.05, 1 − β = 0.90, and the correlation among repeated measures with r = 0.50, we determined that N = 22 was the minimum sample size based on the results of the power analysis.
Pre-service teachers enrolled in primary or secondary school teacher education programs at several German colleges were recruited via social media announcements.Students of primary and secondary teacher study programs were recruited because in both levels of schooling CBM is applied in Germany.A total of 128 pre-service teachers volunteered to participate in the experiment.However, 28 individuals were omitted from further analyses because they left the study either before receiving student vignettes or after receiving the first vignette but did not continue to take part.From the remaining participants, data were complete.
Thus, a total of N = 100 pre-service teachers (M age = 24.9years, SD = 2.0) participated in the study.85 participants were female, 15 were male.The majority of the students were in the third or fourth year of their study program, and had, on average, studied teaching for 7.1 semesters (SD = 3.3).The majority of the participants (n = 87) reported to have no previous experiences with CBM.The remaining participants stated to have heard about CBM in lectures or seminars, but without ever practicing it.

Materials and Procedure
The experiment was run via www.soscisurvey.de(accessed on 12 April 2022).The participants could complete the experiment's tasks on a computer or any other electronic device that was linked to the Internet.The study was accessible for 22 days.
The participants were welcomed and requested to consent to participate in the experiment before the experiment began.After that, they were given a brief introduction to curriculum-based measurement.In the first part of the introduction, the participants received a short text (264 words) containing information about the aim of CBM and how teachers can benefit from it.In the second part of the introduction, the participants learned how learning progress data is visually represented (including explaining the different parts of the graphs, i.e., the learning curve, the trend line, and the goal line).Moreover, they were provided with the trend line rule, according to which the student is making adequate progress, if the trend line and the goal line are similar [6].The introduction was finished with the presentation of an example of the graphical representation of a primary school student's learning progress.The participants were told to judge the student's learning progress by estimating whether the student should receive further support and whether the goal line should be raised.
After the introduction, the participants were randomly assigned in equal numbers to one of two conditions.However, due to a programming error, in one condition (Condition 1), the number of participants was n = 51, whereas in Condition 2, 49 participants were enrolled.
In Condition 1, the participants were presented with progress data obtained from an oral reading fluency assessment.In Condition 2, the participants were presented with progress data obtained from arithmetic tasks.In either condition, each participant received eight experimental student vignettes in random order.In these vignettes, the learning progress of four boys and four girls over a time period of 11 weeks was illustrated.Six arbitrary distractor vignettes were presented in addition to the experimental vignettes to mask the study's independent factors because knowledge of the independent variables could influence the participants' responses [60].
Each vignette showed a graph of the progress data of each student.In Condition 1, the y-axis represented the number of words read correctly (WRC), ranging from 0 to 140 cf.[6].In Condition 2, the y-axis represented the number of correct digits cf.[6], also ranging from 0 to 140.In both conditions, the x-axis represented the school week that the test was administrated, which ranged from week 1 to week 11.The progress data were accompanied by a trend line and a goal line.The linear trend was estimated by ordinary least squares linear regression analysis and was represented by a thin (1 mm) dotted black line from week 1 to week 11.The goal line was displayed as a continuous thin (1 mm) black line from week 1 to week 11, which served as an aid to facilitate judgments about the progress each student made.The steepness of the goal line was equal for all student vignettes and was given by y = 3x + 25, with x being the week of assessment.
The participants were asked to examine the development of students' achievements over a period of 11 weeks to make a judgment of each student's progress.In particular, they were instructed to rate how strongly they would agree to the following statements: (1) "The student needs further assistance".( 2) "The student's goal line should be raised".Each rating was done by using a six-point Likert scale ranging from 1 to 6, with 1 meaning "totally disagree" and 6 meaning "totally agree".Note that agreement on the first statement indicated that the participants judged the student's progress to be rather low, whereas agreement on the second statement reflected the participant's opinion that the student was doing quite well.Two items instead of one single item were developed to cover different facets of the assessment of students' achievement progress.
The independent within-subjects variables included the slope of the linear trend of the data, the amount of data variability, and the students' gender.Students' gender was indicated by their names.The names chosen for this study were typical of German children, both boys and girls.Typical names were used in order to prevent the activation of stereotypes associated with certain social and economic milieus, which can be triggered by rarely used names that are frequently chosen in these milieus [61].
The slope of the linear trend of the data was either low or high.The linear trend in both conditions was given by the following function: y = bx + 9, with b representing the rate of improvement and x representing the school week.A steep linear trend was operationalized as b = 5, whereas a flat trend was given by b = 2.
Data variability was either low or high.The standard error of the estimate (SEE), which is defined as the average magnitude of residuals around the trend line derived from linear regression, is frequently used in the literature to quantify the variability of learning progress graphs.SEE values often vary between 5 and 20, with 5 meaning a very low and 20 a very high variability e.g., [62,63].In the present study, the SEE of high-variable progress data was 10.0 and 5.0 with low-variable data.
The between-subjects independent variable was the content domain of the progress data, which represented learning progress either in reading or in math.
Figure 1 shows the experimental vignettes of boys of the reading condition for illustration purposes.

Data Analyses
A repeated-measures ANOVA, with content domain (reading vs. math) as the betweensubjects factor, and student gender, slope, and data variability as the within-subjects factors, was run for each dependent variable.

Data analyses
A repeated-measures ANOVA, with content domain (reading vs. math) as the between-subjects factor, and student gender, slope, and data variability as the within-subjects factors, was run for each dependent variable.

Results
Tables 1 and 2 display the means, standard deviations, and 95% confidence intervals of the dependent variables of each condition.

Results
Tables 1 and 2 display the means, standard deviations, and 95% confidence intervals of the dependent variables of each condition.Regarding the dependent variable "need for assistance", the results of the analysis of variance yielded one significant main effect and three significant interaction effects.All results are shown in Table 3. First, there was a significant main effect of the slope of the linear trend.The participants estimated the need for further assistance to be higher when the slope was flat (M = 3.67, SD = 0.30) rather than high (M = 3.08, SD = 0.44).
In addition to this main effect, ANOVA yielded a significant data variability × content domain interaction.In the math domain, participants judged students showing low variability (M = 3.47, SD = 0.40) to need more assistance than students showing high variability (M = 3.26, SD = 0.47), whereas in the reading domain, students showing low variability (M = 3.36, SD = 0.40) were judged to need less support than students with high variability (M = 3.41, SD = 0.46).The difference in judgments between low and high variability students was significant only in the math domain, F(1, 98) = 5.25, p = 0.024, but not in the reading domain, F(1, 98) = 0.30, p = 0.585.All simple effects were Bonferroni-adjusted.
Finally, ANOVA produced a significant data variability × student gender interaction.With boys, the participants judged the need for assistance to be higher when the variability of the data was high (M = 3.84, SD = 0.85) rather than low (M = 2.94, SD = 0.91), F(1, 98) = 21.11,p < 0.001.However, with girls, the participants did the reverse and judged the need for assistance to be higher when the variability of the data was low (M = 3.89, SD = 0.83) rather than high (M = 2.83, SD = 0.86), F(1, 98) = 32.06,p < 0.001.
Concerning the dependent variable "raising goal", ANOVA produced three significant main effects and four significant interactions.The results are shown in Table 4.
As with the dependent variable "need for assistance", there was a significant main effect of slope, meaning that a high slope resulted in higher ratings (M = 3.86, SD = 0.45) than a low slope (M = 3.22, SD = 0.48).Furthermore, there was a significant main effect of data variability.The participants preferred raising the goal for students showing high data variability (M = 3.60, SD = 0.31) over students showing low data variability (M = 3.48, SD = 0.28).In addition, a main effect of the content domain was obtained.The ratings for raising the goal were higher in the math domain (M = 3.61, SD = 0.30) than in the reading domain (M = 3.47, SD = 0.29).
We also conducted correlational analyses between the teacher gender (male vs. female) and each dependent variable obtained from all realized combinations of variables that were used to construct the student vignettes in order to determine whether the gender of the participants had an impact on the ratings of the participants.With all |r| < 0.16 and all ps > 0.127, only weak and negligible relationships were obtained.

Discussion
This study demonstrated that pre-service teachers' judgments of students' learning progress were biased by the gender of the students.The judgments of reading fluency were higher for girls than for boys.In particular, the participants judged the need for assistance to be lower for girls than for boys, and they opted more strongly for raising the goal line for girls than for boys.The contrary was the case when the participants were to judge a graph presenting the trajectory of math achievement.With math, boys were judged to be superior to girls on both dependent measures.Strikingly, the difference between boys and girls was large for either content domain.Pre-service teachers rated on average a score for girls compared to boys that was 2.7 times the standard deviation of the distribution of the dependent variables used in this study.A similar result was obtained for boys in the math domain (d = 2.32).These effects were comparable, albeit somewhat larger, to those obtained from [11].Therefore, the hypothesis was confirmed that pre-service teachers stereotype boys and girls when judging the progress of their reading fluency and math achievement.
We also expected the participants to judge the achievement development of girls on average to be higher than that of boys.The rationale behind this assumption was the evidence obtained from several studies [40,41] showing that girls outperform boys on average in both reading and math.In correspondence to this achievement difference, teachers should expect girls to perform better and develop faster than boys [53].However, the overall difference between the participants' judgments of boys and girls was not significant in this study; hence, the hypothesis had to be rejected.One reason for the absence of the student gender main effect was the disordinal interaction between student gender and content domain, which impressively showed that the advantage of one gender over another was strongly dependent on the content domain.
It was further assumed that steep achievement trajectories would correspond with a higher likelihood for high judgments of achievement (i.e., low ratings of needed support and high ratings of raising the goal line) compared to flat achievement trajectories.This hypothesis was confirmed.With both dependent variables, a significant main effect of steepness occurred, which accounted for 21.5% of the variance on average.Although this result might seem odd, as it appears to reflect just an expectable tendency of judging fast development higher than slow development, it demonstrates that the participants actually perceived and responded to the manipulation of the graphs, indicating that the results obtained in this study were internally valid.
In the final hypothesis, it was assumed that high-variable trajectories were more likely to indicate faster development than trajectories with low data variability.This hypothesis was informed by empirical results, showing that high levels of variability in progress data often correspond with a higher probability of detecting a trend in the data than low levels of variability [9,33].Moreover, theoretical assumptions guided this hypothesis, as high data variability produces higher peaks in progress data than low data variability, and high peaks could be perceived as a zone of achievement a student is potentially capable of cf.[64].In line with this hypothesis, the participants judged high-variable progress data as indicating faster progress than low-variable progress data, but only with respect to raising the goal line.
The latter result shows that findings obtained from the dependent variable "need for assistance" did not exactly mirror the results obtained from the dependent variable "raising the goal line".With the latter, the independent variables produced more significant effects.On a descriptive level, all effects that were significant with "raising the goal line" were also observed with "need for support", but only some of them reached statistical significance.Obviously, "raising the goal line" was more sensitive to the manipulation of the graph characteristics and the gender of the students.
The judged superiority of highly variable developing students was dependent on gender and content domain.Concerning student gender, data variability affected the participants' judgments in opposite directions depending on student gender.With boys, high-variable achievement trajectories resulted in lower ratings of performance, whereas with girls, the reverse was the case, with high-variable trajectories yielding higher ratings of performance.In the math domain, the participants judged the achievement trajectories to be higher when data variability was high than when it was low, whereas in the reading domain, data variability did not significantly affect the participants' judgments.Both interaction effects cannot be explained by the visual characteristics of the achievement trajectory, but have to be ascribed rather to characteristics that relate to the students to be judged, and to the content where the achievement had been demonstrated.
When data variability was substantial, higher ratings for girls and lower ratings for boys may have resulted from the participants' impressions about the students derived through studying each trajectory.Since the source of data variability is difficult to interpret, it offers educators in general and the participants of the present study, in particular, a bunch of possible reasons according to which the data might vary.When straightforward interpretations of given data are not possible to apply, the judgment itself will become uncertain [65].Uncertainty in judgments has been shown to produce judgments that are prone to rely on stereotypical assumptions about the individuals to be judged [52,66,67].Hence, high variability in progress data could have elicited more stereotype-based judgments than low variability because the latter appeared to be easier to attribute to plain causes than the former.Consequently, the likelihood that the participants used their stereotypes about boys and girls in their judgments was higher with high than with low variable data.In accordance with gender stereotypes, the achievement of boys should be lower than that of girls.This assumption fits well the data obtained and replicates the gender × data variability effect shown by [11].
There is evidence that variability in reading time is a predictor of reading comprehension [68], as variability has been shown to be an important component of response preparation with executive control functions [69].Thus, regarding the interaction between data variability and content domain, it is possible that the participants interpreted high variability in reading as a clue for a rather low level of achievement, whereas in math high variability seems not to be associated with low performance.
This study also yielded a significant student gender × slope interaction.Only with steep trends, there was a significant difference between boys and girls, with participants associating lower progress with boys rather than with girls, irrespective of the content domain.Again, gender stereotypes held by participants may account for this result.Since stereotypes about girls correspond to high achievement, whereas those about boys adhere to rather low achievement [45], steep achievement trajectories conform to the stereotype about a girl rather than to that about a boy.Since stereotypes are more likely to be activated when they match the description of individuals rather than when they are in contradiction to them [70], we deem it likely that steep trajectories were automatically associated with girls rather than with boys.

Limitations
Every time they lack the motivation or capacity to participate in a more thorough analysis of information, people rely on their stereotyped ideas about social groupings as a basis for their judgments [71].Although this kind of lack of desire or aptitude may be typical in most real-world situations [51], it is not always the case when teachers make decisions regarding students' education.Therefore, the fact that the participants in this study made judgments about virtual rather than actual students may be considered a study weakness.As a result, participants may not have felt strongly responsible for their judgments [72] and therefore may have exerted less effort to analyze information than they would do in school.
Additionally, teachers in classrooms can access various data that may support their judgments of their students' performance.Hence, it is likely that judgments of student accomplishment in the classroom may be less influenced by stereotypes than in this study.
The participants' lack of CBM expertise may also have had an impact on the results of the independent factors employed in this study.Actually, little is known about the strategies teachers use to interpret learning progress data presented by graphs [73], and there is evidence that teachers have difficulty interpreting CBM data [17].Training teachers in graph literacy might therefore improve their understanding of learning progress and even help improve their judgments of learning progress graphs e.g., [12].However, to the authors' knowledge, there is no study showing that a lack of expertise in interpreting CBM data is related to more biased judgments.According to the literature, teachers' expertise is only weakly correlated with their judgment accuracy [74,75].Further studies should use teachers' expertise as an independent variable in the investigation of sources of biased judgments.
Even the unequal distribution of participant gender might have biased the results, as women and men might differ regarding their judgments of students' achievements in reading and math [76].However, the distribution of participants' gender in the present study matched the current distribution of teachers' gender in Germany [77].

Conclusions
The obvious prevalence of gender stereotypes in pre-service teachers' judgments of CBM graphs coincides with an increase in actual gender-related differences in students' achievements in math and reading in some countries, particularly in Germany, where the present study was conducted.Compared to a decade ago, girls lag behind boys in math, but outperform boys in reading, to an even larger degree [34,78].The mere existence of gender stereotypes may contribute to differences in achievement between boys and girls.According to the theory of stereotype threat [79], girls' math performance is decreased because girls feel threatened by the possibility that their performance will confirm the negative stereotype associated with their social group.Stereotype threat may also explain the underperformance of boys in reading [80].A vicious cycle of the interrelatedness of stereotypes and achievement may manifest once the gender-related differences have been established and appear to be stable.As a consequence, the gender gap in reading and math may even increase in the long term.To disrupt this cycle, we recommend teacher training on two paths.First, focusing on curriculum-based measurement, in-service and pre-service teachers should be trained in graph literacy [81], which is the competence to read and understand visualized progress data, and which is apparently lacking in many pre-service and in-service teachers e.g., [12,82].Second, the sensitivity to gender stereotypes should be increased, both within teacher study programs and in in-service teacher training programs.One possibility to do this is by increasing the awareness of teachers' own stereotypes and biases (see for example [83] for an intervention).

Figure 1 .
Figure 1.Experimental vignettes of boys of the reading condition.Note.Upper panel: Steep linear trend of the data; lower panel: flat linear trend of the data; left panel: high variability of the data; right panel: low variability of the data.The names of the students were shown on the vignettes to indicate the gender of the student.The meaning of the three different lines (solid: goal line; dotted: trend line; solid with data points: student progress data) was explained to the participants.The original labeling of the axes was in German.

Figure 1 .
Figure 1.Experimental vignettes of boys of the reading condition.Note.Upper panel: Steep linear trend of the data; lower panel: flat linear trend of the data; left panel: high variability of the data; right panel: low variability of the data.The names of the students were shown on the vignettes to indicate the gender of the student.The meaning of the three different lines (solid: goal line; dotted: trend line; solid with data points: student progress data) was explained to the participants.The original labeling of the axes was in German.

Table 1 .
Means, standard deviations (in parentheses), and 95% confidence intervals [in brackets] of the dependent variable "need for assistance".

Table 1 .
Means, standard deviations (in parentheses), and 95% confidence intervals [in brackets] of the dependent variable "need for assistance".

Table 2 .
Means, standard deviations (in parentheses), and 95% confidence intervals [in brackets] of the dependent variable "raising goal".

Table 3 .
Results of the analysis of variance for the dependent variable "need for assistance".

Table 4 .
Results of the analysis of variance for the dependent variable "raising goal".