Next Article in Journal
Reframing Academic Development for the Ecological University: From ‘Change’ to ‘Growth’
Previous Article in Journal
Meta-Analysis for Math Teachers’ Professional Development and Students’ Achievement
Previous Article in Special Issue
Exploring Relationships Between Qualitative Student Evaluation Comments and Quantitative Instructor Ratings: A Structural Topic Modeling Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future

by
Melissa A. Gallagher
1,*,
Timothy D. Folger
2,
Temple A. Walkowiak
3,
Anne Garrison Wilhelm
4 and
Jeremy Zelkowski
5
1
US Math Recovery Council, Eagan, MN 55121, USA
2
College of Community and Public Affairs, Binghamton University, Binghamton, NY 13902, USA
3
College of Education, North Carolina State University, Raleigh, NC 27695, USA
4
Simmons School of Education and Human Development, Southern Methodist University, Dallas, TX 75205, USA
5
College of Education, The University of Alabama, Tuscaloosa, AL 35401, USA
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(9), 1158; https://doi.org/10.3390/educsci15091158
Submission received: 28 June 2025 / Revised: 25 August 2025 / Accepted: 26 August 2025 / Published: 4 September 2025
(This article belongs to the Special Issue Recent Advances in Measuring Teaching Quality)

Abstract

To better understand how teaching quality has been conceptualized and measured within the sub-field of mathematics education, we conducted a systematic review of 24 journals to identify instruments that have been used to measure mathematics teaching quality; which instruments have interpretation and use statements; and the validity, reliability, and fairness evidence for each instrument. We found 47 instruments with validity, reliability, and fairness evidence. These instruments primarily captured teachers’ enactment of specific teaching practices through classroom observations or student questionnaires. Some instruments captured approximations of practice through teacher questionnaires or interviews. Only two instruments presented an integrated interpretation and use argument (IUA) framework, although eleven included at least one component of an IUA framework. We found that measure developers were most likely to present reliability evidence and evidence related to test content, internal structure, and relations to other variables. They were least likely to present evidence related to response processes, consequences of testing, or fairness. These findings suggest that although there are many instruments of mathematics teaching quality, instrument developers still have considerable work to do in collecting and presenting validity and fairness evidence for these instruments.

1. Introduction

Teaching is the most important in-school factor in predicting student achievement (Nye et al., 2004). Therefore, one of the biggest levers that schools can use to impact student achievement is improving teaching quality. However, to date there is neither consensus in teacher education (including its subfields, such as mathematics teacher education) nor in the teacher effectiveness literature on what practices constitute teaching quality or on how to measure teaching quality. Berliner (2005) describes high-quality teaching as including both good teaching (doing what is expected) and effective teaching (impacting student achievement). To determine whether teaching is effective requires measuring the behaviors that a teacher enacts in their classroom (i.e., their teaching) and relating this measure of teaching to student achievement. Measuring teaching, therefore, requires instruments—“device[s] or procedure[s] in which a sample of an examinee’s behavior in a specified domain is obtained and subsequently evaluated and scored using a standardized process” (AERA et al., 2014, p. 2)—that capture teacher behavior. Further, the practices that are being measured and taken to be high quality, are determined by the particular community of practice, rather than agreed upon more widely (Gitomer & Bell, 2013). In addition to a lack of agreement in the field on what constitutes quality teaching, there is the challenge that many developers of instruments measuring teacher behavior have not yet taken up more current approaches to instrument development.
Traditional views of instrument development have described an instrument as having reliability and validity and have ignored fairness. However, more recently, instrument development experts have noted that reliability, validity, and fairness are not inherent to an instrument, but rather are directly related to the evidence collected for interpretations of the instrument’s scores for specific uses (AERA et al., 2014; M. Kane, 2016; Krupa et al., 2019). An argument-based approach to validation—such as the Interpretation/Use Argument (IUA; M. Kane, 2013, 2016)—requires the researcher to make explicit the claims inherent in the interpretation of instrument scores for specific uses and critically evaluate and provide evidence for the validity of those claims (M. Kane, 2016). The intended score interpretation and use, and the claims entailed, then provides a “framework for test development and validation” (M. Kane, 2016, p. 65) wherein the researcher highlights the evidence they propose to collect to justify the interpretation and use of instrument scores (AERA et al., 2014). This framework is known as an integrated IUA framework.
The purpose of this review was to determine how mathematics teaching has been measured in the twenty-first century in research studies and how researchers approached the development of those instruments. This review was guided by the following research questions:
  • What approaches have been used to measure mathematics teaching?
  • To what extent are there components of an integrated IUA framework (i.e., interpretation and use statements and claims) for mathematics teaching instruments?
  • What types of validity, reliability, and fairness evidence have been described for mathematics teaching instruments? Do these types vary based on how mathematics teaching has been measured?

2. Literature Review

This review focuses on how mathematics teaching has been measured and the approaches taken by instrument developers. We first provide an overview of mathematics teaching quality and then provide a framing of approaches to instrument development.

2.1. Mathematics Teaching Quality

Because research has shown repeatedly that the teacher matters for student learning (e.g., Lynch et al., 2017; Nye et al., 2004), educational stakeholders have become increasingly interested in teaching quality and how to measure it. However, teaching is a complex endeavor that involves coordinating, facilitating, and responding to many activities in the classroom to support learning goals, and therefore, measuring teaching quality is an equally complex undertaking. In the field of mathematics education, there has been increasing interest among researchers in how to measure the dimensions or behaviors that represent high-quality mathematics teaching. These dimensions or behaviors have been largely anchored in documents from organizations in the United States like the National Council of Teachers of Mathematics (2000, 2014) and the National Research Council (2001) and around the world (e.g., Australian Association of Mathematics Teachers, 2006; Eurydice, 2011). Consequently, conceptualizations of mathematics teaching in research have largely reflected these recommendations (Praetorius & Charalambous, 2018) to include behaviors such as facilitating student discourse, utilizing representations to promote conceptual understanding, and responding intentionally to students’ ideas to support them in their understanding. Moreover, some mathematics education researchers have examined teaching practices specific to sub-domains of mathematics, such as algebra (Schoenfeld, 2020); however, research into teaching practices in specific sub-domains of mathematics is still a fairly nascent field, and so we focus our review on mathematics teaching more broadly, but include sub-domain specific instruments when they met our review criteria.
Measuring these teaching behaviors allows researchers to examine relationships between the specific behaviors and other variables. For example, researchers have examined relationships between mathematics teaching quality and teacher knowledge to provide insight on the role of teachers’ knowledge in the practices they implement in the mathematics classroom (Desimone et al., 2016; Hill et al., 2008). Other researchers have investigated the relationship between mathematics teaching and students’ learning outcomes (Bishop, 2021; Ottmar et al., 2015). These studies provide information about which teaching behaviors seem to be optimal for students’ opportunities to learn (i.e., those which constitute mathematics teaching quality), thereby informing teacher education and professional development programs. These implications speak to the importance of having instruments that measure mathematics teaching.

Measuring Mathematics Teaching Quality

The question of how to measure teaching quality has been of interest to a variety of different audiences in the last 20 years. In particular, educational researchers, policy makers, and education practitioners have all taken an interest in this issue, from a variety of perspectives. The Measures of Effective Teaching (MET; T. Kane et al., 2014) study set out to address this issue by collecting a variety of data from six large school districts across the United States. They were interested in understanding how three common school-district approaches to studying teaching quality—classroom observation data, student surveys, and value-added scores (i.e., a statistical estimate of students’ academic growth attributed to the teacher)—could be utilized to describe the strengths and weaknesses of teachers’ classroom instruction for the purposes of teacher evaluation. They found that using multiple data sources and measures, rather than just one, improved the information provided to district leaders about teachers’ classroom instruction.
In mathematics education research, the most common approach for measuring teaching quality is to use classroom observation tools. In a review of classroom observation in mathematics education, J. Bostic et al. (2021) found 114 manuscripts that used classroom observation over a 15-year period (2000–2015); however, the majority of these studies did not use a formalized classroom observation protocol. Of the formalized classroom observation protocols in use, only 44% of them had validity evidence published in peer-reviewed journals. The authors identified this as an area of concern for the field of mathematics education.
While less common in mathematics education, researchers in educational assessment have advocated for using student surveys to allow students to report on their perspectives, as daily participants, with respect to how interactions unfold in the classroom (Downer et al., 2015). Prior studies have found that students can provide unique information (e.g., Feldlaufer et al., 1988), and their ratings of classroom interactions have been related to student outcomes (e.g., Fauth et al., 2014). The MET study in conjunction with this body of research suggests that it is reasonable to look for and include student surveys as a measure of mathematics teacher behavior.
When measuring teaching quality, some behaviors can be observed while they are being enacted, whereas other behaviors are harder to observe. Researchers who study teaching practices that are harder to observe often use teacher interviews or questionnaires. For example, teacher noticing is a practice of teaching that involves attending to instances of teaching and learning in the context of classroom instruction, interpreting those instances, and making decisions about how to respond to those instances (Amador et al., 2021; Jacobs & Spangler, 2017). Teachers engage in noticing continually, but much of it is difficult to observe because it happens in teachers’ heads. By asking teachers to watch a video and attend to instances of teaching and learning, interpret those instances, and decide how to respond, they approximate the practice of noticing. However, this approximation is distinct from what they actually notice as they enact their teaching practice. For the purposes of this study, we were interested in capturing all approaches used by researchers to study mathematics teaching.

2.2. Advancement of Validity Theory

Validity is a fundamental concern of measurement, and it is a concept that has evolved over the past century. Notably, validity was once referenced using many names (e.g., content validity, criterion validity) and evaluated through a diversity of procedures (Anastasi, 1986). Criterion validity was considered the “gold standard” for validity and validation between 1920 and 1950 (M. Kane, 2013). Using a criterion measure that was assumed to estimate the true value of a trait, validity was evaluated by examining the relationship between a test and its criterion. Prior to 1950, validity was often defined as the correlation between test score and criterion scores (e.g., Cureton, 1951; Thurstone, 1931). Criterion-validation approaches could further be classified as (a) concurrent validity when test scores and criterion measures were determined at approximately the same time, or (b) predictive validity when criterion measures were collected some time significantly later than test scores (Anastasi, 1986; Cronbach & Meehl, 1955; M. Kane, 2013). Defining validity based on a test–criterion relationship, however, became problematic if a true criterion measure did not exist, such as with investigations of aptitude, attitudes, and beliefs. Thurstone (1931), who had previously defined validity based on the test–criterion relationship, determined that “we have reached a stage of sophistication where the test–criterion correlation is too coarse. It is obsolete” (Thurstone, 1955, p. 360).
The Technical Recommendations for Psychological Tests and Diagnostic Techniques (APA, 1954), a precursor to the Standards for Educational and Psychological Testing, was published amid growing concerns about the development and validation of psychological tests. One of the most significant contributions of the Technical Recommendations was the introduction of the concept of construct validity, which involved evaluating a test through both theoretical reasoning and empirical evidence related to the construct it intended to measure (APA, 1954). As Cronbach and Meehl (1955) explained, “Construct validation is involved whenever a test is to be interpreted as a measure of some attribute or quality which is not ‘operationally defined’” (p. 282). This framework addressed validation challenges in the absence of clear criterion measures. Over time, the measurement community moved towards conceptualizing validity as a unitary concept partly because the construct validation framework encompassed previous validation procedures related to distinct types of validity (e.g., AERA et al., 2014; Anastasi, 1986; Cronbach, 1988; Cronbach & Meehl, 1955; M. T. Kane, 1992; M. Kane, 2013; Loevinger, 1957; Messick, 1989, 1995).
Currently, Validity is defined as “the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests” (AERA et al., 2014, p. 11). Although this definition represents a consensus in the measurement community, differing perspectives in defining validity continue to exist (e.g., whether or not validity includes consequential considerations of test use has been widely debated; Borsboom & Wijsen, 2016; P. E. Newton & Shaw, 2016; Sireci, 2016). For this study, validity is an attribute of how scores from an instrument are interpreted and used; validity is not an attribute of a measurement instrument (AERA et al., 2014; M. Kane, 2013, 2016; Messick, 1989; Shepard, 1993, 2016; Sireci, 2016; Zumbo & Hubley, 2016).
In this study, we adopt a broad conceptualization of both tests and measurement. Specifically, we define tests as instruments used to assess the quality of mathematics teaching—encompassing tools such as observation protocols, student and teacher questionnaires, and interview guides (AERA et al., 2014). Accordingly, our conception of measurement is intentionally expansive as “the assignment of numerals to objects or events according to rules” (S. S. Stevens, 1946, p. 677). While this definition is sometimes viewed as overly permissive, it allowed our team to account for the wide range of instrument types and methodological approaches used in prior research to measure teaching quality.

2.3. A Modern Approach to Validation

Validation refers to the procedures and processes employed to evaluate validity. Current best practices for validation involve (a) describing the intended interpretation and use of test scores, (b) identifying claims underlying test-score interpretation and use, and (c) collecting evidence concerning those claims, often categorized using the five sources of validity evidence described in the Standards (See Table 1; AERA et al., 2014; M. Kane, 2013; Sireci & Benítez, 2023). It is important to highlight how the Standards’ definition of validity encompasses previous validation procedures associated with distinct types of validity. For example, notions of content validity are reflected in the Standards’ validity evidence based on test content. The difference between content validity and evidence based on test content is more theoretical than practical. Validity theorists (e.g., Messick, 1989) may argue that content validity is an incorrect term “because validity refers to interpretation of test score and not to the content of an assessment” (Sireci & Faulkner-Bond, 2014, p. 101). The practical importance is that evidence of test content can be identified in validation work framed as content validity.
It is generally accepted that collecting evidence from all five sources generates a stronger argument supporting validity. The evidence sources needed to develop some degree of validity depends on the complexity of how scores are interpreted and used. Interpretation statements involve ascribing meaning to the scores (Folger et al., 2023; Messick, 1995). Ideally, this involves substantive qualitative descriptions of an operationalized construct along a score continuum (Carney et al., 2022; Shepard, 2018). Use statements are a natural extension of score interpretation because they refer to the actions or decisions extending from score interpretation (Messick, 1995; Sireci, 2016). Sireci (2016) further illustrates the practical, sequential link between interpretation and use, “Theoretically, a physician could interpret an X-ray and not act upon the information in subsequent treatment of a patient…[But] would you want this physician as your doctor?” (p. 231). Put simply, interpretation and use statements describe (a) what the scores mean, and (b) what should happen based on that meaning (Carney et al., 2022; Folger et al., 2023; Messick, 1995; Sireci, 2016).
Furthermore, issues of reliability—the consistency and/or precision of scores over repeated applications of a measurement procedure for a given group—and fairness—the extent to which the assessment provides equal opportunities to all users—contribute to the overall evaluation of tests and testing procedures (AERA et al., 2014) but may not fall under validation procedures. To be clear, reliability and fairness, both of which become more important as the stakes of a test increase, are not in and of themselves evidence of validity (AERA et al., 2014; M. Kane, 2013). Moreover, the Standards for fairness are often not explicitly addressed by instrument developers (Jonson & Geisinger, 2022a).
A central concern for validity is fairness (AERA et al., 2014; Shepard, 2016). The Standards highlight two principal ideas integral to this concept: bias in measurement and accessibility. Bias in measurement materializes when aspects of an assessment tool or the methods for collecting data “result in different meanings for scores earned by members of different identifiable subgroups” (AERA et al., 2014, p. 51). Accessibility represents the principle that “all test takers should have an unobstructed opportunity to demonstrate their standing on the construct(s) being measured” (AERA et al., 2014, p. 49). The Standards further categorize their guidance for collecting evidence of fairness of an instrument into four clusters (see Table 2), which focus on ensuring fairness in design, development, administration, and scoring (cluster 1); fairness in score interpretations for intended uses (cluster 2); fairness of accommodations to remove construct-irrelevant variance (cluster 3); and safeguards against inappropriate interpretations and uses (cluster 4). Anything that jeopardizes fairness, such as performance variations among users caused by elements unrelated to the measured construct (a phenomenon known as construct irrelevance; AERA et al., 2014), likewise compromises the valid interpretation and application of scores. Consequently, assessments of validity ought to be paired with fairness considerations, guaranteeing that tools provide equitable treatment for all respondents (AERA et al., 2014). Nevertheless, fairness evidence is frequently excluded within studies on instrument development (Jonson & Geisinger, 2022a). Quantitative researchers are responsible for confirming that their selected instruments possess supporting evidence for validity, reliability, and also for fairness; however, a comprehensive review of available instruments and their corresponding evidence has not yet been undertaken.
Integrated validation approaches that aim to assess issues of validity, reliability, and fairness for an intended interpretation and use of test scores are often most effective in defending test use (AERA et al., 2014; M. Kane, 2013, 2016; Sireci & Benítez, 2023). For example, two widely used observational instruments for evaluating mathematics teaching quality are the Mathematics Classroom Observation Protocol for Practices (MCOP2) and the Mathematics Scan (M-Scan). The MCOP2 provides a holistic framework emphasizing teacher facilitation and student engagement with mathematical practices during instruction. The development team, who claimed that the psychometric factor structure of the instrument matched the instrument’s theoretical construct, published validity evidence with expert reviews and a systematic revision process, exploratory factor analysis, and interrater reliability (Gleason et al., 2017). Similarly, the M-Scan captures standards-based practices across nine dimensions, including discourse, cognitive demand, and multiple representations. The M-Scan validity evidence incorporates expert content reviews, rater response analyses, and convergent comparisons with established instruments (Walkowiak et al., 2014). To further illustrate the M-Scan’s integrated validity argument, the development team examined response processes of instrument raters through careful analyses of their scoring rationales to warrant the claim that raters’ “interpretations of the rubrics aligned with what the rubrics intended to measure” (Walkowiak et al., 2014, p. 465). Both instruments demonstrate a modern, multi-source validation process aimed to generate validity evidence across many sources aligned with the Standards (AERA et al., 2014).

3. Methods

We are part of a broader team conducting a content analysis (Krippendorff, 2004) to catalog quantitative instruments utilized in mathematics education (Krupa et al., 2024). Our team focused on primary and secondary mathematics teachers’ affect and behavior (Wilhelm et al., 2024), but for the purposes of this study, we subsetted the instruments focused on teacher behavior and engaged in a deeper analysis of those. We searched articles in 24 mathematics education journals published between 2000 and 2020. Building on Thunder and Berry’s (2016) framework for qualitative reviews, we systematically identified quantitative instruments. This process involved the following steps: (a) defining research questions, (b) identifying search terms, (c) searching databases, (d) selecting relevant studies, (e) evaluating study quality, and (f) synthesizing and reporting findings. Throughout each phase, we engaged in group coding to align our interpretations, reach consensus on coding procedures, and refine our approach to maintain inter-rater reliability (Krippendorff, 2004).

3.1. Data Collection and Analysis

We started by focusing on quantitative instruments that assess primary or secondary mathematics teachers’ behavior or affect and brainstorming relevant search terms to identify all potentially applicable studies. As part of a larger project, we limited our search to mathematics education instruments published in any of 24 English-language mathematics education journals identified by Williams and Leatham (2017) and Nivens and Otten (2017). We prioritized mathematics education journals based on the assumption that most instruments used in mathematics teacher education research would appear in studies published within this set. However, we acknowledge the possibility of missing some instruments featured in key journals outside this list (e.g., American Educational Research Journal, Educational Sciences).

3.1.1. Identifying Relevant Instruments

We conducted our search across all 24 selected journals using the EBSCO Host journal database. This decision was informed by prior experience (e.g., J. Bostic et al., 2021) and consultations with academic librarians from multiple institutions. To refine our search, we tested various Boolean search strings before finalizing one that combined journal names with two key categories: instrumentation (e.g., *observ* OR *instrum* OR *survey* OR *log* OR *assess* OR *protocol* OR *rubric* OR *tool*) and population (e.g., *teach* OR *instruct* OR *class*). For journals with a broader disciplinary focus (e.g., Journal of Computers in Mathematics and Science Teaching), we added a requirement that abstracts include the term *math* to ensure relevance. This search yielded 2286 unique articles (see Figure 1).
After identifying these articles, we manually reviewed each abstract to determine whether it included a quantitative instrument assessing primary or secondary mathematics teacher behavior or affect. We defined quantitative instruments as those generating numerical data for statistical analysis and excluded studies that only reported frequency counts of qualitative codes. However, if frequency counts were analyzed statistically, we included the instrument. We focused on instruments used to assess the mathematics teaching quality of primary or secondary pre-service or in-service teachers and thus excluded instruments that had only been used to measure the teaching quality of university mathematics faculty. Throughout the review process, our team took an inclusive approach—if there was uncertainty about whether an instrument measured teacher behavior, it was initially included, and the full team later reviewed and reached a consensus on its inclusion. This abstract review resulted in 717 articles.
Next, we conducted a full-text review of each identified article to exclude those that did not contain a quantitative instrument measuring teacher behavior or affect. For each relevant study, we recorded details in a shared spreadsheet, including the instrument’s name (or author and date if unnamed), characteristics of the studied population (e.g., grade level, pre-service, or in-service teachers), a description of the measured construct, and the type of instrument (e.g., interview, observation, questionnaire). If a study used multiple relevant instruments, each was logged separately. This process resulted in the identification of 252 unique instruments.

3.1.2. Identifying IUAs and Claims

Our next step was to identify interpretation and use statements, claims, and the associated validity evidence. Explicit interpretation statements ascribe meaning to instrument scores; explicit use statements detail (a) the actions or decisions extending from the score interpretation, or (b) the purpose and conditions under which the instrument is to be administered. Where explicit IUAs were not reported, we looked for evidence of validity, reliability, and fairness. To gather this evidence, we conducted a secondary search for each instrument using Google Scholar, expanding beyond the initial 24 journals and 20-year timeframe. This broader search included peer-reviewed journal articles, conference proceedings, dissertations, and white papers, acknowledging that validation studies are often not published in mathematics education journals.
Our team followed a structured approach: (a) locating the original study that utilized the instrument, and (b) searching for the terms “valid*” and “reliab*” within the set of articles citing the original study. Additionally, we employed berrypicking techniques (Bates, 1989), such as footnote chasing (following citations within studies) and citation searching (examining studies that cited the original work), to locate relevant validity evidence.
When a study or report provided validity or reliability evidence, we cataloged the source (e.g., test content validity evidence), the type of evidence (e.g., expert evaluation), and study details (e.g., citation, DOI). A guidebook (Bentley et al., 2024) was used to operationally define different types of validity evidence. Our aim was to document any reported evidence rather than assess its quality or the framework used for validity reporting. For example, while some studies explicitly named validity types (e.g., content validity), others did not. However, if an author described a process in which experts evaluated an instrument’s alignment with a construct, we categorized that as test content validity evidence based on expert review.
To maintain coding consistency and prevent rater drift, instruments were assigned to researchers in pairs, with pairs rotating midway through the process. Additionally, whole-group meetings provided opportunities to check for consistency, discuss coding challenges, and ensure a shared understanding of what qualified as validity evidence. Once this process was complete, we excluded instruments with no evidence of validity or reliability. We also excluded 16 instruments that either were measuring something other than primary/secondary teachers’ affect or behavior (n = 8), or it was not clear what the instrument was measuring (n = 8).
We examined the remaining instruments (n = 163) for fairness evidence. To determine if an instrument explicitly addressed the fairness standards (AERA et al., 2014), we followed a systematic process: (1) reviewed all articles associated with the instrument, (2) searched for the term “fair” to identify direct references to fairness standards, and (3) examined the instrument development section to assess whether fairness considerations were included. If validation procedures related to fairness standards were present, we recorded the corresponding fairness cluster(s) (i.e., 1, 2, 3, or 4; see Table 2) in our spreadsheet; if none were addressed, we marked it as 0. Throughout the process, we made qualitative notes and flagged instruments for discussion during our regular team meetings.

3.1.3. Categorizing the Constructs Measured

Once we had a database of each instrument with its associated reliability, validity, and fairness evidence, we went back through to better understand what was being measured. We coded the construct(s) of each measure broadly into buckets of teachers’ affect (e.g., feelings, beliefs, emotions) and behaviors (e.g., practices). We examined the methods section of the primary article associated with each instrument to see how the authors described the construct used in that data source. We began with all researchers coding the constructs on 40 instruments into these broad buckets. We met to resolve disagreements, refine our understanding of the buckets, and then continued coding some instruments together and others individually. We coded some together to ensure there was no drift. For instance, there were instruments that claimed to measure teaching practice but utilized questionnaires administered to teachers. We opted not to include those instruments as measures of teacher behavior because of the large body of evidence that teachers’ self-reports of their practice are more closely aligned with their beliefs than their actual teaching practice (Mayer, 1999; Mu et al., 2022). Once we had completed this step, we then filtered to focus on instruments measuring mathematics teacher behavior (n = 47).
To determine which aspects of mathematics teacher behavior had been measured, we examined the constructs and the instrument type. We noted a significant number of instruments focused on measuring teaching practice, many of which drew on observations of teaching or questionnaires administered to students. We categorized these instruments as focused on teacher behavior, with an emphasis on their enactment, because of their focus on collecting information on teachers’ instructional practice as it was being enacted in the classroom with students. We also encountered several other instruments that measured teachers’ behavior but did so in a simulated setting, rather than in the context of enacted teaching practice. Therefore, we created a second category of teacher behavior: approximation of practice. Instruments that fell into this category included instruments that asked pre-service teachers to enact a teaching practice (e.g., posing tasks) outside of an authentic situation, or questionnaires (e.g., with video prompts) that asked teachers what they noticed about classroom activity.

4. Results

From our Boolean search resulting in 2286 articles across the 24 mathematics education journals (2000–2020), we identified 47 instruments measuring teacher behavior that have some validity, reliability, or fairness evidence. From these 47 instruments, we report our findings with regard to each of our research questions below (see Table S1 for a searchable table of all 47 instruments).

4.1. Approaches Used to Measure Mathematics Teaching

We found that there were two primary approaches to measuring mathematics teaching in the 47 instruments with some form of reported validity, reliability, or fairness evidence measuring mathematics teacher behavior. First and certainly more prominent of the two was a focus on the enactment of teaching practices (n = 37 instruments), meant to capture practices enacted by teachers during instruction. The second approach to measuring mathematics teaching quality focused on approximation of teaching practices (n = 10 instruments). Approximations were typically simulations when teachers simulated behaviors they would do on the job but in a setting of reduced complexity. In the following sections, we describe the instruments we found with respect to each approach.

4.1.1. Measuring Mathematics Teachers’ Enactment of Teaching Practice

The 37 instruments categorized as measuring mathematics teachers’ enactment of teaching practice fell into two instrument types: classroom observations (n = 31) and student questionnaires (n = 6). The classroom observation instruments measured a range of different constructs including teaching or instructional quality (e.g., Instructional Quality Assessment (IQA; M. Boston, 2012), Classroom Assessment Scoring System (CLASS; Pianta et al., 2012)), standards-based or inquiry-based teaching (e.g., M-Scan (Walkowiak et al., 2014), and Reformed Teaching Observation Protocol (RTOP; Sawada & Piburn, 2000)). Some of the classroom observation instruments were mathematics-specific (e.g., Mathematical Quality of Instruction (MQI; Hill et al., 2012)) and others were designed to be used beyond mathematics classrooms (e.g., CLASS) but had some available validity evidence for use in mathematics classrooms. The student questionnaire instruments measured a range of constructs including the learning environment (e.g., Constructivist Learning Environment Survey (CLES; Lomas, 2009), teacher characteristics or strengths (e.g., Students’ Perceptions of Teachers Successes (SPoTS; T. Stevens et al., 2013)), or interactions between teachers and students (e.g., Questionnaire on Teacher Interactions (QTI; Lindorff & Sammons, 2018)). Interestingly, these instruments tended to be more generally focused, rather than focused on teaching practices specific to mathematics.

4.1.2. Measuring Mathematics Teachers’ Approximation of Teaching Practice

The 10 instruments categorized as measuring mathematics teachers’ approximation of teaching practice fell into two instrument types: teacher questionnaires (n = 9) and teacher interviews (n = 1). All but one of the measures of teachers’ approximations of teaching practice were related to one or more dimensions often associated with teacher noticing: attending, interpreting, and deciding how to respond (Jacobs & Spangler, 2017). Teacher noticing is what teachers focus their attention on in the classroom, how they make sense of it, and how they respond, and cannot be observed directly. Most of these teacher noticing instruments were teacher questionnaires; however, one involved interviews with pre-service teachers, measuring their noticing around children’s early numeracy (Schack et al., 2013). While we consider teacher noticing a teaching practice that teachers engage in while they are teaching, it is very difficult to measure it during enactment. Therefore, it makes sense that instruments focused on this teaching practice involve approximations of practice. The one other instrument measuring mathematics teachers’ approximations of teaching practice was focused on pre-service teachers’ task-posing in the context of cycles of writing letters with algebra students (Norton & Rutledge, 2006). Although teachers pose tasks in the context of teaching, we consider it an approximation of teaching for this instrument because the behavior is simulated outside of a formal teaching context.

4.2. Components of an Integrated IUA Framework

We evaluated the extent to which an instrument had an integrated IUA framework based upon the inclusion, in at least one of the supporting pieces of literature for that instrument, of explicit interpretation and use statements, as well as specific claims and related evidence. We found only two instruments that presented an integrated IUA framework (see Table 3).
Although we only found 2 instruments that presented integrated IUAs, we found 13 instruments (inclusive of the 2 with integrated IUAs; see Table 4) that had components of an IUA. We found that 2 instruments had both explicit interpretation and use statements, but no claims; 3 had explicit interpretation, but not use statements, 1 of these included at least one claim; and 6 had explicit use, but not interpretation statements, of which 5 also had claims. The majority of the instruments of mathematics teaching quality (n = 35) that we found did not have any components of an IUA framework and only 2 presented an integrated IUA (i.e., M-SCAN and Revised SMPs Look-for Protocol).

4.3. Types of Validity, Reliability, and Fairness Evidence Found

We found evidence of validity, reliability, and fairness, as well as interesting patterns of these evidence types across instrument types (see Table 5).
Reliability evidence was most frequently provided by researchers (n = 41). The reliability evidence provided most often was internal consistency, specifically Cohen’s alpha (n = 24) and inter-rater reliability (n = 24)—either kappa or percent agreement. Additionally, reliability evidence was consistently provided across instrument types.
After reliability, the next most commonly presented evidence types were test content, internal structure, and relations to other variables. Evidence for test content typically came from data from experts (n = 16), alignment to a framework (n = 15), or a literature review (n = 10). With regard to evidence related to internal structure, all instruments either described factor analysis (any type; n = 19) or item response theory (IRT; n = 2). Evidence of relations to other variables was described for many instruments (n = 22), with about half of these providing more than one analysis of relations to other variables (n = 10). The most commonly reported evidence examined correlations with related variables (n = 13).
Although we found evidence of all types of validity, reliability, and fairness across all the instruments, evidence related to response processes (n = 9), consequences of testing (n = 6), and fairness (n = 8) were much less common than the other types. Evidence related to response processes was typically presented either with a description of some kind of rater training (n = 5) or cognitive interviews (n = 2). Evidence of consequences of testing was all presented through explicit intended uses and interpretations and warnings against inappropriate uses (n = 6). Of the six measures that included evidence related to fairness, most provided evidence of clusters 4 (n = 6; see Table 2) and 1 (n = 5), and a few provided evidence of clusters 2 (n = 3) and 3 (n = 3). The only instrument that provided evidence related to all four clusters was the International System for Teacher Observation and Feedback (ISTOF; Muijs et al., 2018).

5. Discussion

5.1. Summarizing How Mathematics Teaching Quality Has Been Measured

As one would expect, mathematics teaching quality has typically been measured by examining enactment of teaching practice in the classroom. Of the 47 instruments identified in this study, 37 of those instruments measure enactment of teaching practice, whereas only 10 instruments measure approximation of teaching practice. Nine of the instruments measuring approximations focused on an aspect of teacher noticing; this is not surprising given that the linked actions of attending to students’ ideas, interpreting, and then deciding are not all directly observable in the context of enactment (Jacobs & Spangler, 2017). Although less present in our data, it is noteworthy that seven instruments measure enactment through student questionnaires. Elevating students’ voices in this way seems important when measuring teaching quality while being attentive to the claims made from this type of data. In particular, a number of recent studies highlight the value of using both observational instruments and student questionnaires to triangulate teaching quality (e.g., van der Lans, 2018).
Among the 37 instruments measuring enactment of teaching practices, we found instruments that are specific to mathematics (e.g., IQA; M-Scan) and instruments that are generic and can be used across disciplines (e.g., CLASS). As scholars have pointed out, the range of instruments bring value in the scope of teacher behaviors that can be captured (Charalambous & Praetorius, 2018). At the same time, with a range of available instruments from content-specific to hybrid (i.e., including content-specific and generic indicators) to generic, tensions arise such as differences in terminology, operationalization of teaching quality, and goals (Brunner & Star, 2024; Charalambous et al., 2021). This can present challenges for researchers as they search for available instruments to potentially use in their own studies. To address these challenges, scholars have pushed for the need to develop common instruments for conceptualizing and measuring teaching quality (Klette & Blikstad-Balas, 2018; Praetorius & Charalambous, 2018). Charalambous and Praetorius (2020) generated a model of teaching quality (i.e., the MAIN-TEACH model) by synthesizing across 11 observational instruments to identify broad dimensions. Although this model holds promise in informing instruments that measure teaching quality, the development of common instruments could come at the expense of attending to particular instructional dimensions, especially aspects that are critically important for historically marginalized students (Litke et al., 2021).
Given that we found instruments that measure enactment and instruments that measure approximation of teaching practice, it is important to draw attention to the distinction between the two types. As previously stated, it makes sense to use approximations of practices that may not be directly observable in the mathematics classroom (e.g., noticing). Are there other practices or types of practices that might lend themselves to being measured with approximations? As a field of scholars focused on teaching quality, we need to learn more about the affordances and constraints of using instruments that focus on enactment versus approximation to measure particular constructs. In sum, our findings provide information about how mathematics teaching quality has been measured, and they point to important next steps for examining overlaps, distinctions, affordances, and constraints across instruments and instrument types, thereby informing instrument development and validation practices.

Implications for Future Research on Mathematics Teaching Quality

Although a large majority of the instruments we found measure enactment of teaching practice through either classroom observation (n = 31) or student questionnaires (n = 7), there is still a need to better understand the constructs measured by these instruments. Future research should consider a comparison of how the instruments operationalize enactment of various teaching practices (e.g., facilitating discourse). While researchers have compared classroom observational instruments in the past (J. Bostic et al., 2021; M. Boston et al., 2015; Praetorius & Charalambous, 2018), a more comprehensive review is warranted to identify constructs that may be overrepresented, underrepresented, or missing within existing instruments. In particular, identifying how these measures capture modern conceptualizations of teaching quality to include equitable practices would be informative for the field. We have reviewed which of these measures capture equity (Wilhelm et al., 2024), but have not yet delved more deeply into the different operationalizations of mathematics teaching quality. Another logical next step for researchers would be to carefully compare and contrast existing instruments that measure the enactment of teaching practice, such as the 37 instruments in this study, to identify exactly which practices are being measured. This analysis could identify overlaps and distinctions among instruments, as well as contribute toward a shared understanding among scholars about terminology and goals in the study of teaching quality.

5.2. Summarizing Current Validation Practices

Some aspects of instrument development and validation were more commonly identified in practice. Evidence of reliability was reported for the vast majority of measures of teacher behavior in our study (41 of 47 instruments). Additionally, evidence of test content, internal structure, and/or relationships to other variables was often identifiable. Current validation practices, however, often present evidence without explicitly explaining how it supports the validity of the intended score interpretations and uses. For example, our study found that factor analysis is somewhat commonly conducted for newly developed measurement instruments (19 of 47 instruments), but it is less common for instrument developers to explicitly describe how the statistical results warrant claims about the instrument conforming to a construct of interest (i.e., internal structure validity evidence). Furthermore, rarely do instrument developers articulate the intended meaning of the instrument scores (i.e., interpretation), and how that information—or the test itself—ought to be used (Folger et al., 2023). Integrated approaches to validation, which link evidence of validity, reliability, and fairness to the claims underlying test-score interpretation and use, remain rare in practice.
Overall, findings from the current study corroborate the assertion that instrument development and validation practices are often less than ideal, at least within the field of mathematics education. Very few measures have been robustly developed and evaluated in alignment with the guidelines and criteria outlined in the Standards for Educational and Psychological Testing (AERA et al., 2014). Although the Standards were developed by organizations based in the United States—the American Educational Research Association, the American Psychological Association, and the National Council of Measurement in Education—these organizations possess a global influence and are internationally respected (Sireci, 2016; Zumbo, 2014). Moreover, the guidelines set forth by the Standards, as well as best practices in validation described in other literature (e.g., Cizek, 2016; M. Kane, 2013; Sireci & Benítez, 2023), do not contain novel advances to instrument development and validation. Scholars have long advocated for argumentative approaches to validation, with validity attributed to how test scores are interpreted and subsequently used (e.g., Cronbach, 1988; M. T. Kane, 1992; Messick, 1989; Shepard, 1993). Loevinger (1957) first argued that “since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view” (p. 636). Yet, our study found that scholars continue to present evidence of validity in isolation, at times referencing distinct types of validity (e.g., content validity), rather than integrating such evidence into a cohesive validity argument (AERA et al., 2014; M. Kane, 2013; Sireci & Benítez, 2023). The following sections further detail aspects of validity and validation that are currently lacking in practice, as well as considerations for instrument developers who aim to develop robust measures of teaching quality.

5.2.1. Response Processes

Evidence of validity based on response processes was not commonly found. Of the 47 instruments in this study, only 9 instruments have evidence of validity based on response processes. For five of those instruments, the response process evidence was rater training. First, we find it concerning that such a low number of instruments have evidence that attended to raters’ interpretations of instrument language. Second, we consider rater training a low bar for evidence of response processes because training should be an automatic component of instrument implementation. We contend that elevating raters’ voices in the accumulation of validity evidence should extend beyond just providing training on how to use the instrument. Using cognitive interviews to examine rater cognition is an approach to evaluate if raters interpret rubrics/language and apply scores as intended (Willis, 2005). The process of rater training (from initial training to the full application of scoring) is incredibly complex, particularly in the context of measuring complex behaviors such as teaching practice (Bell et al., 2014). Cognitive interviews shed light on multiple factors: processes used by raters in their scoring (e.g., note taking during an observation); comprehension of terms and phrases; mapping observed behaviors to the instrument’s indicators; and distinguishing between levels on a scoring rubric (Groves et al., 2009). Although cognitive interviews have rarely been used, as evidenced in this study, researchers have demonstrated their potential in the instrument development process, especially when utilized iteratively to inform improved versions of an instrument (Walkowiak et al., 2022). We urge instrument developers to elevate the voices of raters through the use of cognitive interviews to evaluate validity evidence and to gain raters’ insights on instrument improvement (e.g., training, rubrics, application).

5.2.2. Consequential and Fairness Considerations

Consequential considerations are rarely made explicit in the literature on instrument development and validation. In our study, only 6 of 47 instruments provided validity evidence based on consequences of testing (AERA et al., 2014). Each of those six instruments described the intended score interpretation and warned against inappropriate uses of the measure. In other words, the consequential evidence presented by these instrument developers focused primarily on preventing misuse of their instrument or its results, which is related to the Standards’ guidelines on evidence of test fairness, cluster 4 (AERA et al., 2014). Our study found no evidence of instrument developers explicitly considering the consequences of legitimate score interpretation and use. Zumbo and Hubley (2016) asserted that “the mere act of measuring or testing engenders consequences” (p. 299). That is, consequential considerations of validity are relevant to both the score interpretation and score/instrument use (Sireci, 2016; Zumbo & Hubley, 2016). Scholars generally agree that researchers do not engage in measurement for the sole act of measuring. There is a purpose or a goal they hope to accomplish through measuring, and there is a fine line between the interpretation of scores and the decisions that follow (Folger et al., 2023; Sireci, 2016). Validation should include collecting evidence of the likelihood that the benefits of measurement will be realized (AERA et al., 2014).
Fairness in instrument development is so crucial that the Standards (AERA et al., 2014) elevated it to its own chapter, as compared to the previous edition of the Standards (AERA et al., 1999), and fairness will have an increased focus in the forthcoming edition of the Standards (Tong et al., 2024). At the crux of evidence of fairness is to “promote valid and fair score interpretations for the intended uses for all examinees in the intended population by minimizing extraneous factors that distort the meaning of scores” (Jonson & Geisinger, 2022a, p. 1). This suggests that instruments without fairness evidence may distort the meaning of scores for subpopulations. We found only eight instruments with any considerations of fairness, and these tended to be minor. For instance, many of the studies that we found that addressed fairness did so by including raters who were mathematics teachers. This partially addresses cluster 1 by including relevant subgroups (i.e., mathematics teachers) in reliability studies. However, this is not sufficient to ensure fair score interpretations for all users, and it is unclear whether the developers engaged in this practice purposefully to address fairness. Overall, our findings echo those of measurement experts across education and psychology (Jonson & Geisinger, 2022b) to suggest that mathematics education instrument developers need to be more intentional in attending to issues related to fairness in assessment.

5.2.3. Implications for Instrument Developers

Our findings suggest many implications for instrument developers, but most central is the need for the field to better understand and apply IUAs in measure design (Carney et al., 2022). Beginning the instrument development process by framing the intended interpretation and use of instrument scores can help researchers to intentionally plan to collect evidence of validity, reliability, and fairness related to the intended interpretation of scores. We urge all instrument developers to consider the fairness standards (AERA et al., 2014) as they plan their development studies. Considering and studying the impact on all relevant subgroups, especially those that are likely to be adversely impacted by traditional administrations of an instrument or interpretations of scores, is essential for not perpetuating inequitable systems.
Even if measure developers did not begin their work from an IUA, it is important to recognize that validity arguments are not static but evolve over time as instruments are used in new contexts and with diverse populations. For example, while the MCOP2 originally reported evidence based on expert review, a coherent revision process, and factor analyses of internal structure (Gleason et al., 2017), ongoing use of the instrument has led to new insights about how it functions in different educational settings extending the use from in-service teachers to pre-service teacher candidates with appropriate score interpretations (Zelkowski et al., 2024). As developers and users continue to engage with instruments like the MCOP2 for measuring teacher practice, interpretation and use statements should be revisited and refined based on emerging validity evidence and fairness regarding their use. This iterative process reflects the dynamic nature of validation as envisioned in the Standards (AERA et al., 2014) and supports the notion that interpretations and uses should remain flexible and responsive to empirical findings. Treating validity as a continuous argument underscores the collective responsibility of the research community to engage in long-term validation efforts, especially as measures of teaching quality are scaled or adapted for policy or professional development use.

5.2.4. Implications for Researchers

Researchers who are looking for an instrument to use for a study need to carefully consider and seek out the validity arguments and validity, reliability, and fairness evidence for that instrument. In the event that there is insufficient evidence, they may reconsider the use of that instrument, or they may consider how to first gather that evidence before using it in the novel study they are planning.

5.3. Limitations

It is important to recognize the inherent limitations of this analysis. The first concerns the scope of our instrument review, which is not comprehensive for the field of mathematics teacher education. Our methodology, constrained to a 20-year period and 24 specific mathematics education journals, excluded instruments found elsewhere. A prominent example is the TRU-Math Instrument, a quantitative observational tool used in studies published in non-mathematics education journals, such as Teaching and Teacher Education (e.g., Kim et al., 2022), which our sampling did not capture. For a more exhaustive and evolving resource, we direct readers to the Validity Evidence for Measurement in Mathematics Education instrument repository (Krupa et al., 2024), which is designed to be continually updated as researchers identify and contribute missing instruments or evidence.
Finally, our process for gathering such evidence, while broadened by a Google Scholar search to include grey literature, was imperfect. Access to some sources was prevented by language barriers or broken web links. Given the field’s lack of centralized systems for sharing validation data and the limited number of journals that publish such studies, it is reasonable to assume that our analysis did not capture all existing validity, reliability, or fairness evidence.

6. Conclusions

The previous sections summarize how mathematics teaching quality has been measured, but the robustness of that measurement remains questionable. Measurement was broadly conceptualized by our team for this systematic review, and we did not aim to evaluate the quality of an instrument or an instrument’s development and validation. Nor did we evaluate whether an instrument’s level of measurement should be categorized as ordinal or interval, for example, and if the statistical techniques used by instrument developers were appropriate for the corresponding level of measurement (e.g., S. S. Stevens, 1946). Measuring teaching quality is an incredibly complex endeavor—a conjunction of factors such as teacher and student actions and interactions, available resources to them, and even parental and community support of education. As described by Wright and Stone (2004), “Until we invent a way to make separate estimates of the forces involved, any conclusion we might wish to reach with respect to the meaning of the observation is confounded by this conjunction” (p. 5). This raises a larger question for future research: How is measurement conceptualized with respect to teaching quality?
The systematic review conducted by our research team examined literature published between 2000 and 2020. We hope that instrument developers have since begun adapting more robust procedures for measure development and validation, such that recent and forthcoming validation research is aligned with modern approaches. It is critical to recognize that poorly designed measurement instruments, and even psychometrically sound instruments used in inappropriate ways, have historically contributed to the perpetuation of systemic inequities (Cronbach, 1988; Shepard, 2016; Tong et al., 2024). This is one reason why validity is (a) the most fundamental consideration in evaluating measurement instruments, and (b) defined as an attribute of test-score interpretation for proposed test uses (AERA et al., 2014). Measures of teaching quality are used for a variety of purposes, including but not limited to research, evaluation, accountability, teacher licensure, and public policy (AERA et al., 2014; Hill & Shih, 2009; Sireci, 2016; Zumbo & Hubley, 2016). The intended use implicates the types of evidence—whether that is from different sources of validity evidence, reliability evidence, and/or fairness evidence—needed to justify instrument use (AERA et al., 2014; M. Kane, 2013; Sireci & Benítez, 2023). Those of us who engage in quantitative measurement, whether as instrument developers or users, share a collective responsibility to promote valid and fair interpretations of instrument scores and ensure appropriate uses of measurement instruments.

Supplementary Materials

Author Contributions

Conceptualization, All; Methodology, All; Validation, All; Formal Analysis, All; Investigation, All; Data Curation, T.D.F.; Writing—Original Draft Preparation, All; Writing—Review & Editing, All; Visualization, M.A.G. and T.D.F.; Supervision, M.A.G.; Project Administration, T.D.F. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based on work supported by the National Science Foundation (#1920621; #1920619). Any opinions, findings, or conclusions expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The full repository of mathematics education instruments can be found at https://mathedmeasures.org/, accessed on 4 June 2025.

Acknowledgments

We want to acknowledge the important contributions of other members of our team including Jonathan Bostic, Emanuele Bardelli, and Adrian Neely. Earlier versions of this analysis were published in conference proceedings for the annual meeting of the Psychology of Mathematics Education-North America (2021; 2024) as well as shared at the American Educational Research Association annual meeting (2022; 2023; 2025) and the Association for Mathematics Teacher Educators annual meeting (2022; 2024). A different analysis of this dataset is under review (Wilhelm et al., 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DOIDirect Object Identifier
IUAInterpretation and Use Argument

References

* Indicates reference with validity, reliability, or fairness evidence of an instrument included in this review.
  1. *Adams, R., & Wu, M. (Eds.). (2002). PISA 2000 technical report. OECD. Available online: http://www.oecd.org/pisa/pisaproducts/33688233.pdf (accessed on 4 June 2025).
  2. *Ader, E. (2019). What would you demand beyond mathematics? Teachers’ promotion of students’ self-regulated learning and metacognition. ZDM, 51(4), 613–624. [Google Scholar]
  3. Amador, J. M., Bragelman, J., & Superfine, A. C. (2021). Prospective teachers’ noticing: A literature review of methodological approaches to support and analyze noticing. Teaching and Teacher Education, 99, 103256. [Google Scholar] [CrossRef]
  4. American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
  5. American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
  6. American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. American Psychological Association. [Google Scholar]
  7. Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37(1), 1–16. [Google Scholar] [CrossRef]
  8. *Andrews, P. (2007). Negotiating meaning in cross-national studies of mathematics teaching: Kissing frogs to find princes. Comparative Education, 43(4), 489–509. [Google Scholar]
  9. *Andrews, P. (2009). Comparative studies of mathematics teachers’ observable learning objectives: Validating low inference codes. Educational Studies in Mathematics, 71(2), 97–122. [Google Scholar] [CrossRef]
  10. *Appeldoorn, K. L. (2004). Developing and validating the Collaboratives for Excellence in Teacher Preparation (CETP) core evaluation classroom observation protocol (COP). University of Minnesota. [Google Scholar]
  11. Australian Association of Mathematics Teachers. (2006). Standards for excellence in teaching mathematics in Australian schools. AAMT. [Google Scholar]
  12. Bates, M. J. (1989). The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5), 407–424. [Google Scholar] [CrossRef]
  13. Bell, C. A., Qi, Y., Croft, A., Leusner, D., McCaffrey, D. F., Gitomer, D. H., & Pianta, R. (2014). Improving observational score quality: Challenges in observer thinking. In K. Kerr, R. Pianta, & T. Kane (Eds.), Designing teacher evaluation systems: New guidance from the measures of effective teaching project (pp. 50–97). Jossey-Bass. [Google Scholar]
  14. Bentley, B., Folger, T., Bostic, J., Krupa, E., Burkett, K., & Stokes, D. (2024). Evidence types guidebook. Validity Evidence for Measurement in Mathematics Education. Available online: https://www.mathedmeasures.org/training/ (accessed on 4 June 2025).
  15. *Berlin, R., & Cohen, J. (2018). Understanding instructional quality through a relational lens. ZDM, 50(3), 367–379. [Google Scholar] [CrossRef]
  16. Berliner, D. C. (2005). The near impossibility of testing for teacher quality. Journal of Teacher Education, 56(3), 205–213. [Google Scholar] [CrossRef]
  17. Bishop, J. P. (2021). Responsiveness and intellectual work: Features of mathematics classroom discourse related to student achievement. Journal of the Learning Sciences, 30(3), 466–508. [Google Scholar] [CrossRef]
  18. Borsboom, D., & Wijsen, L. D. (2016). Frankenstein’s validity monster: The value of keeping politics and science separated. Assessment in Education: Principles, Policy & Practice, 23(2), 281–283. [Google Scholar]
  19. *Bostic, J. D., Matney, G. T., & Sondergeld, T. A. (2019). A validation process for observation protocols: Using the Revised SMPs Look-for Protocol as a lens on teachers’ promotion of the standards. Investigations in Mathematics Learning, 11(1), 69–82. [Google Scholar] [CrossRef]
  20. Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2021). Classroom observation and mathematics education research. Journal of Mathematics Teacher Education, 24, 5–31. [Google Scholar] [CrossRef]
  21. Boston, M. (2012). Assessing instructional quality in mathematics. The Elementary School Journal, 113(1), 76–104. [Google Scholar] [CrossRef]
  22. Boston, M., Bostic, J., Lesseig, K., & Sherman, M. (2015). A comparison of mathematics classroom observation protocols. Mathematics Teacher Educator, 3(2), 154–175. [Google Scholar] [CrossRef]
  23. *Boston, M., & Wolf, M. K. (2006). Assessing Academic Rigor in Mathematics Instruction: The Development of the Instructional Quality Assessment Toolkit. CSE Technical Report 672. National Center for Research on Evaluation, Standards, and Student Testing (CRESST). [Google Scholar]
  24. *Boston, M. D., & Smith, M. S. (2011). A ‘task-centric approach’ to professional development: Enhancing and sustaining mathematics teachers’ ability to implement cognitively challenging mathematical tasks. ZDM, 43(6), 965–977. [Google Scholar] [CrossRef]
  25. *Bruckmaier, G., Krauss, S., Blum, W., & Leiss, D. (2016). Measuring mathematics teachers’ professional competence by using video clips (COACTIV video). ZDM, 48(1), 111–124. [Google Scholar] [CrossRef]
  26. Brunner, E., & Star, J. R. (2024). The quality of mathematics teaching from a mathematics educational perspective: What do we actually know and which questions are still open? ZDM, 56(5), 775–787. [Google Scholar] [CrossRef]
  27. *Carney, M. B., Bostic, J., Krupa, E., & Shih, J. (2022). Interpretation and use statements for instruments in mathematics education. Journal for Research in Mathematics Education, 53(4), 334–340. [Google Scholar] [CrossRef]
  28. Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM, 50, 355–366. [Google Scholar] [CrossRef]
  29. Charalambous, C. Y., & Praetorius, A.-K. (2020). Creating a forum for researching teaching and its quality more synergistically. Studies in Educational Evaluation 67. [Google Scholar] [CrossRef]
  30. Charalambous, C. Y., Praetorius, A. K., Sammons, P., Walkowiak, T., Jentsch, A., & Kyriakides, L. (2021). Working more collaboratively to better understand teaching and its quality: Challenges faced and possible solutions. Studies in Educational Evaluation, 71, 101092. [Google Scholar] [CrossRef]
  31. Cizek, G. J. (2016). Validating test score meaning and defending test score use: Different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. [Google Scholar]
  32. *Copur-Gencturk, Y. (2015). The effects of changes in mathematical knowledge on teaching: A longitudinal study of teachers’ knowledge and instruction. Journal for Research in Mathematics Education, 46(3), 280–330. [Google Scholar] [CrossRef]
  33. Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. Braun (Eds.), Test validity (pp. 3–17). Erlbaum. [Google Scholar]
  34. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281. [Google Scholar] [CrossRef]
  35. Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621–694). American Council on Education. [Google Scholar]
  36. Desimone, L. M., Hochberg, E. D., & Mcmaken, J. (2016). Teacher knowledge and instructional quality of beginning teachers: Growth and linkages. Teachers College Record, 118(5), 1–54. [Google Scholar] [CrossRef]
  37. Downer, J. T., Stuhlman, M., Schweig, J., Martínez, J. F., & Ruzek, E. (2015). Measuring effective teacher-student interactions from a student perspective: A multi-level analysis. The Journal of Early Adolescence, 35(5–6), 722–758. [Google Scholar] [CrossRef]
  38. *Dreher, A., & Kuntze, S. (2015). Teachers’ professional knowledge and noticing: The case of multiple representations in the mathematics classroom. Educational Studies in Mathematics, 88(1), 89–114. [Google Scholar] [CrossRef]
  39. *Dunekacke, S., Jenßen, L., Eilerts, K., & Blömeke, S. (2016). Epistemological beliefs of prospective preschool teachers and their relation to knowledge, perception, and planning abilities in the field of mathematics: A process model. ZDM, 48(1), 125–137. [Google Scholar] [CrossRef]
  40. *Eddy, C. M., Harrell, P., & Heitz, L. (2017). An observation protocol of short-cycle formative assessment in the mathematics classroom. Investigations in Mathematics Learning, 9(3), 130–147. [Google Scholar] [CrossRef]
  41. *Erickson, A., & Herbst, P. (2018). Will teachers create opportunities for discussion when teaching proof in a geometry classroom? International Journal of Science and Mathematics Education, 16(1), 167–181. [Google Scholar] [CrossRef]
  42. Eurydice. (2011). Mathematics education in Europe: Common challenges and national policies. Education, Audiovisual, and Culture Executive Agency. [Google Scholar]
  43. Fauth, B., Decristan, J., Rieser, S., Klieme, E., & Büttner, G. (2014). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9. [Google Scholar] [CrossRef]
  44. Feldlaufer, H., Midgley, C., & Eccles, J. (1988). Student, teacher, and observer perceptions of the classroom before and after the transition to junior high school. Journal of Early Adolescence, 8, 133–156. [Google Scholar] [CrossRef]
  45. Folger, T. D., Bostic, J., & Krupa, E. E. (2023). Defining test-score interpretation, use, and claims: Delphi study for the validity argument. Educational Measurement: Issues and Practice, 42(3), 22–38. [Google Scholar] [CrossRef]
  46. Gitomer, D. H., & Bell, C. A. (2013). Evaluating teaching and teachers. In APA handbook of testing and assessment in psychology, Vol. 3: Testing and assessment in school psychology and education (pp. 415–444). American Psychological Association. [Google Scholar]
  47. *Gleason, J., Livers, S. D., & Zelkowski, J. (2017). Mathematics classroom observation protocol for practices (MCOP2): Validity and reliability. Investigations in Mathematical Learning, 9(3), 111–129. [Google Scholar] [CrossRef]
  48. *Gningue, S. M., Peach, R., & Schroder, B. (2013). Developing effective mathematics teaching: Assessing content and pedagogical knowledge, student-centered teaching, and student engagement. The Mathematics Enthusiast, 10(3), 621–646. [Google Scholar] [CrossRef]
  49. *Gotwals, A. W., Philhower, J., Cisterna, D., & Bennett, S. (2015). Using video to examine formative assessment practices as measures of expertise for mathematics and science teachers. International Journal of Science and Mathematics Education, 13(2), 405–423. [Google Scholar] [CrossRef]
  50. Groves, R. M., Fowler, F. J., Coupter, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd ed.). Wiley & Sons. [Google Scholar]
  51. Herman, J., & Cook, L. (2022). Broadening the reach of the fairness standards. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 33–60). American Educational Research Association. [Google Scholar]
  52. Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008). Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and instruction, 26(4), 430–511. [Google Scholar] [CrossRef]
  53. *Hill, H. C., Charalambous, C. Y., Blazar, D., McGinn, D., Kraft, M. A., Beisiegel, M., Humez, A., Litke, E., & Lynch, K. (2012b). Validating arguments for observational instruments: Attending to multiple sources of variation. Educational Assessment, 17(2–3), 88–106. [Google Scholar]
  54. *Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012a). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. [Google Scholar]
  55. Hill, H. C., & Shih, J. C. (2009). Research commentary: Examining the quality of statistical mathematics education research. Journal for Research in Mathematics Education, 40(3), 241–250. [Google Scholar] [CrossRef]
  56. *Hill, H. C., Umland, K., Litke, E., & Kapitula, L. R. (2012). Teacher quality and quality teaching: Examining the relationship of a teacher assessment to practice. American Journal of Education, 118(4), 489–519. [Google Scholar] [CrossRef]
  57. *Horizon Research Inc. (2000). Validity and reliability information for the LSC Classroom Observation Protocol. Available online: https://horizon-research.com/LocalSystemicChange/wp-content/uploads/2023/05/cop_validity_2000.pdf (accessed on 3 September 2025).
  58. *Jacobs, V. R., Lamb, L. L., & Philipp, R. A. (2010). Professional noticing of children’s mathematical thinking. Journal for Research in Mathematics Education, 41(2), 169–202. [Google Scholar] [CrossRef]
  59. Jacobs, V. R., & Spangler, D. A. (2017). Research on core practices in K-12 mathematics teaching. In Compendium for research in mathematics education (pp. 766–792). National Council of Teachers of Mathematics. [Google Scholar]
  60. Jonson, J. L., & Geisinger, K. F. (2022a). Conceptualizing and contextualizing fairness standards, issues, and solutions across professional fields in education and psychology. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 1–9). American Educational Research Association. [Google Scholar]
  61. Jonson, J. L., & Geisinger, K. F. (2022b). Looking forward: Cross-cutting themes for the future of fairness in testing. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 standards (pp. 399–416). American Educational Research Association. [Google Scholar]
  62. Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. [Google Scholar] [CrossRef]
  63. Kane, M. (2016). Validation strategies: Delineating and validating proposed interpretations and uses of test scores. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 64–80). Routledge. [Google Scholar]
  64. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527. [Google Scholar] [CrossRef]
  65. Kane, T., Kerr, K., & Pianta, R. (2014). Designing teacher evaluation systems: New guidance from the measures of effective teaching project. John Wiley & Sons. [Google Scholar]
  66. Kim, J., Salloum, S., Lin, Q., & Hu, S. (2022). Ambitious instruction and student achievement: Evidence from early career teachers and the TRU math observation instrument. Teaching and Teacher Education, 117, 103779. [Google Scholar] [CrossRef]
  67. Klette, K., & Blikstad-Balas, M. (2018). Observation manuals as lenses to classroom teaching: Pitfalls and possibilities. European Educational Research Journal, 17(1), 129–146. [Google Scholar] [CrossRef]
  68. *König, J., & Kramer, C. (2016). Teacher professional knowledge and classroom management: On the relation of general pedagogical knowledge (GPK) and classroom management expertise (CME). ZDM, 48(1–2), 139–151. [Google Scholar] [CrossRef]
  69. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage. [Google Scholar]
  70. Krupa, E. E., Bostic, J. D., Bentley, B., Folger, T., Burkett, K. E., & VM2ED community. (2024, May). Search. VM2ED Repository. Available online: https://mathedmeasures.org/ (accessed on 4 June 2025).
  71. Krupa, E. E., Bostic, J. D., & Shih, J. C. (2019). Validation in mathematics education: An introduction to quantitative measures of mathematical knowledge: Researching instruments and perspectives. In Quantitative measures of mathematical knowledge (pp. 1–13). Routledge. [Google Scholar]
  72. *Kunter, M., Tsai, Y. M., Klusmann, U., Brunner, M., Krauss, S., & Baumert, J. (2008). Students’ and mathematics teachers’ perceptions of teacher enthusiasm and instruction. Learning and Instruction, 18(5), 468–482. [Google Scholar] [CrossRef]
  73. *Kutnick, P., Fung, D. C., Mok, I., Leung, F. K., Li, J. C., Lee, B. P. Y., & Lai, V. K. (2017). Implementing effective group work for mathematical achievement in primary school classrooms in Hong Kong. International Journal of Science and Mathematics Education, 15(5), 957–978. [Google Scholar] [CrossRef]
  74. Lane, S. (2014). Validity evidence based on testing consequences. Psicothema, 26(1), 127–135. [Google Scholar] [CrossRef]
  75. *Lindorff, A., & Sammons, P. (2018). Going beyond structured observations: Looking at classroom practice through a mixed method lens. ZDM, 50(3), 521–534. [Google Scholar] [CrossRef]
  76. Litke, E., Boston, M., & Walkowiak, T. A. (2021). Affordances and constraints of mathematics-specific observation frameworks and general elements of teaching quality. Studies in Educational Evaluation, 68, 100956. [Google Scholar] [CrossRef]
  77. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694. [Google Scholar] [CrossRef]
  78. *Lomas, G. (2009). Pre-service primary teachers’ perceptions of mathematics education lecturers’ practice: Identifying issues for curriculum development. Mathematics Teacher Education and Development, 11, 4–21. [Google Scholar]
  79. Lynch, K., Chin, M., & Blazar, D. (2017). Relationships between observations of elementary mathematics instruction and student achievement: Exploring variability across districts. American Journal of Education, 123(4), 615–646. [Google Scholar] [CrossRef]
  80. *Marshall, J. C., Smart, J., & Horton, R. M. (2010). The design and validation of EQUIP: An instrument to assess inquiry-based instruction. International Journal of Science and Mathematics Education, 8(2), 299–321. [Google Scholar] [CrossRef]
  81. *Martin, C., Polly, D., McGee, J., Wang, C., Lambert, R., & Pugalee, D. (2015). Exploring the relationship between questioning, enacted mathematical tasks, and mathematical discourse in elementary school mathematics. The Mathematics Educator, 24(2). [Google Scholar] [CrossRef]
  82. *Matsumura, L. C., Garnier, H. E., Slater, S. C., & Boston, M. D. (2008). Toward measuring instructional interactions “at-scale”. Educational Assessment, 13(4), 267–300. [Google Scholar] [CrossRef]
  83. *Matsumura, L. C., Slater, S. C., Junker, B., Peterson, M., Boston, M., Steele, M., & Resnick, L. (2006). Measuring Reading Comprehension and Mathematics Instruction in Urban Middle Schools: A Pilot Study of the Instructional Quality Assessment. CSE Technical Report 681. National Center for Research on Evaluation, Standards, and Student Testing (CRESST). [Google Scholar]
  84. Mayer, D. P. (1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21(1), 29–45. [Google Scholar] [CrossRef]
  85. *McConney, M., & Perry, M. (2011). A change in questioning tactics: Prompting student autonomy. Investigations in Mathematics Learning, 3(3), 26–45. [Google Scholar] [CrossRef]
  86. *Melhuish, K., White, A., Sorto, M. A., & Thanheiser, E. (2021). Two replication studies of the relationships between mathematical knowledge for teaching, mathematical quality of instruction, and student achievement. Implementation and Replication Studies in Mathematics Education, 1(2), 155–189. [Google Scholar] [CrossRef]
  87. Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. [Google Scholar] [CrossRef]
  88. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741. [Google Scholar] [CrossRef]
  89. *Mikk, J., Krips, H., Säälik, Ü., & Kalk, K. (2016). Relationships between student perception of teacher-student relations and PISA results in mathematics and science. International Journal of Science and Mathematics Education, 14, 1437–1454. [Google Scholar]
  90. Mu, J., Bayrak, A., & Ufer, S. (2022). Conceptualizing and measuring instructional quality in mathematics education: A systematic literature review. Frontiers in Education, 7, 994739. [Google Scholar] [CrossRef]
  91. *Muijs, D., Reynolds, D., Sammons, P., Kyriakides, L., Creemers, B. P., & Teddlie, C. (2018). Assessing individual lessons using a generic teacher observation instrument: How useful is the International System for Teacher Observation and Feedback (ISTOF)? ZDM, 50(3), 395–406. [Google Scholar]
  92. National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. NCTM. [Google Scholar]
  93. National Council of Teachers of Mathematics. (2014). Principles to actions: Ensuring mathematical success for all. NCTM. [Google Scholar]
  94. National Research Council. (2001). Adding it up: Helping children learn mathematics. National Academy Press. [Google Scholar]
  95. *Newton, K. J. (2009). Instructional practices related to prospective elementary school teachers’ motivation for fractions. Journal of Mathematics Teacher Education, 12(2), 89–109. [Google Scholar] [CrossRef]
  96. Newton, P. E., & Shaw, S. D. (2016). Disagreement over the best way to use the word ‘validity’ and options for reaching consensus. Assessment in Education: Principles, Policy & Practice, 23(2), 178–197. [Google Scholar]
  97. Nivens, R. A., & Otten, S. (2017). Assessing journal quality in mathematics education. Journal for Research in Mathematics Education, 48(4), 348–368. [Google Scholar] [CrossRef]
  98. *Norton, A., & Rutledge, Z. (2006). Measuring task posing cycles: Mathematical letter writing between algebra students and preservice teachers. Mathematics Educator, 19(2), 32–45. [Google Scholar] [CrossRef]
  99. *Nunnery, J. A., Ross, S. M., & Bol, L. (2008). The construct validity of teachers’ perceptions of change in schools implementing comprehensive school reform models. Journal of Educational Research & Policy Studies, 8(1), 67–91. [Google Scholar]
  100. Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257. [Google Scholar] [CrossRef]
  101. Oliveri, M. E., Lawless, R., & Young, J. W. (2015). A validity framework for the use and development of exported assessments. Educational Testing Service. Available online: https://www.ets.org/pdfs/about/exported-assessments.pdf (accessed on 27 August 2025).
  102. Oren, C., Kennet-Cohen, T., Turvall, E., & Allalouf, A. (2014). Demonstrating the validity of three general scores of PET in predicting higher education achievement in Israel. Psicothema, 26(1), 117–126. [Google Scholar] [CrossRef]
  103. *Organisation for Economic Co-Operation and Development. (2012). PISA 2009 technical report. Organisation for Economic Co-Operation and Development. Available online: https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf (accessed on 4 June 2025).
  104. Ottmar, E. R., Rimm-Kaufman, S. E., Larsen, R. A., & Berry, R. Q. (2015). Mathematical knowledge for teaching, standards-based mathematics teaching practices, and student achievement in the context of the responsive classroom approach. American Educational Research Journal, 52(4), 787–821. [Google Scholar] [CrossRef]
  105. Padilla, J. L., & Benitez, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136–144. [Google Scholar] [CrossRef]
  106. Page, M. J., Moher, D., & McKenzie, J. E. (2022). Introduction to PRISMA 2020 and implications for research synthesis methodologists. Research Synthesis Methods, 13(2), 156–163. [Google Scholar] [CrossRef]
  107. *Pianta, R. C., Hamre, B. K., & Mintz, S. (2012). Classroom assessment scoring system upper elementary manual. Teachstone. [Google Scholar]
  108. *Piburn, M., Sawada, D., Turley, J., Falconer, K., Benford, R., Bloom, I., & Judson, E. (2000). Reformed teaching observation protocol (RTOP) reference manual. Arizona Collaborative for Excellence in the Preparation of Teachers. [Google Scholar]
  109. *Polly, D. (2016). Exploring the relationship between the use of technology with enacted tasks and questions in elementary school mathematics. International Journal for Technology in Mathematics Education, 23(3), 111–118. [Google Scholar] [CrossRef]
  110. Praetorius, A. K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying instructional quality: Looking back and looking forward. ZDM, 50, 535–553. [Google Scholar] [CrossRef]
  111. *Reinholz, D. L., & Shah, N. (2018). Equity analytics: A methodological approach for quantifying participation patterns in mathematics classroom discourse. Journal for Research in Mathematics Education, 49(2), 140–177. [Google Scholar] [CrossRef]
  112. Rios, J. A., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116. [Google Scholar] [CrossRef] [PubMed]
  113. *Rubel, L. H., & Chu, H. (2012). Reinscribing urban: Teaching high school mathematics in low income, urban communities of color. Journal of Mathematics Teacher Education, 15(1), 39–52. [Google Scholar]
  114. *Santagata, R., & Stigler, J. W. (2000). Teaching mathematics: Italian lessons from a cross-cultural perspective. Mathematical Thinking and Learning, 2(3), 191–208. [Google Scholar] [CrossRef]
  115. *Santagata, R., Zannoni, C., & Stigler, J. W. (2007). The role of lesson analysis in pre-service teacher education: An empirical investigation of teacher learning from a virtual video-based field experience. Journal of Mathematics Teacher Education, 10(2), 123–140. [Google Scholar] [CrossRef]
  116. *Sawada, D., & Piburn, M. (2000). Reformed teaching observation protocol (RTOP) (ACEPT Technical Report No. IN00-1). Arizona Collaborative for Excellence in the Preparation of Teachers. [Google Scholar]
  117. *Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom, I. (2002). Measuring reform practices in science and mathematics classrooms: The reformed teaching observation protocol. School Science and Mathematics, 102(6), 245–253. [Google Scholar] [CrossRef]
  118. *Schack, E. O., Fisher, M. H., Thomas, J. N., Eisenhardt, S., Tassell, J., & Yoder, M. (2013). Prospective elementary school teachers’ professional noticing of children’s early numeracy. Journal of Mathematics Teacher Education, 16(5), 379–397. [Google Scholar] [CrossRef]
  119. *Schlesinger, L., Jentsch, A., Kaiser, G., König, J., & Blömeke, S. (2018). Subject-specific characteristics of instructional quality in mathematics education. ZDM 50, 475–490. [Google Scholar] [CrossRef]
  120. Schoenfeld, A. H. (2020). Reframing teacher knowledge: A research and development agenda. ZDM, 52(2), 359–376. [Google Scholar] [CrossRef]
  121. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19(1), 405–450. [Google Scholar] [CrossRef]
  122. Shepard, L. A. (2016). Evaluating test validity: Reprise and progress. Assessment in Education: Principles, Policy & Practice, 23(2), 268–280. [Google Scholar] [CrossRef]
  123. Shepard, L. A. (2018). Learning progressions as tools for assessment and learning. Applied Measurement in Education, 31(2), 165–174. [Google Scholar] [CrossRef]
  124. Sireci, S. G. (2016). On the validity of useless tests. Assessment in Education: Principles, Policy & Practice, 23(2), 226–235. [Google Scholar]
  125. Sireci, S. G., & Benítez, I. (2023). Evidence for test validation: A guide for practitioners. Psicothema, 35(3), 217–226. [Google Scholar] [CrossRef]
  126. Sireci, S. G., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. [Google Scholar] [CrossRef]
  127. Solano-Flores, G. (2022). Fairness in testing: Designing, using, and evaluating test accommodations for English learners. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 271–292). American Educational Research Association. [Google Scholar]
  128. *Spruce, R., & Bol, L. (2015). Teacher beliefs, knowledge, and practice of self-regulated learning. Metacognition and Learning, 10, 245–277. [Google Scholar] [CrossRef]
  129. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. [Google Scholar] [CrossRef]
  130. *Stevens, T., Harris, G., Liu, X., & Aguirre-Munoz, Z. (2013). Students’ ratings of teacher practices. International Journal of Mathematical Education in Science and Technology, 44(7), 984–995. [Google Scholar] [CrossRef]
  131. Thunder, K., & Berry, R. Q. (2016). Research commentary: The promise of qualitative metasynthesis for mathematics education. Journal for Research in Mathematics Education, 47(4), 318–337. [Google Scholar] [CrossRef]
  132. Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems. Edwards Brothers. [Google Scholar]
  133. Thurstone, L. L. (1955). The criterion problem in personality research. Educational and Psychological Measurement, 15(4), 353–361. [Google Scholar] [CrossRef]
  134. Tong, Y., Pitoniak, M., Lipner, R., Ezzelle, C., Ho, A., & Huff, K. (2024, April 11–14). Reconsidering assessment fairness: Extending beyond the 2014 standards for educational and psychological testing [invited speaker session]. American Educational Research Association Annual Meeting, Philadelphia, PA, USA. [Google Scholar]
  135. *van de Grift, W. (2007). Quality of teaching in four European countries: A review of the literature and application of an assessment instrument. Educational Research, 49(2), 127–152. [Google Scholar] [CrossRef]
  136. van der Lans, R. M. (2018). On the “association between two things”: The case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30, 347–366. [Google Scholar] [CrossRef]
  137. *Wainwright, C., Morrell, P. D., Flick, L., & Schepige, A. (2004). Observation of reform teaching in undergraduate level mathematics and science courses. School Science and Mathematics, 104(7), 322–335. [Google Scholar] [CrossRef]
  138. *Walkington, C., & Marder, M. (2018). Using the UTeach Observation Protocol (UTOP) to understand the quality of mathematics instruction. ZDM, 50(3), 507–519. [Google Scholar] [CrossRef]
  139. *Walkowiak, T. A., Berry, R. Q., Meyer, J. P., Rimm-Kaufman, S. E., & Ottmar, E. R. (2014). Introducing an observational measure of standards-based mathematics teaching practices: Evidence of validity and score reliability. Educational Studies in Mathematics, 85, 109–128. [Google Scholar] [CrossRef]
  140. *Walkowiak, T. A., Berry, R. Q., Pinter, H. H., & Jacobson, E. D. (2018). Utilizing the M-Scan to measure standards-based mathematics teaching practices: Affordances and limitations. ZDM, 50(3), 461–474. [Google Scholar] [CrossRef]
  141. Walkowiak, T. A., Wilson, J., Adams, E. L., & Wilhelm, A. G. (2022). Scoring with classroom observational rubrics: A longitudinal examination of raters’ responses and perspectives. In A. E. Lischka, E. B. Dyer, R. S. Jones, J. N. Lovett, J. Strayer, & S. Drown (Eds.), Proceedings of the 44th annual meeting of the north American chapter of the international group for the psychology of mathematics education (pp. 1869–1873). Middle Tennessee State University. [Google Scholar]
  142. Wilhelm, A. G., Folger, T. D., Gallagher, M. A., Walkowiak, T. A., & Zelkowski, J. (2024). Examining validation practices for measures of mathematics teacher affect and behavior. [Preprint].
  143. Williams, S. R., & Leatham, K. R. (2017). Journal quality in mathematics education. Journal for Research in Mathematics Education, 48(4), 369–396. [Google Scholar] [CrossRef]
  144. Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design. Sage. [Google Scholar]
  145. Wright, B., & Stone, M. (2004). Making measures. The Phaneron Press. [Google Scholar]
  146. *Wubbels, T., Brekelmans, M., & Hooymayers, H. P. (1992). Do teacher ideals distort the self-reports of their interpersonal behavior? Teaching and Teacher Education, 8(1), 47–58. [Google Scholar] [CrossRef]
  147. *Wubbels, T., Cretan, H. A., & Hooymayers, H. P. (1985, March 31–April 4). Discipline problems of beginning teachers. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, USA. [Google Scholar]
  148. *Yopp, D. A., Burroughs, E. A., Sutton, J. T., & Greenwood, M. C. (2019). Variations in coaching knowledge and practice that explain elementary and middle school mathematics teacher change. Journal of Mathematics Teacher Education, 22(1), 5–36. [Google Scholar]
  149. Zelkowski, J., Campbell, T. G., & Moldavan, A. M. (2024). The relationships between internal program measures and a high-stakes teacher licensing measure in mathematics teacher preparation: Program design considerations. Journal of Teacher Education, 75(1), 58–75. [Google Scholar] [CrossRef]
  150. Zieky, M. J. (2016). Developing fair tests. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 81–99). Routledge. [Google Scholar]
  151. Zumbo, B. D. (2014). What role does, and should, the test standards play outside of the United States of America? Educational Measurement: Issues & Practice, 33(4), 31. [Google Scholar]
  152. Zumbo, B. D., & Hubley, A. M. (2016). Bringing consequences and side effects of testing and assessment to the foreground. Assessment in Education: Principles, Policy & Practice, 23(2), 299–303. [Google Scholar]
Figure 1. PRISMA diagram of search steps. Note: adapted from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Page et al., 2022), documenting our search and selection process. a Interrater agreement = 78.77%; b Interrater agreement = 83.49%.
Figure 1. PRISMA diagram of search steps. Note: adapted from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Page et al., 2022), documenting our search and selection process. a Interrater agreement = 78.77%; b Interrater agreement = 83.49%.
Education 15 01158 g001
Table 1. Five Sources of Validity Evidence (AERA et al., 2014).
Table 1. Five Sources of Validity Evidence (AERA et al., 2014).
Source of EvidenceDescriptionSample Methods for Collecting Evidence
Test ContentThe wording, format, and construct alignment of individual items.Recruiting subject matter experts to evaluate alignment between the test and the construct of interest (Sireci & Faulkner-Bond, 2014).
Response ProcessRespondents’ interpretation and engagement with items.Conducting cognitive interviews with test-takers to explore the degree to which respondents’ psychological processes and/or cognition align with test expectations (Padilla & Benitez, 2014).
Internal StructureThe degree to which items conform to the construct of interest.Using statistical methods, such as factor analysis or item response theory, to assess test dimensionality (Rios & Wells, 2014).
Relations to Other VariablesHypothesized relationships between instrument outcomes and some other variable(s). Using statistical methods, such as multiple linear regression, to examine whether test scores predict a criterion outcome (Oren et al., 2014).
Consequences of TestingIntended and unintended implications of testing and test-score interpretation and use.Collecting data from stakeholders (e.g., students, teachers, administrators) to explore (a) the degree to which the intended benefits of testing are realized, and/or (b) the development of unintended consequences (e.g., narrowing of curriculum, decreased confidence; Lane, 2014)
Table 2. Fairness clusters (AERA et al., 2014).
Table 2. Fairness clusters (AERA et al., 2014).
ClusterTopicSample Methods for Collecting
Evidence
Cluster 1Test design, development, administration, and scoring procedures that minimize barriers to valid score interpretations for the widest possible range of individuals and relevant subgroups.Expert review of language used in the assessment by experts representing different subgroups (Oliveri et al., 2015).
Cluster 2Validity of test-score interpretations for intended uses for the intended examinee population.Including relevant subgroups in initial validation studies and analyzing differences between groups to ensure the instrument performs similarly (or as expected) across groups (Herman & Cook, 2022).
Cluster 3Accommodations to remove construct-irrelevant barriers and support valid interpretations of scores for their intended uses.Using generalizability theory to determine the amount of error variance attributable to an accommodation and its use (Solano-Flores, 2022).
Cluster 4Safeguards against inappropriate score interpretations for intended uses.“[W]arn users to avoid likely misuses of the scores” (Zieky, 2016, p. 95).
Table 3. Instruments with Integrated IUAs.
Table 3. Instruments with Integrated IUAs.
Instrument NameInterpretation StatementUse StatementClaim(s) and Evidence
Mathematics Scan (M-Scan)“The M-Scan measures the extent to which these dimensions of standards-based teaching practices, both individually and collectively, are present in a lesson.” (Walkowiak et al., 2014, p. 114)“The M-Scan was developed for researchers to detect the extent to which teachers are using standards-based mathematics teaching practices. Consequently, researchers can utilize M-Scan data to examine relationships between teaching practices and other constructs. The M-Scan is not designed to be used in a supervisory role, by a school administrator for example, to evaluate an individual teacher’s instruction; however, the data may be used to evaluate the outcomes of a program (e.g., teacher preparation or professional development). The M-Scan rubrics have also been utilized in professional development settings where teachers identify a target area and use the selected rubric to guide improvement. For this use, the numerical scale is removed, and the focus becomes the qualitative descriptors.” (Walkowiak et al., 2018, p. 463)5 claims and related evidence
Revised SMPs Look-for Protocol“Score interpretations provide users with information about teachers’ instruction within a single instance and may be used in conjunction with other instruments to construct a profile of teachers’ instruction.” (Carney et al., 2022, p. 339)“[I]t is intended for research purposes, evaluation of professional development initiatives related to the SMPs, and coaching; it is not an instrument to make high-stakes decisions, does not explore students’ engagement in the SMPs, and does not capture evidence beyond the observed lesson.” (Carney et al., 2022, p. 339)4 claims and related evidence
Table 4. Instruments with interpretation statements, use statements, and claims and evidence.
Table 4. Instruments with interpretation statements, use statements, and claims and evidence.
Instrument NameInterpretation StatementUse StatementClaim(s) and Evidence
Mathematics Scan (M-Scan)
Revised SMPs Look-for Protocol
Questionnaire on Teacher Interactions (QTI)
UTeach Observation Protocol (UTOP)
Mathematics Quality of Instruction (MQI)
Constructivist Learning Environment Survey (CLES)
PISA Student–teacher Relations Questionnaire
Assess Today
Classroom Observation Protocol (COP)
International System for Teacher Observation and Feedback (ISTOF)
Mathematics Classroom Observation Protocol for Practices (MCOP2)
Students’ Perceptions of Teachers Successes (SPoTS)
Schlesinger_2018_Instructional Quality
Table 5. Number of Instruments with Validity, Reliability, and Fairness Evidence, Overall and by Instrument Type.
Table 5. Number of Instruments with Validity, Reliability, and Fairness Evidence, Overall and by Instrument Type.
Evidence TypeOverallClassroom ObservationsStudent QuestionnairesTeacher QuestionnairesTeacher Interviews
Test content2622031
Response processes97020
Internal structure2112360
Relations to other variables2215160
Consequences of testing66000
Reliability41272111
Fairness87100
Total Instruments47325111
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gallagher, M.A.; Folger, T.D.; Walkowiak, T.A.; Wilhelm, A.G.; Zelkowski, J. Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Educ. Sci. 2025, 15, 1158. https://doi.org/10.3390/educsci15091158

AMA Style

Gallagher MA, Folger TD, Walkowiak TA, Wilhelm AG, Zelkowski J. Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Education Sciences. 2025; 15(9):1158. https://doi.org/10.3390/educsci15091158

Chicago/Turabian Style

Gallagher, Melissa A., Timothy D. Folger, Temple A. Walkowiak, Anne Garrison Wilhelm, and Jeremy Zelkowski. 2025. "Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future" Education Sciences 15, no. 9: 1158. https://doi.org/10.3390/educsci15091158

APA Style

Gallagher, M. A., Folger, T. D., Walkowiak, T. A., Wilhelm, A. G., & Zelkowski, J. (2025). Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Education Sciences, 15(9), 1158. https://doi.org/10.3390/educsci15091158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop