Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future
Abstract
1. Introduction
- What approaches have been used to measure mathematics teaching?
- To what extent are there components of an integrated IUA framework (i.e., interpretation and use statements and claims) for mathematics teaching instruments?
- What types of validity, reliability, and fairness evidence have been described for mathematics teaching instruments? Do these types vary based on how mathematics teaching has been measured?
2. Literature Review
2.1. Mathematics Teaching Quality
Measuring Mathematics Teaching Quality
2.2. Advancement of Validity Theory
2.3. A Modern Approach to Validation
3. Methods
3.1. Data Collection and Analysis
3.1.1. Identifying Relevant Instruments
3.1.2. Identifying IUAs and Claims
3.1.3. Categorizing the Constructs Measured
4. Results
4.1. Approaches Used to Measure Mathematics Teaching
4.1.1. Measuring Mathematics Teachers’ Enactment of Teaching Practice
4.1.2. Measuring Mathematics Teachers’ Approximation of Teaching Practice
4.2. Components of an Integrated IUA Framework
4.3. Types of Validity, Reliability, and Fairness Evidence Found
5. Discussion
5.1. Summarizing How Mathematics Teaching Quality Has Been Measured
Implications for Future Research on Mathematics Teaching Quality
5.2. Summarizing Current Validation Practices
5.2.1. Response Processes
5.2.2. Consequential and Fairness Considerations
5.2.3. Implications for Instrument Developers
5.2.4. Implications for Researchers
5.3. Limitations
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
DOI | Direct Object Identifier |
IUA | Interpretation and Use Argument |
References
- *Adams, R., & Wu, M. (Eds.). (2002). PISA 2000 technical report. OECD. Available online: http://www.oecd.org/pisa/pisaproducts/33688233.pdf (accessed on 4 June 2025).
- *Ader, E. (2019). What would you demand beyond mathematics? Teachers’ promotion of students’ self-regulated learning and metacognition. ZDM, 51(4), 613–624. [Google Scholar]
- Amador, J. M., Bragelman, J., & Superfine, A. C. (2021). Prospective teachers’ noticing: A literature review of methodological approaches to support and analyze noticing. Teaching and Teacher Education, 99, 103256. [Google Scholar] [CrossRef]
- American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
- American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
- American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. American Psychological Association. [Google Scholar]
- Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37(1), 1–16. [Google Scholar] [CrossRef]
- *Andrews, P. (2007). Negotiating meaning in cross-national studies of mathematics teaching: Kissing frogs to find princes. Comparative Education, 43(4), 489–509. [Google Scholar]
- *Andrews, P. (2009). Comparative studies of mathematics teachers’ observable learning objectives: Validating low inference codes. Educational Studies in Mathematics, 71(2), 97–122. [Google Scholar] [CrossRef]
- *Appeldoorn, K. L. (2004). Developing and validating the Collaboratives for Excellence in Teacher Preparation (CETP) core evaluation classroom observation protocol (COP). University of Minnesota. [Google Scholar]
- Australian Association of Mathematics Teachers. (2006). Standards for excellence in teaching mathematics in Australian schools. AAMT. [Google Scholar]
- Bates, M. J. (1989). The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5), 407–424. [Google Scholar] [CrossRef]
- Bell, C. A., Qi, Y., Croft, A., Leusner, D., McCaffrey, D. F., Gitomer, D. H., & Pianta, R. (2014). Improving observational score quality: Challenges in observer thinking. In K. Kerr, R. Pianta, & T. Kane (Eds.), Designing teacher evaluation systems: New guidance from the measures of effective teaching project (pp. 50–97). Jossey-Bass. [Google Scholar]
- Bentley, B., Folger, T., Bostic, J., Krupa, E., Burkett, K., & Stokes, D. (2024). Evidence types guidebook. Validity Evidence for Measurement in Mathematics Education. Available online: https://www.mathedmeasures.org/training/ (accessed on 4 June 2025).
- *Berlin, R., & Cohen, J. (2018). Understanding instructional quality through a relational lens. ZDM, 50(3), 367–379. [Google Scholar] [CrossRef]
- Berliner, D. C. (2005). The near impossibility of testing for teacher quality. Journal of Teacher Education, 56(3), 205–213. [Google Scholar] [CrossRef]
- Bishop, J. P. (2021). Responsiveness and intellectual work: Features of mathematics classroom discourse related to student achievement. Journal of the Learning Sciences, 30(3), 466–508. [Google Scholar] [CrossRef]
- Borsboom, D., & Wijsen, L. D. (2016). Frankenstein’s validity monster: The value of keeping politics and science separated. Assessment in Education: Principles, Policy & Practice, 23(2), 281–283. [Google Scholar]
- *Bostic, J. D., Matney, G. T., & Sondergeld, T. A. (2019). A validation process for observation protocols: Using the Revised SMPs Look-for Protocol as a lens on teachers’ promotion of the standards. Investigations in Mathematics Learning, 11(1), 69–82. [Google Scholar] [CrossRef]
- Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2021). Classroom observation and mathematics education research. Journal of Mathematics Teacher Education, 24, 5–31. [Google Scholar] [CrossRef]
- Boston, M. (2012). Assessing instructional quality in mathematics. The Elementary School Journal, 113(1), 76–104. [Google Scholar] [CrossRef]
- Boston, M., Bostic, J., Lesseig, K., & Sherman, M. (2015). A comparison of mathematics classroom observation protocols. Mathematics Teacher Educator, 3(2), 154–175. [Google Scholar] [CrossRef]
- *Boston, M., & Wolf, M. K. (2006). Assessing Academic Rigor in Mathematics Instruction: The Development of the Instructional Quality Assessment Toolkit. CSE Technical Report 672. National Center for Research on Evaluation, Standards, and Student Testing (CRESST). [Google Scholar]
- *Boston, M. D., & Smith, M. S. (2011). A ‘task-centric approach’ to professional development: Enhancing and sustaining mathematics teachers’ ability to implement cognitively challenging mathematical tasks. ZDM, 43(6), 965–977. [Google Scholar] [CrossRef]
- *Bruckmaier, G., Krauss, S., Blum, W., & Leiss, D. (2016). Measuring mathematics teachers’ professional competence by using video clips (COACTIV video). ZDM, 48(1), 111–124. [Google Scholar] [CrossRef]
- Brunner, E., & Star, J. R. (2024). The quality of mathematics teaching from a mathematics educational perspective: What do we actually know and which questions are still open? ZDM, 56(5), 775–787. [Google Scholar] [CrossRef]
- *Carney, M. B., Bostic, J., Krupa, E., & Shih, J. (2022). Interpretation and use statements for instruments in mathematics education. Journal for Research in Mathematics Education, 53(4), 334–340. [Google Scholar] [CrossRef]
- Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM, 50, 355–366. [Google Scholar] [CrossRef]
- Charalambous, C. Y., & Praetorius, A.-K. (2020). Creating a forum for researching teaching and its quality more synergistically. Studies in Educational Evaluation 67. [Google Scholar] [CrossRef]
- Charalambous, C. Y., Praetorius, A. K., Sammons, P., Walkowiak, T., Jentsch, A., & Kyriakides, L. (2021). Working more collaboratively to better understand teaching and its quality: Challenges faced and possible solutions. Studies in Educational Evaluation, 71, 101092. [Google Scholar] [CrossRef]
- Cizek, G. J. (2016). Validating test score meaning and defending test score use: Different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. [Google Scholar]
- *Copur-Gencturk, Y. (2015). The effects of changes in mathematical knowledge on teaching: A longitudinal study of teachers’ knowledge and instruction. Journal for Research in Mathematics Education, 46(3), 280–330. [Google Scholar] [CrossRef]
- Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer, & H. Braun (Eds.), Test validity (pp. 3–17). Erlbaum. [Google Scholar]
- Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281. [Google Scholar] [CrossRef]
- Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621–694). American Council on Education. [Google Scholar]
- Desimone, L. M., Hochberg, E. D., & Mcmaken, J. (2016). Teacher knowledge and instructional quality of beginning teachers: Growth and linkages. Teachers College Record, 118(5), 1–54. [Google Scholar] [CrossRef]
- Downer, J. T., Stuhlman, M., Schweig, J., Martínez, J. F., & Ruzek, E. (2015). Measuring effective teacher-student interactions from a student perspective: A multi-level analysis. The Journal of Early Adolescence, 35(5–6), 722–758. [Google Scholar] [CrossRef]
- *Dreher, A., & Kuntze, S. (2015). Teachers’ professional knowledge and noticing: The case of multiple representations in the mathematics classroom. Educational Studies in Mathematics, 88(1), 89–114. [Google Scholar] [CrossRef]
- *Dunekacke, S., Jenßen, L., Eilerts, K., & Blömeke, S. (2016). Epistemological beliefs of prospective preschool teachers and their relation to knowledge, perception, and planning abilities in the field of mathematics: A process model. ZDM, 48(1), 125–137. [Google Scholar] [CrossRef]
- *Eddy, C. M., Harrell, P., & Heitz, L. (2017). An observation protocol of short-cycle formative assessment in the mathematics classroom. Investigations in Mathematics Learning, 9(3), 130–147. [Google Scholar] [CrossRef]
- *Erickson, A., & Herbst, P. (2018). Will teachers create opportunities for discussion when teaching proof in a geometry classroom? International Journal of Science and Mathematics Education, 16(1), 167–181. [Google Scholar] [CrossRef]
- Eurydice. (2011). Mathematics education in Europe: Common challenges and national policies. Education, Audiovisual, and Culture Executive Agency. [Google Scholar]
- Fauth, B., Decristan, J., Rieser, S., Klieme, E., & Büttner, G. (2014). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9. [Google Scholar] [CrossRef]
- Feldlaufer, H., Midgley, C., & Eccles, J. (1988). Student, teacher, and observer perceptions of the classroom before and after the transition to junior high school. Journal of Early Adolescence, 8, 133–156. [Google Scholar] [CrossRef]
- Folger, T. D., Bostic, J., & Krupa, E. E. (2023). Defining test-score interpretation, use, and claims: Delphi study for the validity argument. Educational Measurement: Issues and Practice, 42(3), 22–38. [Google Scholar] [CrossRef]
- Gitomer, D. H., & Bell, C. A. (2013). Evaluating teaching and teachers. In APA handbook of testing and assessment in psychology, Vol. 3: Testing and assessment in school psychology and education (pp. 415–444). American Psychological Association. [Google Scholar]
- *Gleason, J., Livers, S. D., & Zelkowski, J. (2017). Mathematics classroom observation protocol for practices (MCOP2): Validity and reliability. Investigations in Mathematical Learning, 9(3), 111–129. [Google Scholar] [CrossRef]
- *Gningue, S. M., Peach, R., & Schroder, B. (2013). Developing effective mathematics teaching: Assessing content and pedagogical knowledge, student-centered teaching, and student engagement. The Mathematics Enthusiast, 10(3), 621–646. [Google Scholar] [CrossRef]
- *Gotwals, A. W., Philhower, J., Cisterna, D., & Bennett, S. (2015). Using video to examine formative assessment practices as measures of expertise for mathematics and science teachers. International Journal of Science and Mathematics Education, 13(2), 405–423. [Google Scholar] [CrossRef]
- Groves, R. M., Fowler, F. J., Coupter, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd ed.). Wiley & Sons. [Google Scholar]
- Herman, J., & Cook, L. (2022). Broadening the reach of the fairness standards. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 33–60). American Educational Research Association. [Google Scholar]
- Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008). Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and instruction, 26(4), 430–511. [Google Scholar] [CrossRef]
- *Hill, H. C., Charalambous, C. Y., Blazar, D., McGinn, D., Kraft, M. A., Beisiegel, M., Humez, A., Litke, E., & Lynch, K. (2012b). Validating arguments for observational instruments: Attending to multiple sources of variation. Educational Assessment, 17(2–3), 88–106. [Google Scholar]
- *Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012a). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. [Google Scholar]
- Hill, H. C., & Shih, J. C. (2009). Research commentary: Examining the quality of statistical mathematics education research. Journal for Research in Mathematics Education, 40(3), 241–250. [Google Scholar] [CrossRef]
- *Hill, H. C., Umland, K., Litke, E., & Kapitula, L. R. (2012). Teacher quality and quality teaching: Examining the relationship of a teacher assessment to practice. American Journal of Education, 118(4), 489–519. [Google Scholar] [CrossRef]
- *Horizon Research Inc. (2000). Validity and reliability information for the LSC Classroom Observation Protocol. Available online: https://horizon-research.com/LocalSystemicChange/wp-content/uploads/2023/05/cop_validity_2000.pdf (accessed on 3 September 2025).
- *Jacobs, V. R., Lamb, L. L., & Philipp, R. A. (2010). Professional noticing of children’s mathematical thinking. Journal for Research in Mathematics Education, 41(2), 169–202. [Google Scholar] [CrossRef]
- Jacobs, V. R., & Spangler, D. A. (2017). Research on core practices in K-12 mathematics teaching. In Compendium for research in mathematics education (pp. 766–792). National Council of Teachers of Mathematics. [Google Scholar]
- Jonson, J. L., & Geisinger, K. F. (2022a). Conceptualizing and contextualizing fairness standards, issues, and solutions across professional fields in education and psychology. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 1–9). American Educational Research Association. [Google Scholar]
- Jonson, J. L., & Geisinger, K. F. (2022b). Looking forward: Cross-cutting themes for the future of fairness in testing. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 standards (pp. 399–416). American Educational Research Association. [Google Scholar]
- Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. [Google Scholar] [CrossRef]
- Kane, M. (2016). Validation strategies: Delineating and validating proposed interpretations and uses of test scores. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 64–80). Routledge. [Google Scholar]
- Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527. [Google Scholar] [CrossRef]
- Kane, T., Kerr, K., & Pianta, R. (2014). Designing teacher evaluation systems: New guidance from the measures of effective teaching project. John Wiley & Sons. [Google Scholar]
- Kim, J., Salloum, S., Lin, Q., & Hu, S. (2022). Ambitious instruction and student achievement: Evidence from early career teachers and the TRU math observation instrument. Teaching and Teacher Education, 117, 103779. [Google Scholar] [CrossRef]
- Klette, K., & Blikstad-Balas, M. (2018). Observation manuals as lenses to classroom teaching: Pitfalls and possibilities. European Educational Research Journal, 17(1), 129–146. [Google Scholar] [CrossRef]
- *König, J., & Kramer, C. (2016). Teacher professional knowledge and classroom management: On the relation of general pedagogical knowledge (GPK) and classroom management expertise (CME). ZDM, 48(1–2), 139–151. [Google Scholar] [CrossRef]
- Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Sage. [Google Scholar]
- Krupa, E. E., Bostic, J. D., Bentley, B., Folger, T., Burkett, K. E., & VM2ED community. (2024, May). Search. VM2ED Repository. Available online: https://mathedmeasures.org/ (accessed on 4 June 2025).
- Krupa, E. E., Bostic, J. D., & Shih, J. C. (2019). Validation in mathematics education: An introduction to quantitative measures of mathematical knowledge: Researching instruments and perspectives. In Quantitative measures of mathematical knowledge (pp. 1–13). Routledge. [Google Scholar]
- *Kunter, M., Tsai, Y. M., Klusmann, U., Brunner, M., Krauss, S., & Baumert, J. (2008). Students’ and mathematics teachers’ perceptions of teacher enthusiasm and instruction. Learning and Instruction, 18(5), 468–482. [Google Scholar] [CrossRef]
- *Kutnick, P., Fung, D. C., Mok, I., Leung, F. K., Li, J. C., Lee, B. P. Y., & Lai, V. K. (2017). Implementing effective group work for mathematical achievement in primary school classrooms in Hong Kong. International Journal of Science and Mathematics Education, 15(5), 957–978. [Google Scholar] [CrossRef]
- Lane, S. (2014). Validity evidence based on testing consequences. Psicothema, 26(1), 127–135. [Google Scholar] [CrossRef]
- *Lindorff, A., & Sammons, P. (2018). Going beyond structured observations: Looking at classroom practice through a mixed method lens. ZDM, 50(3), 521–534. [Google Scholar] [CrossRef]
- Litke, E., Boston, M., & Walkowiak, T. A. (2021). Affordances and constraints of mathematics-specific observation frameworks and general elements of teaching quality. Studies in Educational Evaluation, 68, 100956. [Google Scholar] [CrossRef]
- Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694. [Google Scholar] [CrossRef]
- *Lomas, G. (2009). Pre-service primary teachers’ perceptions of mathematics education lecturers’ practice: Identifying issues for curriculum development. Mathematics Teacher Education and Development, 11, 4–21. [Google Scholar]
- Lynch, K., Chin, M., & Blazar, D. (2017). Relationships between observations of elementary mathematics instruction and student achievement: Exploring variability across districts. American Journal of Education, 123(4), 615–646. [Google Scholar] [CrossRef]
- *Marshall, J. C., Smart, J., & Horton, R. M. (2010). The design and validation of EQUIP: An instrument to assess inquiry-based instruction. International Journal of Science and Mathematics Education, 8(2), 299–321. [Google Scholar] [CrossRef]
- *Martin, C., Polly, D., McGee, J., Wang, C., Lambert, R., & Pugalee, D. (2015). Exploring the relationship between questioning, enacted mathematical tasks, and mathematical discourse in elementary school mathematics. The Mathematics Educator, 24(2). [Google Scholar] [CrossRef]
- *Matsumura, L. C., Garnier, H. E., Slater, S. C., & Boston, M. D. (2008). Toward measuring instructional interactions “at-scale”. Educational Assessment, 13(4), 267–300. [Google Scholar] [CrossRef]
- *Matsumura, L. C., Slater, S. C., Junker, B., Peterson, M., Boston, M., Steele, M., & Resnick, L. (2006). Measuring Reading Comprehension and Mathematics Instruction in Urban Middle Schools: A Pilot Study of the Instructional Quality Assessment. CSE Technical Report 681. National Center for Research on Evaluation, Standards, and Student Testing (CRESST). [Google Scholar]
- Mayer, D. P. (1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21(1), 29–45. [Google Scholar] [CrossRef]
- *McConney, M., & Perry, M. (2011). A change in questioning tactics: Prompting student autonomy. Investigations in Mathematics Learning, 3(3), 26–45. [Google Scholar] [CrossRef]
- *Melhuish, K., White, A., Sorto, M. A., & Thanheiser, E. (2021). Two replication studies of the relationships between mathematical knowledge for teaching, mathematical quality of instruction, and student achievement. Implementation and Replication Studies in Mathematics Education, 1(2), 155–189. [Google Scholar] [CrossRef]
- Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. [Google Scholar] [CrossRef]
- Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741. [Google Scholar] [CrossRef]
- *Mikk, J., Krips, H., Säälik, Ü., & Kalk, K. (2016). Relationships between student perception of teacher-student relations and PISA results in mathematics and science. International Journal of Science and Mathematics Education, 14, 1437–1454. [Google Scholar]
- Mu, J., Bayrak, A., & Ufer, S. (2022). Conceptualizing and measuring instructional quality in mathematics education: A systematic literature review. Frontiers in Education, 7, 994739. [Google Scholar] [CrossRef]
- *Muijs, D., Reynolds, D., Sammons, P., Kyriakides, L., Creemers, B. P., & Teddlie, C. (2018). Assessing individual lessons using a generic teacher observation instrument: How useful is the International System for Teacher Observation and Feedback (ISTOF)? ZDM, 50(3), 395–406. [Google Scholar]
- National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. NCTM. [Google Scholar]
- National Council of Teachers of Mathematics. (2014). Principles to actions: Ensuring mathematical success for all. NCTM. [Google Scholar]
- National Research Council. (2001). Adding it up: Helping children learn mathematics. National Academy Press. [Google Scholar]
- *Newton, K. J. (2009). Instructional practices related to prospective elementary school teachers’ motivation for fractions. Journal of Mathematics Teacher Education, 12(2), 89–109. [Google Scholar] [CrossRef]
- Newton, P. E., & Shaw, S. D. (2016). Disagreement over the best way to use the word ‘validity’ and options for reaching consensus. Assessment in Education: Principles, Policy & Practice, 23(2), 178–197. [Google Scholar]
- Nivens, R. A., & Otten, S. (2017). Assessing journal quality in mathematics education. Journal for Research in Mathematics Education, 48(4), 348–368. [Google Scholar] [CrossRef]
- *Norton, A., & Rutledge, Z. (2006). Measuring task posing cycles: Mathematical letter writing between algebra students and preservice teachers. Mathematics Educator, 19(2), 32–45. [Google Scholar] [CrossRef]
- *Nunnery, J. A., Ross, S. M., & Bol, L. (2008). The construct validity of teachers’ perceptions of change in schools implementing comprehensive school reform models. Journal of Educational Research & Policy Studies, 8(1), 67–91. [Google Scholar]
- Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257. [Google Scholar] [CrossRef]
- Oliveri, M. E., Lawless, R., & Young, J. W. (2015). A validity framework for the use and development of exported assessments. Educational Testing Service. Available online: https://www.ets.org/pdfs/about/exported-assessments.pdf (accessed on 27 August 2025).
- Oren, C., Kennet-Cohen, T., Turvall, E., & Allalouf, A. (2014). Demonstrating the validity of three general scores of PET in predicting higher education achievement in Israel. Psicothema, 26(1), 117–126. [Google Scholar] [CrossRef]
- *Organisation for Economic Co-Operation and Development. (2012). PISA 2009 technical report. Organisation for Economic Co-Operation and Development. Available online: https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf (accessed on 4 June 2025).
- Ottmar, E. R., Rimm-Kaufman, S. E., Larsen, R. A., & Berry, R. Q. (2015). Mathematical knowledge for teaching, standards-based mathematics teaching practices, and student achievement in the context of the responsive classroom approach. American Educational Research Journal, 52(4), 787–821. [Google Scholar] [CrossRef]
- Padilla, J. L., & Benitez, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136–144. [Google Scholar] [CrossRef]
- Page, M. J., Moher, D., & McKenzie, J. E. (2022). Introduction to PRISMA 2020 and implications for research synthesis methodologists. Research Synthesis Methods, 13(2), 156–163. [Google Scholar] [CrossRef]
- *Pianta, R. C., Hamre, B. K., & Mintz, S. (2012). Classroom assessment scoring system upper elementary manual. Teachstone. [Google Scholar]
- *Piburn, M., Sawada, D., Turley, J., Falconer, K., Benford, R., Bloom, I., & Judson, E. (2000). Reformed teaching observation protocol (RTOP) reference manual. Arizona Collaborative for Excellence in the Preparation of Teachers. [Google Scholar]
- *Polly, D. (2016). Exploring the relationship between the use of technology with enacted tasks and questions in elementary school mathematics. International Journal for Technology in Mathematics Education, 23(3), 111–118. [Google Scholar] [CrossRef]
- Praetorius, A. K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying instructional quality: Looking back and looking forward. ZDM, 50, 535–553. [Google Scholar] [CrossRef]
- *Reinholz, D. L., & Shah, N. (2018). Equity analytics: A methodological approach for quantifying participation patterns in mathematics classroom discourse. Journal for Research in Mathematics Education, 49(2), 140–177. [Google Scholar] [CrossRef]
- Rios, J. A., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116. [Google Scholar] [CrossRef] [PubMed]
- *Rubel, L. H., & Chu, H. (2012). Reinscribing urban: Teaching high school mathematics in low income, urban communities of color. Journal of Mathematics Teacher Education, 15(1), 39–52. [Google Scholar]
- *Santagata, R., & Stigler, J. W. (2000). Teaching mathematics: Italian lessons from a cross-cultural perspective. Mathematical Thinking and Learning, 2(3), 191–208. [Google Scholar] [CrossRef]
- *Santagata, R., Zannoni, C., & Stigler, J. W. (2007). The role of lesson analysis in pre-service teacher education: An empirical investigation of teacher learning from a virtual video-based field experience. Journal of Mathematics Teacher Education, 10(2), 123–140. [Google Scholar] [CrossRef]
- *Sawada, D., & Piburn, M. (2000). Reformed teaching observation protocol (RTOP) (ACEPT Technical Report No. IN00-1). Arizona Collaborative for Excellence in the Preparation of Teachers. [Google Scholar]
- *Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom, I. (2002). Measuring reform practices in science and mathematics classrooms: The reformed teaching observation protocol. School Science and Mathematics, 102(6), 245–253. [Google Scholar] [CrossRef]
- *Schack, E. O., Fisher, M. H., Thomas, J. N., Eisenhardt, S., Tassell, J., & Yoder, M. (2013). Prospective elementary school teachers’ professional noticing of children’s early numeracy. Journal of Mathematics Teacher Education, 16(5), 379–397. [Google Scholar] [CrossRef]
- *Schlesinger, L., Jentsch, A., Kaiser, G., König, J., & Blömeke, S. (2018). Subject-specific characteristics of instructional quality in mathematics education. ZDM 50, 475–490. [Google Scholar] [CrossRef]
- Schoenfeld, A. H. (2020). Reframing teacher knowledge: A research and development agenda. ZDM, 52(2), 359–376. [Google Scholar] [CrossRef]
- Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19(1), 405–450. [Google Scholar] [CrossRef]
- Shepard, L. A. (2016). Evaluating test validity: Reprise and progress. Assessment in Education: Principles, Policy & Practice, 23(2), 268–280. [Google Scholar] [CrossRef]
- Shepard, L. A. (2018). Learning progressions as tools for assessment and learning. Applied Measurement in Education, 31(2), 165–174. [Google Scholar] [CrossRef]
- Sireci, S. G. (2016). On the validity of useless tests. Assessment in Education: Principles, Policy & Practice, 23(2), 226–235. [Google Scholar]
- Sireci, S. G., & Benítez, I. (2023). Evidence for test validation: A guide for practitioners. Psicothema, 35(3), 217–226. [Google Scholar] [CrossRef]
- Sireci, S. G., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. [Google Scholar] [CrossRef]
- Solano-Flores, G. (2022). Fairness in testing: Designing, using, and evaluating test accommodations for English learners. In J. L. Jonson, & K. F. Geisinger (Eds.), Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 Standards (pp. 271–292). American Educational Research Association. [Google Scholar]
- *Spruce, R., & Bol, L. (2015). Teacher beliefs, knowledge, and practice of self-regulated learning. Metacognition and Learning, 10, 245–277. [Google Scholar] [CrossRef]
- Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. [Google Scholar] [CrossRef]
- *Stevens, T., Harris, G., Liu, X., & Aguirre-Munoz, Z. (2013). Students’ ratings of teacher practices. International Journal of Mathematical Education in Science and Technology, 44(7), 984–995. [Google Scholar] [CrossRef]
- Thunder, K., & Berry, R. Q. (2016). Research commentary: The promise of qualitative metasynthesis for mathematics education. Journal for Research in Mathematics Education, 47(4), 318–337. [Google Scholar] [CrossRef]
- Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems. Edwards Brothers. [Google Scholar]
- Thurstone, L. L. (1955). The criterion problem in personality research. Educational and Psychological Measurement, 15(4), 353–361. [Google Scholar] [CrossRef]
- Tong, Y., Pitoniak, M., Lipner, R., Ezzelle, C., Ho, A., & Huff, K. (2024, April 11–14). Reconsidering assessment fairness: Extending beyond the 2014 standards for educational and psychological testing [invited speaker session]. American Educational Research Association Annual Meeting, Philadelphia, PA, USA. [Google Scholar]
- *van de Grift, W. (2007). Quality of teaching in four European countries: A review of the literature and application of an assessment instrument. Educational Research, 49(2), 127–152. [Google Scholar] [CrossRef]
- van der Lans, R. M. (2018). On the “association between two things”: The case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30, 347–366. [Google Scholar] [CrossRef]
- *Wainwright, C., Morrell, P. D., Flick, L., & Schepige, A. (2004). Observation of reform teaching in undergraduate level mathematics and science courses. School Science and Mathematics, 104(7), 322–335. [Google Scholar] [CrossRef]
- *Walkington, C., & Marder, M. (2018). Using the UTeach Observation Protocol (UTOP) to understand the quality of mathematics instruction. ZDM, 50(3), 507–519. [Google Scholar] [CrossRef]
- *Walkowiak, T. A., Berry, R. Q., Meyer, J. P., Rimm-Kaufman, S. E., & Ottmar, E. R. (2014). Introducing an observational measure of standards-based mathematics teaching practices: Evidence of validity and score reliability. Educational Studies in Mathematics, 85, 109–128. [Google Scholar] [CrossRef]
- *Walkowiak, T. A., Berry, R. Q., Pinter, H. H., & Jacobson, E. D. (2018). Utilizing the M-Scan to measure standards-based mathematics teaching practices: Affordances and limitations. ZDM, 50(3), 461–474. [Google Scholar] [CrossRef]
- Walkowiak, T. A., Wilson, J., Adams, E. L., & Wilhelm, A. G. (2022). Scoring with classroom observational rubrics: A longitudinal examination of raters’ responses and perspectives. In A. E. Lischka, E. B. Dyer, R. S. Jones, J. N. Lovett, J. Strayer, & S. Drown (Eds.), Proceedings of the 44th annual meeting of the north American chapter of the international group for the psychology of mathematics education (pp. 1869–1873). Middle Tennessee State University. [Google Scholar]
- Wilhelm, A. G., Folger, T. D., Gallagher, M. A., Walkowiak, T. A., & Zelkowski, J. (2024). Examining validation practices for measures of mathematics teacher affect and behavior. [Preprint].
- Williams, S. R., & Leatham, K. R. (2017). Journal quality in mathematics education. Journal for Research in Mathematics Education, 48(4), 369–396. [Google Scholar] [CrossRef]
- Willis, G. B. (2005). Cognitive interviewing: A tool for improving questionnaire design. Sage. [Google Scholar]
- Wright, B., & Stone, M. (2004). Making measures. The Phaneron Press. [Google Scholar]
- *Wubbels, T., Brekelmans, M., & Hooymayers, H. P. (1992). Do teacher ideals distort the self-reports of their interpersonal behavior? Teaching and Teacher Education, 8(1), 47–58. [Google Scholar] [CrossRef]
- *Wubbels, T., Cretan, H. A., & Hooymayers, H. P. (1985, March 31–April 4). Discipline problems of beginning teachers. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, USA. [Google Scholar]
- *Yopp, D. A., Burroughs, E. A., Sutton, J. T., & Greenwood, M. C. (2019). Variations in coaching knowledge and practice that explain elementary and middle school mathematics teacher change. Journal of Mathematics Teacher Education, 22(1), 5–36. [Google Scholar]
- Zelkowski, J., Campbell, T. G., & Moldavan, A. M. (2024). The relationships between internal program measures and a high-stakes teacher licensing measure in mathematics teacher preparation: Program design considerations. Journal of Teacher Education, 75(1), 58–75. [Google Scholar] [CrossRef]
- Zieky, M. J. (2016). Developing fair tests. In S. Lane, M. R. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 81–99). Routledge. [Google Scholar]
- Zumbo, B. D. (2014). What role does, and should, the test standards play outside of the United States of America? Educational Measurement: Issues & Practice, 33(4), 31. [Google Scholar]
- Zumbo, B. D., & Hubley, A. M. (2016). Bringing consequences and side effects of testing and assessment to the foreground. Assessment in Education: Principles, Policy & Practice, 23(2), 299–303. [Google Scholar]
Source of Evidence | Description | Sample Methods for Collecting Evidence |
---|---|---|
Test Content | The wording, format, and construct alignment of individual items. | Recruiting subject matter experts to evaluate alignment between the test and the construct of interest (Sireci & Faulkner-Bond, 2014). |
Response Process | Respondents’ interpretation and engagement with items. | Conducting cognitive interviews with test-takers to explore the degree to which respondents’ psychological processes and/or cognition align with test expectations (Padilla & Benitez, 2014). |
Internal Structure | The degree to which items conform to the construct of interest. | Using statistical methods, such as factor analysis or item response theory, to assess test dimensionality (Rios & Wells, 2014). |
Relations to Other Variables | Hypothesized relationships between instrument outcomes and some other variable(s). | Using statistical methods, such as multiple linear regression, to examine whether test scores predict a criterion outcome (Oren et al., 2014). |
Consequences of Testing | Intended and unintended implications of testing and test-score interpretation and use. | Collecting data from stakeholders (e.g., students, teachers, administrators) to explore (a) the degree to which the intended benefits of testing are realized, and/or (b) the development of unintended consequences (e.g., narrowing of curriculum, decreased confidence; Lane, 2014) |
Cluster | Topic | Sample Methods for Collecting Evidence |
---|---|---|
Cluster 1 | Test design, development, administration, and scoring procedures that minimize barriers to valid score interpretations for the widest possible range of individuals and relevant subgroups. | Expert review of language used in the assessment by experts representing different subgroups (Oliveri et al., 2015). |
Cluster 2 | Validity of test-score interpretations for intended uses for the intended examinee population. | Including relevant subgroups in initial validation studies and analyzing differences between groups to ensure the instrument performs similarly (or as expected) across groups (Herman & Cook, 2022). |
Cluster 3 | Accommodations to remove construct-irrelevant barriers and support valid interpretations of scores for their intended uses. | Using generalizability theory to determine the amount of error variance attributable to an accommodation and its use (Solano-Flores, 2022). |
Cluster 4 | Safeguards against inappropriate score interpretations for intended uses. | “[W]arn users to avoid likely misuses of the scores” (Zieky, 2016, p. 95). |
Instrument Name | Interpretation Statement | Use Statement | Claim(s) and Evidence |
---|---|---|---|
Mathematics Scan (M-Scan) | “The M-Scan measures the extent to which these dimensions of standards-based teaching practices, both individually and collectively, are present in a lesson.” (Walkowiak et al., 2014, p. 114) | “The M-Scan was developed for researchers to detect the extent to which teachers are using standards-based mathematics teaching practices. Consequently, researchers can utilize M-Scan data to examine relationships between teaching practices and other constructs. The M-Scan is not designed to be used in a supervisory role, by a school administrator for example, to evaluate an individual teacher’s instruction; however, the data may be used to evaluate the outcomes of a program (e.g., teacher preparation or professional development). The M-Scan rubrics have also been utilized in professional development settings where teachers identify a target area and use the selected rubric to guide improvement. For this use, the numerical scale is removed, and the focus becomes the qualitative descriptors.” (Walkowiak et al., 2018, p. 463) | 5 claims and related evidence |
Revised SMPs Look-for Protocol | “Score interpretations provide users with information about teachers’ instruction within a single instance and may be used in conjunction with other instruments to construct a profile of teachers’ instruction.” (Carney et al., 2022, p. 339) | “[I]t is intended for research purposes, evaluation of professional development initiatives related to the SMPs, and coaching; it is not an instrument to make high-stakes decisions, does not explore students’ engagement in the SMPs, and does not capture evidence beyond the observed lesson.” (Carney et al., 2022, p. 339) | 4 claims and related evidence |
Instrument Name | Interpretation Statement | Use Statement | Claim(s) and Evidence |
---|---|---|---|
Mathematics Scan (M-Scan) | ✔ | ✔ | ✔ |
Revised SMPs Look-for Protocol | ✔ | ✔ | ✔ |
Questionnaire on Teacher Interactions (QTI) | ✔ | ✔ | |
UTeach Observation Protocol (UTOP) | ✔ | ✔ | |
Mathematics Quality of Instruction (MQI) | ✔ | ✔ | |
Constructivist Learning Environment Survey (CLES) | ✔ | ||
PISA Student–teacher Relations Questionnaire | ✔ | ||
Assess Today | ✔ | ✔ | |
Classroom Observation Protocol (COP) | ✔ | ✔ | |
International System for Teacher Observation and Feedback (ISTOF) | ✔ | ✔ | |
Mathematics Classroom Observation Protocol for Practices (MCOP2) | ✔ | ✔ | |
Students’ Perceptions of Teachers Successes (SPoTS) | ✔ | ✔ | |
Schlesinger_2018_Instructional Quality | ✔ |
Evidence Type | Overall | Classroom Observations | Student Questionnaires | Teacher Questionnaires | Teacher Interviews |
---|---|---|---|---|---|
Test content | 26 | 22 | 0 | 3 | 1 |
Response processes | 9 | 7 | 0 | 2 | 0 |
Internal structure | 21 | 12 | 3 | 6 | 0 |
Relations to other variables | 22 | 15 | 1 | 6 | 0 |
Consequences of testing | 6 | 6 | 0 | 0 | 0 |
Reliability | 41 | 27 | 2 | 11 | 1 |
Fairness | 8 | 7 | 1 | 0 | 0 |
Total Instruments | 47 | 32 | 5 | 11 | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gallagher, M.A.; Folger, T.D.; Walkowiak, T.A.; Wilhelm, A.G.; Zelkowski, J. Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Educ. Sci. 2025, 15, 1158. https://doi.org/10.3390/educsci15091158
Gallagher MA, Folger TD, Walkowiak TA, Wilhelm AG, Zelkowski J. Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Education Sciences. 2025; 15(9):1158. https://doi.org/10.3390/educsci15091158
Chicago/Turabian StyleGallagher, Melissa A., Timothy D. Folger, Temple A. Walkowiak, Anne Garrison Wilhelm, and Jeremy Zelkowski. 2025. "Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future" Education Sciences 15, no. 9: 1158. https://doi.org/10.3390/educsci15091158
APA StyleGallagher, M. A., Folger, T. D., Walkowiak, T. A., Wilhelm, A. G., & Zelkowski, J. (2025). Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future. Education Sciences, 15(9), 1158. https://doi.org/10.3390/educsci15091158