Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates
Abstract
:1. Introduction
- IRR in teacher performance assessments has not often been reported.
- When IRR has been reported, it has been reported many times as percent agreement, which is considered to be a “common mistake” [10] (p. 4).
- We had the double-scored data; these data are hard to come by now that they are proprietary to corporations such as Pearson.
- Because of the high stakes nature of the TPA (receiving or not receiving your credential), we felt it was important to estimate the IRR in our sample.
- What is the inter-rater reliability (IRR) of a sample of PACTs from our teacher preparation program?
- Can evaluator descriptions of how they assessed the sample PACTs illuminate and explain the IRR results?
2. Methods
2.1. PACT Training, Calibration, and Scoring
- Complete evaluations in order, starting with a review of the background information provided by the candidate describing the student population including geographic location, cultural context, race and ethnicity, and other statistical data such as free and reduced lunch percentages.
- Review the PACT sections in order: Planning, Instruction, Assessment, and Reflection, while addressing use of Academic Language throughout.
- Identify evidence that met the criteria of a rating of 2 first (basic novice teacher competency), and then determine if the evidence was above or below this mark. Evaluators were then trained on evidence that defined a 1 or a 3 in all tasks. There was no specific training for what defines a 4 rating.
- Score the evidence submitted by the candidate without inferring what the candidate might have been thinking or intending.
- Assess PACTs using the rubrics only.
- Take notes as they assess.
- Consistently refer to the “Thinking Behind the Rubrics” document to provide a more in-depth explanation of rubric scores.
- Recognize their own biases.
2.2. IRR Calculation
2.3. Interviews
3. Results
3.1. IRR Computed by Weighted Kappa
3.2. IRR Computed by Percentage Agreement
3.3. Qualitative Analysis of Interviews
3.3.1. Confidence, Challenges, and Changes Evaluators Would like to Make to the PACT
3.3.2. The Step-By-Step Process of Evaluating the PACT
3.3.3. Thinking behind the Rubrics (TBR)
3.3.4. Academic Language
4. Discussion
4.1. IRR Estimates
4.1.1. Weighted Kappa
4.1.2. Percentage Agreement
4.1.3. Kappa N
4.1.4. Measurement Error in IRR Estimates
4.2. Cognitive Task Analysis
4.3. Study Limitations
4.4. Summary and Suggestions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
- 1.
- How many years have you been a PACT evaluator?
- What is your content area expertise?
- What is your occupation outside of being a PACT evaluator?
- Do you evaluate any other TPAs for any other institution?
- If so, is flipping between the two TPAs challenging for you?
- 2.
- How long has it been since your initial PACT training?
- How useful did you find this training?
- How useful is the yearly calibration?
- 3.
- On average, how long does it take you to complete a PACT evaluation?
- 4.
- For PACTs that are double scored, do you know if they initially passed or failed?
- 5.
- Please take me through your process of evaluating PACTs. Typically, what do you do first, second, etc., all the way through to the completion of the PACT evaluation?
- 6.
- How do you approach the comments portion of the PACT evaluations, and how important do you feel this feedback is to the candidate?
- 7.
- How confident do you feel in conducting a PACT evaluation?
- Do you feel you received sufficient training? Do you have any feedback about your training experience?
- 8.
- To what extent do you use supporting documents such as “Thinking Behind the Rubrics” when evaluating PACTs?
- 9.
- Do you normally complete PACTs in one day, or does it take you multiple days?
- If multiple, how many days?
- 10.
- Do you get fatigued when you evaluate PACTs, and what steps (if any) do you take to reduce this fatigue?
- Do you feel you are as fresh on the latter rubrics as you are on the earlier rubrics?
- 11.
- How do you address the Academic Language rubrics?
- 12.
- What challenges, if any, do you encounter when evaluating PACTs?
- 13.
- What changes, if any, would you make to the PACT evaluation process?
References
- Boshuizen, H.P.A. Teaching as regulation and dealing with complexity. Instr. Sci. 2016, 44, 311–314. [Google Scholar] [CrossRef] [Green Version]
- Reagan, E.M.; Schram, T.; McCurdy, K.; Chang, T.; Evans, C.M. Politics of policy: Assessing the implementation, impact, and evolution of the performance assessment for California teachers (PACT) and edTPA. Educ. Policy Anal. Arch. 2016, 24, 1–22. [Google Scholar] [CrossRef] [Green Version]
- Lalley, J.P. Reliability and validity of edTPA. In Teacher Performance Assessment and Accountability Reforms: The Impacts of Edtpa on Teaching and Schools; Carter, J.H., Lochte, H.A., Eds.; Palgrave, MacMillan: New York, NY, USA, 2017; pp. 47–78. [Google Scholar]
- Hebert, C. What do we really know about the edTPA? Research, PACT, and packaging a local teacher performance assessment for national use. Educ. Forum 2017, 81, 68–82. [Google Scholar] [CrossRef]
- Gitomer, D.H.; Martinez, J.F.; Battey, D.; Hyland, N.E. Assessing the assessment: Evidence of reliability and validity in the edTPA. Am. Educ. Res. J. 2019, 58, 3–31. [Google Scholar] [CrossRef]
- Okhremtchouk, I.; Seiki, S.; Gilliland, B.; Ateh, C.; Wallace, M.; Kato, A. Voices of pre-service teachers: Perspective on the performance assessment for California teachers (PACT). Issues Teach. Educ. 2009, 18, 39–62. [Google Scholar]
- Bird, J.; Charteris, J. Teacher performance assessments in the early childhood sector: Wicked problems of regulation. Asia-Pac. J. Teach. Educ. 2020, 1–14. [Google Scholar] [CrossRef]
- Charteris, J. Teaching performance assessments in the USA and Australia: Implications of the “bar exam for the profession”. Int. J. Comp. Educ. Dev. 2019, 21, 237–250. [Google Scholar] [CrossRef]
- Stacey, M.; Talbot, D.; Buchanan, J.; Mayer, D. The development of an Australian teacher performance assessment: Lessons from the international literature. Asia-Pac. J. Teach. Educ. 2020, 48, 508–519. [Google Scholar] [CrossRef]
- Hallgren, K.A. Computing inter-rater reliability for observational data: An overview and tutorial. Tutor. Quant. Methods Psychol. 2012, 8, 23–34. Available online: https://pubmed.ncbi.nlm.nih.gov/22833776/ (accessed on 12 August 2021). [CrossRef] [Green Version]
- McClellan, C.A. Constructed-response scoring—Doing it right. R D Connect. 2010, 13, 1–7. [Google Scholar]
- McMillan, J.H. Classroom Assessment. Principles and Practice for Effective Standards-Based Instruction, 5th ed.; Pearson: Boston, MA, USA, 2010. [Google Scholar]
- Pufpaff, L.A.; Clarke, L.; Jones, R.E. The effects of rater training on inter-rater agreement. Mid-West. Educ. Res. 2015, 27, 117–141. [Google Scholar]
- Sherman, E.M.S.; Brooks, B.L.; Iverson, G.L.; Slick, D.J.; Strauss, E. Reliability and validity in neuropsychology. In The Little Black Book of Neuropsychology: A Syndrome-Based Approach; Schoenberg, M.R., Scott, J.G., Eds.; Springer: Boston, MA, USA, 2011; pp. 873–892. [Google Scholar] [CrossRef]
- Stemler, S.E. A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Pract. Assess. Res. Eval. 2004, 9, 1–11. [Google Scholar] [CrossRef]
- Riggs, M.L.; Verdi, M.P.; Arlin, P.K. A local evaluation of the reliability, validity, and procedural adequacy of the teacher performance assessment exam for teaching credential candidates. Issues Teach. Educ. 2009, 18, 13–38. [Google Scholar]
- Pecheone, R.L.; Chung, R.R. Evidence in teacher education: The performance assessment for California teachers (PACT). J. Teach. Educ. 2006, 57, 22–36. [Google Scholar] [CrossRef]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Porter, J.M. Performance Assessment for California Teachers (PACT): An Evaluation of Inter-Rater Reliability. Ph.D. Thesis, University of California, Davis, CA, USA, 2010. [Google Scholar]
- Porter, J.M.; Jelinek, D. Evaluating inter-rater reliability of a national assessment model for teacher performance. Int. J. Educ. Policies 2011, 5, 74–87. [Google Scholar]
- Thinking Behind Rubrics. Available online: https://www.uwsp.edu/education/documents/edTPA/Resource10.doc (accessed on 29 September 2021).
- GraphPad QuickCalcs. Available online: http://www.graphpad.com/quickcalcs (accessed on 12 August 2021).
- Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]
- Kappa as a Measure of Concordance in Categorical Sorting. Available online: www.vassarstats.net/kappa.html (accessed on 12 August 2021).
- ATLAS.ti (Version 8.4) [Qualitative Data Analysis Software]; ATLAS.ti Scientific Software Development GmbH: Berlin, Germany, 2019.
- Stanford Center for Assessment, Learning and Equity (SCALE). Educative Assessment and Meaningful Support: 2019 edTPA Administrative Report; Stanford Center for Assessment, Learning and Equity (SCALE): Palo Alto, CA, USA, 2021. [Google Scholar]
- Brennan, R.L.; Prediger, D.J. Coefficient kappa: Some uses, misuses, and alternatives. Educ. Psychol. Meas. 1981, 41, 687–699. [Google Scholar] [CrossRef]
- Scott, W.A. Reliability of content analysis: The case of nominal scale coding. Public Opin. Q. 1955, 19, 321–325. [Google Scholar] [CrossRef]
- Duckor, B.; Castellano, K.E.; Tellez, K.; Wihardini, D.; Wilson, M. Examining the internal structure evidence for the performance assessment for California teachers: A validation study of the elementary literacy teaching event for tier 1 teacher licensure. J. Teach. Educ. 2014, 65, 402–420. [Google Scholar] [CrossRef]
- Ma, W.J.; Husain, M.; Bays, P.M. Changing concepts of working memory. Nat. Neurosci. 2014, 17, 347–356. [Google Scholar] [CrossRef] [PubMed]
- Clark, R.E.; Feldon, D.F.; van Merrienboer, J.J.G.; Yates, K.A.; Early, S. Cognitive task analysis. In Handbook of Research on Educational Communications and Technology, 3rd ed.; Spector, J.M., Merrill, M.D., van Merrienboer, J., Driscoll, M.P., Eds.; Lawrence Erlbaum Associates: New York, NY, USA, 2008; pp. 577–593. [Google Scholar] [CrossRef]
- Feldon, D.F.; Clark, R.E. Instructional implications of cognitive task analysis as a method for improving the accuracy of experts’ self-reports. In Avoiding Simplicity, Confronting Complexity: Advances in Studying and Designing Powerful (Computer-Based) Learning Environments; Clarebout, G., Elen, J., Eds.; Sense Publishers: Rotterdam, The Netherlands, 2006; pp. 119–126. [Google Scholar] [CrossRef]
- Tofel-Grehl, C.; Feldon, D.F. Cognitive task analysis-based training: A meta-analysis of studies. J. Cogn. Eng. Decis. Mak. 2013, 7, 293–304. [Google Scholar] [CrossRef] [Green Version]
- Oswald, M.E.; Grosjean, S. Confirmation bias. In Cognitive Illusions. A Handbook on Fallacies and Biases in Thinking, Judgement and Memory; Pohl, R.F., Ed.; Psychology Press: Hove, UK, 2004; pp. 79–96. [Google Scholar]
Task | Rubric |
---|---|
Planning | 1. Establishing a balanced instructional focus 2. Making content accessible 3. Designing assessments |
Instruction | 4. Engaging students in learning 5. Monitoring student learning during instruction |
Assessment | 6. Analyzing student work from an assessment 7. Using assessment to inform teaching 8. Using feedback to promote student learning |
Reflection | 9. Monitoring student progress 10. Reflecting on learning |
Academic Language a | 11. Understanding language demands and resources 12. Developing students’ academic language repertoire |
k | |||||
---|---|---|---|---|---|
−1 to <0 | 0 to 0.20 | >0.20 to 0.40 | >0.40 to 0.60 | >0.60 to 0.80 | >0.80 to 1 |
Worse than expected by chance | Poor | Fair | Moderate | Good | Very good to perfect agreement |
Candidate | Evaluator | P1 | P2 | P3 | I4 | I5 | A6 | A7 | A8 | R9 | R10 | AL11 | AL12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 3 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
2 | 3 | 3 | 2 | 2 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 2 | |
2 | 1 | 3 | 3 | 3 | 2 | 2 | 2 | 3 | 2 | 3 | 2 | 2 | 3 |
2 | 3 | 4 | 3 | 3 | 4 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | |
3 | 1 | 4 | 3 | 4 | 3 | 2 | 3 | 3 | 4 | 3 | 2 | 3 | 2 |
2 | 3 | 3 | 3 | 3 | 2 | 4 | 2 | 2 | 3 | 3 | 2 | 2 | |
4 | 1 | 2 | 2 | 2 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 2 |
2 | 3 | 2 | 2 | 3 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | |
5 | 1 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
2 | 3 | 3 | 2 | 2 | 2 | 1 | 1 | 1 | 3 | 3 | 2 | 2 | |
6 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
2 | 3 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 2 | 2 | 1 | 1 | |
7 | 1 | 2 | 3 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
2 | 2 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | |
8 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 |
2 | 2 | 2 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | |
9 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 2 |
2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
10 | 1 | 2 | 2 | 3 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 2 |
2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | |
11 | 1 | 2 | 2 | 3 | 3 | 2 | 3 | 2 | 3 | 3 | 2 | 2 | 2 |
2 | 3 | 2 | 3 | 2 | 1 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | |
12 | 1 | 3 | 3 | 2 | 3 | 2 | 3 | 2 | 2 | 2 | 3 | 3 | 2 |
2 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 2 | 2 | |
13 | 1 | 2 | 3 | 2 | 2 | 3 | 2 | 2 | 1 | 1 | 2 | 2 | 2 |
2 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | |
14 | 1 | 2 | 2 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 2 | 2 | 1 |
2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | |
15 | 1 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 4 | 2 | 3 | 3 | 2 |
2 | 3 | 3 | 3 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 3 | |
16 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 2 |
2 | 3 | 2 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | |
17 | 1 | 2 | 3 | 3 | 2 | 2 | 3 | 2 | 1 | 1 | 2 | 2 | 2 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | |
18 | 1 | 3 | 3 | 3 | 3 | 2 | 2 | 3 | 3 | 3 | 2 | 2 | 3 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
19 | 1 | 2 | 3 | 3 | 1 | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 2 |
2 | 3 | 3 | 2 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
Evaluator 2 | |||||
---|---|---|---|---|---|
PACT Score | |||||
Evaluator 1 | 1 | 2 | 3 | 4 | Total |
1 | 3 | 3 | 0 | 0 | 6 |
2 | 0 | 4 | 1 | 0 | 5 |
3 | 0 | 0 | 1 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 |
Total | 3 | 7 | 2 | 0 | 12 |
Candidate—Initial, Final Pass (P) or Fail (F) | Cohen’s Weighted Kappa, kw | Standard Error | 95% CI | Kappa Strength of Agreement | Percentage Agreement (Exact) | Percentage Agreement (Exact + Adjacent) |
---|---|---|---|---|---|---|
1—F, F | 0.54 | 0.19 | 0.17 to 0.90 | Moderate | 66.7 | 100 |
2—P, P | 0 | − | − | Poor | 41.7 | 83.3 |
3—P, P | 0.11 | 0.18 | −0.24 to 0.46 | Poor | 41.7 | 91.7 |
4—P, P | 0.08 | 0.22 | −0.36 to 0.51 | Poor | 50 | 100 |
5—P, F | 0.18 | 0.15 | −0.12 to 0.49 | Poor | 50 | 100 |
6—F, F | 0.33 | 0.16 | 0.02 to 0.65 | Fair | 58.3 | 100 |
7—P, F | 0 | − | − | Poor | 66.7 | 91.7 |
8—F, F | −0.29 | − | − | Worse than expected by chance | 50 | 100 |
9—F, F | 0.25 | 0.32 | −0.37 to 0.87 | Fair | 75 | 100 |
10—F, F | 0.33 | 0.24 | −0.14 to 0.80 | Fair | 66.7 | 100 |
11—P, P | 0.23 | 0.23 | −0.21 to 0.67 | Fair | 58.3 | 100 |
12—P, P | 0.33 | 0.26 | −0.17 to 0.84 | Fair | 66.7 | 100 |
13—P, P | 0.17 | 0.23 | −0.27 to 0.61 | Poor | 66.7 | 91.7 |
14—F, F | 0.09 | 0.09 | −0.09 to 0.27 | Poor | 41.7 | 100 |
15—P, P | 0.07 | 0.20 | −0.32 to 0.45 | Poor | 50 | 91.7 |
16—F, F | 0.09 | 0.21 | −0.33 to 0.50 | Poor | 50 | 91.7 |
17—P, F | 0.35 | 0.17 | 0.01 to 0.69 | Fair | 58.3 | 100 |
18—P, P | 0 | − | − | Poor | 33.3 | 100 |
19—F, P, P | 0.41 | 0.23 | −0.03 to 0.86 | Moderate | 66.7 | 100 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lyness, S.A.; Peterson, K.; Yates, K. Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates. Educ. Sci. 2021, 11, 648. https://doi.org/10.3390/educsci11100648
Lyness SA, Peterson K, Yates K. Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates. Education Sciences. 2021; 11(10):648. https://doi.org/10.3390/educsci11100648
Chicago/Turabian StyleLyness, Scott A., Kent Peterson, and Kenneth Yates. 2021. "Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates" Education Sciences 11, no. 10: 648. https://doi.org/10.3390/educsci11100648
APA StyleLyness, S. A., Peterson, K., & Yates, K. (2021). Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates. Education Sciences, 11(10), 648. https://doi.org/10.3390/educsci11100648