Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling
Abstract
1. Introduction
1.1. Reliability
- The observed score on the scale
- The true score on the scale
- Error
1.2. Generalizability Theory
- Score given by rater r to person i
- Overall mean score across persons and raters
- Person effect on the score
- Rater effect on the score
- Interaction of person by rater
- Variance in the scores of person i
- Variance in the scores of rater r
- Variance in the interaction of person i and rater r
1.3. G and Phi Coefficients
- Variance due to person
- Variance due to error =
- Number of raters (or number of items)
- .
1.4. Variance Component Estimation Using Latent Variable Models
1.5. Comparison of Reliability Estimates Between Groups
- Coefficient alpha for group 1
- Coefficient alpha for group 2
- G-coefficient for group 1
- G-coefficient for group 2
1.6. Study Goals
2. Materials and Methods
3. Results
Empirical Example
- Sample size by degree of noninvariance by method (), number of raters by degree of noninvariance by method (), and number of rating categories by degree of noninvariance by method (), respectively. Notably, all other interactions were either not statistically significant (p > 0.05) or subsumed in these interactions.
4. Discussion
Limitations and Directions for Future Study
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Brennan, R. L. (2001). Generalizability theory. Springer. [Google Scholar]
- Choi, J., & Wilson, M. R. (2018). Modeling rater effects using a combination of generalizability theory and IRT. Psychological Test and Assessment Modeling, 60(1), 53–80. [Google Scholar]
- Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34(3), 363–373. [Google Scholar] [CrossRef]
- Feldt, L. S., & Kim, S. (2006). Testing the difference between two alpha coefficients with small samples of subjects and raters. Educational and Psychological Measurement, 66(4), 589–600. [Google Scholar] [CrossRef]
- Finch, W. H., & French, B. F. (2018). A Simulation investigation of the performance of invariance assessment using equivalence testing procedures. Structural Equation Modeling, 25(5), 673–686. [Google Scholar] [CrossRef]
- Flake, J. K., & McCoach, D. B. (2018). An investigation of the alignment method with polytomous indicators under conditions of partial measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 25, 56–70. [Google Scholar] [CrossRef]
- Gamer, M., & Lemon, J. (2019). irr: Various coefficients of interrater reliability and agreement (R package version 0.84.1). R Foundation.
- Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13–34. [Google Scholar] [CrossRef]
- Haber-Curran, P., & Tillapaugh, D. (2017). Gender and student leadership: A critical examination. New Directions for Student Leadership, 154, 11–22. [Google Scholar] [CrossRef] [PubMed]
- Hall, J. (2019). Empowering leadership: Counteracting gender bias through focus on individual strengths. The Journal of Student Leadership, 3(1), 49–55. [Google Scholar]
- Immekus, J. C., Finch, W. H., & French, B. F. (2023). Recovery accuracy of measurement model and structural coefficients of extended bifactor-(S-1) and (S∙I-1) models. Structural Equation Modeling, 30(4), 633–644. [Google Scholar] [CrossRef]
- Jorgensen, T. D. (2021). How to Estimate Absolute-Error Components in Structural Equation Models of Generalizability Theory. Psych, 3(2), 113–133. [Google Scholar] [CrossRef]
- Kaiser, R. B., & Wallace, W. T. (2016). Gender bias and substantive differences in ratings of leadership behavior: Toward a new narrative. Consulting Psychology Journal: Practice and Research, 68(1), 72–98. [Google Scholar] [CrossRef]
- Kindermann, H. (2023). The reliability of parametric methods in the case of rating scales: A simulation study. Applied Research, 3(3), e202300054. [Google Scholar] [CrossRef]
- Li, D., & Brennan, R. L. (2007). A multi-group generalizability analysis of a large-scale reading comprehension test. Technical report. Center for advanced studies in measurement and assessment. University of Iowa. Available online: https://education.uiowa.edu/sites/education.uiowa.edu/files/2022-10/casma-research-report-25.pdf (accessed on 2 March 2026).
- Liu, Z., Rattan, A., & Savani, K. (2023). Reducing gender bias in the evaluation and selection of future leaders: The role of decision-makers’ mindsets about the universality of leadership potential. Journal of Applied Psychology, 108(12), 1924. [Google Scholar] [CrossRef] [PubMed]
- Morris, C. A. (2020). Optimal methods for disattenuating correlation coefficients under realistic measurement conditions with single-form, self-report instruments [Doctoral dissertation, University of Iowa]. ProQuest Dissertation and Theses Database. [Google Scholar]
- Raykov, T., & Marcoulides, G. A. (2006). A first course in structural equation modeling (2nd ed.). Lawrence Erlbaum Associates Publishers. [Google Scholar]
- Rosch, D. M., Collier, D., & Thompson, S. E. (2015). An exploration of students’ motivation to lead: An analysis by race, gender, and student leadership behaviors. Journal of College Student Development, 56(3), 286–291. [Google Scholar] [CrossRef]
- Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. [Google Scholar] [CrossRef]
- Sterner, P., De Roover, K., & Goretzko, D. (2025). New developments in measurement invariance testing: An overview and comparison of EFA-based approaches. Structural Equation Modeling: A Multidisciplinary Journal, 32(1), 117–135. [Google Scholar] [CrossRef]
- Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and compare inter-rater reliability, agreement and correlation of ratings: An exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in Psychology, 5, 509. [Google Scholar] [CrossRef] [PubMed]
- Vispoel, W. P., Hong, H., Lee, H., & Jorgensen, T. D. (2023). Analyzing complete generalizability theory designs using structural equation models. Applied Measurement in Education, 36(4), 372–393. [Google Scholar] [CrossRef]
- Vispoel, W. P., Lee, H., Xu, G., & Hong, H. (2022). Expanding bifactor models of psychological traits to account for multiple sources of measurement error. Psychological Assessment, 32(12), 1093–1111. [Google Scholar] [CrossRef] [PubMed]
- Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29(3), 39–47. [Google Scholar] [CrossRef]




| Study Condition | Levels |
|---|---|
| Item/rating categories | 2, 3, 4 |
| Number of raters | 2, 4 |
| Sample size per group | 200, 350, 500 |
| Population rating pattern | Symmetric for both groups; positive skew for group 1 and symmetric for group 2 |
| Rater agreement difference between groups (noninvariance) | 0, 0.2, 0.4, 0.6 |
| Group | Rater 1 | Rater 2 | Rater 3 | Rater 4 |
|---|---|---|---|---|
| Overall | 2.72 (1.11) | 2.75 (1.10) | 2.75 (1.04) | 2.75 (1.09) |
| Boys | 3.14 (1.05) | 3.07 (1.06) | 3.09 (1.07) | 3.10 (1.04) |
| Girls | 2.36 (1.05) | 2.48 (1.07) | 2.48 (1.04) | 2.46 (1.04) |
| Overall Observed G | Overall Latent G | Observed G Boys | Observed G Girls | Latent G Boys | Latent G Girls |
|---|---|---|---|---|---|
| 0.89 | 0.89 | 0.97 | 0.75 | 0.97 | 0.75 |
| Source | df | F | p | Eta Squared |
|---|---|---|---|---|
| method | 1 | 477.657 | <0.001 | 0.902 |
| categories | 2 | 36.586 | <0.001 | 0.585 |
| raters | 1 | 30.606 | <0.001 | 0.371 |
| pattern | 1 | 0.736 | 0.395 | 0.014 |
| n | 2 | 18.797 | <0.001 | 0.42 |
| noninvariance | 3 | 74.14 | <0.001 | 0.811 |
| Categories × noninvariance | 6 | 3.956 | 0.002 | 0.313 |
| n × noninvariance | 6 | 4.038 | 0.002 | 0.318 |
| pattern× noninvariance | 3 | 0.094 | 0.963 | 0.005 |
| raters × noninvariance | 3 | 5.732 | 0.002 | 0.249 |
| categories × n | 4 | 0.355 | 0.839 | 0.027 |
| categories × pattern | 2 | 0.282 | 0.755 | 0.011 |
| categories ×raters | 2 | 0.336 | 0.716 | 0.013 |
| Pattern × n | 2 | 0.396 | 0.675 | 0.015 |
| raters × n | 2 | 0.267 | 0.767 | 0.01 |
| raters × pattern | 1 | 2.251 | 0.14 | 0.041 |
| categories × n × noninvariance | 12 | 1.457 | 0.171 | 0.252 |
| categories × pattern × noninvariance | 6 | 0.446 | 0.844 | 0.049 |
| categories × raters × noninvariance | 6 | 1.66 | 0.324 | 0.114 |
| pattern × n × noninvariance | 6 | 0.41 | 0.869 | 0.045 |
| raters × n × noninvariance | 6 | 1.159 | 0.342 | 0.118 |
| raters × pattern × noninvariance | 3 | 0.16 | 0.923 | 0.009 |
| categories × pattern × n | 4 | 0.486 | 0.746 | 0.036 |
| categories × raters × n | 4 | 1.69 | 0.166 | 0.115 |
| categories × raters × pattern | 2 | 0.142 | 0.868 | 0.005 |
| raters × pattern × n | 2 | 0.89 | 0.417 | 0.033 |
| n × noninvariance × method | 6 | 4.04 | 0.002 | 0.32 |
| n × raters × noninvariance | 3 | 5.73 | <0.001 | 0.42 |
| categories × noninvariance × method | 6 | 2.66 | 0.02 | 0.24 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Finch, H.; French, B.; Immekus, J. Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychol. Int. 2026, 8, 19. https://doi.org/10.3390/psycholint8010019
Finch H, French B, Immekus J. Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychology International. 2026; 8(1):19. https://doi.org/10.3390/psycholint8010019
Chicago/Turabian StyleFinch, Holmes, Brian French, and Jason Immekus. 2026. "Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling" Psychology International 8, no. 1: 19. https://doi.org/10.3390/psycholint8010019
APA StyleFinch, H., French, B., & Immekus, J. (2026). Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychology International, 8(1), 19. https://doi.org/10.3390/psycholint8010019

