Next Article in Journal
A Virtual Knockout: Comparing Affective and Anxiety Responses to VR Boxing and Conventional Cardio
Previous Article in Journal
Differentiating Trait-, Class-, and Study-Related Academic Boredom: Associations with Engagement and Performance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling

1
Department of Educational Psychology, Ball State University, Muncie, IN 47304, USA
2
Learning and Performance Research Centre, Psychometric Laboratory, Washington State University, Pullman, WA 99164, USA
3
Department of Educational Leadership, Evaluation & Organizational Development, University of Louisville, Louisville, KY 40292, USA
*
Author to whom correspondence should be addressed.
Psychol. Int. 2026, 8(1), 19; https://doi.org/10.3390/psycholint8010019
Submission received: 19 December 2025 / Revised: 10 February 2026 / Accepted: 3 March 2026 / Published: 10 March 2026
(This article belongs to the Section Psychometrics and Educational Measurement)

Abstract

Interrater reliability is a critical aspect of measurement quality, particularly in assessments that rely on subjective judgment. However, interrater reliability estimates vary, and such variability can introduce bias or reduce the accuracy of observed scores, especially when comparing across groups or conditions. Understanding and accounting for these differences is essential when interpreting reliability in applied settings such as education, psychology, and performance evaluation. This study addresses the need for more nuanced approaches to evaluating interrater reliability across groups. Specifically, in this study, we examine generalizability theory (GT) and structural equation modeling (SEM) that enable direct testing of differences in reliability coefficients across groups. A simulation study compared a proposed method grounded in GT and SEM to the W statistic for reliability coefficient comparisons. Results demonstrate that the proposed method consistently outperforms the W statistic in terms of both Type I error control and statistical power, particularly when sample sizes are moderate to large or when variance in rater agreement exists across groups. These findings underscore the importance of explicitly modeling differences in interrater reliability and provide researchers with a more robust tool for evaluating the consistency of ratings across diverse contexts and populations.

1. Introduction

Generalizability theory (GT) is perhaps the most widely used tool for assessing interrater reliability (e.g., Choi & Wilson, 2018; Goodwin, 2001). As a psychometric tool, GT yields reliability estimates in the form of G and Phi coefficients. Traditionally, these coefficients are estimated using variance components estimates that are derived from analysis of variance (ANOVA) models (Brennan, 2001). Recently, researchers (e.g., Vispoel et al., 2023) have described an approach to GT based on structural equation modeling (SEM), which can yield estimates of coefficients G and Phi. Vispoel et al. demonstrate that this approach produces essentially identical results to those obtained using ANOVA. In addition, these authors note that the SEM-based approach offers greater flexibility with respect to the types of models that can be fit to the data to obtain G and Phi coefficients in a wider array of scenarios.
In practice, rating data often come from populations containing multiple subgroups (e.g., English language learners). This lack of homogeneity within such populations will likely be reflected in the samples drawn from it as well. In turn, it is possible that ratings of members from different groups will themselves be impacted by group membership of those being rated. For example, differing opinions among raters regarding the leadership styles of individuals from different gender groups may lead to an inconsistency of leadership potential scores that they assign to individuals from different genders (Kaiser & Wallace, 2016). Consequently, such group differences may yield biased reliability estimates for the overall sample (e.g., Li & Brennan, 2007). When groups’ rating patterns differ, the overall GT estimates can be compromised, thereby necessitating comparison of reliability estimates across the groups.
In response, the purpose of this study is to assess the performance of two approaches for comparing group differences in reliability estimates. The first technique is based on the W statistic first proposed by Feldt (1969) and subsequently refined by Feldt and Kim (2006). This method has been used primarily in the context of reliability estimates obtained for items on tests or surveys. In this study, we applied the approach to scores given by a relatively small number of raters. The second method that is considered in this study involved a multiple-groups version of the SEM-based estimator described above. The specifics of the approach are described in detail below. The two methods for comparing G were compared using a simulation study and demonstrated with an empirical example. The remainder of this paper is organized as follows: First, a brief review of the theoretical underpinnings of reliability are presented to provide a foundation for the discussion of GT. After describing GT and its associated statistics, the two approaches for comparing the G-coefficient between groups are described. The study goals are then discussed in detail, followed by the method used to meet these goals, the results of the simulation study, and the empirical example. Finally, results and implications are discussed (Supplementary Material).

1.1. Reliability

GT is a flexible approach for estimating the reliability of a set of measures (Brennan, 2001). However, it is perhaps more widely used when working with rating data in which individuals (e.g., students, teachers, workers) are rated on some metric (e.g., teacher quality, leadership potential) by a set of raters. Researchers, educators, and others using these scores are interested in assessing the extent to which raters provide consistent scores across the individuals being rated, a construct often referred to as interrater reliability. At the heart of all reliability estimates is the classical test theory (CTT) equation:
X = T + E
where
  • X = The observed score on the scale
  • T = The true score on the scale
  • E = Error
The variance of the observed score in Equation (1) can be expressed as a sum of the true score and error variances.
σ X 2 = σ T 2 + σ E 2
In turn, the reliability of the measurements is simply the ratio of true score variance to observed score variance.
ρ x x = σ T 2 σ X 2 = σ T 2 σ T 2 + σ E 2
In other words, the reliability of a score is simply the proportion of observed score variance that is associated with the true score. Higher values of ρ x x indicate that more of the observed score variance is due to true score variance, or that the observed score is more closely related to the true score.

1.2. Generalizability Theory

The above Equations (1) through (3) demonstrate the central role of error variance in the calculation of score reliability. A key feature of GT is that it explicitly estimates each of the variances in Equation (2). These individual variance estimates can then be used to estimate ρ x x in Equation (3) and thus account for multiple sources of variance that might impact the scores, such as measurement occasion, measurement location, and qualities about the raters themselves when working with rating data (Brennan, 2001). In addition to obtaining a potentially more accurate estimate of reliability by controlling for a variety of sources influencing the observed score, GT also provides estimates for different indices, depending on whether the scores will be used in a norm-referenced or criterion-referenced manner.
It is typically not possible to obtain information on the universe of all possible measures for a particular source of score variance, known as a facet (e.g., all possible raters). For this reason, the relative impact of each facet on a rating must be estimated based on a sample of raters and individuals being rated. As mentioned earlier, in traditional GT, these estimates are made using ANOVA. In the context of ratings given by a set of raters to a group of students on a single construct (e.g., leadership potential), the observed score can be broken down as:
x i r = μ + P i + R r + P R i r
where
  • x i r = Score given by rater r to person i
  • μ = Overall mean score across persons and raters
  • P i = Person effect on the score
  • R r = Rater effect on the score
  • P R i r = Interaction of person by rater
In Equation (4), the observed rating is expressed as a linear combination of the average person proficiency in the domain being rated ( μ ), the proficiency of one particular person ( P i ), the average level of the rater in scoring the person ( R r ), and the remainder of the score after accounting for the first three terms ( P R i r ). This last term corresponds to the error in Equation (1).
Notably, Equation (4) restates Equation (1) in the context of rating data. The variance of the observed score, σ x i r 2 , can be expressed in terms of the variances of the terms that constitute the score in Equation (4) in much the same way that Equation (2) expresses the observed score variance in Equation (1) in terms of true score and error variance.
σ x i r 2 = σ P i 2 + σ R r 2 + σ P R i r 2
where
  • σ P i 2 = Variance in the scores of person i
  • σ R r 2 = Variance in the scores of rater r
  • σ P R i r 2 = Variance in the interaction of person i and rater r
The purpose of GT is to estimate the variances in Equation (5) using a sample of data and, in turn, use those values to estimate the reliability of the ratings or scale of interest. This is accomplished in what is known as a G-study, using ANOVA models to estimate the various variation components in the ratings. Readers interested in the technical details of how these estimates are obtained are referred to Brennan (2001). The resulting variance component estimates obtained from ANOVA are then employed to estimate the interrater reliability.

1.3. G and Phi Coefficients

In Equation (3), reliability was defined as the ratio of the true score variance to the observed score variance, where the observed score variance was the sum of the true and error variances. GT yields reliability estimates that can be directly tied to Equation (3). As noted earlier, GT provides estimates for both norm- and criterion-referenced decision making. The generalizability coefficient (G-coefficient) is the reliability estimate for use in the norm referenced context, and is defined as
G = σ P 2 σ P 2 + σ δ 2
where
  • σ P 2 = Variance due to person
  • σ δ 2 = Variance due to error = σ ^ P R 2 N R
  • N R = Number of raters (or number of items)
This statistic is directly analogous to reliability as expressed in Equation (3).
In some instances, assigned assessment scores are compared to a standard (or criterion), rather than normatively. Thus, a rater’s determination of that score value will be based upon the extent to which the person or thing being rated conforms to an ideal set of standards (e.g., meets standards, exceeds standard) that need to be met. When such criterion-referenced decisions are being made, the reliability estimate of choice is Phi, also known as the index of dependability.
ϕ = σ p 2 σ p 2 + σ Δ 2
where
  • σ Δ 2 = σ p r 2 n r + σ r 2 n r .
Brennan (2001) refers to σ Δ 2 as the absolute error variance and defines it as the difference between an individual’s observed and universe score, commonly referred to as the mean-squared deviation for the persons E( x ¯ p μ )2 (Brennan, 2001).

1.4. Variance Component Estimation Using Latent Variable Models

Many researchers have proposed using a latent variable modeling approach to estimate the variances used in calculating GT coefficients in Equations (6) and (7) (Jorgensen, 2021; Morris, 2020; Raykov & Marcoulides, 2006; Vispoel et al., 2022, 2023). Vispoel et al. (2023) list a number of advantages that this approach offers the measurement professional, including the ability to derive confidence intervals for the individual variance estimates, using estimation methods appropriate for the scale of the data, avoidance of negative variance components estimates, appropriately account for congeneric scales, testing for overall model fit, accounting for method effects in the scores, and incorporating the GT model into a broader framework including relationships with other variables. As with the standard ANOVA-based method, this approach based on latent variables also allows for the calculation of G and Phi coefficients.
As described by Vispoel et al. (2023), the basic SEM for estimating the appropriate variance components in the single facet p x i design is expressed in Figure 1. This figure reflects the latent GT model fit to the empirical example, as described below. The observed ratings are related to the person factor through factor loadings, which, in this scenario, are all fixed to 1. In addition, the model requires error variances for the individual ratings constrained to be equal. The resulting model then allows for the estimation of two sources of observed score variance: person and error. These variance estimates can then be used to estimate the G and Phi coefficients.

1.5. Comparison of Reliability Estimates Between Groups

As noted above, ratings are often made on individuals from populations containing multiple subgroups (e.g., genders, language). A consequence of group differences in score reliability is the potential for biased estimates for the overall sample (e.g., Li & Brennan, 2007). This potential bias is as much an issue for rating data as it is for data obtained from item responses. For example, comparing interrater reliability estimates can inform measurement professionals as to whether rater pairs (e.g., parent–parent vs parent–teacher) yield different estimates that can influence outcomes (e.g., Stolarova et al., 2014).
Researchers and measurement professionals may want to make comparisons of the reliability estimates from samples in order to make inference about the likely difference between internal consistency between groups. Perhaps the most common approach for making such comparisons was described by Feldt (1969), who proposed a statistic for testing the null hypothesis
H 0 : α 1 = α 2
where
  • α 1 = Coefficient alpha for group 1
  • α 2 = Coefficient alpha for group 2
This statistic is calculated as
W = 1 α ^ 1 1 α ^ 2
The W statistic is distributed as an F statistic with υ 1 and υ 2 degrees of freedom. Feldt showed that when the product of the total sample size (N) and number of items (k) exceeds 1000, υ 1 = N 1 1 and υ 2 = N 2 1 . However, Feldt and Kim (2006) found that for smaller values of N and k, this approximation for the degrees of freedom is not appropriate. They provided a three-step procedure for calculating the appropriate values of υ 1 and υ 2 . The details of this technique are not included here, but the interested reader is referred to Feldt and Kim for detailed information on those calculations.
An alternative approach for comparing reliability of items or ratings between groups is based upon an extension of the work on interrater reliability estimate comparison by leveraging GT estimation using the SEM approach by Vispoel et al. (2023) that was described above. Specifically, multiple-groups SEM can be used to test differences in G or Phi to create a multiple-group generalizability theory (MGGT) approach for comparing interrater reliability between independent groups. This approach involves first estimating the GT coefficients separately for each group using the latent variable model. Next, the difference between the two estimates is calculated as a part of the model, as is the standard error of the difference. These values are then used to create a test statistic (the ratio of the difference to its standard error), which is distributed as a z value, for assessing the null hypothesis
H 0 : G 1 = G 2
where
  • G 1 = G-coefficient for group 1
  • G 2 = G-coefficient for group 2

1.6. Study Goals

The primary goal for this study was to investigate the performance of MGGT for identifying group differences in G and ϕ . Prior research has focused on the estimation of these coefficients (Vispoel et al., 2023) using a latent variable modeling approach. We extend this work by examining the extent to which MGGT can be used to compare the magnitude of these coefficients between groups. In addition, a secondary question addressed compares the noninvariance detection accuracy of MGGT to that of Feldt’s W, which has been widely used for comparing coefficient α (Feldt & Kim, 2006; Feldt, 1969). A Monte Carlo simulation study was used to address these research goals across a variety of conditions that are representative of what is seen in practice and that have been used in prior invariance studies. An empirical example is provided to help illustrate how this can be used.

2. Materials and Methods

A Monte Carlo simulation (1000 replications per combination of conditions) was used to examine the manipulated conditions, which appear in Table 1. For all study conditions, data were simulated for two groups. The population interrater agreement within Group 1 was set to 0.8 for all conditions, whereas agreement within Group 2 was manipulated to reflect differing levels of noninvariance (i.e., group differences in interrater consistency). Drawing from earlier research in latent variable modeling (e.g., Finch & French, 2018; Immekus et al., 2023; Wicherts & Dolan, 2010), noninvariance was set at 0, 0.2, 0.4, or 0.6 (see Table 1), reflecting various levels of difference in the consistency with which the raters assessed members of the two groups. This would reflect a range of noninvariance from completely invariance to a strong difference between groups. In the 0 noninvariance condition, interrater reliability for both groups was 0.8. When noninvariance was 0.2, interrater reliability for group 1 was 0.8, and for group 2 it was 0.6. Likewise, for the noninvariance level of 0.6, group 1 interrater reliability was 0.8 and group 2 interrater reliability was 0.2. The number of rating categories was either 2, 3, or 4, reflecting a short rating, and followed the empirical example presented in this work scale (Kindermann, 2023). The total sample size was 200, 350, and 500, with each group having equal numbers of individuals (Flake & McCoach, 2018; Sterner et al., 2025). As an example, in the sample size 200, each group was simulated to have 100 individuals. Finally, there were two conditions for the population rating patterns: symmetric for both groups or positively skewed for group 1 and symmetric for group 2. These conditions were selected to assess the performance of the comparison methods when the location of the ratings were identical and when they differed. This setting is analogous to group mean differences (impact) in invariance and differential item functioning research. As an example of this setting, in the symmetric condition with 4 rating categories, the proportions associated with each rating value were 0.1, 0.4, 0.4, 0.1. In the skewed condition with 4 categories, the proportions associated with each rating value were: 0.4, 0.3, 0.2, 0.1.
In this study, methods for comparing the G-coefficient values between groups included W and MGGT. The outcomes of interest were the Type I error and power rates and, additionally, GT and reliability estimates are also reported. ANOVA was used to identify important main effects and interactions of the manipulated conditions with respect to study outcomes. The ANOVA model included main effects of the manipulated conditions, along with all interactions of these conditions. ANOVA results are presented in the results section, as well as the eta-squared ( η 2 ) effect size values. For brevity, discussion of the results focuses only on the highest-order terms that were statistically significant, p < 0.05. The interested reader can explore the full results in the tables, where we report these for completeness. Within this study, data generation was conducted IRRsim (Gamer & Lemon, 2019) in the R software language, whereas the W statistic was calculated and the MGGT model was fit using the lavaan package in R (Rosseel, 2012). In addition to the simulation study, an empirical example was carried out using data from teachers assessing the leadership potential of students in an African secondary school by comparing ratings assigned to male and female students (Hall, 2019; Haber-Curran & Tillapaugh, 2017; Liu et al., 2023; Rosch et al., 2015). The example data were simulated based upon the original data to protect the confidentiality of the original subjects while allowing for release of the data to readers. The description of the data set is included below when the example is discussed.

3. Results

Empirical Example

Reporting of findings begins with an empirical example demonstrating the use of MGGT and W for comparing G-theory coefficients between groups. To this end, simulated teacher ratings based upon an existing dataset were used. Specifically, students (351 boys, 422 girls) at a high school in an East African country were rated by four teachers regarding their leadership potential on the following four-point scale: 1 = No leadership potential, 2 = Weak leadership potential, 3 = Moderate leadership potential, and 4 = Strong leadership potential. For this analysis, the research question was whether the ratings were invariant across male and female students. The data used in this example were simulated based upon the statistics obtained from the original data.
Table 2 reports the means and standard deviations for each rater by the overall sample and by gender. As reported, the results suggest raters’ scores are quite consistent, suggesting that they viewed the leadership potential in the sample quite similarly overall and by gender. In addition, it appears that the teachers rated males to have higher such potential with the sample differences between the genders being approximately 0.6 for each rater.
Table 3 includes the G-coefficients for the overall sample and for each gender group, as calculated using standard G-theory and the latent variable approach. As anticipated, these values are identical, in which these values are quite high (0.89), suggesting that the raters agreed with one another in the sample overall. This high level of agreement was particularly present for the male students, with the G-coefficient being 0.97. On the other hand, the G-coefficient for females was 0.75, suggesting a lower level of agreement among the teachers when rating the leadership potential of this group. The W statistic for comparing the male and female G-coefficients was 0.10, p < 0.001, leading to the rejection of the null hypothesis that the G-coefficients for boys and girls are equal in the population. In other words, there was less agreement among the four raters when assessing the leadership potential of girls than of boys. The estimated difference in G-coefficients from the latent variable model was 0.23, with a standard error of 0.018, Z = 12.61, and p < 0.001. Therefore, as with the observed variable approach using W, we would conclude that the consistency among raters was lower for girls compared to boys.
Table 4 reports results based on ANOVA. ANOVA identified the following statistically significant interactions with respect to rejection rates:
  • Sample size by degree of noninvariance by method ( F 6,52 = 4.04 ,   p = 0.002 ,   η 2 = 0.32 ), number of raters by degree of noninvariance by method ( F 3,52 = 5.73 ,   p < 0.001 ,   η 2 = 0.42 ), and number of rating categories by degree of noninvariance by method ( F 6,52 = 2.66 ,   p = 0.02 ,   η 2 = 0.24 ), respectively. Notably, all other interactions were either not statistically significant (p > 0.05) or subsumed in these interactions.
Figure 2 includes the rejection rates for each method of invariance assessment by level of noninvariance and number of rating categories. As reported, the W statistic had an elevated Type I error rate across all number of categories simulated in this study. The error rate for MGGT was consistently at or below the nominal rate regardless of the number of rating categories, and power for W exceeded 0.90 across these conditions. On the other hand, the power for MGGT was lower than that of W, except in the four categories condition and noninvariance of 0.6 condition. In addition, power for MGGT increased concomitantly with increases in the number of categories for each level of noninvariance. This pattern was different than that of W, where number of categories was not related to the rejection rate.
The rejection rates by sample size, degree of noninvariance, and method appear in Figure 3. When invariance held across groups, the MGGT approach maintained the Type I error rate at approximately the nominal level across sample sizes, as seen in panel 1. In contrast, the Type I error rate for W was consistently near 0.10 for the three sample size conditions considered in this study. When the invariance was 0.2, the W statistic consistently had power rates above 0.8, approaching 1.0 for samples of 350 and 500 per group. The power for MGGT was lower than that of W for all sample size conditions. Indeed, for noninvariance of 0.2, the power for MGGT never exceeded 0.40. When the noninvariance was 0.4 or above, W had power rates of 1.0. Power for MGGT was lower than that of W, though it did reach 0.80 or above for sample sizes of 350 per group or above. It is important to keep in mind when interpreting the power results for W that it exhibited inflated Type I error rates across all study conditions. Thus, power may appear adequate, but it is inflated under high Type I error rates.
The Type I error and power rates for the comparison of generalizability coefficients by the testing approach, level of noninvariance, and number of raters appear in Figure 4. It is clear that the Type I error rate for MGGT is at or below the nominal rate of 0.05, whereas for W, the error rate is approximately 0.10 for both two and four raters. With respect to power, W exhibited higher rates than MGGT across conditions, as is evident in Figure 2 and Figure 3. For W, power rates exceeded 0.90 across conditions. MGGT’s power rates were larger with four raters as compared with two. When there were four raters and noninvariance of 0.4 or 0.6, MGGT exhibited power rates in excess of 0.8, though lower than that of W. On the other hand, with two raters, the power of MGGT to detect a lack of invariance never exceeded 0.65. As with the other results for W, it is important to keep in mind that it exhibited an inflated Type I error rate across all study conditions.

4. Discussion

As discussed, ignoring group differences in contributions to error and facet differences can yield biased interrater reliability estimates (Li & Brennan, 2007). This is particularly critical to consider when different rating pairs (e.g., parent–parent vs. teacher–parent) provide information on skills (e.g., expressive vocabulary) used to make decisions about individuals (e.g., program placement). Thus, it is critical to assess invariance in GT coefficients across groups. Therefore, in the current study, we compared two approaches for making such assessments and found that the MGGT approach was able to control Type I error rate better than W. Given its inflated Type I error rate, the higher power results for W should be interpreted with caution. Thus, measurement professionals assessing invariance in G or Phi coefficients are likely to obtain more accurate results with larger samples, more raters, and more categories with the MGGT approach.
The interaction effects observed—particularly those involving sample size, number of raters, and rating categories—highlight the nuanced ways in which design factors influence the performance of invariance detection methods. For example, the power of MGGT increased with the number of rating categories and raters, suggesting that its sensitivity improves with richer data structures. This pattern diverged from W, whose power remained high regardless of these factors, again raising concerns about its specificity.
The empirical example further illustrates the practical utility of MGGT. While both MGGT and W detected significant differences in G-coefficients between male and female students, MGGT provided a more conservative estimate of this difference, aligning with its lower Type I error rate. The observed discrepancy in G-coefficients (0.97 for boys vs. 0.75 for girls) suggests differential rater agreement across gender groups, a finding with potential implications for fairness in educational evaluations. Importantly, MGGT’s latent variable modeling framework allows for more nuanced interpretations of such differences, accounting for measurement error and facet structure.
Taken together, these results support the use of MGGT as a robust alternative to traditional methods like the W statistic when assessing invariance in GT coefficients. While W may offer higher power, its inflated Type I error rates pose risks in applied contexts where false positives can have significant consequences. Future research should explore hybrid approaches that balance sensitivity and specificity and investigate the performance of MGGT in more complex designs involving nested facets or longitudinal data.

Limitations and Directions for Future Study

The current study represents a first step in the investigation of MGGT for reliability comparisons. Future work should expand the number of raters/items and sample size conditions. In addition, the MGGT approach should be investigated using more groups and with additional covariates included in the model. Further studies also need to be done with a focus on differences in rater variance across groups. The present study underscores the importance of evaluating invariance in generalizability theory (GT) coefficients across groups, particularly when decisions are informed by ratings from multiple sources. Consistent with prior work (Li & Brennan, 2007), our findings demonstrate that ignoring group-level differences in these estimates can lead to biased reliability estimates. This has direct implications for educational assessments where ratings are used to evaluate student competencies, teacher performance, or other latent traits. Researchers should consider these comparisons when examining ratings across groups.
Another important consideration is the computational and practical accessibility of the MGGT approach. Unlike Feldt’s W, which is relatively straightforward to compute using standard statistical software or even spreadsheet tools, MGGT requires the use of structural equation modeling (SEM) software such as R (e.g., lavaan package), Mplus, or similar platforms. Implementing MGGT also necessitates a higher level of statistical expertise, particularly in specifying and interpreting latent variable models, managing model identification constraints, and evaluating model fit. These requirements may pose a barrier for practitioners or researchers without advanced training in SEM or access to appropriate software. As such, while MGGT offers superior control over Type I error and greater modeling flexibility, its adoption in applied settings may be limited by these practical constraints. Future work could explore the development of user-friendly tools or simplified implementations to enhance the accessibility of MGGT for broader audiences.

5. Conclusions

Based on the results of this study, there are several implications for the practice of measurement professionals and psychometricians. First, when the primary concern is controlling the Type I error rate (i.e., not concluding that a group difference exists for reliability when in fact there is none in the population), MGGT should be the method of choice. The price to be paid for the greater certainty in Type I error control is a reduction in power for detecting reliability differences between groups. This power differential can largely be mitigated, however, by increasing the sample size and, when possible, the number of rating/item categories. A second implication is that the MGGT approach provides additional information that is not available with W. Specifically, the researcher can obtain group-specific estimates of the variance components (e.g., rater/item variance, error variance). These values can be useful tools for helping researchers gain insights into how well (or poorly) their measure is functioning, beyond a simple comparison of reliability estimates. Finally, the MGGT approach can be scaled up to include more than two groups because it falls within the larger framework of latent variable modeling. Indeed, this modeling structure would allow for more complex assessments of reliability, including changes over time and interactions of groups with other covariates. Such advanced structure cannot be easily incorporated into the W statistic framework.

Supplementary Materials

The following supporting information can be downloaded at: https://holmesfinch.substack.com (accessed on 2 March 2026).

Author Contributions

H.F. ran simulations and drafted introduction, methods, and results. J.I. wrote portions of introduction and the discussion and edited the manuscript. B.F. wrote portions of the introduction and the discussion and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. Studies not involving humans or animals.

Informed Consent Statement

Not applicable. Studies not involving humans.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Brennan, R. L. (2001). Generalizability theory. Springer. [Google Scholar]
  2. Choi, J., & Wilson, M. R. (2018). Modeling rater effects using a combination of generalizability theory and IRT. Psychological Test and Assessment Modeling, 60(1), 53–80. [Google Scholar]
  3. Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson coefficient twenty is the same for two tests. Psychometrika, 34(3), 363–373. [Google Scholar] [CrossRef]
  4. Feldt, L. S., & Kim, S. (2006). Testing the difference between two alpha coefficients with small samples of subjects and raters. Educational and Psychological Measurement, 66(4), 589–600. [Google Scholar] [CrossRef]
  5. Finch, W. H., & French, B. F. (2018). A Simulation investigation of the performance of invariance assessment using equivalence testing procedures. Structural Equation Modeling, 25(5), 673–686. [Google Scholar] [CrossRef]
  6. Flake, J. K., & McCoach, D. B. (2018). An investigation of the alignment method with polytomous indicators under conditions of partial measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 25, 56–70. [Google Scholar] [CrossRef]
  7. Gamer, M., & Lemon, J. (2019). irr: Various coefficients of interrater reliability and agreement (R package version 0.84.1). R Foundation.
  8. Goodwin, L. D. (2001). Interrater agreement and reliability. Measurement in Physical Education and Exercise Science, 5(1), 13–34. [Google Scholar] [CrossRef]
  9. Haber-Curran, P., & Tillapaugh, D. (2017). Gender and student leadership: A critical examination. New Directions for Student Leadership, 154, 11–22. [Google Scholar] [CrossRef] [PubMed]
  10. Hall, J. (2019). Empowering leadership: Counteracting gender bias through focus on individual strengths. The Journal of Student Leadership, 3(1), 49–55. [Google Scholar]
  11. Immekus, J. C., Finch, W. H., & French, B. F. (2023). Recovery accuracy of measurement model and structural coefficients of extended bifactor-(S-1) and (S∙I-1) models. Structural Equation Modeling, 30(4), 633–644. [Google Scholar] [CrossRef]
  12. Jorgensen, T. D. (2021). How to Estimate Absolute-Error Components in Structural Equation Models of Generalizability Theory. Psych, 3(2), 113–133. [Google Scholar] [CrossRef]
  13. Kaiser, R. B., & Wallace, W. T. (2016). Gender bias and substantive differences in ratings of leadership behavior: Toward a new narrative. Consulting Psychology Journal: Practice and Research, 68(1), 72–98. [Google Scholar] [CrossRef]
  14. Kindermann, H. (2023). The reliability of parametric methods in the case of rating scales: A simulation study. Applied Research, 3(3), e202300054. [Google Scholar] [CrossRef]
  15. Li, D., & Brennan, R. L. (2007). A multi-group generalizability analysis of a large-scale reading comprehension test. Technical report. Center for advanced studies in measurement and assessment. University of Iowa. Available online: https://education.uiowa.edu/sites/education.uiowa.edu/files/2022-10/casma-research-report-25.pdf (accessed on 2 March 2026).
  16. Liu, Z., Rattan, A., & Savani, K. (2023). Reducing gender bias in the evaluation and selection of future leaders: The role of decision-makers’ mindsets about the universality of leadership potential. Journal of Applied Psychology, 108(12), 1924. [Google Scholar] [CrossRef] [PubMed]
  17. Morris, C. A. (2020). Optimal methods for disattenuating correlation coefficients under realistic measurement conditions with single-form, self-report instruments [Doctoral dissertation, University of Iowa]. ProQuest Dissertation and Theses Database. [Google Scholar]
  18. Raykov, T., & Marcoulides, G. A. (2006). A first course in structural equation modeling (2nd ed.). Lawrence Erlbaum Associates Publishers. [Google Scholar]
  19. Rosch, D. M., Collier, D., & Thompson, S. E. (2015). An exploration of students’ motivation to lead: An analysis by race, gender, and student leadership behaviors. Journal of College Student Development, 56(3), 286–291. [Google Scholar] [CrossRef]
  20. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. [Google Scholar] [CrossRef]
  21. Sterner, P., De Roover, K., & Goretzko, D. (2025). New developments in measurement invariance testing: An overview and comparison of EFA-based approaches. Structural Equation Modeling: A Multidisciplinary Journal, 32(1), 117–135. [Google Scholar] [CrossRef]
  22. Stolarova, M., Wolf, C., Rinker, T., & Brielmann, A. (2014). How to assess and compare inter-rater reliability, agreement and correlation of ratings: An exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in Psychology, 5, 509. [Google Scholar] [CrossRef] [PubMed]
  23. Vispoel, W. P., Hong, H., Lee, H., & Jorgensen, T. D. (2023). Analyzing complete generalizability theory designs using structural equation models. Applied Measurement in Education, 36(4), 372–393. [Google Scholar] [CrossRef]
  24. Vispoel, W. P., Lee, H., Xu, G., & Hong, H. (2022). Expanding bifactor models of psychological traits to account for multiple sources of measurement error. Psychological Assessment, 32(12), 1093–1111. [Google Scholar] [CrossRef] [PubMed]
  25. Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29(3), 39–47. [Google Scholar] [CrossRef]
Figure 1. Model for empirical example data.
Figure 1. Model for empirical example data.
Psycholint 08 00019 g001
Figure 2. Rejection rates for MGGT and W by level of noninvariance and number of item categories.
Figure 2. Rejection rates for MGGT and W by level of noninvariance and number of item categories.
Psycholint 08 00019 g002
Figure 3. Rejection rates for MGGT and W by level of noninvariance and sample size per group.
Figure 3. Rejection rates for MGGT and W by level of noninvariance and sample size per group.
Psycholint 08 00019 g003
Figure 4. Rejection rates for MGGT and W by level of noninvariance and number of raters.
Figure 4. Rejection rates for MGGT and W by level of noninvariance and number of raters.
Psycholint 08 00019 g004
Table 1. Manipulated study conditions.
Table 1. Manipulated study conditions.
Study ConditionLevels
Item/rating categories2, 3, 4
Number of raters2, 4
Sample size per group200, 350, 500
Population rating patternSymmetric for both groups; positive skew for group 1 and symmetric for group 2
Rater agreement difference between groups (noninvariance)0, 0.2, 0.4, 0.6
Table 2. Mean (standard deviation) of leadership ratings for the overall sample and by gender group.
Table 2. Mean (standard deviation) of leadership ratings for the overall sample and by gender group.
GroupRater 1Rater 2Rater 3Rater 4
Overall2.72 (1.11)2.75 (1.10)2.75 (1.04)2.75 (1.09)
Boys3.14 (1.05)3.07 (1.06)3.09 (1.07)3.10 (1.04)
Girls2.36 (1.05)2.48 (1.07)2.48 (1.04)2.46 (1.04)
Table 3. G-coefficients for overall sample and by group for leadership potential.
Table 3. G-coefficients for overall sample and by group for leadership potential.
Overall Observed GOverall Latent GObserved G BoysObserved G GirlsLatent G BoysLatent G Girls
0.890.890.970.750.970.75
Table 4. ANOVA results for simulation.
Table 4. ANOVA results for simulation.
SourcedfFpEta Squared
method1477.657<0.0010.902
categories236.586<0.0010.585
raters130.606<0.0010.371
pattern10.7360.3950.014
n218.797<0.0010.42
noninvariance374.14<0.0010.811
Categories × noninvariance63.9560.0020.313
n × noninvariance64.0380.0020.318
pattern× noninvariance30.0940.9630.005
raters × noninvariance35.7320.0020.249
categories × n40.3550.8390.027
categories × pattern20.2820.7550.011
categories ×raters20.3360.7160.013
Pattern × n20.3960.6750.015
raters × n20.2670.7670.01
raters × pattern12.2510.140.041
categories × n × noninvariance121.4570.1710.252
categories × pattern × noninvariance60.4460.8440.049
categories × raters × noninvariance61.660.3240.114
pattern × n × noninvariance60.410.8690.045
raters × n × noninvariance61.1590.3420.118
raters × pattern × noninvariance30.160.9230.009
categories × pattern × n40.4860.7460.036
categories × raters × n41.690.1660.115
categories × raters × pattern20.1420.8680.005
raters × pattern × n20.890.4170.033
n × noninvariance × method64.040.0020.32
n × raters × noninvariance35.73<0.0010.42
categories × noninvariance × method62.660.020.24
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Finch, H.; French, B.; Immekus, J. Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychol. Int. 2026, 8, 19. https://doi.org/10.3390/psycholint8010019

AMA Style

Finch H, French B, Immekus J. Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychology International. 2026; 8(1):19. https://doi.org/10.3390/psycholint8010019

Chicago/Turabian Style

Finch, Holmes, Brian French, and Jason Immekus. 2026. "Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling" Psychology International 8, no. 1: 19. https://doi.org/10.3390/psycholint8010019

APA Style

Finch, H., French, B., & Immekus, J. (2026). Interrater Reliability Comparisons with Generalizability Theory and Structural Equation Modeling. Psychology International, 8(1), 19. https://doi.org/10.3390/psycholint8010019

Article Metrics

Back to TopTop