Using Structural Equation Modeling to Reproduce and Extend ANOVA-Based Generalizability Theory Analyses for Psychological Assessments

: Generalizability theory provides a comprehensive framework for determining how multiple sources of measurement error affect scores from psychological assessments and using that information to improve those assessments. Although generalizability theory designs have traditionally been analyzed using analyses of variance (ANOVA) procedures, the same analyses can be replicated and extended using structural equation models. We collected multi-occasion data from inventories measuring numerous dimensions of personality, self-concept, and socially desirable responding to compare variance components, generalizability coefﬁcients, dependability coefﬁcients, and proportions of universe score and measurement error variance using structural equation modeling versus ANOVA techniques. We further applied structural equation modeling techniques to continuous latent response variable metrics and derived Monte Carlo-based conﬁdence intervals for those indices on both observed score and continuous latent response variable metrics. Results for observed scores estimated using structural equation modeling and ANOVA procedures seldom varied. Differences in reliability between raw score and continuous latent response variable metrics were much greater for scales with dichotomous responses, thereby highlighting the value of doing analyses on both metrics to evaluate gains that might be achieved by increasing response options. We provide detailed guidelines for applying the demonstrated techniques using structural equation modeling and ANOVA-based statistical software.


Introduction
Cronbach et al. [1] first introduced generalizability theory (GT) to the research community, and it continues to provide an elegant framework for conceptualizing how different sources of measurement error affect scores from assessment measures and how that information can be used to evaluate and improve such measures. GT techniques encompass both objectively and subjectively scored measures and can be readily applied to assessments in affective, cognitive, behavioral, and psychomotor domains. Applications of GT rely heavily on variance component estimates traditionally obtained using analysis of variance (ANOVA)-based expected mean squares within software packages catered specifically to applications of GT such as GENOVA [2], urGENOVA [3], and EduG [4] or from variance component programs within popular statistical packages such as SPSS, SAS, STATA, R, MATLAB, and Minitab (see, e.g., [5]). The computational framework for GT analyses also can be represented within linear mixed models (see, e.g., [6,7]). For example, in contrast to the other programs listed here, the gtheory package in R [8] uses the lme4 package [9] to fit a linear mixed model to the data. This package also uses restricted maximum likelihood (REML) rather than conventional expected mean square estimates to derive variance components, with both options also being available in most variance component programs.
Structural equation models (SEMs) offer yet another useful though less frequently applied means to estimate variance components for GT analyses in a variety of ways. Marcoulides [10] and Raykov and Marcoulides [11] were among the first to highlight such connections and demonstrate how to partially analyze one-and two-facet GT designs within SEM frameworks using LISREL [12]. Other researchers have since revisited and expanded those techniques to other designs, estimation procedures, and software packages (see, e.g., [13][14][15][16][17][18][19][20][21][22][23][24]). However, the original applications of GT to SEMs by Marcoulides and Raykov as well as those by later researchers cited here focused predominantly on derivation of variance components reflecting relative differences among scores for making norm-referenced decisions and typically omitted components reflecting absolute differences in scores for making criterion-referenced decisions. Part of the reason for such omissions was that derivation of variance components for absolute differences in scores was often considered unwieldy and fraught with technical difficulties due, for example, to presumptions that the data matrices analyzed needed to be transposed to treat facet conditions as objects of measurement and objects of measurement as facet conditions [13]. However, this method will not work in typical scenarios in which the number of persons exceeds the number of facet conditions, seemingly restricting practical uses of GT-SEMs to estimating indices of score consistency that reflect only relative differences in scores.
To overcome this perceived limitation of GT-SEMs, Jorgensen [14] proposed much simpler alternatives to obtaining variance components for absolute differences using the same GT-SEM designs analyzed in previous studies by imposing effect coding [25] and related constraints on factor loadings, means, and intercepts. However, illustrations of his procedures were based on a generated dataset of 200 normally distributed scores for a hypothetical measure of unspecified content with no clearly defined response options for items. When applying his procedures to that dataset with fully crossed one-and two-facet GT-SEM designs, using the lavaan SEM package in R and maximum likelihood parameter estimates, he obtained generalizability (G or Eρ 2 ) and dependability (D or Φ) coefficients that varied by more than 0.003 from those produced by the anova() function in R using ANOVA mean square (MS) estimates and the gtheory package in R using restricted maximum likelihood (REML) estimates. After trichotomizing the original data into discrete ordered categories, Jorgensen repeated the SEM analyses using diagonally weighted least squares estimates (WLSMV in R) to place results on a continuous latent response variable (CLRV) metric that corrected indices of score consistency for possible effects of scale coarseness resulting from limited response options and/or unequal underlying intervals between those options (see [13,22]). He found that G and D coefficients were appreciably higher when taking such effects into account. Jorgensen further noted that simple commands from the semTools package in R [26] could be added to code for GT-SEMs within lavaan to produce Monte Carlo-based confidence intervals for G and D coefficients that are typically unavailable in standard GT and variance component programs. The semTools package also can create Monte Carlo-based confidence intervals from packages outside of R if an asymptotic sampling covariance (ACOV) matrix of variance-component parameters is available. More detailed information about Monte Carlo-based confidence intervals can be found in [27][28][29].
The purpose of this article is to illustrate and expand SEM procedures for analyzing fully crossed GT designs discussed by Jorgensen [14] using empirical data from respondents who completed popular self-report inventories measuring multiple dimensions of personality, self-concept, and socially desirable responding. We chose these inventories for their widespread use, strong psychometric properties, and variety of item response options. Consequently, the results provide a tangible and stronger empirical foundation for evaluating scale coarseness effects in real-life settings. The analyses also were intended to contribute new evidence of the psychometric properties of scores within GT frameworks for the inventories administered and to extend comparisons of results between SEM-and ANOVA-based procedures beyond G and D coefficients to include variance components, proportions of individual sources of measurement error, and confidence intervals for all reported indices.

Background
In Cronbach et al.'s [1] original treatment of GT and in those by most subsequent authors (see, e.g., [30,31]), distinctions were made between generalizability and decision studies. In a generalizability study, researchers identify the objects of measurement and universes of admissible observations, collect data, and estimate relevant variance components. For our present applications to self-report questionnaires, persons are the objects of measurement, and items and occasions serve as possible universes of generalization. Within GT designs, universes of generalization are represented as facets that correspond to sources of measurement error that limit generalization of results. Systematic (i.e., non-error) variance in GT designs is referred to as universe or person score variance and conceptually parallels true score variance in classical test theory and communality in factor analysis (see, e.g., [20]).
In a decision study, variance components from the generalizability study are used to estimate indices of score consistency and measurement error when using scores for normand/or criterion-referencing purposes based on the original generalizability or altered decision study design. The most common alterations in decision studies for questionnaire data are restricting original universes of items and occasions to just items or just occasions and changing the numbers of items and/or occasions from those originally analyzed (see, e.g., [16][17][18][19][20][21][22]24,30,[32][33][34]). To acquaint readers with these fundamental GT techniques for analyzing data from objectively scored self-report measures, we begin with brief introductions to relevant ANOVA-based single-and multi-facet designs and how they can be represented within SEMs.

Single-Facet GT Designs, Key Formulas, and Related SEMs
Basic concepts. Within a persons × items (pi) random effects GT design, persons and items are fully crossed, allowing the observed score for a particular person and item to be decomposed into person, item, and residual effects. The associated variance of each effect is called a variance component. Equations (1) and (2) show how estimated variances for item and item-mean scores are partitioned within this design.
pi design : Individual item score level : pI design : Item-mean score level : whereσ 2 = estimated variance component, Y pi = score for a particular person on a given item, Y pI = mean across all items for a particular person, and n i = number of items. Items serve as the single facet of interest here, but the same principles would apply if the facet represented other tasks, occasions, or raters. Equation (1) reveals that the overall estimated variance in scores across all items and persons is partitioned into three additive components, representing persons (or universe scores;σ 2 p ), inter-person differences in item scores plus other confounded residual error σ 2 pi,e , and item differences (σ 2 i ). The letter I is capitalized in Equation (2) to emphasize that scores for each person are now averaged across items. The partitioning of item-mean variance across persons is more relevant in practical settings because decisions are typically made using those scores or simple transformations of them that would yield the same estimates of score consistency (e.g., multiplying itemmean scores by the number of items to obtain total scale scores). Primes appear over n s in Equation (2) and elsewhere to indicate that any number of conditions/replicates for a facet can be specified in a decision study. The variance component for items (σ 2 i ) drops out of Equation (2) because the mean score for items across persons in the partitioning shown is now a constant.
Indices of score consistency and agreement. Once estimated, the three variance components on the right side of Equation (1) can be inserted into Equations (3)-(5) to derive three key indices: G coefficients, global D coefficients, and cut-score specific D coefficients (see, e.g., [20,30,35]).Ĝ coefficient for pI design =σ GlobalD coefficient for pI design =σ Cut-score specificD coefficient for pI design =σ pi,e n p n i +ˆσ 2 i n i and corrects for bias (see [35]).
G coefficients reflect relative differences in scores used for norm-referencing purposes (e.g., rank ordering). Within the present pI design, they are equivalent to alpha reliability estimates [36] and would be analogous to stability or inter-rater reliability coefficients had occasions or raters been the lone facets in the design. Global and cut-score specific D coefficients take both relative and absolute differences in scores into account. Terms within parentheses in the denominators of Equations (3) and (4) represent relative error and absolute error, respectively. When item means are equal (i.e.,σ 2 i = 0), relative and absolute error will coincide, as will G and global D coefficients. When observed scores are used for screening, selection, classification, or domain-referencing purposes, cut-score specific D coefficients provide the best indices of dependability because they reflect agreement in decisions over random repetitions of the assessment procedure [37]. Values for these coefficients will vary with the cut point chosen and increase as cut-scores deviate from the scale mean. SEM representation. An SEM for the pi GT design based on administration of three items is shown at the top of Figure 1. This model has a single factor for person linked to each item, with factor loadings set equal to one and uniquenesses set equal. Consequently, only two variance components are directly estimated: the variance for the person factor σ 2 p and the common uniqueness across items σ 2 pi,e . To derive the missing variance component for items σ 2 i needed to calculate D coefficients, Jorgensen [14] imposed effect coding constraints on loadings and intercepts [25] that placed results on the same scale as the original indicators (item scores here) and set the mean for the person factor equal to the grand mean of observed scores. With effect coding, item intercepts are constrained to sum to zero and factor loadings are constrained to average one (or equivalently sum to equal the number of items). Under these conditions within the present model, Jorgensen noted thatσ 2 i can be derived using Equation (6).

Two-Facet GT Designs, Key Formulas, and Related SEMs
Basic concepts. To create a two-facet GT design, we will include occasions as an additional facet to produce a persons × items × occasions (pio) random-effects design. Within this design, each person responds to all items on all occasions. The partitioning of estimated variance at individual item and item-mean score levels is shown in Equations (7) and (8).
design: Item-mean score level: where = estimated variance component, = a score for a particular person on a given combination of item and occasion, = the mean across all items and occasions for a particular person, ′ = number of items, and ′ = number of occasions. Equation (7) reveals that the estimated variance in observed scores across persons, items, and occasions ( ) is now partitioned into seven additive components representing persons ( ), inter-person differences in item and/or occasion scores ( , , , ),

Two-Facet GT Designs, Key Formulas, and Related SEMs
Basic concepts. To create a two-facet GT design, we will include occasions as an additional facet to produce a persons × items × occasions (pio) random-effects design. Within this design, each person responds to all items on all occasions. The partitioning of estimated variance at individual item and item-mean score levels is shown in Equations (7) and (8).
pIO design : Item-mean score level :σ 2 Y pIO =σ  . When partitioning variance at the item-mean level in Equation (8), variance components for absolute differences again drop out because the mean for scores averaged over items and occasions is now a constant across persons.
Indices of score consistency and agreement. Formulas for G, global D, and cut-score specific D coefficients for the pIO design are provided in Equations (9)-(11).
G coefficient for pIO design =σ Cut-score specificD coefficient for pIO design and corrects for bias (see [29]).
Measurement error within Equation (9) for the G coefficient is now subdivided into three sources.σ 2 pi /n i reflects inter-person differences in the ordering of item scores (specific-factor error),σ 2 po /n o reflects inter-person differences in the ordering of occasion scores (transient error), andσ 2 pio,e / n i × n o reflects inter-person differences in withinoccasion "noise" (random-response error; see [24,[38][39][40] for more extended discussions of these sources of measurement error). The global and cut-score specific D coefficients in Equations (10) and (11) also include three additional estimated variance components σ 2 i ,σ 2 o ,σ 2 io to account for absolute differences among item and occasion scores. Values within parentheses for G and global D coefficients in Equations (9) and (10) again represent estimates of relative and absolute error. As before, when means across facet conditions are equal (i.e.,σ SEM representation. An SEM for the pio GT design based on administration of the same three items on two occasions is shown at the bottom of Figure 1. This model has orthogonal factors for person, each item, and each occasion. Occasion variances, item variances, and uniquenesses are, respectively, set equal, and all factor loadings are set equal to one. In total, the four variance components reflecting relative differences in scores (σ 2 p ,σ 2 pi ,σ 2 po ,σ 2 pio,e ) are directly estimated, but those for absolute differences (σ To estimate the remaining components, effect coding constraints are again imposed along with setting the sum of all item factor means and the sum of all occasion factor means equal to zero. When these restrictions are imposed, Jorgensen [14] noted that variance components for absolute differences in scores can be obtained using Equations (12)- (14).
Scale coarseness. When considering results from GT analyses, we routinely assume that data are interval level in nature, meaning that equal differences in observed scores represent equal differences in the constructs being measured [1]. However, this assumption is not strictly met with most self-report measures due to scale coarseness effects resulting from limited numbers of response options and/or unequal underlying intervals between those options. To address this problem, Ark [13], Jorgensen [14] and Vispoel et al. [22] described how to conduct GT analyses on continuous latent response variable (CLRV) metrics using SEMs. The GT-SEMs are identical to those for observed scores described here except that estimation methods such as diagonally weighted least squares (DWLS) or paired maximum likelihood (PML) can be used to convert observed score results to CLRV metrics.
Vispoel at al. [22] used DWLS estimation for SEMs with delta parameterization to derive G and D coefficients for pi and pio designs with scores for indicators expressed on the same standardized scales. However, Jorgensen [14] noted that variability in means for latent variable indicators could be modeled to provide more informative D coefficients in CLRV analyses by using theta parameterization and constraining thresholds to be the same across all indicators. As a result, we adopted his approach when analyzing designs for CLRVs.

This Investigation
Our goal in the research reported here was to expand upon Jorgensen's [14] preliminary demonstration of ways to represent complete GT designs in SEM frameworks to encompass empirical data collected from respondents in live assessment settings who completed popular measures of personality, self-concept, and socially desirable responding on two occasions. We compared results from GT-SEM analyses using the lavaan package in R [41] to those obtained using the ANOVA-based package GENOVA [2], which remains one of the most comprehensive programs available for conducting traditional GT analyses (see, e.g., [20]). We further compared results from both packages to those obtained from lavaan using diagonally weighted least squares estimation to evaluate effects of scale coarseness and included Monte Carlo-based confidence intervals for variance components, proportions of measurement error, G coefficients, global D coefficients, and selected cut-score specific D coefficients within both the observed score and CLRV analyses.

Participants and Procedure
We collected data from three separate samples of college students from the University of Iowa who completed self-report inventories online using the Qualtrics platform on two occasions a week apart. Data collection was approved by the governing institutional review board (ID# 200809738) and all respondents gave informed consent before participating. Students within their respective samples completed all subscales from either the 100-item International Personality Item Pool Big Five Model Questionnaire (IPIP-BFM-100 [42]; n = 359, 69.58% female, 72.70% Caucasian; mean age = 23.80), Self-Description Questionnaire III (SDQ-III [43]; n = 427, 70.02% female, 78.69% Caucasian; mean age = 23.20), or Balanced Inventory of Desirable Responding (BIDR [44]; n = 595, 76.47% female, 77.31% Caucasian; mean age = 22.46). Inquiries about accessibility to the data should be directed to the first author.

Measures
IPIP-BFM-100. The IPIP-BFM-100 includes 100 items designed to measure the broad personality constructs associated with the Big Five model: Agreeableness, Conscientiousness, Emotional Stability, Extraversion, and Openness [42,45,46]. Each subscale has 20 items answered using a 5-point response metric (1 = Very Inaccurate, 5 = Very Accurate). Goldberg [42] reported alpha reliability estimates for the subscales ranging from 0.81 to 0.97 and exploratory factor analyses for self and peer ratings supporting the anticipated five-factor structure underlying item responses.
SDQ-III. The SDQ-III is a 136-item questionnaire intended for use with late adolescents and adults. It includes one subscale to measure overall self-esteem (General Self) and 12 additional ones to measure self-perceptions in the following areas: Emotional Stability, Honesty-Trustworthiness, Religious-Spiritual Values, Opposite-Sex Relations, Same-Sex Relations, Parental Relations, Physical Appearance, Physical Ability, General Academic Ability, Verbal Skills, Math Skills, and Problem-Solving Skills. Each scale has 10 or 12 items, equally balanced for negative and positive phrasing rated along an eight-point response metric (1 = Definitely False, 8 = Definitely True). Evidence reported by Marsh [43] and Byrne [47] in support of the reliability and construct validity of SDO-III subscale scores included alpha coefficients ranging from 0.76 to 0.95, median 1-month and 18-month testretest coefficients, respectively, equaling 0.87 and 0.74, factor analyses verifying that each subscale measures a distinguishable construct, and logically consistent relationships of subscale scores with each other and external criterion measures.
BIDR. The BIDR (Version 6, [44]) has 40 items comprising two 20-item subscales that measure two dimensions of socially desirable responding: Impression Management and Self-Deceptive Enhancement. Items within each scale are equally balanced for positive and negative phrasing and rated along a 7-point response metric (1 = Not True, 4 = Somewhat True, 7 = Very True). After reversals are made to negatively phrased items, scores can remain on the original polytomous metric, be dichotomized to emphasize exaggeratedly desirable responses by rescoring extremely high responses (6 or 7) to equal 1 and other responses to equal 0, or be dichotomized to emphasize exaggeratedly undesirable responses by rescoring extremely low responses (1 or 2) to equal 0 and other responses to equal 1 [48][49][50]. We included all three approaches here. In practical settings, polytomous scores are more informative when Impression Management and Self-Deceptive Enhancement are treated as psychological traits, whereas dichotomized scores are frequently used to flag possible instances of faking good (i.e., exaggerated endorsement) or faking bad (i.e., exaggerated denial). Paulhus [44], Kilinc [48], Vispoel and Kim [49], Vispoel, Morris, and Clough [50], and Vispoel, Morris, and Sun [51] reported alpha coefficients for BIDR subscales ranging from 0.66 to 0.88, 1-week test-retest coefficients ranging from 0.71 to 0.88, and confirmation of anticipated patterns of convergent and discriminant validity coefficients with other measures.

Analyses
We analyzed fully crossed pi and pio random-facet GT designs for every subscale from each instrument using the ANOVA-based package GENOVA [2] and SEM-based lavaan package in R [41] with conventional least squares parameter estimates (labeled as expected mean squares in GENOVA and unweighted least squares (ULS) in lavaan). We repeated the lavaan analyses with robust diagonally weighted least squares (DWLS) estimates (WLSMV in R) using theta parameterization to evaluate the effects of scale coarseness and provide more informative D coefficients than those provided by delta parameterization. All SEMs were constrained in the ways described earlier to render indices reflecting both relative and absolute differences in scores. For each scale, design, and analysis, we estimated G coefficients, global D coefficients, cut-score specific D coefficients two standard deviations away from each scale's mean, and proportions of universe score and measurement error variance. We also derived 90% Monte Carlo-based confidence intervals for variance components, proportions of measurement error, G coefficients, global D coefficients, and cut-score specific D coefficients using the semTools package in R [26]. Within the reported GT analyses, n i equals the number of items within a given subscale, and n o equals one.

Descriptive Statistics and Conventional Reliability Estimates
We provide means, standard deviations, alpha coefficients, and test-retest coefficients for all subscale scores from the IPIP-BFM-100, SDQ-III, and BIDR in Table 1  In Table 2, we provide variance components, G coefficients, global D coefficients, cut-score specific D coefficients two standard deviations away from the scale mean, and corresponding 90% confidence intervals for IPIP-BFM-100, SDQ-III, and BIDR observed scores within the GT pi design analyses. Across subscales, lavaan and GENOVA results for G and D coefficients are identical to the three decimal places shown in the table, and variance components differ by no more than 0.005. Confidence intervals for all variance components fail to capture zero, thereby reflecting trustworthy effects. G coefficients (which mirror alpha coefficients reported in Table 2 for Occasion 1) range from 0.691 to 0.952 (M = 0.858), global D coefficients from 0.672 to 0.945 (M = 0.834) and cut-score specific D coefficients from 0.932 to 0.989 (M = 0.962). Confidence interval lower limits for G coefficients equal or exceed 0.830 in all instances except for subscales from the BIDR and the Honesty-Trustworthiness subscale from the SDQ-III. Lower limits for global D coefficients equal or exceed 0.804 except for subscales from the BIDR and the Honesty-Trustworthiness and Problem-Solving Skills subscales from the SDQ-III. Finally, lower limits for cut-score specific D coefficients two standard deviations away from the mean equal or exceed 0.915 for all scales across all instruments.
In Tables 3 and 4, we provide parallel indices for the GT pio designs plus additional variance components and partitioning of measurement error into three sources (specificfactor, transient, and random-response). Across subscales, lavaan and GENOVA results for G coefficients and proportions of measurement error are identical to the three decimal places shown in the tables; D coefficients differ by no more than 0.002; and variance components differ by no more than 0.013. Confidence intervals for all variance components and proportions of measurement error fail to capture zero except o variance components for most subscales across instruments, io variance components for the SDQ-III Problem-Solving Skills subscale and BIDR Self-Deceptive Enhancement dichotomous subscales, and both po  Note. p = person, pi,e = person × item and other error, i = item, G = G coefficient, G-D = global D coefficient, CS-D = cut-score specific D coefficient.   Note. G = G coefficient, SFE = specific-factor error, TE = transient error, RRE = random-response error, G-D = global D coefficient, CS-D = cut-score specific D coefficient. Table entries represent results obtained from lavaan in R. Values within parentheses are 90% confidence interval limits. Differences with GENOVA are indicated by superscripts: a The lavaan result is 0.001 lower than in GENOVA; b The lavaan result is 0.002 lower than in GENOVA.

Partitioning of Variance, G coefficients, and D coefficients on CLRV Metrics
In Tables 5-7, we provide the same indices for CLRVs as those reported in Tables 2-4 for observed scores within the pi and pio designs based on WLSMV estimates from lavaan. For the pi design results within Table 5, G coefficients range from 0.756 to 0.976 (M = 0.909), global D coefficients from 0.726 to 0.969 (M = 0.886), and cut-score specific D coefficients from 0.943 to 0.994 (M = 0.976). In all instances, score consistency and agreement indices as well as their corresponding confidence interval lower limits exceed those from the observed score analyses. As was the case with observed scores, confidence intervals for all CLRV variance components fail to capture zero, again underscoring trustworthy effects. Minimum confidence interval lower limits for G, global D, and cut-score specific D coefficients for CLRVs, respectively, equal 0.733, 0.700, and 0.937 as compared to 0.691, 0.588, and 0.915 for observed scores.
For the CLRV pio design results in Table 7 Differences in lower confidence interval limits between CRLVs and observed scores for G, global D, and cut-score specific D coefficients vary with subscale. Across the 24 subscales, CLRV lower confidence interval limits are greater than or equal to those for observed scores in 10 instances for G coefficients, 10 instances for global D coefficients, and 14 instances for cut-score specific D coefficients. Minimum lower limits for G, global D, and cut-score specific D coefficients, respectively, equal 0.637, 0.614, and 0.917 for CLRVs versus 0.436, 0.411, and 0.874 for observed scores.
In Table 8, we report differences between WLSMV and ULS SEMs in G coefficients, global D coefficients, cut-score specific D coefficients, and proportions of measurement error for all designs to further evaluate effects of scale coarseness. For the pi designs, differences between WLSMV and ULS G and D coefficients are greater for dichotomously scored BIDR scales than for those with five to eight response options, with differences across all scales being noticeably greater for G and global D coefficients than for D coefficients representing cut-scores two standard deviations away from the scale mean. However, even for scales with five to eight options, G and global D coefficients are uniformly higher for WLSMV than for ULS with differences ranging from 0.014 to 0.087 (M = 0.034) for G coefficients and from 0.014 to 0.095 (M = 0.036) for global D coefficients.
In the pio designs, differences between WLSMV and ULS in G and global D coefficients are lower than those in the pi designs and again markedly higher for dichotomous scales than for polytomous scales. For the polytomous scales, differences in G and global D coefficients on average are generally quite small (Ms = 0.010 and 0.011), with the largest being for the Honesty-Trustworthiness subscale from the SDQ-III. The general pattern of differences in relative proportions of variance between WLSMV and ULS within each inventory is for relative proportions of universe score and transient error to increase and relative proportions of specific-factor and random-response error to decrease. Table 5. Variance Components, G coefficients, and D coefficients for GT CLRV pi Designs.    Note. CLRV = continuous latent response variable, WLSMV = robust diagonally weighted least squares, ULS = unweighted least squares, G = G coefficient, G-D = global D coefficient, CS-D = cut-score specific D coefficient, SFE = specific-factor error, TE = transient error, RRE = random-response error.

Overview
Our goal in the present study was to illustrate recently developed techniques for expanding GT analyses within SEM frameworks more thoroughly than in previous studies by including measures assessing a broad range of psychological traits taken in live assessment settings, deriving indices relevant to both norm-and criterion-referenced interpretations of scores, constructing confidence intervals for a variety of key GT indices, and taking the effects of scale coarseness into account. While doing so, we analyzed results for 24 scales from widely administered inventories assessing multiple dimensions of personality, self-concept, and socially desirable responding with item scale metrics having from two to eight response categories. When considered collectively, the present results highlight the effectiveness of SEMs in replicating results from ANOVA models, the importance of taking multiple sources of measurement error into account, the value of Jorgensen's procedures for deriving GT-based dependability coefficients and Monte Carlo-based confidence intervals for key indices, and the benefits of conducting GT analyses on both observed score and CLRV metrics to gauge scale coarseness effects.

Sources of Measurement Error
Across the analyses for observed scores, mean proportions of explained variance for specific-factor, transient, and random-response error equaled 0.060, 0.076, and 0.069, respectively. These findings are highly consistent with those from previous studies in underscoring how reliability is routinely overestimated in single-occasion research studies by failing to take all relevant sources of measurement error into account [14,16,[18][19][20][21][22][23][24][32][33][34]38,39,[52][53][54][55]. The omission of such effects within single-occasion studies, in turn, can lead to the substantial overestimation of reliability and corresponding underestimation of relations between latent constructs when those indices are used to correct correlation coefficients for measurement error (see, e.g., [15,32,38,54]). Such findings emphasize the inherent limitations of single-occasion studies and the importance of using multi-occasion data to better represent reliability and validity of scores from measures of psychological traits.

Dependability Coefficients
Applications of SEMs to derive variance components for persons, sources of relative measurement error, and corresponding G coefficients in the published research literature date back to Marcoulides [10]. In Raykov and Marcoulides's [11] follow-up to that study, the authors cited an unpublished paper by Marcoulides [56] in which they alluded to using SEMs in deriving variance components for absolute error. However, they provided no further details about the procedures. Ark [13] later speculated that a Q method could be used to derive each variance component for absolute error separately but acknowledged that this procedure would be cumbersome and of limited utility. More recently, Jorgensen [14] used a small set of contrived data to demonstrate simpler methods for deriving variance components for absolute error using indices embedded within the same SEMs for one-and two-facet GT designs used by Marcoulides, Raykov, and other researchers (see, e.g., [20][21][22]).
When we applied Jorgensen's [14] procedures here to data obtained from three separate samples of respondents in live settings, results for G coefficients, global D coefficients, cutscore specific D coefficients, and proportions of measurement error within the pi and pio designs varied by no more than 0.002 from ones obtained from the ANOVA-based GT package GENOVA. These results coupled with those from Jorgensen's original study confirm that SEMs provide a viable option for doing complete GT analyses while offering additional benefits that traditional ANOVA-based analyses rarely provide.

Confidence Intervals for Parameters within GT Analyses
One such benefit demonstrated here was to derive Monte Carlo-based confidence intervals for all reported variance components, proportions of measurement error, and indices of score consistency and agreement. Cronbach et al. [57] and others (see, e.g., [30]) have long emphasized the importance of gauging sampling variability in GT parameter estimates but methods to do so are unavailable within most ANOVA-based packages or limited to procedures based on restrictive assumptions. In contrast, the semTools package in R [26] readily allows for the derivation of more widely applicable Monte Carlo-based intervals at any desired level of confidence for all GT indices considered here simply by adding a few lines of code to link commands within semTools to lavaan (see our online Supplementary Materials).
Although hypothesis testing is not part of traditional GT analyses, confidence intervals for variance components can serve a similar function when evaluating effects for persons, sources of measurement error, and differences in absolute levels of scores by noting whether zero or other targeted values fall within the limits of the interval. Our 90% confidence intervals for variance components often captured zero for occasion effects and sometimes for item by occasion interaction and transient error effects, whereas G and D coefficients had interval limits no lower than 0.411 across all scales, though some scales clearly yielded much more reliable results than others. On the observed score metric within the pio designs, confidence interval lower limits for both G and global D coefficients exceeded 0.80 for most subscales from the IPIP-BFM-100 (4 out of 5) and SDQ-III (9 out of 13), but not for either subscale from the BIDR across scoring methods. Overall, these results make sense because the psychological traits we assessed were expected to remain stable over the one-week interval between administrations, whereas item means, universe scores, and measurement error effects were expected to vary among respondents as well as within and across scales.

Effects of Scale Coarseness
Another unique benefit of GT-SEMs illustrated here was to use WLSMV estimates in lavaan to transform binary-and ordinal-level observed scores to CLRV metrics. Although we do not advocate substituting CLRV indices directly for those representing observed scores, we find such indices useful in gauging the effects of scale coarseness on reliability and in disattenuating correlation coefficients simultaneously for measurement error and scale coarseness. Because differences between observed scores and CLRVs in consistency and agreement should diminish as scale options increase, indices for CLRVs can serve as upper bounds for improvements that might be gained by increasing response options [14,22]. In essence, doing GT on CLRV metrics serves a similar function as n' value changes within G and D coefficient formulas by informing ways that assessment procedures might be improved.
To evaluate possible scale coarseness effects here, we intentionally included measures that varied in number of response categories. As expected, differences between G and global D coefficients were more pronounced for BIDR dichotomous scales compared to scales that included five to eight options. In general, these results support use of polytomous BIDR scores when measuring individual differences in Impression Management and Self-Deceptive Enhancement but do not preclude use of dichotomous scores for detecting faking (see, e.g., [48,58,59]). However, even with scales having five or more response options, we observed noticeable differences in score consistency and agreement between WLSMV and ULS estimation in some instances.
One such instance that encompassed nearly all subscales with five or more response options was within the pi designs in which mean differences between WLSMV and ULS equaled 0.034 for G coefficients and 0.036 for global D coefficients. These results are likely due in part to confounding of universe score and transient error variance in the pi designs. For example, within the corresponding pio designs that separate out transient error effects, mean differences between WLSMV and ULS dropped to 0.010 for G coefficients and 0.011 for global D coefficients. The greatest differences for polytomous scales were for Honesty-Trustworthiness that had the lowest standard deviations, alpha coefficients, and test-retest coefficient among SDQ-III scales.
Another factor that may be responsible for some differences between WLSMV and ULS for multi-option scales was restricting factor loadings and uniquenesses to be the same across items within the pi and pio GT-SEM designs. In recent studies of GT, in which models with equal and varying unstandardized factor loadings and/or uniquenesses have been compared (i.e., congeneric versus essential tau-equivalent relationships), reliability is typically higher for the less restricted models, which in turn may further reduce differences between WLSMV-and ULS-based indices of consistency and agreement (see, e.g., [19,23,60]. In comparison to G and global D coefficients, cut-score specific D coefficients two standard deviations away from the scale mean were higher and varied much less across estimation methods. These results underscore that classification decisions made from extreme cutscores can be highly congruent even when overall score consistency is relatively low and dichotomous score distributions are highly skewed.

Additional Advantages of GT-SEMs and Further Research
Although not demonstrated explicitly here, SEMs have additional benefits over traditional ANOVA models that merit comment and future exploration. One recent extension just mentioned is to model congeneric rather than essential tau-equivalent relationships between indicators and underlying factors. As is the case when comparing conventional single-occasion omega to alpha coefficients that, respectively, reflect congeneric versus essential tau-equivalent relationships [61,62], G coefficients are generally higher for congeneric than for essential tau-equivalent factor models (see, e.g., [18,23,60]). The offsetting drawback to modeling congeneric relationships within GT designs is that generalization is restricted to items and occasions sharing the same characteristics as those sampled in contrast to the broader domains from which they were drawn [61].
Vispoel, Xu, and Schneider [60] further showed that, despite differences in the labeling of indices, GT congeneric and latent state-trait orthogonal method models [63,64] are equivalent under certain conditions, thereby providing a useful bridge between the two theories. Researchers also have demonstrated that models within both frameworks can be extended to account for variance due to item phrasing effects (negative, positive, or both) and allow for partitioning of variance at both total score and individual item levels (see, e.g., [16,17,60]). However, a limitation of the GT congeneric models used in these studies was that they were not configured to yield global and cut-score specific D coefficients, thereby highlighting an important area for further investigation.
Other noteworthy recent extensions of GT-SEMs are to represent multivariate [16,54,55,65] and bifactor designs [16,18,19,65]. When analyzing multivariate GT designs, variance components for individual subscale scores are the same as those obtained from univariate designs, but those for composite scores are functions of the variances of subscale scores that comprise them, the covariances among those subscale scores, and the weighting of each subscale in forming the composite (see, e.g., [16,30,54,55,57]). Multivariate GT designs are useful in providing a clearer mapping of content within the global domain represented by subscale and item scores and in producing more appropriate and typically higher indices of score consistency and agreement for composite scores than would a direct univariate analysis of composite scores that ignores subscale representation and interrelationships [18,54,55,65]. Multivariate GT designs would be analyzed as SEMs with individual subscales represented in the same ways as in univariate designs but allowing person and measurement error factors (when appropriate) to covary across subscales (see [16,22,54,55,65]). Multivariate GT designs also allow for calculation of correlation coefficients between pairs of subscale scores corrected for all associated sources of measurement error (see, e.g., [54,55]).
Bifactor GT designs bear similarities to both univariate and multivariate GT designs but further partition universe score variance into general factor effects representing variance common to all items across subscales and independent group factor effects representing additional systematic variance specific to each subscale. Score consistency indices in bifactor models are expanded beyond those in univariate and multivariate GT designs to reflect proportions of variance accounted for by just general factor effects, just group factor effects, and general and group factor effects combined. Such partitioning provides a useful means to investigating score dimensionality and value added when reporting subscale in addition to composite scores (see, e.g., [65][66][67][68]). Bifactor SEMs would include a general factor linked to all items and orthogonal group factors linked to items within each subscale plus additional factors for sources of measurement error in multi-facet GT designs (see [18,19,65] for further details).
Although research in representing GT multivariate and bifactor designs within SEM frameworks is very limited, the techniques illustrated here for traditional univariate GT designs can be extended to those designs to derive G, global D, and cut-score specific D coefficients at composite and subscale score levels; reference results to either observed score or CLRV metrics; and yield Monte Carlo-based confidence intervals for key parameters of interest. As with univariate congeneric SEMs, congeneric multivariate and bifactor designs can produce G coefficients, but methods for obtaining corresponding global and cut-score specific D coefficients remain an important area for further development.

Summary and Conclusions
The results reported here coupled with those from previous studies provide compelling evidence that SEMs can reproduce all key indices from GT ANOVA models for one-and two-facet designs while yielding Monte Carlo-based confidence intervals for those indices and referencing results to either observed score or CLRV metrics. Emerging research also suggests that the techniques described here for univariate GT-SEM analyses can be extended to multivariate and bifactor GT-SEMs to provide additional insights into the nature of assessment domains, create more appropriate indices of score consistency and agreement for composite scores, and further partition universe score variance into independent components reflecting general and group factor effects. To aid readers in applying the GT techniques demonstrated here, we provide examples of all illustrated SEM analyses using the lavaan [41] and semTools [26] packages in R and examples of all illustrated ANOVA analyses using the GENOVA package [2]. Additional guidelines for analyzing and applying these and more complex traditional univariate GT ANOVA-based designs are provided by Vispoel, Xu, and Schneider [24] using the gtheory package in R. Related guidelines and illustrations, using the lavaan package in R, are provided by Vispoel, Lee, and Hong [54,55,65] for analyzing multivariate GT-SEM designs and by Vispoel, Lee, et al. [18,19,65] for analyzing bifactor GT-SEM designs. We hope readers find these resources valuable in applying and extending GT-SEM procedures to their own data.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/psych5020019/s1, Table S1: Example Data Structure for GENOVA p × i design-Occasion 1; Table S2: Example Data Structure for GENOVA p × i × o design; Table S3: Variance Components, G coefficients, and D coefficients for GT pi Observed Score Design; Table S4: Variance Components, G coefficients, and D coefficients for GT pi CLRV Design; Table S5: G coefficients, D coefficients, and Partitioning of Variance for GT pio Observed Score Design;