Does Changing Scale Items’ Contexts Impact Its Psychometric Properties? A Comparison Using the PERMA-Profiler and the Workplace PERMA-Profiler

: The present study evaluated the empirical distinction between the PERMA-Profiler and the Workplace PERMA-Profiler, which measure flourishing using the same items with different contexts (i.e., general vs. workplace orientations). Both scales were administered online via MTurk ( N = 601), and single-group measurement and structural invariances were assessed. Partial metric and scalar invariances were supported, indicating that the PERMA constructs were measured equivalently across scales (except for the relationships factor). Structural properties (covariances, means) were not invariant, indicating distinct utility for each scale in their respective contexts. The results suggest that simple adaptations to items to change their context, but not content, may retain the original scale’s psychometric properties and function with discrimination.


Introduction
Scales are often adapted for the context in which they are used, which may involve simply adjusting items to orient the respondent to the study's focal context.However, if a scale's items are adapted to be context-specific (e.g., moving from a general context to a workplace context), will its psychometric properties change?This study aimed to investigate this question by evaluating the measurement and structural equivalence of the PERMA-Profiler (PP) [1] and its workplace adaptation, the Workplace PERMA-Profiler (WPP) [2].
One conceptualization of wellbeing has been gaining traction in the last decade: Seligman's PERMA model of flourishing [3].According to Seligman, flourishing is a higherorder wellbeing concept comprising positive emotions (P), engagement (E), relationships (R), meaning (M), and accomplishment (A).The PP [1], measuring one's general flourishing, and the WPP [2], measuring one's flourishing at work, have been developed to directly measure this type of flourishing.Although some argue that general wellbeing and work wellbeing are fundamentally different [4], the WPP is a simple adjustment to the items of the PP, making them workplace-specific (i.e., same construct, different context).
One of the reasons the PP and the WPP were selected for the present study was due to the fact that both versions have been previously validated and used in published research, which is rare for simple item context adjustments (but see Luthans et al.'s [5] and Lorenz et al.'s [6] psychological capital questionnaires for other examples).As indicated above, in many cases, when an original scale does not quite fit the context or population, slight adjustments to item wording are implemented (e.g., Culbertson et al.'s [7] workplace adaptation of Ryff's [8] psychological wellbeing scale).Additionally, the workplace adaptation has been translated and validated in other languages (e.g., Japanese [9] and Korean [10]), with studies identifying statistical distinctions between the WPP and PP (though without direct comparisons).Finally, more recent research has summarily evaluated the psychometric properties of PP and WPP across a number of samples, concluding that there may be problems with the WPP's item wordings [11,12].Such issues pose a problem for utilization of the WPP in workplaces (its intended context), as leaders wanting to measure their employees' (or their own) flourishing at work may receive questionable data.As such, continued evaluation of the adaptation and statistical comparisons with its parent are necessary if this scale is to be continually used.
In some cases, contextual scale adaptations are used in studies without any statistical evaluation of their consistency with the original.For example, in order to measure happiness in children, Holder and Klassen [13] slightly reworded Lyubomirsky and Lepper's Subjective Happiness Scale [14] for a younger population (e.g., changing "To what extent does this characterization describe you?" to "How much does this sentence describe you?").Although seemingly innocuous, it is important to note that Holder and Klassen's version of the scale had poor internal consistency (Cronbach's alpha = 0.67, which is lower compared to adult applications of the scale [14,15]) [13].Although the authors found poor reliability, and no work was done to either validate the adaptation or evaluate it for consistency with the original, this adaptation was used in future studies (e.g., Layous et al. [16], who notably did not report any psychometric properties), which calls into question whether the findings themselves may be reliable.For instance, it may not be that Layous et al. kindness intervention was ineffective at improving happiness in the preadolescents [16], but rather the measure they used did not function appropriately.As such, contextual adaptations to a scale that do not necessitate content adjustments may have impactful influences, not just on the measurement of a construct, but on the efficacy of interventions.
In a systematic review of workplace interventions to reduce mental health stigma, Tóth et al. [17] found that general stigma assessments may not have been sensitive enough to detect changes in some interventions due to their lack of a workplace-context focus.Similarly, a meta-analysis of Acceptance and Commitment Therapy (ACT) interventions to reduce dysregulated eating behaviors (e.g., binge eating) found that general psychological flexibility scales (e.g., the Acceptance and Action Questionnaire) were associated with significantly smaller RCT effects compared to their weight-specific adaptations (e.g., the Acceptance and Action Questionnaire for Weight-Related Difficulties) [18].In summary, it may be necessary for individuals applying interventions in a particular context (e.g., managers implementing a new intervention to improve their employees' wellbeing) to adapt a "general" scale for said context (e.g., the workplace) in order to know whether the intervention was effective.However, without appropriately evaluating and understanding how those adaptations can change the scale's function and validity, it may cause idiosyncrasies across implementations and ultimately produce unreliable results.
In the present study, I aimed to empirically illustrate what impacts changing the context or wording of a scale's items may have on its psychometric properties using two previously validated scales of flourishing.Because the PP and the WPP were both developed to measure the same constructs in different contexts, I expected their respective measurement components (e.g., item loadings) to be equivalent.The structural components (e.g., inter-factor covariances, like-factor means), however, may or may not differ by context (e.g., a sample's level of perceived life meaning could be higher or lower or the same as their perceived work meaning).Our work was guided by the following research questions: RQ1. Are the measurement properties of the PP and the WPP invariant?Specifically, are the patterns of latent PERMA constructs equivalent across scales (configural invariance), are the associations between the latent constructs and their respective like-items equivalent (metric invariance), and does each construct, when at an equal level, produce equivalent averages across like-items (scalar invariance)?
RQ2. Are the structural properties (i.e., factor variances and means) of PERMA, as measured by each scale, invariant?
RQ3.Is the correlation between general PERMA and workplace PERMA low enough to support distinctiveness?

Participants and Procedure
Participants were recruited via Amazon Mechanical Turk (MTurk).MTurk is an online crowdsourcing recruitment site for research.Individuals (aka MTurkers) create profiles based on their demographic, work, and other characteristics, and complete human intelligence tasks (HITs), which are commonly psychological studies that the individuals participate in.Researchers post studies on MTurk, and if MTurkers are eligible to participate (e.g., if their demographics match the inclusion criteria for a study), then they are able to enroll in the study and participate.If ineligible, they are not able to enroll.After participation, MTurkers' data are evaluated for quality by the study's research team.If acceptable, their HIT is approved; if unacceptable, their HIT is rejected.MTurkers' reputations (i.e., their proportion of approved HITs to total HITs) are tracked.English-speaking adults (≥18 years) in the U.S. employed outside of MTurk were eligible to complete the surveys.Additionally, only high-reputation MTurkers (those with at least 95% approved tasks on MTurk) were allowed to participate [19].Six hundred one participants were eligible, provided informed consent, and passed a majority of attention checks (3/5).See Table 1 for demographic characteristics.

Measures
The PERMA-Profiler [1] and the Workplace PERMA-Profiler [2] were used to measure PERMA constructs in general and at work, respectively.In each scale, three items measured each construct on an 11-point scale (0: "Never" or "Not at All", to 10: "Always" or "Completely"): positive emotions (e.g., "How often do you feel positive?";"At work, how often do you feel positive?"),engagement (e.g., "To what extent do you feel excited and interested in things?";"To what extent do you feel excited and interested in your work?"), relationships (e.g., "How satisfied are you with your personal relationships?; "How satisfied are you with your professional relationships?"), meaning (e.g., "To what extent do you lead a purposeful and meaningful life?"; "To what extent is your work purposeful and meaningful?"), and accomplishment (e.g., "How often do you achieve the important goals you have set for yourself?";"How often do you achieve the important work goals you have set for yourself?").

Statistical Analyses
RQ1.All models were evaluated using maximum likelihood estimation because the number of ordered categories exceeds seven (0-10)-which has previously been shown to equal that of least-squares estimation for ordinal data [20]-and all items were normally distributed.Models were identified using the marker indicator approach.First, model fit was evaluated for each scale independently.Model fit criteria included the Comparative Fit Index (CFI > 0.90), and the Standardized Root-Mean-Squared Residual (SRMR < 0.08), in accordance with Brown [21].Next, I proceeded with single-sample invariance testing [21].First, configural invariance was established by placing both scales into a single model and re-evaluating model fit using the same criteria.Metric invariance was assessed by constraining like-item loadings on the factors to be equivalent between each scale and evaluating the decrement in model fit between this and the configural model via the chisquared test.Likewise, scalar invariance was assessed by constraining like-item intercepts to be equivalent between each scale and evaluating the model fit decrement between it and the metric invariance model via the chi-squared test.Partial invariance was evaluated in cases where full measurement invariance did not hold in either case [22].Correlated residuals between like-items across scales were also included a priori in each model [21].
RQ2. Building on the scalar invariance model, latent factor variances were constrained to 1 in the WPP to be equivalent to the PP (which were set to 1 for model identification).A chi-squared difference test was computed to compare fit between this and the scalar invariant model.Next, covariances among latent factors within each scale were constrained to be equivalent across scales (e.g., the covariance between positive emotions and engagement constrained equal to the covariance between positive emotions and engagement in the WPP).Latent factor means were compared by evaluating the statistical significance (p < 0.05) of the mean difference between each like-construct across scales.Factor means of the PP were constrained to zero to identify the model, with those of the WPP freely estimated.
RQ3.The correlations between the PERMA like-factors across scales were evaluated from the covariance invariance model.If correlations were below 0.85, the latent factors were considered to be distinct across scales [21].

Results
See Table 2 for model fit and measurement invariance results.PERMA typically is a second-order factor, with positive emotions, engagement, relationships, meaning, and accomplishment as its first-order factors [3].However, adding this second-order factor resulted in significant decrements in model fit with both the PP, ∆χ 2 (5, N = 601) = 72.04,p < 0.01, and the WPP, ∆χ 2 (5, N = 601) = 112.20,p < 0.01, indicating that a higher-order model would not be appropriate for analyses [21].As such, only the first-order factors were retained.The fit of the five-factor PERMA model using the PP was excellent (CFI > 0.95; SRMR < 0.05).Although slightly worse, the fit of the model using the WPP was also good (CFI > 0.90; SRMR < 0.05).The model fit of the combined PERMA models with both scales was excellent (CFI > 0.95; SRMR < 0.05), supporting configural invariance (see Table 3 for the loadings and intercepts estimated in the configural model).Constraining the like-item loadings to be equivalent resulted in a significant decrement in model fit, ∆χ 2 (10, N = 601) = 115.33,p < 0.01.Freeing three item loadings (one engagement item, one meaning item, and one accomplishment item) brought the fit reduction to an acceptable level, ∆χ 2 (7, N = 601) = 7.91, p = 0.34, supporting partial metric invariance.As such, variances and covariances could reliably be compared across scales.Constraining all likeitem intercepts to be equivalent also resulted in a significant decrement in model fit, ∆χ 2 (10, N = 601) = 88.72,p < 0.01.Freeing five item intercepts (one positive emotion item, one engagement item, two relationship items, and one meaning item) brought the fit reduction to an acceptable level, ∆χ 2 (5, N = 601) = 10.54,p = 0.06, supporting scalar invariance for all factors except for relationships.As such, the means for all but the relationships factor could be reliably compared.As indicated in Table 2, constraining all factor variances to 1 resulted in a significant decrement in model fit from the scalar invariance model, ∆χ 2 (5, N = 601) = 51.64,p < 0.01.Freeing two factor variances from the WPP (engagement and meaning) brought the fit reduction to an acceptable level, ∆χ 2 (3, N = 601) = 7.21, p = 0.06.Constraining all respective covariances to be equivalent across scales resulted in a significant decrement in model fit, ∆χ 2 (10, N = 601) = 166.47,p < 0.01.Freeing all but one covariance constraint (positive emotions with relationships) brought the fit reduction to an acceptable level, ∆χ 2 (1, N = 601) = 0.03, p = 0.86, indicating that the associations between PERMA factors were not equivalent across scales.Finally, all like-factor means except for relationships were compared across scales (because the relationships factor was not scalar-invariant).Workplace positive emotions (M Diff = −0.09,p = 0.01), engagement (M Diff = −0.02,p < 0.01), and meaning (M Diff = −0.10,p < 0.01) were all significantly lower than in general.Workplace accomplishment, in contrast, was significantly higher than in general (M Diff = 0.19, p < 0.01).See Table 4 for the correlations among latent factors.Although associations were high, particularly between accomplishment (r = 0.81) and positive emotions (r = 0.81), all likeconstruct correlations were under 0.85, indicating construct distinctiveness across scales.

Discussion
When adapting scales for a single study, researchers may report the psychometric properties of their adaptation.However, without contextualizing with the parent scale's properties, it is unknown whether (1) the adaptation measures the intended construct in the same way as the parent scale (i.e., if the construct presents equivalently across contexts), and (2) whether the adaptation is statistically necessary (e.g., whether the inter-factor covariances and means are distinct).Previous meta-analyses have shown varying effects of interventions depending on whether a scale's adaptation or its parent was used [17,18], and there have been recent questions of the reliability of PERMA constructs measured with the WPP [11,12].A such, continuing evaluation of the PERMA scales' psychometric properties are needed to establish empirical support for their use in wider contexts like the workplace.
The results of the present study supported partial invariance across the PERMA-Profiler and the Workplace PERMA-Profiler, indicating that the flourishing constructs within PERMA were measured comparably across contexts.Not only does this finding support the consistent measurement structure of PERMA across contexts, but one could use these scales to directly compare the impacts of an intervention on flourishing both in general and at work.However, the relationships factor was not scalar-invariant, indicating that the means across these scales are not comparable.As such, using criteria derived from the PP to interpret means of the relationships factor from the WPP may not be appropriate.This lack of invariance could be due to the varied environmental setting (i.e., work), the variability of the individuals from whom the respondents are receiving support (i.e., others vs. coworkers), and the type of relational experience (i.e., love vs. appreciation).Each scale's structural covariances and latent means significantly differed, suggesting distinct factor means and inter-factor associations for PERMA constructs in each context.Additionally, correlations among like-constructs supported each scale's latent variable distinctiveness (e.g., workplace engagement and general engagement are not exactly the same).

Limitations and Future Directions
A number of limitations should be mentioned, as well as associated directions for future research.First, the study sample was recruited from MTurk was predominately male and educated and was analyzed cross-sectionally.There have been a number of studies on the quality, representativeness, and ethics of MTurk for survey research [19,[23][24][25][26]. On average, MTurk samples have been found to be more representative of the general working population, and possibly more likely to provide valid data, than student samples [24] and professional panels [24,27], with some studies supporting normative sample alignment [28].However, many papers cite problems with the reliability of responses, including inattentiveness, impossible answers, and an overrepresentation of white males [23,27,29,30].Although I only included high-reputation workers, such efforts to increase the data quality may be futile as few researchers ever reject HITs [31].As such, our results may be biased by poor data quality and may not be generalizable to all worker populations, especially those outside of the United States.Replications of this study are recommended using different samples, preferably those recruited directly from workplaces, and with a larger cross-national and female proportion.
Another major limitation is the lack of associative variables for convergent validity comparisons.Previous research has shown stronger associations between workplace interventions and workplace contextual measures than general measures (e.g., workplace mental health stigma vs. general mental health stigma; [17]).As such, a more robust evaluation of the PP's and WPP's distinct utility would have involved the structural invariance of PERMA constructs with practical workplace measures (e.g., job performance, satisfaction, burnout [9,32,33]).Future research should evaluate the practical utility of the WPP over the PP in workplace contexts through additional structural invariance assessments.Such results would provide leaders with more evidence for the utilization of one scale over the other for workplace-specific emotional wellbeing pulse-taking.
A final future research direction would be to continue evaluating the psychometric properties of both scales.Recent research has suggested poorer reliability than is acceptable with the PP and WPP [11,12].In the present study, I too found unacceptably low internal consistency for the engagement subscale of the PP, and the relationships subscale had non-invariant measurement properties, suggesting that (1) our results may not hold across replications, and (2) interpretations of relationships and the higher-order PERMA factor may not be comparable across general and workplace contexts, and consequently the use of either scale may not be recommendable to evaluate relationship-related health at work.I recommend adjustments to the engagement items be made to improve reliability, and adjustments to the relationship items be made (perhaps solely for the WPP) so that these constructs may be more representative of the same factor.Future work should be carried out to develop these adjustments and test the psychometric differences between the subsequent adaptation and the original versions.

Conclusions
When measuring workplace wellbeing, it is important that the scale(s) used are placed in a workplace context.However, if an adaptation to a general scale is made in order to do so, psychometric evaluations of the new scale are necessary to ascertain measurement consistency with and structural distinction from its parent scale.Without such tests, it is unknown whether the results are interpretable in the same way, and such practical implications may be questionable.Our results suggest that the measurement of PERMA is relatively stable across scales, and the constructs themselves are distinct, supporting the utility of both the PERMA-Profiler and the Workplace PERMA-Profiler.However, I express caution when measuring the relationships factor in the workplace (and, consequently, if the higher-order PERMA factor is used), due to its non-invariance between general and workplace contexts.The practical implications of this include support of the WPP for leaders to use as an employee emotional health assessment.

Funding:
The data collection was funded by the Marchionne Summer Research Fellowship from Washington State University and the Total Worker Health Dissertation Award from the Oregon Healthy Workforce Center, a Total Worker Health Center of Excellence funded by the National Institute for Occupational Safety and Health (grant number U19OH010154).Institutional Review Board Statement: This study was conducted in accordance with the Declaration of Helsinki, and was approved by the Institutional Review Board (or Ethics Committee) of Washington State University (protocol code 17664-001 and 23 May 2019).Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.Data Availability Statement: Data are available on request.

Table 1 .
Demographic and descriptive statistics.
Note. a Mean and standard deviation are rounded to the nearest thousand; b Means and standard deviations are computed using observed averages.

Table 2 .
Model fit and invariance results.
Note. df is degrees of freedom.CFI is the Comparative Fit Index.SRMR is the Standardized Root-Mean-Squared Residual.

Table 3 .
Item characteristics across scales.
[1,e.Coefficients in the table are taken from the configural invariance model.U-Load is the unstandardized loading.SE is the standard error of the unstandardized loading.Std Load is the standardizing loading.aItemnumbers correspond to those published in[1,2].b Items freed for partial metric invariance.c Items freed for partial scalar invariance.