Time to Renovate the Humor Styles Questionnaire? An Item Response Theory Analysis of the HSQ

The Humor Styles Questionnaire (HSQ) is one of the most popular self-report scales in humor research. The present research conducted a forward-looking psychometric analysis grounded in Rasch and item response theory models, which have not been applied to the HSQ thus far. Regarding strengths, the analyses found very good evidence for reliability and dimensionality and essentially zero gender-based differential item functioning, indicating no gender bias in the items. Regarding opportunities for future development, the analyses suggested that (1) the seven-point rating scale performs poorly relative to a five-point scale; (2) the affiliative subscale is far too easy to endorse and much easier than the other subscales; (3) the four subscales show problematic variation in their readability and proportion of reverse-scored items; and (4) a handful of items with poor discrimination and high local dependence are easy targets for scale revision. Taken together, the findings suggest that the HSQ, as it nears the two-decade mark, has many strengths but would benefit from light remodeling.


Introduction
The typical "conference hotel" for a psychology conference will get renovated every 4-6 years [1], which is a much faster cycle than the self-report scales presented in the meeting rooms and poster sessions downstairs. The typical self-report scale in the behavioral sciences can go a decade or longer before revision, if ever, and many popular scales grow increasingly dated and stale until they fall into disuse. In the present research, we examine the Humor Styles Questionnaire (HSQ) [2], one of the most popular self-report tools in humor research over the last 20 years [3]. Because the HSQ is nearing the 20-year mark, now is a good time for a forward-looking psychometric analysis that clarifies the scale's strengths and pinpoints opportunities for future improvement. Our analysis applies tools from the family of Rasch and item response theory (IRT) models [4,5], which have apparently not been applied to the HSQ before but offer keen insight into the behavior of items and rating scales. We conclude by outlining suggestions for the future development of the HSQ.

The Humor Styles Questionnaire
Developed by Martin et al. [2], the HSQ is among the most prominent self-report scales in the psychology of humor [6,7]. The scale has been translated into dozens of languages [8] and adapted for versions used in the workplace [9] and with children [10,11]. Several meta-analyses and systematic reviews involving the HSQ [12][13][14][15] and nearly 2000 citations to the original publication (as of November, 2020) attest to its importance to modern humor scholars. The humor styles model grew out of research on humor's role in coping, health, and well-being [3]. Martin et al. [2] proposed four humor styles that form a 2 × 2 matrix. Two styles-affiliative and

The Present Research
In the present research, we apply tools from Rasch and item response theory models to gain insight into the behavior of the HSQ. Despite the scale's wide use and popularity, it has not been examined using these tools. Our aim is to evaluate areas for possible improvement, so we focus on a cluster of core issues: (1) the behavior of the seven-point rating scale versus alternatives; (2) reliability, dimensionality, and local item dependence for the four styles; (3) item difficulty and discrimination, particularly if some subscales are too "easy" or "hard"; (4) test information; (5) possible gender bias (differential item functioning); and (6) item readability and polarity. . RMSD = root mean squared deviation item-fit statistic. RMSD values greater than 0.02 and 0.05 reflect "small" and "medium" misfit, respectively.

Participants
We examined the HSQ with two samples. Our primary sample consisted of 1210 adults-557 women and 652 men, ranging in age from 18 to 80 (M = 30.90, SD = 11.99, Mdn = 27)-who completed the HSQ using a 5-point rating scale. Some of the responses (n = 810) come from data collected by open psychometrics and shared in the R package clustrd [19]; we collected the remaining responses (n = 400) using the Prolific.co survey panel. This 5-point dataset is the primary sample reported throughout the manuscript. Our secondary sample consisted of 532 adults-397 women and 135 men, ranging in age from 18 to 51 (M = 19.04, SD = 2.39, Mdn = 18)-enrolled in psychology courses at the University of North Carolina at Greensboro. This sample completed the HSQ using a 7-point rating scale; this dataset was used to illustrate deficiencies in using 7 response options. In both samples, the responses had been screened for non-native speakers of English (when known) and for careless and inattentive responding using missed directed response items, long-string indices, and Mahalanobis D values, conducted with the R package careless [20].

Analytic Approach
The data were analyzed in R [21] using psych [22] and TAM [23]. For the IRT models, we estimated generalized partial credit models (GPCM) in TAM using marginal maximum likelihood. These models estimate difficulty (b) and discrimination (a) parameters for each item as well as item thresholds (the level of the underlying trait at which people have a 50:50 chance of responding to one or the other scale option). Four models were estimated, one for each HSQ subscale. The data and R code are available at Open Science Framework (Supplementary Materials). For clarity, we combine the results with our explanation of the statistical methods as well as our discussion and interpretation.

Rating Scale Analysis: Is Seven Points Too Many?
The HSQ was developed with a seven-point Likert-type rating scale, a common number in the self-report assessment of individual differences, and most published research appears to have used seven response options. Many researchers, however, have used five-point scales in their HSQ research [24,25], and some have used three-point scales [26]. Some translations of the HSQ have adopted other rating scales, such as the four-point scale used in the Italian version [27]. Notably, the adaptations of the HSQ for children [10,11] use a four-point scale to simplify the judgments for the younger intended population.
The optimal number of rating scale categories-for a particular scale targeted at a particular population-is an empirical question [28]. A common intuition among researchers is that "more is better": a nine-point scale, for example, will "expand the variance," as the saying goes, relative to a four-point scale by "giving people more options." Nevertheless, measurement theory grounded in Rasch and IRT models points out that allowing people to make finer distinctions does not mean that they can make fine distinctions reliably [4]. Researchers can study the empirical behavior of their rating scales to discern the optimal number of categories.
The first step is to evaluate whether the estimated item thresholds are disordered. In the GPCM, a seven-point rating scale has six thresholds that represent the "tipping point" between responses. The threshold between 1 and 2 for affiliative humor, for example, represents the trait level of affiliative humor at which someone has a 50:50 chance of responding with a 1 or 2. The thresholds should be monotonically ordered, ascending from smaller (usually negative) values to larger values. When the thresholds are ordered, it means that as the trait increases from low to high, each response option, from lowest to highest, takes its turn as the most likely response. When thresholds are out of order, it is usually a sign that a rating scale has too many categories-the participants are underusing using some of them.
This point can be illustrated with item characteristic curves, which display the probability of giving a particular item response at different trait levels. Ideally, each rating scale option will be most probable at some trait level. The top panel in Figure 1 shows the well-behaved, ideal curves using item 10 (self-enhancing: "If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better") completed with a five-point scale. As latent self-enhancing scores move along the trait (theta) continuum from low (−3) to high (3), the most probable response shifts from 1 to 5 in ordered steps. Note how each response takes its turn as the most likely responses. All five response options are thus informative. The five-point HSQ data, in contrast, had only five disordered items, which are flagged in Table  1. Most disordered items were in the aggressive subscale (items 11, 19, and 27); the remaining subscales had at most one disordered item: affiliative (item 25), self-enhancing (no disordered items), and self-defeating (item 28). The bottom panel of Figure 1 illustrates item 22 again, but for the fivepoint data. The pattern is ordered because all five options are most likely at some point of the trait. The pattern is not ideal-option 3 is most likely for only a small theta region-but it is much better behaved than the seven-point version.
Taken together, the behavior of the rating scales suggests that seven response options is probably excessive. The participants were not discerning seven levels of the underlying HSQ styles for nearly In the seven-point HSQ data, however, around a third of the items-10 of 32-had disordered thresholds. These appeared in all four subscales: affiliative (items 1 and 25), self-enhancing (items 2 and 22), aggressive (items 3, 7, 15, and 31), and self-defeating (items 12 and 24). As an example of a disordered item, the middle panel of Figure 1 shows the item characteristic curves for item 22 (self-enhancing: "If I am feeling sad or upset, I usually lose my sense of humor"). For this item, the most likely responses do not move in ordered steps: as the trait moves from low to high, the most likely responses are 2, 3, 5, and 6. Three options in the seven-point scale (1, 4, and 7) are never the most likely responses at any trait level, and most of the trait range (from around −2 to +2.5) is covered by only two scale options (3 and 5). It is obvious that seven options is far too many for this item-it is functionally a four-point scale.
The five-point HSQ data, in contrast, had only five disordered items, which are flagged in Table 1. Most disordered items were in the aggressive subscale (items 11, 19, and 27); the remaining subscales had at most one disordered item: affiliative (item 25), self-enhancing (no disordered items), and self-defeating (item 28). The bottom panel of Figure 1 illustrates item 22 again, but for the five-point data. The pattern is ordered because all five options are most likely at some point of the trait. The pattern is not ideal-option 3 is most likely for only a small theta region-but it is much better behaved than the seven-point version.
Taken together, the behavior of the rating scales suggests that seven response options is probably excessive. The participants were not discerning seven levels of the underlying HSQ styles for nearly a third of the items, so their response behavior indicates that they were using fewer categories to classify their level of the respective humor style. For future HSQ development, then, researchers should consider reducing the number of options. Our data show that a five-point scale outperforms a seven-point scale, but we think researchers should explore other rating scales when revising the HSQ, such as the four-point scale used in the Italian HSQ [27] or further examining the three-point scale explored in recent work [26]. The ideal rating scale for an intended respondent population is an empirical issue, so future scale development work should empirically evaluate its rating scale relative to promising alternatives.

Reliability
We explored score reliability with a few metrics, shown in Table 2. Cronbach's alpha was good for all four subscales (α ranged from 0.79 to 0.86). Omega-hierarchical (ω H ; the proportion of total score variance due to a general factor) was good but lower (0.68 to 0.77), suggesting the possible presence of minor factors. Finally, reliability of the expected a posteriori (EAP) trait scores estimated from the GPCM IRT model were good (0.80 to 0.86). Taken together, the present data support the evidence for score reliability found in much past HSQ research [6].

Dimensionality
For the HSQ, we evaluated essential unidimensionality, a looser criterion that recognizes that self-report scales for measures of personality traits are rarely strictly unidimensional [29]. We assessed the dimensionality of the HSQ subscales with three methods: parallel analysis [30], Velicer's [31] minimum average partial (MAP) criterion, and the ratio of the first-to-second eigenvalues (e.g., greater than 4:1) [29]. The factor analyses, conducted in psych (Revelle, 2020) using maximum likelihood factor analysis, suggested good evidence for essential (but not strict) dimensionality. Parallel analysis suggested 3 or 4 factors per subscale, but there was clearly a dominant first factor. For all subscales, the MAP and the 4:1 eigenvalue-ratio criteria indicated 1 factor. Taken together, these findings are consistent with past research, which regularly finds good evidence for the factor structure of the HSQ [8].

Local Dependence
Although the HSQ subscales appear essentially unidimensional, the minor secondary factors revealed by the parallel analysis are worth exploring in more detail. In self-report scales, these "nuisance factors" commonly come from a few items that have redundant meanings, overlapping wording, or similar structure (e.g., reverse-scored). Understanding these minor nuisance factors gives useful guidance for future scale revision because they can flag items that are causing deviations from unidimensionality. We thus examined local dependence statistics for the items using the adjusted Q3 statistic (aQ3), an extension of Yen's (1984) Q3 that corrects for its negative bias by centering the values on the average Q3 [32]. As residual correlations, aQ3 values are in the r metric, and values greater than |0.20| are commonly flagged as violations of local independence [33]. In brief, these residual correlations reveal if two items correlate with each other after accounting for the influence of the common latent factor. Table 1 notes the items within each HSQ subscale that were flagged for local dependence. These typically reflected "surface dependence" [34], such as overlapping item keywords or items with reversed-scored wording. For the affiliative scale, the two dependent pairs reflected residual correlations between flipped versions of items (e.g., "I laugh and joke a lot with my closest friends" and "I don't often joke around with my friends"). For the self-enhancing subscale, the largest value appeared (aQ3 = 0.39) for a pair of items about finding humor when alone. No local dependence appeared for the aggressive subscale. Finally, for the self-defeating subscale, the cluster of locally dependent items reflected substantive overlap between a group of items about being a target of others' jokes and laughter.
For future HSQ development, the patterns of local dependence suggest that it would be straightforward to improve the scales' unidimensionality, such as by omitting or rewording items to reduce surface dependence. For the self-defeating subscale, however, the cluster of items had a shared substantive meaning-being the target of others' laughter and teasing-which suggests that it may represent a distinct facet of self-defeating humor that could be explored further [35,36].

Item Difficulty, Item Discrimination, and Test Information
The estimates of the generalized partial credit models are shown in Table 1. Item fit was good according to mean-square infit and outfit statistics, for which 1.00 is the ideal value [4]. (Note that TAM estimates infit and outfit using individual posterior ability distributions rather than individual ability estimates, so its fit estimates will vary from WLE-based methods [37]). As an additional measure of item fit, we calculated the root mean squared deviation (RMSD) statistic [37,38]. Köhler et al. [37] suggested that RMSD values under 0.02 indicated "negligible" misfit, values between 0.02 and 0.05 reflected "small" misfit, and values between 0.05 and 0.08 reflected "medium" misfit. The RMSD values suggested good item fit: all values were below 0.05 (see Table 1).

Difficulty
For polytomous self-report items, the item difficulty parameter, known as the b parameter, represents the "endorsability" of the item-how easy or how hard it is for people to express agreement with it. The item difficulty values for the HSQ subscales are reported in Table 1. Figure 2 illustrates the difficulty values for all 32 items, which are ordered from "easiest" to "hardest" and color coded by subscale.
TAM estimates infit and outfit using individual posterior ability distributions rather than individual ability estimates, so its fit estimates will vary from WLE-based methods [37]). As an additional measure of item fit, we calculated the root mean squared deviation (RMSD) statistic [37,38]. Köhler et al. [37] suggested that RMSD values under 0.02 indicated "negligible" misfit, values between 0.02 and 0.05 reflected "small" misfit, and values between 0.05 and 0.08 reflected "medium" misfit. The RMSD values suggested good item fit: all values were below 0.05 (see Table 1).

Difficulty
For polytomous self-report items, the item difficulty parameter, known as the b parameter, represents the "endorsability" of the item-how easy or how hard it is for people to express agreement with it. The item difficulty values for the HSQ subscales are reported in Table 1. Figure 2 illustrates the difficulty values for all 32 items, which are ordered from "easiest" to "hardest" and color coded by subscale. The figure highlights a few problematic features of the HSQ. First, the affiliative and selfenhancing subscales are much easier than the others. Most of the self-enhancing items are on the easy end; surprisingly, all eight affiliative items are on the easy end. Most or nearly all of the aggressive and self-defeating items, in contrast, are on the harder end and require relatively high levels of these traits to endorse. The subscales thus vary greatly in their difficulty. For example, the hardest affiliative item is still easier to endorse than even the easiest aggressive and self-defeating items. The figure highlights a few problematic features of the HSQ. First, the affiliative and self-enhancing subscales are much easier than the others. Most of the self-enhancing items are on the easy end; surprisingly, all eight affiliative items are on the easy end. Most or nearly all of the aggressive and self-defeating items, in contrast, are on the harder end and require relatively high levels of these traits to endorse. The subscales thus vary greatly in their difficulty. For example, the hardest affiliative item is still easier to endorse than even the easiest aggressive and self-defeating items.

Discrimination
The HSQ subscales also vary in their values on the discrimination a parameter (see Table 1), often called the slope parameter for polytomous items [39]. In IRT, these values represent how well an item discriminates between people with different trait levels. Items with higher discrimination scores provide more information and can more reliably rank-order participants according to their trait level. They are conceptually equivalent to the magnitude of factor loadings in a confirmatory factor analysis. As Figure 3 illustrates, the HSQ items vary broadly in their discrimination values, from fairly low (below 0.50) to fairly high. Notably, some subscales had more discriminating items. Most of the affiliative items were among the most discriminating items; most of the aggressive items, in contrast, were in the bottom half.

Discrimination
The HSQ subscales also vary in their values on the discrimination a parameter (see Table 1), often called the slope parameter for polytomous items [39]. In IRT, these values represent how well an item discriminates between people with different trait levels. Items with higher discrimination scores provide more information and can more reliably rank-order participants according to their trait level. They are conceptually equivalent to the magnitude of factor loadings in a confirmatory factor analysis. As Figure 3 illustrates, the HSQ items vary broadly in their discrimination values, from fairly low (below 0.50) to fairly high. Notably, some subscales had more discriminating items. Most of the affiliative items were among the most discriminating items; most of the aggressive items, in contrast, were in the bottom half.

Test Information
In item response theory, error is not constant across the trait; some regions of the trait are measured more reliably than others [5]. Test information functions illustrate how much information the subscales provide and at which traits levels they are most reliable. Figure 4 shows the curves for the HSQ subscales, which distill the items' difficulty and discrimination values into a depiction of

Test Information
In item response theory, error is not constant across the trait; some regions of the trait are measured more reliably than others [5]. Test information functions illustrate how much information the subscales provide and at which traits levels they are most reliable. Figure 4 shows the curves for the HSQ subscales, which distill the items' difficulty and discrimination values into a depiction of how much information they provide across the trait continuum.

Test Information
In item response theory, error is not constant across the trait; some regions of the trait are measured more reliably than others [5]. Test information functions illustrate how much information the subscales provide and at which traits levels they are most reliable. Figure 4 shows the curves for the HSQ subscales, which distill the items' difficulty and discrimination values into a depiction of how much information they provide across the trait continuum. The most notable feature is that the affiliative subscale provides much more information relative to the others, but it does so at a much lower trait level. This function reflects the fact that, as we have seen, the affiliative items are very easy but also highly discriminating. The affiliative subscale thus measures the low end of the underlying trait very well-it can make fine distinctions between very low and moderately low levels, for example-but provides much less information about respondents who are high in the affiliative humor style. Stated differently, these items are quite good at rank- The most notable feature is that the affiliative subscale provides much more information relative to the others, but it does so at a much lower trait level. This function reflects the fact that, as we have seen, the affiliative items are very easy but also highly discriminating. The affiliative subscale thus measures the low end of the underlying trait very well-it can make fine distinctions between very low and moderately low levels, for example-but provides much less information about respondents who are high in the affiliative humor style. Stated differently, these items are quite good at rank-ordering people low in affiliative humor but relatively weak at rank-ordering people high in it. The other three subscales have test information functions that are commonly found for measures of individual differences in typical populations: the peaks are in the region of zero, and although they provide less absolute information (i.e., they have lower peaks than the affiliative subscale), they provide information over a much broader range of the trait.

Summary
Of the four HSQ subscales, the affiliative subscale stands out as problematic and worth a close look in future scale development. First, its items are simply too easy for this group of respondents, which represents the typical population for whom the scale is targeted. These items are highly discriminating, which is a strength, but because the items are all fairly easy to endorse, the scale yields a high level of information centered on the low end of the trait. This scale thus does a relatively poor job of sorting people with average and higher levels of the affiliative humor style. Rewording at least some of the items to be harder to endorse would enable the scale to provide more information about high trait levels.
Regarding difficulty and discrimination, the remaining subscales seem solid. A few items with low discrimination levels deserve a look during the item revision process, especially items that were also flagged for disordered thresholds and local dependence.

Gender Bias and Differential Item Functioning
In the present sample, like much past work [2,8], men had at least slightly higher scores on all four subscales. Table 2 reports Cohen's d values and their 95% confidence intervals. For our purposes, the more interesting issue is whether gender differences represent true differences in the underlying trait or if the HSQ items are contaminated with item bias. One of the major virtues of item response analysis is the ability to appraise differential item functioning (DIF). Women and men can vary in their trait scores, of course, but men and women with identical true trait scores should have the same predicted responses. When this condition is violated for an item, it is evidence for DIF. For such items, group differences are difficult to interpret because they are a mixture of true group differences and unwanted item bias.
In the present sample, we evaluated DIF in lordif [40,41], which implements a logistic ordinal regression approach using IRT-based trait scores and iterative purification. Because of our large sample size, we used McFadden's R 2 [42] as an effect size measure of DIF [43]. A criterion of R 2 = 0.01 (more stringent than the common cutoff of R 2 = 0.02 for a small R 2 effect size) was used to flag items for total DIF. Notably, no items were flagged for any subscale by this criterion, indicating essentially zero gender-based DIF for all the HSQ items. This is a striking finding for a self-report scale with widespread gender differences, and it indicates that the gender differences found in the scale scores reflect true group differences rather than item bias.

Readability and Item Polarity
Readability is an important feature of self-report items, but readability statistics are infrequently evaluated for measures of individual differences. For the HSQ, we computed the average words per items and evaluated two common metrics of readability [44] (see Table 2). Higher Flesch "reading ease" scores indicate greater readability, so reading ease is rather easy for the affiliative and self-defeating subscales but fairly hard for the self-enhancing subscale. Gunning's "fog index" found similar variability between the subscales. The fog index is expressed in American grade levels, so the affiliative subscale had a lower reading level (around 6th grade, 11-12 years old) and the self-enhancing scale had a much higher reading level (around 12th grade, 17-18 years old). The affiliative subscale has mostly short, focused items, and the self-enhancing subscale has some long-winded, passive-voice clunkers, such as "It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems." In short, the HSQ as a whole has a reading level that is appropriate for adult populations, but it is problematic that the subscales vary in their readability. Future item revision should aim for consistent readability levels so that differences in subscale scores are not contaminated by factors related to reading comprehension and verbal abilities.
Another notable item feature is polarity-whether the item is phrased in a positive or negative direction. The virtues, problems, and consequences of reverse-scored items remain controversial [45][46][47][48], and there is clearly no single approach that is best for all tools, populations, and purposes. Nevertheless, if a scale includes reversed items, they should be distributed evenly across subscales. The four HSQ subscales vary notably in the proportion of reverse-coded items (see Table 2). For the affiliative subscale, more than half of the items (five of eight items) are reverse coded. The aggressive scale is similar, with half of the items (four of eight items) reversed. The self-enhancing and self-defeating subscales, however, each have only one reverse-coded item. The subscales are thus imbalanced, ranging from mostly to minimally reversed, and this imbalance is systematic: the two interpersonal subscales are at least 50% reversed, whereas the two intrapersonal subscales are only 12.5% reversed.
In addition, many of the reverse-coded items use simple negation, such as "don't" and "not," which is a widely criticized practice [48]. Respondents often miss these words and thus interpret the item in the wrong direction, which reduces internal consistency and increases item misfit [49]. Another consequence is reduced unidimensionality due to local dependence. Two similarly worded items that vary only in simple negation will often have a high residual correlation. This appeared for several HSQ items in the affiliative subscale, as we saw earlier. We do not have strong feelings about including or omitting reverse-scored items, but we see the imbalance in the proportion of reversed items across subscales and the predominance of simple negations as problematic for the HSQ, and it should be addressed in future item revisions.

Conclusions
In the present research, we conducted a psychometric analysis of the Humor Styles Questionnaire, one of the most widely used self-report scales in modern humor research. Because the scale is nearly 20 years old and has not yet been evaluated using Rasch and item response theory methods, a thorough evaluation can set the stage for future scale revision by highlighting strengths to preserve and opportunities for improvement. Based on our analyses, the HSQ has many strengths, so any future remodeling is more a matter of light renovations than of gutting the interior and starting over.
We should note some limitations of our analysis. First, our conclusions apply only to the English form of the HSQ. This scale has been widely translated into dozens of languages [6,8], often with noteworthy changes to the response scale and item wording to reflect distinct cultural inflections in the meanings and uses of humor [50]. Each translated form of the scale deserves its own separate analysis. Second, our primary analysis of the HSQ was grounded in the dataset that used a five-point response scale. Our findings provide evidence that the five-point scale has more consistent psychometric behavior and should be preferred to the traditional seven-point scale. Nevertheless, some of the item and scale parameters (e.g., difficulty, discrimination, and item fit) would be expected to differ slightly for the seven-point scale version.
We have focused our attention on psychometric aspects of the HSQ rather than on broader questions of its validity. Some researchers, however, have recently argued that the HSQ has questionable construct validity [51,52]. It is unclear, in our opinion, how much some of the methods used to challenge the scale's validity (e.g., experimentally manipulating items) inform the complex issue of validity for the HSQ [53], especially in light of the large literature that suggests that it is a useful tool for humor inquiry. Validity is not a static feature of a scale but rather an evolving feature of the kinds of interpretations and claims that are legitimately afforded by a growing body of scholarship [54].
If researchers revise and refine the HSQ's items, it would be wise to take advantage of the accumulated literature related to the validity of its underlying constructs, both favorable and critical. As one example, research since the HSQ was first published suggests that self-defeating humor probably has distinct sides: the maladaptive side that the HSQ intends to assess and an adaptive side reflecting healthy self-deprecation and humility [35,36]. The HSQ might not distinguish these as clearly as it could.
In conclusion, here are the major take-home messages from our analysis for the scale's users and for researchers interested in refining and revising the scale. The HSQ has several strengths to preserve and build upon:

•
The HSQ's subscales are essentially unidimensional and yield scores with good reliability relative to their length.

•
The items have essentially no gender-based differential item functioning, which is a striking finding for a scale with prominent gender differences. We see this as a major strength of the HSQ.

•
Except for the affiliative subscale, all the subscales provide most of their information around the center of the trait, which is apt for measures of normal individual differences.
Based on our analyses, here are the major opportunities for improvement: • The rating scale analysis indicates that seven points is probably too many. The five-point items were more consistently ordered, and a five-point scale is a sensible alternative. The behavior of a rating scale is an empirical question, so we would encourage future development of the scale to explore the behavior of alternative rating scales, such as four or six points.

•
The affiliative subscale is simply too "easy" for the intended population, and it is much easier than the other subscales. We see the disparity in overall difficulty as a problem for the HSQ because overall subscale scores are not comparable. The varying difficulty levels are almost certainly an unintended consequence of the scale development process and something that should be targeted in future revisions. There is no ideal difficulty level, but scales that measure normal individual differences usually seek to distribute their items across the difficulty spectrum, thus allowing them to gain insight into a wide range of the trait.

•
The items' readability should be adjusted and harmonized. The subscales vary in sentence length and readability, which introduces unwanted variance.

•
The proportion of reverse-scored items should be more similar across the subscales, and reversing with simple negation should be avoided. • Some items are relatively low in discrimination and thus offer relatively little information. These are good candidates for item revision or deletion. • According to local dependence statistics, a handful of items impair unidimensionality and are thus easy targets for item revision or deletion.
Supplementary Materials: The raw data and R files are publicly available at Open Science Framework (https://osf.io/7zy2h/). We invite researchers to reanalyze and reuse the data for their own purposes.