1. Introduction
A stereotype is a cognitive template that perpetuates fixed patterns of thought, generalizing beliefs that oversimplify individual differences. While stereotypes fundamentally constitute false ideas that distort reality, they do not invariably represent negative attributions. For example, Germans may be stereotypically perceived as reliable and punctual, while Southern Italian women might be characterized as excellent cooks—attributions that, although potentially positive, remain preconceived notions divorced from individual variation. More frequently, however, stereotypes function as powerful beliefs that activate prejudicial and/or discriminatory representations of specific social groups.
The first systematic scientific investigation of stereotypes, conducted by
Katz and Braly (
1933), demonstrated the pervasiveness of internalized prejudices among white college students, who consistently associated African Americans with attributes such as “lazy” and “superstitious,” while characterizing Germans as “rational” and “industrious”. These findings established that stereotypes erroneously attribute specific qualities—predominantly negative—to particular individuals or social groups based on category membership rather than individual characteristics. Gender stereotypes represent perhaps the most pervasive form of social conditioning, emerging through primary socialization processes wherein children internalize what
Mead (
2015) termed the “generalized other”. Contemporary socialization remains predominantly binary, developing distinct gender-linked cognitive schemas from early childhood (
Bem, 1981). Recent studies examining the influence of gender stereotypes confirm that, even when respondents attempt to manage their answers according to social desirability, their self-representations remain clearly shaped by internalized sexism (
Cerbara et al., 2022;
Lai & Wilson, 2021). Consequently, although younger generations demonstrate greater awareness of gender bias, socialized sexism continues to persist, manifesting in more subtle and implicit forms (
Sulla et al., 2025;
Moors & De Houwer, 2006). Thus, this binary educational framework continues to shape human existence across the lifespan, generating stereotyped perceptions, expectations, and behaviors transmitted through multiple channels: language acquisition, imitation of emotionally significant adults, media content, educational materials, play activities, and cultural narratives. Research has identified a precise developmental trajectory for prejudice formation in children. By age three, children demonstrate nascent prejudices toward out-groups (
Yee & Brown, 1992;
Doyle & Aboud, 1995). Prejudicial attitudes emerge by age four, while categorical thinking and in-group/out-group distinctions consolidate between ages five and seven. Between seven and nine years, rigid categorical interpretations begin to attenuate, with more concrete conceptualizations of human differences—particularly regarding sex, age, ethnicity, and group membership—emerging by age eleven (
Aboud, 1998).
The persistent prevalence of gender stereotypes across cultures ensures that individuals encounter continuous reinforcement of these beliefs through observable gender roles. These roles establish normative constraints and generate expectations based on biological sex, often associated with differentiated expectations regarding leadership, caregiving, and social authority. The widespread manifestation of these patterns in social reality causes children to perceive stereotypical gender arrangements as natural rather than socially constructed (
Berger & Luckmann, 1966). Adolescence represents a particularly critical developmental stage for examining gender stereotypes, as identity consolidation, peer influence, and social comparison processes intensify during this period. Indeed, during adolescence, gender stereotypes increasingly influence educational choices, career aspirations, leadership expectations, and relational norms (
Alfieri et al., 1996). Evidence indicates that internalized gender beliefs shape academic self-concept, occupational trajectories, and power dynamics in intimate relationships (
Moreau et al., 2021). Measuring stereotype endorsement at this developmental stage is therefore essential for understanding how early cognitive schemas may orient long-term life outcomes. This process establishes a social order mistaken for a natural order, perpetuating itself intergenerationally through binary socialization until the “generalized other” becomes fully integrated into individual consciousness, consolidating gender identity. Identity formation, however, represents an ongoing process subject to continuous remodulation through secondary socialization. Nevertheless, primary socialization experiences constitute the most resistant core of identity structure. Consequently, males progressively assume roles emphasizing dominance, control, and power, while females experience constraint toward—and often internalize—subordinate social positions centered on domestic spheres, caregiving, and partner/offspring nurturance.
Stereotypes thus simultaneously categorize individuals within primary group membership while delineating distinctions from other social groups. Multiple categorization criteria operate: sex, physical appearance, profession, sexual orientation, and social affiliation. Group members share distinctive characteristics that they collectively defend (
Tajfel, 1981).
Tajfel’s (
1986) empirical validation of Mead’s theoretical insights through Social Identity Theory demonstrated how group affiliation mechanisms underpin social discrimination, extending beyond gender to encompass the concepts of race, ethnicity, sexual orientation, and religious affiliation. The fundamental principle operates through ethnocentric bias—perceiving one’s group as superior while devaluing others. When out-groups fail to conform to stereotypically prescribed “natural” standards, discrimination becomes perceived as both justified and inevitable. Gender stereotypes can thus generate discrimination, segregation, and violence.
The propensity to categorize others within meaningful mental containers (sex, ethnicity, religion) while simultaneously self-categorizing derives from stereotypes operating below conscious awareness. Without recognizing these social conditioning processes, individuals enact prescribed social roles while attributing behavior to inherent personality traits or biological predispositions rather than contextual, situational, or cultural factors. This attribution error legitimizes male aggression toward women through perceived superiority and discretionary power, including violence justification. Measuring prejudice presents substantial methodological challenges. Potential discrepancies between questionnaire responses and actual behavior, combined with social desirability bias, complicate accurate assessment (
McConahay, 1986). Additionally, individuals may lack awareness of their stereotyped actions (
Greenwald & Banaji, 1995). While direct behavioral observation could address these limitations, it lacks the statistical representativeness achievable through large-scale questionnaire administration.
Notable gender-focused instruments include the Bem Sex Role Inventory (BSRI;
Bem, 1974) and the Personal Attributes Questionnaire (PAQ;
Spence et al., 1975), measuring conformity to traditional gender roles and personality traits associated with gender roles, respectively. However, both instruments employ outdated binary conceptual categories and 1970s-era stereotypes and social norms. The more contemporary Gender Role Stereotypes Scale (GRSS;
Mills et al., 2012) offers brevity (eight items) and a distinction between male and female role stereotypes; however, its restricted item pool and emphasis on adult-oriented social roles limit its content validity for adolescent populations. The Ambivalent Sexism Inventory (ASI;
Glick & Fiske, 1996), widely utilized internationally, distinguishes hostile from benevolent sexism using an evaluative metric analogous to
Pettigrew and Meertens (
1995)’ scale. Comprising 22 items equally divided between overt and benevolent/paternalistic sexism measurement, it emphasizes stereotypical thinking over gender role assessment. Given the inherent limitations of any scientific inquiry into abstract constructs, and considering the temporal evolution of gender stereotypes and their impact on life trajectories, attitudes, and behaviors, it is now more imperative than ever to adopt a rigorous methodology capable of identifying both the presence and the nature of gender stereotyping-one that integrates a multidimensional socio-psychometric analysis (
Bhatia & Bhatia, 2021;
Menegatti et al., 2021;
Abele et al., 2016;
Hentschel et al., 2019).
Given the limited availability of multidimensional instruments specifically designed to assess gender stereotype endorsement among adolescents, the present study developed the Gender Stereotypes and Roles Adherence Battery for Adolescents (GAB-A).
This battery, employing an interdisciplinary socio-psychological evaluative metric, was implemented in a study of 2955 students attending the first year of public secondary schools of Rome (9th grade; aged 14 years), which was carried out between January and March 2025. The GAB-A is a comprehensive tool comprising three distinct scales: the Gender Stereotyped Attitude Scale (GSAS), which assesses the endorsement of traditional gender stereotypes, the Gender Role Activities Scale (GRAS) that measures beliefs about gender roles and the Gendered Traits Inventory (GTI) that examines attribution of personality characteristics to specific genders. The GAB-A was, therefore, designed to capture three complementary yet distinct components of gender stereotyping among adolescents: prescriptive attitudes, role attributions, and essentialist trait beliefs.
The present study had three primary aims: The first aim was to examine the factorial structure of each scale by testing the robustness of its latent dimensions through a split-sample approach, employing Exploratory Factor Analysis (EFA) in one subsample and Confirmatory Factor Analysis (CFA) in the other. The second aim was to investigate gender differences in endorsement of gender-related beliefs. In this regard, it was hypothesized that male students would report higher endorsement of prescriptive gender stereotypes, as assessed by the GSAS and GRAS, compared to female students. In contrast, no strong directional hypothesis was formulated for essentialist gender trait beliefs, measured by the GTI, and potential differences were explored. The third aim was to assess the psychometric validity of the scales by establishing measurement invariance across gender and school type, thereby ensuring the appropriateness of group comparisons. In addition, convergent validity was examined by testing associations between gender-related beliefs and theoretically relevant constructs, specifically aggressive behaviors and psychological distress. Based on prior literature linking gender stereotype endorsement to psychological adjustment difficulties (
Wong et al., 2017;
Levant et al., 2009), it was hypothesized that higher GAB-A scores would be positively associated with aggression, psychological distress, hostility, and alexithymia, and negatively associated with self-esteem.
4. Discussion
The present study developed and validated the Gender Stereotypes and Roles Adherence Battery for Adolescents (GAB-A), a multi-module assessment tool for measuring endorsement of gender stereotypes among Italian adolescents. Comprehensive psychometric evaluation across content validity, factorial structure, reliability, known-groups validity, convergent and discriminant validity, and measurement invariance demonstrates that the GAB-A provides a psychometrically sound instrument for research and applied contexts. Beyond confirming the technical adequacy of the scales, the validation process revealed several theoretically significant findings regarding the structure, correlates, and demographic patterns of gender stereotype endorsement that warrant detailed discussion.
4.1. Summary of Psychometric Properties
The GAB-A demonstrated strong psychometric properties across all three modules. Content validity assessment by a multidisciplinary expert panel (N = 12) yielded excellent scale-level indices (S-CVI/Ave = 0.967), with 97.8% of items meeting the I-CVI ≥ 0.78 threshold for acceptability. The split-sample cross-validation strategy—with EFA conducted on Sample A (n = 1479) and CFA on the independent Sample B (n = 1476)—provided robust evidence for the hypothesized factorial structures while protecting against overfitting.
The GSAS (Gender Stereotyped Attitude Scale) emerged as a 17-item measure with three correlated factors: Traditional Stereotypes (GSAS-TS; 9 items), Violence and Sexuality Myths (GSAS-VM; 5 items), and Relational Control (GSAS-RC; 3 items). CFA demonstrated good-to-excellent fit (CFI = 0.950, TLI = 0.941, RMSEA = 0.049, SRMR = 0.037), and internal consistency was excellent (α = 0.894). This three-factor structure aligns with theoretical distinctions in the gender literature between traditional role beliefs, hostile attitudes toward women, and controlling relationship behaviors (
Glick & Fiske, 1996). The GRAS (Gender Role Activities Scale) demonstrated a two-factor structure distinguishing Leisure Activities (GRAS-LA; 5 items) from Social Roles (GRAS-SR; 9 items), with excellent fit (CFI = 0.989, TLI = 0.987, RMSEA = 0.055, SRMR = 0.049) and good reliability (α = 0.850). The GTI (Gendered Traits Inventory) showed a unidimensional structure with acceptable fit (CFI = 0.948, RMSEA = 0.075) and good reliability (α = 0.802). All scales achieved full or partial scalar measurement invariance across gender and school type, supporting valid group comparisons.
4.2. The Three-Module Structure: Related but Distinct Constructs
Inter-scale correlations revealed that the three GAB-A modules capture related but distinguishable facets of gender stereotype endorsement. The GSAS and GRAS showed a strong correlation (r = 0.653), indicating substantial shared variance (43%) while maintaining sufficient uniqueness (57%) to justify separate assessment. The GTI correlated more modestly with both the GSAS (r = 0.312) and GRAS (r = 0.440). This correlational pattern supports a hierarchical conceptualization wherein attitudinal and role-based stereotypes form a closely related cluster reflecting prescriptive beliefs about what men and women should do, whereas trait stereotypes represent more fundamental descriptive beliefs about what men and women are (
Eagly & Wood, 2012). A recent systematic review of the ambivalent sexism literature (
Bareket & Fiske, 2023) reinforces this distinction, demonstrating that hostile sexism functions primarily to protect men’s power, whereas benevolent sexism serves to guard traditional gender role arrangements. This pattern is consistent with the differentiation between the GSAS-VM (which captures hostile attitudes) and GSAS-TS (which reflects benevolent sexism and beliefs about complementary roles).
The stronger GTI–GRAS association (compared to GTI–GSAS) may reflect conceptual overlap between attributing personality traits to genders and believing genders are suited for corresponding activities, as both tap into essentialist reasoning about “natural” gender differences (
Haslam et al., 2000). From a social role theory perspective (
Eagly & Wood, 2012), individuals who endorse role-based stereotypes may be particularly likely to invoke essentialist trait explanations to justify those role assignments.
Within the GSAS, the three subscales showed substantial but not redundant intercorrelations (rs = 0.47–0.68), supporting the utility of subscale-level assessment. The GSAS-RC (Relational Control) subscale emerged as particularly noteworthy given its distinct pattern of external correlates. Although strongly correlated with the other GSAS factors, GSAS-RC showed the strongest association with physical aggression (r = 0.24) while demonstrating consistent positive associations with distress across both gender groups, a pattern not observed for the other subscales. This suggests that beliefs legitimizing partner surveillance may represent a particularly problematic dimension of gender ideology with unique implications for relationship behavior and psychological functioning. Furthermore, its unique link to distress suggests that beliefs legitimizing partner surveillance may not be purely ideological, but also functionally related to relational insecurity. It is plausible that distress acts as an antecedent or mediator, wherein individuals with higher anxiety endorse controlling beliefs as a defensive strategy. Thus, GSAS-RC may capture a dimension of gender ideology that is particularly intertwined with the individual’s psychological state, serving as a cognitive scaffold for underlying insecurity.
4.3. Gender Differences Across Scales: Theoretical Implications
The known-groups validity analyses revealed dramatically different patterns of gender differences across the three GAB-A modules, with important implications for theories of gender stereotype development and maintenance.
4.3.1. Large Gender Differences on Attitudinal and Role-Based Measures
For the GSAS and GRAS, males showed substantially higher scores than females, with effect sizes in the large range (GSAS: d = 1.07, 95% CI [0.99, 1.15]; GRAS: d = 0.88, 95% CI [0.81, 0.96], dlatent = 1.06). These findings replicate and extend prior research documenting that males more strongly endorse traditional gender stereotypes across diverse samples and measures (
Glick et al., 2000;
Swim et al., 1995). The pattern is consistent with recent evidence from the Italian adult population showing persistence of gender stereotypes regarding occupations and traits despite high educational attainment (
Carvalho Silva et al., 2024), and with findings from Spanish adolescents linking egalitarian attitudes to lower internalization of both hostile and benevolent sexism (
Bonilla-Algovia et al., 2024). From a system justification perspective (
Jost & Banaji, 1994), male endorsement of gender stereotypes may serve to legitimize existing social arrangements that disproportionately benefit men in terms of economic resources, political power, and social status.
The consistency of gender differences across GSAS subscales is particularly informative. The largest effect emerged for Violence and Sexuality Myths (GSAS-VM; d = 1.04), suggesting that male adolescents in this sample are substantially more likely than female adolescents to endorse beliefs that normalize sexual coercion and intimate partner violence. This finding aligns with research linking masculine gender ideology to tolerance of relationship aggression (
Parrott & Zeichner, 2003) and highlights the potential relevance of stereotype reduction interventions for violence prevention.
4.3.2. Negligible Aggregate Differences Masking Strategic Self-Serving Biases
In stark contrast to the attitudinal measures, the aggregate gender difference on the Trait Inventory (GTI) appeared negligible (d = 0.11, 95% CI [0.04, 0.18]). On a macroscopic level, both male and female adolescents endorsed the same trait stereotypes at similar rates, that males are more aggressive and self-confident, while females are more sensitive and cooperative. This surface-level uniformity suggests that essentialist beliefs about gendered personality are deeply internalized cultural schemas that transcend personal gender identity (
Eagly & Steffen, 1984). In this case, respondents fail to recognize the socially constructed nature of the phenomenon, instead misinterpreting it as a personality trait. However, a granular analysis of the Self-Attribution Gap (Δ), defined as the difference between a group’s self-rating and the outgroup’s rating of them (
Supplementary Table S87), reveals that this consensus is actively negotiated to preserve positive social identity. Using Bonferroni-corrected z-tests (αadj < 0.0025) to assess the 20 target-item combinations, significant self-serving biases were identified that challenge the assumption of passive internalization.
The analysis revealed two distinct forms of identity negotiation. First, absolute ingroup bias (Directional Discordance). For the highly valued trait of Independence, the groups disagreed on the direction of the stereotype. Both males and females significantly claimed the trait for themselves (males, Self: 32.5% vs. Other: 12.6%; females, Self: 20.2% vs. Other: 8.8%). This “double self-serving bias” (ΔM = +19.9%; ΔF = +11.4%, p < 0.001) indicates that for traits central to agency and autonomy, both genders reject outgroup dominance to maintain collective self-esteem.
Second, Relative Ingroup Bias (Intensity Discordance): For traits where the direction was agreed upon, the in-group consistently optimized the magnitude. Enhancement: While both groups agreed females are more Reasonable and Cooperative, female respondents endorsed these positive traits at significantly higher rates than males acknowledged (Δ between 9% and 14%, p < 0.001). Deflection: Conversely, for negative traits like Fragility, females engaged in defensive deflection, endorsing the trait significantly less (48.5%) than males attributed it to them (54.6%; Δ = −6.1%, p = 0.019).
Interestingly, males exhibited a distinct pattern regarding Unpredictability. While females attempted to deflect this negative trait (Δ = −8.0%), males admitted to it significantly more often than females accused them of it (Self: 21.5% vs. Other: 17.0%; p < 0.05). This divergence suggests the operation of a gendered double code, wherein females likely reject the term as a marker of emotional instability, whereas males embrace it as a proxy for agency and strategic autonomy.
To test for gender differences in bias strategies, a standardized Self-Serving Score (SSS) was computed for each item. A distinct asymmetry was found in bias expression: Males exhibited a dominant strategy of self-enhancement (Mean SSSpos = 11.0% vs. SSSneg = −1.2%), significantly inflating their association with positive traits. In contrast, Females utilized a dual strategy, showing moderate enhancement (7.9%) alongside significant deflection (4.9%).
These findings reframe the theoretical interpretation of the GTI. While the low Cohen’s d indicates shared access to cultural stereotypes, the item-level data suggest these schemas are not accepted passively. Instead, adolescents engage in a dynamic process of Social Identity negotiation (
Tajfel & Turner, 1979), accepting the broad cultural script but statistically manipulating specific attribution probabilities to maximize positive distinctiveness. The GTI thus taps into a hybrid construct: the knowledge of descriptive norms (“how men and women are”) filtered through the protective lens of in-group favoritism.
The successful implementation of Schema B scoring for the GTI reinforces this interpretation of trait attribution as an active, identity-relevant process. Under Schema B, the egalitarian response (“It doesn’t matter”) is coded as 0, representing a refusal to engage in gender categorization. Crucially, counter-stereotypical responses are coded as 1, grouping them with traditional stereotypes rather than treating them as “opposites.”
This scoring structure aligns with our finding that both stereotype endorsement and self-serving reversals are manifestations of the same underlying psychological mechanism: gender salience. Whether an adolescent adheres to the traditional script or flips it to favor their in-group, they are engaged in gendered categorization. The dramatic improvement in reliability under Schema B (α improved from 0.515 to 0.802) confirms that the fundamental construct being measured is not a linear continuum from “Sexist” to “Progressive,” but a categorical distinction between those who use gender as a primary explanatory lens (for tradition or self-enhancement) and those for whom gender is irrelevant to personality attribution.
4.4. Empirical Verification of Stereotype Directions: Cultural Change in Gender Beliefs?
Evidence of cultural evolution in gender stereotypes emerged across multiple GAB-A scales, necessitating empirical verification of item characteristics rather than reliance on classical theoretical assumptions.
4.4.1. Stereotype Direction Recoding in the GTI
The empirical verification of stereotype directions for the GTI revealed that two items required coding opposite to classical theoretical expectations. Unpredictability, traditionally associated with feminine emotionality (
Williams & Best, 1990), was empirically associated with males in the present sample (19.5% attributed to males vs. 14.7% to females). Reasonableness, classically associated with masculine rationality, was empirically associated with females (27.0% vs. 10.2%).
These findings may reflect cultural evolution in gender stereotypes among contemporary Italian adolescents. The association of unpredictability with males, rather than the historically stereotyped “emotional female,” may reflect contemporary framings of masculinity that emphasize impulsivity, risk-taking, and poor emotional regulation (
Levant et al., 2009). This interpretation is strongly supported by our bias analysis: while females actively deflected this trait (ΔF = −8.0%), males engaged in a unique pattern of “acceptance,” attributing it to themselves significantly more often than females did (ΔM = +4.5%,
p = 0.044), effectively validating it as a trait that male adolescents may paradoxically embrace as markers of agency rather than instability.
Similarly, the association of reasonableness with females may reflect benevolent sexism beliefs (
Glick & Fiske, 1996) that characterize women as more mature, responsible, and sensible than men; on the other hand, it could be an active reclamation of rationality by young women. Indeed, female respondents exhibited their strongest self-enhancement effect on this item (ΔF = +14.1%,
p < 0.001), suggesting this shift is not just an imposed restriction to caretaking roles, but also a strategic rejection of historical narratives regarding female instability.
When items 7 and 10 were coded according to theoretical rather than empirical directions, reliability dropped to unacceptable levels (α = 0.359), demonstrating that coding decisions can dramatically affect scale psychometric properties.
4.4.2. Item Exclusions Reflecting Evolving Stereotypes in the GRAS
Similar evidence of cultural change emerged during GRAS item analysis, where two items were excluded partly because their gender associations appeared to be evolving (
Supplementary Table S84). The item “Talking on the phone” (rit = 0.29) showed weak psychometric properties. What was once a stereotypically feminine activity (extended phone conversations) has become gender-neutral in an era when smartphone use is universal among adolescents. Similarly, “Teaching” (rit = 0.28) was excluded partly because it was “perceived as gender-neutral” by contemporary respondents, despite teaching historically being stereotyped as a feminine profession in many cultural contexts.
A third excluded item, “Making scientific discoveries” (rit = 0.24), showed the lowest item-total correlation in the GRAS. The low discrimination may reflect the abstract nature of the concept, but it could also indicate increased egalitarian attitudes toward women in science among contemporary adolescents, a possibility consistent with educational initiatives promoting STEM participation among girls.
4.4.3. Implications for Scale Development
These patterns across both the GTI and GRAS underscore the necessity of empirical verification before applying theoretical assumptions from classic cross-cultural research to contemporary adolescent populations. Gender stereotypes are not static cultural artifacts but evolving belief systems that respond to technological change (smartphone ubiquity), educational initiatives (women in STEM), and shifting cultural narratives about masculinity and femininity. Importantly, such evolution is not necessarily linear toward egalitarianism; recent research has documented patterns of “retrenchment” or countertrends among younger cohorts, suggesting that stereotype change may be nonmonotonic and context-dependent (
Palomino-Suárez & Aparicio García, 2025). Scale developers working with gender stereotype content should routinely verify stereotype directions and item functioning empirically rather than assuming that classical theoretical expectations will hold in new populations or time periods.
4.5. Stereotype–Wellbeing Relationships: Gender Moderation and Simpson’s Paradox
The criterion validity analyses revealed complex, gender-moderated relationships between stereotype endorsement and psychological wellbeing that require careful interpretation.
4.5.1. Aggregate Correlations: A Misleading Pattern
Aggregate correlations between GSAS/GRAS scores and psychological wellbeing measures initially showed an unexpected pattern: negative associations with distress (GSAS: r = −0.149; GRAS: r = −0.184), hostility, and alexithymia, alongside positive associations with self-esteem. If taken at face value, these correlations would suggest a “protective” effect of stereotype endorsement, suggesting that holding traditional gender beliefs is associated with better psychological outcomes.
4.5.2. Within-Group Patterns: Reversing the Aggregate Effect
Gender-stratified analyses revealed that the aggregate pattern was a statistical artifact produced by confounding between gender, stereotype endorsement, and wellbeing outcomes. Among males, stereotype endorsement showed positive associations with distress (GSAS: r = 0.066, p = 0.007), hostility (r = 0.085, p < 0.001), and alexithymia (r = 0.095, p < 0.001), and a negative association with self-esteem (r = −0.075, p = 0.002). Among females, these associations were non-significant or weakly negative.
This pattern constitutes a textbook example of Simpson’s paradox (
Kievit et al., 2013), wherein aggregate correlations can be misleading when subgroup differences exist. Notably, this statistical phenomenon has recently been documented in the gender equality literature more broadly, where country-level associations between gender equality and outcomes such as occupational segregation reverse direction when individual-level variation is considered (
Berggren & Bergh, 2025). The “protective” aggregate correlation arose because: (a) males had substantially higher stereotype endorsement than females, (b) males and females differed on wellbeing variables, and (c) within each gender group, the relationship between stereotypes and wellbeing was null (females) or in the theoretically expected positive direction (males). The aggregation across groups created a spurious negative correlation driven by between-group rather than within-group variation (see
Figure 3).
4.5.3. Theoretical Implications
The gender-moderated pattern has important theoretical implications. For male adolescents, endorsement of traditional gender stereotypes was associated with poorer psychological outcomes, specifically higher distress, hostility, and alexithymia, and lower self-esteem. This finding is consistent with research on the costs of traditional masculinity ideology, which has been linked to restricted emotionality, reluctance to seek help, greater self-stigma regarding psychological services, and both internalizing and externalizing problems (
Wong et al., 2017;
Üzümçeker, 2025).
For female adolescents, stereotype endorsement showed essentially no relationship with wellbeing outcomes. This null pattern may reflect the complex role of benevolent sexism for women, which can simultaneously offer subjective benefits (e.g., feeling protected and cherished) while functioning to maintain inequality (
Glick & Fiske, 2001). Alternatively, the low variance in female stereotype endorsement (substantially lower means with compressed distributions) may have attenuated correlations. Future research should examine whether the null pattern holds across samples with greater variability in female stereotype endorsement.
Two exceptions to the general pattern merit emphasis. First, the GSAS-RC subscale (Relational Control) showed consistent positive associations with distress in both genders (males: r = 0.083,
p < 0.001; females: r = 0.062,
p = 0.026). Beliefs legitimizing partner surveillance appear to be uniquely associated with distress regardless of gender, perhaps because such beliefs implicate interpersonal mistrust and control dynamics that are inherently stressful. Second, the GTI showed positive correlations with distress in both subgroups (males: r = 0.110,
p < 0.001; females: r = 0.087,
p = 0.002). Essentialist trait attributions, unlike prescriptive role beliefs, appear to be uniformly associated with poorer psychological outcomes across genders. This may be due to the mediating role of psychological rigidity, intolerance of uncertainty and need for cognitive closure (
Roets et al., 2015). Individuals with a rigid cognitive style often experience higher distress due to an intolerance of ambiguity and simultaneously endorse essentialist stereotypes to impose structure and certainty on their social world (
Allport, 1954;
Bastian & Haslam, 2006).
The methodological lesson of this analysis extends beyond the present study. Researchers investigating relationships between stereotype endorsement and psychological outcomes should routinely test for demographic confounding and moderation. Failure to do so risks drawing spurious conclusions from aggregate analyses.
4.6. Measurement Invariance: Implications for Group Comparisons
Measurement invariance testing represents a critical prerequisite for valid group comparisons (
Chen, 2007;
Putnick & Bornstein, 2016). The present findings have important implications for how the GAB-A should be used in research and practice.
4.6.1. Full Scalar Invariance Across School Type
All three GAB-A scales achieved full scalar invariance across the three school types (Academic, Technical, Vocational). This finding indicates that the scales measure the same constructs in the same way across educational tracks, with equivalent factor loadings and item thresholds.
The school type differences observed in the present study, though smaller than gender differences (GSAS: η2 = 0.064; GRAS: η2 = 0.021; GTI: η2 = 0.003), have practical significance for targeting interventions. Technical and vocational students showed the highest stereotype endorsement on most scales, suggesting that educational context (including social background, curriculum, peer culture, and occupational expectations associated with different tracks) may shape gender-related attitudes. The documented invariance supports valid identification of elevated endorsement profiles for targeted prevention programs.
4.6.2. Scalar Invariance Across Gender: Scale-Specific Patterns
The measurement invariance findings across gender were more complex and scale-specific. The GSAS achieved full scalar invariance with all 17 items, indicating that the scale functions equivalently for male and female adolescents. This is a particularly robust finding given the large gender differences in mean levels; it indicates that these mean differences reflect genuine attitudinal differences rather than differential item functioning.
The GRAS similarly achieved full scalar invariance after removal of one item (Driving) that showed substantial differential item functioning (female λ = 0.39 vs. male λ = 0.70). This DIF pattern is substantively interesting: driving appears to be more central to males’ conception of gendered activities than to females’. The removal of this item, already excluded based on low item-total correlations, ensured valid measurement across gender groups.
The GTI achieved partial scalar invariance, with 8 of 10 items (80%) functioning equivalently across gender groups. Two items (Independence and Reasonableness) showed threshold non-invariance. However, considering our item-level bias analysis, this non-equivalence appears to stem not from differing semantic interpretations, but from competing self-enhancement strategies. These two items elicited the strongest self-serving biases in the entire scale: Independence showed the largest male enhancement (Δ = +19.9%) and the second largest female enhancement (Δ = +11.4%), while Reasonableness showed the largest female enhancement (Δ = +14.1%) alongside the second largest male enhancement (Δ = +12.0%). The lack of scalar invariance here likely captures the intensity of this identity negotiation, where both groups aggressively claim these high-value traits, rather than a failure of measurement validity. Given that 80% of items retained scalar constraints, comparison of latent means remains defensible (
Putnick & Bornstein, 2016), though researchers should interpret gender comparisons on these specific traits as reflective of both stereotype content and active identity management.
4.7. Comparison with Existing Measures
The GAB-A addresses several gaps in the existing measurement landscape for gender stereotype assessment among adolescents.
4.7.1. Relation to Ambivalent Sexism Measures
The Ambivalent Sexism Inventory (ASI;
Glick & Fiske, 1996) remains the most widely used measure of sexist attitudes, distinguishing hostile sexism (antipathy toward women who violate traditional gender norms) from benevolent sexism (subjectively positive but patronizing attitudes toward women). The GSAS shows conceptual overlap with the ASI, particularly in the GSAS-TS subscale, which captures beliefs about complementary gender roles similar to the ASI’s benevolent sexism subscale.
However, the GAB-A extends beyond the ASI in several ways. First, the GSAS-VM subscale specifically targets beliefs about gender-based violence and sexual coercion, content that is not systematically assessed by the ASI but is critically important for adolescent populations given developmental considerations around dating violence and consent. Second, the GSAS-RC subscale assesses partner surveillance beliefs that have gained relevance in the digital age (e.g., social media password access, location tracking). Third, the GRAS and GTI provide a systematic assessment of activity and trait stereotypes, which are not captured by purely attitudinal measures.
4.7.2. Adolescent-Appropriate Content
Many existing stereotype measures were developed and validated with adult samples, raising questions about developmental appropriateness and cultural relevance for adolescents. The GAB-A was developed specifically for Italian adolescents, with items reflecting developmentally relevant domains (e.g., football/soccer as a stereotypically male sport in Italian culture, academic subject aptitudes, leisure activities). The normative data and cutoffs established in the present study provide age-appropriate reference points that adult-normed measures cannot offer.
4.7.3. Multi-Module Assessment
The modular structure of the GAB-A, assessing attitudes (GSAS), role beliefs (GRAS), and trait attributions (GTI) separately, provides more differentiated assessment than single-dimension measures. The differential pattern of findings across modules (e.g., large vs. negligible gender differences; different correlational patterns with wellbeing) demonstrates the value of this multi-faceted approach. Researchers and practitioners can select modules relevant to their specific assessment goals while maintaining psychometric integrity.
4.8. Practical Applications
4.8.1. Identifying Elevated Endorsement Profiles
The normative data and operational cutoffs established in this study enable practical use of the GAB-A for identifying elevated endorsement profiles. The combined classification system for the GSAS identifies approximately 27% of the total sample as having “Elevated” or “Very Elevated” stereotype endorsement (sum score ≥ 39). This proportion rises to 45% among males and drops to 8% among females, highlighting the importance of gender-stratified interpretation.
The GSAS-VM (Violence and Sexuality Myths) and GSAS-RC (Relational Control) subscales may be particularly valuable for screening in contexts concerned with relationship violence prevention. Elevated scores on these subscales showed the strongest associations with physical aggression and were associated with distress across both genders. The relevance of such screening is supported by recent evidence linking sexism to teen dating violence, with gender-differentiated mediation pathways involving personal distress in males and assertiveness deficits in females (
Villanueva-Blasco et al., 2024). School-based prevention programs targeting dating violence might use these subscales to identify elevated endorsement profiles for more intensive intervention.
4.8.2. Intervention Targeting
The school type differences documented in this study, combined with the demonstrated measurement invariance across educational tracks, support the use of GAB-A for targeting interventions by educational context. Technical and vocational students showed consistently higher stereotype endorsement than academic students, suggesting that prevention programs might productively focus resources on technical and vocational educational settings. The additive (non-interactive) pattern of gender and school type effects indicates that male students in the non-academic track represent a particularly vulnerable group.
4.8.3. Monitoring Intervention Effects
The established reliability and measurement invariance properties of the GAB-A support its use for monitoring change over time. The demonstrated stability of factor structures across split samples and demographic subgroups suggests that pre-post differences would reflect genuine attitude change rather than measurement artifacts. Researchers evaluating gender equality interventions should consider the GAB-A as an outcome measure, particularly given the availability of multiple construct-specific subscales that can detect targeted changes.
4.9. Limitations
Several limitations warrant consideration when interpreting the present findings.
4.9.1. Sample Characteristics
The validation sample was limited to Italian adolescents from the Province of Rome, raising questions about generalizability to other Italian regions, age groups, or cultural contexts. The overrepresentation of males (56.4%) reflects differential enrollment patterns across Italian secondary education tracks rather than sampling bias. Future validation efforts should include samples from other Italian regions, different age groups, and cross-cultural samples to establish broader generalizability.
4.9.2. Cross-Sectional Design
The present validation is based on cross-sectional data from the first wave of data collection, which precludes causal inference regarding the observed associations between stereotype endorsement and psychological outcomes. Although stereotype endorsement was observed to be associated with distress among males, it cannot be determined whether stereotype beliefs cause psychological difficulties, whether psychological difficulties lead to stereotype adoption, or whether third variables (e.g., family context, media exposure) influence both. However, the present study is part of the MIB (Mutamenti Interazionali e Benessere) longitudinal research project, which will follow participating students across all five years of Italian secondary school. Future waves of data collection will enable testing of temporal precedence, developmental trajectories of stereotype endorsement, and potential bidirectional effects between stereotypes and psychological outcomes.
4.9.3. Self-Report Methodology
All measures relied on self-report, introducing potential shared method variance and social desirability bias. Although response style analyses indicated minimal impact of acquiescence and extreme responding on scale scores, social desirability effects cannot be ruled out, particularly for overtly prejudicial items in the GSAS-VM subscale. Future research might incorporate implicit measures or behavioral assessments to complement self-report data.
4.9.4. Measurement Invariance Limitations
Although the present study established measurement invariance across gender and school type, invariance testing for the GTI was limited by convergence issues with the sparse categorical data, resulting in partial rather than full scalar invariance. The two non-invariant items (Independence, Reasonableness) may be interpreted somewhat differently by males and females. Researchers using the GTI for gender comparisons should acknowledge this limitation.
4.9.5. Criterion Measure Selection
The criterion measures used for convergent and discriminant validity assessment (BPAQ, K10, RSES, PAQ-S) represent a limited sampling of relevant constructs. Future validation efforts should examine associations with additional theoretically relevant variables, including relationship quality, dating behavior, academic/occupational aspirations, and behavioral measures of discrimination and prejudice. In addition, the percentile-based thresholds proposed in the present study are distribution-based and should be interpreted as normative or operational categories rather than risk cutoffs.
4.9.6. Cluster Sampling Design
Although the cluster sampling design (students nested within schools) may introduce non-independence of observations, several design features mitigate this concern: the two-dimensional stratification of school selection (by educational track and geographical area) reduces systematic between-school variability, the inclusion of only first-year students limits school-level socialization effects, and the standardized CAPI administration protocol minimizes context effects. Nevertheless, formal multilevel modeling with school-level identifiers would allow precise partitioning of variance components, and future waves of the longitudinal study should incorporate this approach.
4.9.7. Cognitive Pretesting
Although a formal cognitive pretesting study (e.g., think-aloud protocols) was not conducted, the iterative refinement process—grounded in a decade of direct engagement with the target population—provides strong ecological validity for item content and wording. Future instrument development should include formal cognitive testing.
4.10. Future Directions
Several directions for future research emerge from the present findings.
4.10.1. Longitudinal and Developmental Research
Longitudinal designs could establish whether stereotype endorsement predicts psychological outcomes over time and how stereotypes develop and change across adolescence. The negligible gender differences observed for trait stereotypes (GTI) raise questions about whether this pattern is present earlier in development or emerges through adolescence. Tracking stereotype trajectories from middle school through late adolescence would illuminate developmental processes.
4.10.2. Cross-Cultural Validation
The GAB-A was developed and validated in an Italian context, and its applicability to other cultural settings requires empirical evaluation. Some content (e.g., football as stereotypically male) may be culture-specific, while other content (e.g., beliefs about domestic roles) may show cross-cultural invariance. Cross-cultural validation studies, particularly in Southern European and Mediterranean contexts with similar gender role traditions, would enhance the international utility of the battery.
4.10.3. Intervention Research
The present findings provide a foundation for intervention research. Experimental studies could test whether evidence-based educational programs produce measurable reductions in GAB-A scores and whether such reductions mediate improvements in relationship quality or reductions in aggressive behavior. The demonstrated measurement invariance supports valid pre-post comparisons within treatment and control groups.
4.10.4. Extension to Gender-Diverse Populations
The present validation focused on binary gender comparisons (male vs. female), excluding 20 participants who identified as non-binary or did not report gender. Future research should explicitly include and validate the GAB-A with gender-diverse populations. Theoretical frameworks on gender stereotypes have been critiqued for implicitly assuming binary gender categories (
Hyde et al., 2019), and empirical extension to non-binary individuals represents an important direction.
Leaper (
2024) has recently proposed an expanded developmental model of ambivalent sexism that addresses gender-diverse youth and emphasizes cultural variation in stereotype content, providing both theoretical grounding and methodological rationale for such validation efforts.
4.10.5. Integration with Implicit Measures
The present study relied exclusively on explicit, self-report measures. Integration with implicit measures (e.g., Implicit Association Test) would enable examination of whether explicit and implicit stereotype measures show convergent or divergent patterns and whether implicit measures add predictive validity for behavioral outcomes beyond explicit measures.
5. Conclusions
The Gender Stereotypes and Roles Adherence Battery for Adolescents (GAB-A) provides a psychometrically sound, multi-dimensional assessment tool for measuring gender stereotype endorsement among Italian adolescents. The three modules (GSAS, GRAS, and GTI) capture related but distinct facets of gender-related beliefs, with trait stereotypes (GTI) emerging as a qualitatively different construct characterized by near-uniform endorsement across gender groups.
The relationship between stereotype endorsement and psychological wellbeing proved to be moderated by gender, with important methodological implications. What initially appeared to be a paradoxical “protective” pattern was revealed through stratified analyses to be a Simpson’s paradox artifact. Among male adolescents, stereotype endorsement showed the theoretically expected positive association with psychological distress, while among females, no significant association emerged. These findings underscore the importance of testing for demographic confounding and moderation in stereotype research.
The documented measurement invariance across gender and school type supports valid group comparisons, enabling both research applications (testing theoretical predictions about group differences) and practical applications (identifying elevated endorsement profiles for interventions). The present findings provide strong initial evidence of structural validity, internal consistency, and measurement invariance, together with gender-stratified normative data. The interdisciplinary nature of the study, which enabled the validation of the GAB-A, also enriched the analysis by incorporating findings aimed at elucidating the key factors underlying the generational reproduction of gender stereotypes. Furthermore, the extensive psychometric component of the Interactional Changes and Wellbeing (MIB) project questionnaire revealed several noteworthy associations with adolescents’ psychological well-being; however, at this stage of the project, causal relationships cannot yet be established. For researchers, the GAB-A offers a validated instrument for investigating gender stereotype development and consequences in Italian adolescent populations. For practitioners, it provides a tool with normative data and empirically derived cutoffs for identifying elevated stereotype endorsement that may warrant intervention.