Appendix B. Comparison of Current and Former Studies
How do the findings of the current study compare with those from prior studies? Here we examine all previous studies of academic letters, some of which have reported divergent results from the current study, which as was seen, found few gender differences, most of which favored women. In this section we highlight four possible reasons for any inconsistencies between these results and former ones.
Date: We used an open-ended analysis that reversed the word count methodology to provide an internal validation of the word count analyses. Instead of starting with “recommendation” words in published word counts and then measuring their gender frequency, we started with words with gendered frequency and looked for those indicating strength of recommendation. This is an independent analysis that is not dependent on the words in word counts used in the main analysis but rather is based on the gendered associations of all words in actual letters. Between 30% and 86% of the patterns in each word count category matched a word in the letter text, but over 75% of the words in the base vocabulary considered below were not found any word count. Therefore because a great majority of the words in the letters are not in the word count categories, it is possible for large differences in the letters to be overlooked. We examined three variations of mapping letters to sets of tokens (1) within word-category word use differences based on rate per word frequencies, (2) a broad examination of all common words based on rate per word frequencies, and (3) an examination of candidate descriptors based on fraction-of-letters frequencies.
Different Fields: Some previous studies examined applicants for jobs in fields that are less math-intensive and have much greater female representations than EPP (for example, medicine, biology, and geoscience). It is therefore even more curious that some of these studies found gender bias in letters of recommendation, while we found little evidence of gender bias in EPP. We therefore urge extreme caution in generalizing our results beyond EPP and social science, including even other subfields of physics such as theoretical particle physics. The core values for a field such as EPP, which requires large collaborative efforts (hence, a possibly greater emphasis on communal and grindstone traits), may differ from fields in which research is carried out by much smaller collaborative teams or even by lone individuals. Again, the social sciences are usually less collaborative and math-intensive than EPP and have dramatically higher representations of women than EPP so we expected the social science effects if anything to be markedly smaller, which they were not.
Nationality of Writers: Some prior findings, particularly Dutt et al.
), that are inconsistent with the current findings contained a large proportion of letters written by recommenders from outside America. That study finds non-US letters are much shorter and they are less enthusiastic than letters written by American scientists. Thus, there is evidence that differences in letter length and tone vary significantly across cultures, with letters written by Americans being longer and more positive than those emanating from other regions. The vast majority of the letters in the present study were written by Americans; in contrast, in Dutt et al.
), 530 out of 1224 letters were written by scientists outside the US. Thus, the current study clarifies what (until now) has been a set of seemingly contradictory findings that result from studies that are mostly small-scale and based on samples lacking important controls (such as the unavailability of letters for unsuccessful applicants, or too small subsamples of female writers to do gendered analyses). Some, but not all, of these earlier studies have reported evidence of gender bias; however, there are many exceptions and contradictions that we enumerate next.
Sample Size: The current study is based on the largest sample of letters thus far examined, 2206 letters. Prior studies varied from 237 letters in Li et al.
) to 1224 letters in Dutt et al.
). This is important because the smaller samples precluded examining correlations between the gender of applicant and the gender of the writer. As noted elsewhere in this article, it is important to remember that candidates typically apply to many positions and letters for a given candidate rarely differ substantively from one search to another. Therefore, the letters in our samples are likely to resemble those submitted by these applicants to other searches beyond these two institutions. Nonetheless, caution is needed when generalizing these results; for example, the internal culture of EPP, with large collaborations requiring communal behavior, may differ from other fields of physics.
Dependent Variables: With the exception of McCarthy and Goffin
) the current study employs the largest number of dependent variables: length, agentic, communal, grindstone, standout, achievement, positive affect, negative affect, ability, homophily, and writer status. The current study also included an internal validation using words not limited to the word counts used by prior researchers, an analysis that was bottom-up. This analysis began with words associated with gender that were not contained in any of the word counts; that is, after finding few gender differences in the myriad dependent measures examined, the current study undertook an independent test that confirmed no other terms associated with gender were related to ability; instead, they were words that described research topics that tended to be gendered (as perhaps implied our findings, women in developmental social science being more likely to study family processes and men to study neuroscience.)
Control of Background: Only a few prior studies attempted to control for applicants’ personal characteristics (by using number of publications, conference presentations, class rank, and/or awards as covariates). These are less than ideal controls for other ways applicants can differ (e.g., status of their mentor, quality of the journals in which they publish, prestige of their university, their contribution to multi-authored studies). In contrast, the current study included an analysis of a subsample of 918 of the 2206 letters for candidates with letters from both genders, neither of whom was the candidates primary advisor. Each candidate thereby contributed to the measures for each writer gender. This minimizes differences between male and female candidates caused by gender differences in applicants’ personal characteristics. However, it does not address the possibility that differences in letters for male and female candidates might reflect unmeasured differences in candidate qualifications.
Comparing Disciplines that Differ in Women’s Representations: The current study contrasted two disciplines in which women are disparately represented. Only Dutt et al.
) examined a math-intensive field, the analysis of letters for postdoctoral positions in geoscience. This current study examined an even more male-dominated field, EPP (Elementary Particle Physics) and contrasted it with fields with high female representation, psychology and sociology. This contrast provided a principled basis for the expectation of larger correlations between gender and the dependent variables within the less female-represented field, EPP.
In sum, the current study was much larger and contained more measures than in most former studies.
Trix and Psenka
) analyzed 300 letters written on behalf of applicants for faculty positions at a single U.S. medical school. The letters were written in the mid-1990s, preceding the secular movements that occurred two decades later. Unfortunately, Trix and Psenka only had access to letters written on behalf of successful applicants, precluding a comparison with unsuccessful letters. They also were unable to analyze their data as a function of letter writer’s gender, due to their small sample, rendering their results more narrative than quantitative. Like the current study, they found no differences in the frequency of letters that contained standout terms (58% for men vs. 63% for women). However, they found that letters written for men tended to repeat standout terms: on average, such letters contained 2.0 standout terms vs. only 1.5 for women’s letters. Because of the unavailability of letters for unsuccessful candidates, there is no way of knowing how instrumental these features were in hiring decisions. Trix and Psenka also found that letters for women contained twice as many “doubt-raisers” as did letters for men. Finally, like Dutt et al.
) they found that letters by writers from Europe were shorter: “Even letters from Canada were less hyperbolic than those from the USA. But we did not have enough letters to make more than general observations.” Thus, Trix and Psenka’s study was limited by its small sample size, a single institution, and a single field (medicine), and the authors were unable to analyze letters for unsuccessful candidates or as a function of the gender of the writer, precluding many of the analyses in the current study. Despite these limitations, Trix and Psenka provided a rich narrative analysis that influenced the hypotheses in the current study.
Messner and Shimahara
)’s study was twice as large as Trix and Psenka’s, 763 letters vs. 300 letters, written on behalf of applicants for a 1-year residency in otolaryngology/head and neck surgery at Stanford University’s Medical School. Only 8.8% of letters were written by women, which limited gender of applicant times gender of writer analyses. They found that all letters were quite positive, which echoes Dutt et al.
)’s finding that roughly 98.5% of letters were either good or excellent. However, Messner and Shimahara
) found that letters written for women contained more communal terms (e.g., team player, compassionate), and male writers were more likely to mention a female applicant’s physical appearance. We found zero such occurrences in EPP, although there were statements in social science letters by female writers that male candidates were “adorable” or “charming” as discussed earlier. They found that 86% of all letters contained standout terms (averaging 2.6 per letter). However, like the current findings—and unlike Trix and Psenka’s—standout terms did not differ by gender of applicant. They also found that doubt raisers (present in 19% of letters) did not differ by gender of applicant, unlike Madera et al.
) and Trix and Psenka
), both of which reported more “doubt-raisers” for women applicants. (Trix and Psenka 2003
) Finally, unlike most studies (e.g., Trix and Psenka
))’s), Messner and Shimahara
) did not find a difference in letter length as a function of gender of writer or gender of applicant, nor did they find a correlation between letter length and favorability. However, the mean length of their letters was less than half of the length in the current study: female writers’ letters = 345 words, male letter writers’ letters = 328 words (not statistically significant); in contrast, the mean length of letters in the current study ranged between 915–960 words, which is considerably longer than letters written in other studies. Our much longer letters will be qualitatively different, with more depth and detail. Comparisons between these two sets are then complicated by the evident difference in the commitment of the writer.
In a larger study of letters written for geoscience postdocs, Dutt et al.
) analyzed 1224 recommendation letters, submitted by writers from 54 countries (43% were from outside the U.S.), for postdoctoral fellowships in a single field, geosciences, submitted between 2007 and 2012. (Dutt et al. 2016
) Unlike Messner and Shimahara
), Dutt et al.
) and her colleagues found that letters written for women contained fewer words. However, like both Messner and Shimahara
) as well as the current study—but unlike Trix and Psenka
)—Dutt et al.
)’s letters contained similar numbers of standout words and more grindstone words. Although these researchers found that female applicants were only half as likely as men to receive excellent letters, they found no evidence that male and female recommenders differed in their likelihood to write stronger letters for male applicants. Like Trix and Psenka
), they also found that letters from American writers were on average 561 words whereas those written by Africans, (305 words), South Asians (275 words) and East Asians (320 words) were notably shorter; even Europeans, New Zealanders and Australians wrote shorter letters (345 words) than American writers. In contrast to the current findings, Dutt et al.
) concluded: “these results suggest that women are significantly less likely to receive excellent recommendation letters than their male counterparts at a critical juncture in their career.”
Madera et al.
) analyzed 624 letters from an earlier study that had been written for 174 applicants who had applied for positions in academic psychology at a single R1 university. Letters written for females contained more doubt-raisers, even after controlling for personal accomplishments (number of first-authored publications, honors, etc.) In this regard, their findings agreed with several of the above studies such as Trix and Psenka
) but disagreed with several others.
Li et al.
) analyzed 237 letters written on behalf of applicants to a four-year emergency medicine residency at Northwestern University. (Li et al. 2017
) Of the fifteen dimensions they analyzed, only three revealed gender differences: letters written for female applicants were slightly longer, contained more ability terms that referred to expertise, competence, and intelligence, and also more affiliative/communal terms that referred to teamwork, helpfulness, communication, compassion, and empathy. Unlike other studies such as Trix and Psenka
) they found no gender differences in doubt-raisers, grindstones, or standouts. Overall, they found little evidence of gender bias in letters, although the special nature of the application process may have influenced their findings (including the constraints that letters were limited to 250 words in length and only the top quarter of applicants were invited to apply to apply. In contrast, French et al.
) found that letters written by females for surgical residents were 45 words longer than letters written by males for both genders of applicants) and also that writers used standout terms more for letters about female applicants than male applicants.
Schmader et al.
) examined 886 letters written for 235 male and 42 female applicants for a chemistry/biochemistry faculty position at a single R1 university. Sample sizes were too small to analyze data by gender of writer times gender of applicant, so many of the analyses in the current study were not possible. The word count of letters for women was 604 words vs. 555 for men. They found no gender differences in the frequency of grindstone words. However, unlike the current study and several others, recommenders used significantly more standout adjectives to describe male than female applicants. Letters containing more standout words also included fewer grindstone words, which runs counter to the current study’s finding of weak statistical relationship (
) between the co-occurrence of grindstones and standouts.
Interestingly, in the earlier analysis, Madera et al.
) analyzed 624 letters written for 194 applicants in psychology and found that male recommenders wrote 262 letters for male applicants and 194 letters for female applicants; in contrast, female recommenders wrote 78 letters for male applicants and 109 letters for females. Hence, to some extent they resemble the current study’s finding that male applicants submit more letters from male writers than from female writers and the reverse trend for females, which the current study did not find (that study found homophily only for males in the social sciences). Madera et al.
) also found that women were described as more communal and less agentic than men, neither of which was found in the current study. Finally, although they found more agentic adjectives in letters for males, there was no difference in “agentic orientation”, a summary of indices of how much writers referred to the applicants as active, dynamic, and achievers (using words such as “earn”, “insight”, “think”, “know”, and “do”).
The study of recommendation letters has continued to the present: we examine two we found of particular relevance. Powers et al.
) studied a larger sample than ours for an orthopaedic residency program in 2018, examining 2625 letters for both race and gender. The reference letters were standardized to reduce potential bias, a relatively new idea. The researchers concluded (UiO indicating “underrepresented in orthopaedics”):
Small differences were found in the categories of words used to describe male and female candidates and white and UiO candidates. These differences were not present in the standardized LOR compared with traditional LOR. It is possible that the use of standardized LOR may reduce gender- and race-based bias in the narrative assessment of applicants.
The study was performed using LIWC 2015 for the standard categories: agentic/communal, standout/grindstone, and ability. Interestingly, the researchers concluded that standardized letters of reference may only produce a small effect. They also made an interesting speculation that a orthopaedics-focused word list may have obscured bias; our reverse methodology addresses this issue and is a powerful method for going beyond LIWC or other pre-defined lists.
These authors also note:
A similar discrepancy has been noted in studies analyzing letters of recommendation for surgical residency and suggests that applicants preferentially ask men faculty over women faculty for letters of recommendation… If applicants believe that letters from writers of higher academic rank carry more weight, then the larger proportion of men at higher academic rank could be one explanation for this difference…
The former is exactly what we have observed in our developmental social science letters, although in our study we noted the clustering into research areas as in Figure 5
with no corresponding effect in EPP. These authors also hypothesized a rank/weight correlation, but we observed no significant effect.
Kobayashi et al.
) studied 2834 letters in another orthopaedic residency study. Their conclusions, again using LIWC 2015, were quite similar to ours:
Although there were some minor differences favoring women, language in letters of recommendation to an academic orthopaedic surgery residency program were overall similar between men and women applicants… Given the similarity in language between men and women applicants, increasing women applicants may be a more important factor in addressing the gender gap in orthopaedics.
The authors made some of the same points regarding the limitations of word lists made in Powers et al.
) and in the current work.
To recap, amidst many similar findings, there were also many differences between the current study and former studies, any of which might be responsible for inconsistencies when they occurred. There are no studies directly comparable with the current study: they are either older (predating the recent focus on gender issues in STEM), not written on behalf of applicants for tenure-equivalent positions in STEM fields, and/or written for less math-intensive STEM fields that have higher representations of women, or written by writers from different cultures. All but one, Li et al.
), employed fewer measures than the current study. These differences may partly explain why we found less evidence of gender bias against women candidates than might have been expected from statements made in articles such as the following quote from Blue et al.
Standout words in letters of recommendation…portray a candidate as talented and exciting, (and) are most often found in letters of recommendation for men. Grindstone words create the impression that a candidate works hard but is not intellectually exceptional, (and) are more often used for women…As a result of that discrepancy, female candidates seem both more boring and less intellectually promising than their male competitors.
This article appeared in Physics Today, a magazine for members of the American Physical Society and was not based on original research by the authors; it was written for physicists who wanted to understand the general issue and accurately reflected a distillation of common sentiment.