1. Introduction
Social stratification—the process by which inequality reproduces itself—has long been a subject of interest in academic and policy circles. The interplay among familial background, individual attributes, opportunities, and an array of structural factors constituting one’s social advantage shapes the pathways to famous achievements and perpetuates or disrupts social stratification (
Hällsten and Pfeffer 2017;
Song and Campbell 2017;
Toft and Jarness 2021;
O’Brien 2023). The reproduction and transmission of social advantages have been hypothesized to form an “elite” class with excessive control of resources, power, and influence (
Khan 2012). Many studies have documented substantial endogamy across various geographical contexts regarding specific forms of elite status (
Lomnitz and Pérez-Lizaur 1987;
Stone 1990;
Ermakoff 1997;
Gatewood 2000;
S. Beckert 2001;
Padgett and Powell 2012;
Bodenhorn 2015;
Toft and Jarness 2021;
Chung et al. 2021;
O’Brien 2023). However, studies of intergenerational social mobility have historically been restricted to studying father–son pairs in terms of occupation, education, or income (
Blau and Duncan 1967;
J. Beckert 2022;
O’Brien 2023;
Bessiere 2023). While this is primarily a consequence of poor data availability, it has led to an incomplete understanding of how social advantages are transmitted across generations. Alternative approaches, such as surname-based analyses, suggest that mobility is far lower than traditionally estimated, with social status persisting across multiple generations (
Clark 2014;
Barone and Mocetti 2021). These findings underscore the need for broader genealogical studies that extend beyond parent–child pairs to capture the full extent of inherited advantage. In the United States (US), linking between censuses is still poor for women, and names recorded on the census often cannot be easily linked to other data sources (
Abramitzky et al. 2021;
O’Brien 2023). Thus, data that capture broader social relations are often limited to easily measurable outcomes strictly across close familial relations. Consequently, the scale of elite endogamy is not well-known or understood.
Multigenerational and broader genealogical data are increasingly available to study such outcomes (
Blau and Duncan 1967;
Lomnitz and Pérez-Lizaur 1987;
Stone 1990;
Ermakoff 1997;
Gatewood 2000;
S. Beckert 2001;
Padgett and Powell 2012;
Khan 2012;
Bodenhorn 2015;
Hällsten and Pfeffer 2017;
Chung et al. 2021;
Abramitzky et al. 2021;
Price et al. 2021;
J. Beckert 2022;
Bessiere 2023;
O’Brien 2023). Several studies have suggested the centrality of intergenerational transmission of advantage to social stratification and demonstrated that individuals with advantaged parents are more likely to achieve social and economic success as adults (
Hout 2015;
Song et al. 2021). While better administrative datasets may be better equipped to explore the extent to which parental notability influences individual achievements, variations across occupational domains, and potential gender disparities in famous outcomes, few datasets are as well equipped to explore large-scale network patterns of fame and status. We combined a comprehensive genealogical database comprising 30 million people with individual demographic and social descriptors to explore the dynamics of socioeconomic mobility within the elite populations of the Western world. We examined the proportion of individuals who achieved fame based on parental fame, considering both paternal and maternal influences, including multiple occupational domains and gender-specific distinctions. This allowed for a more nuanced understanding of the complexities of the interplay between familial background and individual achievements than previously reported.
Traditional social stratification research relies on variations of measures of income, occupation, or education to measure the intergenerational association of status attainment (
Hauser and Warren 1997). For a variety of reasons, obtaining such individual-level data at a scale with which they could be matched with large-scale family tree data is infeasible. Administratively, such data are only available for specific time periods and places. Additionally, the meaning of such measures would vary drastically across time periods and places. One unique metric of status that transcends both time and place is celebrity or notability. While distinct from traditional quantitative measures of socioeconomic status, celebrity status can proxy elite status by representing an aspirational level of wealth and success (
Rojek 2001;
Alexander 2010). Achieving celebrity status is generally dependent upon the same networks and resources that are generally associated with higher socioeconomic status (
Currid-Halkett 2010). It must be emphasized, however, that the concept of celebrity is socially constructed and does necessarily reflect any intrinsic personal attributes (such as intelligence). Recent advances in data on notable individuals with coverage that extends across countries and time periods make it an extremely valuable metric for status (
Laouenan et al. 2022).
Examining the correlation of fame across specific types of familial relations informs a clearer understanding of how larger-scale clusters of famous individuals are tied together. We found that having famous parents massively increases the likelihood of attaining fame, suggesting the presence of inherited advantages that contribute to social stratification. Independent of parent–child relationships, we identified additional pathways through which famous individuals become linked—marriage, in-law relationships, and grandparents. The culmination of these pathways led us to identify that more than a quarter of the famous individuals in our dataset were all part of a single dense genealogical cluster of famous individuals. While a wide array of social stratification research has documented the association of status across various direct relationships, such as between parents and children, between husbands and wives, and even between grandparents and grandchildren, little social stratification research has been able to descriptively explore the extent to which status is linked across broader familial network structures. Our results provide some of the first definitive evidence that descriptive patterns of status association are not limited to close contacts but persist in the form of larger familial network structures.
2. Materials and Methods
2.1. Data
The primary dataset utilized in this study is derived from WikiTree, a free and shared social networking genealogy website that facilitates collaboration across a single, comprehensive worldwide family tree. WikiTree was founded in 2008 and has become one of the largest genealogical research and collaboration platforms. The platform allows users to create and edit personal profiles, document family history using the wiki markup language, and contribute to a singular worldwide family tree.
The dataset used in this study was obtained from the WikiTree data dump (acquired 7 February 2023), which provides complete data on all deceased individuals who are part of the publicly available family tree. The birth dates of these individuals are summarized in
Figure 1. Users’ data contain detailed information about the individuals registered on WikiTree, including usernames, dates of birth and death, locations, gender, and parent IDs. Dyadic marriage data capture information about marriages between individuals within the WikiTree platform, including the user IDs of the spouses, marriage dates, and locations.
The second dataset used in this study is available in the “A Brief History of Human Time—Cross-verified Dataset” (
Laouenan et al. 2022). This dataset focuses on cross-verified information on individuals who have Wikipedia articles, including the full names of the individuals, and categorizes them based on their presence in different language editions of Wikipedia and Wikidata. Information such as birth and death dates is provided as reported or estimated values. The dataset also includes variables related to each individual’s primary domain of influence, categorized into different layers and sub-groups. To ensure the accuracy of the dataset, the creators of the dataset employed various validation measures, which involved checking for missing information and comparing data between different Wikipedia editions and Wikidata. The validation process revealed a high convergence rate, indicating the reliability of the dataset.
The dataset has several unique advantages. First, by covering non-English versions of Wikipedia, the dataset is not subject to typical Western biases.
Laouenan et al. (
2022) reported that “this significantly reduced the Anglo-Saxon bias” in the data. However, they still reported that an Anglo-Saxon bias is likely to be present in the dataset because of the nature of recorded history. While the dataset is a major improvement over similar datasets, the results of this analysis should be interpreted with that limitation in mind. Additionally, this dataset is uniquely high-quality because the data are cross-verified between Wikipedia and Wikidata, with extremely low error rates, as well as more comprehensive coverage as a result. The data also have rich geospatial, temporal, and categorical metadata, which enables comprehensive analyses on a wide variety of secondary attributes.
The authors carefully articulated what they substantively believe inclusion in the dataset means. Broadly, people in the dataset can be thought of as the union of two groups of people: “the universe of these significant individuals, without any further discussion of what influence means—it is actually entirely specific to the question social scientists would ask, e.g., the dynamics of the arts, or science, or demography” and “the universe of already detected individuals, for instance all those individuals above a certain visibility threshold at their time, that would be invariant over time, e.g., the so called ‘elite’, the top famous individuals”. The authors acknowledged that their data might omit individuals who “are currently forgotten”.
2.2. Linking Procedure
To link individuals between the two datasets, we relied on a variation of the linking procedure pioneered by
Abramitzky et al. (
2021). Abramitzky’s approach, applied initially to link individuals between different US censuses, involves matching people uniquely based on their name and year of birth in a manner that reduces the likelihood of false positives. We additionally considered the year of death due to data availability. Because the scales of our datasets are so different (30,461,478 in the WikiTree dataset and 2,229,187 in the Wikipedia dataset), we believe that false positives are not a substantial concern.
To establish links, we created unique identifiers for everyone in both datasets by combining first names, last names, birth years, and death years (
Figure 1A). By creating these identifiers, we established a standardized format for comparison. Once the identifiers were generated in both datasets, we conducted a matching process using a specific function, identifying exact matches between the unique identifiers in the two datasets. If multiple potential matches were found for a single individual, we did not match any of them. This decision directly follows
Abramitzky et al.’s (
2021) methods and is essential for minimizing the chance of false matches.
2.3. Linking Results
The linking procedure is skewed toward the US, UK, Canada, and Australia (
Table S1). While 18.1%, 9.6%, 3.0%, and 2.4% of all famous individuals in the Wikipedia universe are attached to these four areas, respectively, we observe disproportionately high rates of these areas in the linked dataset—43.5, 22.4, 4.8, and 2.9%, respectively (
Table S1). Regarding occupation, the sample skews towards politics, a sub-category of leadership (
Table S2). While 27.0% of all individuals in the Wikipedia universe are notable for “leadership”, this proportion is 52.9% in the linked dataset (
Table S3). Lastly, the dataset appears to be disproportionately skewed toward famous individuals born between 1700 and 1900 (
Figure 1B). Approximately 71.6% of linked individuals were born in that period, even though only 22.0% were featured in the full Wikipedia dataset (
Figure 2).
One of the most significant constraints is that WikiTree excludes all living individuals, regardless of their fame status. This systematic omission means that the dataset provides a complete view only for historical figures while necessarily truncating genealogical lineages for contemporary individuals. Unlike some historical linkage datasets, where missingness might disproportionately affect certain groups, this exclusion applies universally. As a result, estimates of intergenerational transmission of fame are not necessarily biased downward—there is no systematic undercounting of famous children relative to their non-famous peers. However, the exclusion does have important implications for understanding fame clustering and the structure of elite family networks.
For example, because living individuals are absent, the data capture only a partial picture of contemporary elite networks. If famous families today continue to marry within elite circles but those marriages involve living individuals, those connections will not appear in the dataset, artificially reducing observed assortative mating rates. Similarly, the clustering of famous individuals within extended family networks may appear weaker than it actually is because linkages through currently living relatives are missing. In effect, this creates a time-based truncation: historical clusters of famous individuals may appear denser than they actually were at the time, while contemporary clusters appear more diffuse than they truly are.
Net of excluding living individuals, a secondary issue that further complicates efforts to validate the dataset is the fact that Wikipedia does not necessarily constitute a representative sample of famous people.
Laouenan et al. (
2022) suggest that the fraction of famous individuals who are forgotten (and excluded from Wikipedia) varies over time.
Past research has established that within given populations, user-generated genealogical data have surprisingly minimal socioeconomic biases. For example,
Kaplanis et al. (
2018) found nearly perfect overlap between the educational distribution of a subset of individuals in their data and a matched population (individuals living in the state of Vermont). While death certificate data are unavailable for validation, we anticipate a similar association, given the similarity between the platform our data was drawn from and the one
Kaplanis et al. (
2018) looked at.
For various reasons, it is virtually impossible to obtain a truly representative linked dataset. Past analyses that have validated the representativeness of genealogical data have typically only been able to do so on small subsets of data (
Kaplanis et al. 2018). In particular, intergenerational patterns of association in the prevalence of fame would be more precisely measured using administrative datasets, where the representativeness of the data can be better assessed and guaranteed. Nonetheless, the large-N nature of genealogical datasets still allows for remarkably insightful analyses that are not possible with other types of data. Specifically, the unique advantage of large-scale genealogical data is the broader network perspective that such data can provide (
Kaplanis et al. 2018;
O’Brien 2023).
In our subsequent analyses, estimates of intergenerational and marital patterns of association in the prevalence of fame are only representative for the WikiTree sample and should not be generalized to other specific populations. We believe there are two primary ways in which representation bias could impact the results. It is likely that individuals with Wikipedia articles are over-represented in the genealogical data.
Laouenan et al. (
2022) documented that at the peak of human history, only about 1 in 3000 people were famous. In our linked dataset, this number is closer to 1 in 600. Over-representation of famous people could result in higher rates of intergenerational transmission of fame while also potentially artificially increasing rates of ties between famous and non-famous people (parent–child and husband–wife).
Network analyses that explore whether major underlying sub-networks of fame exist are partially robust to representativeness issues. Even if a linked dataset is systematically biased, network analyses can still reveal meaningful insights. Additionally, biased but dense networks can be especially insightful. Representative but dispersed networks lack the density required to detect large underlying network structures. In this way, the systematically biased nature of WikiTree is a major advantage of the data. The family tree is so well-connected, in fact, that the majority of individuals in the tree are part of a single component where known paths can connect any two individuals.
While representation issues make it difficult to assess the accuracy of direct fame association patterns or even to suggest what populations such findings generalize to, such metrics provide reasonable baseline descriptive statistics with which to understand the linked dataset. Furthermore, such associations provide valuable information on what types of relationships connect famous people to one another. Such statistics motivate our broader network analyses.
2.4. Intergenerational Mobility
We calculated intergenerational mobility patterns as the proportion of individuals within a specific subpopulation who achieved fame. In line with
Laouenan et al. (
2022), we define fame as a simple binary, indicating whether an individual has a Wikipedia article. This approach allowed for comparison of the rates of intergenerational mobility between groups of people in different contexts.
2.5. Assortative Mating
To study assortative mating, we constructed a husband–wife fame matrix to investigate assortative mating patterns and fame reciprocation among couples. The dataset comprised information about the status of fame of husbands and wives in various categories. To begin the analysis, we examined the prevalence of various pairings within the dataset. This initial exploration involved calculating the frequencies of marriages between non-famous individuals and fame-mixed couples (i.e., one partner is famous, while the other is not). Next, we focused on specific fame categories to identify patterns. The categories included culture, discovery/science, leadership, sports/games, and missing/other categories. By examining each category separately, we could observe any tendencies for individuals within certain fame domains to marry partners with fame in the same category. This analysis involved calculating frequencies of fame-mixed and fame-similar marriages within each category.
2.6. Grandparent and In-Law Effects
We investigated grandparent effects by identifying the attributes of each person’s paternal and maternal grandparents and their attributes. Paternal and maternal grandparents were identified by linking the listed father and mother for each person to their father and mother’s parents. We did not exclude individuals based on missingness in their grandparents’ data.
In-law effects were investigated by combining marriage and profile data. Using the marriage file, individuals were linked to one or more spouses, then to one or two parents-in-law (depending on what was available) through the profile data (the recorded parents of the profile they were married to). Observations constituted the same individual multiple times if they were married multiple times.
2.7. Clustering Method
The clustering method identifies and groups historically significant individuals within genealogical networks. The algorithm exclusively considers famous nodes and creates clusters where all famous figures in the cluster are linked through paths with a high density of famous individuals across them. To initiate the clustering process, we selected a famous node as the initial member of a cluster. Clusters were generated such that all nodes in any given cluster have the following property:
The undirected network is denoted as G = (V,E), where V is the set of nodes and E is the set of edges.
Let be the set of type 1 nodes in the network, and let be the set of type 2 nodes.
For a specific value (k), let us define the relation expressed as on such that if and only if there exists a path in G from node to node that does not include any sequence of more than k consecutive type 2 nodes. A cluster of type 1 nodes with the described property can be defined as a connected component of relation .
The algorithm examines the immediate neighbors of the selected famous node to identify any additional famous nodes within its immediate family. If any famous neighbors are found, they are added to the cluster. Next, the algorithm checks if the newly added famous nodes have any additional famous neighbors. If any are found, they are also included in the cluster. This iterative process of adding famous neighbors to the cluster continues until no further famous neighbors can be identified among the previously added famous nodes. The cluster is considered complete and recorded as a single cluster within the genealogical structure. We extended the clustering procedure to allow connections over multiple steps. This approach was identical to the clustering procedure, except that rather than considering two nodes connected if they were married or were parent/child, two nodes were connected if they were within n degrees of each other. We explored clusters where n equals 2, 3, and 4.
3. Results
We linked open-source genealogical data with cross-verified Wikipedia entries (
Laouenan et al. 2022); this resulted in a comprehensive family tree of over 30 million people. Binary annotations indicated the presence or absence of fame. We first investigated whether the likelihood of achieving fame increased in children of famous parents. The results revealed that individuals with two famous parents were most likely to achieve fame (
Figure 2A,
Table S4). Among those with a famous father, 9.97% of males achieved fame, while 16.31% of those with a famous mother achieved fame. When both parents were famous, 25.55% of males achieved fame. The proportions among females were slightly lower, with 2.08% of females with a famous father achieving fame, 7.70% of those with a famous mother achieving fame, and 12.34% with both a famous father and mother achieving fame (
Figure 2B,
Table S4). Thus, parental fame imposes social advantages on progeny that are moderated by the gender of parents and children.
Next, we asked whether the rates of intergenerational fame transmission varied by the type of parental fame. While the broad categories of parental fame are overly vague, they provide a sense of how the concentration of fame varies. Occupations were coded into one of four major categories: “discovery/science”, “leadership”, “culture”, and “sports/games” (
Figures S2–S5,
Table S6). We conducted this analysis to explore whether heterogeneity exists in the intergenerational persistence of fame. Higher intergenerational transmission of fame was observed in men compared to women, regardless of the type of parental fame; however, having a famous mother extends a greater benefit towards male children in terms of achieving any type of fame than having a famous father (
Figure 2B,D,
Tables S4–S6). Culture showed the highest rates of intergenerational persistence, while sports/games had the lowest. Additional analyses suggest heterogeneity in intergenerational persistence in terms of destinations (child type of fame;
Table S5). Overall, these results suggest that the persistence of fame varies according to the parental gender, child gender, and type of fame.
4. Marriage and In-Law Effects
It has been hypothesized that elite families marry one another and disproportionately concentrate social advantages among their descendants (
O’Brien 2023). Homogamy (the propensity for alike individuals to marry each other) in terms of status, as measured through education (
Schwartz 2013) and elite class membership has been documented across a variety of contexts (
O’Brien 2023). A husband–wife fame matrix was created to observe assortative mating patterns among famous and non-famous individuals (
Figure 3). The most commonly observed pairing type is between two non-famous individuals, which aligns with the notion that fame is rare. This is an expected pattern, given the assortative mating patterns of elite endogamy (
Lomnitz and Pérez-Lizaur 1987;
Stone 1990;
Ermakoff 1997;
Gatewood 2000;
S. Beckert 2001;
Padgett and Powell 2012;
Bodenhorn 2015;
Toft and Jarness 2021;
Chung et al. 2021;
O’Brien 2023). We additionally see a disproportionate number of non-famous women partnering with famous men, which is expected, given the disproportionately larger number of famous men compared to famous women. In many cases, famous men married famous women from categories other than their own. Overall, the data highlight the prevalence of non-famous pairings, the prominence of marriages where husbands are famous while wives are not, and variations in fame reciprocation rates across different categories.
Since famous individuals marry each other at high rates and women are much less likely to be considered famous, we next determined the extent to which famous men married non-famous women who were the children of famous fathers (
Tables S7–S9). Indeed, non-famous women married to famous men were much more likely to have a famous father than non-famous men married to non-famous women.
4.1. Grandparent Effects
Direct associations between grandparents and grandchildren have been observed regarding education, occupation, and other important life outcomes (
Chan and Boliver 2013;
Hällsten 2014;
Pfeffer 2014). However, some research suggests that the apparent grandparent effects are the product of measurement error (
Ferrie et al. 2021). We investigated grandparent effects by identifying the attributes of each person’s paternal and maternal grandparents in relation to their attributes (
Tables S10–S15). Paternal and maternal grandparents were identified by linking the listed father and mother for each person to their father and mother’s parents. We did not rely on marriage data to make these linkages; instead, we used recorded parents in the dataset. We did not exclude individuals based on missingness in their grandparents’ data. Indeed, individuals with famous grandparents were much more likely to become famous, regardless of their parents’ fame.
4.2. Cluster Analysis
Next, we performed a cluster analysis of famous individuals to identify broadly connected groups. For most genealogical networks, a low ceiling exists for the number of direct connections between nodes due to the constraint presented by demographic processes (e.g., a person can only have two parents and is unlikely to have more than a few spouses or a few children). As a result, it would be essentially impossible to observe large clusters of people in which all members can reach others with a very short path length. Instead, our procedure creates clusters that allow for more extended genealogical connections, like tentacles linking broader groups of famous people.
We generate clusters for when k = 1 to k = 4, where k = 1 is the smallest value possible and k = 4 allows for connections between cousins, which is arguably one of the most extended genealogical relationships that social stratification research has historically engaged with. The clustering method focuses on identifying and grouping historically significant individuals or those with well-documented ancestries within genealogical networks. The algorithm exclusively considers famous nodes and creates clusters where all prominent figures in the cluster are closely related.
Analysis of the genealogical network yielded intriguing results regarding the concentration of famous individuals within families. The k = 1 (famous-only) cluster comprises nodes where famous individuals are connected by a single component, such as parents, spouses, and children. Each of these clusters represents a single famous individual who is not connected to any other famous individuals. We calculated summary statistics for the probability of being an isolate and the degree of exposure to famous individuals in the k = 1 cluster (
Tables S16 and S17). These data suggest that looking only at adjacent family members creates an appearance of fame in isolated individual accounts. The k = 2 (two-hop reachable) clusters, however, encompass both famous and non-famous nodes, where the non-famous nodes link together two or more famous nodes within a maximum of two steps. A two-step association accounts for links between immediate parents, siblings, and grandparents. Most two-hop reachable clusters were found to be of size 1, with 47,289 individual clusters identified.
4.2.1. Three-Hop Cluster
Three-hop reachable clusters (k = 3) encompass associations up to and including siblings, great grandparents, aunts, and uncles. The largest three-hop cluster comprises 7542 famous individuals. These famous individuals are primarily famous for leadership (65.9%). The domains of culture (13.6%) and discovery/science (11.0%) make up the second and third largest categories. Sports/games is the smallest category, with only 2.5%. The cluster skews toward leadership, which makes up 52.9% of famous people overall. All other categories are under-represented compared to the overall rates. The average path length between any two nodes in this cluster is 14.47. Density is much higher inside the main cluster than outside of it. The average famous node within the cluster is within three hops of 4.51 other famous nodes. In contrast, outside of the main cluster, the average famous node is within four hops of 0.51 famous nodes.
Individuals’ birth dates in the cluster range from 1180 to 1954, with an interquartile range of 1761 to 1857. The cluster’s main attachment area of 3358 individuals is the United Kingdom (UK). Meanwhile, another 2719 individuals are attached to the contemporary (post 1776) US. Moreover, the last names involved in the clusters are highly varied. The most common last name is found only 55 times. Only five last names are found more than 20 times. In total, 2926 different last names appear in the cluster.
4.2.2. Four-Hop Cluster
The largest four-hop cluster comprises 14,875 famous individuals (
Figure 4). Although nearly double the size of the three-hop cluster, it has approximately the same domain composition as the sizeable three-hop cluster. These famous individuals are, again, primarily famous for leadership (65.1%). Culture (13.7%) and discovery/science (12.1%) comprise the second and third largest categories, respectively. The category of sports/games is substantially under-represented, at 2.7%. The average path length between any two nodes in this cluster is 9.36. The substantial decline in the average path length from the largest three-hop cluster indicates that the concentration of fame in certain networks may be hidden unless considering the broader network (e.g., considering a slightly greater number of degrees of connection).
Again, density is much higher inside the main cluster than outside. The average famous node within the cluster is within four hops of 6.32 other famous nodes. In contrast, outside of the main cluster, the average famous node is within four hops of 0.46 famous nodes.
Individuals’ birth years in the cluster range from 1102 to 1969, with an interquartile range of 1765 to 1856 (
Figures S6 and S7). This cluster is primarily made up of individuals whose main country of attachment is the UK or the US (
Figure 4). Among them, the main area of attachment of 5789 individuals in the cluster is the UK, while the contemporary (post 1776) US is associated with 5344 individuals in the cluster. This implies that this cluster comprises 4.8% of all (120,946) deceased famous individuals primarily attached to the UK and 3.1% of all (171,375) deceased famous individuals primarily attached to the US.
Regarding leadership, this cluster contains an even greater concentration of American and British individuals. Approximately 6.1% (3619) of all (59,354) deceased famous individuals in the US in the leadership domain are part of the cluster. More specifically, 30.8% of all people famous for leadership in the US born between 1700 and 1750 are included in the cluster, and 20.7% born between 1750 and 1800 are included in this cluster. This is especially striking when one considers that we failed to link the vast majority of famous individuals in the dataset at all.
The overall percentage is even higher for the UK. Approximately 7.7% (3511) of all (45,886) deceased famous individuals in the UK in the leadership domain are part of the cluster. Notably, the core of this structure in the UK is not nobility, where status is systemically transferred in a familial manner; instead, it mostly comprises political leaders. While only 6.3% of all deceased UK nobility is in the cluster, it includes 8.5% of all deceased UK politicians. Additionally, 8.0% of all deceased UK administration/law individuals are in the cluster.
4.2.3. How Big Are These Clusters?
We can liberally define a ceiling on the size of the clusters by calculating the maximum size of a cluster, assuming each node takes the maximum number of steps to reach another node. This implies the maximum size is n + 3 × (n − 1), where n is the number of famous nodes in the cluster. When applied to the large cluster, this approach results in a size of 59,497. This means that approximately one in three nodes in the cluster is “famous” and that out of 30,461,478 individuals, including 58,149 famous ones, 25.6% of all famous individuals can be found in a cluster that contains only 0.195% of all individuals (
Figure 4).
Another way to estimate the size of the cluster is to consider the size of the union of the number of total nodes that could be reached in four steps from all famous nodes in the cluster. Although this calculation results in a larger value of 729,001, even this value is still only equivalent to 2.4% of all nodes in the cluster. Furthermore, even this approach indicates that famous nodes are intensely concentrated in this cluster. Within this 729,001-person group, 2.0% of all individuals are famous. Specifically, outside the cluster, only 0.15% of individuals are famous.
5. Discussion
The concentrated structure of powerful elites in contemporary Western society is a central detriment to resource access and equality of opportunity. A large body of research has documented the mechanisms by which elite family networks concentrate resources and advantages (
Padgett and Powell 2012;
Chung et al. 2021;
Goni 2022), with assortative mating patterns being hypothesized to amplify the intergenerational transmission of status from parent to child and concentrate wealth, influence, and other resources within specific families (
Schwartz 2013). Until recently, the lack of comprehensive generational data has hampered the exploration of these dynamics at a scale beyond two generations. The broader network-level perspective we took is paramount to fully quantifying this familial concentration of resources.
Our overall results shed light on the importance of studying mobility in terms of broader webs instead of father–son or other simple intergenerational approaches. The comprehensive multigenerational network revealed a substantial degree of concentration of fame within family networks. Rather than persisting directly between adjacent generations, famous families tend to weave themselves together through marriage, with fame often appearing to skip generations or take other non-linear paths. This contrasts with the widely held belief that social advantages are strictly passed from parents to children and provides a new framework for understanding how social advantages are transferred between people.
Our specific association findings shed light on the types of relationships that bind larger webs of fame together. We consistently observe men being more likely to attain fame than women, regardless of the mother or father’s status. However, we also observe that having a famous mother increases the probability of a child attaining fame more than having a famous father. Women have historically held non-breadwinner roles in the global West and, as demonstrated here, are much less likely to be famous or have as broadly or as well-recognized of a social role (
Yavorsky et al. 2023). There are multiple reasons why having a famous mother may be associated with a higher probability of a child being famous. Given the relative scarcity of famous women, having a famous mother may indicate a higher latent propensity for fame in a family. Additionally, sociological research suggests that mothers matter more children’s life outcomes, such as education (
Beller 2009). Beyond parent–child analyses, the husband–wife fame matrix reveals assortative mating patterns and fame reciprocation across various categories. The data show the prevalence of non-famous pairings and the prominence of marriages where husbands are famous while wives are not.
The prevalence of fame differs across domains such as leadership, Discovery/Science, sports/games, and culture. This variation highlights the multidimensional nature of social stratification, where the distribution of resources, opportunities, and cultural capital differs depending on the occupational domain. While fame persisted between generations for all categories, parental fame in leadership was not the strongest predictor of individual fame. This is surprising, as the intergenerational persistence of leadership has historically characterized aristocracy, which existed in the West until very recently and has been hypothesized to be succeeded by a new aristocracy (
Alfani 2023). In addition, we observed high intergenerational persistence for discovery/science, which aligns with the findings of
Bell et al. (
2019), who noted that many “lost Einsteins” exist due to unequal life opportunities. In the US, athletics are often considered a means of social mobility for marginalized groups (
Eitzen 1999). However, the data suggest that the degree of intergenerational persistence for fame in the sports/games category is low. Therefore, while the sports/games domain may serve as a tool for upward mobility, it is unlikely to persist across generations.
Overall, the results of this analysis shed light on an important question with respect to social stratification: To what extent do individuals’ family trees form broader networks of status? While a broader array of social stratification research has quantified associations of status intergenerationally, in terms of homogamy and even in terms of grandparent effects, little research has explored the extent to which status attainment can be observed in broader familial networks. While, based on the results of this analysis, it is not possible to say whether “fame”, as a category of status, is inherited more of less than would be expected from what we know of other forms of inheritance, we can leverage fame as a metric of status to test whether status is concentrated in broader familial networks.
It is important to note that there are several limitations to this research. This study relies on user-generated genealogical data, which is biased toward a specific time period (births between 1800 and 1900) and context (UK and US). Our approach for linking elites to the family tree was inherently designed to produce conservative estimates of the concentration of elites within genealogical networks. Thus, false matches were minimized, and the likelihood of missing matches increased. Furthermore, if false matches were to occur, they would likely attenuate the apparent concentration of elites. As such, we view our estimates regarding the concentration of elites in genealogical networks and estimates of the intergenerational and spousal transmission of elite status to be conservative. However, we note that by missing linkages in the dataset, we likely biased estimates of the association between extended kin and individual elite status, conditional on parent status. While this would not upwardly bias our estimates of the scale of elite clustering in the network, it would upwardly bias estimates of conditional grandparent effects and conditional in-law effects.
Additionally, the nature and mechanisms of intergenerational transmission are entirely unclear. At some points in human history, in certain contexts, notable status was mechanically inherited across generations (e.g., monarchy). The fact that our primary cluster transcended contexts with (the UK) and without monarchy (the US) suggests that the familial concentration of fame in Western society transcends mechanical inheritance. Another related concern is that individuals who have a close relative who has a Wikipedia page may have an increased probability of having a Wikipedia page, independent of their own fame/significance. While it is difficult to quantify the extent of this effect, several factors suggest it is not a major concern in interpreting our results. If direct familial influence were the primary driver of fame, we would expect fame to cluster only among immediate relatives rather than forming expansive, multi-hop networks. Moreover, the wide diversity of last names within these clusters indicates that no single family lineage disproportionately biases the results.
Additionally, we used binary distinctions of parental fame (having a famous father, mother, or both). Not all instances of fame are equal, and the analysis does not capture variations in the intensity of individual fame. Future research could explore more nuanced measures of fame and examine how different levels of parental fame impact fame outcomes. The most straightforward ways of quantifying fame in our dataset (such as the word counts of Wikipedia entries) are especially problematic, as they may have little to do with other measures of fame and may be biased by forces that will lead to endogenous results. Similarly, it is unclear how robust the coding of the fame domains is. Certain broad categorizations, such as leadership, may mask meaningful heterogeneity within each category. Future research on richer, more robust datasets may better address these important concerns. Future research that has better established representativeness and includes more comprehensive metadata may also shed better light on more nuanced findings—such as exploring whether the socioeconomic status of descendants of famous athletes is lower compared to those whose famous parents gained recognition in science or discovery.
Ultimately, caution should be exercised when interpreting the results of this analysis. In particular, the mechanisms that link the persistence of fame across generations are entirely unclear. While large-scale family tree data are often used to explore the role of genetics in shaping outcomes (
Kaplanis et al. 2018), we strongly discourage such an interpretation here. Not only do our results provide no evidence to suggest that, but we also find it theoretically implausible. The concept of celebrity is socially constructed and does necessarily reflect any intrinsic personal attributes (such as intelligence).