The Genes of Freedom: Genome-Wide Insights into Marronage, Admixture and Ethnogenesis in the Gulf of Guinea

The forced migration of millions of Africans during the Atlantic Slave Trade led to the emergence of new genetic and linguistic identities, thereby providing a unique opportunity to study the mechanisms giving rise to human biological and cultural variation. Here we focus on the archipelago of São Tomé and Príncipe in the Gulf of Guinea, which hosted one of the earliest plantation societies relying exclusively on slave labor. We analyze the genetic variation in 25 individuals from three communities who speak distinct creole languages (Forros, Principenses and Angolares), using genomic data from expanded exomes in combination with a contextual dataset from Europe and Africa, including newly generated data from 28 Bantu speakers from Angola. Our findings show that while all islanders display mixed contributions from the Gulf of Guinea and Angola, the Angolares are characterized by extreme genetic differentiation and inbreeding, consistent with an admixed maroon isolate. In line with a more prominent Bantu contribution to their creole language, we additionally found that a previously reported high-frequency Y-chromosome haplotype in the Angolares has a likely Angolan origin, suggesting that their genetic, linguistic and social characteristics were influenced by a small group of dominant men who achieved disproportionate reproductive success.


Introduction
During the first half of the 16th century, the archipelago of São Tomé and Príncipe, located in the Gulf of Guinea (1 • N, 7 • E), became one of the first examples of the so-called "plantation complex", which was soon to take over the New World [1] (Figure 1a). When the Portuguese reached the Gulf of Guinea in the early 1470s, they found the islands of São Tomé (860 km 2 ), Príncipe (136 km 2 ) and Annobón (17 km 2 ) to be uninhabited. A fourth island, Fernando Pó (now Bioko), located only 32 km off the coast of Cameroon, had already been populated by the Bubi, an autochthonous Bantu-speaking group who had presumably reached its shores by canoe [2]. While São Tomé and Príncipe were both settled plays an important role in the Angolar folklore [9], bears striking similarities to the traditional views on the origins of the Garifuna from St. Vincent Island in the Caribbean [18,19]. However, as it has been impossible to identify the date or location of the shipwreck, some historians speculate that the story may have been circulated to disguise the vast number of slave escapes from the plantations themselves [20]. The second and most widely accepted hypothesis assumes that the Angolares are in fact the descendants of a maroon community formed by slaves who managed to escape to the forest after they had been brought to the island as plantation workers or for re-export to the Americas [4,8,21]. Slave rebellions and escapes to the mountainous regions of São Tomé date back to the first days of colonization during the late 15th and early 16th century, culminating in a major uprising in 1595 [4]. While the existing sources do not document a clear link between these events and the modern Angolar community [4], they do show that no shipwreck scenario was needed for the formation of maroon communities on the island.
We have previously detected an unusually strong signal of genetic differentiation between the Angolares and the remaining populations of the island of São Tomé by using a set of just 15 autosomal microsatellite polymorphisms [22]. However, this extreme differentiation did not allow us to recover the historical relationships between the Angolares and other groups from São Tomé and Príncipe and from the African mainland.
Here, we reassess the genetic variation of the three speech communities of São Tomé and Príncipe (Forros, Principenses and Angolares) using genomic data from expanded exomes in combination with a contextual dataset from Europe and from major slave-trading zones in Africa, including new data from Angola. We found that, despite the strong levels of differentiation of the Angolares, the three groups share notable genetic similarities with respect to their Gulf of Guinea/Angolan ancestry ratios, suggesting that the Angolares are an admixed isolate. Based on the available genetic and linguistic evidence, we further propose that their origins trace back to a maroon community strongly influenced by the political and cultural dominance of one or several related men from Angola.

Population Samples
We generated 53 expanded exomes from 9 Angolares (ANG), 8 Forros (FOR) and 8 Principenses (PRI) from São Tomé and Príncipe, as well as from 28 individuals belonging to five Angolan populations: 5 Ovimbundu (OVI), 5 Ganguela (GAN), 5 Nyaneka (NYK), 6 Himba (HIM) and 7 Kuvale (KUV). Samples from Forros and Angolares were obtained from Lungwa Santome and Lunga Ngola speakers whose four grandparents were born in villages where the two languages are still used as medium of everyday conversation. Although Lung'Ie is highly endangered, we could sample five individuals who still were active speakers and three additional individuals who only had a passive knowledge of the language. In all cases, all four grandparents were speakers of Lung'Ie and had been born on the island of Príncipe. Additional details on sampling procedures in São Tomé and Angola have been described elsewhere [22][23][24] (Figure 1a; Supplementary Table S1).

Expanded Exome Sequencing, Variant Calling and Quality Control
DNA samples were extracted from buccal swabs and saliva as previously described [24,25]. Library preparation for expanded exome sequencing (~62 Mb) was done using the Nextera ® Rapid Capture Enrichment kit by Illumina, San Diego, CA, USA, following the protocol version #15037436 v01. Indexed samples were sequenced in two runs on an Illumina's HiSeq 1500 System with 250 cycles in paired-end mode. A mean depth coverage of 21x (5-41x) was obtained for captured regions. The 53 newly generated exomes were compared with sequence data from 71 Table S1). To avoid ascertainment bias, we reanalyzed the raw sequence data from these populations together with the newly generated data, instead of just merging the different datasets. Due to lack of available genome-wide data, Fa d'Ambô speakers from the island of Annobón were not considered in the present study.
To improve the quality of the variant dataset, we further filtered the data with VCFtools (v0.1.13) [35], retaining only autosomal biallelic SNPs (average coverage 16x) without excessive coverage (<35x). Moreover, a minimum genotype coverage of 3 and a minimum genotype quality of 20 were required; with these filters, sites with >15% missing data were excluded. Sites with a Hardy-Weinberg equilibrium p-value < 0.05 for at least two populations were also excluded. Overall, the filtering process yielded 149,501 autosomal SNPs.
All individuals were checked for relatedness with the -relatedness2 option on VCFtools (v0.1.13) [36,37]. One individual from a pair of samples from Príncipe with a kinship coefficient of 0.248 (first-degree) was removed. The final dataset consisted of 135 samples from 15 populations with 149,501 SNPs with a transition/transversion ratio (Ti/Tv) of 2.58, confirming the high quality of the sequences.

Population Structure Analyses
Haplotype-based coancestry matrices, principal component analysis (PCA) and clustering dendrograms were obtained with fineSTRUCTURE/CHROMOPAINTER v.2 [38]. Phasing and genotype imputation were carried out with BEAGLE (v4.1) [39,40]. We calculated two types of coancestry matrices. In the first type, we assumed that the haploid genomes of each individual are formed by copying DNA chunks from any other individual in the whole sample, independently of the group to which that individual belongs (Supplementary Figure S4a). In the second type of matrix, we defined a group of recipients whose haploid genomes were copied from a group of donors belonging to a specific set of populations ( Figure S2a). The differences between the average copy profiles of pairs of recipient populations were quantified using the total variation distance (TVDxy) [41,42] and visualized with a Neighbor-Joining (NJ) consensus tree with weighted branches using SplitsTree4 [43] (Figure 2c). Support for NJ partitions was calculated by generating 1000 replicas of the original coancestry matrix by sampling with replacement the copy profiles of individuals from each recipient population.
To estimate mutation emission and recombination scaling parameters used in the analyses relying on CHROMOPAINTER, we performed initial runs using 10 iterations of the Expectation-Maximization (EM) algorithm for a subset of five randomly selected chromosomes (chr1, chr6, chr11, chr16 and chr21). The inferred parameters were first averaged by chromosome (weighted by their number of SNPs) and then by individuals. These parameters were then used in subsequent CHROMOPAINTER runs on all individuals and chromosomes.
Genotype-based, unsupervised clustering analyses were performed by applying AD-MIXTURE v1.3.0 [44] to a linkage disequilibrium (LD) pruned dataset consisting of 62,564 SNPs, obtained with the PLINK -indep-pairwise option [45], using a 200-SNP sliding window incremented by 5 SNPs, and a LD threshold of r 2 =0.2. We performed 20 independent ADMIXTURE runs for each K value from 2 to 5, applying a cross-validation (CV) procedure. The results were post-processed and plotted with the pong software [46].
Pairwise Fst values between populations were calculated with EIGENSOFT [47,48] and visualized with a heatmap and UPGMA clustering with the Pheatmap package [49].
We used PCA ( Figure 1c) and ADMIXTURE (K = 4) (Figure 1d) to quantify European, Gulf of Guinea and Angolan ancestral contributions to the Forros and Principenses. Using ADMIXTURE, the European contribution was simply taken as the proportion of the European ancestry component (orange) in each population. To estimate the Gulf of Guinea and Angolan contributions, we calculated the average proportions of the ancestry components in blue and red in Gulf of Guinea (Esan and Yoruba) and Angolan (Ovimbundu, Ganguela and Nyaneka) populations, and we used these proportions in Bernstein's equation [50]. All three parental groups (Europe, Gulf of Guinea, Angola) consist of individuals that were assembled into homogeneous clusters using fineSTRUCTURE. The pastoralist Kuvale and Himba form a distinct subcluster in Angola and were not considered in this analysis (Supplementary Figure S4b).
For PCA, we determined the centroids of the European, African, Gulf of Guinea and Angolan populations in PC1 + PC3. To obtain the European admixture proportions we projected each Forro and Principense individual onto the line connecting the European and African centroids and estimated the European contribution as one minus the Euclidian distance of each projection to the European centroid, divided by the Euclidian distance between the European and the African centroids. We repeated the same procedure to obtain the contributions from the Gulf of Guinea using the line connecting the Gulf of Guinea and Angolan centroids [51,52]. Since the Gulf of Guinea/Angolan ancestry proportions of the Angolares could not be assessed with these methods, we additionally used an ad hoc approach based on the relative positions of populations in the NJ tree obtained from the TVDxy distances ( Figure 2c). In this approach, the Gulf of Guinea contribution was given by the Euclidean distance between the root of each population from São Tomé and Príncipe to the Yoruba, divided by the distance between the Yoruba and the midpoint between the roots of the Ganguela and Nyaneka.

Genetic Diversity
We characterized genetic diversity using observed, individual per locus heterozygosities (Ho), runs of homozygosity (ROH), LD measured by the squared correlation of allele frequencies (r 2 ), folded Site Frequency Spectra (SFS) and individual inbreeding coefficients (Fis). To control for uneven sample sizes in LD, SFS and Fis calculations, we downsampled the number of individuals to 5 (the sample size of the Ovimbundu, Nyaneka and Ganguela). We repeated this process ten times, calculating each summary statistic in each replicate, and taking the average over replicates as the final estimate.
We calculated LD (r 2 ) between pairs of SNPs in sliding windows of 1 Mb in each population using PLINK 1.9. To evaluate the LD decay, we binned the LD values between pairs of SNPs according to different genomic distance categories (<2 Kb The SFS and Fis statistics were also calculated with PLINK, using -freq and -het, respectively. Ho was calculated by dividing the number of polymorphic sites in each individual by the total number of SNPs in the dataset.

Mitochondrial DNA and Y-Chromosome Variation
We compared newly generated data from the island of Príncipe consisting of mitochondrial DNA (mtDNA) sequences of hypervariable regions (HVR) I and II, and Y-chromosome microsatellite haplotypes, with previously reported data from the island of São Tomé [22]. MtDNA sequences and haplotypes defined by 11 Y-chromosome microsatellite loci (Powerplex Y system, Promega) were obtained for 41 maternally unrelated and 19 paternally unrelated individuals, carrying lineages that could be associated with at least one Lung'Ie speaker up to the grandparental generation. MtDNA sequencing and Y-chromosome typing were done as described [22]. MtDNA haplogroups were assigned with HaploGrep [53]. Y-chromosome haplogroups were inferred from microsatellite haplotypes with Haplogroup Predictor (http://www.hprg.com/hapest5/ accessed on 24 May 2020) [54].
The TMRCA for a previously identified [22] Y-chromosome descent cluster reaching high frequencies in the Angolares was calculated with the rho statistic [58,59], using an average microsatellite mutation rate of 0.0025 per locus per generation [60,61] and a generation time of 30 years [62].

Genetic Structure
Using newly generated expanded exome data, we compared the genetic composition of three creole-speaking communities from São Tomé and Príncipe with five populations from Angola, as well as with available sequence data from Europe and from major slavetrading regions in Africa (Figure 1a; Supplementary Table S1).
In a haplotype-based principal component analysis (PCA) implemented by the fineSTRUCTURE algorithm [38], the Angolares show no detectable European ancestry (PC1) and are clearly separated from all other African populations (PC2) (Figure 1b). A pairwise Fst analysis further shows that the average genetic distances between the Angolares and their neighbors from São Tomé and Príncipe (Fst = 0.027) are as high as their distance to other African populations (Fst = 0.028) (Supplementary Figure S1). This differentiation is also confirmed in a LD pruned dataset using ADMIXTURE [44]

Genetic Structure
Using newly generated expanded exome data, we compared the genetic composition of three creole-speaking communities from São Tomé and Príncipe with five populations from Angola, as well as with available sequence data from Europe and from major slavetrading regions in Africa (Figure 1a; Supplementary Table S1).
In a haplotype-based principal component analysis (PCA) implemented by the fineSTRUCTURE algorithm [38], the Angolares show no detectable European ancestry (PC1) and are clearly separated from all other African populations (PC2) (Figure 1b). A pairwise Fst analysis further shows that the average genetic distances between the Angolares and their neighbors from São Tomé and Príncipe (Fst = 0.027) are as high as their distance to other African populations (Fst = 0.028) (Supplementary Figure S1). This differentiation is also confirmed in a LD pruned dataset using ADMIXTURE [44] (Figure 1d). While the position of the divergent Angolares cannot be determined in this genetic gradient, the Forros and Principenses lie between the Esan and Yoruba from Nigeria, and a group of Bantu-speakers from Angola that includes the Nyaneka, the Ganguela and the Ovimbundu (Figure 1c). This observation is compatible with the available historical records, which identify Nigeria/Gulf of Guinea and Congo/Angola as the two most relevant slave-trading areas involved in the settlement of São Tomé and Príncipe [11].
In contrast to the Forros and Principenses, the Bubi from the neighboring island of Bioko are grouped together with Bantu-speaking populations (Figure 1c). The Bubi sample, however, is quite heterogeneous, and three individuals overlapping in PC3 with the Kuvale and Himba are in fact separated as extreme outliers by PC4 (Figure 1c; Supplementary Figure S2); these individuals, together with three additional samples showing less extreme differentiation, have previously been grouped in a distinct genetic cluster using whole genome data [2].
To better elucidate the relationship between the Angolares and other groups, we tried to reduce the impact of their genetic differentiation by further exploring haplotype sharing profiles generated by CHROMOPAINTER [38,42]. In this analysis (Figure 2a), we split the African populations into a group of donors and a group of recipients, assuming that the haplotypes of recipients were exclusively formed by DNA chunks from donor populations. The recipient group included the three language communities from São Tomé and Príncipe as well as populations from West Africa (Mandinka), Gulf of Guinea (Yoruba) and Angola (Ganguela, Nyaneka, Himba). The donor group consisted of the remaining populations, all from geographical areas located as close as possible to the recipients: Mende in western Africa; Esan in the Gulf of Guinea; Ovimbundu and Kuvale in Angola. To account for a possible contribution of Bioko to São Tomé and Príncipe, the Bubi were also included in the donor group. Figure 2a presents a coancestry matrix based on the number of haplotype segments shared between donors and recipients. As expected, the haplotype copy profiles show that recipient populations from the African mainland derive most of their haplotypes from donor groups that match their geographic and linguistic area. Figure 2b displays pairwise total variation distances (TVDxy) between recipients, calculated on the basis of their inferred African ancestries [38,42]. Remarkably, when genetic similarity is assessed only on the basis of ancestry, the three groups from São Tomé and Príncipe become very close to each other, suggesting that the genetic uniqueness of the Angolares was caused by demographic events occurring within the island of São Tomé, rather than by different external contributions from the African mainland (Figure 2b). The genetic similarity between Angolares, Forros and Principenses is further illustrated by a Neighbor-Joining network calculated with the TVDxy matrix, which, as expected, is closely related to geography (Figure 2c).
We additionally used both PCA and ADMIXTURE to quantify the ancestral contributions of Europe (Iberians and CEU-Europeans), Gulf of Guinea/Nigeria (Esan and Yoruba) and Congo/Angola (Ganguela, Nyaneka and Ovimbundu) to the genome of Forros and Principenses (Supplementary Table S2, Supplementary Figure S4). Although the Ovimbundu, Ganguela and Nyaneka are located to the south of more relevant slave trade areas from Angola, these populations can still be considered adequate proxies for the Congo/Angola ancestry, as they have been shown to be genetically very similar to the Kikongo and Kimbundu-speaking groups that inhabit those areas [63,64].
Based on Euclidean distances to African and European PCA centroids (Figure 1c) we estimated a substantially higher European ancestry in the Forros than in the Principenses (13% vs. 3%). Using the same approach, we found Gulf of Guinea/Angolan ancestry ratios of 76%:24% in the Principenses and 66%:34% in the Forros (Supplementary Table S2). These ratios are in accordance with the low impact of Bantu lexicon in Lung'Ie when compared to Lungwa Santome, but do not reach statistical significance (Mann-Whitney p = 0.09). Ancestry estimates based on the frequency of ADMIXTURE components were highly correlated with the PC-based results (rho = 0.99; p < 10 −5 for European ancestry estimates; rho = 0.81; p < 2 × 10 −4 for Gulf of Guinea/Angolan estimates; Supplementary Table S2).
As the extreme divergence of the Angolares does not allow for an assessment of their ancestral proportions using PCA or ADMIXTURE, we additionally estimated Gulf of Guinea/Angolan ratios using an ad hoc approach based on the relative position of the three populations from São Tomé and Príncipe in the TVDxy/Neighbor-Joining network (Figure 2c). In agreement with the outstanding lexical Bantu contribution in Lunga Ngola, the Angolares display a lower Gulf of Guinea/Angolan ratio (58%:42%) than the other two groups (Principenses: 69%:31%; Forros: 64%:36%).  Table S2).
As the extreme divergence of the Angolares does not allow for an assessment of their ancestral proportions using PCA or ADMIXTURE, we additionally estimated Gulf of Guinea/Angolan ratios using an ad hoc approach based on the relative position of the three populations from São Tomé and Príncipe in the TVDxy/Neighbor-Joining network (Figure 2c). In agreement with the outstanding lexical Bantu contribution in Lunga Ngola, the Angolares display a lower Gulf of Guinea/Angolan ratio (58%:42%) than the other two groups (Principenses: 69%:31%; Forros: 64%:36%).

Genetic Diversity
Despite their mixed ancestry, the Angolares display substantially lower levels of genetic diversity than any other African population in our dataset. The patterns of ROH presented in Supplementary Figure S5 provide a remarkable illustration of this homogeneity. Both the number of ROH (nROH) and the average total length in ROH (sROH) of the Angolares are only surpassed by the European populations, who have experienced a bottleneck during the Out-of-Africa migration [65]. Consideration of the

Genetic Diversity
Despite their mixed ancestry, the Angolares display substantially lower levels of genetic diversity than any other African population in our dataset. The patterns of ROH presented in Supplementary Figure S5 provide a remarkable illustration of this homogeneity. Both the number of ROH (nROH) and the average total length in ROH (sROH) of the Angolares are only surpassed by the European populations, who have experienced a bottleneck during the Out-of-Africa migration [65]. Consideration of the average ROH size (sROH/nROH) further shows that the Angolares have unusually long ROH, even when compared with the Europeans, as expected for populations with recent inbreeding [66,67] (Figure 3a). Interestingly, a similar, albeit less pronounced, trend is observed in the Himba and Kuvale from southwestern Angola, who have a well-documented preference for cross first-cousin marriages between a man and his father's sister's daughter [68]. Consistent with these observations, the Angolares, the Himba and the Kuvale stand apart from the other populations especially for longer ROH categories, measuring more than 2 Mb (Supplementary Figure S5d). A further indication of recent inbreeding is provided by a plot of sROH vs. nROH, showing that ROH sizes in the Angolares are longer than expected from their number of ROH, on the basis of the best-fitting line for sROH vs. nROH in outbred populations from mainland Africa (Supplementary Figure S6). However, the signals of inbreeding revealed by the ROH analyses are coupled with negative Fis values that are similar to other populations, and no significant differences in mating patterns could be captured using this statistic (Supplementary Figure S7).
Additional characterization of other aspects of genetic diversity shows that the Angolares have higher levels of LD (Figure 3b, Supplementary Figure S8), lower observed per locus heterozygosities (Ho; Figure 3c) and lower proportions of singletons in the site spectrum (SFS; Figure 3d)) than the other African populations. All these summary statistics are strongly intercorrelated (Supplementary Figure S9) and suggest that the low levels of genetic diversity observed among the Angolares were caused by a comparatively small effective population size.
average ROH size (sROH/nROH) further shows that the Angolares have unusually long ROH, even when compared with the Europeans, as expected for populations with recent inbreeding [66,67] (Figure 3a). Interestingly, a similar, albeit less pronounced, trend is observed in the Himba and Kuvale from southwestern Angola, who have a welldocumented preference for cross first-cousin marriages between a man and his father's sister's daughter [68]. Consistent with these observations, the Angolares, the Himba and the Kuvale stand apart from the other populations especially for longer ROH categories, measuring more than 2 Mb (Supplementary Figure S5d). A further indication of recent inbreeding is provided by a plot of sROH vs. nROH, showing that ROH sizes in the Angolares are longer than expected from their number of ROH, on the basis of the bestfitting line for sROH vs. nROH in outbred populations from mainland Africa (Supplementary Figure S6). However, the signals of inbreeding revealed by the ROH analyses are coupled with negative Fis values that are similar to other populations, and no significant differences in mating patterns could be captured using this statistic (Supplementary Figure S7).
Additional characterization of other aspects of genetic diversity shows that the Angolares have higher levels of LD (Figure 3b, Supplementary Figure S8), lower observed per locus heterozygosities (Ho; Figure 3c) and lower proportions of singletons in the site spectrum (SFS; Figure 3d)) than the other African populations. All these summary statistics are strongly intercorrelated (Supplementary Figure S9) and suggest that the low levels of genetic diversity observed among the Angolares were caused by a comparatively small effective population size.  For (b,d), populations were randomly downsampled (without replacement) to the smallest sample size, and the average over replicates is reported.  For (b,d), populations were randomly downsampled (without replacement) to the smallest sample size, and the average over replicates is reported.

Reanalyzing Previously Generated Uniparental Data
Previously, we found that a single Y-chromosome microsatellite haplotype reached an unusually high frequency (15/25) in the Angolares, who otherwise retained a small number of equally frequent, molecularly divergent mtDNA lineages. This pattern contrasted with the high variability detected for the two uniparental markers in a sample of linguistically uncharacterized non-Angolar residents of São Tomé [22]. Using newly generated data, we now found that levels of mtDNA and Y-chromosome variability in Príncipe are similar to the non-Angolar sample from São Tomé (Supplementary Figure S10; Supplementary Tables S3-S6).
Moreover, we reassessed the provenance of the most common Angolar patrilineage by investigating its matching profile, using publicly available data from the Y-Chromosome Haplotype Reference Database (yhrd.org). We found that matches with Angola represent 39% (7/18) of all matches with African populations, although the Angolan sample size accounts for only 12% (309/2679) of the total sample size of populations in which at least one match was observed (Supplementary Figure S11; Supplementary Table S7). These results suggest that the most frequent Angolar Y-chromosome lineage is likely to have originated in Angola.
We also attempted to estimate the time to the most recent common ancestor (TMRCA) of this lineage. By using previously defined criteria [61], we first delimited a descent cluster of close mutational neighbors that likely derived from the most frequent (ancestral) haplotype ( Figure 4). Then, we calculated the time necessary to generate the descent cluster through mutation accumulation, based on the rho statistic [58,59]. Our estimate suggests a TMRCA of~500 years (95% confidence interval limits~62-940 years), in broad agreement with historical records indicating that the first slaves from Congo-Angola arrived at São Tomé around 1520.

Reanalyzing Previously Generated Uniparental Data
Previously, we found that a single Y-chromosome microsatellite haplotype reached an unusually high frequency (15/25) in the Angolares, who otherwise retained a small number of equally frequent, molecularly divergent mtDNA lineages. This pattern contrasted with the high variability detected for the two uniparental markers in a sample of linguistically uncharacterized non-Angolar residents of São Tomé [22]. Using newly generated data, we now found that levels of mtDNA and Y-chromosome variability in Príncipe are similar to the non-Angolar sample from São Tomé (Supplementary Figure  S10; Supplementary Tables S3-S6).
Moreover, we reassessed the provenance of the most common Angolar patrilineage by investigating its matching profile, using publicly available data from the Y-Chromosome Haplotype Reference Database (yhrd.org). We found that matches with Angola represent 39% (7/18) of all matches with African populations, although the Angolan sample size accounts for only 12% (309/2679) of the total sample size of populations in which at least one match was observed (Supplementary Figure S11; Supplementary Table S7). These results suggest that the most frequent Angolar Ychromosome lineage is likely to have originated in Angola.
We also attempted to estimate the time to the most recent common ancestor (TMRCA) of this lineage. By using previously defined criteria [61], we first delimited a descent cluster of close mutational neighbors that likely derived from the most frequent (ancestral) haplotype ( Figure 4). Then, we calculated the time necessary to generate the descent cluster through mutation accumulation, based on the rho statistic [58,59]. Our estimate suggests a TMRCA of ~500 years (95% confidence interval limits ~62-940 years), in broad agreement with historical records indicating that the first slaves from Congo-Angola arrived at São Tomé around 1520.  Haplotypes were defined by using 10 microsatellite loci (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439). Locus DYS385 was excluded from the network because it is duplicated. The inset shows the descent cluster centered around the most frequent haplotype in the Angolares. Haplotypes from the Angolares (ANG) and from a sample of linguistically uncharacterized non-Angolar residents of São Tomé (Non-ANG) were previously reported [22]. The newly generated data on the Y-chromosome haplotypes from Príncipe (PRI) are shown in Table S4. Circles represent haplotypes, area is proportional to frequency, and colors represent populations. Lines represent microsatellite mutational differences.

Discussion
The emergence of new ethnic identities defined by distinct cultural and linguistic traits is one of the most remarkable outcomes of the forced displacement of millions of Africans during the Atlantic Slave Trade. In contrast to other instances of European conquest where local societies were subjected to colonial rule, the plantation complex that was installed in the New World, especially in the Caribbean, relied on the mass replacement of indigenous groups with Africans of different geographical origins [1]. Although the coercive amalgamation of people from diverse backgrounds became the defining characteristic of the rapidly developing creole societies, enslavement and forced migration were also met with staunch resistance. Escapes (marronage) and rebellions were a frequent occurrence in all slave societies and eventually led to the creation of independent maroon communities surrounding the plantations [69].
Located right off the African coast, São Tomé and Príncipe anticipated some of these defining characteristics of Caribbean creole societies both in time and space [70]. While members of the modern Forro (São Tomé) and Principense (Príncipe) communities are generally understood to be the descendants of plantation slaves recruited in the Gulf of Guinea and Congo-Angola, the Angolares from São Tomé may represent the oldest maroon society formed during the Atlantic Slave Trade [8]. However, the specific conditions under which this self-governed group emerged remained to be fully elucidated.
By using 15 autosomal microsatellite polymorphisms together with mtDNA partial sequencing and Y-chromosome microsatellites, we have previously detected an unusually strong signal of genetic differentiation between the Angolares and a heterogeneous sample including non-Angolares inhabitants of the island of São Tomé [22]. However, the extreme differentiation of the Angolares, the paucity of comparative data and the low number of genetic markers analyzed did not allow us to recover the historical relationships between this community and other groups from São Tomé and Príncipe and from the African mainland.
Here, we used newly generated genome-wide data to show that when the impact of genetic differentiation is reduced, the Angolares display mixed contributions from the Gulf of Guinea and Angola (58%:42%) and are genetically closer to the Forros and the Principenses than to any other African population (Figure 2). At the same time, we confirm and extend the evidence indicating that the gene pool of the Angolares is unusually homogeneous, as shown by several summary statistics capturing different but related aspects of genetic diversity (LD; SFS; and Ho), including long ROH suggestive of substantial inbreeding (Figure 3; Supplementary Figures S5-S9). This combination of features rules out the possibility that the Angolares originated from a specific region of Africa, as assumed by the frequently cited hypothesis according to which they descend from survivors of the wreck of a slave ship carrying captives from Angola [4,8]. Alternatively, it is likely that the Angolares constitute an admixed isolate that was founded through fusion of a small number of slaves with different geographical backgrounds, and they subsequently experienced high levels of genetic drift and extensive isolation in the context of marronage.
Linguistic evidence offers additional insights into this scenario. The creoles spoken by the Forros (Lungwa Santome) and Angolares (Lunga Ngola) differ in the way in which Bantu-derived features are ingrained into different linguistic domains [11,12]. While the Kikongo influence in Lungwa Santome mostly consists of non-core lexical items, Kimbundu features in Lunga Ngola are found in multiple subsets of the language, including core vocabulary and phonology. This qualitatively different impact of the Bantu adstrate in Lunga Ngola suggests that the Angolares resulted from a union between creole-speaking slaves escaping from the plantations and recently arrived Kimbundu-speaking slaves from Angola, who had a considerable influence on the formation of the new language.
An important clue about this process is provided by the finding that the Angolares display a very high frequency of a single Y-chromosome microsatellite haplotype with a likely Angolan origin [22] (Figure 4; Supplementary Figure S11). Similarly to wellknown examples of social selection [61,71], this pattern suggests that the founding Angolar population was dominated by a high-ranking Angolan male, or a small group of related males, who achieved greater reproductive success than other men and passed their elevated social status to their male descendants, favoring the rapid expansion of a single patriline.
A high status of prominent Angolan men provides the socio-linguistic context that could easily explain the emergence of Lunga Ngola through the partial relexification of a pre-existing creole under the influence of Kimbundu-speaking leaders. Moreover, the role of headmen, also known as "captains", in the Angolares community has long been attested by historical sources [4,8]. As in many other maroon societies, transmissible male dominance is likely to have been favored by a highly centralized political organization under the strong authority of headmen who had the prerogative of polygyny and could transmit this mating advantage to their offspring [69]. Historical records additionally report episodes of women abduction from the farms [4,8], suggesting that the maroon communities experienced shortages of females that may have been caused or exacerbated by polygyny. Only after the death of their last captain, Simão Andreza, in the beginning of the 20th century, did the long history of chieftainship among the Angolares come to an end [4,8].
The effects of cultural transmission of male social dominance can be illustrated with a simple deterministic model, where the favored cultural phenotype (higher status) is associated with higher rates of polygyny among individuals inheriting the Y-chromosome from dominant males [72,73]. For example, assuming that dominant men represent 5% of the male population, a mating success three times higher than that of other males would be necessary for the current frequency of the Angolar descent cluster to be reached during the~500 years corresponding to its estimated TMRCA (Supplementary Figure S12).
Even when driven solely by males, social selection is expected to have a strong wholegenome impact. In their pioneer studies on the genetic structure of Native American tribes, James Neel and his colleagues [74,75] have shown that the transmission of polygyny among high-ranking men could increase inbreeding to higher levels than expected by systematic marriage among relatives [4,8]. Our observation that the Angolares display higher amounts of inbreeding than southwestern Angolan groups, such as the Himba and Kuvale, who favor cross-cousin marriage, is congruent with those findings (Figure 3a; Supplementary Figures S5 and S6). Other known consequences of cultural transmission of fitness, including a sharp reduction in effective population size (Ne) and a strong increase in allelic association [76], could explain the low genetic diversity and increased LD of the Angolares (Figure 3). Together with strong isolation, this reduction in Ne probably led to the group's unusual genetic divergence from the other populations of São Tomé and Príncipe.
While the observed patterns of Y-chromosome and genome-wide variation of the Angolares can also be explained by neutral demographic factors such as strong bottlenecks and founder effects, neutrality and social selection are of course not mutually exclusive. Therefore, further work is needed to clarify the roles played by these evolutionary factors in shaping the genetic and non-genetic dimensions of human diversity in São Tomé and Príncipe. Our genome-wide results provide the empirical framework for these analyses.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12060833/s1, Figure S1: Pairwise Fst distances between populations; Figure S2: Haplotype-based Principal Component Analysis; Figure S3: ADMIXTURE analysis; Figure S4: Dendrogram displaying genetic relationships between studied individuals; Figure S5: Individual variation in measures of runs of homozygosity (ROH); Figure S6: Comparison between number of ROH (nROH) and total length of ROH (sROH); Figure S7: Variation in individual F is values; Figure S8: Variation in average pairwise linkage disequilibrium (LD); Figure S9: Pairwise correlations between summary statistics of genetic diversity; Figure S10: mtDNA and Y-chromosome networks; Figure  S11: Matching analysis; Figure S12: Frequency change in Y-chromosome descent cluster; Table S1: Geographic location, language affiliation and sample size of populations used in this study; Table S2: Ancestry estimates; Table S3: mtDNA variation in Príncipe; Table S4: Y-chromosome variation in Príncipe; Table S5: mtDNA diversity statistics; Table S6: Y-chromosome diversity statistics; Table S7: Y-chromosome haplotype sharing results.