Comparative Genome-Wide Survey of Single Nucleotide Variation Uncovers the Genetic Diversity and Potential Biomedical Applications among Six Macaca Species

Macaca is of great importance in evolutionary and biomedical research. Aiming at elucidating genetic diversity patterns and potential biomedical applications of macaques, we characterized single nucleotide variations (SNVs) of six Macaca species based on the reference genome of Macaca mulatta. Using eight whole-genome sequences, representing the most comprehensive genomic SNV study in Macaca to date, we focused on discovery and comparison of nonsynonymous SNVs (nsSNVs) with bioinformatic tools. We observed that SNV distribution patterns were generally congruent among the eight individuals. Outlier tests of nsSNV distribution patterns detected 319 bins with significantly distinct genetic divergence among macaques, including differences in genes associated with taste transduction, homologous recombination, and fat and protein digestion. Genes with specific nsSNVs in various macaques were differentially enriched for metabolism pathways, such as glycolysis, protein digestion and absorption. On average, 24.95% and 11.67% specific nsSNVs were putatively deleterious according to PolyPhen2 and SIFT4G, respectively, among which the shared deleterious SNVs were located in 564–1981 genes. These genes displayed enrichment signals in the ‘obesity-related traits’ disease category for all surveyed macaques, confirming that they were suitable models for obesity related studies. Additional enriched disease categories were observed in some macaques, exhibiting promising potential for biomedical application. Positively selected genes identified by PAML in most tested Macaca species played roles in immune and nervous system, growth and development, and fat metabolism. We propose that metabolism and body size play important roles in the evolutionary adaptation of macaques.


Introduction
Macaca is of great importance in evolutionary and biomedical research, belonging to the Cercopithecidae family, a diverse and widespread primate group that contains 23 extant species [1,2]. These closely related species not only constitute a hotspot in phylogeny research due to their rapid speciation [3], but are also essential nonhuman primate (NHP) models for a wide spectrum of biomedical research [4,5] because of their strong similarities to humans across physiological, developmental, behavioural, immunological, and genetic levels. While rhesus (M. mulatta) and cynomolgus (M. fascicularis) macaques are the most commonly used NHP models, southern pig-tailed (M. nemestrina), Barbary (M. sylvanus), Tibetan (M. thibetana), Assamese (M. assamensis), and Japanese (M. fuscata) macaques are emerging models in epidemiology, immunology, neuroscience, pathology, and behaviour science [6][7][8]. Selection of NHP models for biomedical research and evolutionary studies requires clear genetic information about the NHP system [9,10]. While some previous genetic studies have shown that Macaca species have considerably diverse genetic backgrounds, displaying enormous heterospecific variation [11,12], and that phenotypic variation exists across geographically distinct individuals [13], genome-wide genetic variations across Macaca species have not yet been thoroughly investigated. Additionally, previous biomedical research results and their interpretations can be confounded by indiscriminate use of various macaques, given that genetic data from multiple Macaca species could be incorporated into study designs. Therefore, it is necessary to estimate the genome-wide genetic divergence across these macaques.
Single nucleotide variation (SNV) is a primary form of genetic variation in the genome, and is considered to be significantly correlated with various phenotypes including disease susceptibility, illness severity and drug responses [14]. Characterization of SNVs based on whole genome data can provide a comprehensive and thorough dissection of genetic variation. Some genome analysis on Macaca species have included the identification of SNVs, but most have focused on evolutionary phylogeny or population genetics [15][16][17][18][19]. To date, only a few studies have analyzed macaque genomic SNVs, but are limited to two flagship Macaca species, rhesus and cynomolgus macaques. For example, Malhi et al. [20] identified approximately 23,000 candidate SNVs widely distributed throughout the coding and non-coding regions of M. mulatta genome, with large-scale parallel pyrosequencing technology in combination with bioinformatics tools. Also, Ng et al. [21] conducted a comparative study of heterospecific SNVs between human, rhesus and cynomolgus macaque genomes to determine whether macaque alleles are associated with the same phenotypes as their corresponding human alleles. However, a comparable map of genomic SNVs in other macaque species has thus far been lacking.
In this study, we mainly detected, characterized and compared the autosomal heterospecific SNVs of six Macaca species, from genome information of eight individuals, with an emphasis on nonsynonymous SNVs (nsSNVs). Employing the rhesus macaque as reference, we describe the genetic diversity patterns and evaluate the genetic differences across Macaca. This study aims to not only enrich the genetic data for Macaca, but also to facilitate their evolutionary studies as well as inform the selection of optimal NHP models for various research purposes. The identification of functionally significant genetic variations among macaques will open doors for large quantities of downstream studies, which can shed light on gene functions or genetic basis of certain phenotypic traits.

Genome-Wide Discovery of SNVs
We present the most comprehensive examination of SNVs among macaques to date, involving eight individuals and six species ( Table 1) that are of major importance in evolutionary and biomedical research. In total, we identified approximately 10 million final SNVs within each macaque after a series of filtrations described in Fan et al. [22] (detailed results in Table 2). The Chinese rhesus macaque had the smallest number of SNVs, while the southern pig-tailed macaque held the largest number of SNVs, followed by the stumped-tailed macaque (M. arctoides) and the Assamese macaque (M. assamensis). Counting both homozygous and heterozygous sites, the average SNV frequencies ranged from 3.47/kb (the Chinese rhesus) to 5.15/kb (the southern pig-tailed macaque) for different individuals.  The numbers of SNVs shared by different macaques or unique to only one macaque are shown in Figure 1, which are generally in accordance with the phylogenetic relationships [26,27]. The Tibetan macaque and Assamese macaque, two closely related species, had more SNVs in common than with other individuals. The second largest SNV set was shared by the Tibetan, Assamese, and stump-tailed macaques, which belong to the same sub-clade of macaques. Finally, belonging to another close species pair, the Chinese rhesus macaque shared more SNVs with the cynomolgus macaques than with others. Approximately 40% SNVs were unique to the southern pig-tailed macaque, which is the most phylogenetically diverged species from rhesus macaque in this survey, representing the largest proportion of species-specific SNVs among our sample set. The Tibetan macaque harboured the least species-specific SNVs with a percentage of 13.68%, mainly due to a less diverse genetic background [22] and high degree of homogeneity, as well as being very closely related to the Assamese macaque.
We analyzed two species that were each represented by two individuals. While the two stump-tailed macaques shared the largest number of SNVs, which indicates very low genetic diversity, the two cynomolgus macaques, originating from Vietnam and Malaysia, had far fewer shared SNVs ( Figure 1, Table 2). This lack of common SNVs is even less than those shared between the Tibetan and Assamese macaques, highlighting that the two cynomolgus macaques have strikingly different genetic backgrounds. Therefore, we processed the two cynomolgus macaques separately and surveyed their population-specific SNVs, meanwhile analyzed the two stump-tailed macaques as one sample by only examining the shared SNVs as well as species-specific (not individual-specific) SNVs in subsequent analyses. The numbers of SNVs shared by different macaques or unique to only one macaque are shown in Figure 1, which are generally in accordance with the phylogenetic relationships [26,27]. The Tibetan macaque and Assamese macaque, two closely related species, had more SNVs in common than with other individuals. The second largest SNV set was shared by the Tibetan, Assamese, and stump-tailed macaques, which belong to the same sub-clade of macaques. Finally, belonging to another close species pair, the Chinese rhesus macaque shared more SNVs with the cynomolgus macaques than with others. Approximately 40% SNVs were unique to the southern pig-tailed macaque, which is the most phylogenetically diverged species from rhesus macaque in this survey, representing the largest proportion of species-specific SNVs among our sample set. The Tibetan macaque harboured the least species-specific SNVs with a percentage of 13.68%, mainly due to a less diverse genetic background [22] and high degree of homogeneity, as well as being very closely related to the Assamese macaque.
We analyzed two species that were each represented by two individuals. While the two stumptailed macaques shared the largest number of SNVs, which indicates very low genetic diversity, the two cynomolgus macaques, originating from Vietnam and Malaysia, had far fewer shared SNVs ( Figure 1, Table 2). This lack of common SNVs is even less than those shared between the Tibetan and Assamese macaques, highlighting that the two cynomolgus macaques have strikingly different genetic backgrounds. Therefore, we processed the two cynomolgus macaques separately and surveyed their population-specific SNVs, meanwhile analyzed the two stump-tailed macaques as one sample by only examining the shared SNVs as well as species-specific (not individual-specific) SNVs in subsequent analyses. The transition/transversion ratios across these macaques fluctuated from 2.17 to 2.25, which is in agreement with the previous studies of human and other primates [28,29], and we observed that C→T and G→A substitutions were most prevalent in the mutation spectrums of Macaca species. The heterozygosity level was lowest in the Tibetan macaque (17.16%), followed by two stump-tailed macaques with observed levels equal to 28.76% and 29.32%. These low genetic diversity levels likely resulted from the relatively small effective population sizes of the two species based on the study of Fan et al. [22,25]. In contrast, Chinese rhesus macaque maintained the highest heterozygosity level, followed by Malaysian and Vietnamese cynomolgus macaques, which mirrored the previous finding that these two macaques had high genetic diversity [30]. Heterozygosity estimates revealed varying levels of genetic diversity in our sampled macaques. The transition/transversion ratios across these macaques fluctuated from 2.17 to 2.25, which is in agreement with the previous studies of human and other primates [28,29], and we observed that C→T and G→A substitutions were most prevalent in the mutation spectrums of Macaca species. The heterozygosity level was lowest in the Tibetan macaque (17.16%), followed by two stump-tailed macaques with observed levels equal to 28.76% and 29.32%. These low genetic diversity levels likely resulted from the relatively small effective population sizes of the two species based on the study of Fan et al. [22,25]. In contrast, Chinese rhesus macaque maintained the highest heterozygosity level, followed by Malaysian and Vietnamese cynomolgus macaques, which mirrored the previous finding that these two macaques had high genetic diversity [30]. Heterozygosity estimates revealed varying levels of genetic diversity in our sampled macaques.

Functional Annotation of SNVs
To annotate the putative functional effects of the SNVs detected across the eight macaque genomes, we processed the sites with ANNOVAR [31] (Table 3) and SnpEff [32] (Table S1) based on the rheMac2 reference genome, which produced largely consistent results. Upon mapping our SNVs to the latest rhesus macaque genome assembly Mmul_8 [33] and the human reference genome GRCh37, we also observed concordant ANNOVAR annotations (Table S2). The SNV mapping rates were greater than 99% to Mmul_8 and 77-92% to GRCh37. Though the mapping rate to humans was lower due to larger genetic disparities between human and rhesus macaque, hundreds of additional SNV putative functional variants were obtained. For example, there were 56,515-87,789 exonic SNVs in rheMac2 across these macaques, 56,285-88,136 in Mmul_8, and 59,371-92,402 in human, likely resulting from the more accurate gene models in human genome. Similar to most previous studies [34,35], we consistently observed a higher SNV frequency in intergenic rather than genic regions for all species. As shown in Figure 2, 34.23~36.31% of total SNVs were genic variants, including 0.60~0.69% that were in exons. Among these exonic SNVs, 38.74~40.77% were non-synonymous SNVs (nsSNVs) and less than 1% were stop codon related SNVs (scrSNVs). These nsSNVs and scrSNVs are most likely to be putatively functional variants, which subjected to further analyses later. Finally, the top five mutation biases were shared among all surveyed macaques are as follows: CCG→CCA, AAC→AAT, ACG→ACA, ACA→ACG, and CAC→CAT for codon alterations and A←→T, V←→I, A←→V, P→L, R→H for amino acid changes. An accurate atlas of functional annotations for the SNVs of Macaca was drawn based on rhesus macaque and human reference genomes.

Characterization of SNV and nsSNV Distribution Patterns
We investigated the differences in SNV and nsSNV distribution patterns to illustrate their genetic discrepancies among Macaca species. Employing non-overlapping 50 kb sliding window scans, we observed that the SNVs were not uniformly distributed on the autosomes. While the distribution patterns were generally congruent among different individuals, a few regions were distinct. These exceptions included thirteen bins with relatively more SNVs across all individuals on chromosomes 3, 4, 5, 9, 10, and 20 (Table S3). It should be noted, these were not problematic regions, as these bins were not located near gaps or incomplete reference genome sequence, nor near regions of high structural variation across individuals found by [36]. Additionally, twelve windows on chromosome 2 exhibited very high SNV densities (9.52-23.74/kb), which displayed only in the Assamese macaque (M. assamensis) ( Figure S1 and Table S4). The SNV rates in these twelve windows were two to five times higher than the average number of SNVs observed across other windows. Furthermore, seven known genes are located within this SNV-dense region including a highly conserved gene Robo2 and Dazl, Galnt15, Dph3, Oxnad1, Rftn1, and Plcl2. These genes are involved in immune response, pharmacodynamics, spermatogenesis, O-linked glycosylation, axon guidance and cell migration. While this SNV outlier region could arise due to copy number variations (CNVs) specific to this macaque species, we implemented filtration steps to remove SNVs that overlap with the known duplications in the rhesus macaque [37], rather than that of Assamese macaque using the method described in Fan et al. [22]. Further CNV study on Assamese macaque would be needed to clarify.
As for nsSNVs, the overall autosomal distribution patterns were basically congruent across individuals. Outlier tests revealed a total of 319 bins, harbouring 1299 genes with nsSNVs, which displayed remarkably distinct nsSNV distribution patterns across macaques. The outlier test result on chromosome 1 is exhibited in Figure 3, and details of the 319 bins are provided in Table S5.

Characterization of SNV and nsSNV Distribution Patterns
We investigated the differences in SNV and nsSNV distribution patterns to illustrate their genetic discrepancies among Macaca species. Employing non-overlapping 50 kb sliding window scans, we observed that the SNVs were not uniformly distributed on the autosomes. While the distribution patterns were generally congruent among different individuals, a few regions were distinct. These exceptions included thirteen bins with relatively more SNVs across all individuals on chromosomes 3, 4, 5, 9, 10, and 20 (Table S3). It should be noted, these were not problematic regions, as these bins were not located near gaps or incomplete reference genome sequence, nor near regions of high structural variation across individuals found by [36]. Additionally, twelve windows on chromosome 2 exhibited very high SNV densities (9.52-23.74/kb), which displayed only in the Assamese macaque (M. assamensis) ( Figure S1 and Table S4). The SNV rates in these twelve windows were two to five times higher than the average number of SNVs observed across other windows. Furthermore, seven known genes are located within this SNV-dense region including a highly conserved gene Robo2 and Dazl, Galnt15, Dph3, Oxnad1, Rftn1, and Plcl2. These genes are involved in immune response, pharmacodynamics, spermatogenesis, O-linked glycosylation, axon guidance and cell migration. While this SNV outlier region could arise due to copy number variations (CNVs) specific to this macaque species, we implemented filtration steps to remove SNVs that overlap with the known duplications in the rhesus macaque [37], rather than that of Assamese macaque using the method described in Fan et al. [22]. Further CNV study on Assamese macaque would be needed to clarify.
As for nsSNVs, the overall autosomal distribution patterns were basically congruent across individuals. Outlier tests revealed a total of 319 bins, harbouring 1299 genes with nsSNVs, which displayed remarkably distinct nsSNV distribution patterns across macaques. The outlier test result on chromosome 1 is exhibited in Figure 3, and details of the 319 bins are provided in Table S5. Outlier test of nonsynonymous SNV distribution patterns on chromosome 1 for eight macaque individuals using Cook's distance test in R. The blue circles represent outlier chromosomal bins that hold significantly more nsSNVs than others. In addition, the numbers next to the blue circles are Cook's distances of the outlier bins. The dark blue line shows the threshold we used that is 30 times of the mean Cook's distance.
To ascertain whether genes located in outlier regions of high nsSNV distribution have shared functional roles, we performed enrichment tests for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms for these genes (Table 4). Our results indicate that these genes are linked to metabolism, such as taste transduction (sweet, umami and bitter taste) (mcc04742; p = 0.0113), fat digestion and absorption (mcc04975; p = 0.0247), pancreatic secretion (mcc04972; p = 0.0339), and proteolysis (GO: 0006508; p = 0.0200). Genes involved in homologous recombination (HR) (mcc03440; p = 0.0163) were also enriched in our outlier regions. Four of the five outlier genes in the HR pathway are associated with double strand break repair (DSBR). Additionally, positive regulation of innate immune response (GO: 0045089; p = 0.0448) was one of the enriched GO terms for these outlier genes. The enrichment analysis showed metabolism and immune were two main functional roles for the nsSNVs in outlier windows. Outlier test of nonsynonymous SNV distribution patterns on chromosome 1 for eight macaque individuals using Cook's distance test in R. The blue circles represent outlier chromosomal bins that hold significantly more nsSNVs than others. In addition, the numbers next to the blue circles are Cook's distances of the outlier bins. The dark blue line shows the threshold we used that is 30 times of the mean Cook's distance.
To ascertain whether genes located in outlier regions of high nsSNV distribution have shared functional roles, we performed enrichment tests for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms for these genes (Table 4). Our results indicate that these genes are linked to metabolism, such as taste transduction (sweet, umami and bitter taste) (mcc04742; p = 0.0113), fat digestion and absorption (mcc04975; p = 0.0247), pancreatic secretion (mcc04972; p = 0.0339), and proteolysis (GO: 0006508; p = 0.0200). Genes involved in homologous recombination (HR) (mcc03440; p = 0.0163) were also enriched in our outlier regions. Four of the five outlier genes in the HR pathway are associated with double strand break repair (DSBR). Additionally, positive regulation of innate immune response (GO: 0045089; p = 0.0448) was one of the enriched GO terms for these outlier genes. The enrichment analysis showed metabolism and immune were two main functional roles for the nsSNVs in outlier windows.

Enrichment Analyses of Specific nsSNVs with Putative Functions
Missense mutations are of special interest since many are believed to have non-marginal functional effects [38]. To further parse the biological meaning of nsSNVs, we conducted enrichment analyses (KEGG pathway, PANTHER, and GO) for specific nsSNVs (results summarized in Table S6). The seven species/populations shared categories associated with metabolism, which emphasizes their metabolic diversity along with the above-mentioned nsSNV distribution patterns. Specifically, genes with specific nsSNVs in the Vietnamese cynomolgus macaque were uniquely depleted for glycolysis (P00024; p = 0.0319), the Chinese rhesus macaque for tryptophan metabolism (mcc00380; p = 0.0450), the stump-tailed macaque for protein digestion and absorption (mcc04974; p = 0.0383) as well as glycosylphosphatidylinositol (GPI)-anchor biosynthesis (mcc00563; p = 0.0486), and the Assamese macaque for other glycan degradation (mcc00511; p = 0.0439). Additionally, there was an enrichment signal in starch and sucrose metabolism (mcc00500) for both the Vietnamese cynomolgus macaque (p = 0.0415) and the southern pig-tailed macaque (p = 0.0462). Besides, the homologous recombination pathway (mcc03440; p = 0.0233) was also one of the enriched biological pathway terms for the Malaysian cynomolgus macaque, which complemented the above results related to HR and indicated that the Malaysian cynomolgus macaque probably had a distinct homologous recombination function compared with other samples. Also, these macaques shared several significantly enriched biological pathways, including ECM-receptor interaction (mcc04512), hematopoietic cell lineage (mcc04640), ABC transporters (mcc02010) and Fanconi anemia (mcc03460).
Considering that the mammalian target of rapamycin (mTOR) pathway plays a central role in organismal metabolism [39], we investigated the specific nsSNVs within genes belonging to this pathway. There were 10-30 specific nsSNV within 8-22 genes found in these macaques. Tibetan macaque contained a heterozygous SNV in Map2k2 (ENSMMUT00000027482:exon4:c.G391A:p.V131M) which was quite close to the reported heterozygous mutations causing Cardiofaciocutaneous Syndrome in human, P128Q and G132D [40]. The Tibetan, Assamese, and southern pig-tailed macaques displayed nsSNVs in Insr and Igf1r, which are key genes in carbohydrates, lipids and protein metabolism as well as growth and insulin-related phenotypes [41,42]. This is consistent with the observations that they are susceptible to diabetes mellitus, and are relatively sturdier than other macaques.
Significantly enriched GO terms highlighted additional patterns. For example, genes tolerating specific nsSNVs in Malaysian cynomolgus macaque belonged to several immune response related terms (GO:0002218; GO:0002224; GO:0002758; GO:0002764; GO:0002253), which mirrored the findings above based on nsSNV distribution. Also, in the stump-tailed macaque, nsSNVs were enriched in genes linked to the regulation of endocrine process (GO:0044060; p = 0.0236). The enrichment survey successfully identified genes with specific nsSNVs in these macaques may indicate different putative functions, and can provide a better understanding of the diversity among Macaca.

Putatively Damaging nsSNVs and Their Associated Diseases
Since we were interested in the study of multiple macaque species for potential applications to disease research, we identified the putatively deleterious specific nsSNVs which might be associated with human diseases. The identification was on the basis of the human genome assembly (GRCh37) with PolyPhen2 [43] and SIFT4G [44], since deleterious SNVs in human tend to play vital roles in diseases [45]. There were between 13,175 and 40,283 functional mutations among specific nsSNVs based on prediction of PolyPhen2. On average, 55.90% of these mutations (54.41-57.77%) were assessed to be benign, 16.39% (15.88-16.94%) and 24.95% (22.99-26.82%) were possibly damaging and probably damaging, respectively. Since far fewer deleterious SNVs (9.96-12.86%) were predicted by SIFT4G than PolyPhen2, this indicated the first represented more conservative predictions. The comparison between the two approaches are detailed in Table S7. Enrichment analyses of diseases (KEGG disease, NHGRI GWAS Catalog, and OMIM) were run with KOBAS3.0 [46] for the genes with deleterious specific nsSNVs identified by both programs. Results based on PolyPhen2 prediction are exhibited in Table S8.
The disease phenotypes shared by all macaques here included congenital disorders of development, cardiovascular physiology, skin and soft tissue diseases, obesity-related traits, and immune response, which revealed the major hereditary differences between Macaca species and human in the anatomic structure and physiological process. We found that all tested macaques were enriched for nsSNVs in genes linked to type II diabetes mellitus (H00409; OMIM: 125853) except the stump-tailed macaque. In addition, Chinese rhesus macaque is susceptible to type I diabetes (OMIM: 222100; p = 0.0085), inferring it may be an ideal model for type I diabetes.

SNVs Causing Stop Codon Changes
Genes with SNVs causing stop codon changes (scrSNVs) were also investigated by pathway enrichment tests with KOBAS 3.0 [46]. We mainly focused on the disease or medicine related KEGG pathways. A very small proportion of SNVs, ranging from 381 for the Chinese rhesus to 523 for the southern pig-tailed macaque, led to gained or lost termination codons. Due to their induced changes in protein products that could causes serious alterations, we looked closely at these high impact variations.
The genes with scrSNVs were implicated in the KEGG pathway "drug metabolism cytochrome P450" (mcc00982) in two cynomolgus, stump-tailed and Assamese macaques, including Fmo2, Cyp2d17-like, Fmo6p and Gsta5 ( Figure S2 and Table S9). The scrSNVs were predicted to affect the metabolism of tamoxifen (antitumor hormone drug), cyclophosphamide and ifosfamide, citalopram (anti-depression drug), codeine and morphine (dependence producing drug) for these macaques. Additionally, they were also significantly enriched in the metabolism of cyclophosphamide and ifosfamide (anticarcinogen) for the Malaysian cynomolgus macaque. The four above-mentioned macaques shared two scrSNVs with one in Fmo2 (ENSMMUT00000027724:exon9:c.C1606T:p.Q536X) and the other in Cyp2d17-like (ENSMMUT00000025240:exon1:c.C82T:p.Q28X). FMO2 proteins, which catalyze the oxidation of heteroatom centers in numerous drugs and xenobiotics, of many mammals including rhesus macaque are 536 residues [47]. Here we observed the four macaques produced a 535-amino acid protein due to this nonsense mutation, sharing the same allele found in African Americans and Hispanic populations [48,49], therefore can be good candidate experimental animals for drug studies of these populations. Despite Cyp2d17-like is a novel gene, Cyp2d17, highly homologous to human Cyp2d6, metabolizes human Cyp2d6 substrates such as bufuralol and dextromethorphan. The previous study showed nonsynonymous variants I297M and N337D in cynomolgus and rhesus Cyp2d17 significantly altered the catalytic activity of the protein [50]. Therefore, it is possible that the scrSNV in Cyp2d17-like can change the drug metabolism of these macaques.
The peroxisome proliferator-activated receptor (PPAR) signalling pathway, which plays a major regulatory role in energy homeostasis and metabolic function [51], emerged in the enriched terms of both the Tibetan and southern pig-tailed macaques ( Figure S3). The Tibetan macaque displayed an enrichment signature in PPAR-β/δ that involved three genes with scrSNVs, while the southern pig-tailed macaque showed an enrichment signal in PPAR-γ caused by four genes with scrSNVs.
These scrSNVs in the southern pig-tailed macaque might be related to metabolic disease susceptibility according to studies on human and mouse [52,53].

Positive Selection Based on SNVs
Positive selection (also known as Darwinian selection) is an important source of evolutionary innovation and a major force behind the divergence of species [54]. Thus, we surveyed the positively selected genes (PSGs) for Macaca species. Of the 11437 single-copy orthologous genes shared between rhesus macaque (M. mulatta), the olive baboon (Papio anubis), and human (Homo sapiens) (see Methods), 12-33 PSGs were identified (FDR, 0.05) for different macaques according to a standard likelihood ratio test ( Table 5). The phylogenetic tree of the ten species/populations used as the working topology based on these orthologous sequences is shown in Figure S4. PSGs were generally different for various macaques and the top ranked PSGs were also distinct, confirming that these monkeys diversely adapted to environment under natural selections, even though they were very closely related species.
PSGs for most tested macaques were mainly involved in the nervous system, immunity, growth, development, and fat metabolism. Neuro-related genes were represented in the PSGs of most macaques including the Vietnamese (Fam53a) and Malaysian (Htt) cynomolgus macaque, the Chinese rhesus macaque (Myrf ), the Tibetan macaque (Csrp1) and the southern pig-tailed macaque (Htt, Ndnf, and Tnk2). PSGs in the Vietnamese cynomolgus macaque (Fcamr), the Chinese rhesus macaque (Sharpin), the Assamese macaque (Myd88), and particularly the southern pig-tailed macaque (Cmip, Havcr1, Prdm1) also included several innate and adaptive immune related genes. Genes related to growth and development were found to exhibit putatively positive selection patterns in Macaca species, including the Vietnam cynomolgus macaque (Pin1, Igfl1, Evc2), the stump-tailed macaque (Dis3l2), the Tibetan macaque (Dis3l2, Acan), the Assamese macaque (Kcnk1, Ogfr, Evc2) and the southern pig-tailed macaque (Acan). In addition, a strong signal of positive selection was found in fat metabolism among all Macaca species except the Vietnamese crab-eating macaque and the Chinese rhesus macaque. The four categories of PSGs may reflect some of the main forces driving the species divergence of Macaca.

Large Dataset of SNVs for Macaca
We profiled and compared the diversity patterns of genome-wide SNVs for six macaque species by analyzing resequenced, high-coverage genome datasets. Macaques (Cercopithecidae: Macaca) are a group of non-human primates of great evolutionary and biomedical importance. Lack of sufficient genomic information has been a significant obstacle for their broader use. Thus, it was necessary to conduct such a systematic variation survey for this group. In comparison to a similar study conducted by Ng et al. [21], our sample set represented more Macaca species including the stump-tailed (M. arctoides), the Assamese (M. assamensis), and the southern pig-tailed (M. nemestrina) macaques. These species have received far less survey from a genomic perspective, therefore we included these species in our analyses on macaque SNVs and their putative links to biological functions. Altogether, this study represents the most comprehensive comparative assessment of genomic SNVs for Macaca to date.
Our study produced as massive set of genomic differences represented by ten million heterospecific SNVs and had laid the foundation for functional genomic studies in the future with various analyses. This does not just provide a valuable interspecific variation reservoir for Macaca, but also facilitates their evolutionary and other studies in the future. The total numbers of SNVs and the SNVs shared by different sets of macaques were generally congruent with their phylogenetic relationships based on previous studies [22,27]. We also provided additional evidence to confirm that Vietnamese and Malaysian crab-eating macaques were quite genetically different, sharing only 43% of their total SNVs, while the two stump-tailed macaques had more than 73% of SNVs in common.

Suggestive Functional Divergence Inferred from SNV Distribution Patterns
SNV distribution patterns among macaques well represent their genetic differentiation. The twelve SNV outlier windows found only in the Assamese macaque (M. assamensis) had a unique SNV distribution ( Figure S1 and Table S4), and might indicate the unique characteristics of the Assamese macaque in terms of immune response, pharmacodynamics, spermatogenesis, O-linked glycosylation, axon guidance and cell migration. Yet further study is needed to clarify. Enrichment analysis for the outlier genes based on nsSNV distribution demonstrated that these macaques diverged largely in metabolic pathways. The results in Table 4 suggested that these monkeys may prefer different tastes in food and have discriminating abilities to digest fat and protein, which can give clues to scientific feeding of macaques. The genetic discrepancies in metabolism might also have a connection with the diverse body sizes of these sibling species. For example, the rhesus and cynomolgus macaques are generally slim while the Tibetan, stump-tailed and Assamese macaques look larger and sturdier. This assumption is strengthened by the study of Li et al. [55], which analyzed transcriptomic data and found that the Tibetan macaque had more genes annotated to GO terms related to nutrient reservoir activity than the rhesus macaque, indicating a better ability to store nutrients, contributing to its large body size.
Our enrichment results also indicated that these macaques might have diverging patterns in homologous recombination or specifically DSBR (Table 4). DSBR is essential for maintaining the stability and integrity of genomes and can be used as a potential target of cancer treatment [56][57][58]. If these rhesus monkeys perform DSBR or HR differently, we believe that they tend to react differently to cancer therapy according to the previous studies of DSBR in human and animal models [59,60]. This can provide directions for their potential applications to cancer research.
Finally, the enrichment outputs implied that these macaques differ in innate immune response ( Table 4). Given that immune responses of experimental animals heavily affect the results of biomedical experiments, our results confirm that the adverse impacts caused by diverse genetic backgrounds of various macaques should be taken into account in future research designs. On the other hand, their genetic diversity makes it possible to screen optimal models for special research purposes.

Divergent Characteristics Inferred from Enrichment of Putative Functional SNVs
From the perspective of putative functional SNV, this study confirmed that Macaca species were generally characterized by a large diversity in metabolism. Genes with specific nsSNVs in different macaques were depleted in multiple metabolic pathways (Table S6), including glycolysis (P00024), protein digestion and absorption (mcc04974), glycan degradation (mcc00511), and starch and sucrose metabolism (mcc00500), which emphasized their metabolic diversity along with the above results of nsSNV distribution pattern analyses. It was quite consistent with the results of SNV distribution patterns (see detailed information about the discrepancies below).
The PPAR pathway was one of the most enriched metabolic pathways for genes with scrSNVs in the Tibetan macaque (PPAR-β/δ) and the southern pig-tailed macaque (PPAR-γ) ( Figure S3). In particular, PPAR-γ plays a pivotal role in insulin sensitivity, adipogenesis and placental function [61][62][63]. Mutations in this pathway are responsible for the development of severe insulin resistance, Type-2 diabetes, hypertension, elevated triglycerides and low high-density lipoprotein (HDL) levels, and metabolic syndrome [64]. Combined with the investigation of specific nsSNVs in the mTOR pathway, we infer that the southern pig-tailed macaque is probably susceptible to these metabolism disorders. In contrast, PPAR-β/δ is mainly involved in lipid oxidation and cell proliferation [65,66]. The stop codon mutations and specific nsSNVs in Insr and Igf1r found in the Tibetan macaque were more likely to contribute to its fat metabolism and comparatively large body size.
Genes with specific nsSNVs in the Malaysian cynomolgus macaque included those associated with activation of innate immune responses and homologous recombination, implying this species probably displays different special innate immune reactions compared to other macaques. These results are also mirrored in our findings based on nsSNV distribution.
Interestingly, one identified gene belonging to the enriched pathway 'regulation of endocrine process (GO: 0044060)' for the stump-tailed macaque, Pomc, may contribute to the characteristic red face of the stump-tailed macaque, since this gene plays an important role in hair and skin pigmentation based on previous studies [67,68].

Application Potentials in Biomedical Research
Results of this survey suggest that the Chinese rhesus macaque had relatively differential pharmacokinetic characteristics, including responses to antineoplastic agents, anti-depressant and antipsychotic agents, amphetamines and so on. In fact, Irwin et al. reported that the response intensity of α-amphetamine in different NHPs was very different, and it decreased in the order of squirrel monkey > rhesus macaque > pig-tailed macaque > stump-tailed macaque > baboon [69]. This emphasizes that inconsistent results may be generated when using different macaques as experimental animals in drug research. Besides, after examining the scrSNVs in genes belonging to the drug metabolism pathway (mcc00982), we found the assayed cynomolgus, stump-tailed, and Assamese macaques contained a genotype of Fmo2 also found in African Americans and Hispanic populations [48,49], as well as a nonsense mutation in Cyp2d17-like. This indicates that they are very likely to have specific drug-metabolic features as well as potentials to be optimal experimental animals in drug metabolism studies of specific human populations.
All tested macaques can serve as ideal spontaneous NHP models for obesity studies based on our study. For obesity-related traits was one of the enriched disease phenotypes for these macaques (Table S8). This is congruent with the observation that macaques are prone to suffer from obesity [70][71][72], and actually they have been applied to obesity related studies as spontaneous models for a long time [73]. From the aspect of adaptive evolution, we speculated that they were probably more efficient in energy-storing or fat-depositing than humans, in order to survive the harsh environments where availability of food resources always shifts.
Our results also indicated that the Chinese rhesus macaque can be developed as a promising spontaneous model for type I diabetes. Susceptibility to type I diabetes was one of the enriched terms for the Chinese rhesus macaque. Macaques have been wildly used as models for type II diabetes [74,75], which was supported by our convincing results that enrichment signal in type II diabetes appeared for most surveyed macaques. However, the perfect animal model for type I diabetes mellitus has yet to be found [76] and the standardized method of diabetes induction is far from reach [77], which urges the development of additional spontaneous models. Though it still needs more investigation, our survey gives clues to the selection of appropriate macaque models in different types of diabetes studies. The stump-tailed macaque is very likely to be a potentially spontaneous disease model of primary ciliary dyskinesia (PCD) as it displays similar genetic pathogenicity to humans. There was not only a strong enrichment signal identified in PCD, but also probably damaging nsSNVs in the seven disease-causing genes of PCD found in stump-tailed macaque. To date, murine models are most commonly used in PCD studies, and there has been no primate model yet [78]. The stump-tailed macaque may be a strong NHP model to this disease.
Our study also supported there was a genetic basis behind the spontaneous disease susceptibility of crab-eating macaque. This species, along with the rhesus, are the most frequently used NHPs in research, and both are observed to quite often suffer from severely spontaneous diseases in abdominal organs including liver and gastrointestinal tract quite [79]. Our results indicate that Vietnamese cynomolgus macaque had genes with functional SNVs linked to hepatic cysts and neoplasm of the gastrointestinal tract according to the enrichment results, providing a genetic explanation to this disease susceptibility phenomenon.
As in the study of Cornish et al. [80], we also believe that disease enrichments do not guarantee, but reveal a probability that these Macaca species harbouring similar loss-of-function variations indicated in human diseases, had a correspondingly higher susceptibility to certain diseases. Our results can lay foundation for developing better NHP models for a wide array of biomedical research.

Positive selections on Macaca Genomes
Previous research found that positive selection shaped the genetic variation of human populations during their earliest settlements in different environments [81,82]. We believe positive selection also heavily impacted the genetic variations of Macaca species during their adaptations. Thus, addressing the positive selections across Macaca is helpful to understand what drives their genetic differentiation.
Our study found that genes related to nervous system, immune function, growth, development, and fat metabolism display patterns of strong positive selection in most tested macaque species. Our results were in agreement with the previous finding that nervous system [83] and immune response [84][85][86] were two frequent targets for positive selection in primates. Additionally, these different putatively selected genes across these monkeys suggested they probably developed diverse immune mechanisms to adapt to various environments.
Body size is a major discrepancy among our assayed species. The Tibetan macaque is the largest, followed by the stump-tailed and pig-tailed macaques, while the cynomolgus macaque is the smallest. Though the other species are of medium body size, the Assamese macaque is a little more robust than the rhesus macaque. As expected, genes related to growth and development displayed evidence of positive selection, and a strong signal of positive selection was found in fat metabolism genes among all tested Macaca species except the Vietnamese eating-crab macaque and the Chinese rhesus macaque, which are relatively leaner species than the others. Along with the aforementioned discussion about metabolic divergences, it implied that body size played an important role in evolutionary adaptation of genus Macaca. Also, we agreed that dietary adaptation, including fat metabolism, may have been a major driving force behind species evolution, as was found in previous study [87]. More research needs to conduct to answer questions like when the positive selection occurred and what its functional consequences were.
Macaque sequence data were mapped to the rhesus reference genome [23] using Bowtie (v2.2.0; [88]) for Hiseq data, using the local alignment algorithm with very sensitive model and proper insert sizes. Default options were used for other parameters. SNVs on autosomes were called individually using Picard (v1.98; http://broadinstitute.github.io/picard/) and GATK (v3.2; [28]). The SOLiD data of CE2 was mapped to rheMac2 with BioScope and called SNVs using the same pipeline. After standard screening procedures, we employed several conservative filtrations to control the data quality with custom Python scripts (scripts are available upon request of the authors). Genome Filters (GF) and Sample Filters (SF) description in [22] were applied to minimize the errors derived from sequencing and alignment, such as errors resulted from triallelic sites, copy number variations, CpG and proximity to indels or other SNVs. We merged the eight SNV files in VCF format with 'vcf-merge' [89] to concatenate all SNVs and their genotypes. Sites with missing genotypes in any of the samples were deleted and all callable sites were kept for further analyses.
To identify species-and subspecies-specific SNVs, we screened our results using a custom shell script, using a Guinea baboon (Papio papio) as outgroup (GenBank accession: SRX652597 and SRX652598). We mapped the sequences of Guinea baboon to rheMac2 to find SNVs shared by all macaques and baboon, setting mapping threshold as 95%, then excluded them from the species-and subspecies-specific SNVs. For the two individuals of cynomolgus macaques that came from two distinct populations, we identified their population-specific SNVs separately. Upset plot was drawn to show the number of SNVs shared by different pairs or sets of macaques using UpSetR [90].

Functional Annotation of SNVs
Functional annotation was conducted for the final SNV set. To ensure greater accuracy, both ANNOVAR [31] and SnpEff [32] were used with default parameters based on the unique physical positions (bp) of SNVs on chromosomes and the gene annotation from rheMac2. To facilitate further functional analyses, these SNVs were also mapped to the latest assembly of the rhesus genome Mmul_8 and the human reference sequence (GRCh37) using LiftOver (https://genome.ucsc.edu/ cgibin/hgLiftOver), and were then structurally characterized with the above two software according to Mmul_8 and GRCh37, respectively. We compared the annotations obtained from the two software on the basis of three different reference genomes. Functional impact of specific nsSNVs based on GRCh37 were predicted with PolyPhen2 [43] and SIFT4G [44] under 'multiple transcripts' mode. Classifier model was set as 'HumDiv' in PolyPhen2.

Determination of SNV Distribution Patterns and Outlier Detection
We investigated the distribution pattern of SNVs on autosomes by a non-overlapping sliding window approach. The number of SNVs and nsSNVs were counted in each non-overlapping window to draw the distribution patterns. After we tried window sizes of 10 kb, 50 kb, and 100 kb by referring to previous studies [91,92], the size of the window was finally set as 50kb to get a good resolution of the distribution patterns. To figure out whether the SNV-dense bins overlap with regions where the reference genome is questionable or structural variations locate, we examined the bins by compared to 'bad' regions found in [36].
Outliers were detected by employing Cook's distance test in R [93] to identify the windows with significantly different distribution patterns of nsSNVs among the eight macaques, which can infer regions of remarkable genetic differences across the samples/species. A window was identified as an outlier if its Cook's distance is greater than 30 times of the mean Cook's distance. The R script was deposited in a GitHub repository (https://github.com/Jing-Li-SCU/macaques_genomic_SNV). Genes with nsSNVs in outlier windows were identified for further analysis.

Functional Enrichment Analyses of SNVs
Enrichment analyses were performed using standalone KOBAS 3.0 [46]. The genes containing nsSNVs in outlier windows, specific nsSNVs, and scrSNVs were respectively subjected to the enrichment tests based on the pathway (KEGG pathway and PANTHER) and GO databases. We set all rhesus genes as the statistical background in the program. Data were statistically analyzed using the hypergeometric test and Benjamini-Hochberg FDR (false discovery rate) correction in KOBAS 3.0. Significant results received p values < 0.05. Enriched GO categories were further filtered to Biological Process in GO. Categories with less than three associated genes were discarded.
We also identified enriched disease terms from genes with putatively deleterious nsSNVs predicted by PolyPhen2 and SIFT4G (as described above) using KOBAS 3.0. Terms were considered statistically significant when p values < 0.05, too. Furthermore, Human Phenotype Ontology (HPO) enrichment was assessed by g:Profiler [94] for these genes. The number of genes for functional category was set as 5-500, and significance threshold was 0.05 for p value based on the Benjamini-Hochberg FDR. The input genes were compared to a background of all human genes. These analyses allow associating the SNVs of macaques with human diseases and other phenotypic traits, beneficial to understanding the characteristics of macaques and their possibly biomedical applications.

Identification of Positively Selected Genes Based on SNVs
To scan for positively selected genes (PSGs) among the different macaques, we examined 1:1:1 orthologous protein-coding sequences of rhesus macaque (Mmul_8), olive baboon (PapAnu2.0) and human (GRCh37) with OrthoMCL [95]. Large, divergent protein families and proteins with coiled-coil domains were omitted from the analysis. Due to lack of whole genome assemblies, we generated the orthologous coding sequences of other seven macaques by replacing the reference nucleotides with coding SNVs (Mmul_8 version), including M. mulatta lasiota, M. fascicularis from Vietnam and Malaysia, M. arctoides, M. thibetana, M. assamensis, and M. nemestrina. This exploited the fact that gene structures are well-conserved among such closely related species. Then phylogenetic tree, used as the working topology later, of the ten species/populations was reconstructed with RAxML [96] based on these orthologous sequences.
Briefly, CodeML program of PAML (v4.6; [97]) was employed to identify PSGs, with PRANK [98] as the aligner. We tested for selection on each of the eight individual branches within the Macaca clade, by comparing branch-site model A (settings: model 1 4 2, NS sites 1 4 2) with null model (settings: fix_omega = 1, omega = 1). Likelihood Ratio Tests (LRTs) were computed and adjusted for multiple testing with a FDR threshold of 0.05. If significant, codon sites with an Empirical Bayes probability 95% and up were considered to be under positive selection. Finally, the function and tissue expressions of these PSGs were manually queried via RhesusBase [99] to dissect the functional differences among these monkeys.

Conclusions
This study generated a large dataset of heterospecific SNVs for Macaca, providing a better understanding of the genetic divergence among macaques. Analyses based on specific nsSNVs demonstrated that these macaques genetically differed in genes belonging to metabolic pathways that include the protein and fat digestion, which might contribute to their variation in body sizes across species.
Analyses of nsSNVs that are putatively deleterious indicated that some macaque species may be valuable for study of numerous diseases. For example, the Chinese rhesus macaque is an ideal candidate for study of type I diabetes, while other macaques may be suitable for type II diabetes research. The Malaysian cynomolgus macaque displayed a distinct pattern in genes associated with homologous recombination and innate immune function, likely indicating its special reaction to cancer therapy and other medical treatments. The stump-tailed macaque may be a putative spontaneous disease model of PCD, while the southern pig-tailed macaque may be vulnerable to metabolic disorders. Different macaques probably have distinct drug-metabolic features and promising application potentials to drug metabolism studies due to tolerating specific nsSNPs and scrSNVs in some drug metabolism key genes. We re-emphasize the importance of genetic variations in model organism's reactions to biomedical research, and advise that researchers consider the variations when choosing proper macaque models, rather than using different macaques indiscriminately.
Identification of PSGs revealed that genes in pathways associated with immune, neuro system, growth, development, and fat metabolism may have undergone positive selections in Macaca. Along with the nsSNV analyses, this result suggests that metabolism and body size played important roles in evolutionary adaptation of this genus.
Although further research is required to thoroughly address some of the issues presented in this study, this survey has not only contributed to the basic understanding of macaque, but has also paved way for their future studies, such as functional genomics studies as well as determination of optimal NHP models in biomedical research.

Conflicts of Interest:
The authors declare no conflict of interest.