Population Scale Analysis of Centromeric Satellite DNA Reveals Highly Dynamic Evolutionary Patterns and Genomic Organization in Long-Tailed and Rhesus Macaques

Centromeric satellite DNA (cen-satDNA) consists of highly divergent repeat monomers, each approximately 171 base pairs in length. Here, we investigated the genetic diversity in the centromeric region of two primate species: long-tailed (Macaca fascicularis) and rhesus (Macaca mulatta) macaques. Fluorescence in situ hybridization and bioinformatic analysis showed the chromosome-specific organization and dynamic nature of cen-satDNAsequences, and their substantial diversity, with distinct subfamilies across macaque populations, suggesting increased turnovers. Comparative genomics identified high level polymorphisms spanning a 120 bp deletion region and a remarkable interspecific variability in cen-satDNA size and structure. Population structure analysis detected admixture patterns within populations, indicating their high divergence and rapid evolution. However, differences in cen-satDNA profiles appear to not be involved in hybrid incompatibility between the two species. Our study provides a genomic landscape of centromeric repeats in wild macaques and opens new avenues for exploring their impact on the adaptive evolution and speciation of primates.


Introduction
Centromeres are essential chromatin domains for chromosome segregation during cell division and for the maintenance of genome stability across eukaryotes [1,2]. Centromeric DNA is mainly composed of tandem arrays of repetitive DNA sequences with multiple repeat units known as satellite DNA (satDNA) [3,4]. Owing to this satellite-rich architecture, centromeres tend to have high rates of structural mutations that occur through replication

Fosmid DNA Library and Isolation of Satellite DNA Sequences
A genomic library was constructed using a pCC1FOS fosmid library construction kit (Epicentre Biotechnologies, Madison, WI, USA), following the manufacturer's protocol. Genomic DNA fragments of 35-45 kb were mechanically sheared and inserted into the 8.1 kb fosmid pCC1FOS vector. A total of 768 colonies from the long-tailed macaque genomic library were cultured in LB broth liquid medium (TM Media, Titan Biotech Ltd., Rajasthan, India) with 100 mg/mL Chloramphenicol solution (Sigma-Aldrich Co., St.

C-Banding
C-banding was performed to examine the chromosomal distribution of constitutive heterochromatin using the standard barium hydroxide/saline/Giemsa method [66] with slight modifications. Briefly, chromosome slides were treated with 0.2 N HCl (Merck KGaA, Darmstadt, Germany) at 25 • C for 60 min and then with 5% Ba(OH) 2 (Merck KGaA) at 50 • C for 1.30 s, followed by treatment with 2× SSC (Sigma-Aldrich Co.) at 65 • C for 60 min.

Centromeric Satellite DNA Sequencing in Macaques
Fosmid clones with intense signals were mapped on the centromeric region, and the nucleotide sequences of inserted DNA fragments at end-terminals were determined by the DNA sequencing service of First Base Laboratories Sdn Bhd (Seri Kembangan, Selangor, Malaysia) using universal M13 primers (M13F-pUC (−40): 5 -GTTTTCCCAGTCACGAC-3 and M13R (−20): 5 -GCGGATAACAATTTCACACAGG-3 ). The BLASTn program (http://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 1 December 2021) was used to search nucleotide sequences in the National Center for Biotechnology Information (NCBI) database to confirm the identity of the amplified DNA fragments. Individual monomers were identified within multimers to design a specific primer pair to amplify cen-satDNA sequences in long-tailed and rhesus macaques. Nucleotide sequences were searched for homologies using the BLASTn program (http://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 1 December 2021) and for patterns of repetitive sequences using RepBase (https: //www.girinst.org/repbase/, accessed on 1 December 2021) [69]. Sequence alignment between a candidate clone-derived cen-satDNA from long-tailed macaque and a databasederived cen-satDNA repeat of rhesus macaque (accession number X04006) was performed to design specific primers. Amplification of cen-satDNA was carried out using the designed primers containing specific (Illumina) sequencing adaptors on their 5 ends; Censat F: 5 -TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCTGTGAAATCATCTCACAGAG-3 ; Censat R: 5 -GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCTTTCTCTTTGTTATC GGAGG-3 . Each 15 µL PCR reaction contained 1× ThermoPol ® buffer, 1.5 mM MgCl 2 , 0.2 mM dNTPs, 0.2 µM of each primer, 0.5 U Taq polymerase (Apsalagen Co., Ltd., Bangkok, Thailand), and 25 ng genomic DNA. PCR conditions were as follows: initial denaturation at 94 • C for 4 min, followed by 35 cycles of 94 • C for 30 s, 63 • C for 45 s, 72 • C for 30 s, and a final extension at 72 • C for 10 min. PCR products were reamplified using the Nextera dual-index primer set (Supplementary Materials Table S1). Each primer contained a unique 6 bp index sequence. Each sample was amplified using a different combination index to allow multiplexing of up to 384 samples per run. The conditions of the second PCR reaction were as follows: initial denaturation at 94 • C for 4 min, followed by 35 cycles of 94 • C for 30 s, 55 • C for 45 s, 72 • C for 30 s, and a final extension at 72 • C for 10 min. PCR products were purified using the FavorPrep GEL/PCR purification mini kit (Favorgen Biotech Corp., Ping-Tung, Taiwan) and subsequently pooled at equal amounts based on the concentration measured by Quant-iT™ dsDNA HS assay kits (Invitrogen, Waltham, MA, USA). The pool of DNA libraries (size 340-350 bp) from 377 samples was paired-end sequenced (2 × 250 bp) on an Illumina MiSeq platform (Illumina, Inc., San Diego, CA, USA). The obtained nucleotide sequences were inputted in the BLASTn program (http://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 1 December 2021) to search for nucleotide sequences in the National Center for Biotechnology Information database to confirm the identity/similarity of the amplified DNA fragments. The sequences generated were deposited in the Dryad repository (https://doi.org/10.5061/dryad.w3r2280sw, accessed on 19 May 2022).

Identification of Putative Satellite DNA in Long-Tailed and Rhesus Macaque Populations
To identify the occurrence of satDNA sequences in long-tailed and rhesus macaque populations, all Illumina reads were applied to a tandem repeat analyzer (TAREAN) [70], and sequence formats were generated for further analyses. All sequence data (Illumina reads) from the 377 samples were merged into a single FASTA file, which was used as input in the RepeatExplorer2 pipeline using default parameters that detected sequence similarities and variability by examination of reads, clusters, and graph topology [71]. RepeatExplorer is a computational pipeline for characterization of repetitive sequences and identification of centromeric repeats from unassembled next generation sequencing (NGS) data. RepeatExplorer also includes TAREAN tool for detecting putative satDNA sequences based on circular structures in their cluster graphs. Putative satDNA sequences were subsequenttly detected by the presence of circular structures in their cluster graphs. Consensus sequences of satDNA sequences were constructed from the most prevalent k-mers obtained by decomposing read sequences from the respective groups. The most frequent variants of sequences were reconstructed from the count of k-mers in the set of oriented reads [70]. Variants of satDNA monomers were selected based on multiple criteria: (i) by converting k-mer sequences constituting De Bruijin graphs, (ii) alignment of sequences with the same length, (iii) calculating consensus, and (iv) position probability matrix from k-mer weights. The output groups of the satDNA sequences were annotated using RepeatMasker to further confirm the occurrence of satDNA sequences [72]. All Illumina reads were assembled using Geneious Prime (version 2022.0.2) (http://www.geneious.com, accessed on 1 December 2021) [73] and individual contigs were generated using a custom sensitivity method with default parameters [73]. Sequence features including GC content, sequence length, pairwise identity, and consensus length were determined for each assembled contig using Geneious Prime. To elucidate the relationship of the 2 datasets derived from k-mer and assembled contig approaches, multiple sequence alignment of putative satDNA sequences was performed using multiple sequence comparison by log-expectation (MUSCLE version 3.8.31) (http://www.ebi.ac.uk/Tools/msa/muscle/, accessed on 1 December 2021) [74] with default parameters. Sequences were clustered using maximum composite likelihood analysis with neighbor-joining phylogeny and a total of 1000 bootstrap replications following the default parameters of MEGA11: Molecular Evolutionary Genetics Analysis (version 11.0.10) [75]. Evolutionary distances were computed using the number of base substitutions per site, and all ambiguous positions were removed for each sequence pair (pairwise deletion option). The phylogenetic trees were then annotated and edited using Interactive Tree of Life (iTOL) version 4 [76]. Tajima's D neutrality test (D) [77] was performed using MEGA11 with a substitution model type of "nucleotide" to examine whether cen-satDNA sequences had evolved randomly ("neutrally").

Genomic Organization and Comparative Genomics of Satellite DNA Sequences
BLASTn of satDNA sequences was performed to screen their corresponding genomic regions across the 2 macaque genomes. The chromosomal level genome assemblies of longtailed and rhesus macaques (NCBI accessions: GCA_012559485.3 and GCA_003339765.3, respectively) were retrieved from the NCBI assembly database and unplaced scaffolds were eliminated to restrict the analysis for chromosomes. Random hits with alignment size less than 200 bp and percent identity less than 80% were also filtered out. The repeat density of mapped cen-satDNA was calculated and the number of intervals (chromosomal blocks/loci of a specific size) for the region spanning 1 megabase pair (Mbp) was counted using BED-Tools version 2.29.2 [78]. Chromosomes with the highest repeat density in the respective regions were identified and selected to investigate the complex structures of satDNA arrangements in these regions. The StainedGlass pipeline workflow [79] was applied and implemented in Snakemake [80,81]. All possible pairwise alignments between satDNA sequences and respective chromosomal regions were computed using Minimap2 [82], with outputs being generated as heatmap plots. All satDNA sequences were also compared with the human genome (GCA_000001405.28) using the BLAT tool [83], with default parameters in Ensembl to identify the patterns of sequence distribution on chromosomes, resulting in the generation of an ideogram. The default parameters in Ensembl were set as "alignment BLASTn, Search sensitivity normal, Maximum number of hits to report 100, and Match/Mismatch scores 1,-3". The human genome was the only available genome with complete karyotypic information necessary for ideogram construction.

Tests of Genetic Diversity within and between Macaque Populations
To perform genetic diversity analyses within/between the 18 macaque populations, BLAST of the consensus sequences of satDNA sequences was performed against the "nr" database in NCBI and the best hit selected as the reference sequence was retrieved. Sequence reads of all macaques were aligned to the reference sequence using the Burrows-Wheeler Aligner (version 0.7.10-r789) [94] and the Sequence Alignment/Map (SAM) format files in Binary Alignment Map (BAM) and were converted using SAMtools (version 0.1.19) [95], followed by sorting and importing to the STACKS software pipeline (version 2.60) [96] for variant calling used to build variant loci and generate SNP loci. The SNP loci from BAM files were subsequently transformed to variant calling format (VCF) files, which were then used for downstream analyses of genetic diversity within and between macaque populations. Different file formats for data analyses were generated using PGDSpider (version 2.1.1.5) [97] and dartR package (version 1.9.9.1) [98]. The genetic diversity of macaque populations was evaluated using different parameters, including the number of different alleles (N a ), number of effective alleles (N e ), Shannon's information index (I), expected heterozygosity (H e ), observed heterozygosity (H o ), unbiased expected heterozygosity (uHe), and fixation index (F), which were calculated using GenAlEx (version 6.5) [99]. To investigate similarities between individuals and the state of populations, pairwise F ST genetic distances between the 18 populations were calculated using GenAlEx (version 6.5) [99]. A value of F ST = 0.25 indicated large differentiation between populations, a value in the 0.15-0.25 range indicated high differentiation, and a value in the 0.05-0.15 range indicated moderate differentiation, whereas it was negligible if F ST was 0.05 or less [100]. To facilitate understanding of the genetic differences between the 18 populations, hierarchical analyses of molecular variance (AMOVA) [101] were performed using GenAlEx (version 6.5) [99]. This provided hierarchical partitioning of the total genetic variation within individuals, between individuals, between populations, and the estimation of F-statistics, which were tested using random permutation. To test the homogeneity of the population examined, principal coordinates analysis (PCoA) was conducted in GenAlEx (version 6.5) and data were visualized using "ggplot2" [85] in R [86]. Bayesian clustering analysis was also performed to infer the number of genetic clusters for all samples. We ran STRUCTURE from K = 1 to K = 10 with 10 iterations per K, a 10,000 generation burn in, and 50,000 Markov Chain Monte Carlo steps per iteration. STRUCTURE runs were implemented without previous information about group membership. An estimate of the optimal number of genetic clusters (K) was obtained using the method by Evanno et al. (2005) [102] implemented in STRUCTURE HARVESTER (version 0.6.94) [103]; meanwhile, results were compared across various K values regardless of the optimal K to assess the substructuring present within the data. Demographic history was determined using the statistical test of neutrality [77,104]. Tajima's D [77] and Fu and Li's D* and F* tests [104] were calculated using PopGenome (version 2.7.5) in R package [105], while the jackknife resampling function, which is part of this package, was applied for empirical p-value determination [105,106]. The genetic structure of each dataset was examined in relation to its geographic location using a Mantel test [107], implemented in Alleles In Space (AIS) [108], with 1000 permutations to establish the significance of the correlation coefficient (null hypothesis: genetic clustering as a result of isolation-by-distance, p < 0.01). To visualize spatial patterns of genetic diversity, we conducted a landscape shape interpolation (LSI) analysis, wherein geographic coordinates (x-and y-axes) can be related to genetic distances (surface plot heights, z-axis) in a threedimensional surface plot [108]. Peaks in the surface plot represented areas of high genetic distance between individuals and were considered areas of restricted genetic exchange. Troughs in the surface plot revealed areas of low genetic distance. We performed the LSI analysis in the AIS software, using a distance weighting parameter (α) of 1 and grid settings of 80 × 80 grid were specified. All analyses implemented in AIS used sequences as the input matrix (raw genetic distances) and Universal Transverse Mercator coordinates ArcGis were also used to construct maps for each population dataset and to incorporate information on cluster membership [109].

Isolation of Highly Repetitive DNA Sequences and Their Nucleotide Sequences
We then performed genomic hybridization using 768 fosmid clones and identified prominent clones potentially containing highly repetitive sequences. To investigate th chromosomal localization, we mapped all prominent clones to long-tailed macaq chromosomes. We observed more than 20 metaphase spreads, with hybridizati efficiencies ranging from 70% to 90%. We detected that all the prominent clones we localized to the centromeric regions of all the chromosomes, except for the Y chromosom ( Figure 2c). We then characterized the 500-800 bp nucleotide sequences from the termin ends of 2 out of 14 randomly prominent clones using the M13 universal primers. D matrix analysis showed that these two clone sequences contained tandem-array repetitive sequences (Figure 3a). Within the clone insert, we detected five units w lengths ranging from 170 to 174 bp, and GC contents ranging from 36.8% to 41.5%, w an average of 39.5%, which were characterized by a secondary structure (Supplementa Figure S1). We therefore designated these satDNA as cen-satDNA sequences. W identified the following conserved motifs in all the units in the cen-satDNA sequenc CTCACAGAGTTAC termed "core motif1, CM1" and CTTTCTGAGAAACT termed "co motif2, CM2" (Table 1 and Figure 3b). We searched the NCBI database for sequen homology to any of the families of repetitive sequences and found that our detected ce satDNA sequences showed sequence similarity to alpha satDNA in human, rhes

Isolation of Highly Repetitive DNA Sequences and Their Nucleotide Sequences
We then performed genomic hybridization using 768 fosmid clones and identified 14 prominent clones potentially containing highly repetitive sequences. To investigate their chromosomal localization, we mapped all prominent clones to long-tailed macaque chromosomes. We observed more than 20 metaphase spreads, with hybridization efficiencies ranging from 70% to 90%. We detected that all the prominent clones were localized to the centromeric regions of all the chromosomes, except for the Y chromosome ( Figure 2c). We then characterized the 500-800 bp nucleotide sequences from the terminal ends of 2 out of 14 randomly prominent clones using the M13 universal primers. Dot matrix analysis showed that these two clone sequences contained tandem-arrayed repetitive sequences ( Figure 3a). Within the clone insert, we detected five units with lengths ranging from 170 to 174 bp, and GC contents ranging from 36.8% to 41.5%, with an average of 39.5%, which were characterized by a secondary structure (Supplementary Figure S1). We therefore designated these satDNA as cen-satDNA sequences. We identified the following conserved motifs in all the units in the cen-satDNA sequences: CTCACAGAGTTAC termed "core motif1, CM1" and CTTTCTGAGAAACT termed "core motif2, CM2" (Table 1 and Figure 3b). We searched the NCBI database for sequence homology to any of the families of repetitive sequences and found that our detected cen-satDNA sequences showed sequence similarity to alpha satDNA in human, rhesus macaque, bonnet macaque (Macaca radiata) (É. Geoffroy, 1812) [110,111], and olive baboon (Lesson, 1827) [37]. We further observed that these alpha-satDNA sequences showed high sequence similarity with the nucleotide sequences of conserved centromeric motifs in the human genome, such as the pJα motif (100% identity), centromere-binding protein A (CENP-A) box (64.29% identity) located in the CM2, and CENP-B box (37.50-87.50% identity) [112] (Figure 2). We also performed a Repbase [69] search using the cen-satDNA sequences and identified a partial sequence with 77.39% identity to the ALRa-SAT satellitee from primates (122 bp, GC content: 41%) that was also found in human, rhesus macaque, Sumatran orangutan, chimpanzee, and northern white-cheeked gibbon (Nomascus leucogenys, Ogilby, 1840) [113].

Striking Sequence Variability of Cen-satDNA in Macaque Populations
We grouped all the available reads of each population (total 1,764,463 sequences from 377 individuals) using RepeatExplorer and TAREAN. We then assessed the mean read coverage for all 18 populations and recorded deep coverage sufficient to obtain an accurate contig assembly and population-scale sequencing for all individuals across the populations, with a mean coverage exceeding 100× for most individuals (Figure 4a). Using TAREAN, we detected five different k-mer with the most frequent being 27-mer repeats, including 265,060 monomers obtained by decomposing the read sequences from the corresponding groups. We observed 14 distinct cen-satDNA subfamilies with variable monomers (Supplementary Figure S2 and Supplementary Table S3). We noticed that the cent-Msat9 and cent-Msat3 subfamilies showed the highest and lowest abundance (2003 and 501 frequency) across the assembled contigs. We also detected cen-satDNA with a 96.02% level of abundance using RepeatMasker, further validating the prediction of the cen-satDNA subfamilies detected by TAREAN [114]. To confirm the occurrence of sequence variability derived from the NGS reads with highly complex satDNA, we analyzed the same datasets obtained from sequencing using contig assembly. We thus obtained new sequences of monomer units that were highly similar to the alpha-satDNA

Striking Sequence Variability of Cen-satDNA in Macaque Populations
We grouped all the available reads of each population (total 1,764,463 sequences from 377 individuals) using RepeatExplorer and TAREAN. We then assessed the mean read coverage for all 18 populations and recorded deep coverage sufficient to obtain an accurate contig assembly and population-scale sequencing for all individuals across the populations, with a mean coverage exceeding 100× for most individuals (Figure 4a). Using TAREAN, we detected five different k-mer with the most frequent being 27-mer repeats, including 265,060 monomers obtained by decomposing the read sequences from the corresponding groups. We observed 14 distinct cen-satDNA subfamilies with variable monomers (Supplementary Figure S2 and Supplementary Table S3). We noticed that the cent-Msat9 and cent-Msat3 subfamilies showed the highest and lowest abundance (2003 and 501 frequency) across the assembled contigs. We also detected cen-satDNA with a 96.02% level of abundance using RepeatMasker, further validating the prediction of the cen-satDNA subfamilies detected by TAREAN [114]. To confirm the occurrence of sequence variability derived from the NGS reads with highly complex satDNA, we analyzed the same datasets obtained from sequencing using contig assembly. We thus obtained new sequences of monomer units that were highly similar to the alpha-satDNA sequences isolated from the genomic library (more than 88%). We observed 1546 contigs of the cen-satDNA sequences, with remarkable variable lengths ranging from 293 to 718 bp (average size 318 bp) across all individuals (Supplementary Dataset S1). sequences isolated from the genomic library (more than 88%). We observed 1546 contigs of the cen-satDNA sequences, with remarkable variable lengths ranging from 293 to 718 bp (average size 318 bp) across all individuals (Supplementary Dataset S1).   We specifically found that the rhesus macaque population of Wat Tham Pa Mak Ho, Loei (WTPMH), showed the highest variability of cen-satDNA sequences, whereas the longtailed macaque population of Sai Yok, Kanchanaburi (SY), had the lowest range of variation in cen-satDNA sequence size (Figure 4b). Remarkably, we noticed that the GC content of these cen-satDNA sequences was highly variable, ranging from 37% to 63% across all macaque populations (Figure 4c). Tajima's neutrality test recorded a D value of −1.59391, measured across all populations, which was not statistically significant, indicating a possible recent selective sweep of cen-satDNA in macaques (Table 2). We subsequently reconstructed an unrooted phylogenetic tree to infer the evolutionary relationships and identify the putative clusters using all the cen-satDNA sequences. We specifically defined these clusters according to a set of particular nucleotide substitutions. We grouped all the sequences together into four major clusters: A, B, C, and D. (Figure 4d). Finally, we examined these 14 subfamilies for their relationship with the four major clusters. We detected that 10 out of 14 subfamilies corresponded to cluster A, whereas cluster B and cluster C contained a single subfamily, cen-Msat3 and cen-Msat6, respectively (Figure 4d). We did not observe any cen-satDNA subfamilies derived from the k-mer analysis in cluster D.

Genomic Organization of Cen-satDNA Sequences
We also examined the organization of cen-satDNA sequences in the centromeric regions of long-tailed and rhesus macaque chromosomes using comparative genomic analysis. Using in silico mapping, the cen-satDNA sequences with all subfamilies and clusters were mapped onto all chromosomes, except for the Y chromosome (with no alignments). We found that the total number of "filtered nonrandom alignments" (percent identity > 80 and alignment length > 200 bp) was higher in rhesus macaques (total 74,478) compared with that in long-tailed macaques (total 10,272) (Figure 5a,b). Moreover, we observed a repetitive density and enriched regions of cen-satDNA with the highest abundance on chromosomes 3 and 8 of long-tailed macaques, with 1726 and 8724 total repeats, respectively, and on chromosomes 18 and 19 of rhesus macaque, with 9575 and 8724 total repeats, respectively (Supplementary Table S3). We then visualized the complex tandem arrangements of cen-satDNA sequences in the enriched centromeric regions of these four chromosomes using a heatmap (Figure 5c,d). Accordingly, we identified different patterns of arrangement of the cen-satDNA sequences in each centromeric region of different chromosomes in both species, indicating their high structural complexity on the chromosomes (Figures 5c,d, S3 and S4).
We compared the cen-satDNA consensus sequences from the assembled contigs with orthologous sequences from 17 primate species using multiple sequence alignment to provide insights into the polymorphically and evolutionarily diverged sites. We observed a species-specific 120 bp deletion downstream of CM1 (Figure 6b). In particular, we detected that the vervet monkeys (Chlorocebus sabaeus, Linnaeus, 1766) [88], geladas, olive ba-boons, rhesus macaques, long-tailed macaques, and southern pig-tailed macaques showed long cen-satDNA sequences, whereas the western gorilla or Sumatran orangutan contained shorter cen-satDNA sequences (Supplementary Figure S5). By contrast, we did not observe any similarity between cen-satDNA sequences in certain primate genomes, including that of the common marmoset (Callithrix jacchus, Linnaeus, 1758) [115], bamboo lemur (Prolemur simus Gray, 1871) [116], and tarsier (Carlito syrichta, Linnaeus, 1758) [115] (Supplementary Dataset S2). For comparison with the human genome, we mapped these cen-satDNA sequences on autosomes and the X chromosome but not on the Y chromosome (Figure 6c).
We also examined the organization of cen-satDNA sequences in the centromeric regions of long-tailed and rhesus macaque chromosomes using comparative genomic analysis. Using in silico mapping, the cen-satDNA sequences with all subfamilies and clusters were mapped onto all chromosomes, except for the Y chromosome (with no alignments). We found that the total number of "filtered nonrandom alignments" (percent identity > 80 and alignment length > 200 bp) was higher in rhesus macaques (total 74,478) compared with that in long-tailed macaques (total 10,272) (Figure 5a,b). Moreover, we observed a repetitive density and enriched regions of cen-satDNA with the highest abundance on chromosomes 3 and 8 of long-tailed macaques, with 1726 and 8724 total repeats, respectively, and on chromosomes 18 and 19 of rhesus macaque, with 9575 and 8724 total repeats, respectively (Supplementary Table S3). We then visualized the complex tandem arrangements of cen-satDNA sequences in the enriched centromeric regions of these four chromosomes using a heatmap (Figure 5c,d). Accordingly, we identified different patterns of arrangement of the cen-satDNA sequences in each centromeric region of different chromosomes in both species, indicating their high structural complexity on the chromosomes (Figures 5c,d, S3 and S4). contained shorter cen-satDNA sequences (Supplementary Figure S5). By contrast, we did not observe any similarity between cen-satDNA sequences in certain primate genomes, including that of the common marmoset (Callithrix jacchus, Linnaeus, 1758) [115], bamboo lemur (Prolemur simus Gray, 1871) [116], and tarsier (Carlito syrichta, Linnaeus, 1758) [115] (Supplementary Dataset S2). For comparison with the human genome, we mapped these cen-satDNA sequences on autosomes and the X chromosome but not on the Y chromosome (Figure 6c).

Genetic Diversity within and between Macaque Populations
We recorded an alignment rate of 99.2% for the genomic sequences of 377 macaques using reference satDNA sequences and built 29 loci with a mean size of the genotyped sites of 301.34 bp and 520.6X as read coverage per locus. The dataset, with stacks for all populations, contained 1505 variants. We also inferred a phylogenetic tree of macaques from the single nucleotide polymorphisms (SNPs) of the cen-satDNA (Supplementary Figure S6). Consequently, we detected that the population exhibited F values of −0.510, while the I ranged from 0.110 to 0.359. The Ho values ranged from 0.133 to 0.382 (mean ±

Genetic Diversity within and between Macaque Populations
We recorded an alignment rate of 99.2% for the genomic sequences of 377 macaques using reference satDNA sequences and built 29 loci with a mean size of the genotyped sites of 301.34 bp and 520.6X as read coverage per locus. The dataset, with stacks for all populations, contained 1505 variants. We also inferred a phylogenetic tree of macaques from the single nucleotide polymorphisms (SNPs) of the cen-satDNA (Supplementary Figure S6). Consequently, we detected that the population exhibited F values of −0.510, while the I ranged from 0. 110 Table 3. The pairwise F ST values, which represent population structure based on the differences between the allele frequencies of the two populations, were represented as a heatmap plot (Figure 7). In case of F ST > 0.25, a state of different populations was indicated. Our AMOVA analysis demonstrated 67% molecular variation (p < 0.001) within a population and 4% difference (p < 0.001) between populations (Table 4). In addition, PCoA revealed that the first, second, and third principal components accounted for 27.24%, 10.04%, and 7.69% of the total variation, respectively, and provided support for four tentatively differentiated groups (A, B, C, and D) as mentioned above (Figure 8). A Bayesian structural analysis revealed the highest posterior probability based on Evanno's ∆K, with all macaque populations showing the genetic admixture of cen-satDNA sequences with all four major clusters (Figure 9). We used three neutrality tests (Tajima's D values, Fu and Li's F* values, and Fu and Li's D* values) to observe the historical population expansion based on the cen-satDNA sequences. However, we noticed that not all values were statistically significant ( Table 2). The Mantel test revealed no correlation between nucleotide diversities and geographic distance from isolationby-distance (Figure 10a). Of note, our LSI analyses revealed high nucleotide diversities. In particular, the Khao Nor (KN) population belonging to M. fascicularis and the Wat Tham Thep Bandan (WTT) population belonging to M. fascicularis showed high nucleotide diversities (Figure 10b).

Discussion
Despite the importance of the centromere region, knowledge of the centromeric sequences remains limited, mainly because of the highly tandem repetitive and complex nature of these sequences. This obscures the process of genome sequencing and assembly.
Although the potential impact of the variations in the centromeric sequences and their associated satellites on the functions of centromeres and heterochromatin has been recognized, their study remains difficult [118]. However, the recent integration of bioinformatics and cytogenetics as chromosomics has greatly assisted the assessment of repeat variations [119,120]. These combined approaches offer deep insights into the organization of cen-satDNA within primate genomes, thus facilitating our understanding of their possible roles in adaptive evolution at the population level and highlighting the extraordinary diversification of tandem repeats.

Turnover of Cen-satDNA Sequences with Multiple Subfamilies in Long-Tailed and Rhesus Macaque Populations
In this study, we found that the cen-satDNA sequences isolated from long-tailed and rhesus macaque centromeres were homologous to the alpha-satDNA family in primates. More specifically, they showed a tendency toward AT-rich monomers, characterized by secondary structures, including helices and stem-loops [39,121]. The limitations derived from the PCR-based approach drastically reduced the number of available sequences [122]. Genomic and phylogenetic characterizations provided information on the structural organization of each subfamily in each cluster. The co-occurrence of clusters and subfamilies in macaque populations and their interspecific divergence suggested that cen-satDNA followed the library model of satDNA evolution, in which different satDNA subfamilies coexist in the same genome due to amplification-contraction events of these sequence pools. However, this was not the case in the cen-satDNA observed in this study, where 14 satDNA subfamilies were retained within the both long-tailed and rhesus macaques. This rejected our hypothesis (i) as mentioned above, and the organization and diversity of cen-satDNA shared between the two macaque species indicated nonindependent evolution. The cen-satDNA remained in the common ancestor of the two species until at least 5.5 MYA [123]. The occurrence of large numbers of subfamilies in both long-tailed and rhesus macaques has probably been caused by the frequent turnover of the tandem repeat sequences [40]. This redundancy and high rate of change suggest that long-tailed and rhesus macaques contain substitutions, even in functional domains, without deleterious effects. The pattern of differentiation was also variable among repeats, revealing that some k-mers evolved relatively independently of others. Different cen-satDNA subfamilies underwent concerted evolution in abundance and might have retained essential protein-binding sites, structural domains, and sites for epigenetic modifications [124]. Cluster A, with many cen-satDNA subfamilies, adopted a monomeric organization, further supporting the hypothesis that centromeres are composed of repeated arrays evolved from simple monomeric structures [125,126]. Clusters B and C associated into multimers and possible parts of higher order regions (HOR), suggesting the differential evolution of tandem repetitive arrays in the genomes of long-tailed and rhesus macaques. Clusters B and C's multimers are probably tandemly repeated or represent only a part of a longer HOR involving other monomers, as observed in other primates [28,29]. Higher-order structures might have a strict requirement for sequence length and the conservation of particular repeat regions [28,29,[127][128][129]. Cluster D contained 316 bp of cen-satDNA and could not be classified in any subfamily, suggesting that it is an intermediate structural unit. This might have resulted from an ancient dimer of monomers and the subsequent deletion, substitution, and homogenization between the two monomers, followed by homogenization between monomers of nonhomologous chromosomes and the fixation of specific complex structures. An alternative explanation is that a monomer in the cluster D became amplified together with parts of anonymous adjacent monomer variants, and these spread together as a new repeated unit, not necessarily a HOR. The added part might be a favorable consequence of a monomer sequence from a degenerated short segment. For instance, an increase in repeated unit length and complexity through the merging of shorter repeat motifs or degenerated monomers has been found in satellite III on human chromosome 14 [130], human beta satellite DNA [131], and house mouse (Mus musculus, Linnaeus, 1758) [115,132,133].
Comparison of all cen-satDNA sequences derived from 1,764,463 sequences from 18 macaque populations revealed the existence of two conserved core sequence motifs, CM1 and CM2, which are also observed in alpha-satellite arrays in other primates [134,135]. This suggests that the two CMs are likely to be essential for the maintenance of chromosomes in primates, and can be expressed as s = 1, where s is the selection coefficient used in population genetics [136,137]. Comparing these cen-satDNA sequences with those of other primates revealed that similar cen-satDNA sequences were present in Cercopithecoidea (Old World monkeys) and Hominoidea (apes and human) but not in Ceboidea (New World monkeys) [138]. However, comparative genomics identified a 120 bp deletion downstream of CM1 in apes. In addition, CM2 was shown to be homologous to the CENP-A box, which is found at active centromeres [136,137]. The CENP-A box has rapidly evolved in many animals and CENP-A is a histone H3-like protein that binds to this box [139]. This suggests that the CENP-A box imposes a complementary selection pressure on cen-satDNA sequences to ensure protein-DNA compatibility [140,141]. By contrast, the pJα binding site was found to be retained in the repeated motif, with more than 90% of the number of cen-satDNA sequences from all populations, suggesting that it is nonessential with s = 1, but highly positive for a higher chromosome stability of both long-tailed and rhesus macaques. Sequence regions similar to the pJα binding site have been found in some mammalian species as well as primates [142]. Similarly, the core sequences of the CENP-B box, an evolutionary conserved domain (ECD), were partially observed in most cen-satDNA sequences in this study, indicating the presence of CENP-B box-like boxes in long-tailed and rhesus macaques. Although not all ECD nucleotides are present in the CENP-B box-like motif found in the cen-satDNA sequence, we cannot exclude its functional activity. Such divergent motif sequences have also been observed in the cen-satDNAs of the African bush elephant (Loxodonta africana, Blumenbach, 1797) [143], the nine-banded armadillo (Dasypus novemcinctus, Linnaeus, 1758) [115,144], and the two-toed sloth of the Choloepus genus (Illiger, 1811) [145,146]. These findings have collectively suggested the positive value of s of the CENPB-like box. The preservation of both the conserved and variable domains across 18 different populations suggested that cen-satDNA was evolutionarily constrained in long-tailed and rhesus macaques [31,41,[147][148][149]. Differential rates of substitution in cen-satDNA could have resulted from the interaction of DNA-binding proteins with satellite DNA. Immunofluorescence colocalization using polyclonal anti-poly (ADP-ribose) polymerase (PARP) antibody, anti-5-methylcytosine (5meC) antibody, remodeling and spacing factor (RSF) complex (Rsf-1), or SNF2h antibody are required to confirm this functional possibility [150][151][152].

Nonrandom Cen-satDNA Sequences in Long-Tailed and Rhesus Macaque Chromosomes
Our chromosomal analysis showed that all cen-satDNA subfamilies were mapped to the centromeric region in all chromosomes, except for the Y chromosome, suggesting their common evolution and expansion in the genome. Similar cases of different cen-satDNA sequences between autosome/X and Y chromosomes were observed in alpha-satellite families in sun-tailed monkeys (Cercopithecus solatus, Harrison 1988) [59,153]. The absence of cen-satDNA on the Y chromosome in long-tailed and rhesus macaques might be explained by chromosomal rearrangement, such as unequal crossing over between the X and Y chromosomes, resulting in its reduction or almost absence in the Y. The lack of conservation and prevalence of satDNA sequences in the neocentromeres of the genome have raised questions whether any specific satellite sequences are required for function [5,13]. In addition, it has been reported that the deleterious effect of an increase in one satellite DNA group is mitigated by a decrease in a different group on the Y chromosome [154]. Meiotic drive has been proposed as the likely explanation for the salutatory divergence of satDNA sequences and the excess nonsynonymous divergence of several centromere proteins, some of which interact directly with the Y, together with the fitness benefits of maintaining an optimal genome size [21,138,139,155,156]. Alternatively, it has been proposed that the Y chromosome is probably excluded from recombination events with nonhomologous chromosomes [157]. Such changes might contribute to the significant interand intraspecific heterogeneity of the genome size observed in insects [158,159]. In great apes, the nonrandom distribution of various satDNA subfamilies of SatIII DNA has been observed on different chromosomes [160]. Several satDNA subfamilies are present on the human Y chromosome, more than on the human acrocentric chromosomes or the nonhuman primate Y chromosomes, suggesting that the reduction in the number of satDNAs in less ancient primates is due to the loss of the number of repeats in some subfamilies [161][162][163]. However, in other studies, these satDNA subfamilies were undetectable by FISH due to the low copy number or large sequence divergence on the centromeric region of Y, and the nature of highly repetitive sequences obstructing genome assembly [164].
Considering the distribution of cen-satDNA sequences on autosomes and X chromosomes, our analysis of the arrangement of cen-satDNA on chromosomes 3 and 8 of long-tailed macaque and chromosomes 18 and 19 of rhesus macaque revealed the presence of distinct satellite arrays forming structural complexes or possible HORs that might have resulted from centromere repositioning, which is known to recruit a large block of alpha-satDNA leading to rearrangement [165]. This was consistent with the presence of at least nine repositioned centromeres in the macaque lineage [166]. The chromosome-specific distribution of cen-satDNA might ensure faithful chromosomal transmission or prevent deleterious rearrangements, as lengthy satellite blocks might be more prone to unequal crossing over or ectopic recombination [154]. Similar patterns in gibbons and New World monkeys were also associated with the existence of chromosome-specific centromeric sequences [167].

Cen-satDNA in Long-Tailed and Rhesus Macaque Populations
The mean satellite abundance in a population is expected to approximate an equilibrium determined by the mutation rate, the degree of selective constraint, potentially positive selection, and population size. A high F ST value suggests a low rate of genetic exchange between populations. Most variations in cen-satDNA sequences were observed among individuals within populations, probably resulting from the extensive genetic differentiation between populations. In addition, both meiotic drive and segregation distortion also promote strong population-specific patterns [154]. Similar patterns of H o and H e among populations together with STRUCTURE and LSI analyses indicated no geographical bias in the amplification of cen-satDNA subfamilies in any population, suggesting that cen-satDNA divergence at the population level did not identify population-specific mutations or other population-specific features of satDNA sequences. These findings opposed hypothesis (ii) regarding there being no bias of cen-satDNA diversity in any population. The population structure inferred from overall satellite quantities did not recapitulate the expected population relationships. Moreover, the observed admixture patterns that are lacking in population differentiation suggested individual-specific genetic differentiation among satellite sequences.
The demographically isolated macaque populations in Thailand have resulted from the integrated effects of the expanding human population size and human activities [168,169]. Consequently, the inbreeding process in populations with multiple satDNA subfamilies might drive the homogenization of satellite arrays in different populations. Interestingly, both long-tailed and rhesus macaques can produce hybrids with similar genomic profiles. This has suggested that differences in cen-satDNA profiles might not have resulted from hybrid incompatibility between the two species, leading to a lack of chromosomal segregation bias in each species or their hybrids. In these scenarios, chromosomes with a particular k-mer abundance or a composition with better fitness have a segregation advantage in the population. If these cen-satDNA sequences have pleiotropic deleterious effects, the fixation of a suppressor can quickly purge these sequences from populations [154]. As such, negative selection, that is, purging repetitive DNA, might have occurred in these populations [170].

Conclusions
This study investigated the genetic variability and genomic organization of cen-satDNA in two biomedically important primate model species, the long-tailed and rhesus macaques, at the population scale. Several chromosomal approaches have provided unprecedented insights into the diversity and organization of cen-satDNA at the individual, populational, and species levels. The process of diversification was apparent at all levels, indicating the increased occurrence of satellite turnover leading to the concerted evolution of cen-satDNA in both macaque species. This study attempted to unveil the complex evolutionary mechanisms that result in the expansion and homogenization of centromeric repeats. The recent emergence of long-read sequencing technologies (Oxford nanopore and PacBio sequencing) and complete centromere assemblies will provide more insights toward understanding the true nature of primate centromere evolution. Future research should address the biological consequences of sequence variation and the diversity of cen-satDNA on the adaptive evolution of primates. Our study provided evidence for the significant variation of satellite DNA profiles at the population level; however, further research is required to explore the potential influence of satellite DNA dynamics underlying the evolutionary mechanisms and functions involved in the process of speciation.  Table S1: Sampled populations of macaques at location sites in Thailand; Table S2: k-mer-based characterization of subfamilies in macaque cen-satDNA; Table S3: Chromosomal localization and genomic organization of cen-satDNA with respective alignments in macaque genomes; Supplementary Dataset S1: Genomic features of cen-satDNA-assembled contigs in macaque populations; Supplementary Dataset S2: Comparative alignments of cen-satDNA across different primate species; Supplementary Dataset S3: Comparison of Ho and He between 18 populations.