Genomes contain a variety of sequence classes, many of which are repetitive in nature. The smallest of these are microsatellites—simple sequence repeats that have a 2–6-base-pair repeating motif. Microsatellites are highly polymorphic sequences that are most commonly found within non-coding portions of the genome; however, they can also be located in regulatory or intronic regions [1
]. The location of microsatellites within the genome can have strong impacts on their stability. For instance, microsatellites occurring in regulatory or protein-coding regions tend to be highly conserved [3
]. Similarly, conserved noncoding microsatellites which occur in the 5′ flanking regions of some protein coding genes in plants are, as their name suggests, conserved among species, possibly for their role in gene regulation [5
]. In contrast, microsatellites found in regions without regulatory or coding functions (e.g., intergenic and some intronic regions) are likely to have little impact on organism fitness and, thus, their frequency and distribution should reflect the underlying mutation processes [7
Microsatellites are useful to biologists as easily accessed genetic markers, and some microsatellites have fundamental impacts on organismal functioning. Because of their relative neutrality, microsatellites are useful in inferences of population demography, as well as genetic diversity [8
]. The large quantity and variability in microsatellites even within a species make them useful for forensics [10
], kinship analysis [11
], and medical profiling [12
]. Microsatellites also play an important role in chromatin organization [13
], DNA structure [14
], and centromere and telomere function [15
]. However, the most studied effect of microsatellites is on the regulation of gene activity, where microsatellites can impact transcription [16
], gene expression [17
], protein binding [18
], and translation [19
], thus leading to diseases [20
The primary mechanism for expansion and contraction of microsatellites is slippage that occurs during DNA replication [21
]. However, differential abundance of repeats in exonic, intronic, and intergenic regions among taxa may suggest that strand slippage alone is insufficient to explain microsatellite distribution [23
]. While strand slippage can account for the expansion and contraction of microsatellite content, it cannot account for large shifts in the relative abundance of types of microsatellites in closely related species (e.g., a shift from AC repeats to TA repeats being the most common).
While previous studies focused largely on specific classes of microsatellites in one or a handful of species, few studies examined the dynamics of all microsatellite content across large clades [24
]. Adams et al. showed that ray-finned fish, squamate reptiles, and mammalian genomes had higher microsatellite content than crocodilian, turtle, and avian genomes. Additionally, some lineages had unusually high rates of change in microsatellite content, providing support for multiple major shifts in the microsatellite genomic landscape. The goal of our study was to determine whether microsatellites evolve differently in different clades of insects and to evaluate the impact of chromosome number, genome size, and centromere type (i.e., holocentric and monocentric) on both the content and the rate of microsatellite evolution. Our analyses revealed that chromosome number has no impact on either content or rate of microsatellite evolution, and that centromere type has no impact on total microsatellite content. However, our study showed that different insect orders have significantly different rates, and that the rate of microsatellite evolution is different among species with monocentric and holocentric chromosomes. Additionally, we found that genome size correlates with total content and rate of microsatellite evolution.
2. Materials and Methods
We downloaded all available insect genome assemblies from NCBI, ENSMBL, and Baylor HGC (accession numbers and site addresses in Table S1 (Supplementary Materials)
, accessed August 2018). A total of 304 genomes were available spanning 18 of the 24 insect orders. Six orders were represented by single species, while Diptera and Hymenoptera were the most frequently sequenced orders with 116 and 71 species, respectively. In all cases, the most recent assembly with no masking was downloaded (accession numbers in Table S1 (Supplementary Materials)
Repetitive sequences are one of the central challenges in genome assembly and, because of this, it is possible that poorly assembled genomes or genomes assembled with shorter read technology will lead to inaccurate inference of microsatellite content. We took two approaches to assess and control for this possibility. Firstly, we reanalyzed data from a survey of microsatellites across 71 vertebrates [26
]. In this analysis, we categorized each genome assembly by the sequencing platform (short, Sanger, and long) and tested whether genomes in these three classes had significantly different microsatellite content. If mixed data were available for a genome assembly (e.g., long-read sequencing with short-read polishing), this was classified as long read for our categories. Secondly, we also evaluated the correlation between scaffold and contig N50 and total microsatellite content. Based on results from this analysis (described below), we chose to include all insect genomes regardless of sequencing platform or N50 statistics.
Preliminary inspection of our insect genomes suggested that some were highly incomplete (e.g., assembly size 2% of expected genome size). Because of this, we performed a second quality assessment comparing BUSCO (Benchmarking Universal Single-Copy Orthologs) scores and total microsatellite content in all insect genomes [27
]. We used default settings for BUSCO in conjunction with the insect gene set. Scores were calculated as the proportion of genes searched that were found as complete genes (in either single or duplicate copies). These scores were then compared to the total microsatellite content in each genome to determine if there was a cutoff below which microsatellites were poorly inferred. Based on this approach, we reduced the number of genomes examined to 231.
For downstream comparative analyses, we downloaded sequences for the four most frequently sequenced mitochondrial genes (12S, COI, COII, and cytochrome b) and four nuclear genes (18S, 28S, elongation factor 1, and arginine kinase). This yielded a dataset of 221 operational taxonomic units (OTUs) representing members of 12 of the 24 insect orders. All sequences were downloaded from GenBank (accession numbers in Table S2 (Supplementary Materials)
). The sequences were aligned in MAFFT v.7 using default settings [28
]. We used Gblocks 0.91b to remove ambiguously aligned sites from 12S, 18S, and 28S alignments, using options for less stringent selection, including allowing smaller final blocks, gap positions within blocks, and less strict flanking requirements [29
]. This resulted in alignments for 12S, 18S, and 28S of 346 bp, 1442 bp, and 253 bp in length, respectively. We used MEGA to manually adjust the alignments of protein coding genes (COI, COII, elongation factor 1, cytochrome b, and arginine kinase) to ensure that the reading frame was maintained [30
]. These alignments were 1463 bp, 683 bp, 1064 bp, 409 bp, and 1019 bp in length, respectively. For tree inferences in RAxML, alignments were concatenated (total length 6686 bp), while each gene was kept separate for Bayesian tree inference.
Rogue taxa, or taxa which are placed inconsistently with equal probability during phylogenetic inference due to insufficient or erroneous data, will often lead to overestimation of rates of trail evolution, similar to what is seen in supertrees [31
]. To avoid this problem, we inferred 100 rapid bootstrap trees using RAxML-HPC v.8 on XSEDE using the CIPRES Science Gateway [33
]. Using these trees, we calculated taxon instability index with Mesquite v 3.6 [35
]. A high index value indicates that a taxon has variable placement among trees. By visual inspection of this distribution, we found that 92% of taxa have indices less than 5000; however, above this, instability quickly increases (Figure S1 (Supplementary Materials)
). To ensure that our estimate of rates was conservative, we removed the 18 taxa with scores higher than this cutoff. Filtering our dataset based on BUSCO scores (discussed below), gene sequence availability, and taxonomic instability index scores led to a final dataset of 201 taxa for our Bayesian analysis.
We used BEAST v2.5.2 for the inference of time-calibrated phylogenies [36
]. For a starting tree, we selected the best maximum likelihood tree from RAxML, which we converted to an ultrametric tree using nonparametric rate smoothing implemented in the function chronos in the R package APE [37
]. We assumed a relaxed log-normal clock, a GTR substitution model with among-site rate variation modeled with a γ distribution, and a birth–death branching model. We estimated nucleotide substitution model parameters independently across four partitions: protein codon positions 1, 2, and 3, and the ribosomal positions. To calibrate divergence time estimation, we placed eight priors on node ages in the tree. Normal distributions with means and standard deviations were chosen to represent previous estimates of the ages of the root of the tree and the origin of Lepidoptera, Diptera, Hymenoptera, Coleoptera, Blattodea/Phasmatodea, Hemiptera, and Ephemeroptera/Odonata (Table S3 (Supplementary Materials)
]. Two independent BEAST runs were completed.
Microsatellite and other trait data:
We used micRocounter v.1.1.0 to characterize the microsatellite content within the insect genome assemblies [39
]. We recorded the number of dinucleotide (2mer), trinucleotide (3mer), tetranucleotide (4mer), pentanucleotide (5mer), and hexanucleotide (6mer) repetitive sequences. Default micRocounter settings for all parameters were used (2mers required six repeats, 3mers required four repeats, and 4–6mers required three repeats). We used a publicly available dataset to gather centromere type (holocentric or monocentric) and chromosome number for as many species as possible in our study [40
]. We additionally gathered available genome size estimates from the Animal Genome Size Database [42
Estimating rates of microsatellite evolution:
We fit Brownian motion models to estimate rates of microsatellite evolution at several levels. All rate estimates described were generated using the restricted maximum likelihood approach using the ace function in the R package APE [37
]. This function takes observed microsatellite content and the phylogeny, and it returns an ancestral state estimate for every node in the tree and the maximum likelihood estimate for the rate of evolution. For comparison, we fit the same model using the fitContinuous function in the R package Geiger v18.104.22.168 [43
]. Rate estimates between these two approaches were qualitatively identical.
Firstly, we fit a model where we assumed a single rate of microsatellite evolution across the entire phylogeny. Next, we estimated rates individually for each order that had at least 10 species in our dataset (for both of these analyses, we fit our model based on microsatellite bp per Mbp). Finally, we calculated tip rates. Using the ancestral state estimates from our combined analysis of all data, tip rates were estimated by taking the difference in microsatellite content of a species and the ancestral state estimate for the node from which it descends. This value represents the change since the last speciation event sampled on our phylogeny (this was calculated based on the total bp of microsatellites estimated for each tip in the tree). This value was then divided by the branch length since that speciation event, providing an estimate for the recent rate of evolution in a species lineage.
The impact of centromere type on microsatellite content and evolution:
We firstly tested whether species with holocentric and monocentric chromosomes have significantly different microsatellite content. We analyzed the quantity of each microsatellite size class (2–6mer), total microsatellite content, and microsatellite content per Mbp using a phylogenetic ANOVA implemented in Geiger [43
]. The phylogenetic ANOVA was repeated for each tree from the posterior distribution. To calculate p
-values, the observed F-statistic was compared to a null distribution generated from 100 simulations.
In addition to differences in microsatellite content, type of centromere may also affect the rate at which microsatellite content evolves. We tested for a difference in the rate of microsatellite evolution in species with holocentric and monocentric chromosomes using a censored rate test implemented in the brownieREML function in phytools v0.6-99 [44
]. This allowed us to compare models where the continuous trait (microsatellite content) evolves at a single rate on all branches to a model where each state has independent rates of evolution (O’Meara et al. 2006). We used the function make.simmap in phytools to generate the stochastic maps (holocentric vs. monocentric states) that are used in brownieREML [44
]. In construction of the stochastic map, we used a Markov model and allowed rates of transition between holocentric and monocentric to differ. To account for uncertainty in ancestral states, we repeated our analysis across 100 stochastic maps.
Comparing rates and content to chromosome number and genome size:
We hypothesized that, if microsatellites are common in structural elements of chromosomes, then those species with more chromosomes would be expected to have greater microsatellite content. We analyzed the data using a phylogenetic linear model where microsatellite content in bp of microsatellite content/Mbp of genome was the response variable and chromosome number was the predictor variable. We also fit a phylogenetic linear model where the tip rate as described above was the response variable and chromosome number was the predictor variable. Both of these models were fit using the function phylolm in the R package phylolm v2.6 and used all 100 posterior distribution trees with 100 bootstraps for each tree [45
We also assessed genome size as a predictor of microsatellite evolution. For this analysis, we used genome size in Mbp. We analyzed the data using a phylogenetic linear model where microsatellite content in Mbp was the response variable and genome size was the predictor variable. This analysis used the 100 posterior distribution trees with 100 bootstraps for each tree. We also fit a phylogenetic linear model where the tip rate as described above was the response variable and genome size was the predictor variable. Again, this analysis used the 100 posterior distribution trees with 100 bootstraps for each tree. Both of these models were fit using the function phylolm in the R package phylolm. [45
]. All analyses were completed in R version 3.6.3 [46
]. All tests were considered significant at α = 0.05. All data and code necessary for our analyses are available on Dryad (available upon acceptance).