Genome Variability and Gene Content in Chordopoxviruses: Dependence on Microsatellites

To investigate gene loss in poxviruses belonging to the Chordopoxvirinae subfamily, we assessed the gene content of representative members of the subfamily, and determined whether individual genes present in each genome were intact, truncated, or fragmented. When nonintact genes were identified, the early stop mutations (ESMs) leading to gene truncation or fragmentation were analyzed. Of all the ESMs present in these poxvirus genomes, over 65% co-localized with microsatellites—simple sequence nucleotide repeats. On average, microsatellites comprise 24% of the nucleotide sequence of these poxvirus genomes. These simple repeats have been shown to exhibit high rates of variation, and represent a target for poxvirus protein variation, gene truncation, and reductive evolution.


Introduction
The Poxviridae family consists of viruses with a wide host range and potential for causing disease. The family is divided into subfamilies based on host range, consisting of the Entomopoxvirinae which infect insects, and the Chordopoxvirinae (ChPV) which infect vertebrates [1]. The chordopoxviruses are OPEN ACCESS further divided into several genera, including the Orthopoxvirus genus, which includes the most well-known and the best characterized poxviruses, variola virus and vaccinia virus. Smallpox disease is caused by the viruses belonging to the Variola virus (VARV) species, and is well known both for its severity and its eradication through a worldwide vaccination effort. The vaccine used was based on viruses belonging to the species Vaccinia virus (VACV), which at some point replaced cowpox virus that was originally used in Jenner's smallpox vaccine, although the natural host of vaccinia viruses is unknown [2]. In addition to the Orthopoxvirus genus, chordopoxviruses are categorized into nine other genera, namely Avipoxvirus, Capripoxvirus, Cervidpoxvirus, Crocodylidpoxvirus, Leporipoxvirus, Molluscipoxvirus, Parapoxvirus, Suipoxvirus, and Yatapoxvirus, as well as the unassigned species Squirrelpox virus, and the yet to be classified, yoka poxvirus (YKPV) [1,3].
Poxvirus genomes consist of linear double-stranded DNA of 133,000-360,000 bp bounded by inverted terminal repeats [4]. Poxviruses code for 133-328 genes, and at least 90 gene families exhibit significant homology across the ChPV subfamily [4]. Viruses assigned to the same genus exhibit higher degrees of gene conservation. For example, all orthopoxviruses share a core set of approximately 174 genes, however, the remaining genes outside of this core set are always present among the set of 214 genes present in cowpox viruses, in addition to the other orthopoxviruses that may also contain the gene [4]. Poxviruses are unique among the DNA viruses because DNA replication occurs in the cytoplasm as opposed to the nucleus [5,6]. The large coding capacity of their genomes is one of the ways poxviruses support replication in the absence of the cellular DNA replication machinery, by coding for many of their own transcription and replication proteins. Most poxvirus genomes share similar genetic organization, with genes involved in essential functions such as transcription, DNA replication, and virion assembly located in a highly conserved central region [7]. Genes involved in immunomodulation and host range determination tend to cluster at the ends of the genomes, which are much more variable in terms of sequence homology and gene content.
Viruses within the species Cowpox virus (CPXV) have the largest genomes in comparison to other orthopoxviruses, and are believed to be most genetically similar to the ancestral orthopoxvirus in terms of gene content [4,8]. Several species including Cowpox virus and Monkeypox virus (MPXV) are capable of causing zoonotic infections in humans, and both species have a wide host range [9]. The orthopoxviruses are antigenically similar, and infection with one can induce long-lasting immunity against the others [10]. Only limited genomic data is available for orthopoxviruses endemic to North America, however they appear to be phylogenetically distinct from other orthopoxviruses [11].
The Avipoxvirus genus infects birds, can be transmitted via arthropod vectors as well as aerosols, and members are being investigated as vaccine vectors [12]. Although most chordopoxviruses share similar genome organization, many avipoxviruses contain an extensive genomic rearrangement relative to other ChPV [13]. Chordopoxviruses are often named for the animal from which they were isolated, although in many cases it is not the reservoir host. Ruminants such as cattle, goats, and sheep are the hosts of viruses in the Capripoxvirus genus [14]. Members of the Cervidpoxvirus genus are normally found in mule deer, but have recently been isolated from a gazelle, viruses of the Leporipoxvirus species infect rabbits, members of the Suipoxvirus species infect only swine, and primates are the hosts for members of the Yatapoxvirus genus [15,16].
Some members of ChPV are distinct from the others due to their high genomic GC content. The GC content in chordopoxviruses ranges from 25% in the capripoxviruses to 67% in squirrelpox virus [17], although a biological role for variation in GC content has not yet been identified. The high GC poxviruses include viruses in the Molluscipoxvirus genus, which is specific to humans, Nile crocodilepox virus, a virus in the Crocodylidpoxvirus genus, and members of the Parapoxvirus genus, which have been identified in hoofed mammals, shrews, weasels, and other members of the mammalian Laurasiatheria superorder [18][19][20][21].
The large host range of some poxviruses may be a result of host jumping, where a virus will occasionally infect an animal other than its typical host, and adaptive mutations may allow successful colonization of the new species [22,23]. DNA viruses have a comparatively low genetic mutation rate of approximately 10 −7 -10 −9 mutations per site per round of replication which is closer to that of their hosts than to RNA viruses [24,25]. It is not well understood how the potentially large number of mutations necessary to cross host species barriers are able to accumulate quickly enough to allow the virus to adapt to a new host. Poxviruses are subject to several mechanisms which introduce genomic variability, including horizontal gene transfer (HGT), single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), larger deletions of sequence which may include entire genes, and recombination [26]. The mechanisms utilized by viruses within any particular poxvirus genus to support their variation and evolution may differ from one genus to another. For example, the orthopoxviruses do not seem to have acquired any new genes through HGT since their most recent common ancestor, however, evidence supports progressive gene fragmentation and loss as one of the major determinants of speciation within the Orthopoxvirus genus [4]. Short indels are the most common cause for variation in orthopoxvirus gene content resulting in gene truncation, fragmentation, and loss [27].
Microsatellites, also called Simple Sequence Repeats (SSRs), or Short Tandem Repeats (STRs), are motifs of 1-6 nucleotides arranged in tandem repeats [28,29]. They are present in eukaryotes, prokaryotes, and viruses. The presence and characteristics of microsatellites have been shown to vary according to genome length and GC content, and the number and length of repeats varies by species and their location in coding and noncoding regions [30,31]. Microsatellites can function as hypervariable regions compared to surrounding non-repetitive genomic sequence. The motifs are not always replicated perfectly during DNA replication, which can lead to changes in the number of repetitions or single nucleotide polymorphisms (SNPs). Slipped-strand mispairing is one of the most likely causes of mutation in microsatellites, and occurs when DNA polymerase "slips" backwards or forwards along the template, and is facilitated by the short repeats [30,32]. When insertions and deletions (indels) occur in gene coding regions, they result in frameshift mutations and frequently therefore early stop mutations and truncation, unless the indel length is an in-frame multiple of 3, in which case the protein sequence will vary by the deletion or addition of an amino acid. Microsatellites can also serve as "hot spots" for recombination, which can lead to duplication or deletion of larger stretches of DNA, sometimes including whole genes [33,34].
Repetitive genetic elements have been shown to mediate different aspects of virus biology such as latency in some herpesviruses [35]. Poxviruses contain clusters of tandem repeats in their telomeres near the exterior ends of inverted terminal repeats bookending the genome, although the repeat units are longer than microsatellites [36]. The high variability often identified at short sequence repeats has led to great interest in characterizing microsatellites in many different species. Microsatellite distribution has been investigated in ssDNA viruses such as members of the family Geminiviridae that infect plants and the families Circoviridae, Parvoviridae, and Anelloviridae that infect vertebrates [37,38]. Short repeats were also identified throughout viruses in the Herpesviridae family, six genotypes of hepatitis C viruses, adenoviruses, influenza viruses, sin nombre virus, and human immunodeficiency virus type 1 [39][40][41][42][43].
Despite widespread identification of microsatellites and their functional significance in eukaryotic and prokaryotic genomes, the biological role of microsatellites in viral genomes remains largely unknown. We previously identified an inverse association between the length of orthopoxvirus genomes and the number of mutations they contain that could lead to truncated and fragmented genes [27]. While analyzing the data, we observed that many of the ESMs occurred at microsatellites (data was not reported). In this report, we have assessed the microsatellite content of viruses throughout the Chordopoxvirinae subfamily, characterized the microsatellites present in the viral genomes, and analyze the relationship between ESMs and microsatellites.

Genome Sequences and Assessment of Gene Content
We chose representative isolates from each genus within the Chordopoxvirinae subfamily, as well as each species within the Orthopoxvirus genus with available complete genome sequences (Table 1). Nucleotide sequences were downloaded from the Viral Bioinformatics Resource Center (vbrc.org) [44], with the exception of SQPV, which was downloaded from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov). As described in Hendrickson et al. [4] and Hatcher et al. [27], we used the Poxvirus Genome Annotation System (PGAS) to determine the coding potential of open reading frames (ORFs), to annotate predicted genes according to their coding state (i.e., intact, truncated, fragmented, or missing genes), and to compare syntenic regions in closely related strains or species. MAFFT alignments using the FFT-NS-2 method were generated for the avipoxviruses, the high GC isolates, and all isolates except the avipoxviruses, and were used to identify genes that have similar location but low homology [45]. An ORF was considered intact if it was similar in length to the longest ORF for all orthologs of that gene. Truncated genes were defined as genes greater than 30 amino acids long and retaining a predicted promoter sequence, but were less than 80% of the amino acid length of intact orthologs. This cutoff is based on a comparison of gene length conservation (described in Hendrickson et al. [4]). Fragmented genes were identified as genomic regions containing identifiable homologous sequence to an orthologous, intact ORF, but with the fragmented ORF coding for less than 30 amino acids, missing a predicted promoter region, or missing the 5' end of the gene, including a start codon.
ORFs that were flagged as potential genes through PGAS automated prediction but did not have an identified promoter, had an absent or weak Kozak consensus sequence, had a low Glimmer score, and showed no orthology with intact poxvirus genes were not labeled as genes or gene remnants unless they were identified as genes when the genome was originally published. To differentiate orthologous from paralogous genes, we assessed sequence homology as well as gene synteny, the conservation of genomic location and gene neighbors [46]. We utilized the term "syntelog" to represent orthologous genes shared across isolates with a common syntenic genome location [4].

Detection of Early Stop Mutations (ESMs)
ESMs are defined as mutations that give rise to a stop codon in the sequence of a gene that either interrupts the start codon or truncates the ORF to a length 80% or less of the length of the intact orthologous gene. ESMs were identified as in Hatcher et al. [27] by visualizing the nucleotide sequence of each open reading frame and the corresponding amino acid translation of the coding frame and the two alternate frames, and annotating mutations that introduced either a stop codon in the normal coding frame, introduced an indel resulting in a frameshift mutation and subsequent early stop codon in the new reading frame, or altered the start codon of a gene. In the case of an altered reading frame, the mutation that caused the change in reading frame is coded as the ESM and not the newly introduced (now in-frame) stop codon.

Microsatellite Identification
Each of the viruses was assessed individually for the presence of microsatellites using the IMEx program available at http://imex.cdfd.org.in [42]. Repeats with motifs of 1-6 bp were identified if present with a minimum repeat number of 4 for motifs of 1 base, 3 for motifs of 2 or 3 bases, and 2 for motifs of 4, 5, or 6 bases. Perfect and imperfect repeats were allowed, and the imperfections were limited to a variation of 10% within the repeat tract. Microsatellite content was determined by dividing the total number of nucleotides present within the repeats by the length of the genome. When counting ESMs that overlapped microsatellites, if an ESM only partially overlapped a microsatellite, it was not included in the number of ESMs overlapping repeats.

Statistical Analysis
The total number of microsatellites was normalized according to genome size. Relative abundance was calculated as the number of microsatellites per kilobase of genomic sequence. Relative density was calculated as the sum total of microsatellite nucleotides per kilobase of genomic sequence.
Linear regression was used to determine the relationship between various genomic features and microsatellites, and was performed in Microsoft Excel 2010.

Assessment of Chordopoxvirus Gene Content
We annotated the genes of representative viruses from each chordopoxvirus genus, and characterized those genes as intact, truncated, or fragmented (Table 1). Using whole genome multiple sequence alignments, we were able to use homology and synteny to assist with gene annotation, including identifying truncated and fragmented genes by comparing the genomic sequence in these regions with the sequences of intact, syntenic genes present in other isolates. When identifying non-intact genes, we define truncated genes as less than 80% the length of intact genes but with the 5' end intact, as opposed to fragmented genes which maintain nucleotide sequence homology to intact genes but either do not have an intact 5' end of the gene, or the remaining orthologous ORF is less than 30 amino acids long. We were able to identify at least one non-intact gene in most of the viruses, although the majority were in orthopoxviruses (reported in Hatcher et al. [27]), yoka poxvirus, capripoxviruses, and the avipoxviruses. Yoka poxvirus is the closest phylogenetic relative to the orthopoxviruses which has been identified at this time [3]. We were unable to identify any truncated or fragmented genes in the high GC viruses. Figure 1 provides a comparison of the similarities of the isolates as pairwise nucleotide identities for the most conserved central portion of the genomes.

Assessment of Truncated and Fragmented Genes, and Early Stop Mutation Content
Genome reduction through gene fragmentation and loss is one of the primary drivers of speciation in the orthopoxviruses, and previously we described an analysis of early stop mutations (ESMs) leading to gene truncation and fragmentation in orthopoxviruses [27]. We expanded that analysis here, characterizing the ESMs in viruses in the other genera of the Chordopoxvirinae subfamily, and were able to identify the causative ESMs for all but two of the non-intact genes ( Table 2). The two instances were due to high variability between the non-intact genes and their homologs, which prevented us from being able to reliably identify the ESMs.
Orthopoxvirus genes range from highly intact in cowpox viruses (0.013-0.026 non-intact genes per 1000 bp) to highly fragmented in the variola viruses and TATV (0.150-0.167 non-intact genes per 1000 bp). Orthopoxvirus truncated and fragmented genes are interrupted by an average of 2.8 ESMs per gene, similar to the average for the entire subfamily, although some isolates average as high as five ESMs per non-intact gene. YKPV is not only genetically similar to the orthopoxviruses, but has a comparable number of truncated or fragmented genes (17 in YKPV and an average of 20.7 in orthopoxviruses), and ESMs per nonintact gene (2.8 in both YKPV and orthopoxviruses). In the avipoxviruses, FWPV has many more truncated and fragmented genes than CNPV (18 versus 6), a greater number of ESMs (53 versus 22), but the frequency of ESMs per nonintact gene in FWPV is lower than in CNPV (2.9 versus 3.7). Two members of the capripoxviruses have nine non-intact genes each, with an average of 11.8 ESMs per gene in SPPV, and an average of 9.1 in GTPV. The third capripoxvirus isolate included, LSDV, does not appear to have any truncated or fragmented genes. Because we were unable to identify truncated or fragmented genes in the high GC content viruses, we were also unable to identify ESMs in those viruses. It is difficult to say whether the lack of non-intact genes may be due to the low number of sequenced isolates available for high GC viruses compared to high AT viruses, or due to different evolutionary mechanisms that prevent the accumulation of ESMs.
The availability of additional sequenced high GC poxvirus isolates may provide the data necessary to answer this question.
Short repeats in DNA sequences are associated with higher rates of mutation in many species, and we wanted to understand if microsatellites may have a role in the introduction of early stop mutations in poxviruses. To determine if there is an association between ESMs and short repeats, we identified the microsatellites in the genomes, and identified the ESMs that occurred at the microsatellites ( Figure 2). The short repeats are present in all of the isolates and account for an average 24% of the genome in chordopoxviruses, ranging from 12.2% in BPSV to 31.7% in the capripoxviruses ( Table 2). The orthopoxviruses are within a very narrow range of 22.4%-23.0% of the genome consisting of repeats. In most viruses where ESMs were identified, a greater number of ESMs are located at microsatellites than can be explained by the microsatellite content of the genomes (Table 2 and Figure 3A,B). In some examples, the incidence of colocalization is striking, for example in RPXV, 81.1% of the ESMs occur at microsatellites, but the microsatellites comprise only 22.5% of the genome. This is not a universal observation however since in FWPV, RFV, and TANV, the distributions of ESMs and the genome microsatellite content are close to equal. We did not, however, find any correlation between the frequency of ESMs overlapping microsatellites, and the overall genome microsatellite content ( Figure 3C).

Chordopoxvirus Microsatellite Content
Counts of microsatellites of oligonucleotides from 1-6 bp motifs are described in Table 3 and Figure  4. The relative abundance (RA) of microsatellites, or the average number of microsatellites in every kilobase of genomic sequence varies from 18 in BSPV to 55 in SPPV and GTPV. The orthopoxviruses are very similar to each other, and each member has a RA between 34 and 39 microsatellites/kb. As the length of repeat motifs increases, the number of repeats decreases. Although the reported hexamer RA for several of the viruses is close to 0, repeats of 6 bp motifs were observed in all of the viruses. The portion of the genome made up of microsatellites is the relative density (RD), and ranges from 122 microsatellite nucleotides per kilobase of genomic nucleotides in BSPV to 317 nt/kb in SPPV and GTPV. The most numerous repeats are monomers and trimers, with the greatest number of repeats being monomers in high AT viruses, with high GC viruses having greater numbers of trimers. This is true whether comparing actual counts or RA for each virus, however because RD takes into account the size of each repeat, there is no clear pattern. Microsatellites appear to be distributed relatively evenly across genomes, as shown in Figure 5; however, there is a significant difference between the RA of microsatellites in noncoding and coding regions (T test, p < 0.0001, 95% CI of the difference between the means is 10.146-26.034). The mean RA in noncoding regions is 57 microsatellites/1000 bp, and in coding regions is 39 microsatellites/1000 bp. This is best explained by more relaxed selection pressure on noncoding regions, where mutations due to the increased variability in microsatellites are less likely to affect virus viability.    We next attempted to identify genomic characteristics that may be associated with microsatellite content. The genome length for viruses included in this study ranges from approximately 134,000-360,000 bp. Neither RA nor RD show significant correlation with genome size in the chordopoxvirus isolates we analyzed ( Figure 6). We also analyzed the microsatellites for their GC content, since it has previously been shown that the GC content of the genome is associated with the microsatellite composition in other organisms [31,47]. All of the microsatellites in a virus were assessed for the proportion of G and C nucleotides they contained, and this number was compared to the total GC content of each genome. In these poxviruses, the GC content of the microsatellites was strongly associated with the GC content of the genome (Figure 7).

Discussion
Poxviruses have been shown to evolve through a combination of vertical gene transfer, horizontal gene transfer from eukaryotes and other viruses, genetic drift, and, at least in the orthopoxviruses, gene loss. In this report, we cataloged all of the truncated and fragmented genes present in completely sequenced genomes of representative poxviruses from the Chordopoxvirinae subfamily, and we identified the early stop mutations leading to these gene truncations and fragmentations. We also showed that these ESMs were associated with microsatellites, and we characterized the microsatellite content in the genomes.
We observed a wide variation in the number of nonintact genes both in the subfamily as a whole, and often within genera. While reductive evolution has commonly been accepted as a major form of adaptation in the orthopoxviruses, this finding may signify that it is also active in other poxvirus genera such as the avipoxviruses. Although FWPV and CNPV share a similar genomic arrangement, CNPV contains over 75 kbp of additional sequence compared to FWPV [48]. FWPV also contains three times as many nonintact genes compared to CNPV (Table 2). Therefore, genome reduction through a process of gene truncation and fragmentation may have played a significant role in the evolution of FWPV. Similar features can be seen in the genus Capripoxvirus, where LSDV appears to only code for intact genes. In comparison, SPPV and GTPV each have nine non-intact genes.
One of the limitations with this study is that gene truncations and fragments cannot be identified without an example of an intact gene. When analyzing genomes for genes that are truncated or fragmented, we were unable to identify any in the high GC viruses. There are at least two possible interpretations for this. The high GC viruses contain a large number of unique genes and a low number of viruses sequenced for each genus, making it difficult to identify a gene as truncated or fragmented since there is no intact gene in a closely related virus with which to compare. Because of this, several genes may be classified as intact, when they are in fact truncated or fragmented. The second possible reason is that the high GC viruses may be subject to different evolutionary mechanisms, including a lack of gene reduction as compared to viruses with higher AT content and higher numbers of nonintact genes. However, a number of non-high GC viruses also contain low numbers of nonintact genes including LSDV, DPV, SWPV, the leporipoxviruses, and the yatapoxviruses. In addition, CPXV isolates only contain 3-6 nonintact genes in contrast to the other orthopoxviruses that contain between 15 and 33 nonintact genes. This emphasizes that GC content alone does not directly determine the number of nonintact genes present in any single genome, and that other factors must play a role in determining the extent to which gene truncation and fragmentation play a role in the evolution of poxvirus species.
While most viruses show an association between ESMs and microsatellites, this does not appear to be the case for FWPV, RFV, and TANV, (Table 2, Figure 3B). This may be due to different mechanisms being used for the introduction of ESMs compared to other poxviruses, but it may also be due to the accumulation of mutations in microsatellites that originally colocalized with ESMs, but whose sequence has now diverged significantly making it difficult to identify the former microsatellites.
Another interesting observation is the relatively low number of microsatellites in BPSV. BPSV and ORFV are both in the genus Parapoxvirus, and have genomes that are 134,431 and 139,952 bp long, respectively. We were unable to identify any non-intact genes in either ORFV or BPSV (both are high GC viruses). BPSV has a genomic microsatellite content of 12.2%, compared to ORFV, which is 22.8%, and 20.8%-31.7% for all other chordopoxviruses.
DNA viruses, and especially dsDNA viruses, have been shown to have relatively low mutation rates, more similar to eukaryotes than to RNA viruses [25]. Therefore, mechanisms that would support an increase in the number of mutations or that would target mutations to specific regions of the genome may provide an opportunity to adapt to alterations in the environment, selecting for new genome variants that increase the fitness of the virus. Recently, Elde et al. identified just such a mechanism for vaccinia virus that was able to expand the copy number of an immunomodulatory gene when exposed to selective conditions favoring expression of that gene [49]. The genomic regions containing the gene were most likely duplicated by recombination, however the authors also witnessed loss of the duplications when selection pressure was removed. Gene duplication and reduction may be one method of rapidly altering the poxvirus genome. Increased numbers of mutations due to the presence of microsatellites may be another. The two methods are not mutually exclusive, since non-homologous recombination preferentially occurs at repeat regions [33,34], and this may be a mechanism by which the accordion-like expansion and contraction of genes like the instance observed by Elde et al. may occur.

Conclusions
This study highlights the role of microsatellites in poxvirus sequence diversity, and how they may affect viral host range and pathogenesis by playing a role in gene variation and inactivation. We found a high incidence of statistically significant co-localization of ESMs with microsatellites. Genes which are no longer impacted by selection pressure to remain functional may be more likely to accumulate mutations at microsatellites, since these short repeats often show higher variability than the surrounding sequences [50,51]. Microsatellites may therefore serve as a major source of genomic variability in chordopoxviruses, contributing to the introduction of ESMs and therefore impacting gene content, virus biology, and evolution. Microsatellite hypervariability may allow the virus to adapt more quickly to environmental changes, such as the phenotypic changes needed when adapting to newly encountered hosts.