Re-Assembly and Analysis of an Ancient Variola Virus Genome

We report a major improvement to the assembly of published short read sequencing data from an ancient variola virus (VARV) genome by the removal of contig-capping sequencing tags and manual searches for gap-spanning reads. The new assembly, together with camelpox and taterapox genomes, permitted new dates to be calculated for the last common ancestor of all VARV genomes. The analysis of recently sequenced VARV-like cowpox virus genomes showed that single nucleotide polymorphisms (SNPs) and amino acid changes in the vaccinia virus (VACV)-Cop-O1L ortholog, predicted to be associated with VARV host specificity and virulence, were introduced into the lineage before the divergence of these viruses. A comparison of the ancient and modern VARV genome sequences also revealed a measurable drift towards adenine + thymine (A + T) richness.


Introduction
Smallpox, which is caused by the poxvirus variola virus (VARV) [1], was declared eradicated almost 40 years ago, following a massive worldwide vaccination campaign [2,3]. The disease may have been the world's worst scourge, with reports that more than 500 million were killed during the Twentieth Century alone [4]. VARV and vaccinia virus (VACV), which was used as a live vaccine, belong to the Orthopoxvirus genus within the Poxviridae family and have linear dsDNA genomes of approximately 200 kb. VACV is the most studied poxvirus and serves as a model virus for the family. However, several other Orthopoxviruses are notable: monkeypox viruses (MPXV) [5] are zoonotic; ectromelia viruses (ECTV) [6] have served as a model for smallpox in mice; rabbitpox viruses [7,8] are VACV variants that are hypervirulent in rabbits; and cowpox viruses [9][10][11], which despite their names, actually form several species, encode the most genes.
More than 300 essentially complete genomes have been sequenced from poxviruses (Chordopoxvirinae subfamily) that infect vertebrates. Although not all are yet officially recognized, these likely fall into at least 18 genera. When these viruses are aligned, the highest synteny and DNA sequence identity is found within a central core region of approximately 80 to 100 kb, with a gene density averaging approximately one per kb. Outside of this core, the genes, which mostly encode host range and virulence factors and are not present in all viruses, are frequently rearranged and have significantly lower levels of DNA and amino acid (aa) conservation than genes in the core region. Thus, when different sets of genes are compared for a given group of viruses, it is not uncommon to observe significantly different levels of conservation because core genes are more highly conserved [12].
Several other factors complicate genome alignments and the ubiquitous phylogenetic trees that are used to display the evolutionary relationships between the viruses. First, when complete genomes are aligned, the alignment software is frequently confused by the presence of large insertions/deletions (indels) or gene rearrangements in conjunction with the variety of paralogs present outside the central core. The alignment of paralogs instead of orthologs may lead to serious errors in the trees. Second, numerous examples of homologous recombination, which go against the underlying assumptions of phylogenetics, have been described within poxvirus genomes [9, [13][14][15]. In previous work [16], we hypothesized that the limited host range (only humans) and severe virulence might be linked to a relatively large number of single nucleotide polymorphisms (SNPs) grouped together in the VARV ortholog of the VACV-Cop-O1L gene (membrane protein) and that a recombination event had created this sequence arrangement. Therefore, it was of great interest when the sequencing of a VARV isolated from a mummy, which was estimated to be approximately 350 years old, was announced [1]. This genome (GenBank KY358055.1) was described, in the supplementary material, as consisting of 31 contigs with padding of Ns (unidentified nucleotides) to allow alignment to a reference sequence (VARV-India-1967; National Center for Biotechnology Information (NCBI) Reference Sequence NC_001611.1). However, when the sequence file was examined, there were 121 blocks of Ns ranging in size from one to 626 nucleotides (nt). The shorter blocks of Ns (46 are shorter than 10 nt), which are frequently close together, may represent unresolved nucleotide calls rather than all being spacers between un-joined contigs. Here we describe significant improvements to the assembly of the reported ancient VARV genome sequence data [1] and further characterization of the evolution of VARV.

Retrieval of Genome Sequences and Alignment
The genomes used for the construction of multiple sequence alignments (MSAs) are listed in Table 1. The sequences were aligned using MAFFT [17], and both ends of the genome were removed, leaving a 98 kb core region spanning the VACV-WR genome 32958-134689 nt, from gene VACV-WR-043 (ribonucleotide reductase, small subunit) to VACV-WR-144 (RNA polymerase (RPO132)). The sequences were then manually edited using Base-By-Base (BBB; [18]) software to correct alignment errors produced by the software. All alignment columns containing gap-characters were removed prior to the construction of the phylogenetic trees.

Visual Examination of the MSA
The BBB software highlights SNPs in several ways. The user can choose to compare sequences against the top sequence of the alignment, against the consensus sequence, or in a pairwise fashion. Differences between SNPs are highlighted by blue blocks; insertions and deletions are shown by green and red blocks, respectively. This highlighting makes otherwise unrecognizable short patterns of SNPs easily visible, allowing the user to scan the sequence alignment visually for small and imperfect patterns of SNPs and indels. Although this process sounds archaic, the human eye/brain is still usually most effective when looking for novel patterns in subsets of the sequences in MSAs.

Phylogenetic Tree Construction and Evolutionary Analyses
Phylogenetic analyses were performed using a 98 kb alignment of the conserved core region of orthopoxviruses between VACV-Cop-F4L and Cop-A24R orthologs with any gap-containing columns removed. Maximum-likelihood trees were constructed using RAxML (Randomized Axelerated Maximum Likelihood) [19] under the GTRGAMMA (General time reversible) base substitution model using 1000 bootstrap replicates. We performed an evolutionary analysis of the VARV species using the Bayesian Markov chain Monte Carlo (MCMC) inference methods with BEAST 2 (Bayesian evolutionary analysis by sampling trees) software [20], with the rooted RAxML constructed maximum-likelihood tree as the starting tree. All of the viruses included in the evolutionary analysis had reliable collection dates, with the exception of VARV-VD21-uvic (our assembly), which was previously dated using radiocarbon dating techniques [1] , and VARV-V1588 and VARV-V563, which were previously dated by the degree of D-and L-aspartic acid racemization [21]. BEAST 2 was run using a fixed clock model, constant population size coalescent priors, and a 100,000,000 MCMC chain length. The maximum clade credibility tree produced was generated under a GTR substitution model using gamma and invariant sites.

Re-Assembly of the VARV-VD21 Genome
The previously reported VARV-VD21 genome of 31 contigs was created using de novo and reference mapped assemblies. However, the GenBank file (KY358055.1) of this ancient genome contains 121 groups of Ns, which range in size from 1 to 626 nt and total 3,748 nt. Therefore, before attempting to search for evidence of recombination in the severely fragmented genome, we set out to try and fill some of the gaps in the published sequence by using alternative assembly strategies. Our first attempts at assembling the data provided by the Sequence Read Archive (SRP091673), using Cutadapt [22] to trim Illumina adapters and MIRA (Mimicking Intelligent Read Assembly) [23] and SPAdes (St. Petersburg genome Assembler) [24] for assembly, created 56 contigs that could not be resolved by modifying the assembly parameters. However, when these contigs were viewed and aligned to the same reference sequence (VARV-India-1967) using BBB, it was obvious that the ends of the contigs were effectively terminated by reads that contained variants of the sequencing adapters normally removed by the Cutadapt module ( Figure 1).

Re-Assembly of the VARV-VD21 Genome
The previously reported VARV-VD21 genome of 31 contigs was created using de novo and reference mapped assemblies. However, the GenBank file (KY358055.1) of this ancient genome contains 121 groups of Ns, which range in size from 1 to 626 nt and total 3,748 nt. Therefore, before attempting to search for evidence of recombination in the severely fragmented genome, we set out to try and fill some of the gaps in the published sequence by using alternative assembly strategies. Our first attempts at assembling the data provided by the Sequence Read Archive (SRP091673), using Cutadapt [22] to trim Illumina adapters and MIRA (Mimicking Intelligent Read Assembly) [23] and SPAdes (St. Petersburg genome Assembler) [24] for assembly, created 56 contigs that could not be resolved by modifying the assembly parameters. However, when these contigs were viewed and aligned to the same reference sequence (VARV-India-1967) using BBB, it was obvious that the ends of the contigs were effectively terminated by reads that contained variants of the sequencing adapters normally removed by the Cutadapt module ( Figure 1). Therefore, we repeated the trimming process using Cutadapt a further three times, modifying the adapter information to match the remaining adapters at each step. This process reduced the number of contigs to 11, with gaps ranging from 0 to 22 nt when aligned to a reference genome; contigs with gaps of 0 nt abut but do not overlap. Even searches through the fastq files for specific nucleotide sequences within reads failed to find reads that could join these final contigs. Figure 2 displays the organization of our final assembly after alignment with VARV-Bang-1975. The majority of improvement in assembly and coverage came from the repeated removal of the fragmented Illumina adapters. However, we also accepted some very low coverage regions that were 100% consistent with a consensus built from all known VARV genome sequences. Following this reassembly step, these new contigs were compared to the published VARV-VD21 sequence and all differences were examined manually using TABLET software [25]. Surprisingly, of 79 nt differences between these two sequences, all but 10 had been called wrong by the SPAdes software, probably due to failed mismatch correction in regions of low coverage. Next, the new VARV-VD21 contigs were compared to previously sequenced VARV genomes, and nucleotides unique to this new genome were checked for correct nt calling with TABLET; all these SNPs were correct. Finally, we aligned the contigs with VARV-Bang75 as a reference (see below) and padded the contigs with Ns to create an alignable genome sequence. A comparison of our new assembly to the published VARV-VD21 sequence revealed that the number of contigs was reduced from 122 to 11 and that the total number of Ns was reduced from 3748 to 66; this revised version of the ancient VARV genome is Therefore, we repeated the trimming process using Cutadapt a further three times, modifying the adapter information to match the remaining adapters at each step. This process reduced the number of contigs to 11, with gaps ranging from 0 to 22 nt when aligned to a reference genome; contigs with gaps of 0 nt abut but do not overlap. Even searches through the fastq files for specific nucleotide sequences within reads failed to find reads that could join these final contigs. Figure 2 displays the organization of our final assembly after alignment with VARV-Bang-1975. The majority of improvement in assembly and coverage came from the repeated removal of the fragmented Illumina adapters. However, we also accepted some very low coverage regions that were 100% consistent with a consensus built from all known VARV genome sequences. Following this reassembly step, these new contigs were compared to the published VARV-VD21 sequence and all differences were examined manually using TABLET software [25]. Surprisingly, of 79 nt differences between these two sequences, all but 10 had been called wrong by the SPAdes software, probably due to failed mismatch correction in regions of low coverage. Next, the new VARV-VD21 contigs were compared to previously sequenced VARV genomes, and nucleotides unique to this new genome were checked for correct nt calling with TABLET; all these SNPs were correct. Finally, we aligned the contigs with VARV-Bang75 as a reference (see below) and padded the contigs with Ns to create an alignable genome sequence. A comparison of our new assembly to the published VARV-VD21 sequence revealed that the number of contigs was reduced from 122 to 11 and that the total number of Ns was reduced from 3748 to 66; this revised version of the ancient VARV genome is denoted as VARV-VD21-uvic and is used in all subsequent analyses. This new assembly has been submitted to GenBank (Accession number BK010317).
denoted as VARV-VD21-uvic and is used in all subsequent analyses. This new assembly has been submitted to GenBank (Accession number BK010317). For any comparative genomic analysis to be meaningful, it is essential the sequences that are aligned are both (1) optimally aligned and (2) related through common ancestry. As noted above, the organization of genes outside of the central core is such that large indels and paralogous genes often create alignment errors. Therefore, we processed the alignments that are used below in several ways to minimize alignment errors; some of these steps were included to make the data easier to view and interpret. First, we built a MAFFT MSA [17] of all previously published VARV whole genomes. Viewing of this alignment confirmed not only that the terminal regions of the VARV are the most variable but also that (1) it is impossible to create a meaningful alignment along the full length of all the VARV genomes, (2) default MAFFT parameters, set for large DNA sequences, fail to correctly align short regions that need multiple small indels, and (3) the vast majority of repeat regions and indels are within the 25 kb terminal regions. Interestingly, the first region we noticed that was misaligned was a region in VARV-India-1967, which was used as the reference strain by Duggan et al. Figure 3 illustrates a region unique to VARV-India-1967, as aligned by MAFFT to a consensus of all other VARV genomes. However, this alignment can be significantly improved, reducing 12 SNPs to seven by the redistribution of the seven gapped positions. Although the new alignment has seven independent gaps instead of two gaps of 2 nt and three gaps of 1 nt, this arrangement places every one of the seven gaps opposite a short homopolymer run, which strongly suggests that at least some of these might be sequencing errors in the VARV-India-1967 sequence. Therefore, we scanned the other SNPs that were unique to VARV-India-1967 for unusual features. In the short region, 49,859 to 50,005 (146 nt), we found a cluster of 13 such SNPs; a search of the database of poxvirus sequences revealed that this apparently unique VARV region from India-1967 matches perfectly to an orthologous region in a series of VACV isolates ( Figure 4). This result suggests further errors in the sequencing process of VARV-India-1967 or a recombination event between an ancestor of this virus and possibly a vaccinia virus (vaccine) isolate. In view of these results, we decided to select an almost gap-free central region of approximately 98 kb that could be checked for regions of misalignment for further phylogenetic analysis. In addition, we selected subsets of the available genomes, representing the most divergent of the completely sequenced VARV, cowpox (CPXV), and VACV groups and also included camelpox (CMLV) and taterapox (TATV) genomes. After simplifying the MSA by removing any nt column that contained an N (from VARV-VD21-uvic) and also any nt column that contained a gap character (in any genome), we compared VARV-VD21-uvic to the other VARV sequences. Within this region of the alignment, the 10 VARV-major strains had a range of 223 to 254 nt differences to VARV-VD21-uvic (mean 239), whereas VARV-India-1967 had 298 nt differences (>3 standard deviations above the mean). Given this result and the previous anomalies with VARV-India-1967 observed here and by others [26], we decided to exclude this genome from further phylogenetic analyses. For any comparative genomic analysis to be meaningful, it is essential the sequences that are aligned are both (1) optimally aligned and (2) related through common ancestry. As noted above, the organization of genes outside of the central core is such that large indels and paralogous genes often create alignment errors. Therefore, we processed the alignments that are used below in several ways to minimize alignment errors; some of these steps were included to make the data easier to view and interpret. First, we built a MAFFT MSA [17] of all previously published VARV whole genomes. Viewing of this alignment confirmed not only that the terminal regions of the VARV are the most variable but also that (1) it is impossible to create a meaningful alignment along the full length of all the VARV genomes, (2) default MAFFT parameters, set for large DNA sequences, fail to correctly align short regions that need multiple small indels, and (3) the vast majority of repeat regions and indels are within the 25 kb terminal regions. Interestingly, the first region we noticed that was misaligned was a region in VARV-India-1967, which was used as the reference strain by Duggan et al. Figure 3 illustrates a region unique to VARV-India-1967, as aligned by MAFFT to a consensus of all other VARV genomes. However, this alignment can be significantly improved, reducing 12 SNPs to seven by the redistribution of the seven gapped positions. Although the new alignment has seven independent gaps instead of two gaps of 2 nt and three gaps of 1 nt, this arrangement places every one of the seven gaps opposite a short homopolymer run, which strongly suggests that at least some of these might be sequencing errors in the VARV-India-1967 sequence. Therefore, we scanned the other SNPs that were unique to VARV-India-1967 for unusual features. In the short region, 49,859 to 50,005 (146 nt), we found a cluster of 13 such SNPs; a search of the database of poxvirus sequences revealed that this apparently unique VARV region from India-1967 matches perfectly to an orthologous region in a series of VACV isolates ( Figure 4). This result suggests further errors in the sequencing process of VARV-India-1967 or a recombination event between an ancestor of this virus and possibly a vaccinia virus (vaccine) isolate. In view of these results, we decided to select an almost gap-free central region of approximately 98 kb that could be checked for regions of misalignment for further phylogenetic analysis. In addition, we selected subsets of the available genomes, representing the most divergent of the completely sequenced VARV, cowpox (CPXV), and VACV groups and also included camelpox (CMLV) and taterapox (TATV) genomes. After simplifying the MSA by removing any nt column that contained an N (from VARV-VD21-uvic) and also any nt column that contained a gap character (in any genome), we compared VARV-VD21-uvic to the other VARV sequences. Within this region of the alignment, the 10 VARV-major strains had a range of 223 to 254 nt differences to VARV-VD21-uvic (mean 239), whereas VARV-India-1967 had 298 nt differences (>3 standard deviations above the mean). Given this result and the previous anomalies with VARV-India-1967 observed here and by others [26], we decided to exclude this genome from further phylogenetic analyses.

Phylogenetic History of VARV
As in the analysis performed by Duggan et al. we constructed an evolutionary time tree comparing the core regions of CMLV, TATV, and VARV strains ( Figure 5). However, the input data was somewhat different in that we (1) used a more complete version of the VARV-VD21 genome (VARV-VD21-uvic), (2) used a manually corrected alignment, (3) added CMLV and TATV genomes, (4) omitted VARV-India-1967, and (5) added a VARV-major genome from 1942 and a VARV-minor genome from 1850. The tree depicts the evolutionary divergence of the VARV species and provides estimates of the "time since their most recent common ancestor" (tMRCA), where the most recent common ancestor is the latest state of the ancestral virus from which each member of the group of viruses has directly descended. The dates depicted at the nodes of the tree represent the median date estimated from a Bayesian evolutionary analysis performed using BEAST 2 under a fixed clock model and using constant size coalescent priors as per the analysis performed by Duggan et al. [1]. The inclusion of CMLV and TATV sequences provides a stronger root to the rest of the VARV sequences for the estimation of the tMRCAs, as well as providing an estimate of when these orthopox viruses diverged into distinct species and perhaps became host specific.

Phylogenetic History of VARV
As in the analysis performed by Duggan et al. we constructed an evolutionary time tree comparing the core regions of CMLV, TATV, and VARV strains ( Figure 5). However, the input data was somewhat different in that we (1) used a more complete version of the VARV-VD21 genome (VARV-VD21-uvic), (2) used a manually corrected alignment, (3) added CMLV and TATV genomes, (4) omitted VARV-India-1967, and (5) added a VARV-major genome from 1942 and a VARV-minor genome from 1850. The tree depicts the evolutionary divergence of the VARV species and provides estimates of the "time since their most recent common ancestor" (tMRCA), where the most recent common ancestor is the latest state of the ancestral virus from which each member of the group of viruses has directly descended. The dates depicted at the nodes of the tree represent the median date estimated from a Bayesian evolutionary analysis performed using BEAST 2 under a fixed clock model and using constant size coalescent priors as per the analysis performed by Duggan et al. [1]. The inclusion of CMLV and TATV sequences provides a stronger root to the rest of the VARV sequences for the estimation of the tMRCAs, as well as providing an estimate of when these orthopox viruses diverged into distinct species and perhaps became host specific.

Phylogenetic History of VARV
As in the analysis performed by Duggan et al. we constructed an evolutionary time tree comparing the core regions of CMLV, TATV, and VARV strains ( Figure 5). However, the input data was somewhat different in that we (1) used a more complete version of the VARV-VD21 genome (VARV-VD21-uvic), (2) used a manually corrected alignment, (3) added CMLV and TATV genomes, (4) omitted VARV-India-1967, and (5) added a VARV-major genome from 1942 and a VARV-minor genome from 1850. The tree depicts the evolutionary divergence of the VARV species and provides estimates of the "time since their most recent common ancestor" (tMRCA), where the most recent common ancestor is the latest state of the ancestral virus from which each member of the group of viruses has directly descended. The dates depicted at the nodes of the tree represent the median date estimated from a Bayesian evolutionary analysis performed using BEAST 2 under a fixed clock model and using constant size coalescent priors as per the analysis performed by Duggan et al. [1]. The inclusion of CMLV and TATV sequences provides a stronger root to the rest of the VARV sequences for the estimation of the tMRCAs, as well as providing an estimate of when these orthopox viruses diverged into distinct species and perhaps became host specific. The most significant difference between the data presented in Figure 5 and the previous work is that our results push back the date of the tMRCA by about 100 years for all VARVs (from 1588-1645 to 1470-1563), as well as the split of the VARV major and minor clades (from 1734-1793 to 1579-1667). Interestingly, the dates for speciation of all VARV and the major/minor species are midway between those calculated by Duggan et al. [1] and those recently calculated by Pajer et al. [21], who sequenced the VARV-major genome from 1942 and the VARV-minor genome from 1850. The number of unique SNPs associated with particular viruses at nodes on the tree ( Figure 5) are consistent for all of the VARVs, suggesting that, after the speciation of VARV, there has been little to no recombination with other poxvirus species.

Drift in Nucleotide Composition
The nucleotide composition of poxvirus genomes is an interesting puzzle in that the A + T% varies between 33% (squirrelpox virus) and 82% (amsacta moorei entomopoxvirus), but the mechanism by which the drift occurs is unknown. Given that VARV-VD21-uvic represents a true ancient genome, we decided to determine if a difference in the A + T% could be detected between the ancient and modern VARV genomes. Using the same central core of 98 kb, BBB calculated the A + T% for the ancient and modern sequences to be 67.0 and 67.1, respectively. The plot nucleotide content function of BBB showed no discernable difference but emphasized that the A + T composition within a genome varies by more than 10% if a sliding window of 1 kb is plotted across the genome core. However, when we focused on each of the individual SNPs, comparing ancient to modern nucleotides, we found that approximately twice as many nt changes were guanine/cytosine (G/C) to The most significant difference between the data presented in Figure 5 and the previous work is that our results push back the date of the tMRCA by about 100 years for all VARVs (from 1588-1645 to 1470-1563), as well as the split of the VARV major and minor clades (from 1734-1793 to 1579-1667). Interestingly, the dates for speciation of all VARV and the major/minor species are midway between those calculated by Duggan et al. [1] and those recently calculated by Pajer et al. [21], who sequenced the VARV-major genome from 1942 and the VARV-minor genome from 1850. The number of unique SNPs associated with particular viruses at nodes on the tree ( Figure 5) are consistent for all of the VARVs, suggesting that, after the speciation of VARV, there has been little to no recombination with other poxvirus species.

Drift in Nucleotide Composition
The nucleotide composition of poxvirus genomes is an interesting puzzle in that the A + T% varies between 33% (squirrelpox virus) and 82% (amsacta moorei entomopoxvirus), but the mechanism by which the drift occurs is unknown. Given that VARV-VD21-uvic represents a true ancient genome, we decided to determine if a difference in the A + T% could be detected between the ancient and modern VARV genomes. Using the same central core of 98 kb, BBB calculated the A + T% for the ancient and modern sequences to be 67.0 and 67.1, respectively. The plot nucleotide content function of BBB showed no discernable difference but emphasized that the A + T composition within a genome varies by more than 10% if a sliding window of 1 kb is plotted across the genome core. However, when we focused on each of the individual SNPs, comparing ancient to modern nucleotides, we found that approximately twice as many nt changes were guanine/cytosine (G/C) to A/T (157 nt) than were A/T to G/C (79 nt; Table 2). This same dominance of G/C to A/T SNPs was found when comparing the VARV genome isolated in 1850 to the modern VARVs. Thus, the core of the VARV genome has increased in A + T richness by approximately 0.08% in about 300 years. Interestingly, although the CMLV core is slightly less A + T rich than the VARV core, when unique SNPs in CMLV are compared to a consensus of VARV, VACV, TATV, and CPXV-humAac09, there are four times as many SNPs that change G/C to A/T than that change A/T to G/C (Table 3).

Evolution of the VARV Ortholog of VACV-Cop-O1L
The key evolutionary events that lead to the creation of the human-specific pathogen VARV remain a mystery. However, in previous work, we suggested that one or more recombination events prior to the divergence of VARV, CMLV, and TATV produced a large number of SNPs in the ortholog of VACV-O1L, which has been identified as a virulence factor [27,28], may have been a significant factor [16]. Therefore, we examined the distribution of the SNPs associated with this gene in VARV-VD21-uvic and the recently sequenced CPXV-HumAac09, a representative of the VARV-like-CPXV clade [9]. First, a comparison of the VARV-VD21-uvic sequence with the modern VARV-O1L gene sequences found only six SNPs associated with changes distributed across the complete clades (major or minor) of VARV. Nucleotide changes 985R > K, 1029Silent and 1799A > V were found in all modern VARV-O1L gene sequences and nucleotide changes 775E > K, 1173Silent and 1942Y > H were found in the VARV-major-O1L gene sequences. When long sequences are used for phylogenetic analyses, the placement of the VARV-like-CPXVs and other CPXV clades on the phylogenetic tree is consistent. However, because of recombination events between these groups of viruses, it is important to remember that parts of the phylogenetic tree are not displaying true evolutionary relationships for all regions of the sequences. Furthermore, although the presence of orthopoxvirus recombination can be displayed on reticulated phylogenetic trees and with Bootscan analysis [9,29], these do not convey SNP detail about the relationships between the DNA sequences. Therefore, we used BBB to search for SNPs shared between particular sets of viruses and to provide a visual summary of the SNP patterns. The BBB visual summary ( Figure 6) shows SNP differences in sequences compared with the top sequence in an MSA and also allows the user to change the scale of the display and re-order the sequences, both of which are very useful for discovering hidden relationships between the sequences. For example, in Figure 6, the top panel shows a mostly random distribution of SNPs among the VARV + TATV + CMLV sequences; the middle panel shows two large blocks of perfect identity between CPXVs GER-1998 and GER91 at the left side of the alignment; and the bottom panel compares CPXV-HumAac09 to the other sequences and shows a patchwork of regions in the various viruses that are identical to the VARV-like CPXV sequence. of VACV-O1L, which has been identified as a virulence factor [27,28], may have been a significant factor [16]. Therefore, we examined the distribution of the SNPs associated with this gene in VARV-VD21-uvic and the recently sequenced CPXV-HumAac09, a representative of the VARV-like-CPXV clade [9]. First, a comparison of the VARV-VD21-uvic sequence with the modern VARV-O1L gene sequences found only six SNPs associated with changes distributed across the complete clades (major or minor) of VARV. Nucleotide changes 985R > K, 1029Silent and 1799A > V were found in all modern VARV-O1L gene sequences and nucleotide changes 775E > K, 1173Silent and 1942Y > H were found in the VARV-major-O1L gene sequences. When long sequences are used for phylogenetic analyses, the placement of the VARV-like-CPXVs and other CPXV clades on the phylogenetic tree is consistent. However, because of recombination events between these groups of viruses, it is important to remember that parts of the phylogenetic tree are not displaying true evolutionary relationships for all regions of the sequences. Furthermore, although the presence of orthopoxvirus recombination can be displayed on reticulated phylogenetic trees and with Bootscan analysis [9,29], these do not convey SNP detail about the relationships between the DNA sequences. Therefore, we used BBB to search for SNPs shared between particular sets of viruses and to provide a visual summary of the SNP patterns. The BBB visual summary ( Figure 6) shows SNP differences in sequences compared with the top sequence in an MSA and also allows the user to change the scale of the display and re-order the sequences, both of which are very useful for discovering hidden relationships between the sequences. For example, in Figure 6, the top panel shows a mostly random distribution of SNPs among the VARV + TATV + CMLV sequences; the middle panel shows two large blocks of perfect identity between CPXVs GER-1998 and GER91 at the left side of the alignment; and the bottom panel compares CPXV-HumAac09 to the other sequences and shows a patchwork of regions in the various viruses that are identical to the VARV-like CPXV sequence. Previously, we found a block of SNPs in the O1L gene with the BBB search parameter "find position of nucleotides that are identical in VARV + CMLV + TATV but different in all CPXVs and VACVs" [16]. However, when this analysis was repeated with the new CPXVs included, no SNPs were found in O1L with the same query ( Figure 7); a phylogenetic tree of these genomes is shown in Figure 8. This result indicates that the previously identified SNPs must be present in some of the new Previously, we found a block of SNPs in the O1L gene with the BBB search parameter "find position of nucleotides that are identical in VARV + CMLV + TATV but different in all CPXVs and VACVs" [16]. However, when this analysis was repeated with the new CPXVs included, no SNPs were found in O1L with the same query ( Figure 7); a phylogenetic tree of these genomes is shown in Figure 8. This result indicates that the previously identified SNPs must be present in some of the new viruses, the most likely being CPXV-HumAac09, which represents the VARV-like-CPXVs. Therefore, we repeated the BBB query with the addition of a "tolerance of 5" parameter, which instructs BBB to allow up to five other sequences from the alignment to match the nucleotides otherwise unique to VARV + CMLV + TATV. BBB provides the position of the SNP and the number and names of the sequences falling into the set of "tolerated" matches ( Figure 7). Most of the SNPs previously found to be unique to VARV + CMLV + TATV were found to be also present in CPXV-HumAac09 but not in any other CPXV strain (highlighted yellow in Figure 7). Another block of the original SNPs is subdivided into groups that are also in CPXV-HumAac09 + CPXV-HumLit, CPXV-HumAac09 + CPXV-HumLit + CPXV-Ger1980, or CPXV-HumLit + CPXV-HumBer. There are several other blocks of SNPs associated with VARV + CMLV + TATV, CPXV-HumAac09, and one to four other CPXVs in this particular alignment (Figure 7). Thus, the vast majority of the SNPs present in O1L and associated with the VARV clade of orthopoxviruses were introduced before the split of VARV + CMLV + TATV and the VARV-like-CPXVs, which can be represented here by CPXV-HumAac09 because they are almost identical. It is noteworthy that the O1L gene is nonfunctional in both TATV and CMLV but maintained in VARV genomes.
This unusual distribution of mutations within the O1L ortholog, which does not fit the phylogenetic tree, can be further demonstrated with a comparison of the concatenated protein sequences from the same central core (93 genes; F4L-A24R) of these viruses. The related BBB query "find positions of aa that are identical in VARV + CMLV + TATV + CPXV-Aac09 but different in all other CPXVs and VACVs" revealed a total of only 21 unique aa. However, six of these were concentrated within a 40 aa region of the O1L C-terminus.  x x  x  x  25133  x  x  x  x  25187  x  x  25205  x  x  25206  x  x  25229  x  x  25238  x  x  25241  x  x  25253  x  x  25254  x  x  25262  x  x  25271 x x x x 25648 x x x 25652 x x x 25653 x x x 25654 x x x 25656 x x x 25658 x x 25692 x x x 25695 x x x 25698 x x x 25700 x x x 25731 x x x x 25758 x x x 25761 x x x 25797 x x x x x 27533 x 27668 x

Discussion
The history and evolution of variola virus is important because the devastating disease that this poxvirus causes has been instrumental in shaping human populations [4,26]. Although this virus has been eradicated from the global environment by the development and implementation of vaccination, there are several other poxviruses that, either naturally or as an epizootic, infect and cause disease in humans; these include molluscum contagiosum virus (MOCV) [30], monkeypox [5], V_014 [31], escaped VACV [32], orf [33], CPXV [9], and a novel orthopoxvirus isolated in Alaska [34]. Therefore, it is important to understand the viral mechanisms that control pathogenesis and host range. Based on a bioinformatics analysis of the orthopoxviruses, which found an unusual group of SNPs associated specifically with the VARV ortholog of VACV-Cop-O1L, we had previously hypothesized that this gene might have influenced the virulence and/or the host range of the evolving VARV clade of viruses. Curiously, this gene has been lost from CMLV and TATV, which also have very restricted host ranges and are the two most similar viruses to VARV.
It was therefore of great interest when several VARV genomes were sequenced from relatively ancient samples. Unfortunately, the genome sequence from the oldest virus, isolated from a mummified child approximately 360 years old, was reported as a GenBank file (Accession number: KY358055.1) with 3748 unknown nucleotides that broke the genome up into 122 segments. Before analyzing this highly fragmented DNA sequence, we attempted to improve the genome assembly and the quality of the sequence using the previously published sequence data. During this process,

Discussion
The history and evolution of variola virus is important because the devastating disease that this poxvirus causes has been instrumental in shaping human populations [4,26]. Although this virus has been eradicated from the global environment by the development and implementation of vaccination, there are several other poxviruses that, either naturally or as an epizootic, infect and cause disease in humans; these include molluscum contagiosum virus (MOCV) [30], monkeypox [5], V_014 [31], escaped VACV [32], orf [33], CPXV [9], and a novel orthopoxvirus isolated in Alaska [34]. Therefore, it is important to understand the viral mechanisms that control pathogenesis and host range. Based on a bioinformatics analysis of the orthopoxviruses, which found an unusual group of SNPs associated specifically with the VARV ortholog of VACV-Cop-O1L, we had previously hypothesized that this gene might have influenced the virulence and/or the host range of the evolving VARV clade of viruses. Curiously, this gene has been lost from CMLV and TATV, which also have very restricted host ranges and are the two most similar viruses to VARV.
It was therefore of great interest when several VARV genomes were sequenced from relatively ancient samples. Unfortunately, the genome sequence from the oldest virus, isolated from a mummified child approximately 360 years old, was reported as a GenBank file (Accession number: KY358055.1) with 3748 unknown nucleotides that broke the genome up into 122 segments. Before analyzing this highly fragmented DNA sequence, we attempted to improve the genome assembly and the quality of the sequence using the previously published sequence data. During this process, we discovered that our first assembly was similarly poor but observed that the contigs were in fact terminated by variant Illumina adapters that had not been removed from the sequencing reads by the Cutadapt software. Therefore, we repeated the assembly process with a series of adaptor sequences removed and followed up with a manual review of the assembly, accepting low coverage reads that agreed 100% with the consensus sequence from multiple VARV genomes. Thus, we arrived at a greatly improved genome sequence assembly that reduced the number of sequence segments from 122 to 11 and the total number of Ns from 3748 to 66. This sequence will permit a much more comprehensive comparison of the ancient and modern VARV genome sequences. Furthermore, the discovery of variant adaptor sequences highlights a serious problem for the assembly of sequencing reads from very old DNA samples when every read counts.
By using the improved assembly of the ancient VARV-VD21 genome together with TATV + CMLV genomes for phylogenetic analysis and the prediction of clade divergence dates, we determined a tMRCA for all VARVs within the range of 1470-1563. This is approximately midway between the latest two estimates from the publications describing the ancient VARV genomes [1,21]. The BEAST software predicted a split of the VARV and CMLV + TATV clades approximately 1250 years prior to the tMRCA for all VARVs. However, such predictions have limited accuracy, and, based on the simple mutation rate of about 40 SNPs per 100 years seen in the core region of the VARV clade, the tMRCA for CMLV + TATV + VARV would be approximately 2000 years before the tMRCA for all VARVs.
An ortholog of the VACV-Cop-O1L gene can be found in the vast majority of poxviruses, and its absence from TATV, CMLV, and some attenuated myxoma and vaccinia viruses is notable. When mapping clade specific SNPs previously [16], our attention was attracted to this gene because of the high number of SNPs and their organization, which suggested that recombination events had been responsible for these modifications. With the sequencing of the ancient VARV genome and many new CPXV genomes, including the VARV-like CPXVs, we were able to determine that the introduction of these SNPs (Figure 7) took place before the VARV + TATV + CMLV and the VARV-like CPXVs diverged. Although, as we previously reported, twelve more amino acid changes occurred in the O1L protein after the split of VARVs from CMLV and TATV, this result shows that a closely related precursor of the VARV-O1L ortholog is currently circulating in the environment within the VARV-like CPXV population. Therefore, if this protein is indeed associated with the host-adaption of an ancestral VARV to humans, then the ortholog in the VARV-like CPXVs may have some amino acid changes that are associated with adaption to humans [35]. Clearly, this is not an all-or-nothing event; VACVs grow very well in human cells in vitro, as well as in vaccinated humans. In fact, VACV sometimes grows too well in humans and generates adverse vaccine reactions [36]. Similarly, the deletion of the gene is not lethal but is known to affect plaque size [27]. No mammalian ortholog of O1L has been identified, but the biological role of the protein appears to act in conjunction with the poxvirus epidermal growth factor-like protein to sustain activation of the mitogen-activated protein kinase kinase/extracellular signal-regulated kinases (MEK/ERK) pathway [28], thereby promoting the survival of the infected cell in the face of apoptotic signals aimed at killing the cell. Similarly, no mechanism has been proposed for this activity, but if protein-protein interactions are involved then it seems likely that relatively minor changes in the amino acid sequence of this protein could modify the protein such that it performs its function more effectively (or less well) in different mammalian species, thereby generating host specificity.
It was also of interest that a drift towards A + T richness could be observed by examining the SNPs in modern VARVs with the ancient genome as a reference. Approximately twice as many G/C to A/T changes than A/T to G/C changes were observed, but, since this drift is taking place in a genome that is already A + T rich (67%), the bias is actually greater. When the genes from A + T and G + C rich poxviruses are compared, not only are there changes to the codon usage for the particular amino acids, but there are also significant changes to the aa composition. For example, isoleucine and arginine are high in A + T rich poxviruses, with respect to alanine and lysine, respectively, whereas the opposite is true in G + C rich poxviruses. Although genus specific analyses of codon use have been performed in the past [37], it is important that the open reading frames (ORFs) used do in fact represent functional genes; otherwise the data is contaminated by non-coding sequences [38]. Some small regions with unusual nucleotide composition are clearly the result of horizontal gene transfer [39,40], whereas the presence/absence of ribonucleotide reductase subunits [41] and a protein associated with the anaphase-promoting complex [42] have more tenuous associations to the control of the overall nucleotide composition of poxvirus genomes. It will be very interesting to re-analyze codon usage with recently sequenced poxviruses since some of these (e.g., kangaroo poxvirus, unpublished) have intermediate A + T richness levels.

Conclusions
We report a much-improved assembly of a previously published NGS data set for an ancient VARV genome. This improvement was accomplished by modifying the procedure to remove Illumina sequencing tags from the read data and "manually" searching data for reads capable of spanning contig junctions. In addition, we reviewed all regions of the assembly ourselves rather than relying on automation and sequence depth for read support. Using this new assembly with TATV and CMLV genome sequences, we have derived new dates for the tMRCA for the VARV viruses and were able to observe increasing A + T richness in the VARV lineage. Finally, we placed recombination events that modified the VARV ortholog of VACV-Cop-O1L and that may have contributed to host range and virulence of VARV prior to the divergence of the VARV-like CPXVs and the VARV + CMLV + TATV clade.