Orthopoxvirus Genome Evolution: The Role of Gene Loss

Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche.


Introduction
Poxviruses are a family of viral pathogens known to infect a variety of organisms including insects, reptiles, birds and mammals. The wide distribution of poxviruses in nature suggests that an ancestral virus to this family might have been able to infect a common ancestor of vertebrates and invertebrates [1]. Based on some of these differences in host range, the Poxviridae family of viruses is subdivided into two subfamilies, the Chordopoxvirinae (ChPV), which infect vertebrates, and the Entomopoxvirinae, which infect insects. Each of these subfamilies is further subdivided into genera based on shared characteristics such as host range, morphology, antigenicity, and sequence similarity [2]. Host range represents one of the significant major phenotypic differences between members of the Poxviridae family [3], and there have been suggestions that one of the major evolutionary driving forces of this virus family has been co-speciation with their hosts [4,5]. Species in the Avipoxvirus genus only infect birds, though abortive infections can occur in other animals [6]. Members of the genus Capripoxvirus infect ruminants, including cattle, sheep, and goats [7]. Suipoxvirus species infects only swine [8] and Leporipoxvirus species infect only rabbit (leporid) species [9]. Viruses in the Parapoxvirus genus have a broader host range infecting animals in the superorder Laurasiatheria and may also occasionally infect humans [10,11]. Members of the Yatapoxvirus genus infect primate species [12,13]. Cervidpoxviruses infect species of deer [14], while the lone species in the genus Molluscipoxvirus is a human-specific pathogen and only causes serious problems for immunosuppressed individuals [15]. Viruses in the species Squirrelpox virus infect various squirrel species [16], and Crocodilepox virus was isolated from the Nile crocodile [17]. Viruses belonging to the various species that are members of the Orthopoxvirus genus have a broad host range and infect a wide variety of mammalian species including humans [18,19]. Viruses that are members of the Cowpox virus species primarily appear to infect rodent species as their natural hosts, but can also infect many other mammals including humans [20]. Members of the Monkeypox virus species also have a wide host range, infecting humans, non-human primates, and other large animals, as well as a large variety of rodents [21]. In fact, rodent species may represent the natural host of monkeypox viruses. The host range of ectromelia virus is more restricted, infecting mice and moles [22], while taterapox virus infects gerbils [23]. Camelpox virus is known to only infect camels [24] and variola virus, the causative agent of smallpox, is specifically restricted to humans [25]. The origins and natural host of viruses belonging to the species Vaccinia virus, that includes viruses used as vaccines to eradicate variola virus from the human population, remain unknown. But in laboratories, vaccinia viruses are able to infect a variety of species, and there have been occasional outbreaks of vaccinia among bovine populations in South America [26,27]. The Orthopoxvirus genus also includes two species whose viruses are native to North American mammalian hosts [18]. Complete genome sequences for strains of these species, Raccoonpox virus and Volepox virus as well as the related skunkpox virus are not yet publicly available, so these viruses could not be included in any genome-sequence-based analysis. They appear to form a distinct clade well separated on a phylogenetic tree from all other orthopoxvirus species. As complete genomic sequences for these viruses become available, it will be important to include them in analyses such as those reported in this manuscript.
The limited host range of some poxviruses in contrast to the broader host range of other poxviruses suggests the existence of gene-specific determinants that are responsible for the various host-range phenotypes [19]. In addition, the consequences of infection can vary from inapparent or very mild illness for some virus-host combinations, to significant disease and high mortality rates for others [23,28]. Studies to better define and understand the evolution of poxvirus species are in many cases attempting to discover the important genotypic differences that are responsible for the phenotypic differences in host range and disease.
A genome sequence is considered as the ultimate genetic map defining a species [29]. DNA sequence data provides the key information for determining the phylogenetic relationships among species, which in turn provides the framework for comparative approaches for biological investigation. The availability of genome sequence data and improvements in methods to analyze these data supports the process of comparative genomics, which allows us to discern both the common and contrasting features between different virus species (and virus strains) at the genome level. Comparative genomics promises a much more thorough and systematic approach to understanding the genetic diversity of species that leads to their different phenotypic properties, and determining the functions of newly identified genes in one species by studying their counterparts in other species [30]. These analyses then proceed to reveal the evolutionary history of, and the relationships between, species [31].
Since the early 1990s, efforts to completely sequence the genomes of multiple poxvirus species and strains have reached the point where currently over 120 complete genome sequences are publicly available. Complete genomic sequences are now available for representative isolates of all genera and most species of the subfamily Chordopoxvirinae as well as a few strains that belong to taxa in the subfamily Entomopoxvirinae. Poxvirus genomes contain a single linear molecule of dsDNA ranging in size from approximately 133,000-134,000 base pairs for members of the Parapoxvirus and Yatapoxvirus genera [32,33], to almost 360,000 base pairs for Canarypox virus, a member of the Avipoxvirus genus [34]. The two telomeres at the ends of the dsDNA genome form covalently closed hairpin structures at the termini [35]. Near the termini are sequences responsible for concatemer resolution of replication intermediates as well as a variable series of direct, tandem repeat sequences [36]. Finally, the ends of the viral genome contain inverted terminal repeats (ITR) that vary in size between species [37,38]. These ITRs can be large enough to contain the coding region for multiple genes, and genes contained within the ITRs are present as diploid copies. For orthopoxviruses, the size of the ITRs range from approximately 200 to 500 base pairs for variola viruses that contain no genes within their ITR, to almost 12,000 base pairs for several vaccinia virus strains containing six diploid genes within their ITR.
The coding potential of poxvirus genomes ranges from approximately 133 genes in parapoxviruses and yatapoxviruses to 328 genes in canarypox virus [39]. Overall, genome organization and syntenic gene locations are consistently maintained throughout Chordopoxvirinae species with the exception of a large genome inversion in avipoxvirus genomes [40]. Since poxviruses replicate in the cytoplasm of host cells, they must code for all proteins and enzymes required for their own transcription and replication instead of relying on the host proteins present in the nucleus [25,41]. These viral genes, encoded in the central portion of the poxvirus genome, generally code for functions involved in basic viral replicative processes: proteins involved in transcription, DNA replication, and virion assembly and release. In orthopoxviruses, this central core region of the genome comprises approximately 75% of the complete sequence and is the most conserved region of the genome. Genes present towards either end of the genome are much more variable between species in terms of both nucleic acid and amino acid sequence as well as whether or not the genes are present or totally absent from the genome. Proteins coded for by the genes in these variable regions are largely involved in host interactions including host range, immunomodulation, and pathogenicity [42,43].
Based on the analysis of poxvirus genomic sequences, a number of different studies have established the evolutionary relationship of members within the Poxviridae family [1, 4,23,39,[44][45][46][47][48]. Evolutionary studies can focus on different aspects of the overall variability that has occurred in poxvirus genomes over the course of their evolutionary history and the mechanisms by which this variation is generated. These mechanisms may include nucleotide sequence variation resulting in single base changes and small insertions and deletions; acquisition of new genes and genetic material through recombination resulting in horizontal gene transfer (HGT); and the loss of existing gene function through the fragmentation and loss of genetic material. The consequences of this variation ranges from neutral changes, to changes in protein function due to amino acid variation, to loss of gene function due to gene fragmentation, to acquisition of new function through HGT, and also potentially to changes in gene expression due to changes in the regulatory motifs such as promoter sequences that control gene expression. Therefore studies that attempt to reconstruct the evolutionary history of poxvirus species must, for example, include phylogenetic inferences based not only on the comparison of sequence information, but also gene content to illustrate some of the higher-level evolutionary processes that influence poxvirus variation [45,47]. Therefore to fully explore the potential changes in genotypic-phenotypic relationships that have occurred in the poxvirus lineage requires that an accurate, and consistently predicted gene set be utilized to ensure that the complete genomic complement of each virus species is compared and contrasted with that of every other virus species. Unfortunately, few previous studies have reassessed the available public annotation of the gene sets for each viral isolate under study. Instead, these studies relied on the annotations that were published along with the GenBank record of the genome sequence. These GenBank annotations are in many cases known to be inaccurate, and have not been updated since the original sequence record was released [49,50].
Our goal in this present work is to predict and utilize a much more accurate gene set for orthopoxviruses, and together with comparative genomic approaches, to identify the genetic diversity between poxvirus species, not only on a sequence comparison level, but also at higher levels that include analyses involving comparisons between sets of functional, as well as fragmented and missing genes. We describe the development and use of bioinformatics tools that allow us to re-predict the gene set coded for by representative isolates of each orthopoxvirus species. We then use these new gene sets to more fully explore the important genotypic and coding differences between species that may lead to better explanations for some of the phenotypic differences that result in differences in host range, immunomodulation, and ultimately pathogenicity and disease.

Poxvirus Phylogeny
Evolutionary analysis of the complete Poxviridae family is difficult due to the extensive divergence seen in gene and sequence content as well as gene synteny differences when comparing members of the Entomopoxvirinae and Chordopoxvirinae subfamilies. In fact, species in separate genera within the Entomopoxvirinae subfamily show almost as much divergence between themselves as they do with ChPVs [51,52]. Nevertheless, high-quality alignments can be generated using a subset of well-conserved genes shared between each subfamily, and these genes can be used to assess the evolutionary history for the whole Poxviridae family of viruses. Figure 1 shows a Bayesian tree based on a concatenated set of aligned amino acid sequences of the 20 conserved genes that could be unambiguously aligned (see Supplementary Table 1 for a list of these genes). The tree was inferred based on the amino acid alignment to minimize artifacts that arise from tree construction based on nucleic acid alignments of sequences with biased base compositions. (The family Poxviridae contains viruses with base compositions ranging from approximately 18% GC for entomopoxvirus strains to 64% GC for parapoxviruses and molluscum contagiosum virus.) This tree clearly delineates the two Poxviridae subfamilies, and shows the phylogenetic relationships that exist between the Chordopoxvirinae genera. Using datasets containing larger numbers of shared orthologous genes, it is possible to further increase the resolving power of these evolutionary analyses and support more detailed analysis of the evolutionary history of these viruses [53]. While 20 gene families could be unambiguously aligned to support the analysis provided above, at least 49 genes show some level of significant sequence similarity across the whole family. Within the chordopoxviruses at least 90 gene families exhibit significant homology across the subfamily, while all orthopoxviruses share a core set of approximately 174 genes, and some subset of the 214 genes present in cowpox viruses are shared between every strain of every species of virus belonging to the Orthopoxvirus genus (see below) [39].
To fully understand the biology of poxviruses-their replication cycle, host interactions, pathogenesis, epidemiology, and evolution-it is necessary to understand at the very least the functions of all proteins encoded by these viruses and their genotypic-phenotypic relationships. But to continue to explore gene function, an accurate set of genes must be available. And to compare and contrast the biology of multiple viruses, accurate gene sets must be available for each one of those viruses. Unfortunately, as one begins to explore the gene sets provided in the GenBank annotations of many of the poxvirus genomic sequence records, it quickly becomes apparent that many of these gene sets are inaccurate and contain many small gene fragments that are annotated as functional genes, while also missing other genes known to be transcribed, translated, and functional in these viruses. In addition, available information on the structure, function, and role of individual poxvirus genes and their contribution to the overall biology of the virus has been greatly enriched [25,[54][55][56][57] since the first poxvirus genome sequence was completed in 1990 [58,59]. However there has been little re-annotation of previously annotated genomes either to correct inaccurate gene predictions or functional annotations, or to identify previously unrecognized genes. Therefore, inconsistencies and inaccuracies remain common in the published annotations for poxvirus genomes.
So to begin a more detailed analysis of the evolution of the orthopoxviruses at the gene level, we first needed to develop a consistent, objective set of bioinformatics analyses to inform the process of genome annotation. We developed a software tool that could be used in a semi-automated manner to run these analyses, and provide for the visualization of the results. This tool then supported the final step of human-directed, manual refinement of the genome annotation. This allowed us to consistently predict gene sets for representative strains of all orthopoxvirus species, and then utilizing these gene sets, compare and contrast the coding potential and evolutionary history of each of these viruses.

Poxvirus Genome Annotation
Computational prediction of gene structures in genomic sequence has been one of the most active areas of bioinformatics research and has resulted in the development and application of many novel and innovative algorithms [60,61]. Approaches to gene prediction can be roughly divided into three categories: statistical approaches to look for features that appear frequently in gene regulatory and coding regions and infrequently elsewhere; similarity-based approaches where sequence similarity to previously identified genes and protein products provides evidence that a particular open reading frame (ORF) by inference may code for a functional protein; and phylogenetic approaches that rely on the observation that conserved regions in aligned genomic sequences of diverse species are more likely to contain coding regions and other functional motifs. But while gene prediction algorithms have steadily improved over the years, the performance of any one individual algorithm is still far from satisfactory, even for predicting simple gene structures like those present in poxviruses where RNA splicing does not occur and all genes are coded for by single, intact open reading frames. Therefore our approach to poxvirus genome annotation involves the utilization of integrated strategies that combine many different algorithms and analyses to improve the overall performance of our gene prediction pipeline. To support this goal, we developed the poxvirus genome annotation system (PGAS) to facilitate the re-annotation of currently available sequences and the annotation of forthcoming sequence projects. Using this system, we have re-annotated representative virus strains of all species in the genus Orthopoxvirus.
The Poxvirus Genome Annotation System PGAS consists of a computational pipeline that runs each genome sequence through a series of analyses using a variety of computational algorithms that provide various types of information pertinent to gene prediction. The results of these analyses are available from a genome visualization tool that supports the annotation of individual genomes, and the comparative analysis of sequence features between genomes. The pipeline proceeds as follows: First, ORFs greater than or equal to 30 amino acids in all six frames are translated into peptides. Homology searches are carried out using BLAST [62] to search for similarities in the NCBI nr protein database or a database of poxvirus proteins. Similarities to known protein family motifs are detected using HMMPFAM to search for hits in the Pfam database [63]. To be functional, a poxvirus ORF must be transcribed, and transcription is regulated by specific promoter sequence motifs. Different promoter motifs exist that interact with virus-encoded transcription factors to support transcription at early, intermediate, or late times after infection. To predict early and late promoter motifs, a novel algorithm was developed based on interpolated Markov models [64][65][66]. This algorithm extended a simple weight matrix model of bases present at specific positions by providing for sequence dependencies between positions in an alignment of known promoter sequences. These models were then used to search for similar motifs present in the genomic sequence, and ranked those motifs according to the degree of similarity to the promoter model. Predicted promoters present upstream of an ORF then provide evidence that supports the expression of that ORF. Potential genes were also evaluated and scored according to their match to a poxvirus gene model constructed by the program Glimmer 2 [66]. All results were then loaded into a Microsoft SQL Server relational database to support manual inspection and curation of the results using a customized graphical user interface (GUI). Gene assignment decisions were based on the evaluation of multiple lines of evidence such as adjacency of a potential promoter sequence, a reasonable Glimmer score, a statistically significant hit to the NCBI nr sequence database or the Pfam database of amino acid motifs, and most importantly, conservation of the ORF between phylogenetically related species. In PGAS, all of these data provide evidence that can be easily visualized using the GUI. Figure 2 provides a flowchart of the automated pipeline that generates the PGAS data. Figure 3 shows an example of the PGAS GUI, displaying a comparison between homologous regions of the cowpox virus (CPXV-GRI) and horsepox virus (HSPV) genomes, where a gene intact in CPXV-GRI is fragmented in HSPV.   The screen shots provided in Figure 3 only show a few of the available screens. The PGAS GUI application contains multiple panels for displaying the various types of evidence that support the identification of a particular ORF as a coding gene. The gene layout panel (top panel of Figure 3a) displays the arrangement of ORFs in the genome and also displays a graph showing the base compositional bias (GC base frequency) along the genome. The dual genome comparison panel (lower panel of Figure 3a) shows similarities detected by a BLAST search between neighboring ORFs around orthologous genes in any two genomes. Using this panel, a user can visualize the similarity between any two selected genes by aligning them with the Needleman-Wunsch algorithm [67]. A separate panel displays predicted promoters, early transcriptional termination signals, and in-frame and out-of-frame ATG triplets. A similarity comparison panel shows hits to the Pfam and NCBI nr database for each potential gene. Finally, a gene up-date panel provides detailed information on each gene, including the sequence and its coding potential as evaluated by Glimmer 2.0. This panel is then used to annotate the gene, provide links to pertinent references in the literature, and make the final assignment as to its coding potential.
During the process of gene annotation, not only are gene assignments made and confirmed, but additional features of the annotations are also assessed. Most-if not all-previous poxvirus annotations assumed that the first codon in an ORF is the translation starting point. Although this happens to be correct for many genes, some genes may not follow this simple rule. By taking into account the position of predicted promoters, strength of the Kozak consensus sequence [68], and comparing orthologous genes in different species, PGAS can identify a more likely translation start site. Alterations to the translation start site for a gene is indicated in Supplementary Table 1. (Of course, the only definitive means of determining the actual translation start site is through N-terminal sequencing of the translated protein.) We also use the comparative analysis features of PGAS to assign all predicted genes to orthologous families. These are families of genes conserved across two or more poxvirus taxa that show significant sequence similarity and therefore allow for an inference of common function to be made. Since the genes present in these orthologous families also share a common genome location and gene order when comparisons are made between all orthopoxvirus genomes-shared gene synteny-we refer to these orthologous families as "syntelogs": a contraction of "synteny" and "ortholog".
While annotating poxvirus genomes, many small sequence fragments can be annotated as partial sequences or pieces of coding ORFs [69]. With PGAS, comparing orthologous genes and syntenic genome regions between two closely related strains or species can easily identify gene fragments. Figure 4 shows two examples of genes that are fragmented in some orthopoxvirus species but not others. In the first example (Figure 4a) a gene coding for guanylate kinase is present between a hemagglutinin gene and a gene coding for serine/threonine kinase. In cowpox viruses, the guanylate kinase protein is 197 amino acids in length, similar in size to the 198 amino acid guanylate kinase protein present in mice. In variola, vaccinia, and ectromelia viruses, there is a break in the ORF resulting in a truncated protein. In camelpox, monkeypox, and taterapox viruses, the gene is so fragmented, that no transcribed coding region has been annotated in PGAS. As seen in Figure 4b, cowpox viruses code for a 1,279 amino acid protein that contains multiple repeats of the A-type inclusion (ATI) protein motif. ATI proteins are believed to form protein aggregates in the cytoplasm of virus-infected cells and these inclusions may be involved in virion assembly [70]. ATI proteins are coded for by most ChPVs, but as seen in Figure 4b, the ATI gene is fragmented in many orthopoxviruses.
When annotating genes, it can be difficult to determine if any particular fragment should be annotated as coding for protein. To provide consistency, we developed defined criteria for gene annotation. In this study, if an ORF is intact at its 5' end and retains a predicted promoter sequence, it is annotated as an intact gene if it is at least 80% or greater of the length of its intact counterpart. If an ORF is intact at the 5' end and maintains a predicted promoter sequence, but that ORF would code for a protein that is less than 80% of the length of the intact orthopoxvirus protein, then it annotated as a truncated gene. Any ORF that has lost its predicted promoter and/or has been significantly truncated at its 5' end is annotated as a fragmented gene. In most cases, we would not expect fragmented genes to be transcribed and/or translated into a functional protein product. If for any particular virus, no remnants of any significant sequence fragments can be detected for a particular gene, then that gene is annotated as missing in that virus. These criteria are admittedly somewhat subjective. And of course the only way to definitively assess the coding potential of any ORF is to experimentally determine if that ORF is transcribed and translated under a variety of in vitro and in vivo conditions.  Table 1.

Orthopoxvirus Gene Prediction
Using PGAS, we re-predicted and annotated the complete gene sets of representative strains of each orthopoxvirus species. The list of viruses that were analyzed is provided in Table 1, along with virus-specific information including the abbreviation used for that virus in the figures; the length of the viral genome; the length of the virus ITR regions; the total number of coding genes (intact plus truncated) predicted to be transcribed and translated into protein; and the number of genes present within the ITR and are therefore present as diploid copies in that particular virus. The GenBank accession number for the genome sequence along with the Pubmed ID of the publication describing the genome sequencing are also provided. The genome sequence analyzed for the Copenhagen strain of vaccinia virus is one base larger than that presented in its GenBank record (accession number M35027) due to an additional base inserted into the A2.5L gene. The absence of this base in the M35027 sequence is the result of a reported sequencing error [71,72].
The lengths of the ITRs reported in Table 1 should be considered to be approximate. These were compiled by determining the number of complementary bases that appear at each end of the reported genomic sequences. But for each genome, the beginning and end of the reported sequence starts and stops at different points near the termini of the linear genome. A few of the available sequences do contain the complete sequence up to (and through) the ends of the genome (e.g., VACV-WR), but most do not. So the true length of the ITRs is unknown, but in all cases the reported length is probably within a few tens, or at worst, a few hundreds of bases of the actual length.
The complete set of genes annotated for each genome is provided in Supplementary Table 1 along with a functional identification for each gene. Figure 5 shows, as an example, the resulting gene map for Variola virus strain Brazil 1966. Intact genes are displayed in light green, truncated genes in dark green, and fragmented genes in yellow. As can be seen, most of the fragmented and truncated genes are present towards the ends of the genome, while most of the genes in the central region of the genome are intact. This pattern corresponds to that seen for all poxviruses with the conserved core set of genes that are responsible for basic replicative processes mostly intact; while the genes coding for proteins that interact with the host and that confer some of the unique biological properties of different viruses showing much more variability.

Comparative Analysis of Orthopoxvirus Gene Content
When compared at the sequence level, any two viruses from different species in the Orthopoxvirus genus share at least a 96% identity when compared at the nucleotide level over the length of the alignable region of their genome, while any two strains from the same species show at least a 99% nucleotide sequence identity [28]. This alignment includes the core region of the genome along with a portion of the more variable region up to the point at which differential copies of repeat sequences, large deletions, and the variable ITR regions prevent the construction of a reasonable alignment [24]. (The left end of the alignment extends from orthologous genes represented by VACV C7L, position 15,716 of the VACV-WR genome, to A51R at the right end of the genome at position 158,673). The presence of multiple, conserved genes in every viral genome provides us the opportunity to compare phylogenetic relationships based on a comparison of the aligned gene sequences of these viruses. As seen in Figure 6, strains of each virus species group together in separate clades. This tree is consistent with those reported in the literature and shows the fairly broad clade of strains that comprise the species Vaccinia virus, and the two separate clades present for both monkeypox viruses and for variola viruses. For monkeypox viruses, the lower pathogenicity Western    It is interesting to note that as has been previously observed, the three currently available completely sequenced strains assigned to the Cowpox virus species do not form one separate clade when analyzed based on their gene sequences. CPXV-GRI forms a branch that lies at a point near the base of the vaccinia virus lineage, while both CPXV-BR and CPXV-GER form a separate clade between the ectromelia virus lineage and the camelpox/taterapox/variola virus lineage. The assignments of these three strains of cowpox virus to a single species were based on shared nonsequence biological properties such as host of isolation, lesion, growth properties, and morphology. But based on sequence-only comparisons and existing demarcation criteria for Orthopoxvirus species, these viruses seem to clearly belong to two separate species. As discussed below, this demarcation into two species is not supported when comparing these viruses based on their gene content.
As emphasized above and in previous publications, when a direct comparison is made between shared orthologous gene families, sequence variation and gene variation is greatest towards both ends of the linear genome [39,44,46,47]. Examining differences in gene length for each orthologous family across the Orthopoxvirus genus further emphasizes this variation. Figure 7 plots the length of each orthologous gene as a percentage of the length of the corresponding cowpox virus gene that is used as a reference. A great deal of variability in gene (and protein) length is observed in genes that are coded for near the ends of these genomes, while much less variability is observed in the core central region of these genomes. (The few anomalous points that appear to be extremely long genes coded for by non-cowpox virus strains are due to genes in particular strains where small repeat sequences have been greatly expanded in number thus increasing the size of the overall gene.) Figure 7. Comparison of orthopoxvirus gene lengths. The length of every annotated orthopoxvirus gene from the genomes listed in Table 1 was compared to the length of the corresponding cowpox virus ortholog. The length of each gene as a percentage of the length of the longest cowpox virus strain gene was plotted with respect to its genomic position in CPXV-GRI.
A graphical view of the genus-wide pattern of gene content is provided in Figure 8. Each column in the figure represents a unique orthopoxvirus gene family, which are ordered according to their syntenic position in CPXV-GRI. (The 214 columns representing unique syntelog gene families are split into three panels representing the left-third, middle-third, and right-third of the genomes.) Each row corresponds to a separate virus strain, and the strains are ordered according to the degree of conservation of orthologous family genes. Each cell within the figure is colored according to the annotation assigned to that gene for that virus within PGAS. Genes are colored according to their degree of conservation: intact (light green), truncated (dark green), fragmented (yellow), and missing genes (red). The map uses genes from CPXV-GRI as a reference to indicate genomic position of the end of each gene (vertical numbers above each column). The two genome positions marked with an asterisk * indicate that the corresponding ORF in CPXV-GRI is not present, and the numbers provided correspond to CPXV-GER gene stop location. Arrows above the gene family columns indicate the direction of transcription and are color coded according to experimentally determined and predicted expression temporality based on gene expression analysis of the corresponding gene in vaccinia virus [81]: early (green arrows), late (red), early/late (yellow), intermediate (blue), and unknown (black). Genes present in the ITR of any particular genome are outlined in a heavy black box. The five genes that are part of the cowpox virus ITR are repeated at either end of the figure to represent the fact that they are present in diploid copies in these cowpox virus strains. Genes labeled 3' are part of a gene family that is present at the 5' end of cowpox viruses, but is located near (but not a part of) the 3' terminal ITR of the indicated genome. Genes labeled 5' are part of a gene family that is present at the 3' end of cowpox viruses, but is located near (but not a part of) the 5' terminal ITR of the indicated genome.
The pattern observed of intact, truncated, fragmented, and missing genes in orthopoxvirus strains, further emphasizes the conservation of genetic material in the core region of poxvirus genomes, and the variability near the ends of the genome. Importantly, a pattern begins to become apparent in the degree to which genes are conserved for any particular strain and species. This is summarized in Figure 9 and Table 2 where a summary of the number of genes conserved, fragmented, and missing in each strain is provided. The three cowpox viruses contain every gene present in every other orthopoxvirus species. (This is another characteristic that these three strains have in common irrespective of the fact that their sequence-based phylogeny predicts that they belong to two separate species.) Therefore, given the reasonable inference that the progenitor to all current-day orthopoxvirus species contained all genes currently present in all of these species, then cowpox virus appears to be most like that progenitor virus, at least in terms of gene content. Viruses of every other orthopoxvirus species contain a subset of those genes.
While strains of cowpox virus contain an essentially complete set of gene family orthologs, variola viruses in contrast, contain the most restricted set of genes for any naturally-occurring orthopoxvirus. (Only VACV-MVA contains fewer genes, but MVA is an attenuated vaccine strain of vaccinia virus that was isolated following multiple passages in vitro [80].) Variola viruses contain a set of 162 intact and 17 truncated genes that would be predicted to code for functional protein. This is in contrast to cowpox viruses that code for up to 214 functional genes. It is interesting that viruses belonging to the orthopoxvirus species Cowpox virus, that infects the widest variety of host species, contains the largest number of genes of any species in the genus; while viruses belonging to the species Variola virus, code for the most restricted gene set, but are the most host restricted, and at the same time the most pathogenic of any other species in the genus.
One additional method for assessing gene content differences between orthopoxvirus strains is to infer their phylogenetic relationship based on these gene content differences, and not based on the usual multiple sequence alignment [48,[82][83][84][85]. Figure 10 shows a phylogenetic inference of virus strains based on their gene content. Each gene in each genome was coded as to whether or not it was intact, truncated, fragmented, or missing in that virus strain. Then this gene content matrix was used as input to the program MrBayes to calculate the gene content phylogeny. For pairwise genome comparisons based on this matrix, each orthologous syntenic gene was compared between each genome, and a mismatch was scored if the character states of the two genes being compared did not match. Then MrBayes was used to infer the most probably phylogenetic relationship based on these gene content comparisons.    While similar to the sequence-based phylogeny presented in Figure 6, there are important differences observed when comparing the two trees. In the gene content tree, the three cowpox virus strains form a common clade since each contains an almost complete set of intact orthopoxvirus genes. This is in contrast to the two clades seen when assessing their phylogenetic relationship based on sequence comparisons. Taterapox virus and camelpox virus still form a common clade, but now, they form a clade with monkeypox viruses instead of with variola viruses. Ectromelia virus continues to lie on the most extended branch of the tree, while most of the vaccinia virus strains analyzed form an extremely broad divergent clade similar to, but perhaps even more extended then that observed in the sequence-based tree. This is probably explained by the artificial nature of vaccinia viruses: while their natural origin remains unknown, current strains are essentially laboratory viruses passaged under a variety of in vivo and in vitro conditions. It is therefore not surprising that they would exhibit extensive variability not only in their sequence, but also in their gene content. It is interesting that based on sequence analysis, horsepox virus is clearly a strain of the species Vaccinia virus [78]. But based on gene content, horsepox virus forms a clade separate from the clade containing all of the other vaccinia virus strains. This is in contrast to the sequence-based tree where horsepox virus lies on an extended branch of the vaccinia virus clade (Figure 6). Horsepox virus was isolated from a natural outbreak of disease in horses, and is therefore the only vaccinia virus in this study that was obtained from a natural outbreak as opposed to being derived from artificial passage in the laboratory. Supporting this difference, the gene content of horsepox virus is much more similar to that of the cowpox virus strains than to the other vaccinia virus strains. These differences between the gene content of horsepox virus when compared to the other vaccinia virus strains, may reflect a history of infection in a "natural" host and the selection pressures of such natural passage in vivo may require a broader complement of host-interacting genes than for the "artificially"-passaged laboratory strains of vaccinia virus.

Discussion
To reliably evaluate the evolutionary history of any virus family, and to better understand the selection pressures that have influenced that evolution, it is necessary to begin with a reliable and consistently annotated set of genes predicted to be coded by viruses belonging to that family. The prediction of genes coded for by large DNA viruses is not possible solely by annotating ORFs larger than some arbitrary cutoff as an expressed gene. Therefore we have developed a poxvirus-specific gene prediction tool that is able, in a semi-automated manner, to streamline the gene prediction and annotation process for poxvirus genomes. PGAS is able to help refine the translation start point of a coding region; identify truncated genes; identify fragmented, possibly non-functional genes; and identify small, previously unrecognized genes. Using its semi-automated prediction pipe-line, the system can either predict the gene content of a newly sequenced poxvirus genome, or re-annotate a previously sequenced genome very efficiently, allowing information pertaining to the remaining ORFs to be inspected and assessed using the GUI inspection tool.
Over-prediction of genes has been a significant problem in previous poxvirus genome annotations. For instance, the initial annotation of the sequence of VACV strain Copenhagen uncovered 263 'potential' genes, many of which are small and overlap with other genes [58,59]. By combining evidence from genome comparison, promoter prediction, similarity searches, and compositional analysis, we have found that a substantial number of previously annotated genes in many poxvirus species to be artifacts due to lack of information or to the limitations of previously employed gene prediction algorithms. Our current annotation of VACV-COP predicts that 187 genes may be expressed and functional in this virus strain. By applying integrated, consistent strategies as implemented in the PGAS tool, we were able to re-predict and annotate the gene sets for representative strains from all species with available complete genomic sequences in the Orthopoxvirus genus. Although the tool was developed for the prediction of poxvirus genes, it should be applicable to other viruses such as herpesviruses and prokaryotic organisms with a simple gene structure.
Inspection of poxvirus genomic sequences and their predicted gene sets emphasize three major mechanisms of variation that have occurred throughout their evolutionary history: (1) single base changes causing amino acid variation or variation in regulatory regions such as promoter sequences; (2) acquisition of new genetic information through horizontal gene transfer or gene duplication driven by the recombination of poxvirus genomic DNA with that derived from the virus host or other co-infecting pathogens; and (3) the gradual loss of genetic information and coding genes through progressive deletion of DNA sequence over many rounds of replication/infection. Single base changes may result in functional variation of existing proteins or changes in gene expression. HGT may result in the acquisition of new function. Gradual deletion will result in the loss of function. The key feature of all types of variation is that while they result from essentially random processes, to influence virus biology and evolution, they must provide some selective advantage that drives fixation within the virus genome. Our analysis of poxvirus genomes reveals the debris resulting from this continuing process of genome evolution. These truncated, fragmented, and missing genes that we observe, reflect the ongoing process of selection and fixation.
Features of poxvirus biology that might influence the ability of viruses to accommodate the loss of existing or acquisition of new genes are the structure of the virion and the organization of the genome. The virion structure must be flexible enough to contain a viral genome that can vary by at least 45,000 bases. The structure of the genome must be able to accommodate the insertion of significant lengths of new genetic material. The flexibility seen in poxviruses being able to introduce recombinant genes into poxvirus vectors emphasizes the ability of these viruses to accommodate new genetic information [86][87][88]. In addition, gene order in poxvirus genomes may not be as restricted by functional constraints as it is for other viruses, and can therefore be disrupted to a greater extent without deleterious effects [89]. A substantial number of genes unique to different poxvirus genera and species are found to be present near the ends of the virus genome. However, when confining the comparisons to the core regions of each genome, only a few unique genes are found in each representative species. Therefore, over the evolutionary history of these viruses, it appears that genes absolutely required for virus replication have "migrated" towards the central part of the virus genome, while genes with more peripheral functions, whose loss may be less disruptive to basic replicative processes, are located in the terminal regions of the viral genome. In this manner, evolution has provided for the segregation of poxvirus genes according to their role in virus biology. The central part of the genome represents the utility room of the virus without which nothing happens-the utility room houses all of the machines required to keep the virus functional, and in general, you do not mess with these machines. The terminal regions of the genome represent the parlor-the place where the guests first arrive. The parlor is where you hold the party where you entertain a wide variety of guests. If the guest proves to be useful, you invite them to stay awhile, and perhaps even give them a more permanent home. If the guest serves no useful purpose, you show them the door.
The absence of a viral fossil record makes it difficult to ascertain whether a taxa-specific gene is due to a recent gene acquisition event in that particular taxa, or is due to the loss of that gene from related taxa. But the presence of gene fragments can help infer which particular evolutionary process may have been responsible. Once the selection pressure to maintain gene function is lost, it appears that the sequence for the gene itself is also fairly rapidly lost through progressive deletion. Truncated genes, and especially gene fragments, are the result of that ongoing process.
The evolutionary process in orthopoxviruses exhibits two seemingly disparate features. On the one hand, species within the genus show extensive differences due to fragmentation and deletion of genes. On the other hand, the core genomic sequences of these viruses show very little variation, as reflected by the fact that more than a 96% nucleotide identity is seen throughout the core region of these genomes. But as demonstrated in the present study, major differences between viruses in the Orthopoxvirus genus predominantly involve the inactivation or loss of a number of genes originally present as a larger repertoire possessed by an ancestral virus similar to modern-day cowpox viruses. Therefore gene loss appears to be the predominant evolutionary process that drives the divergence of orthopoxvirus species. These changes in the non-core gene sets may lead to rapid changes in virus-host interactions resulting in divergence and speciation.
Variola virus, the etiologic agent of smallpox, contains the smallest genome and gene set of any other orthopoxvirus that circulates naturally in their host(s). This suggests that inactivation of genes may have reduced the ability of variola virus to propagate in alternative hosts, eventually restricting replication and transmission to just a single host: humans. It is interesting to note that cowpox viruses, with the largest orthopoxvirus genomes, have the widest host range of viruses belonging to the genus, while variola viruses, with the smallest genomes, have the most restricted host range. Cowpox viruses might be considered to be the sports utility vehicle of the poxvirus world-big lumbering beasts with many disparate functions, none of which is finely tuned, but all of which provide support for a wide range of uses. Variola viruses in contrast can be considered the Ferraris of the poxvirus world-small, streamlined, and tuned to do one thing well: infect and kill humans.
The last common poxvirus ancestor might have been able to infect early eukaryotic organisms, as reflected by their wide, present-day distribution and natural host range, which spans insects to mammals [1,4]. In spite of this wide distribution in the environment, many modern-day poxviruses have a tendency to exhibit a fairly narrow host range for any one particular species. During the evolution of the virus within a host, poxviruses may have lost the genes unnecessary for infection of that particular host species, keeping only those genes necessary to successfully parasitize that particular environmental niche, a process in part recapitulated by endosymbiotic bacteria [90]. These gene-loss events may have jeopardized the virus' ability to infect other organisms, forming a natural host species barrier. Virus evolution due to gene loss may therefore represent one of the defining processes through which the basic biology of modern-day poxviruses is determined.

Genome Sequences
For evaluation of the evolutionary history of the Poxviridae family, we chose a representative strain of the type species from each genus. For re-annotation and analysis of the Orthopoxvirus genus, we chose representative strains of each species in the genus where complete genome sequences have been determined (Table 1). Where distinct clades exist for any particular species (Variola virus, Vaccinia virus, and Monkeypox virus), we utilized members of each clade.

Multiple Sequence Alignments and Phylogenetic Tree Construction
For gene-based phylogenetic analysis, translations of the nucleotide sequences of the open reading frame of each gene were aligned using MEGA 4 [91] and the ClustalW algorithm [92]. Amino acid multiple sequence alignments (MSA) were used to generate corresponding codon-aligned nucleotide MSAs. The individual amino acid alignments or nucleic acid alignments were concatenated together into one large contiguous alignment for subsequent phylogenetic analysis. For the core orthopoxvirus genome nucleotide alignment, the alignment was generated using a combination of the programs MAVID and Multi-LAGAN [93,94]. The final computational alignment was then hand edited extensively to optimize the alignment. The alignment extends from base 15716 to base 158673 of the Vaccinia virus strain Western Reserve (VACV-WR) genome. This alignment starts with VACV-WR_021 and ends with VACV-WR_177 and corresponds to the VACV-Copenhagen (VACV-COP) genes C7L to A51R [79].
Phylogenetic inference of the family Poxviridae was based on an amino acid alignment of 20 conserved genes of virus isolates from representative species of each genus (Supplementary Table 1). For the sequence-based orthopoxvirus phylogeny, an alignment of codon-aligned nucleic acid sequences from 141 conserved genes was used (Supplementary Table 1). All trees were inferred by using Bayesian inference with Markov chain Monte Carlo methods as implemented by MrBayes [95]. The poxvirus amino acid tree was estimated by allowing model jumping between all fixed-rate amino acid models. For the nucleic acid sequence phylogeny, the most appropriate nucleotide substitution model was first determined using the program MODELTEST [96]. The best fitting model was a general time reversible (GTR) model of nucleotide substitution that allowed for gamma-distributed variation across sites with a proportion of invariable sites [95,97]. Tree analysis was performed for at least 100,000 generations with a sampling frequency of 100. Trees were constructed from the MrBayes run data after disregarding the initial 25% as burn in.
For the gene content phylogenetic inference, the MrBayes model utilized was the standard discrete morphological model with variable coding and 4 character states (0: intact, 1: truncated, 2: fragmented, 3: missing). Characters were unordered. The model assumed equal stationary state frequencies and equal substitution rates and the branch lengths were unconstrained, with all topologies equally probable. The prior was set to symmetric Dirichlet with fixed (−1.00) variance parameter. The cost to switch from any one state to any other was set to 1. Tree analysis was run for 200,000 generations with a sampling frequency of 100. Trees were constructed from the MrBayes run data after disregarding the initial 25% as burn in.

Poxvirus Genome Annotation System (PGAS)
Different gene prediction methods often examine different aspects of an actual gene, all of which may complement each other and yield better predictions. Therefore, in order to achieve optimum predictive capability, PGAS was designed to integrate as many useful gene prediction methods as possible as long as they individually provide some predictive capability and the algorithm that provides that capability is not also implemented in another method used in PGAS. PGAS contains four independent approaches: sequence similarity comparison, comparative genomic analysis, promoter detection, and a test of coding potential.

Sequence Comparison
The genomes of living organisms have arisen through modifications of an array of ancestral sequences. Duplication with modification is a central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones [98]. Similarity over an extended region of a sequence in most cases implies homology, or descent from a common ancestral gene [99]. The similarity of a region of the genome to a sequence that is already known to be transcribed is the single most powerful predictor of whether the newly annotated genomic sequence is part of a gene and is therefore transcribed into mRNA and translated into protein [100]. Gene prediction algorithms that take sequence similarity into account generally outperform those that do not [101].
However, two concerns have to be taken into consideration when using sequence comparison to predict gene content in a genome sequence. The first one is the quality of the sequence database. Since most sequences deposited into a database are based more or less on results from prediction and annotation by similarity, mistakes made early on will be propagated repeatedly from one sequence to the next. The second concern is the coverage of the sequence database and the sensitivity of the similarity-detecting algorithm. Biological sequence data may be both extremely redundant for some genes, but at the same time relatively sparse for others. Discovering closely related homologues is relatively straightforward due to the development of efficient similarity-search algorithms [62,102]. However, it is much harder to detect similarity between two distantly related sequences due to the accumulation of mutations. To address concerns about the quality of the sequence database and the inability to detect distant similarity by conventional pair-wise sequence comparison algorithms, PGAS uses similarity information as detected by BLAST [62] against common sequence databases such as the NCBI nr database, only as supporting evidence during manual inspection. Instead, PGAS relies heavily on detecting statistically significant matches between an ORF and the Pfam database of functional protein motifs. Pfam is a comprehensive collection of protein domain families curated by experienced biologists and bioinformaticists, with a range of well established uses including genome annotation [63]. In addition, each domain in the Pfam database represents an empirically derived estimate of all possible evolutionary changes for a protein of particular function, which then leads to identification of a much higher proportion of distantly related sequences [103] with a searching algorithm, HMMPFAM, based on a hidden Markov model (HMM).
The tradeoff of using HMM-based searches for increased sensitivity is the intrinsically slow nature of the Viterbi [104] or forward algorithm used in the search application. In addition, as the size of the publicly-available protein database continues to grow at a rapid rate, it also takes a significant amount of time to search these databases even with the relatively efficient BLAST algorithm. To solve this problem, PGAS uses our previously developed algorithm for deploying these searches on a Linux cluster [105].

Genomic Comparative Analysis
Functional sequences are subjected to evolutionary selection. When two sequences are aligned, most of the observed differences are neutral, having no effect on the amino acid sequence of the encoded protein. Other mutations result only in a conservative change of one amino acid for another similar amino acid, or the changes may occur in regions of the protein that do not directly play a role in protein function and may therefore be more tolerant of change. In contrast, regions of a gene that are devoid of mutation may be conserved because the mutations may result in a loss of function and therefore be detrimental to the organism [106]. Using computer-based analysis to focus in on the genomic features that have been preserved in phylogenetically related species, researchers have been able to pinpoint the motifs responsible for function, such as protein coding regions and gene regulatory motifs.
Members of the subfamily Chordopoxvirinae share many basic features with each other. One significant feature is the overall conservation of gene synteny (genomic organization)-the pattern in which their genes are arrayed along the chromosome is almost identical when comparing most of the species within the ChPV subfamily (only the avipoxviruses show significant deviation due to a few large scale rearrangements within their genomes) [44]. On the other hand, in spite of this conservation, other aspects of genome composition have undergone dramatic changes. For example, the GC content in both the parapoxviruses [32] and molluscipoxviruses [107] is more than 60%, while it is only about 25% in Capripoxvirus species [7]. Synteny conservation at the gene level and disparity in GC content of genome sequences permit unambiguous identification of functional motifs including genes, because any similarity due to conservation of nonfunctional sequence is eliminated over time. Synteny conservation also allows the establishment of orthologous relationships among genes. The comparative genomic approach has also proven to be very powerful in detecting small genes that are conserved in several closely related species [108]. Unlike larger genes, the statistical signals to distinguish small functional ORFs from non-coding sequence are very weak and therefore these genes may simply be over-looked by more conventional approaches such as similarity searching and tests for codon-bias or biases in base composition.
PGAS detects similarity between any two ORFs in all sequenced poxvirus genomes with sensitive all-against-all BLAST searching. Two orthologous genes and those surrounding them in each species are displayed graphically in a panel in the PGAS tool, and are connected with lines indicating possible pairs of orthologs. By including similarity relationships for neighboring genes, the orthologous relationships can then be unambiguously resolved due to the conservation of genome organization.
Since it is more difficult to detected similarities between two short genes using the BLAST algorithm, in PGAS a pairwise alignment between any two genes can be generated as desired using an included implementation of the rigorous Needleman-Wunsch algorithm [67].

Promoter Prediction
The expression of poxvirus genes are controlled by a promoter sequence that lies upstream of the mRNA start site and initiating ATG codon. Just as one can reach the melon by following the vine, it is possible to use the predicted promoter sequences as a "signal"-by knowing the position of a promoter one knows at least the approximate starting point of the transcript, thus delineating one end of the gene. This information is also helpful in predicting a small gene, where weakness in the signal of the coding region prevents confirmation of whether it is expressed until experimental evidence is available. Although poxvirus promoters appear to be simple conserved sequence motifs, computational identification is far from a fait accompli. In general, promoter-prediction algorithms that are able to locate a reasonable percentage of true promoters, also give a high number of false-positive predictions [109]. To increase the reliability of gene and promoter prediction, PGAS combines these predictors to provide mutually supportive results. In addition, other types of information-such as adjacency to an ORF and reading frame compatibility of the first ATG codon following the transcription start site with as well as the inter-species conservation of both promoter and gene-are all taken in account. The significant compositional features and inter-dependencies observed in vaccinia virus promoter sequences were incorporated into interpolated context models (ICMs) [65,66], which were then employed to predict promoters in the vaccinia virus genome, as well as all other orthopoxvirus genomes. By coordinating positional and compositional features and dependencies in a signal, ICMs lead to improvement in predictive capability compared to a simple weight matrix model (WMM). Temporal-specific promoter sequence models were built based on experimental data reported in the literature. For each ORF, the predicted promoter sequence, the coding region, and the poxvirus early gene transcriptional termination signal were projected onto a panel in PGAS, which also provides for visualization of the same information for orthologous genes in other species. Manual inspection was then carried out to verify the predicted promoter sequence and downstream open reading frame. For those ORFs with predicted promoters that show the predicted sequence located within the putative coding region, comparison between orthologous genes was used to resolve the conflict by either altering the translation start site to the next in-frame ATG triplet, or by reassessing the accuracy of the predicted promoter sequence.

Characterization of Coding Potential Using Glimmer
The base composition of genes is strongly affected by evolutionary constraints, and therefore may be statistically "unfavorable" in the context of the whole genome. A DNA sequence that encodes protein is not a random chain of available codons for any particular amino acid, but rather an ordered list of specific codons that reflect the evolutionary origin of the gene as well as constraints associated with genome replication, gene expression and function [110]. A composition-based gene finding approach detects the statistical bias present in coding regions. In general, these tools are first trained with the coding regions that comprise a set of known, true genes for the species under study. Then this model is used to evaluate the coding potential of every ORF present in that species. A poxvirus gene consists of a single continuous open reading frame separated from the next gene by a short intergenic region-a pattern that is similar to a prokaryotic gene model. Glimmer 2.0, one of the available prokaryotic gene finders, is widely used for prokaryotic genome annotation [64]. In PGAS, Glimmer 2.0 was customized to evaluate the coding potential of each orthopoxvirus ORF using an isolate-specific model constructed from the ORFs in that isolate that have statistically significant hits to the Pfam database. The Glimmer score provides a useful clue in manually resolving ambiguous predictions.

Semi-Automatic Gene Prediction
For efficiency, scalability, and consistency, automated processes are highly preferable to manual curation. However achieving both efficiency and accuracy simultaneously is not yet possible given the limitations of the current gene prediction algorithms. For that reason, PGAS is designed to operate in a hybrid mode in which the results from an initially automated gene prediction pipeline are displayed in the GUI that then provides additional information to allow for an intelligent, human-directed final assignment of ORFs as "authentic" genes. In the initial phase, ORFs with significant Pfam hits are scored as highly likely to be potential genes. Shorter ORFs entirely overlapping a longer ORF with a strong Pfam hit are scored much lower, as they are less likely to be true genes because of poxvirus gene structure features. Due to the comprehensiveness of Pfam, the majority of genes in any one species can be resolved in this manner. Evidence for other potential genes that did not receive high scores in the initial round can be visualized through the GUI, which when combined, allows a straightforward decision regarding gene assignment to be made. An ORF conserved in several different phylogenetically related species and having a predicted promoter motif present upstream of its coding region is much more likely to be a functional gene than an ORF lacking orthologs in other viruses, or lacking a promoter sequence. For a potential gene that is unique to a species, both the Glimmer score and the existence of a promoter-like motif provide strong clues to assist in determining whether it is an "authentic" gene. In addition, the precise translational start site can be determined based on the position of the predicted transcription start site, and comparison with orthologous genes in other strains and species.

Conclusions
Through the development and use of a new set of bioinformatics tools, we have re-annotated the gene sets of representative strains of all species in the Poxviridae genus Orthopoxvirus. These tools, packaged as the Poxvirus Genome Annotation System (PGAS) provide a semi-automated pipeline for the assessment of the coding potential of every ORF present in each poxvirus genome, and then present this information using a visualization tool that supports the final step of human-directed manual annotation of virus genes. An analysis of the coding potential for each gene of each genome emphasizes the variability of gene content in the orthopoxviruses. This variability is most apparent in genes that are functionally involved in various virus-host interactions and are located near the ends of the virus genome, while the central core region of the genome encodes genes that are much more conserved and are involved in basic virus replicative processes. Poxvirus evolution is mechanistically driven by single base changes and small insertions/deletions; acquisition of new genetic material through horizontal gene transfer; and changes in expression due to alterations in regulatory, promoter sequences. The importance of each mechanism in the evolution of any particular species probably reflects the various selection pressures that impact virus variation, biology, and response to the environment. This current work emphasizes the importance of gene loss in the overall divergence of orthopoxvirus species, and suggests that the loss of gene function through the deletion of genome sequences that no longer provide any selective advantage to virus replication is a major driving force supporting the variation and evolution of these viruses.