Divergent Evolution of E1A CR3 in Human Adenovirus Species D

Adenovirus E1A is the first viral protein expressed during infection. E1A controls critical aspects of downstream viral gene expression and cell cycle deregulation, and its function is thought to be highly conserved among adenoviruses. Various bioinformatics analyses of E1A from 38 human adenoviruses of species D (HAdV-D), including likelihood clade model partitioning, provided highly significant evidence of divergence of HAdV-Ds into two distinct groups for the conserved region 3 (CR3), present only in the E1A 13S isoform. This variance within E1A 13S of HAdV-Ds was not found in any other human adenovirus (HAdV) species. By protein sequence and structural analysis, the zinc finger motif of E1A CR3, previously shown as critical for transcriptional activation, showed the greatest differences. Subsequent codon usage bias analysis revealed substantial divergence in E1A 13S between the two groups of HAdV-Ds, suggesting that these two sub-groups of HAdV-D evolved under different cellular conditions. Hence, HAdV-D E1A embodies a previously unappreciated evolutionary divergence among HAdVs.


Introduction
Human adenoviruses (HAdVs) belong to the genus Mastadenovirus within the Adenoviridae family, and comprise seven species, HAdV-A to HAdV-G. HAdV-D contains the largest number of viruses presently recognized in GenBank, totaling 61 out of 90 HAdV genotypes. Moreover, HAdV-D is the fastest growing clade with 19 of the 28 most recently emerged genotypes [1][2][3][4]. We previously identified homologous recombination as the main evolutionary force driving this rapid expansion [5][6][7][8][9]. Recombination pressure was found to be most prominent in the major structural components of the viral capsid, specifically the penton base, hexon, and fiber proteins, along with the three E3 CR1 open reading frames (ORFs). These regions are also the most hypervariable among all the ORFs in the HAdV-D genome. These insights were possible only through continued efforts to fully characterize newly identified adenoviruses through whole-genome sequencing [5,10]. The relative abundance MEGA6.06 [43] using default parameters. Phylogenetic trees were generated using the bootstrapped neighbor-joining distance method (500 replicates) with MEGA6.06 and maximum likelihood with PhyML (HKY85 model) [44]. The PhyML generated unrooted tree in Newick format was used to label all HAdV-Ds as a single clade partition ($1) or two separate clade partitions, group 1 ($1) and group 2 ($2), while the remaining HAdVs belonging to the other 6 species (A-C, E-G) were labeled as a single clade partition ($0 as default). The corresponding labeled trees in Newick format along with E1A coding sequences aligned in respective preserved codon position in PHYLIP format were used as input for Clade model C (CmC) divergence selection analysis [45]. PAML 4.8 [46] CmC analysis with settings 'Model = 3 and NSsites = 2' was used to identify divergence selection acting on two clades separating HAdV-Ds into two classes. Using codeml, an alternate model was represented with a phylogenetic tree having each class as separate clades (complex tree) and tested against the tree with all HAdV-Ds in a single clade (simpler tree) as a null model to identify divergence. A likelihood ratio test (LRT) between the models was calculated using the resulting ln-likelihood (ln L) for null and alternate models. The statistical significance was calculated from the LRT value and difference in parameter number using a χ2 distribution [47].

Proteotyping
Proteotyping was performed as previously described [5,6]. Amino acid sequences were aligned using ClustalW, and then transferred to MEGA to generate maximum likelihood trees. E1A sequences were then rearranged in the multi-sequence alignment in the same order as in the phylogenetic tree. For each amino acid position, the consensus residues were identified and residues matching the consensus at each respective position colored white. Gaps in the alignment were colored black. The remaining amino acids were each assigned a unique color for all occurrences. A unique proteotype was defined as a pattern representing 10% residue divergence from the consensus sequence [5,6].

Protein Sequence and Structural Analysis
Multi-sequence amino acid alignment of E1A CR3 regions representing 38 HAdV-D E1A sequences was performed using MAFFT [42]. A protein model for HAdV-D37 and D56 CR3 regions was constructed using QUARK [48] (http://zhanglab.ccmb.med.umich.edu/QUARK), an ab initio protein folding and protein structure prediction algorithm, using atomic-level knowledge-based force field and Monte Carlo simulations to generate 3D protein structures in the absence of a template. The generated models were evaluated by Ramachandran plot using RAMPAGE (http://mordred.bioc.cam.ac.uk/~{}rapper/rampage.php/). At least 95% of residues in both models were found in either most-favored or allowed regions. Protein structure analysis and visualization were performed with UCSF Chimera software (http://www.cgl.ucsf.edu/chimera). CR3 protein structure backbone dynamics, including flexibility, and ordered and disordered regions, were predicted using DynaMine [49] (http://dynamine.ibsquare.be/).

Codon Usage Bias Analysis
Relative Synonymous Codon Usage (RSCU) values represent the occurrence frequency of a specific codon in a sequence as compared to other synonymous codons coding for the same amino acid under equal codon usage bias. RSCU values were calculated using DnaSP v5 [50] for all individual codons except for three stop codons, methionine, and tryptophan. The RSCU values for 38 HAdV-D sequences (62 amino acids, 59 codons) for E1A 13S CR3 sequence were examined by correspondence analysis (CA) in R statistical software version 3.3.2 (http://www.R-project.org/) [51], using the CA function in the FactoMineR package [52], and the first two dimensions were plotted.

Segregation of the 13S Isoform of HAdV-D E1A into Two Major Subclades
The evolution of the E1A 13S isoform across 60 HAdVs, representing all seven species (A-G), was analyzed by phylogenomics at the protein level. As shown in Figure 1A, E1A 13S clades segregated as expected according to existing HAdV species delineations. The largest membered HAdV species, B and D, showed further subdivisions with two subclades each, containing roughly equivalent numbers of HAdVs in each subclade. To look more specifically at the CR3 region of E1A 13S, we applied proteotyping to these 60 HAdVs. In this method, a maximum likelihood tree is generated from the putative amino acid sequences of the genomic region of interest, and a unique color is assigned to each amino acid in the resultant amino acid assembly to permit visual discrimination of subgroups. As can be seen from proteotyping ( Figure 1B), two distinct HAdV-D CR3 proteotypes emerge and are resolved more clearly than the HAdV-D subclades in Figure 1A. Only one CR3 proteotype was observed for each of the other HAdV species, including HAdV-B (despite it dividing into two subclades at the complete protein level). The CR3 proteotypes for HAdV-A, B, C, E, F, and G were more similar to each other than to HAdV-D. HAdV-C had the smallest E1A CR3 13S region, comprising only 46 amino acids. Compared to HAdV-C5, with the best-studied E1A protein, the CR3 region of HAdV-Ds appears to contain an additional 16 amino acids. Here, multi-sequence alignment is shown in respect to the maximum-likelihood tree with the consensus sequence hidden as a white background, while non-consensus colored regions represent unique shared protein signatures. The red box outlining the HAdV-Ds represents the most variable region within CR3 for the two HAdV-D groups, arbitrarily named group1 (upper proteotype encompassing HAdV-D45 to D29) and group2 (lower proteotype, encompassing HAdV-D33 to D42), each with highly distinct amino acid signatures. The CR3 of HAdV-Cs, as exemplified by the well-studied HAdV-C5 (red arrows), is the smallest in length among all HAdV species.
Subsequent phylogenetic analysis at the nucleotide level for the unspliced E1A ORF confirmed the clear distinction between these two subgroups within HAdV-D ( Figure 2A). By proteotyping, no such division into distinct proteotypes was evident in E1A 12S ( Figure 2B), nor was the origin of the difference between the two 13S subgroups apparent beyond the CR3 region ( Figure 2C). Due to differential splicing of the first exon, E1A 12S in HAdV-Ds contains 191 amino acids, while E1A 13S comprises 253 amino acids. Proteotyping of HAdV-D E1A 12S suggested four proteotypes, but these were not distinctly different from one another ( Figure 2B). In contrast, proteotyping of 13S showed a remarkable divergence of one proteotype (D29 to D44) from the others, with four proteotypes in total ( Figure 2D). The major source of this bifurcation of E1A was the CR3 region, absent in E1A 12s transcript, and present in E1A 13S. No significantly distinct proteotypes were observed for the E1A 12S isoform. The E1A 13S isoform clearly segregated into distinct proteotypes, arbitrarily named group1 (upper) and group2 (lower), due to differences in the CR3 region (as seen by the non-consensus protein signatures). (D) E1A recombination was identified in assortment analysis performed in the genomic position order of putative proteotypes for penton base hypervariable loops (HVL1 and HVL2), hexon, three E3 ORFs (CR1 α, β, and γ) and fiber open reading frames, with respect to E1A 13S proteotypes. For visual representation and understanding recombination at the whole genomic scale, each unique proteotype was assigned a number and color.

Evidence for Prior Homologous Recombination in HAdV-D E1A
HAdV-D genomes have been shown to contain seven distinguishable hypervariable regions that undergo homologous recombination as a mechanism of viral evolution, leading to the emergence of new pathogens [5,8,9]. These are the penton base hypervariable regions 1 and 2 (separated bỹ 450 nucleotides), hexon hypervariable regions 1-6 and 7 (closely adjacent), E3 CR1-α, E3 CR1-β, E3 CR1-γ, and fiber genes. Unlike the penton base and hexon genes, which contain hypervariable regions within otherwise highly conserved nucleotide sequence, the three E3 genes and the fiber gene are hypervariable over their entire ORFs. Hypervariable regions can recombine as independent units, or the recombination event initiating at one hypervariable region can extend to include a larger area of the genome [2,6]. As shown by proteotype analysis and subsequent sorting based on E1A 13S ( Figure 2D), the data suggest that E1A might be an additional site of homologous recombination in HAdV-D. For example, HAdV-D47, D56, D9, D10, and D26 all share a proteotype including penton base hypervariable region (HVL)1, while the latter four viruses (D56, D9, D10, and D26) also share a proteotype with penton base HVL2. This suggests that the recombination event for HAdV-D47 E1A ended somewhere between penton base HVR1 and HVR2, while for the other four viruses the recombination event concluded after penton base HVR2 but before the hexon hypervariable regions, a pattern observed previously [53].

Evolutionary Divergence in HAdV-D E1A 13S Zinc Finger-Containing CR3 Region
Our proteotyping and phylogenetic analyses indicated that the two separate clades in HAdV-Ds evolved divergently for the E1A-13S protein. Therefore, we tested divergence within clades and site-specific substitution constraints along entire clades at the codon level through quantifying the diversifying selection pressure acting on each clade in comparison. Clade model C [45] (CmC) in PAML [46] was used to test this hypothesis with whole E1A-13S DNA sequences for 60 HAdVs using nested models. The null model had all the HAdV-Ds labeled as a single partition or one clade (D combined), while the second partition consisted of the rest of the HAdVs belonging to other species as the background clade. Although the CR3 region might be evolving in both clades, this null model assumes no divergence within the D combined clade. The alternate model had three partitions, which included two partitions for the two subclades in HAdV-D and a third as background clade with the rest of the HAdVs belonging to other species. This model assumes the divergence between the clades in HAdV-Ds and hence is a complex version of the simpler null model. There was only one parameter difference between these models. Hence, using one degree of freedom and twice the difference in the (ln L) values between these models, the likelihood ratio test (LRT) produced a p = 6.7 × 10 −16 . This is consistent with a highly diversifying selection pressure between HAdV-D clades for the full E1A-13S (Table 1). To better understand the role of the CR3 region in this diversifying selection between the two clades, we performed another test by removing the CR3 region from the E1A-13S nucleotide sequence, which accounted for 62 amino acids. Using the same complex and simpler tree partition models as above, CmC was used to identify the divergence between the clades for this shortened sequence without CR3. Interestingly, the difference was no longer significant (p = 0.108), demonstrating that the divergent selection within CR3 is the major driver of divergence between the two HAdV-D E1A 13S clades (Table 1). Further inspection of the HAdV-D 13S CR3 region by multi-sequence alignment revealed the area of greatest divergence between 13S sequences included the zinc finger motif responsible for transcriptional activation ( Figure 3A), with an average identity of only 64.7% (55-70% range) between the two HAdV-D clades at the protein level. As previously reported [54], most of the E1A protein was intrinsically disordered except for the region of the CR3 zinc finger motif. The E1A CR3 zinc finger motif has been characterized as a Cys-4 zinc finger, comprised of four cysteine residues [55]. As seen in the multi-sequence alignment, these four cysteine residues, previously described for HAdV-C5, are also conserved in HAdV-D, and found at positions 153, 156, 170, and 173. Many zinc finger motif classes contain multiple histidine residues, and we identified a highly conserved histidine residue at position 159 in all 60 HAdVs. We also found a HAdV-D specific and conserved cysteine residue at position 134. This HAdV-D specific cysteine residue-containing region is absent in HAdV-C5, and its role is yet to be studied. Hence, our multi-sequence alignment suggested that HAdV-Ds have a novel zinc finger motif, with an extra (fifth) cysteine residue. Finally, we noted an additional histidine residue at position 140, present only in the larger of two 13S CR3 subgroups (group2) and of unknown significance.
The structure of E1A has not been resolved. We therefore applied ab initio protein modeling to compare the impact of sequence divergence between 13S CR3 subgroups on the corresponding zinc finger fold. Models of the CR3 regions of HAdV-D56, chosen as a representative for group1, and D37, from group2, were studied as examples of each subgroup ( Figure 3B,C, respectively), as they share similar tropisms [1]. In the D56 CR3 protein model, four cysteine residues formed a compact zinc finger scaffold in close proximity with the conserved histidine residue at position 159. The secondary structure of the D56 CR3 protein consists of three alpha helices, and the conserved cysteine at position 134 was distant from the core zinc binding domain. In contrast, the D37 CR3 protein model showed an alpha helix with two beta strands. The HAdV-D specific cysteine C134 in D37 was part of the zinc finger scaffold. Only the Ad37 protein model incorporated all five conserved cysteines in its scaffold, while the Ad56 model was able to incorporate only four cysteines, similar to HAdV-C5.
Hence, it appears that the predicted D56 E1A zinc finger structure is similar to what was previously predicted for E1A in other HAdV species, having four cysteine residues in the zinc finger, while the D37 E1A proteotype evolved to include a fifth cysteine residue in its zinc finger scaffold structure. The ab initio models for HAdV-56 and D37, as shown in Figure 3B,C, respectively, were generated based on the entire 62 aa sequence for the E1A CR3 region. However, the great majority of studies of E1A CR3 have used HAdV-C5. HAdV-Cs are unique in that their E1A CR3 is only 46 aa in length. It could be argued that given the intrinsically disordered character of most of E1A 13S, that the extra 16 aa in other HAdV species do not contribute to the structure of their CR3 zinc finger domain.
To further elucidate the effects of including the entire 62 aa in our models, we next compared ab initio models of HAdV-D56 and D37 and assessed their structural overlap using both the canonical 46 and the extended noncanonical 62 aa sequences. As shown in Figure 4, when modeling was restricted to the canonical 46 aa, the predicted structures for D56 and D37 were nearly identical ( Figure 4A-C). However, when the full E1A CR3 comprising of 62 aa was modeled, the predicted structures diverged dramatically ( Figure 4D-F). Including the entire 62 aa had a greater effect on the predicted structure of D37 than on D56 ( Figure 4G,H). It was previously shown that~57% of the amino acids in E1A CR3 region are required for transcriptional activation based on studies of HAdV-C5 E1A CR3 by conservative mutational analysis [17]. Moreover, the region corresponding to the first half (23 aa out of 46 aa) of the canonical E1A CR3 (of HAdV-C5) [17] in HAdV-Ds is the most divergent region between the two CR3 subgroups with only a 56% average identity. Hence, the amino acid differences between the two CR3 subgroups may be the result of evolutionary divergence in the scaffold structure while maintaining the functionality of the zinc-finger domain.

Codon Usage Bias Analysis and Divergent Evolution of HAdV-D E1A 13S
DynaMine analysis is a linear regression-based model trained with available NMR data to predict protein dynamics and disorder regions, and is based on amino acid sequences independent of structural modeling. DynaMine was used previously to identify interaction motifs in HAdV-C5 E1A [49]. Our analysis predicted greater structural flexibility in the region 140-160 aa, which flanks the cysteine 134 in HAdV-D37 E1A 13S relative to that of D56 ( Figure 5A). Our structural predictions suggest therefore that these two CR3 groups differ principally in the zinc finger motif. We therefore employed codon usage bias analysis to examine the pathway of evolution yielding these two CR3 subgroups. RSCU values representing the frequency of occurrence of synonymous codons in a 59-dimensional codon space for each of the 38 HAdV-Ds were converged to 2-dimensional space. For the HAdV-D E1A 12S isoform, both subgroups had similar RSCU values ( Figure 5B), implying that HAdV-D evolved in a similar codon-optimized environment for E1A 12S. However, the same analysis on the CR3 region of the 13S isoform revealed a clear separation between the two CR3 subgroups ( Figure 5C). These data suggest that these two E1A CR3 subgroups within HAdV-D evolved in different cellular environments.

Discussion
E1A is one of the most highly studied viral proteins, with pleiotropic influences on viral gene expression, apoptosis, and the cell cycle [13]. The E1A protein consists of four regions thought to be highly conserved among HAdV species. However, in the work presented herein, we demonstrate a unique evolutionary variance in the HAdV-D E1A CR3 region with two highly divergent CR3 subgroups. The three major capsid genes and three of eight ORFs in the E3 transcription unit of HAdV-D were previously shown to contain regions of nucleotide hypervariability with clear segregation into multiple distinct proteotypes, relative to the remainder of the highly conserved HAdV-D genome [5]. HAdV-D is the largest and most rapidly expanding HADV species [56]. Variation in nucleotide sequences coding for the three major surface capsid proteins, the hexon, penton base, and fiber, and in proteins that directly mediate immune evasion, namely E3 CR1-α, β, and γ [6], suggests successful adaptive evolution to achieve broad infectivity, replication, and viral transmission across multiple host cellular and tissue microenvironments [8,57]. The E1A CR3 region, present in the 13S but not the 12S E1A isoform, is critical to downstream viral gene expression and has a direct impact on viral replication and, therefore, on viral fitness. The ab initio modeling, in context with the other in silico analyses, suggests that the downstream transcriptional activity of E1A CR3 may differ between HAdV-D subgroups-because of differences in the amino acid backbone-which in turn would impact the motif and folding of the zinc finger scaffold critical to the binding of viral promoters for transcriptional activation. We speculate that the unique zinc-finger fold in the subgroup containing HAdV-D37, with an additional cysteine residue, might generate more robust transcriptional activation. However, whether downstream viral gene expression differs between the two HAdV-D CR3 subgroups remains to be determined by further study. Differences in the evolution (and putative function) between HAdV species have been previously shown to include marked discrepancies in size. For example, the E3 CR1β gene in HAdV-D is almost twice the length of its homolog in any other HAdV species [58]. Similarly, as noted previously, the HAdV-D E1A CR3 is larger than that reported for HAdV-C5. In the study by Ablack and colleagues comparing transcriptional activation by E1A CR3 across HAdV species, HAdV-D9 E1A CR3 transcriptional activation of a GAL4-DNA-binding domain fusion was extremely low, and well below the comparands from other HAdV species. It would be interesting to perform similar studies comparing HAdV-D9 and other members of the subgroup containing HAdV-D56, with transcriptional activation by members of the subgroup containing HAdV-D37. Further, research into the biology of E1A has contributed to the identification of many important proteins involved in mammalian gene regulation, for example, p300 and CtBP [59]. Therefore, comparing not only transcriptional activity but also the protein interactomes of HAdV-D E1A CR3 subgroups may reveal novel binding partners and transcription factor machinery components whose function is yet to be elucidated.
Our studies of HAdV-D E1A CR3 codon usage bias indicated that both CR3 subgroups came from a common ancestor, but took different evolutionary paths to adapt to different host environments. It is unclear what specific host adaptation pressure(s) may have influenced codon usage between E1A CR3 subgroups. The close relationship between simian and human adenoviruses and prior evidence for interspecies transmission [60,61] and co-evolution [7,62,63] suggest one such possibility. Regardless, our results provide a platform to better understand zinc-finger transcription factor motif evolution in the context of viral pathogenesis, and may contribute to improvements in the design of synthetic zinc finger proteins [64].
In summary, in this study, we present the discovery and delineation of two subgroups of E1A CR3 within HAdV-D species, each with a different evolutionary history and predicted protein structure. Next-generation sequencing, genomics, and bioinformatics have made it possible to understand previously hidden aspects of adenoviral evolution.