Coronavirus Genomics and Bioinformatics Analysis

The drastic increase in the number of coronaviruses discovered and coronavirus genomes being sequenced have given us an unprecedented opportunity to perform genomics and bioinformatics analysis on this family of viruses. Coronaviruses possess the largest genomes (26.4 to 31.7 kb) among all known RNA viruses, with G + C contents varying from 32% to 43%. Variable numbers of small ORFs are present between the various conserved genes (ORF1ab, spike, envelope, membrane and nucleocapsid) and downstream to nucleocapsid gene in different coronavirus lineages. Phylogenetically, three genera, Alphacoronavirus, Betacoronavirus and Gammacoronavirus, with Betacoronavirus consisting of subgroups A, B, C and D, exist. A fourth genus, Deltacoronavirus, which includes bulbul coronavirus HKU11, thrush coronavirus HKU12 and munia coronavirus HKU13, is emerging. Molecular clock analysis using various gene loci revealed that the time of most recent common ancestor of human/civet SARS related coronavirus to be 1999–2002, with estimated substitution rate of 4×10−4 to 2×10−2 substitutions per site per year. Recombination in coronaviruses was most notable between different strains of murine hepatitis virus (MHV), between different strains of infectious bronchitis virus, between MHV and bovine coronavirus, between feline coronavirus (FCoV) type I and canine coronavirus generating FCoV type II, and between the three genotypes of human coronavirus HKU1 (HCoV-HKU1). Codon usage bias in coronaviruses were observed, with HCoV-HKU1 showing the most extreme bias, and cytosine deamination and selection of CpG suppressed clones are the two major independent biological forces that shape such codon usage bias in coronaviruses.

HKU12 (ThCoV HKU12) and munia coronavirus HKU13 (MunCoV HKU13) [2]. Most of these genomes were sequenced using the RNA extracted directly from the clinical specimens, such as nasopharyngeal aspirate or stool, as the template, while the viruses themselves were still non-cultivable [2,3,6,[11][12][13][14][15]. This provided more accurate analysis of the in situ viral genomes avoiding mutational bias during in vitro viral replication. These sequence efforts have resulted in a marked increase in the number of coronavirus genomes and have given us an unprecedented opportunity to understand this family of virus at the genomic and in silico levels. These understandings have also led to generation of further hypotheses and experiments in the laboratory. In this article, we reviewed our current understanding on the genomics and bioinformatics analysis of coronaviruses. Details of the bioinformatics tools will not be discussed.

Genomics
Coronaviruses possess the largest genomes [26.4 kb (ThCoV HKU12) to 31.7 kb (SW1)] among all known RNA viruses ( Figure 1) [2,13,16]. The large genome has given this family of virus extra plasticity in accommodating and modifying genes. The G + C contents of coronavirus genomes vary from 32% (HCoV-HKU1) to 43% (Pi-BatCoV HKU5 and MunCoV HKU13) (Table 1) [2,3,10]. Both the 5' and 3' ends of coronavirus genomes contain short untranslated regions. For the coding regions, the genome organizations of all coronaviruses are similar, with the characteristic gene order 5'-replicase ORF1ab, spike (S), envelope (E), membrane (M), nucleocapsid (N)-3', although variable numbers of additional ORFs are present in each subgroup of coronavirus (Table 1, Figure 1). A transcription regulatory sequence (TRS) motif is present at the 3' end of the leader sequence preceding most ORFs ( Table 1). The TRS motifs are thought to be important for a "copy-choice" mechanism that mediates the unique random template switching during RNA replication, resulting in a high frequency of homologous RNA recombination in coronaviruses [17].

ORF1ab
ORF1ab of coronaviruses occupy about two thirds of their genomes. It encodes the replicase polyprotein and is translated from ORF1a (11826 to 13425 nt) and ORF1b (7983 to 8157 nt). In all coronaviruses, a slippery sequence (UUUAAAC), followed by sequences that form a putative pseudoknot structure, are present at the junction between ORF1a and ORF1b. Translation occurs by a -1 RNA-mediated ribosomal frameshift at the end of the slippery sequence. Instead of reading the transcript as UUUAAACGGG, it will be read as UUUAAACCGGG. The replicase polyprotein is cleaved by papain-like protease(s) (PL pro ) and 3C-like protease (3CL pro ), proteins encoded by ORF1ab of the coronavirus genome, at consensus cleavage sites, into 15 to 16 non-structural proteins (nsps) named nsp1, nsp2, nsp3, etc ( Table 1). As the number of coronavirus genomes is expanding, novel cleavage sites have been discovered [3,18]. Some of these non-structural proteins encode proteins of essential functions, such as PL pro (nsp3), 3CL pro (nsp5), RNA-dependent RNA polymerase (Pol) (nsp12) and helicase (nsp13) (Figure 1). The genomes of all known members of Alphacoronavirus and Betacoronavirus subgroup A possess two PL pro (PL1 pro and PL2 pro ), while those of all known members of Betacoronavirus subgroup B, C and D and Gammacoronavirus possess only one PL pro (Table 1, Figure 1). The gene sequences that encode these conserved proteins are frequently used for phylogenetic analysis. In addition to the nsps with essential functions, bioinformatics analysis of some other nsps revealed their putative functions. Downstream to PL pro or PL1 pro in nsp3 is the X domain which contains putative ADP-ribose 1"-phosphatase (ADRP) activity [1]. In other microorganisms, such as Saccharomyces cerevisiae and other eukaryotes, ADRP and its functionally related enzyme cyclic nucleotide phosphodiesterase (CPDase), were important for tRNA processing [19]. ADP-ribose 1",2"-cyclic phosphate (Appr>p) is produced as a result of tRNA splicing. Appr>p is in turn converted to ADP-ribose 1"-phosphate (Appr-1"p) by CPDase and Appr-1"p is then further processed by ADRP. As for nsp13, nsp14 and nsp15, they possess a putative 3'-to-5' exonuclease (ExoN) domain of the DEDD superfamily [1], a putative poly(U)-specific endoribonuclease (XendoU) domain, and a putative S-adenosylmethionine-dependent ribose 2'-O-methyltransferase (2'-O-MT) domain of the RrmJ family respectively [1]. ADRP, CPDase, ExoN, XendoU and 2'-O-MT are enzymes in RNA processing pathways. Contrary to the pre-tRNA splicing pathway that ADRP and CPDase belong to, ExoN, XendoU and 2'-O-MT are enzymes in a small nucleolar RNA processing and utilization pathway.

Haemagglutinin esterase
In all members of Betacoronavirus subgroup A, a haemagglutinin esterase (HE) gene, which encodes a glycoprotein with neuraminate O-acetyl-esterase activity and the active site FGDS, is present downstream to ORF1ab and upstream to S gene ( Figure 1). The HE gene of coronavirus is believed to be acquired from influenza C virus, and is the most notable example of acquisition of new genes from non-coronavirus RNA donors by heterologous recombination [20]. The presence of HE genes exclusively in members of Betacoronavirus subgroup A, but not members of Betacoronavirus subgroup B, C and D suggested that the recombination had probably occurred in the ancestor of members of Betacoronavirus subgroup A, after diverging from the ancestor of other subgroups of Betacoronavirus.

Spike
The S proteins are responsible for the "spikes" present on the surface of coronaviruses and give this family of virus the characteristic crown-like appearance under electron microscopy. The S proteins are type I membrane glycoproteins with signal peptides. The S proteins are used for receptor binding and viral entry, and are the proteins with the most variable sequences in the coronavirus genomes. In some coronaviruses, the S proteins are cleaved into the S1 and S2 domains at consensus cleavage site (RRSRR of BCoV, RRSR of HCoV-OC43, RRKRR of HCoV-HKU1, RSRR of PHEV, RRADR of MHV, RRFRR of SDAV and RRFRR of IBV) (Table 1), with the sequences of the S1 domains much more variable than the S2 domains. In all coronaviruses, most of the S protein is exposed on the outside of the virus, with a short transmembrane domain at the C terminus, followed by a short cytoplasmic tail rich in cysteine residues. Two heptad repeats are present at the C termini of the extracellular parts of the S proteins. At the moment, no bioinformatics tool is available for accurate prediction of the receptor by analyzing the amino acid sequences of the S proteins of the corresponding coronaviruses.

Envelope and membrane
The E and M proteins are small transmembrane proteins associated with the envelope of all coronaviruses. In some coronaviruses, such as MHV and SDAV and possibly HCoV-HKU1, the translation of the E protein is cap-independent, via an internal ribosomal entry site. Although these two genes are conserved among all coronaviruses, they are not good targets for phylogenetic studies because of their short sequences.

Nucleocapsid
Similar to the conserved proteins encoded by ORF1ab, the N gene is also another common target for phylogenetic analysis. Due to its immunogenicity, it is also a common target for cloning and generation of recombinant proteins for serological assays.

Other small ORFS
Variable numbers of small ORFs are present between the various conserved genes in different lineages in the Coronaviridae family (Table 1, Figure 1). In some coronaviruses, small ORFs are present downstream to the N gene (Table 1, Figure 1). Most of these small ORFs are of unknown function. One exception is the small ORFs downstream to N in feline infectious peritonitis virus (FIPV) and TGEV, which are important for virulence and viral replication/assembly respectively [21][22][23]. Another notable exception is the 3a protein of SARSr-CoV, which forms a transmembrane homotetramer complex with ion channel function and modulates virus release [24]. For some of these small ORFs, such as ORF3a and ORF8 of SARSr-CoV genomes, their sequences are as highly variable as those of the S proteins. In particular, the most significant difference between human SARSr-CoV and civet SARSr-CoV genomes was a 29-bp deletion in the ORF8 of human SARSr-CoV [25].

Phylogeny
The first impression of the phylogenetic position of a strain or species of coronavirus is usually acquired by constructing a phylogenetic tree using a short fragment of a conserved gene, such as Pol or N. However, this can sometimes be misleading because the results of phylogenetic analysis using different genes or characters can be different. When SARSr-CoV was first discovered, it was proposed that it constituted a fourth group of coronavirus [26,27]. However, analyses of the amino-terminal domain of S of SARSr-CoV revealed that 19 out of the 20 cysteine residues were spatially conserved with those of the consensus sequence for Betacoronavirus [28]. On the other hand, only five of the cysteine residues were spatially conserved with those of the consensus sequences in Alphacoronavirus and Gammacoronavirus [28]. Furthermore, subsequent phylogenetic analysis using both complete genome sequence and proteomic approaches, it was concluded that SARSr-CoV is probably an early split-off from the Betacoronavirus lineage [1], and SARSr-CoV was subsequently classified as Betacoronavirus subgroup B and the historical Betacoronavirus as Betacoronavirus subgroup A. Therefore, the phylogenetic position of a coronavirus is best appreciated and confirmed by constructing phylogenetic trees using different genes in the coronavirus genome. The most commonly used genes along the coronavirus genome for phylogenetic studies include chymotrypsin-like protease, Pol ( Figure 2), helicase, S and N, because these genes are present in all coronavirus genomes and are of significant length. The envelope and membrane genes, although present in all coronavirus genomes, are too short for phylogenetic studies. It is noteworthy that the cluster formed by the three novel avian coronaviruses BuCoV HKU11, ThCoV HKU12 and MunCoV HKU13, which was originated proposed as group 3c [2], might represent a new coronavirus genus provisionally designated Deltacoronavirus (Figure 2). Using this approach of multiple gene phylogenetic studies, unique phylogeny of individual gene that may have biological significance may be discovered. During our phylogenetic study on Rh-BatCoV HKU2, another coronavirus that has was also found in the stool samples of Chinese horseshoe bats, its unique S protein phylogenetically distinct from the rest of the genome was discovered [15]. The S protein of Rh-BatCoV HKU2 is the shortest among S proteins of all coronaviruses and had less than 30% amino acid identities to those of all known coronaviruses, in contrast to other genes that showed higher amino acid identities to the corresponding genes in other members of Alphacoronavirus. When the S protein of Rh-BatCoV HKU2 is aligned with those of other members of Alphacoronavirus, many of the amino acid residues conserved among and specific to Alphacoronavirus were not found. Rather, the S protein of Rh-BatCoV HKU2 shares the two conserved regions of deletions both of 14 amino acids among members of Betacoronavirus in its C-terminus, suggesting that this segment of the S protein of Rh-BatCoV HKU2 may have co-evolved with the corresponding regions in Betacoronavirus. Most interestingly, a short peptide of 15 amino acids in the S protein of Rh-BatCoV HKU2 was found to be homologous to a corresponding peptide within the RBM in the S1 domain of SARSr-CoV. A similar peptide was also observed in SARSr-Rh-BatCoV, but not in any other known coronaviruses. These suggested that there is a common evolutionary origin in the S protein of SARSr-CoV, SARSr-Rh-BatCoV and Rh-BatCoV HKU2, and Rh-BatCoV HKU2 might have acquired its unique S protein from a yet unidentified coronavirus through recombination.

Evolutionary rate and divergence
In 1992, Sanchez et al. analyzed 13 enteric and respiratory TGEV related isolates and estimated the mutation rate of TGEV to be 710 -4 nucleotide substitutions per site per year [29]. 1n 2005, using linear regression, maximum likelihood and Bayesian inference methods, Vijgen et al. estimated the rate of evolution in BCoV to be 4.3 (95% confidence internal 2.7 to 6.0) 10 -4 nucleotide substitutions per site per year [30]. The estimation of time of divergence was first extensively used in coronaviruses after the SARS epidemic for estimating the date of interspecies jumping of SARSr-CoV from civets to humans and that from BCoV to HCoV-OC43 [31,32]. Subsequently, when various novel human and animal coronaviruses were discovered, evolutionary rates and divergence time in the Coronaviridae family were estimated by various groups using different approaches [31,[33][34][35]. Although Bayesian inference in BEAST is probably the most widely accepted approach and was used by most researchers, the use of different genes (ORF1ab, helicase, S and N genes) and datasets by different groups have resulted in considerable difference in the estimated history of coronaviruses. It was found that the S and N genes of PHEV, BCoV and HCoV-OC43 evolved at different rates, and the divergence time of the PHEV lineage and the HCoV-OC43 and BCoV lineage based on these two rates were 100 years different [31]. One group, using the helicase gene for analysis, has estimated the life history of coronaviruses to be as short as about 420 years [35].
Recently, we used the uncorrelated exponentially distributed relaxed clock model (UCED) in BEAST version 1.4 [36] to estimate the time of divergence of SARSr-CoV based on an alignment of a large set of SARSr-Rh-BatCoV ORF1 sequences collected over a period of five years. Under this model, the rates were allowed to vary at each branch drawn independently from an exponential distribution. Using this model and large dataset, the time of emergence of SARSr-CoV was at 1972, about 31 years before the SARS epidemic; that of SARSr-CoV in civet was at 1995, about eight years before the SARS epidemic; and the most recent common ancestor date of human and civet SARSr-CoV was estimated to be 2001.36, which was comparable to the dates estimated by other groups (Table 2) [37]. The date obtained from the figure of the reference but was not mentioned in the reference's text.

Recombination analysis
As a result of their unique random template switching during RNA replication, thought to be mediated by a "copy-choice" mechanism, coronaviruses have a high frequency of homologous RNA recombination [41,42]. Recombination in coronaviruses was first recognized between different strains of MHV and subsequently in other coronaviruses such as IBV, between MHV and BCoV, and between feline coronavirus (FCoV) type I and canine coronavirus (CCoV) [43][44][45][46]. As shown below, such recombination can result in the generation of coronavirus species or different genotypes within a coronavirus species. In our experience, the possibility of homologous RNA recombination and the possible part of the genome that recombination has taken place are usually first appreciated using bootscan analysis or phylogenetic analysis using different parts of the coronavirus genome. Other methods for recombination analysis, such as those in the RDP3 package, are also available. Then, the exact site of homologous RNA recombination would be best revealed by multiple sequence alignment.
The best documented example of generation of coronavirus species through homologous RNA recombination is the generation of FCoV type II by double recombination between FCoV type I and CCoV. It was first observed that the sequence of the S protein in FCoV type II was closely related to that of CCoV [47,48] but the sequence downstream of the E gene in FCoV type II was more closely related to that of FCoV type I strain than to CCoV [49,50]. This observation suggested that there might have been a homologous RNA recombination event between the genomes of CCoV and FCoV type I, resulting in the generation of FCoV type II. Further analysis by multiple sequence alignments pinpointed the site of recombination to a region in the E gene. A few years later, an additional recombination region in the Pol gene was also discovered, and it was concluded that FCoV type II originated from two recombination events between the genomes of CCoV and FCoV type I [43].
As for the generation of different genotypes in a coronavirus species through homologous RNA recombination, the best documented example is HCoV-HKU1. The possibility of homologous RNA recombination was first suspected when a few strains of HCoV-HKU1 showed differential clustering when the Pol, S and N genes were used for phylogenetic tree construction [51]. This observation has led to our subsequent study on complete genome sequencing of 22 strains of HCoV-HKU1. Recombination analysis by bootscan analysis and phylogenetic analysis using different parts of the 22 complete genomes revealed extensive recombination in different parts of the genomes, resulting in the generation of three genotypes, A, B and C, of HCoV-HKU1 [52]. Using multiple sequence alignment, two sites of recombination were pinpointed. The first one was observed in a stretch of 143 nucleotides near the 3' end of nsp6, where recombination between HCoV-HKU1 genotype B and genotype C has generated genotype A; and the second one in another stretch of 29 nucleotides near the 3' end of nsp16, where recombination between HCoV-HKU1 genotype A and genotype B has generated genotype C [52].

Codon usage bias
Recently, using the complete genome sequences of the 19 coronavirus genomes, we analyzed the codon usage bias in coronaviruses as well as selection of CpG suppressed clones by the immune system and cytosine deamination being the two major independent biochemical and biological selective forces that has shaped such codon usage bias [53]. In the study, we showed that the mean CpG relative abundance in the coronavirus genomes is markedly suppressed [53]. However, we observed that only CpG containing codons in the context of purine-CpG (ACG and GCG), pyrimidine-CpG (UCG and CCG) and CpG-purine (CGA and CGG); but not CpG-pyrimidine (CGU and CGC); are suppressed. However, when trinucleotide frequencies were analyzed in the 19 coronavirus genomes, all the eight trinucleotides with CpG were suppressed [53]. These indicate that another force that has led to an increase use of CGU and CGC as codons for arginine, but does not act on trinucleotides over the whole genome in general, is probably present. Furthermore, this force is probably unrelated to the relative abundance of the corresponding tRNA molecules in the hosts of the coronaviruses, as the pattern of bias in the hosts is not the same as that in the coronaviruses.
In addition to CpG suppression, marked cytosine deamination is also observed in all the 19 coronavirus genomes [53]. Using the six amino acids that are only encoded by NNC or NNU (asparagine, histidine, aspartic acid, tyrosine, cysteine and phenylalanine), hence excluding most other pressures that may affect the relative abundance of cytosine and uracil, it was observed that all NNU are markedly over represented with usage fractions of more than 0.700, whereas the usage fractions of all NNC are less than 0.300 [53]. For all codons that encode the same amino acid and with either U or C in any position, the usage fraction of the codon that uses U is always higher than the one that uses C in all coronaviruses. Furthermore, the percentage of U showed strong inverse relationships with the percentage of C in the coronavirus genomes. These suggest that cytosine deamination is another important biochemical force that shaped coronavirus evolution.
Interestingly, among all the 19 coronaviruses, HCoV-HKU1 showed the most extreme codon usage bias. HCoV-HKU1 is the only coronavirus that had effective number of codons outside the mean  2 standard deviations range. In addition, HCoV-HKU1 also possessed the lowest G + C content, highest GC skew, lowest percentages of G and C and highest percentage of U among all coronavirus genomes. Furthermore, HCoV-HKU1 showed extremely high NNU/NNC ratio of 8.835. The underlying mechanism for the extreme codon usage bias, cytosine deamination and G + C content in HCoV-HKU1 is intriguing.

Database
Rapid and accurate batch sequence retrieval is always the cornerstone and bottleneck for all kinds of comparative genomics and bioinformatics analysis. During the process of batch sequence retrieval for comparative genomics and other bioinformatics analysis of the coronavirus genomes that we have sequenced, we encountered a number of major problems about the coronavirus sequences in GenBank and other coronavirus databases. First, in GenBank, the non-structural proteins encoded by ORF1ab are not annotated. Second, in all databases, the annotations for the non-structural proteins encoded by ORFs downstream to ORF1ab are often confusing because they are not annotated using a standardized system. Third, multiple accession numbers are often present for reference sequences. These problems will often lead to confusion during sequence retrieval. Fourth, coronaviruses, especially SARSr-CoV, amplified from different specimens may contain the same gene or genome sequences, which will lead to redundant work when they are analyzed. In view of these problems, we have developed a comprehensive database, CoVDB, of annotated coronavirus genes and genomes, which offers rapid, efficient and user-friendly batch sequence retrieval and analysis [54]. In CoVDB, first, annotations on all non-structural proteins in the polyprotein encoded by ORF1ab of every single sequence were performed. Second, annotation was performed for the non-structural proteins encoded by ORFs downstream to ORF1ab using a standardized system. Third, all sequences with identical nucleotide sequences were labeled and one can choose to show or not to show strains with identical sequences. Fourth, this database contains not only complete coronavirus genome sequences, but also incomplete genomes and their genes. This is useful because some genes of coronaviruses, such as Pol, S and N, are sequenced much more frequently than others because they are either most conserved or least conserved, and therefore are particularly important for primers design for RT-PCR assays and evolutionary studies.

Concluding remarks
After the SARS epidemic, there has been a marked increase in the number of coronaviruses discovered and coronavirus genomes being sequenced. This increase in the number of coronavirus species and genomes, a comprehensive and user-friendly database for efficient sequence retrieval, and the ever improving bioinformatics tools have enabled us to perform meaningful genomic, phylogenetic, evolutionary rate and divergence, recombination, and other bioinformatics analyses on the Coronaviridae family. Three genera, Alphacoronavirus, Betacoronavirus and Gammacoronavirus, have been used to replace the traditional group 1, 2 and 3 coronaviruses. A fourth genus, Deltacoronavirus, which includes BuCoV HKU11, ThCoV HKU12 and MunCoV HKU13, is likely to emerge. Under this new classification system, bat coronaviruses dominate the Alphacoronavirus and Betacoronavirus genera and bird coronaviruses dominate the Gammacoronavirus and Deltacoronavirus genera. This huge diversity of coronaviruses in bats and birds has made them excellent gene pools for coronaviruses in these four genera [55].