The name “coronavirus” is derived from the Greek
, due to the viruses’ typical shapes being crown-like. The first complete genome of a coronavirus (mouse hepatitis virus—MHV), a positive sense, single-stranded RNA virus, was first reported in 1990 [1
]. It belongs to the family Coronaviridae
and ranges from 26.4 (ThCoV HKU12) to 31.7 (SW1) kb in genome length [2
], having the largest genome among all known RNA viruses, with G + C contents varying from 32% to 43% [3
]. The Orthocoronavirinae
sub-family consists of four genera based on their genetic properties: Alphacoronavirus
(subdivided in subgroups A, B, C and D), Gammacoronavirus
. Coronaviruses can infect humans and diverse animal species, including swine, cattle, horses, camels, cats, dogs, rodents, birds, bats, rabbits, ferrets, minks, snakes and other wildlife animals.
In this study, we have focused on 30 coronavirus (CoV) genomes: 28 viruses from Woo et al. (2010) [4
]; the Middle East respiratory syndrome coronavirus (MERS-CoV), which appeared for the first time in 2012; and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which just broke out in Wuhan (China) in December of 2019. Only seven CoVs have been identified that infect humans. Two coronaviruses that cause relatively mild respiratory symptoms have been known of since the 1960s that is, human CoV-229E (HCoV-229E) and human CoV-OC43 (HCoV-OC43). Human severe acute respiratory syndrome coronavirus (SARSr-CoV) was identified in 2003, and it causes a more severe respiratory syndrome [5
]. The human coronavirus NL63 (HCoV-NL63) was first identified in 2004 and it causes respiratory symptoms in humans [6
]; the fifth member, human CoV-HKU1 (HCoV-HKU1) was described in 2005 [7
]. More recently, the pathogenic Middle East respiratory syndrome (MERS-CoV) coronavirus was identified as the sixth human coronavirus [8
]. Finally, the present outbreak of a coronavirus-associated acute respiratory disease called coronavirus disease 19 (COVID-19) is caused by human SARS-CoV-2 infections [9
The newly sequenced SARS-CoV-2 genome encodes two open reading frames (ORFs), ORF1a and ORF1ab. The latter encodes replicase polyproteins, and four structural proteins [11
]; namely, the spike-surface glycoprotein (protein S), the small envelop protein (protein E), the matrix protein (M) and the nucleocapsid protein (N).
The phenomenon of codon usage bias (CUB) exists in many genomes, including RNA genomes, and it is actually determined by mutation and selection [13
]. The non-random selection of synonymous codons is known to vary among species that are potential hosts for viruses [16
]. It is therefore important to study patterns of common codon usage in coronaviruses because CUB can be related to the driving forces that shape the evolutions of small RNA viruses. Mutational bias has been considered as the major determinant of codon usage variation among RNA viruses [17
]. Indeed, RNA viruses show an effective number of codons (ENC) that is quite high (ENC > 45), pointing to quite random codon usage, whereas the adaptive index CAI indicates that the viral CUB is consistent with that of the host, as observed in the Equine infectious anemia virus (EIAV) or Zaire ebolavirus (ZEBOV) [18
The aims of this study were to perform a comprehensive analysis of the nucleotide composition, codon usage and rate of protein divergence of SARS-CoV-2, and to thereby draw inferences regarding its leading evolutionary determinants.
To investigate the factors determining the codon usage patterns of SARS-CoV-2 and other coronaviruses, several analytical methods were used in our study. First, the RSCU value of the SARS-CoV-2 was calculated. Despite the relatively high mutation rate that characterizes SARS-CoV-2, as other RNA viruses, we could not find any significant differences in codon usage between its genome and the ones of the other CoVs. Moreover, their associated vectors did not cluster based on geographical position, further confirming the common origin of these genomes.
In line with the common nucleotide composition of other RNA viruses such as SARS, our results show that SARS-CoV-2 has a high AU content and a low GC content. The results also indicate that codon usage bias exists and that SARS-CoV-2 prefers U-ending codons. The codon usage bias was further confirmed by a mean ENC value of 51.9 (a value greater than 45 is considered a slight codon usage bias due to mutation pressure or nucleotide compositional constraints). These findings were also corroborated by the CAI analysis, which measures the deviation of a given protein coding gene sequence with respect to a reference set of the most highly expressed genes in the host. This suggests that those RNA viruses with high ENC values (and low CAI) adapt to the host with randomly chosen codons. Therefore, a slightly biased codon usage pattern might allow the virus to use several codons for a respective amino acid, and it might be beneficial for viral replication and translation in host cells.
We then analyzed in more detail the relationships between SARS-CoV-2 and various possible hosts other than humans. For this purpose, we calculated the average CAI and SiD values of individual SARS-CoV-2 genes against different candidate hosts. Although previous studies do not support transmission of SARS-CoV-2 from snakes to humans [34
], we showed that SARS-CoV-2 has the highest CAI values by considering these two organisms as references, and therefore, it should use codons that are better optimized to snakes and humans. Moreover, we demonstrated that the adaptiveness of SARS-CoV-2’s codon usage, as measured by SiD, is also fairly high for pangolins, rats, and bats, thereby confirming previous hypotheses regarding the possible origin of SARS-CoV-2 from these species [9
The ENC-plot analysis indicated that natural selection plays an important role in the codon choice of the five conserved viral genes under study; namely, RdRP, S, E, M and N. However, genes N,S and RdRP are more scattered below the theoretical curve compared to genes M and E, implying that in the latter the codon usage is more a sign of mutational bias than of natural selection. According to neutrality plot analysis, the genes S and RdRP are considered to be subject to more robust action of natural selection; gene M is the least subject to natural selection; and the genes E and N are in an intermediate situation. Conversely, the regression line for the gene M is closer to the bisector than the other genes, meaning that this gene is the least subject to the action of natural selection. Finally, the genes E and N are intermediately affected regarding the previous cases.
Forsdyke plots were employed to analyze the mutation statuses of these five genes. Proteins M and E were found to have gentler slopes, thereby reflecting a tendency to evolve slowly by accumulating nucleotide mutations on their respective genes. Conversely, the steeper slopes for the three genes N, RdRP and S (encoding a protein responsible for the “spikes” present on the surface of coronaviruses), indicate that these three genes, and therefore their corresponding protein products, evolve faster compared to the other two genes.
Interestingly, all x-intercepts (see Table 4
) are negative and the degree of negativity correlates with the low slope values. Recalling that the x-axis (RNA change) can be viewed as a time axis, it appears that the RNA segments encoding M
are as resistant to change during the early period of genome divergence (negative x values) as they are during the later period of divergence when phenotypic changes can be naturally selected (positive x values). M and E are less flexible at the protein level. On the other hand, the RNA segments encoding S, RdRP and N are flexible during the early genome divergence period (high negative x values). As a result, these segments would have been more able to contribute to the initial genotypic divergence that would have decreased recombination between two genomes diverging in a common cell, thereby facilitating speciation. Under the protection of this global “reproductive isolation”, the segments could then evolve during the period corresponding to positive x values. Without reproductive isolation, blending would have occurred and phenotypic divergence would be less possible.
In future studies, it would be interesting to explore why M and E are less flexible and S, R and N are more flexible towards preventing recombination. Viral RNA recombination requires recognition between two comparable RNA regions and then extensive base pairing, mediated by the kissing stem-loop interaction, to thoroughly examine sequence complementarity. Perhaps the M and E genes lack the ability to form stem-loops, but this inflexibility during phenotypic divergence is suggestive of high conservation.
The findings of the present study could be useful for developing diagnostic reagents and probes for detecting a wide range of viruses and isolates in one test and for vaccine development, utilizing the information about codon usage patterns in these genes.
In addition, an interesting potential idea for the treatment of pneumonia-related to SARS-CoV-2 and other similar viruses is a low dose of ionizing radiation (LDIR). SARS-COV-2 is an RNA virus with an expected mutation rate similar to other RNA viruses, as discussed above. This mutation rate is usually much higher than the corresponding one of any human host. Therefore, as discussed in a recent paper [43
], any antiviral drug against SARS-CoV-2 would exert an intense selective pressure on the virus. This may result in highly adaptive and treatment-resistant virus types with enhanced pathogenicity. It should also be taken into consideration that the virus will create a systemic inflammatory response with detrimental effects in the host organism, i.e., acute respiratory distress syndrome (ARDS), a form of severe hypoxemic respiratory failure associated with major inflammatory injury to the lung cells and extravasation of protein-rich edema fluid into the airspace [44
]. Low dose radiation (<0.5 Gy) has been shown to have indeed, in some cases, anti-inflammatory effects and to modulate the immune response, and has even been suggested for treating pneumonia [46
]. This LDIR exposure is not expected to exert significant selective pressure on the new coronavirus. Therefore, and based also on recent suggestions, one can hypothesize that a low dose treatment of 30 to 100 cGy to the lungs of a patient with COVID-19 pneumonia could ameliorate the inflammation significantly and relieve the life-threatening systemic symptoms of the infection [47