1. Introduction
SARS-CoV-2 uses its trimeric spike protein for binding to host angiotensin-converting enzyme 2 (ACE2) and for fusing with cell membrane to gain cell entry [
1,
2,
3,
4]. This is a multi-step process involving three separate S protein cleavage events to prime the SARS-2-S for interaction with ACE2 [
2,
3], and subsequent membrane fusion and cell entry. These processes involve different domains of the S protein interacting with host cell and other intracellular and extracellular components. Efficiency in each step could contribute to virulence and infectivity. Disrupting any of these steps could lead to medical cure.
The domain structure is very similar between SARS-S (UniProtKB: P59594) and SARS-2-S (UniprotKB: P0DTC2). Both are cleaved to generate S1 and S2 subunits at specific cleavage sites (
Figure 1A). S1 serves the function of receptor-binding and contains a signal peptide (SP) at the N terminus, an N-terminal domain (NTD), and receptor-binding domain (RBD). S2 (
Figure 1A) functions in membrane fusion to facilitate cell entry, and it contains a fusion peptide (FP) domain, internal fusion peptide (IFP), two heptad-repeat domains (HR1 and HR2), transmembrane domain, and a C-terminal domain [
2,
3,
5,
6,
7,
8]. However, there are also significant differences between SARS-S and SARS-2-S. For example, the contact amino acid sites between SARS-S and human ACE2 (hACE2) [
5,
7,
9,
10] differ from those between SARS-2-S and hACE2 [
11,
12,
13,
14]. This may explain why some antibodies that are effective against SARS-S are not effective against SARS-2-S [
4], especially those developed to target the ACE2 binding site of SARS-S [
15]. In this article, numerous experiments on SARS-S are considered to facilitate comparisons and to highlight differences between the two.
2. General Features of SARS-S and SARS-2S
SARS-2-S is 1273 aa long, in contrast to 1255 aa in SARS-S. Individual protein domains in the S protein tend to fold independently and are associated with specific functions. The numbers (
Figure 1A) that indicate the start/end of individual domains in SARS-S and SARS-2-S may mislead readers to think that the boundary is based on some clearly recognizable physiochemical landmarks. In fact, these numbers are for rough reference only. For example, the boundaries of RBD in SARS-S mainly result from experiments with different RNA clones containing different parts of RBD [
17,
18,
19]. The 5′ side is delimited by the site where upstream mutations/deletions do not affect receptor binding, but downstream mutations/deletions do affect receptor binding. Similarly, the 3′ side is where upstream mutations/deletions affect receptor binding, but downstream mutations/deletions do not have an effect. Boundaries of some domains are substantiated by protein structure, for example, the boundaries of RBD [
11,
12,
13,
14,
20], but some are not substantiated by protein structure.
Some inter-domain segments (
Figure 1C,D) could be much more conserved than neighboring domains. For example, C822, D830, L831, and C833 in SARS-S (corresponding to C840, D848, L849, and C851 in SARS-2-S) are located between FP and IFP but are highly conserved and critically important for membrane fusion [
21]. Similarly, V601 in SARS-S (corresponding to V615 in SARS-2-S) does not belong to any recognized domain (
Figure 1A) but is highly conserved. Replacing it by G contributes to viral escape from neutralizing antibodies [
22]. Experimental mutations at sites 1111–1130 in SARS-S, upstream of HR2 (
Figure 1A), are also associated with viral escape from neutralizing antibodies [
23], suggesting that mutations at those sites affect protein structure. This segment is highly conserved in SARS-2-S and related viruses, and antibodies targeting this region provide broad protection against heterogeneous viral strains [
23]. In short, inter-domain segments may not be functionally less important than those recognized domains, and the sequences in these inter-domain regions are no less conserved than those within domains. More studies will reveal their functions leading to more detailed structure-function maps.
Experiments with a truncated SARS-S excluding the C-terminus indicates that it is synthesized in the endoplasmic reticulum (ER), modified in the Golgi apparatus, glycosylated, and eventually exported to the membrane [
24]. Spike protein synthesis following SARS-CoV infection can cause an unfolded protein response (UPR) [
25], suggesting its association with the ER. The UPR restores ER homeostasis by upregulating chaperone proteins to increase the protein-folding capacity in the ER and by reducing translation and increasing protein degradation to reduce the folding load (review in [
26]). When prolonged UPR fails to restore ER homeostasis, it often triggers apoptosis. Adenovirus-mediated overexpression of S2 induces apoptosis [
27] and may have implications for viral pathogenicity and secondary bacterial infection.
Coronavirus S proteins are heavily glycosylated with 21–35 N-glycosylation sites [
17]. Replacing these N-glycosylation sites in SARS-S alters protein folding and expression [
18]. Glycosylation events have been identified mainly in two ways. The first way has been to compare the expected molecular weight of an expressed segment of S protein containing a putative N-glycosylation site against the actual molecular weight [
18]. An increase in the actual molecular weight is assumed to be due to N-glycosylation. The second way has been by high resolution mass spectrometry [
28]. O-glycosylation was also found in SARS-2-S [
28]. Glycosylation is not required for receptor-binding in SARS-S [
18] or MHV (murine hepatitis virus) [
17].
3. Cleavage Sites
The S protein undergoes two crucial cleavage events, with the first splitting S1 and S2 and the second splitting S2 into FP and S2′ (
Figure 1A). The most pronounced difference between SARS-S and SARS-2-S is an additional furin cleavage site (site 1,
Figure 2A) resulting from an insertion of 12 nt at the boundary between S1 and S2 [
8,
11,
29]. This additional furin cleavage site is shared among all sequenced SARS-CoV-2 genomes, but absent in all their closest known relatives such as bat RaTG13 and those isolated from pangolin [
29]. The seemingly sudden appearance of this additional polybasic furin cleavage site 1 has been a lasting source of conspiracy theory that SARS-CoV-2 is man-made, which is discussed later.
The furin cleavage site was predicted in February 2020 [
8] and, in May 2020, its functional importance was confirmed, i.e., that the cleavage was essential for efficient viral entry into human lung cells, especially in cell-cell fusion to form syncytium to facilitate viral spread from one cell to another [
2]. This exemplifies the rapidity in the progress of SARS-2-S research.
The cleavage of the S protein into S1 and S2 is an essential step in viral entry into a host cell, and needs to occur before viral fusion with the host cell membrane [
6]. Different cleavage sites targeted by different proteases are often associated with drastically different virulence and host cell tropism in various RNA viruses. For example, the low-pathogenicity forms of the H1N1 influenza virus has a cleavage site by trypsin-like proteases [
31] in contrast to the high-pathogenicity forms with a furin cleavage site cleaved by furin-like proteases [
32]. Trypsin-like proteases typically have a narrow tissue distribution in humans. For example, trypsin-like transmembrane serine protease 11D (gene name TMPRSS11D) is expressed only in the esophagus (
Figure 2C). Another member of the trypsin family, PRSS1, is expressed mainly in the pancreas [
30]. In contrast, furin-like proteases are ubiquitous (
Figure 2C). Thus, if a coronavirus needs to be cleaved TMPRSS11D or PRSS1, then its cellular entry is limited to the esophagus where TMPRSS11D is expressed (
Figure 2C) or the pancreas where PRSS1 is expressed. However, if the virus gains a furin cleavage site, then this restriction is removed because FURIN is ubiquitous in human tissues (
Figure 2C), resulting in dramatic broadening of host cell tropism. In this context, the S protein contributes to host specificity [
6], and also to tissue specificity through its differential requirement of tissue-specific proteases. For this reason, viruses with different cell tropism may accumulate tissue-specific genomic signatures [
33].
Because the C-terminus of the spike protein is anchored inside the viral membrane, one might expect the distal S1 to be lost after cleavage at site 1. However, the distal S1 subunit remains non-covalently bound to the S2 unit in the prefusion conformation after cleavage at site 1 [
10,
11,
34]. In order to stabilize the prefusion conformation to facilitate vaccine design [
10,
35] or structural determination [
11,
12], the furin site is often mutated so that it is not cleaved. For example, the cleavage site RRAR was changed to GSAS in obtaining protein structure 6VSB [
12], and to SGAG in obtaining protein structures 6VXX and 6VYB [
11].
The cleavage site 2 (
Figure 2A) is highly conserved in all sequenced SARS-CoV-2, as well as in all its close relatives including SARS-CoV. This site is likely cleaved by cathepsin L in endosome in both SARS-S [
34,
36,
37,
38] and SARS-2-S [
4]. Cathepsin L requires an aromatic residue at P2 and a hydrophobic residue at P3 [
39]. Cleavage site 2 has Y at P2 and A at P3 to satisfy this requirement (
Figure 2A). The low pH in endosomes is optimal for cathepsin L activity. Inhibitors of cathepsin L block SARS-CoV infection [
36].
While cleavage site 1 (
Figure 2A) is known to be cleaved during SARS-CoV-2 assembly, most likely by furin in the Golgi apparatus [
2,
11,
24,
40], it is less clear how cleavage site 2 (
Figure 2A) is used in SARS-2-S priming. One could hypothesize if cleavage site 1 is efficient [
2], then cleavage site 2 would seem redundant and may accumulate mutations in the
SARS‑2‑S gene without a negative impact on the fitness of the virus. However, the amino acid sites near site 2 (VASQSIIAYT|MSLGAEN, where the vertical bar indicates the scissile bond,
Figure 2A) was perfectly conserved among all SARS-2-S sequenced by 8 May 2020. In contrast, each site of the 4-AA insertion (PRRA,
Figure 2A) has experienced at least one amino acid replacement. Thus, in spite of the additional furin cleavage site 1, cleavage site 2 (
Figure 2A) may still be functionally important for it to be so evolutionarily conserved.
In addition to site 1 and site 2 (
Figure 2A) that cleave SARS-2-S into the S1 and S2 domains, a third cleavage site also exists for cleaving S2 into FP and S2′ domains (
Figure 2B,D). This site, often referred to as the S2′ site, is likely cleaved by TMPRSS2 [
41,
42,
43,
44], consistent with the finding that TMPRSS2 is needed for SARS-CoV-2 infection [
3]. In particular, TMPRSS2 needs to be expressed in the target cell for it to be infected [
41]. Because TMPRSS2 is active mainly in the membrane or extracellular space, the third cleavage site is not cleaved during SARS-CoV assembly [
24,
41]. This site can also be cleaved by trypsin. Exogenous trypsin can enhance membrane fusion and SARS-CoV infection [
45,
46]. Trypsin cleaves SARS-S at R797 (
Figure 2D), consistent with the finding that an R797N mutation abolishes this trypsin-induced membrane fusion [
34].
The temporal sequence of cleavage events is not clear, although the following order is likely: For SARS-2-S, furin cleaves at cleavage site 1 during viral assembly [
2]. Then, the third cleavage site is cleaved by TMPRSS2 to yield FP and S2′ (
Figure 2D) to trigger membrane fusion, syncytium formation, and viral entry into a target cell [
3,
11,
34]. For SARS-S, cleavage site 1 does not seem to be used efficiently. The transmembrane TMPRSS2, if expressed, cleaves the third cleavage site to yield FP and S2′ and to trigger cell fusion and viral entry [
3]. This may be termed the membrane-TMPRSS2 pathway of viral entry. If SARS-S is not cleaved by TMPRSS2 into FP and S2′, then the virus can enter the cell through endocytosis with cleavage site 2 cleaved by cathepsin L. This is the endosome-cathepsin pathway of viral entry [
41,
46].
6. The Spike Protein in Vaccine Development
Almost all vaccine candidates against SARS-CoV-2 are based on the spike protein, including the FDA-approved Pfizer/BioNTech and Moderna vaccines that use mRNA encoding a modified spike protein stabilized in its prefusion conformation. It is important for the immune system to respond to the virus at the prefusion stage, because it would probably be too late for the immune system to intervene at the postfusion stage when the virus is gaining entry into an uninfected cell. Therefore, the rationale of vaccine development is to produce a spike protein stabilized in the prefusion conformation as a target to train the immune system to act against it.
Two structural studies on spike proteins, one on Betacoronavirus HKU1 [
10] and the other on MERS-CoV [
79], have demonstrated that replacing two consecutive amino acids by proline near the transition from HR1 to the central helix (
Figure 3) would strongly contribute to the stabilization of the resulting spike protein at the prefusion conformation. These amino acid sites correspond to sites 986 and 987 in SARS-2-S (
Figure 5), located at the transitional bend between HR1 and the central helix (
Figure 3). Amino acids at two sites are not conserved, being NL in CoV-HKU1, VL in MERS-CoV, and KV in SARS [
79], suggesting that they are probably not functionally important. However, the two amino acid replacements (K986P, V987P), shown in
Figure 5, stabilize the resulting spike protein in the prefusion state and contribute to vaccine efficiency. The mutant SARS-2-S spike protein with these proline replacements is referred to as S-2P [
85,
86], which is encoded in the mRNA vaccine from both Pfizer/BioNTech (BNT162b2) and Moderna (mRNA-1273). A new spike protein variant (HexaPro) that includes four additional amino acid replacements by proline (F817P, A892P, A899P, and A942P) is even more stable and expressed more than the original S-2P [
35].
7. Structural Insights into the Emergence of New Viral Variants
Here, one example is described to illustrate how structural biology can shed light on the emergence of new viral variants. In an experiment that used neutralizing monoclonal antibodies to select neutralization-escaping SARS-CoV variants [
22], one of the four variants was V601G within SARS-S at 594VAVLYQD
VNCTDV606 where V601 was highlighted in bold. The identification of this infection-enhancing V601G variant is puzzling because one does not expect that such a V→G replacement would have much phenotypic effect on the S protein. First, site 601 is not involved in receptor binding. Second, both V and G are small and nonpolar. Therefore the replacement is conservative and should not cause a significant structural perturbation of the S protein. Does a replacement of a small nonpolar V by a smaller nonpolar G really matter? One cannot answer the question without structural evidence. It can only be inferred that site 601 is functionally important, and that the smallest amino acid at site 601 (or its vicinity) is beneficial to SARS-CoV.
A V601G mutation requires a transversion (i.e., from codon GUN to GGN). Because of proofreading in coronavirus genome replication [
87,
88,
89], transversional mutations are much rarer than transitions. For this reason, V→G at site 601 is expected to occur much more frequently than D→G at site 600, because the latter requires a transition (from codon GAY to GGY) instead of a transversion. Therefore, a small G can be gained by a D600G mutation instead of a V601G mutation. The segment of 594VAVLYQD
VNCTDV606 in SARS-S corresponds to 608VAVLYQDVNCTEV620 in SARS-2-S, therefore, a D600G mutation in SARS-S is equivalent to D614G in SARS-2-S. In this context, it is not surprising that a D614G variant of SARS-CoV-2 quickly increased in frequency [
90], indicating a strong selective advantage.
Now, there are two alternative hypotheses concerning the selective advantage of the D614G mutation as follows: (1) the benefit is due to G being the smallest amino acid, or (2) the benefit is due to the loss of a negative charge altering electrostatic interactions. The second hypothesis may be dismissed on the following empirical grounds: Codons encoding D (GAY) could also mutate to AAY encoding N through a single transition. Such a mutation would lose the negative charge carried by D. If it is the loss of a negative charge that is beneficial, we would expect AAY and GGY to be roughly equally represented at this site. However, AAY is entirely missing in sequenced SARS-2-S, which goes against the second hypothesis. Unfortunately, exclusion of the second hypothesis neither implies confirmation of the first (because there are other alternatives), nor helps us understand why the D614G mutation enhances viral fitness. Only through structural studies [
91] can we hope to gain a mechanistic understanding of the effect of the D614G mutation on the S protein.
8. The Spike Protein and the Conspiracy Theory
As previously mentioned, the additional polybasic furin cleavage site 1 (
Figure 2A) has been a lasting source of conspiracy theory that SARS-CoV-2 is man-made. Advocates of the conspiracy theory assume that scientists have ignored or refused to address their legit concerns. In this review, two points are made. First, the evidence for a natural origin of SARS-CoV-2 is accumulating, albeit at a rate slower than desired. Second, the reasons behind the conspiracy theory have been seriously considered by scientists and have been deemed to be not strong reasons.
There are three main reasons for the conspiracy theory, all involving the polybasic furin cleavage site (
Figure 2A). First, the furin cleavage site has not been observed in any close relatives of SARS-CoV-2 in nature. A somewhat similar furin cleavage site was present at a roughly homologous site in S protein of the murine hepatitis virus [
45] and in a few alphacoronaviruses [
2,
8,
29]. However, it is not clear how SARS-CoV-2 could gain it from these remote relatives. While recombination might be a possibility, there is hardly any sequence homology between SARS-2-S and its homologues in the murine hepatitis virus or alphacoronaviruses at sequences flanking the cleavage site, therefore, a recombination origin of the cleavage site is tenuous at present. An insertion at the same site was found in a bat-derived coronavirus [
92], but the inserted sequence was different and could not function as a furin cleavage site. A novel bat-derived coronavirus (RmYN02) was reported to have an insertion bearing a weak semblance to the polybasic furin cleavage site in
Figure 2A [
92], suggesting the possibility of a natural origin of the polybasic furin cleavage site. However, the sequence homology between RmYN02 and SARS-2-S is low, and it is not clear if the insertion in RmYN02 is real or an artefact of alignment. Therefore, if one cannot offer a plausible hypothesis of natural origin of the polybasic site, it is easy to fall back on the hypothesis of artificial origin. This reminds us of the period of time before Darwin, i.e., when the origin of species cannot be fully explained, it is easy to fall back to the theory of a creator.
The second reason for the conspiracy theory is associated with the feasibility of creating such a polybasic site and a need to create such a site for testing certain biological hypotheses. Some background information arising from SARS-S is needed to understand this reason. The roughly homologous RNA segment in SARS-S is a weak cleavage site, likely cleaved by transmembrane serine protease TMPRSS2 [
93]. R667 in SARS-S (immediately upstream of the site 1 cleavage in
Figure 2A) is required for cleavage by TMPRSS2 [
93]. The site can also be cleaved by trypsin, and processing of SARS-S by trypsin enhances viral infectivity [
34,
45,
94]. Because trypsin and trypsin-like proteases are strongly tissue restricted (
Figure 2C), the site is typically not cleaved in SARS-S [
24]. It is natural for one to hypothesize that adding a furin cleavage site would allow the site to be efficiently cleaved in nearly all tissues, potentially enhancing SARS-CoV infection and broadening its cell tropism. Indeed, introducing a furin cleavage site at the S1 and S2 boundary of SARS-S has increased cell-cell fusion (syncytium formation) and viral infectivity [
34]. This result suggests that the additional polybasic furin cleavage site may have contributed significantly to the efficiency of SARS-CoV-2 in infecting human. Host cells, in response to viral infection, may reduce furin activities [
8].
In short, given the seemingly sudden appearance of the additional furin cleavage site that cannot be readily explained by a hypothesis of natural origin, and the fact that virologists have already experimented with adding a furin cleavage site at this specific location and learned the consequence of enhanced viral infectivity and cell-cell fusion, the claim that the polybasic furin cleavage site in SARS-2-S has been experimentally inserted is not too far-fetched. However, the global collaboration among scientists, in general, and virologists, in particular, has created scientific communities that are far more closely knit than before. While it is possible to create a viral pathogen, it is extremely unlikely for a laboratory to create SARS-CoV-2 without being noticed.
The third reason is that the 12 nt insertion encoding the polybasic furin cleavage site carries two CpG dinucleotides. Such CpG dinucleotides are very rare in SARS-CoV-2 [
95], and particularly rare in SARS-2-S. Why would such CpG rarity contribute to the conspiracy theory? Mammalian zinc finger antiviral protein (ZAP, gene name ZC3HAC1) targets CpG dinucleotides in viral RNA to mediate RNA degradation and inhibit viral replication [
96]. The ZAP-mediated RNA degradation is cumulative [
96], as shown by the following experiment. When CpG dinucleotides were experimentally added to individual viral segment 1 or 2, the inhibitory effect of ZAP was weak. However, when the same CpG dinucleotides were added to both segments 1 and 2, the ZAP inhibition effect was strong [
96]. This implies that only mRNA sequences of sufficient length would be targeted by ZAP (i.e.,
S,
1ab, and
1a mRNAs in SARS-CoV and SARS-CoV-2). SARS-CoV-2 and its closest relatives from bat (RaTG13) and pangolin exhibit the strongest genomic CpG deficiency among all betacoronaviruses [
95], presumably to evade ZAP-mediated host defense. The
S gene is particularly CpG-deficient as measured by two indices, I
CpG [
95,
97] and ln (N
CG/N
GC) (
Table 1), where N
CG and N
GC are the numbers of CpG and GpC dinucleotides in the
S gene. I
CpG < 1, or ln (N
CG/N
GC) < 0, means CpG deficiency.
Because of this ZAP-mediated selection against CpG, SARS-CoV-2 and its close relatives encode most of arginine residues by the two AGR codons, instead of the four CGN codons. The S gene encodes 42 arginine residues, with only 12 (28.57%) encoded by the four CGN codons in contrast to 30 encoded by the two AGR codons. The two arginine residues in the polybasic furin cleavage site are encoded by the rare CGN codons, which seems unnatural in this context. However, the probability of randomly picking up two arginine codons that happen to be both CGN codons is not extremely low (i.e. =0.28572 = 0.0816).
One way to dispel the conspiracy theory is to find a set of viral lineages in wildlife that would allow reconstruction of a plausible evolutionary path leading to the origin of the polybasic furin cleavage site. The “missing link” that would satisfy conspiracy theorists is still to be found. However, there is no guarantee that it will be found because nature is not obliged to preserve all what she has created.