Polymorphism and Selection Pressure of SARS-CoV-2 Vaccine and Diagnostic Antigens: Implications for Immune Evasion and Serologic Diagnostic Performance

The ongoing SARS-CoV-2 pandemic has triggered multiple efforts for serological tests and vaccine development. Most of these tests and vaccines are based on the Spike glycoprotein (S) or the Nucleocapsid (N) viral protein. Conservation of these antigens among viral strains is critical to ensure optimum diagnostic test performance and broad protective efficacy, respectively. We assessed N and S antigen diversity from 17,853 SARS-CoV-2 genome sequences and evaluated selection pressure. Up to 6–7 incipient phylogenetic clades were identified for both antigens, confirming early variants of the S antigen and identifying new ones. Significant diversifying selection was detected at multiple sites for both antigens. Some sequence variants have already spread in multiple regions, in spite of their low frequency. In conclusion, the N and S antigens of SARS-CoV-2 are well-conserved antigens, but new clades are emerging and may need to be included in future diagnostic and vaccine formulations.


Introduction
The emergence and rapid spread of a novel Coronavirus, referred to as SARS-CoV-2, is resulting in one of the worst pandemics in the world, causing an unprecedented health and economic crisis. About seven months after the first cases were identified, over 8 million cases have been reported worldwide, with over 400,000 deaths according to the Johns Hopkins Coronavirus Resource Center [1].
The pandemic has triggered multiple efforts at developing serological tests, able to detect both acute infections by detecting virus-specific IgM, as well as recovered individuals by detecting virus-specific IgG. Several immunochromatographic rapid tests are already available [2], and several more are likely to become available by the end of 2020. Such tools will be critical to increase testing for the accurate and rapid identification of cases and their isolation to limit further transmission of the virus. However, their performance needs to be evaluated, and initial testing suggested variable performance of these tests [2,3]. Test performance relies in part on the antigen used, and its conservation among virus strains circulating in the population being tested. Currently, most of these tests are based on the Spike glycoprotein (S) or the Nucleocapsid (N) viral proteins [2]. The receptor-binding domain (RBD) of the S protein, which mediates binding to the angiotensin-converting enzyme 2 (ACE2) receptor in human cells [4], is also widely used as a diagnostic antigen.
Similarly, vaccine development efforts have been very intense and a growing number of vaccine candidates are being quickly moved into clinical trials. These are based on different technological platforms, ranging from recombinant proteins, RNA and DNA vaccines, or recombinant viral vectors [5,6]. A first RNA vaccine candidate recently completed clinical phase 1 evaluation and is expected to move into Phase 2 shortly. Most of these vaccine candidates are based on the viral S protein, or the RBD as antigen. Multiple potential vaccine epitopes have also been identified in the S, as well as in the N viral proteins [7]. As for diagnostics, conservation of these vaccine antigens among viral strains is critical to ensure broad protection and avoid immune evasion by the virus.
As an RNA virus, SARS-CoV-2 is prone to frequent mutations, in spite of some proof-reading abilities of its RNA polymerase complex [8,9]. An early assessment of genomic changes in SARS-CoV-2 showed a mutation hot-spot in the virus's RNA-dependent RNA polymerase (RdRp), but a few mutations were also detected in other parts of the viral genome, including the N and S proteins [10]. The growing availability of a large number of complete genome sequences gathered since the beginning of the pandemic provides a unique tool to assess the extent of viral antigen polymorphisms, and potential selection pressures. A first analysis of polymorphisms in the S glycoprotein from the first sequences in December 2019 to 13 April 2020 identified a handful of variant sites, including D614G, S943P, and possibly L5F and L8V [11]. Variant sites V367F, G476S, and V483A were also identified in the RBD. Here, we analyzed the sequence variation in a broader set of viral proteins N and S, which represent the main diagnostic and vaccine antigens to date. We then examined the implications of the identified sequence variants on vaccine and serological diagnostic performance.

Results
Analysis of over 17,000 genome sequences confirmed the SARS-CoV-2 is a fast-evolving virus. Indeed, in the less than 5 months that viral sequences have been available, we detected multiple sequence variants, and some virus strains currently circulating in multiple countries have somewhat diverged from the first isolates from December 2019 in Wuhan, China ( Figure 1A). Variants were scattered throughout the viral genome, rather than clustered in specific genes ( Figure 1B,C), and some sequence variation could be detected within both the N and S genes.
These genes were then analyzed in detail and separately. For the N protein, we included a dataset of 16,656 sequences, and significant sequence diversity was detected, with up to 326 distinct protein sequences. For a clearer assessment of their phylogenetic relationship, these variant sequences were analyzed independently ( Figure 2A). Notably, a structuring including up to seven incipient clades was found emerging, with sequences from the first virus from Wuhan, China included in Clade 1 ( Figure 2C). There was no specific geographic clustering of the sequence variants, illustrating the widespread multidirectional spreading of the virus across the world. A notable exception was observed for Clade 3, which included mostly sequences from Europe. Analysis of sequence variation along the protein sequence indicated that about half of the protein on the amino side was mostly conserved, except in two regions at sites 13 and 203-204, respectively ( Figure 2B). On the other hand, the carboxy half of the protein appeared more variable, but this also reflected some sequencing ambiguities.
A total of 178/419 (42.5%) sites presented variation in the N protein. This included seven sites with four variant amino acids, seven sites with three variant amino acids, and 13 sites with two variant amino acids that were found under significant diversifying selection pressure (26/419 (6.2%), Table 1). Because of these changes, the N protein appears to be slowly diverging from the sequence from some of the early virus, belonging to Clade 1, and up to six additional major clades (Clades 2-7) are emerging for the N antigen ( Figure 2C). Site D144 that can be substituted by E, H, Y, or N may disrupt a predicted epitope (ALNTPKDHI 138-146) [7,12]. Importantly, most variants were still found at relatively low frequency among the viral population (0.018 to 0.541%), with only R203X and P13X variants detected at higher frequency (18.108 and 1.589%, respectively, Table 1), indicating an overall high level of conservation of the N protein. Nonetheless, many of the low frequency variants were found to have already dispersed in multiple countries and regions. For example, S202X variants were detected in 90 cases from Australia, China, Democratic Republic of Congo, England, Ghana, India, The Netherland, Russia, Saudi Arabia, Senegal, Turkey, and the USA. Similarly, D22X was detected in 53 cases from Australia, England, Taiwan, Uruguay, and Wales. On the other hand, a few variants were likely more associated with limited clusters of infection, such as A208G variants, which were mostly limited to the USA so far.
A similar analysis of the S antigen was performed, based on 17,802 sequences. It revealed even greater sequence diversity, with up to 681 unique S protein variants, and most of this diversity was observed in the most recent months of March and April 2020 compared to January and February 2020 ( Figure 3A). Furthermore, up to six emerging clades could be defined, that present a clear divergence from Clade 1, which includes some of the first sequences from Wuhan, China ( Figure 3B). While some sequences from Clades 3 and 6 could be detected as early as February 2020, sequences from Clades 2, 4, and 5 appeared in March 2020 and expanded in April 2020. At the same time, sequences from Clade 1 appeared less frequent with time ( Figure 3A,B).
Analysis of S glycoprotein variation indicated that amino acid variants were spread along most of the protein sequence ( Figure 3C), although a few regions of lower sequence conservation could be detected at positions 260-320 just before the RBD, at position 445-515 in the carboxy end of the RBD, and at site 614. Further analysis of each major clade revealed that each had amino acid substitutions that concentrated in different domains of the proteins, except for Clade 1, which accumulated the greatest number of substitutions across the entire protein ( Figure 4). For example, Clade 2 had more substitutions between sites 850-970, Clade 3 between 550-750 and 1150-1250, Clade 4 between 250-320, Clade 5 between 140-250 and 420-500, and Clade 6 between 750-800.
Pathogens 2020, 9, x FOR PEER REVIEW 4 of 12 Scotland, Singapore, Taiwan, Thailand, USA, and Wales, and H49X variants have been found in 36 cases from Australia, China, England, Mexico, Taiwan, and the USA, for example. As mentioned above, some of the sequence variation affecting the S protein was detected within the RBD, which is a key functional domain of the protein and one of the most used targets for serological diagnostics. We thus analyzed in detail its polymorphism. Sequence analysis of RBD revealed that it represented a highly conserved region of the S protein. Nonetheless, up to 54 RBD sequence variants were identified, with again some significant divergence from the first sequences from Wuhan, China ( Figure 5). Importantly, divergence seemed to increase with time as more variants accumulate and become established. A total of seven sites from the RBD were found under significant diversifying selection pressure, and variants sites within the RBD were observed in each of the major clades of the S protein (Table 2). Nonetheless, while possible RBD clades are emerging, these do not match the S protein major clades described above.  A total of 362/1273 (28.4%) sites presented variation in the S protein, of which 30 sites (2.4%) were found under significant diversifying selection pressure (Table 2). These included one site with five variant amino acids, one site with four variants, seven sites with three variants, 11 sites with two variants, and 12 sites with a single variant amino acid. Furthermore, different selection patterns were identified in each major clade, and only a few notable sites had substitutions in more than one clade (Table 2 and Figure 4). For example, sequences from Clade 1 are clearly defined by site D614, which is under strong diversifying selection, together with sites V615, G476, V483, and H519 and their corresponding variants. Most sequences from Clades 2-6 have a D614G substitution, together with clade specific variants. Thus, Clade 2 is characterized by a cluster of substitutions around sites 936-943, with specific sites D936, S940, and S943 and their variants under strong diversifying selection. Clade 3 is characterized by variants sites V622, A653, A684, A703, and their variants, and Clade 5 by sites D215, S221, and Q238 and their variants (Table 2 and Figure 4). A single predicted epitope may be affected by diversifying selection and substitutions at site 1078. The furin cleaveage region (671-692), and particularly the cleavage site, were well conserved, although two sites, Q675 and A684 are under diversifying selection. Substitutions at these less conserved sites may thus not affect furin cleavage, which is unique to SARS-Cov-2 [4]. Similarly, none of the sites with N-linked glycosylation [13] were found under diversifying selection, allowing for the conservation of the glycosylation pattern of the S protein across its diversity.  With the exception of the D614G substitution which has taken over and is now widespread in virus populations across the globe (over 63% of sequences carry this substitution), the other variants under selection still represent a low proportion of viral sequences, ranging from 0.017 to 0.586% (Table 2). A few of these variants likely correspond to limited clusters of infections, as they come from a single geographic region and are grouped in time. This is the case for the G1124V variant, which is limited to 50 cases from Victoria, Australia, between 20-27 March 2020. Similarly, the N439K variant is limited to 40 cases from Scotland, identified between 16 March-5 April 2020. However, most of the other variants have already spread to multiple countries and regions, such as Q675X, which has been found in Denmark, England, Finland, Iceland, Norway, Scotland, Spain, and the USA over March and April 2020. Similarly, L5F variants have been found on 102 cases from Australia, Belgium, Canada, England, France, Iceland, India, Italy, Japan, Netherlands, Portugal, Scotland, Singapore, Taiwan, Thailand, USA, and Wales, and H49X variants have been found in 36 cases from Australia, China, England, Mexico, Taiwan, and the USA, for example.
As mentioned above, some of the sequence variation affecting the S protein was detected within the RBD, which is a key functional domain of the protein and one of the most used targets for serological diagnostics. We thus analyzed in detail its polymorphism. Sequence analysis of RBD revealed that it represented a highly conserved region of the S protein. Nonetheless, up to 54 RBD sequence variants were identified, with again some significant divergence from the first sequences from Wuhan, China ( Figure 5). Importantly, divergence seemed to increase with time as more variants accumulate and become established. A total of seven sites from the RBD were found under significant diversifying selection pressure, and variants sites within the RBD were observed in each of the major clades of the S protein (Table 2). Nonetheless, while possible RBD clades are emerging, these do not match the S protein major clades described above.          Sites in bold are localized within the RBD. * indicate site(s) included in predicted epitopes. # sites included in the furin cleavage region. + site presenting no significant selection pressure but extensive variation.

Discussion
Antigen polymorphism from pathogens has the potential to impair serological diagnostic test performance, as well as vaccine efficacy. It is thus of key importance to consider these aspects for serological test and vaccine development, to ensure their usefulness and broad efficacy. This is commonly done for influenza vaccines for example, which are updated each year based on circulating viral strains as cross protection among strains is still elusive [14]. Here, we investigated the sequence diversity of two major antigens of the novel SARS-CoV-2 virus, the N and S proteins. Importantly, a significant level of sequence diversity was detected for both antigens, with incipient clades emerging as multiple sites were found under significant diversifying selection pressure.
The N protein, mostly used in serological diagnostic tests [2] had a large number of sequence variants, and 6.2% of its residues were found under diversifying selection. Overall, up to seven major sequence clades have been emerging in recent months for this antigen, and these did not show any geographic clustering. A notable exception was Clade 3 of the N protein, which appeared over-represented in sequences from Europe. Importantly, predicted epitopes appeared conserved so far, although a more detailed epitope mapping is still needed for this antigen. Nonetheless, N protein variants diverging from the initial sequences from Wuhan, China are now circulating in most geographic regions. While these changes are so far limited to a relatively small proportion of sequences (23.4%) and may not interfere with protein antigenicity, the inclusion of some of the variants in serological tests would ensure optimum sensitivity of tests, particularly if some of these variants become more frequent.
The S glycoprotein is the main vaccine candidate currently tested in multiple vaccine platforms/formulation [5,6]. Compared to the N antigen, it is more conserved and only 2.4% of its sites were found under diversifying selection pressure. We confirmed the importance of most of the variant sites previously identified in this antigen. These include D614G and S943P, as well as L5F and L8V and variant sites V367F, G476S, and V483A in the RBD [11]. However, multiple additional variants were also identified here, leading to the identification of up to six major clades of the S glycoprotein that are emerging. Most of these variants appeared in the months of March and April 2020 and may be slowly replacing the virus presenting sequences similar to that of the initial isolates from Wuhan, China. Indeed, while most of the variants still have a low frequency in the viral population, several have already spread to multiple countries and regions, where they may reach higher frequencies in the near future if they are successfully transmitted. Importantly, none of the substitutions identified affected the glycosylation pattern of the S protein, and none of the predicted epitopes appear affected. While the functional impact of these variants is unknown, the D614G mutation has been associated with potential increased viral transmission and/or fitness [11], which may explain why it became so frequent. A recent comparison of functional properties of the S proteins with aspartic acid (SD614) and glycine (SG614) confirmed a greater infectivity correlated with less S1 shedding and greater incorporation of the S protein into the pseudovirion with the SG614 variant [15]. Similar functional studies of the additional variants identified here may help evaluate their impact on virus fitness. Future studies will potentially provide data on how the different clades identified here may be successfully transmitted or go extinct.
While the RBD is particularly well conserved, some sequence variation was also detected in this region within the S glycoprotein, with up to 54 sequence variants. Because these differ by only 1-2 amino acids, the overall antibody recognition of the RBD can be expected to be mostly preserved, but some specific epitopes may nonetheless be lost. In addition, our phylogenetic analysis suggested that possible clades may be emerging within the RBD, as well, and newer sequences may diverge further from the sequence from the initial isolates from Wuhan.
In conclusion, we found that the N and S antigens of SARS-CoV-2 are so far highly conserved, so that both are good antigens for both diagnostic and vaccine development. However, some sequence variation is also emerging and 6-7 phylogenetic clades could be identified for both antigens. Some of these sequence variants have already spread in multiple countries and regions, in spite of their low frequency. Sequence variants may arise by random substitutions in the viral genome during replication, but the significant diversifying selection detected at multiple sites in both antigens suggests that immune selection pressure and adaptation to human hosts may be driving some of these changes [16,17], which may lead to the establishment of some of these variants. New variants are also likely to emerge with time. The recent identification of potential co-infections with more than one viral strain suggests that recombination could also contribute to the generation of SARS-CoV-2 genetic diversity [18]. Therefore, further monitoring of antigen drift over time will be needed to ensure that diverging antigens can be identified in a timely manner and included in future diagnostic and vaccine formulations.

Viral Sequence Data
Whole genome sequences from 18,247 SARS-CoV-2 virus were obtained from GISAID (Supplementary Table S1), covering virus isolates from multiple continents, including Asia, Africa, Europe, Oceania, and America. These sequences included those from initial human cases in Wuhan, China, from December 2019 up to sequences from 11 May 2020.

Sequence Analysis
Viral genome sequences were aligned using MAFFT [19] as implemented in Geneious 11, and alignments were edited to exclude partial or low-quality sequences. A final alignment including 17,853 quality sequences were used to construct phylogenetic trees using FastTree [20] for a global analysis of viral diversity across the world. FastTree infers approximately-maximum-likelihood phylogenetic trees. Sequence conservation across genome alignment was calculated using a sliding window analysis in Geneious.
Separate analyses were then performed using S and N genes, as well as the RBD from the S protein (positions 319-540 within the S protein). For these, translated sequences were aligned with the MAFFT algorithm using Blossum62 matrix and the frequency of variants at each site was calculated. Unique sequences from these proteins were then selected, and phylogenetic trees were constructed using FastTree as above. Major clades were defined according to tree structure and branch support measures. Predicted epitopes from these antigens [7,12] were mapped in the alignments, as well as glycosylation sites [13], to assess their conservation among viral sequences. Finally, evolutionary selection pressures on the antigens were analyzed using the Fast, Unconstrained Bayesian AppRoximation (FUBAR), as implemented in HyPhy [21], and statistical significance was considered at a threshold of p < 0.1.