A Simple and Fast Method to Sequence the Full-Length Spike Gene for SARS-CoV-2 Variant Identiﬁcation from Patient Samples

: Since the beginning of the pandemic, a race has been underway to detect SARS-CoV-2 virus infection (PCR screening, serological diagnostic kits), treat patients (drug repurposing, standard care) and develop a vaccine. After almost a year of active circulation worldwide, SARS-CoV-2 variants have appeared in different countries. Those variants include mutations in multiple regions of the genome, particularly in the spike gene. Because this surface protein is a key player in both the spread of the virus and the efﬁcacy of vaccine strategies, the challenge is to efﬁciently monitor the appearance of spike mutations in the population. The present work describes a procedure based on the widely available Sanger technology to produce a full-length sequence of the spike gene from patient-derived samples.


Introduction
In late 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in Wuhan, Hubei province, China. This local outbreak rapidly developed into the COVID-19 pandemic [1,2]. At the end of March 2021, the total number of cases was estimated to be 128 million (https://coronavirus.jhu.edu, 29 March 2021). Since the middle of 2020, there have been hundreds of thousands of new infections each day, with peaks approaching one million. Unsurprisingly, several new variants have emerged during this expansion, some of which have become variants of interest, or even variants of concern (VOC) [3]. Their prevalence in the population quickly increased and the majority of new infections are now due to these variants rather than the parental strain. Mutations have been described in the 5 untranslated region (UTR), ORF1ab, ORF3a, Spike, M, ORF8 and N. Therefore, monitoring the genetic evolution in the SARS-CoV-2 virus is becoming a top priority in the management of the COVID-19 pandemic.
The entry of coronaviruses is mediated by the surface glycoprotein spike [4][5][6]. Because of its role in receptor binding, the spike protein was quickly found to be the best candidate for producing neutralizing antibodies [7]. Current vaccines and novel therapeutic strategies under development are based on this spike protein. Accordingly, it is crucial to monitor the appearance of mutations in circulating viruses that might affect collective as well as acquired immunity.
The first described amino acid mutation was a D to G transition in the spike coding region (D614G) [8]. Observed in Europe as early as February 2020, this variation increased viral replication/infectivity [9]. It is now present in almost all circulating viruses and is shared among all of the current variants of interest/concern. Of note, this mutation has also been associated with an increased susceptibility to neutralizing antibodies [10]. Conversely, other mutations, such as the N501Y and E484K mutations located in the receptor-binding domain of spike (RBD), can affect the immunity conferred by polyclonal antibodies. The United Kingdom (UK) variant, also known as VOC 202012/01 (Nextstrain 20I/501Y.V1, PANGO lineage B.1.1.7), emerged in the south-east of the country in December 2020 [11]. In addition to the D614G, it harbored multiple mutations including N501Y, as well as a deletion at positions 69/70, affecting the diagnostic capacity of PCR tests involving the S region (Table 1, Alpha).  T  R  20  T  N  26  P  S  69  H  -70  V  -80  D  A  138  D  Y  142  G  D  144  Y  -154  E  K  157  F  -158  R  -190  R  S  215  D  G  242  L  -243  A  -244  L  -246  R  I  417  K  N  T  452  L  R  477  S  N  478  T  K  484  E  K  K  Q  501  N  Y  Y  Y  570  A  D  614  D  G  G  G  G  G  655  H  Y  681  P  H  R  701  A  V  716  T  I  950  D  N  982  S  A  1027  T  I  1071  Q  H  1118  D  H  1176 V F # = total number of mutations compared to the reference sequence. Deletions are indicated with a minus (−) sign. Signature mutations are highlighted in green, while mutations that could impact viral infectivity and/or immune response are in red.
Soon thereafter, another VOC was identified with a similar N501Y transition but in South Africa [12]. Known as 20H/Beta or B.1.351, this variant (Table 1, Beta) appeared independently of the alpha virus and presented additional transitions including the spike E484K. Finally, a third VOC was identified in Brazil. Known as P.1 or 20J/501Y.V3, this variant shared some spike mutations seen in the beta virus. In all cases, numerous other mutations can be associated with these major variants, and new mutations are appearing constantly. That is the case, for example, with the BRE-IPP03921 variant, which recently emerged in Brittany, France, and which was identified after a recrudescence of COVID-19 symptoms in patients who tested negative for SARS-CoV-2 by routine PCR screening. More recently, the VOC delta (B.1.617.2), first described in India, is now spreading in the United Kingdom and all around Western Europe [13].
Overall, all variants harbour mutations in the viral spike, and these mutations spread throughout the protein sequence. However, present sequencing protocols of the spike cover two regions (genome positions 21,701 to 22,344 and 22,763 to 23,310), while sequencing of the whole spike is performed by NGS when a variant is suspected. Next generation sequencing would be the best solution to obtain a full-length spike sequence or even a whole genome sequence in order to survey any kind of mutation. Still, the required technology is absent in most southern countries and is not even readily available in every hospital in Europe. In contrast, Sanger is more widely available and makes it possible to obtain 800-1000 base reads per run. However, the SARS-CoV-2 spike is a 1273-amino acid protein encoded by the corresponding 3822-base open reading frame. Accordingly, we developed a procedure based on Sanger technology to sequence the full-length spike in three steps from nasal or saliva screening swab extracts. This procedure is compatible with other patient-derived samples such as bronchoalveolar lavages (BAL). The entire process can be performed in less than 14 h and provides sufficient coverage to identify existing and future novel variants.

Materials and Methods
Clinical Sample Collection. Patient samples were obtained from the Bordeaux university Hospital virology laboratory. To validate the protocol, various types of clinical samples were used. Twelve among 15 samples were RNA extracts from patients tested at the hospital COVID-19 centre as positive by qualitative real-time RT-PCR (nasal swabs). Three samples were RNA extracted from broncho-alveolar lavages (BAL). The last one (culture strain) consisted of RNA extracts from culture supernatant of Vero E6 cells infected with a reference strain (WIV04/2019 obtained through the EVAg and Pasteur Institute). Infection was performed in the BSL3 facility of Bordeaux University (UB'L3) as described in our recent work [14]. Ct values of RT-qPCR were determined during patient follow-up in the hospital using SARS-CoV-2 R-gene (BioMérieux, Craponne, France) and GeneFinder TM COVID-19 PLUS RealAmp (Osang Healthcare, Gyeonggi-do, Korea) kits.
RNA extraction. A pure viral RNA kit from Roche was used to extract RNA from samples of either BAL or cell culture supernatant (200 µL). After binding and washes, elution was performed with 50 µL of water. The quality of the extraction was checked by monitoring the absorbance profile using a NanoDrop 2000 (ThermoFisher scientific, Waltham, MA, USA).
Amplification of the spike gene. Primers used for amplification and sequencing were purchased from Eurofins Genomics (Germany), and are reported in Table 2. Amplification of a single fragment of 4200 bp covering the entire spike gene was performed from 10 µL of extracted viral RNA, as previously described [15]. RT-PCR was carried out in 25 µL reaction using forward F1-1 and reverse F1-1 primers (0.4 µM of each primer) in a buffer containing 0.4 mM of each dNTP, 3.2 mM MgSO 4 and 0.5 unit of SuperScript™ III RT/Platinum™ Taq Mix (Invitrogen, ThermoFisher Scientific, Waltham, MA, USA). After an initial RT step performed at 50 • C for 30 min, the PCR cycling program consisted of a denaturation at 95 • C for 2 min followed by 30 cycles of 95 • C for 30 s, 60 • C for 15 s and 68 • C for 5 min. The reaction concluded with a final extension at 68 • C for 5 min. A nested PCR was performed with 2 µL of this RT-PCR product at a final volume of 25 µL using 0.8 µM of each primer (forward F1-2 and reverse F1) in a reaction buffer containing 15 mM of MgCl 2 , 5 mM of dNTPs and 0.25 unit of ThermoSTAR 2 HS Taq (Eurobio, Les Ulis, France). The quality of the PCR products was verified by electrophoresis using a QIAxcel Advanced automate (Qiagen, Germantown, MD, USA) before cleaning using ExoSAP-IT™ PCR Product Cleanup Reagent kit (Applied Biosystems, ThermoFisher scientific, Waltham, MA, USA). Sanger Sequencing and sequences analysis. Purified PCR products were sequenced using the BigDye ® Terminator v3.1 Cycle Sequencing Kit on an Applied Biosystems 3500XL DX automat (Applied Biosystems, ThermoFisher Scientific, Waltham, MA, USA). For that, ten sequencing reactions were performed, each using 2 µL of nested PCR products, and using primers F1-2 and F1-2 plus eight additional primers ( Table 2). Abi files were either aligned to the reference SARS-CoV-2 genome in SnapGene 4.3.11 or uploaded to the SmartGene IDNS software Coronavirus Module (IDNS ® version v3_11_0r2(r32344) © SmartGene 2021).

Results
The aim of this work was to establish a robust but simple, fast procedure to generate sequencing data on the full-length spike gene using Sanger technology. Since the beginning of 2020, several hundreds of thousands of people have been screened every day for SARS-CoV-2 infection. Accordingly, there are a tremendous number of samples that merit further characterization through spike sequencing. The majority of the collected samples are nasal swabs that are further processed (RNA extraction) for RT-qPCR detection of two to three viral genes. We intended to make use of these PCR positive samples that were already extracted to enable subsequent full-length spike gene sequencing. Sequencing was performed for 16 samples from patients, comprising 8 males and 7 females. Ct values ranged from 15.14 to 35.76 for the N gene, and 16.6 to undetectable (Ct > 40) for the RdRp gene (Table 3). For some samples, E gene screening was also performed and Ct values ranged from 13.53 to 25.64 (Table 3). First, RNA extracts from nasal swabs were subjected to reverse transcription using primers Forward F1-1 and Reverse F1-1 located in adjacent genes, namely, orf1ab and orf3a, respectively (Table 2). This led to the generation of a 4587-base-pair DNA fragment that was further amplified by nested PCR using primers Forward F1-2 and Reverse F1-2 ( Table 2). At this step, the quality of amplification could be assessed if needed through conventional agarose separation or any other electrophoretic method. As seen in Figure 1B, this process from nasal swab extracts allowed us to produce a single band of 4199 bp. Before further processing, samples were enzymatically cleaned with Exo I and rSAP treatments. Sequencing was performed using 2 µL of the PCR mix per primer pair, allowing up to 13 sequencing reactions per sample. We found that six was the minimal number of Sanger reactions necessary to fully cover the 3822 bp spike gene. Deletion events in the 5 side were unambiguously detected, even with a single read. However, mutations at primer binding sites may affect sequencing efficacy. In addition, the presence of mix bases due to multiple viral populations may interfere with proper base assignment, and higher coverage might be required in hotspot regions. Accordingly, primers were designed to produce a higher overlap (4x coverage) at spike amino acids positions 400 to 800 (22,760-23,962 nt, Figure 1A), because these correspond to the receptor binding domain and represent the most relevant region for variant determination. Thus, we proposed a set of 10 primers to unambiguously sequence the full-length spike gene. Using this protocol, complete sequencing was achieved for all tested samples (Table 3).
COVID 2021, 1, FOR PEER REVIEW 6 Before further processing, samples were enzymatically cleaned with Exo I and rSAP treatments. Sequencing was performed using 2 µL of the PCR mix per primer pair, allowing up to 13 sequencing reactions per sample. We found that six was the minimal number of Sanger reactions necessary to fully cover the 3822 bp spike gene. Deletion events in the 5′ side were unambiguously detected, even with a single read. However, mutations at primer binding sites may affect sequencing efficacy. In addition, the presence of mix bases due to multiple viral populations may interfere with proper base assignment, and higher coverage might be required in hotspot regions. Accordingly, primers were designed to produce a higher overlap (4x coverage) at spike amino acids positions 400 to 800 (22,760-23,962 nt, Figure 1A), because these correspond to the receptor binding domain and represent the most relevant region for variant determination. Thus, we proposed a set of 10 primers to unambiguously sequence the full-length spike gene. Using this protocol, complete sequencing was achieved for all tested samples (Table 3). Sequencing could be achieved using various types of samples. In addition to nasal swabs, BAL or cell culture viral supernatants were used as alternative starting materials. In the latter case, RNA extraction was performed using 200 µL of sample concentrated to 50 µL of total RNA extract. Then, these alternative extracts were used exactly as previously described for the nasal swab extracts. Analyses of sequences were then performed using the SmartGene IDNS software Coronavirus Module (Figure 2). This allowed us to detect mutations in comparison with the reference strain. Combinations of mutations unambiguously allowed us to classify the viruses as WT or in relation to the circulating alpha and beta variants, with alpha representing the most frequent circulating form in our region at this time. In addition to previously reported mutations using current protocols, several additional mutations could be identified in the spike gene, especially the L452R, a mutation found in delta and kappa variants, as well as in epsilon, known to impact antibody binding and increase transmissibility. Interestingly, sequencing data could be ob- Sequencing could be achieved using various types of samples. In addition to nasal swabs, BAL or cell culture viral supernatants were used as alternative starting materials. In the latter case, RNA extraction was performed using 200 µL of sample concentrated to 50 µL of total RNA extract. Then, these alternative extracts were used exactly as previously described for the nasal swab extracts. Analyses of sequences were then performed using the SmartGene IDNS software Coronavirus Module (Figure 2). This allowed us to detect mutations in comparison with the reference strain. Combinations of mutations unambiguously allowed us to classify the viruses as WT or in relation to the circulating alpha and beta variants, with alpha representing the most frequent circulating form in our region at this time. In addition to previously reported mutations using current protocols, several additional mutations could be identified in the spike gene, especially the L452R, a mutation found in delta and kappa variants, as well as in epsilon, known to impact antibody binding and increase transmissibility. Interestingly, sequencing data could be obtained similarly from samples arising from screening facility samples, as well as from intensive care unit derived samples (BAL). Moreover, our data showed that if necessary, hard to sequence samples could alternatively be amplified in the lab (cell culture viral amplification) without introducing additional mutations (at least on early passages).

Discussion
Since the original breakout of SARS-CoV-2 at the end of 2019 in China, tremendous effort has been made by the scientific community to better understand the physiopathology of the virus and find effective antivirals and vaccines. Even though several vaccines have been successfully developed in record time, several mutations have arisen which threaten their future efficacy. Altogether, we developed a simple and reliable method to sequence whole spike gene in less than 14 h, directly from RNA extract left over from screening facilities. This protocol is able to unambiguously identify an exhaustive list of known and yet unseen mutations, which will help monitor circulating variants around the world if widely adopted by screening facilities. The surveillance of new variants will certainly be the most significant challenge in future months, and since next generation

Discussion
Since the original breakout of SARS-CoV-2 at the end of 2019 in China, tremendous effort has been made by the scientific community to better understand the physiopathology of the virus and find effective antivirals and vaccines. Even though several vaccines have been successfully developed in record time, several mutations have arisen which threaten their future efficacy. Altogether, we developed a simple and reliable method to sequence whole spike gene in less than 14 h, directly from RNA extract left over from screening facilities. This protocol is able to unambiguously identify an exhaustive list of known and yet unseen mutations, which will help monitor circulating variants around the world if widely adopted by screening facilities. The surveillance of new variants will certainly be the most significant challenge in future months, and since next generation sequencing is not available in many regions, this simple method, based on Sanger sequencing, appears to be a viable option for global surveillance.

Discussion
Since the original breakout of SARS-CoV-2 at the end of 2019 in China, tremendous effort has been made by the scientific community to better understand the physiopathology of the virus and find effective antivirals and vaccines. Even though several vaccines have been successfully developed in record time, several mutations have arisen which threaten their future efficacy. Altogether, we developed a simple and reliable method to sequence whole spike gene in less than 14 h, directly from RNA extract left over from screening facilities. This protocol is able to unambiguously identify an exhaustive list of known and yet unseen mutations, which will help monitor circulating variants around the world if widely adopted by screening facilities. The surveillance of new variants will certainly be the most significant challenge in future months, and since next generation sequencing is not available in many regions, this simple method, based on Sanger sequencing, appears to be a viable option for global surveillance.  Data Availability Statement: Raw sequence data can be obtained directly from the authors.

Conflicts of Interest:
The authors declare no conflict of interest.