An Upgrade on the Surveillance System of SARS-CoV-2: Deployment of New Methods for Genetic Inspection

SARS-CoV-2 variants surveillance is a worldwide task that has been approached with techniques such as Next Generation Sequencing (NGS); however, this technology is not widely available in developing countries because of the lack of equipment and limited funding in science. An option is to deploy a RT-qPCR screening test which aids in the analysis of a higher number of samples, in a shorter time and at a lower cost. In this study, variants present in samples positive for SARS-CoV-2 were identified with a RT-qPCR mutation screening kit and were later confirmed by NGS. A sample with an abnormal result was found with the screening test, suggesting the simultaneous presence of two viral populations with different mutations. The DRAGEN Lineage analysis identified the Delta variant, but there was no information about the other three mutations previously detected. When the sequenced data was deeply analyzed, there were reads with differential mutation patterns, that could be identified and classified in terms of relative abundance, whereas only the dominant population was reported by DRAGEN software. Since most of the software developed to analyze SARS-CoV-2 sequences was aimed at obtaining the consensus sequence quickly, the information about viral populations within a sample is scarce. Here, we present a faster and deeper SARS-CoV-2 surveillance method, from RT-qPCR screening to NGS analysis.


Introduction
Since late 2019, coronavirus disease (COVID- 19), an illness caused by a novel coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has represented one of the main challenges of public health across the world. Along with the SARS-CoV-2 dissemination over new territories, new mutations such as the spike (S) protein mutation D614G (A23403G) emerged and became dominant over time [1][2][3]. After this evolutionary event, the population of non-D614G-mutants is virtually nonexistent, and it appears to be a consequence of the adaptation of the virus [4], but even after many studies, the reasons for this change in prevalent strains are not totally clear.
The S protein is characteristic of the coronavirus surface, and it is involved in the viral adsorption over the host cell surface because this protein interacts with the cellular receptors such as ACE2 (Angiotensin converting enzyme). Because of this, the S protein is one of the key molecules used as targets in COVID-19 vaccines [5,6]. Along with the replication and dissemination of the virus, several mutations arose and became fixated in the genome of SARS-CoV-2, originating variants of the virus. As variants diverge and accumulate mutations, it is expected that they have a heterogeneous epidemiological behavior, and in some cases even a differential clinical progression, but there is not enough data available to predict the result of mutations combined within a single viral particle [7].
Sampling, SARS-CoV-2 detection, and genetic analysis to identify genomic characteristics of infecting viruses are the major steps for epidemiological surveillance worldwide. However, there are important differences regarding these approaches: (i) the number of samples taken and assayed for the presence of SARS-CoV-2, (ii) data reported to corresponding Health Departments, (iii) criteria for sample selection as sequencing candidate, to list a few. Each government handles the situation as it appears to be the best option for their specific situation, but an essential aspect of this epidemiological approach is the economic situation. The price for virus detection by RT-qPCR has been reduced and become widely available, in contrast to sequencing technology. Moreover, it is important to note that NGS (Next-Generation Sequencing) requires different laboratory equipment, specially trained scientists, in addition to sequencing reagents, which makes the intensive use of NGS technology difficult in several countries. On the other hand, RT-qPCR technology is a readily available technology, and if it is correctly designed, it can help in the screening of samples for mutations. An affordable option of RT-qPCR technology for SARS-CoV-2 variant screening is Master Mut Kit (Genes2Life, Mexico), which can detect mutations present within the spike gene, and therefore, identify if the genetic material belongs to a VOI (Variant of Interest) or VOC (Variant of Concern) virus. As epidemiological surveillance becomes more scrupulous, data about the mutations and their real distribution will be available for most countries, and ultimately, it will have a higher certainty of epidemiological data accuracy. Additionally, as more tests are performed, now rare events, such as simultaneous infection with two or more strains of SARS-CoV-2 will become more frequently detected and relate to their actual occurrence.
NGS data analyses are commonly processed with public-available bioinformatics tools. As main programs and algorithms became widely used by researchers worldwide, the amount of genomic data generated each day increases substantially, representing a potential challenge because the processing power needed to supply the demand increases every day. Additionally, as the speed of sample analysis increases, the depth of analysis is reduced, therefore, losing important data, such as genetic populations. Some of the leading platforms for sequencing, such as ARTIC, obtain information of variants while processing the data, but this is performed at the last stage when a consensus sequence is obtained; all mutations below the threshold level for identification for the variant call are lost.
The threshold level of the Illumina DRAGEN COVID Pipeline is 0.5 (Illumina DRA-GEN COVID Pipeline Software Guide, Document # 1000000158680 v01). This study aims to propose two methods for analyzing SARS-CoV-2, a RT-qPCR method that can accurately identify VOI and VOC at a lower cost and shorter time than NGS, and a bioinformatics data processing pipeline to obtain information from NGS reads which is currently lost in the regular analysis. Both objectives in order to demonstrate that the integration of both methodologies would make the current and future epidemiological surveillance programs and research protocols more efficient.

Master Mut Analysis
Samples collected between March and October 2021 were analyzed with Master Mut Kit. Table 1 shows the summary of the variants found in the 87 samples that were analyzed.

Concordance of SARS-CoV-2 Variant Identification by Master Mut and by Sequencing
All samples analyzed by Master Mut kit were further analyzed by sequencing with Illumina ® COVIDSeq™ Kit in an iSeq platform, and genome sequences were obtained with the Illumina DRAGEN COVID Lineage v3.5.3 app. Samples were prepared following manufacturer instructions. Fasta files were downloaded from the BaseSpace platform for further analysis.
The resulting SARS-CoV-2 genomes were identified using the Pangolin COVID-19 Lineage Assigner web application (Available at pangolin.cog-uk.io, last accession 14 December 2021). The resulting identifications were compared to the mutations and variants previously identified by the Master Mut kit.
Master Mut is capable of identifying VOI and VOC and distinguishing samples that did not belong to any of them. For 86 of 87 samples (98.5%), there was concordance between the Master Mut Kit analysis results and the data obtained from NGS sequencing with COVIDSeq Test. The only sample which did not have matching results between sequencing and Master Mut kit was M84 since the consensus sequence did not match all mutations previously described.
M84 sample presented five mutations in Master Mut Kit, but in the consensus sequence obtained from Illumina DRAGEN COVID Lineage v3.5.3 app, there were only two mutations, L452R and T478K, while R346K, E484K and N501Y were not present.
The fastq files of this and the other three samples were downloaded and analyzed locally.

Results from Local Data Analysis
Since a result from Illumina DRAGEN COVID Lineage v3.5.3 app was not fully concordant with the results from Master Mut (Sample M84), the sequencing reads from four samples (M81, M83, M84 and M86) were manually reviewed, mapped and assembled, in order to analyze and compare the information generated by NGS data processing tools, especially looking for data lost in simplification and automatic consensus generation. It was decided to analyze more samples than just M84 to test the procedure with samples apparently homogeneous, to decrease the possibility of misinterpretation of FreeBayes results. FreeBayes will analyze the mapped reads and calculate the relative abundance of mutations present, given a reference genome. With the parameters used in this paper, the groups will be 3; AC = 3 means the mutation is present in virtually all the reads, AC = 2 indicates the mutation is present in most reads, and AC = 1 indicates the mutation is present in few reads, but at least 15% of them.

Sample M81
This sample was identified as a Delta variant. The mutations detected by DRAGEN are the same as detected by FreeBayes, except for GCT28086ACA, but it is important to notice that this mutation is grouped in AC = 1, which means its abundance is lower than 50%, to be exact, 150 reads have this mutation, while 226 have the wildtype allele; hence just a 40% of the reads present the mutation ( Table 2). Since only 40% of the reads are mutated; therefore, the automatic analysis performed by the DRAGEN COVID Lineage app discards them.

Sample M83
This sample was identified as a Lambda variant. FreeBayes (Table 3) detects three mutations not detected by DRAGEN, all of them are classified as AC = 1, of which 2 are near to the 3 end of the viral genome. The mutation detected by FreeBayes at base 26,894 is to be noted, since 10,674 reads have it, while 15,404 reads had the native base, and the total depth at this position is 26,097, this means that although the mutation was detected in 40% of the reads, it was not represented in the result provided by DRAGEN. This synonymous mutation is located within the M gene of SARS-CoV-2.  Regarding the mutations near the 3 end of the genome (C29370T and C29870A), the number of reads is very low compared with the rest of the genome. The reads at each position are 1427 and 54, respectively, and the abundance was below 25% of those reads. In contrast, the median depth was 4638. This mutation has been reported in other Lambda samples, but the low number of reads and their respective abundance, especially in the case of C29870A, difficulties the determination of mutation presence.

Sample M86
This sample is composed mainly of a Delta variant, and there are two mutations present in many reads but not all ( Table 4).
The first is the deletion 23,583-23,609, present in 94.73% of the reads. This mutation is interesting since, apparently, it has surpassed the wild-type population. A similar mutation is known to arise after passages in cultured cells [8], which is the case of this sample. The other mutation is G24410A, present in 70.98% of the reads.

Sample M84
Previously, DRAGEN COVID Lineage v3.5.3 identified just the presence of Delta variant, with the mutation pattern characteristic of 21J, but this sequence did not contain three of the mutations detected by Master Mut Kit (R346K, E484K and N501Y). Table 5 contains all AC = 3 and AC = 2 mutation groups from FreeBayes. The abundant mutations match almost all the mutations detected in the consensus sequence from DRA-GEN, with one exception (G29742T mutation); This mutation was detected by DRAGEN in the consensus, but FreeBayes considered this mutation as one a part of the less abundant mutations. The number of reads for this position is 40, with a 50/50 distribution between mutant and wild-type reads. Therefore, the reason to consider this mutation in AC = 1 is because it is below the abundance threshold of FreeBayes, but it is at the abundance threshold of DRAGEN COVID Lineage v3.5.3. In Table 6, TAAAATG28270TAAATG mutation is listed because it relates to the abundant mutation TAAAATG28270TAAATG, and Freebayes considers them as alternative alleles at the same position, and mutually exclusive. Another position also presents an alternative mutation (C23604G and C23604A), which encode the P681R and P681H mutations in the S gene, respectively. Cross-contamination of the sample cannot be ruled out just by the results of screening or NGS; therefore, the sample was extracted and sequenced again, and the results were equivalent. These results can be seen in Table 7.
As we can see, there are three different assignations between both sequencing results. Mutation C4002T (First AC = 2, second AC = 3), TAAAATG28270TAAATG (First AC = 2, second AC = 3) and G29742T (First AC = 1, second AC = 2). All these changes can be originated since the percent of the mutation presence in reads is higher in the second NGS, changing from 93.96% to 98.10%, 83.01% to 84.41% and 50% to 62.92%, respectively. It is important to note that despite being at a 93.96% abundance, C4002T mutation was grouped in AC = 2; but TAAAATG28270TAAATG, at just 84.41%, is grouped in AC = 3, and the assignment of this mutation in AC = 3 groups could also be the reason of the secondary mutation at that position (TAAAATG28270TATAATG) not being listed in the vcf file of second sequencing. TGTTAA26157TA is neither listed in the vcf file since the percent of presence at the position must be higher than 15%. At last, G29868C and A29871T were not adequately covered in the first sequencing; 91.02% of the mutation detection and group assignment were fully concordant between both experiments. Five out of seven differences were due to the threshold and the assignment of groups, a larger study, with more sequencing repetitions, could help to adequately tune the threshold to an adequate value in which false negatives or positives are avoided without losing resolution.

Discussion
The Master Mut Kit showed a high concordance with NGS results and could be a valuable tool for mutation screening and variant surveillance. The mutation pattern of VOI and VOC is characteristic to them, and even if some mutations are shared, each combination represents a unique variant. Although VOC and VOI do not represent all currently circulating variants, they represent most of the cases considering the information obtained from sequencing [9]. Thus, variant identification is possible by detecting the presence or absence of specific mutations. Although this method is limited to detecting those nine mutations, the design of the test can be adapted to detect emerging variants, by introducing a new mutation pattern or by changing one or more of the currently detected mutations. Another significant drawback is the interference in the method caused by other mutations in the periphery of those detected since these changes affect the hybridization of probes and could compromise the detection [10]. Until now, these potential issues were solved by continuously updating the kit design, by including new targets and actualization of current ones.
These changes keep the kit at an update cycle, which involves the design of new assays, standardization, validation and deployment. This process is vital for developing tools to analyze highly transmissible viruses such as SARS-CoV-2.
Furthermore, the relevance of SARS-CoV-2 variants in clinical outcomes has been addressed but results aren't homogeneous across studies [11,12]. There could be several reasons for this, from sample size, genetic background of the population, comorbidities, available medical equipment and personnel, and the method employed for variant identification. Some studies rely on sequencing to determine the variant present, others on a test, such as S-target failure. Nevertheless, the first is not available for all medical facilities, and the latter is useful just for the 69-70 deletion detection. A mutation pattern analysis can provide more information about the variant, or variants, present in a sample, than just the S gene dropout. Moreover, if a variant, or a specific mutation within a variant, prove to be critical in clinical outcome, symptom development, or even treatment, the identification of variant could be readily available upon SARS-CoV-2 diagnostic, even simultaneously.
Regarding the sequencing results, sample M81 is relevant because virtually all mutations are classified in group AC = 3, which means they are present in almost 100% of the reads, as the AC value indicates this, but GCT28086ACA mutation is clearly present in some reads, 150 of 376 total reads. Since this value is above the expected error rate of PCR or Illumina sequencing technology [13,14], it is possible that the analysis of mapped reads with FreeBayes reveals the rise of a new mutation from the initial population but is not visible in DRAGEN analysis since it discards them. A similar scenario is observed in sample M83, since most of the mutations are grouped within AC = 3, except for C2919T, G10097A, C26894T, C29370T and C29870A. The last two are close to the 3 , so the read number is low compared to the other sites. Setting those two sites aside, other sites have reads as high as 10,674 for the mutant base, of a total of 26,097, and they are not listed in the consensus sequence obtained from DRAGEN. Finally, for sample M86, the mutation distribution is the same between FreeBayes and DRAGEN. Within this sample, two sites have mutant and native reads, deletion at 23,583-23,609, where 94.73% of reads contain mutations, and G24410A, with 70.98% of mutant reads. The difference in percent suggests that those mutations arose at different events, and the deletion could be the first event since it has a higher relative abundance, but this should be experimentally proven. Since both percentages are higher than 50%, DRAGEN includes them at the consensus sequence; therefore, FreeBayes and this consensus sequence contain the same mutations.
As demonstrated for other viruses, could not represent a homogeneous population, but a mixture of them [15,16], and these analyses suggest that SARS-CoV-2 behaves the same way.
The result from FreeBayes analysis reflects that changes in the SARS-CoV-2 populations can be finely studied through the analysis of sequencing data as a mixture of genomes instead of a homogeneous and unique population, an application with potential for determining the genomic conservation and purity of strains. However, it is necessary to include more samples and controls to thoroughly evaluate the viability and utility of such analysis.
All three samples present a similar result between the mutations observed in DRAGEN and FreeBayes analysis, with little difference between them, but for the M84 sample the difference is higher. FreeBayes detects 78 mutations, and DRAGEN detects just 48 of them. These 30 different mutations are low abundance mutations, an abnormally high number compared with the other samples. All mutations previously detected by Master Mut Kit are listed in the FreeBayes report, with N501Y, E484K and R346K listed in the lower abundance group, which is consistent with the result of the RT-qPCR analysis of the sample.
Considering this sample as a population composed of two variants, the genomes of those hypothetical strains were determined by joining the mutation groups as follows. The first variant genome resulted from merging AC = 3 and AC = 2 groups of mutations, which is almost equal to the DRAGEN consensus genome. Furthermore, the other genome, which belongs to the less abundant, was built using AC = 3 and AC = 1 groups of mutations. The first genome was identified as a Delta variant, while the second genome was identified as a Mu variant, which is the same result obtained previously by Master Mut Kit. With this new result, Master Mut Kit analysis was fully consistent with the NGS result, but just when NGS data was analyzed with FreeBayes. The DRAGEN COVID Lineage v3.5.3 app is part of the BaseSpace platform from Illumina, an integrated online toolkit with numerous applications for a wide variety of applications. Since the diversity of analysis and the demand of computing time is that high, the deep reached at each sequencing analysis is not the best for a comprehensive analysis of fine sequencing results. The tools already available are enough for the determination of a consensus sequence, however, remain as a basic analysis resulting in the loss of essential data but analyses, such as FreeBayes, could provide more information with no experimental procedure changes. This information could be a milestone in the study of SARS-CoV-2 population dynamics or even evolution. In the future, this kind of approach can be directed to the evaluation of changes in the population originated by treatments, replacing current methods and technology and thus eliminating its limitations [17].
As stated before, data of M84 suggest the presence of both variants within the patient, but more studies must be performed to assess if both infections are active, and further, if the patient can be infected at the same time and the virus coexists, or if one of those variants dominates over the other, extinguishing it.
As shown in the tables, the analysis with DRAGEN is accurate for most of the mutations present in samples evaluated but lacks the function to detect and identify populations of genomic variants present in lower abundance. This characteristic is not part of the current epidemiological program aim, but it is important to highlight the potential data that could be obtained from this analysis. To this date, 6,160,790 submissions have been shared in the GISAID database [9]. These submissions contain not only the consensus sequence, but also their taxonomy, collection date, location and patient information, and the sequencing technology used to obtain said consensus. GISAID is designed with an epidemiological purpose, centralizing data and generating statistical analysis based on the data contributed by the whole world. Even if it contains the information of mutations present in each sample, it lacks data generated by NGS other than consensus. Sample characteristics, such as populations, mutations present in lower proportion, mutations present in the same base but in different molecules, and even simultaneous infections, are just overlooked, and the opportunity to fully characterize samples is lost. Of course, it is not an easy task to analyze and store a massive database that contains all NGS results, such as reads or mapped reads, but the storage of a record of single nucleotide mutations, insertions, and deletions in a convenient form, such as a vcf file is by far achievable in an easier way than the storage of all reads and mapping, and more convenient to analyze and compare across samples or regions.
As NGS data is composed of reads that originated from RNA amplification from the sample, it is expected that the proportion found in the sequencing data relates to the proportion present in the original sample, but this proportion can be biased in the amplification step. Nevertheless, some tools consider the percent contribution to deconvolute the reads mixture, such as MixEmt [18] which has proven the separation of haplotypes from mixtures with good accuracy [13]. Other methods such as iterative mapping against references [19] have been used to analyze closely related organisms whose genomes are mixed within a sample, or as specialized software like SNPGenie [16,19,20]. The accuracy of some tools has been analyzed, testing both human WGS and WES [21], but must be proven valid at classifying data from SARS-CoV-2 genomes. Incorporating tools like FreeBayes in NGS analysis and mutations PCR screening in common practice will increase the information available, for genomic analysis and epidemiology, respectively, and will not represent a significant difference in terms of economy, time, or specialized training.
As stated by other authors [16,17,19,20], NGS data can be exploited to obtain information further than the sequence itself, and this information can improve the understanding of the evolution of the virus, both within-host and host-to-host change, the impact of genetic drift and both natural and immunological selection, and ultimately, factors which are determinant and drive the viral genetic change over time. On the other hand, surveillance programs must be reviewed and reinforced with the deployment of new tools and algorithms in order to achieve an extensive data collection, which then could be used for evaluation of the current epidemiological situation, as well to epidemiological forecasting, and finally, enable the analysis of how these mutations arise, and disappear or become fixated, over time, not only as a biochemical and physiological event but as an epidemiological phenomenon.

Samples and Diagnosis
Clinical samples of nasopharyngeal and pharyngeal swabs were taken from patients with COVID-19 symptoms or people without symptoms but at risk of being infected by SARS-CoV-2. Twelve culture samples were provided by a research laboratory.
RNA extraction was performed using Quick-RNA™ Viral Kit (Cat. R1035, Zymo Research ® , Irvine, CA, USA) and SARS-CoV-2 diagnosis was performed with the CoviFlu kit (Genes2Life, Irapuato, Mexico). Each RT-qPCR analysis included a positive control reaction, with a positive template included with the kit, and a negative non template reaction.
Positive samples with a threshold cycle value (Cq) of 31 or earlier were selected and analyzed with Master Mut Kit (Genes2Life, Irapuato, Mexico).  Table S1. Each RT-qPCR analysis performed a positive control reaction, with a mutant template included in the kit, and a negative control reaction, using either NATtrol SARS-Related-Coronavirus 2 (SARS-CoV-2) Stock (ZeptoMetrix, Buffalo, NY, USA) or a sequenced native sample as template.

Sample Sequencing
Eighty-seven samples analyzed by Master Mut kit were further analyzed by sequencing with Illumina ® COVIDSeq™ Kit (Illumina, San Diego, CA, USA) in an iSeq platform, and genome sequences were obtained with the Illumina DRAGEN COVID Lineage v3.5.3 app. Samples were prepared following manufacturer instructions. PhiX Control v3 (Illumina, San Diego, CA, USA) was used in each experiment.

NGS Data Processing and Variant Calling
Two pathways for data analysis were followed and compared.

Automatic Analysis: BaseSpace Sequence Hub Platform (Illumina)
The automatic data process offered by Illumina online platform was employed as the first tool. The main advantage of this online tool is the easy access and friendly user interface which have the full pipeline for SARS-CoV-2 genome sequence determination and subsequent sequence update to GISAID in one platform, thus eliminating the need to install and use each of the software programs and algorithms needed for local genome assembly; with the downside of eliminating the possibility of a deeper analysis of obtained sequencing data.
In brief, the resulting files were classified with the DRAGEN COVID Lineage v3.5.3 app. The consensus sequence obtained was compared with the reference genome of SARS-CoV-2 (NC_045512.2).
The consensus sequence was then uploaded in Nextclade (clades.nextstrain.org, last accession 14 December 2021) and the mutation list was analyzed against the results obtained from the other tools.

Analysis with Other Bioinformatics Tools: Samtools and Freebayes
Trimmed fastq files were downloaded with BaseSpace Sequence Hub Downloader. Then, reads were mapped on the reference genome of the Wuhan SARS-CoV-2 virus (NC_045512.2) using BWA (v0.7.17-r1188). The alignments were sorted and indexed with samtools (v1.13). With this data as input, the bioinformatics tool FreeBayes (v1.3.5) was employed for variant calling. "FreeBayes can act as a frequency-based pooled caller and describe variants and haplotypes in terms of observation frequency rather than called genotypes" [22]; therefore, this tool will classify the mutations present in the fastq file depending on their relative abundance.
This command indicates that all the mutations present in at least ten reads, and representing above 15% of position depth, must be listed in the Output.vcf file. Additionally, mutations listed in the vcf file will be classified into three groups in the function of their relative abundance; those groups are low abundance (AC = 1), abundant but not dominant (AC = 2) and present in virtually all reads (AC = 3).
FreeBayes can separate the mutations in different groups because the ploidy expected from the sample can be changed. Here we used a ploidy of 3, but a different ploidy value could have a better performance depending on the sample. With this ploidy value we can separate present mutations in three clusters: The first, which is present in virtually all reads, with Spike D614G as a perfect example, and two complementary mutations sets, AC = 2 and AC = 1, each one with mutations present at lower abundance.
This means that mutation present in the higher abundance group, AC = 2, plus the mutations of AC = 3, would be from a single viral population. Therefore, mutation group AC = 1 plus mutation group AC = 3, would be the complete mutation pattern of the less abundant viral population.
The resulting vcf file is converted to a spreadsheet for data display. BAM files were visualized with Tablet [23].

Conclusions
RT-qPCR screening of mutations was fully concordant with NGS results; therefore, it can accurately measure the incidence of VOI and VOC, at a lower cost and shorter time compared to NGS. Additionally, the result obtained with this kit allowed identifying a possible co-infection case, an event hard to identify with NGS data and current bioinformatics analysis. Finally, a deeper NGS data analysis with FreeBayes vcf file, or similar software, will provide more information about the genomic characteristics of the population within a sample, and can be implemented in current databases without demanding an excessive storage capacity as it would be required for fastq o bam files.
Our results encourage the use of new validated methods which can be employed for an extensive and affordable genomic surveillance of SARS-CoV-2 variants, and recommend further development of them, especially in developing countries.  Data Availability Statement: All data used in this article is available upon request, including fastq, bam, and vcf files.