Evaluation of EPISEQ SARS-CoV-2 and a Fully Integrated Application to Identify SARS-CoV-2 Variants from Several Next-Generation Sequencing Approaches

Whole-genome sequencing has become an essential tool for real-time genomic surveillance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) worldwide. The handling of raw next-generation sequencing (NGS) data is a major challenge for sequencing laboratories. We developed an easy-to-use web-based application (EPISEQ SARS-CoV-2) to analyse SARS-CoV-2 NGS data generated on common sequencing platforms using a variety of commercially available reagents. This application performs in one click a quality check, a reference-based genome assembly, and the analysis of the generated consensus sequence as to coverage of the reference genome, mutation screening and variant identification according to the up-to-date Nextstrain clade and Pango lineage. In this study, we validated the EPISEQ SARS-CoV-2 pipeline against a reference pipeline and compared the performance of NGS data generated by different sequencing protocols using EPISEQ SARS-CoV-2. We showed a strong agreement in SARS-CoV-2 clade and lineage identification (>99%) and in spike mutation detection (>99%) between EPISEQ SARS-CoV-2 and the reference pipeline. The comparison of several sequencing approaches using EPISEQ SARS-CoV-2 revealed 100% concordance in clade and lineage classification. It also uncovered reagent-related sequencing issues with a potential impact on SARS-CoV-2 mutation reporting. Altogether, EPISEQ SARS-CoV-2 allows an easy, rapid and reliable analysis of raw NGS data to support the sequencing efforts of laboratories with limited bioinformatics capacity and those willing to accelerate genomic surveillance of SARS-CoV-2.


Introduction
Whole-genome sequencing of SARS-CoV-2 using next-generation sequencing (NGS) is a powerful tool for studying coronavirus disease 2019 (COVID- 19) and tracking the evolution and spread of the virus [1]. Accurate information about the global spread of SARS-CoV-2 is critical to allow an adapted public health response. Multiple protocols have been developed and a huge volume of sequencing data have been generated in the past

Patients and Samples
Nasopharyngeal swab (NPS) samples tested positive for SARS-CoV-2 by quantitative RT-PCR, and cycle threshold (Ct) values ranging from 15 to 30.8 were selected for sequencing in this study.
Samples used for the validation of the EPISEQ SARS-CoV-2 pipeline vs. a reference method (n = 1700) were collected between February 2021 and March 2022 and sequenced as part of random genomic surveillance by the Virology Laboratory of Hospices Civils de Lyon (HCL, France). Investigations were conducted in accordance with the General Data Protection Regulation (Regulation (EU) 2016/679 and Directive 95/46/EC) and the French data protection law (Law 78-17 on 06 January 1978 and Décret 2019-536 on 29 May 2019).
Samples used for the comparison of kits and sequencing platforms using the EPISEQ SARS-CoV-2 bioinformatics pipeline (n = 40) were leftover samples of routine laboratory testing for SARS-CoV-2 infection collected between April 2020 and February 2022, provided by the Virology Laboratory of HCL (Lyon, France), Oriade-Noviale medical laboratories (Saint-Martin d'Hères, France) and Eurofins Biomnis Sample Library (Lyon, France). These samples were sequenced at bioMérieux (Marcy l'Etoile, France), as described below.
The study was conducted in accordance with the Declaration of Helsinki and followed the standards of Good Clinical Practice. Ethical review and approval were waived for this study, as all samples were collected for regular clinical management, with no additional samples needed for the purpose of the study. Patients were informed of the research and their non-opposition to the use of leftover samples for research purposes was obtained, in accordance with French regulations.

Sequencing
For the EPISEQ SARS-CoV-2 validation study, total nucleic acid was isolated from the NPS samples using the automated MGISP-960 system (MGI Tech Co., Ltd., Shenzhen, China). Eluted nucleic acids were used as inputs for cDNA synthesis. cDNA synthesis and multiplexed amplicon-based whole-viral-genome sequencing was performed using  [19]. Libraries were quantified prior to sequencing using the Qubit dsDNA HS Assay Kit (Invitrogen, Waltham, MA, USA, Q32851) and then 100 bp paired-end sequenced using the NovaSeq 6000 Sequencing System SP flow cell (Illumina, San Diego, CA, USA). Three negative controls were processed per 96-well plate run.
For the kit and sequencing platform comparison study, total nucleic acid was isolated with the NucliSENS easyMAG system (bioMérieux, Marcy L'Etoile, France) using the Specific B protocol and an elution volume of 50 µL. Negative control samples (at least one per sequencing run) were generated by processing nuclease-free water as an input sample for nucleic acid extraction. 8 µL eluted nucleic acids (or negative control sample) were used as input for cDNA synthesis. cDNA synthesis and multiplexed amplicon-based wholeviral-genome sequencing were performed using NEBNext ® ARTIC kits (New England Biolabs [NEB], Ipswich, MA, USA) and primer pools listed in Table 1, according to the manufacturer's recommendations [20]. NEBNext ® ARTIC kits are designed according to the protocols and primers developed by the ARTIC network [19,21]. Libraries were quantified prior to sequencing using the Qubit dsDNA HS Assay Kit (Invitrogen, Waltham, MA, USA, Q32851). For Illumina sequencing, library quality and size were also evaluated by capillary electrophoresis (Femto Pulse System, Agilent, Santa Clara, CA, USA) using the Ultra Sensitivity NGS Kit (Agilent, Santa Clara, CA, USA, FP-1101-0275

Sequencing Data Export and Analysis
For the EPISEQ SARS-CoV-2 validation study, reads of Illumina sequencing conducted on the NovaSeq 6000 Sequencing System SP flow cell were first processed for basecalling and demultiplexing using the Illumina DRAGEN Bio-IT Platform. Raw FASTQ reads were then used as input for a reference analysis using the in-house bioinformatics pipeline seqmet (github genEPII) [22], as recently described [23]. Briefly, paired reads were trimmed with cutadapt to remove sequencing adapters and low-quality ends, only keeping reads longer than 30 bp [24]. Alignment to the SARS-CoV-2 reference genome (isolate Wuhan-Hu-1 MN908947.3) was performed using Minimap2 [25]. Mapped reads were processed to remove duplicates tagged by picard, then realigned by abra2 to improve indel detection sensitivity and finally clipped with samtools ampliconclip to remove read ends containing primer sequences [26][27][28]. Variants present at frequencies ≥ 5% were called using freebayes, then decomposed and normalized with vt and filtered with bcftools to eliminate false positives [28][29][30]. Co-infections were detected as previously described [23]. The percentage of coverage of the consensus sequence to the reference genome was calculated and SARS-CoV-2 variant clade and lineage were identified according to the Nextstrain clade and Pango lineage nomenclatures [31][32][33] using Nextclade v1.11.0 and Pangolin v3.1.20, respectively. For a second time, raw FASTQ reads were used as input for analysis by the EPISEQ SARS-CoV-2 application, as described below.
For the kits and sequencing platforms comparison study, basecalling and demultiplexing were conducted using the Real-Time Analysis (RTA) software v1.18.54 (for NGS data generated on the Illumina MiSeq device) or the Guppy software v4.3.4, v5.0.11, v5.0.13 or v5.1.13, as they became available (for NGS data generated on the ONT GridION device). Raw FASTQ reads were then used as input for analysis by the EPISEQ SARS-CoV-2 application. Alignment to the reference genome (isolate Wuhan-Hu-1 MN908947.3), generation of a consensus sequence, percentage of coverage of the reference genome, and identification of amino acid mutations were automatically performed using EPISEQ SARS-CoV-2. SARS-CoV-2 variant clade and lineage were identified in EPISEQ SARS-CoV-2, according to the Nextstrain clade and Pango lineage nomenclatures [31][32][33] using the same version of Nextclade (v1.11.0) and Pangolin (v3. 1.20), respectively, as that used for the reference pipeline.

Data Analysis
EPISEQ SARS-CoV-2 was compared to the validated bioinformatics pipeline (github genEPII; [22,23]) set as a reference, using a large set of raw NGS data (n = 1700 samples). Genome coverage (% of reference genome) established by EPISEQ SARS-CoV-2 was compared to that determined by the reference method using non-parametric Spearman correlation in GraphPad Prism 5.04. A p-value < 0.05 was considered statistically significant. The percentage of coverage was calculated using the following formula: (number of non-ambiguous bases)/29,903 × 100. The percentage of agreement between EPISEQ SARS-CoV-2 and the reference method in clade and lineage assignment and in SARS-CoV-2 amino acid mutation identification was calculated for samples with genome coverage greater than 95% (as determined by the reference method). The Exact Binomial 95% confidence intervals (95% CI) were computed using the SAS Enterprise Guide 8.2 software. In addition, the number of single-nucleotide polymorphisms (SNPs) detected after pairwise alignment of consensus sequences (with >95% genome coverage) generated by EPISEQ SARS-CoV-2 vs. the reference method (between-method SNPs) was evaluated. For concordance analyses, sequence comparisons did not consider regions with undetermined (N) nucleotides and indels (insertions or deletions) in any of the respective consensus sequences.
Following validation against the reference method, EPISEQ SARS-CoV-2 was used to compare the analysis of raw NGS data of SARS-CoV-2-positive samples generated in parallel on two sequencing platforms (Illumina, ONT) using several commercial kits ( Table 1). The percentage of reference genome coverage was calculated and depicted as Tukey box plots [34] using GraphPad Prism 5.04. The percentage of concordance between kits and sequencing platforms in clade and lineage assignment was calculated. Variations in nucleotide and amino acid detection between kits and sequencing platforms were recorded and evaluated using heatmaps (designed in R version 3.6.1) and nucleotide sequence alignments (generated using Geneious 10.0.7).
First, a quality check of the input FASTQ files is performed. It consists of checking the format and integrity of the uploaded files and verifying if enough SARS-CoV-2-related reads are available for analysis using the Fastv public tool [35].
Second, genome assembly is carried out. For Illumina sequencing data, the reads are aligned against the SARS-CoV-2 reference genome (isolate Wuhan-Hu-1 MN908947.3) using bwa (v0.7.17) [36], and automatic detection of the primer kit is performed using a proprietary tool. Primer sequences are then trimmed and a consensus sequence is generated using the ivar (v1.3.1) public tool [37]. For ONT sequencing data, an automatic detection of the primer kit is performed using a proprietary tool before filtering the input reads based on their size to remove potential chimeric reads, aligning the reads on the SARS-CoV-2 reference genome (isolate Wuhan-Hu-1 MN908947.3) using a minimap2 (v2.17) public tool [25], trimming the primers, and creating a consensus sequence according to the ARTIC network bioinformatics protocol [38].
Third, quality controls of the consensus sequence, including its length, the percentage of reference genome coverage, the sequencing depth, the number of ACGT (nonambiguous) bases, and statistics related to the assembly quality of the spike-coding S gene are performed.
Fourth, variant identification and mutation screening based on the consensus sequence are conducted. Variants are identified according to the up-to-date Nextstrain clade and Pango lineage nomenclatures using the Nextclade and Pangolin public tools, respectively [31][32][33]. Variants of concerns (VOC) are labelled according to the definitions of the World Health Organization and Centers for Disease Control and Prevention [39]. Mutations are screened in all SARS-CoV-2 genes, including the S gene, using Nextclade [31,33].
The complete analysis is performed in one click and takes a few minutes upon NGS FASTQ data upload. As an example, it took 12 min to analyse the 19 omicron samples of the study sequenced on the Illumina platform. Multiple samples can be processed in parallel. Following analysis, a simple report is available for download in portable document format (PDF) ( Figure S1). The consensus sequence generated during analysis can be downloaded, and the results can also be exported in batch to a Microsoft Excel file.

Validation of EPISEQ SARS-CoV-2 3.2.1. SARS-CoV-2 Genome Coverage
Agreement in sequence analysis by EPISEQ SARS-CoV-2 and the reference method was evaluated using 1700 whole-genome SARS-CoV-2 sequences generated on Illumina NovaSeq 6000. The dataset included sequences of 990 pre-omicron samples sequenced with ARTIC v3 (n = 619) and ARTIC v4 (n = 371) primer sets and 710 samples of the omicron era sequenced with ARTIC v4.1 primer set. Genome assembly length (expressed in % of the reference genome) of the 1700 samples, as determined by EPISEQ SARS-CoV-2 and the reference method was compared. Following the quality control step by EPISEQ SARS-CoV-2, which considers the percentage of genome coverage, the sequencing depth, and the number of non-ambiguous ACGT bases of the consensus sequence, 68 samples were attributed the status "QC Fail" by EPISEQ SARS-CoV-2. A "QC Fail" status implies that no consensus sequence is generated; these samples were excluded from the comparison. Genome coverage of a total of 1632 sequences calculated by both bioinformatics tools was highly correlated (Figure 1) (Spearman correlation r = 0.883, p < 0.0001).
was evaluated using 1700 whole-genome SARS-CoV-2 sequences generated on Illumina NovaSeq 6000. The dataset included sequences of 990 pre-omicron samples sequenced with ARTIC v3 (n = 619) and ARTIC v4 (n = 371) primer sets and 710 samples of the omicron era sequenced with ARTIC v4.1 primer set. Genome assembly length (expressed in % of the reference genome) of the 1700 samples, as determined by EPISEQ SARS-CoV-2 and the reference method was compared. Following the quality control step by EPISEQ SARS-CoV-2, which considers the percentage of genome coverage, the sequencing depth, and the number of non-ambiguous ACGT bases of the consensus sequence, 68 samples were attributed the status "QC Fail" by EPISEQ SARS-CoV-2. A "QC Fail" status implies that no consensus sequence is generated; these samples were excluded from the comparison. Genome coverage of a total of 1632 sequences calculated by both bioinformatics tools was highly correlated (Figure 1) (Spearman correlation r = 0.883, p < 0.0001).

SARS-CoV-2 Variant Call
Out of these 1632 sequences, 1362 with a genome coverage > 95% (based on the reference method) were considered to assess the concordance in variant call (Nextstrain clade and Pango lineage) by the EPISEQ SARS-CoV-2 pipeline vs. the reference method ( Table 2).

SARS-CoV-2 Variant Call
Out of these 1632 sequences, 1362 with a genome coverage > 95% (based on the reference method) were considered to assess the concordance in variant call (Nextstrain clade and Pango lineage) by the EPISEQ SARS-CoV-2 pipeline vs. the reference method ( Table 2). Agreement between both analysis methods to identify SARS-CoV-2 variant clade and lineage over the whole dataset (n = 1362) was >99%, ranging from 98.7% to 100.0% depending on the variants investigated and the respective primer pools used ( Table 2). The evaluation of the 12 apparent discordant sequences (two for clade and 10 for lineage  Table 2) revealed that two sequences were not assigned a clade with the reference method due to a large deletion in the S gene (preventing a comparison with EPISEQ SARS-CoV-2) and that 10 sequences were assigned distinct sub-lineages within the same main lineage by the two pipelines (Table S1). Out of those 10 sequences, slight differences in the percentage of coverage (<1.6%) between both pipelines were observed and four sequences showed one or two single-nucleotide polymorphisms. Samples with differing lineage attributions were concordant in their clade definition and vice-versa (Table S1). Therefore, no major discrepancies were identified between both analysis tools as to clade and lineage assignment.

SARS-CoV-2 Whole-Genome Consensus Sequence
Genome assemblies performed by the EPISEQ SARS-CoV-2 and the reference pipelines were further compared by evaluating the number of single-nucleotide polymorphisms (SNPs) detected between consensus sequences generated by both pipelines (Table 3). For this nucleotide sequence comparison, regions of the consensus sequences with undetermined (N) nucleotides or indels in either analysis pipeline were excluded.  (Table 3). Among the four sequences generated with ARTIC v4.1 showing >2 SNPs, three were identified as resulting from SARS-CoV-2 co-infections [23], likely explaining the higher number of variable nucleotides identified by the two pipelines (3, 4 and 5 SNPs, respectively). Among the 55 (ARTIC v4) and 137 (ARTIC v4.1) sequences with 1 SNP (   (Table 3). Nucleotide polymorphisms C8829A, T8835C and T15521A lie within amplicons (not primer-annealing regions), and predict the following amino acid mutations: ORF1a:A2855D, ORF1a:V2857A and ORF1b:F685Y, respectively. Polymorphisms T8835C and T15521A have been reported as sequencing artefacts associated with ARTIC v4 and v4.1 primer schemes resulting from mispriming events within amplicons 29 and 51, respectively [40]. To our knowledge, the less frequent C8829A apparent polymorphism has not been reported to date as a sequencing artefact.

SARS-CoV-2 Spike Protein Mutations
Concordance in the detection of amino acid mutations within the protein spike by both pipelines was also examined. Regions with undetermined sequences in either analysis pipeline were excluded from the comparison. Amino acid identified by both pipelines showed a strong agreement (>99% over all sequencing data), ranging from 98.3% for sequences generated by ARTIC v4.1 to 100% for sequences generated by ARTIC v3 ( Table 4). Each of the 10 discordances observed with ARTIC v4 or v4.1 corresponded to polymorphisms in roughly equal proportions among reads, which were designated as consensus in one of the pipelines. Altogether, sequence analyses provided by EPISEQ SARS-CoV-2 as to genome assembly, clade and lineage classification, and SNP identification were in strong agreement with those provided by the reference method. Evaluation of discordances also demonstrated that EPISEQ SARS-CoV-2 performed at least as well as the reference method.

Comparative Performance of Sequencing Platforms and Kits Using EPISEQ SARS-CoV-2
We next evaluated the compatibility of the EPISEQ SARS-CoV-2 tool for the analysis of data generated by commonly used sequencing platforms and reagents. We used EPISEQ SARS-CoV-2 to compare the sequencing results obtained on two sequencing platforms (Illumina MiSeq and ONT GridION Mk1) using different commercial kits and primer pools (Table 1). Altogether, 40 SARS-CoV-2-positive samples covering a broad range of Ct values (15.0-30.8) and including pre-omicron (n = 21) and omicron (n = 19) SARS-CoV-2 variants were selected for this analysis, thus generating a total of 244 raw sequencing results (Tables S2 and S3).

SARS-CoV-2 Genome Coverage
The quality of the 244 NGS data was evaluated by calculating the proportion of genome coverage with EPISEQ SARS-CoV-2 ( Figure 2). 235/244 (96.3%) NGS results showed a coverage of the reference genome >95%, with a median coverage ranging from 99.6% to 99.8% on the Illumina platform and from 97.0% to 99.5% on the ONT platform (Figure 2a,b). Out of the nine NGS results with a coverage <95%, seven originated from sequencing on ONT using VSS (v1 or v2) primer sets, one from sequencing on Illumina using VSS v2 primer set, and one from sequencing on ONT using ARTIC v4.1 primer set ( Figure 2).

SARS-CoV-2 Variant Call
The analysis of the concordance in clade and lineage identification by EPISEQ SARS-CoV-2 between the different sequencing approaches revealed a 100% concordance over the 40 analysed samples (Table 5) and 243/244 NGS data (Tables S2 and S3). One sequencing result with very low genome coverage (69.1%) could not be assigned a Pango lineage  Table S3, yellow field).
The quality of the 244 NGS data was evaluated by calculating the proportion of genome coverage with EPISEQ SARS-CoV-2 ( Figure 2). 235/244 (96.3%) NGS results showed a coverage of the reference genome >95%, with a median coverage ranging from 99.6% to 99.8% on the Illumina platform and from 97.0% to 99.5% on the ONT platform (Figure 2a  and b). Out of the nine NGS results with a coverage <95%, seven originated from sequencing on ONT using VSS (v1 or v2) primer sets, one from sequencing on Illumina using VSS v2 primer set, and one from sequencing on ONT using ARTIC v4.1 primer set ( Figure 2).

SARS-CoV-2 Variant Call
The analysis of the concordance in clade and lineage identification by EPISEQ SARS-CoV-2 between the different sequencing approaches revealed a 100% concordance over the 40 analysed samples (Table 5) and 243/244 NGS data (Tables S2 and S3). One sequencing result with very low genome coverage (69.1%) could not be assigned a Pango lineage by EPISEQ SARS-CoV-2, although the correct clade was attributed (sample 22; Table S3, yellow field). Table 5. Concordance of sequencing results of SARS-CoV-2-positive samples generated by different kits and sequencing platforms and analysed using the EPISEQ SARS-CoV-2 pipeline.

SARS-CoV-2 Samples
Nextstrain Clade Pango Lineage Pre-omicron variants 1 Table 5. Concordance of sequencing results of SARS-CoV-2-positive samples generated by different kits and sequencing platforms and analysed using the EPISEQ SARS-CoV-2 pipeline.
A detailed analysis of concordant and discordant mutations within spike showed an overall good concordance (Figure 3, dark green and light grey) between all approaches for pre-omicron samples using ARTIC v3, v4, v4.1 and VSS v1 primers (Figure 3a, samples 1 to 21) and for omicron BA.1 samples using ARTIC v4.1 and VSS v2 primers (Figure 3b, samples 22 to 30), except for one sample with low genome coverage (Figure 3b, sample 22). Discordant results (Figure 3, pink, orange and red) were mainly due to the use of outdated primer pools, notably ARTIC v3 vs. v4.1 for the sequencing of delta variants (Figure 3a, samples 16 to 21) or to differences in sequencing performance between ARTIC v4.1 and VSS v2 for the sequencing of omicron BA.2 variants, especially between amino acids 339 and 505 (Figure 3b, samples 31 to 40). In these BA.2 variants, the differences also appeared to be sample dependent. . Concordance in mutation detection between kits and sequencing platforms is shown in dark green (mutation detected in all eight (a) or four (b) conditions) and light grey (no mutation detected in all eight (a) or four (b) conditions). Other colours (pink, orange and red) represent mutations detected with some but not all kit/sequencer combinations, thus indicating a discordance in identified mutations (see Tables S2 and S3 for details).
Differences in performance between ARTIC v4.1 and VSS v2 primers for the sequencing of omicron BA.2 variants were confirmed by analysing the alignment of the respective S gene nucleotide sequences (Figure 4). Sequences generated using ARTIC v4.1 often showed gaps of undetermined sequences between nucleotides ~700 and 1250 (overlapping amplicon 75), while sequences produced using VSS v2 showed gaps between nucleotides ~1300 and 1700 (overlapping amplicon 57) (Figure 4b, horizontal black bars). These . Concordance in mutation detection between kits and sequencing platforms is shown in dark green (mutation detected in all eight (a) or four (b) conditions) and light grey (no mutation detected in all eight (a) or four (b) conditions). Other colours (pink, orange and red) represent mutations detected with some but not all kit/sequencer combinations, thus indicating a discordance in identified mutations (see Tables S2 and S3 for details).
Differences in performance between ARTIC v4.1 and VSS v2 primers for the sequencing of omicron BA.2 variants were confirmed by analysing the alignment of the respective S gene nucleotide sequences (Figure 4). Sequences generated using ARTIC v4.1 often showed gaps of undetermined sequences between nucleotides~700 and 1250 (overlapping amplicon 75), while sequences produced using VSS v2 showed gaps between nucleotides 1300 and 1700 (overlapping amplicon 57) (Figure 4b, horizontal black bars). These amplicon dropouts over amplicons 75 (ARTIC v4.1) and 57 (VSS v2) were likely due to sequencing failures due to mutations within the BA.2 variant that overlap primer 75R (ARTIC v4.1; two mutations at positions 2 and 7 of primer 75R) and 57L (VSS v2; one mutation at position 27 of primer 57L), respectively, as recently reported [41]. Thus, both ARTIC v4.1 and VSS v2 primer pools presented flaws in accurately sequencing the S gene of the omicron BA.2 variants. These flaws explain the discordant results in amino acid mutations detected by EPISEQ SARS-CoV-2 between the sequencing approaches (Figure 3b, orange and red colours, amino acids 339 to 505). Sequencing gaps were rarely observed using the same primers on samples of omicron BA.1 variants (Figure 4a), except for sample 22 sequenced with VSS v2 on the ONT device, in line with the low genomic coverage described earlier (Table S3 and Figure 3b, orange and red colours).
Finally, considering the mispriming artefacts observed in the validation phase (T8835C, T15521A and possibly C8829A), which was based on NGS data obtained using a different protocol (Illumina COVIDSeq Test on NovaSeq 6000 sequencer), we evaluated the 244 sequencing results obtained in this kit comparison analysis for the presence of polymorphisms at nucleotide positions 8829, 8835 and 15521. None of the 244 generated consensus sequences showed the C8829A, T8835C or T15521A apparent polymorphisms. Coincidently, eight BA.2 omicron samples out of the 40 analysed samples had also been sequenced using the reference protocol (Illumina COVIDSeq Test on NovaSeq 6000, using ARTIC v4.1 primers). Interestingly, of these eight sequences, one presented the T15521A artefact mutation.

Discussion
This study describes the validation of EPISEQ SARS-CoV-2, an easy-to-use and integrative ("one-click") web-based application developed for sequencing laboratories lacking bioinformatics capacity or wishing to speed up SARS-CoV-2 genomic surveillance without saturating their internal bioinformatics capacity. EPISEQ SARS-CoV-2 can analyse raw NGS data generated by different sequencing platforms (Illumina, ONT) within minutes using a variety of kits and primer pools.
We showed that EPISEQ SARS-CoV-2 provides results comparable to those of a reference in-house bioinformatics pipeline in terms of genome coverage (Spearman correlation coefficient r = 0.883; p < 0.0001), Nextstrain clade and Pango lineage classifications (>99% concordance), and amino acid substitution identification (>99% concordance within the spike protein), over 1362 NGS data covering alpha to omicron SARS-CoV-2 variants. Interestingly, the comparison of the nucleotide consensus sequences generated by both pipelines upon sequencing with ARTIC v4 and v4.1 revealed the presence of apparent SNPs (T8835C, T15521A) actually resulting from sequencing errors (mispriming artefacts) frequently detected with ARTIC v4 and v4.1 [40]. These sequencing artefacts were observed in 11-22% (T8835C) and 43-47% (T15521A) of the validation dataset (depending on the pipeline used), thus representing an important proportion of artefactual mutations. Several types of sequencing artefacts resulting from mispriming, cross-primer dimerisation or reduced coverage due to amplicon dropout have been described [11,[40][41][42]. These sequencing errors can lead to fallacious mutation reporting and distort phylogenetic trees. They can also lead to erroneous biological interpretations, as illustrated by the misinter- pretation of mutation G142D being associated with a higher SARS-CoV-2 viral load [11]. Interestingly, in our study, detection of the mispriming artefacts T8835C and T15521A not only depended on the use of ARTIC v4 and v4.1 primers but also seemed to depend on the sequencing protocols (Illumina COVIDSeq Test vs. NEBNext ® ARTIC SARS-CoV-2), regardless of the bioinformatics pipeline used. Similarly to our kit comparison analysis, Lambisia et al. reported that they did not detect T8835C and T15521A SNPs using ARTIC v4 primers with their sequencing protocol [43]. Thus, differences in wet lab protocols should be carefully examined regarding the possible occurrence of systematic sequencing errors. In addition, sequence analysis solutions such as error pre-screening, amplicon size filtering and problematic site masking should be considered to avoid erroneous mutation reporting [40,44].
The comparison of 40 NGS data generated by a variety of sequencing approaches (ARTIC and VSS kits on Illumina and ONT platforms) using EPISEQ SARS-CoV-2 revealed a perfect concordance in Nextstrain clade and Pango lineage classifications. It also allowed for the identification of differences in performance in terms of genome coverage and in the identification of mutations (within and outside spike). Such differences were in part expected due to established differences in specificities associated with some of the primer pools (notably between ARTIC v3, v4 and v4.1) [45,46]. In addition, this analysis identified flaws in ARTIC v4.1 and VSS v2 primers for the sequencing of two distinct regions of the S gene of the omicron BA.2 subvariant due to amplicon dropouts that had been previously reported [41].
EPISEQ SARS-CoV-2 is a "one-click" application that allows a rapid and reliable analysis of SARS-CoV-2 NGS data. The results of the analysis, which are essential for proper genomic surveillance of SARS-CoV-2 (variant calls, mutation identification), can then be exported in a simple report ( Figure S1). This approach is particularly important for small sequencing laboratories with limited bioinformatics capacity and those needing to improve genomic surveillance and is thus highly relevant in times of pandemics. In comparison, the bioinformatics pipelines provided with the respective sequencing platforms (Dynamic Read Analysis for GENomics [DRAGEN] Bio-IT Platform and DRAGEN COVID Lineage application, Illumina; EPI2ME cloud-based analysis platform Fastq QC + ARTIC + NextClade, ONT), albeit reliable, are more complex to use and their results are more difficult to locate, extract and interpret for a non-specialist in bioinformatics. For instance, the configuration of these pipelines requires the user to specify analysis parameters (e.g., primers used), the analysis provides loads of details, sometimes in separate and large tables, and some results cannot be exported in a simple file, all of which might confuse a non-specialist and possibly be more error-prone in routine analyses. On the other hand, these platforms allow data visualisation (e.g., phylogenetic tree), which is not provided by EPISEQ SARS-CoV-2. Additionally, EPISEQ SARS-CoV-2 cannot report complex situations, such as co-infections, as opposed to the reference pipeline used in this study [23].
A strength of this study is the use of a large number of samples (1362) for the validation of EPISEQ SARS-CoV-2 against a reference bioinformatics method, the choice of samples covering a broad range of past and present SARS-CoV-2 variants, and the comparison of NGS data generated in parallel on two sequencing platforms (Illumina MiSeq and ONT GridION Mk1) using a total of 5 different kits (ARTIC v3, v4, v4.1 and VSS v1, v2), thus comparing up to 4 or 8 experimental combinations, depending on the variants investigated (omicron or pre-omicron, respectively). A possible limitation of this study is that the performance of EPISEQ SARS-CoV-2 to identify indels was not evaluated. In addition, this study focused on amplicon-based sequencing methods, which are most commonly used in the current era of ongoing SARS-CoV-2 genomic surveillance, and on two sequencing platforms (Illumina and ONT). However, preliminary data indicated that EPISEQ SARS-CoV-2 is compatible with target-enrichment sequencing approaches and that it can analyse NGS data generated by additional sequencing platforms, such as ThermoFisher Ion Torrent (not shown). Thus, in addition to being regularly updated as new SARS-CoV-2 variants emerge, EPISEQ SARS-CoV-2 has been conceived to evolve with the implementation of novel sequencing approaches and reagents according to official recommendations.

Conclusions
EPISEQ SARS-CoV-2 is a reliable and easy-to-use web-based application conceived to support the analysis of SARS-CoV-2 NGS data and the reporting of identified mutations by laboratories with limited bioinformatics skills. The platform is updated weekly to evolve with the reporting of new SARS-CoV-2 Nextstrain clades, Pango lineages, and VOC. The application is also conceived to evolve with the implementation of new sequencing tools.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki and following the standards of Good Clinical Practice. Ethical review and approval were waived for this study, as all samples were collected for regular clinical management, with no additional samples needed for the purpose of the study.

Informed Consent Statement:
Patients were informed of the research and their non-opposition to the use of leftover samples for research purposes was obtained, in place of informed consent, in accordance with French regulations.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to their containing information that could compromise the privacy of research participants.