Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer

The goal of biomarker testing, in the field of personalized medicine, is to guide treatments to achieve the best possible results for each patient. The accurate and reliable identification of everyone’s genome variants is essential for the success of clinical genomics, employing third-generation sequencing. Different variant calling techniques have been used and recommended by both Oxford Nanopore Technologies (ONT) and Nanopore communities. A thorough examination of the variant callers might give critical guidance for third-generation sequencing-based clinical genomics. In this study, two reference genome sample datasets (NA12878) and (NA24385) and the set of high-confidence variant calls provided by the Genome in a Bottle (GIAB) were used to allow the evaluation of the performance of six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, as an integral step in the in-house variant detection workflow. Out of the six variant callers understudy, Clair3 and Human-SNP-wf that has Clair3 incorporated into it achieved the highest performance rates in comparison to the other variant callers. Evaluation of the results for the tool was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. In conclusion, our findings give important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing.


Introduction
Over time, the field of genetic testing for many cancer biomarkers, such as breast cancer driver genes BRCA1 and BRCA2, improved, starting from single gene sequencing on sanger sequencing technology, followed by multigene panels, which were created as a result of developments in next-generation sequencing technology (NGS), allowing for a broader genetic assessment, a faster testing method, and better throughput, without being cost prohibitive but constrained by the generation of short reads [1,2]. MinION, the first longread Nanopore-based sequencer, was released by Oxford Nanopore Technologies (ONT), overcoming the primary limitations of short-read sequence creation [3] by introducing long-read sequencing technology that was adapted by both ONT and Pacific Biosciences (PacBio) [4]. These technologies proved that new long-read, single-molecule sequencing technologies could reliably be able to identify small variants, indel, and structural variants (SVs), with significant improvements in both sensitivity and specificity [3,5].
In human genomes, single-nucleotide polymorphisms (SNPs) and short insertions and/or deletions (indel) are two forms of genetic variants [6,7]. They contribute to genetic diversity and have the ability to affect phenotypic differences, such as human disease susceptibility. Detecting SNPs and indel is challenging in studying genomic variants and functions using new generations of high-throughput sequencing data [5]. Many different variant (SNP/indel) callers were introduced by the Nanopore community and recommended by ONTs for accurate variant detection based on data from long-read sequencing. Some variant callers implemented variant calling methods using deep learning, such as "Clair" [8], the successor of "Clairvoyant" [9]. "Longshot " [10] calls SNPs on long-read data using a Pair-Hidden Markov Model (pair-HMM) for a small local window surrounding candidate sites. Medaka [11], an SNP/indel caller based on deep learning on long-read data, was recently launched by ONTs [11]. Medaka predicts SNPs from unphased long reads before phasing them. For each set of phased reads, Medaka ends up making SNP and indel calling. Nanocaller [12] is a deep convolutional neural network that incorporates a long-range haplotype structure to improve variant detection on long-read sequencing data. "Clair3" [13] combines the greatest characteristics of two key method categories: pile-up calling, which handles most variant candidates fast, and full alignment, which tackles complicated candidates with precision and recall in an account. Accordingly, in this article, the development of a workflow for detecting disease-causing variants, starting from the sample to the variant call format (VCF) with annotated variants, was proposed where different variant calling tools were tested on reference genome samples to evaluate the output of each tool against "Truth" set of variants. The proposed pipeline for targeted sequencing of the data generated from long-read sequencing technology, where the two genes BRCA1 and BRCA2, which are recurrently mutated in breast cancer, were analyzed as an example of this workflow and an examination of its performance was described for future testing and implementation.

Classification of the Pathogenicity of Variants
The information deposited in the ClinVar database and the recommendations of the American College of Medical Genetics and Genomics (ACMG) were used to classify the detected mutations [21,22]. The results of the BRCA1/2 gene variant detection were classified as wild type (no harmful variants), variant of unknown significance (VUS), pathogenic variants (PV), and likely pathogenic variants (LPV); not all the benign variants were reported [23].

Validation Data Set
To ensure the pipeline's usefulness and readiness, two long-read datasets based on publicly accessible human reference samples HG001 (NA12878) (https://www.ncbi.nlm. nih.gov/popset/?term=NA12878 (accessed on 8 August 2022)) and HG002 (NA24385) (https://www.ncbi.nlm.nih.gov/genome/?term=NA24385 (accessed on 8 August 2022) were provided by the ONT-open-data registry that is provided to support: (1) exploration of the properties of Nanopore sequence data; (2) performance evaluation and replication; (3) tool and method development. These are two of the most used reference samples. The Fastq files provided along with the bam files for this sample were used as input to test the validity of different tools' output (https://registry.opendata.aws/ont-open-data/ (accessed on 8 August 2022)) using the benchmarking tool (https://github.com/Illumina/hap.py (accessed on 8 August 2022)) [24].

Data Analysis Workflow Outcome
Data analysis workflow for the HG001 and HG002 reference genomes started with the read sequence aligner Minimap2, which aligns DNA sequences against the GRCh38 human reference genome with a SAM file as an output. Samtools "View" was used to convert the SAM file to a BAM file, followed by Samtools "Sort" and "Index" to generate a sorted and indexed BAM file ready for variant calling. As a part of the workflow pipeline, a step of PCR duplicate removal from the aligned reads of the two reference samples was included, to avoid overestimation of the coverage and overestimated variants resulting from PCR duplication with Samtools "rmdup". The mean coverage was calculated by bedtools "coverage". For the sample HG001, the mean coverage for the reads before PCRduplicate removal was 32.62 X and 36.89 X for BRCA1 and BRCA2, respectively, while after removal of PCR duplicates, the mean coverage for BRCA1 and BRCA2 was found to be the same. The mean coverage for HG002, before the PCR-duplicate removal, was 53.85 X for BRCA1 and 70.06 X for BRCA2. After removing the duplicates, the mean coverage of BRCA1 and BRCA2 was found to be the same, which suggested that the published reference samples previously underwent the step of PCR-duplicate removal or it was sequenced as a whole-genome sequencing sample, which is more logical ( Table 2).

Primary Filtering Outcomes
The BAM files were ready for the next step, which was the variant calling step. Six tools were used to call variants in the BRCA genes in HG001 and HG002; some of these tools were recommended by ONT and some by the ONT community for variant calling, such as Medaka, Clair, Nanocaller, Longshot, Clair3, and wf-human-snp workflow, which is the workflow provided by ONT employing Clair3 with pre-adjusted parameters for accurate variant calling. All of the generated output VCFs were filtered, including the variants with "PASS" and QUAL > 20 as a threshold for the comparison of the output of the tools. Long-read sequencing data aligned to a reference genome are taken as an input along with a BED file designed to target the two genes' coordination, which restricts the variants called in the target location into different variant callers, which output a VCF file with predicted SNPs and indel. The output after the primary filtering for the three samples is described in Tables 3 and 4.

Comparison of the Variant Caller's Performance
For a comparison of the variant caller's performance, the traditional binary classification performance assessment paradigm of simply determining true and false "positives" and "negatives" lends itself well to evaluating the performance of variant callers [6]. By comparing the results to the truth sets for the NA24385 sample or NA12878 sample using the Hap.py tool that enumerates the variants between a "truth" VCF file containing the truth set of variants and a "query" VCF file, which contains the set of output variants of the variant caller along with a BED file that restricts the comparison to variants in the specified target location to determine the reliability of the variant calling conducted. The hap.py tool outputs a summary with the true positive "TP", false positive "FP", false negative "FN", Precision, Recall, or sensitivity, and finally, F1-score, which is an indication and a representation of both precision and recall. The data generated from the comparison tool "Happy" were summarized to include important metrics, such as Recall, Precision, F1-score, and the time taken by the tool to call the variants in both genes (Tables 5 and 6). With respect to the time taken for the tools to perform the variant calling on only the coordination of BRCA1 and BRCA2 genes for each sample, Nanocaller proved to be faster in this aspect where the time taken for Nanocaller was the lowest and Clair was proved to take the longest time in two samples HG001 and HG002 (Tables 5 and 6).

Discussion
Evaluation of BRCA1/2 molecular status has become the standard of care in the treatment of individuals with breast cancer. Precision medicine has made significant progress against this type of cancer, which accounts for one-third of all new female cancers every year. Female breast cancer is the sixth biggest cause of mortality worldwide, with an estimate of 685,000 deaths in 2020 [25]. One example is the development and clinical application of PARP-inhibitor (PARPi); Poly (adenosine diphosphate-ribose) polymerase inhibitors (PARPi) are a key arrow in the oncologist's quiver among new therapeutics [26,27]. Indeed, PARPi has been found to enhance the clinical outcomes of breast cancer patients with BRCA1/2 germline or somatic mutations, which have been found to improve survival and quality of life [28][29][30][31][32]. As a result, current worldwide guidelines strongly advise BRCA1/2 testing in all patients. Rapid and dependable genetic screening for BRCA1/2 germline or somatic mutations has become critical in identifying individuals who would most likely benefit from these treatments [3,33,34].
The technology used in BRCA 1/2 gene testing held an important impact on getting the full picture of the two genes. Traditional Sanger sequencing is expensive and takes a long turn-around time (TAT). Next-generation sequencing (NGS) is a game-changing high-throughput nucleotide sequencing approach that produces rapid, cheap, and accurate genomic data. NGS developed the clinical methodology for genetic examination across various fields of medicine [34]. NGS can massively sequence millions of DNA reads, allowing for accurate characterization of the "status" of multiple genes; in this context, NGStargeted gene sequencing enables the detection of driver mutations, which are responsible for progression and relapse and might be employed as predictive or prognostic biomarkers in breast cancer [33,34]. When compared to Sanger sequencing, NGS can offer doctors comparable genetic information at a cheaper cost and shorter time to results [2,34], yet the NGS limitations are the small read size and the difficulty in analyzing large alterations as structural variants. Many studies employed the NGS as a technology in the detection of BRCA1/2 gene variants in various ethnic groups to implement the detection of gene variants using NGS in the routine line of diagnostics and may allow doctors to make more prompt and informed decisions about surgery or neo-adjuvant chemotherapy in breast cancer patients [35][36][37][38][39][40][41][42].
However, the use of NGS technologies in clinical diagnostics necessitates a large initial investment in the sequencer, which is a barrier for local research institutions in underdeveloped nations, as well as small research institutes and hospitals. MinION, the first commercially available sequencer based on Nanopore technology, might be a viable alternative [43,44]. MinION has previously been utilized effectively to identify mutations in TP53 and ABL1 genes in CLL and CML patients [45][46][47][48], respectively. Furthermore, the cheap cost, ease of use, and length of the reads make MinION a perfect instrument for targeted gene sequencing; the long read can enable researchers to detect and phase genetic variants, as well as thoroughly define new isoforms and fusion transcripts, using Nanopore technology. Nanopore technology sheds new light on health and disease, ranging from cancer to immunology and neurology [48].
In the current study, the main focus was on the data analysis of data generated using Nanopore technology, as there are many proposed tools by the Nanopore community, a hub for all the Nanopore technology users (https://community.nanoporetech.com/ (accessed on 8 August 2022)) for every step along the way in data analysis. The in-house targeted gene sequencing workflow was divided into two parts: (1) design a data analysis pipeline for SNV/INDEL/SV detection and how to validate this pipeline and (2) design an in-house primer panel for BRCA1/2 genes as a prototype for future implementation. The pipeline design started with a set of tools designed and trained on long-read data generated from the MinIon ONT sequencer; the reference samples used as the input data for validation of this workflow are the publicly published "NA12878" (HG001) reference sample [49] and "NA24385" (HG002) dataset that contain whole-genome sequencing of well-known human cell lines, sequenced using Nanopore technology [50]. Each, therefore, serves as a helpful benchmark sample. The HG002 cell line was used as a "seen" sample in the current (PrecisionFDA Truth Challenge V2) competition [51]. The method of validating the performance of workflows and especially the variant callers is called "Benchmarking", where a reference sample is used either as DNA to be sequenced and undergo the workflow or using the data for this reference sample from the public repository in-silico for a data analysis step, a method that was recommended by Global Alliance for Genomics and Health (GA4GH) [52].
The pipeline went as follows: (1) mapping for the reads stored in the fastq file that outputs the reads into a SAM file format using "Minimap2" mapper for long reads against reference sequences based on the GRCh38/UCSC hg38 public human genome build, (2) sorting and indexing using Samtools as a versatile tool as it was heavily used in many pipelines proposed by other studies, used to convert a SAM file to BAM file, sort and index the BAM output, (3) removing the PCR duplicates even though the reference data samples used to validate this workflow were whole-genome sequencing, not including a PCR step but were included in the workflow as this workflow will be used on targeted gene sequencing data, (4) calculating the mean coverage of the targeted genes using Bedtools, (5) variant calling step, which is the main event in the workflow and the focus of our study; there are many variant callers both recommended by ONT and the Nanopore community, so the output variants were filtered based on "PASS" and QUAL > 20 as a threshold for the comparison of the tools output, (6) annotating the variants using SnpEff as an annotation tool, and (7) checking the clinical significance of the annotated variants using ClinVar clinical database.
The focus of the current study was to evaluate this workflow as well as compare the performance of the commonly used software pipelines for variant calling, which is another key element in variant discovery. The comparison is based on how well the tool calls the "True" variants when compared to the benchmarking VCF file; the tools analyzed in this study are Medaka, Clair, Nanocaller, Longshot, Clair3, and ONT's wf-human-snp workflow for variant calling, which employs Clair3 with pre-adjusted parameters for the accurate calling of variants.
Recent studies attempted to enhance variant calling by using phasing information from long-read sequencing data. Longshot calls SNPs on long-read data using a pair-hidden Markov Model (pair-HMM) for a small local window surrounding candidate sites and then improves genotyping of identified SNPs using Hap-CUT2 [53] based on the most probable pair of haplotypes given the present variant genotypes, but on the other hand, is incapable of detecting indel. Medaka was provided by ONT, an SNP/indel caller that uses deep learning on long-read data. Medaka predicts SNPs from unphased long readings before using WhatsHap [54] to phase the data Medaka eventually makes SNP and indel calls for each phased read group. Clair, the successor of Clairvoyante, is a tool for detecting germline minor variants quickly and accurately using single-molecule sequencing data. Clair outperforms several competing systems for ONT data, including Clairvoyante, Longshot, and Medaka, in terms of precision, recall, and speed. As a deep learning approach, Nanocaller detects SNPs using long-range haplotype information, then phases long reads with identified SNPs and calls indels using local realignment.
Two key designs differ greatly in terms of performance and speed either employing pileup or full alignment as the input of the decision-making neural network. Clair and Nanocaller are pileup-based calling networks that aggregate read alignments into features and counts before sending them into a variant calling network. PEPPER-Margin-DeepVariant5 (PEPPER) [55] is fully alignment based. The DeepVariant variant calling network input is retained with spatial information in the full alignment method and is tens of times greater in size than the pileup method. Medaka is consensus based, using pileup input to generate a diploid consensus in the first iteration and two haploid consensuses in the second. Variants are formed by identifying and combining differences between the reference and consensus. To fill the void, Clair3 was created, which combines the best of both designs. It is as quick as pileup-based callers and performs just as well as full alignment callers. First, the pileup calling network goes through all the variant candidates that met a coverage and alternative allele frequency criterion. The high-quality pileup calls are then used to phase the alignments and generate the final output. Then, for each low-quality pileup call for full-alignment calling, the alignments phased by WhatsHap are utilized to create full-alignment input that is 23-times greater in size than the pileup input. Finally, as the final output, the full-alignment calls are combined with the high-quality pileup calls.
For performance validation of the pipeline along with the variant callers, the process started with the genome in a bottle (GIAB) reference samples HG001 and the Ashkenazi son sample HG002 ONT reads that were used as an input for mapping with Minimap2, sorting and indexing with Samtools, calculating the mean coverage within the BRCA1/2 gene bed file with coordination. for the variant calling step, the default parameters were used for all the variant callers to ensure uniformity in the output variants. The benchmarking variant VCF "Truth set" used was the GIAB v.4.2.1 for each reference genome sample to compare the output of different variant callers. The hap.py [24] tool was used for benchmarking, which is a reference implementation of the GA4GH recommendations for variant caller benchmarking with the "vcfeval" engine for comparison; it generated metrics as "False positive", "False negative", "True positive", "Precision", "Recall", and "F1 score". It was found that three metrics are the most important for variant caller performance evaluations, which are "Precision", "Recall", and, most importantly, "F1 score", which is the mean of precision and recall and is commonly used to test the performance of the callers [56][57][58].
Based on the metrics obtained in our results, it is suggested that Clair3 as a standalone or incorporated into a workflow as Human-SNP-wf by ONT, was found to be outperforming other variant callers concerning performance. The Clair3 method's efficiency is based on its ability to effectively distinguish between true and false calls during pileup calling, allowing only essential candidates to be transferred to the considerably more computationally costly full alignment calling. Following that comes Nanocaller, which performed in a better way than the rest of the variant callers, Longshot, Clair, and Medaka, respectively, agree with the findings of another study. Even though Clair is supposed to outperform Longshot, it was found to have lower F1 scores in both reference samples and that may be because Clair was outdated and was succeeded by Clair3 in May 2021 (https://github.com/HKU-BAL/Clair (accessed on 8 August 2022)). Although Medaka was, up until the release of Clair3, the recommended variant caller for SNP calling using the "medaka_variant" argument, which was formerly implemented inside the medaka package, it has been exceeded in accuracy and computing performance by alternative approaches and is, thus, deprecated and it is advised to utilize Clair3 either directly or through the Oxford Nanopore Technologies offered Nextflow implementation (Human-SNP-wf) (https://github.com/nanoporetech/medaka (accessed on 8 August 2022)) and that may explain the low performance. It was intentional not to test Nanopolish [59], which is also capable of variant calling on ONT data since it requires fast5 raw signals file as input, which are not publicly accessible for HG002, so it was excluded from the variant callers' comparison.
Targeted gene panels are one of the most frequent ways of enriching the genomic areas to be sequenced and they are widely utilized in NGS technology. Using Nanopore technology, we were able to enrich all the gene areas of interest without being limited by the read length. MinION real-time sequencing allows reads to be evaluated as they are produced, considerably speeding up analysis and allowing for the modification of experimental conditions as needed. Another benefit of MinION over second-generation sequencers is its mobility and ease of use for library preparation and sequencing, as well as its low cost. There are currently many custom/academic or commercial BRCA1/2 target panels that have been established in recent years because of investigations on the use and impact of NGS in breast/ovarian cancer [56,[60][61][62], the majority of which are based on the amplicon sequencing technique. There are currently many commercial shortread amplicon-based BRCA gene panels available that detect SNV and/or copy number variation. Nonetheless, efforts to create a complete gene panel useful for BRCA prognosis and medication impact prediction are ongoing. The design of a primer panel targeting different oncology biomarkers will be incorporated into our future plan for trial on different cancer sample types.

Conclusions
In this study, six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, were evaluated regarding their performance and accuracy in the detection of genetic variants. The tested genetic variants were single-nucleotide polymorphisms (SNPs) and short insertions and/or deletions (indel) of BRCA1 and BRCA2 genes, where two reference genome sample datasets (NA12878) and (NA24385) were used. The set of high-confidence variant calls provided by Genome in a Bottle (GIAB) was used to allow for the evaluation of the performance of six variant calling tools. The obtained results provide important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing. The evaluation of the results was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. Both Clair3 and Human-SNP-wf tools accomplished the highest performance rates and should be implemented for evaluating the prognosis of breast cancer in humans.

Data Availability Statement:
In publicly accessible repositories, the Nanopore sequencing data have been deposited. The data can be found here: https://www.ncbi.nlm.nih.gov/sra/PRJNA865100 (accessed on 8 August 2022).

Conflicts of Interest:
The authors declare no conflict of interest as well as the funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.