1. Introduction
Whole-genome sequencing (WGS) determines an individual’s entire DNA sequence uncovering every genetic variant that could underlie biological changes in the organism [
1]. From a statistical viewpoint, an observed phenotypic trait—the dependent variable—is influenced by the values of specific genes acting as independent variables.
WGS has become a major focus of contemporary genomic research because it pinpoints hereditary predispositions—facilitating, for instance, early cancer detection [
2,
3]. Sequencing a single genome is informative, yet analyzing a trio (mother, father, and child) yields deeper insights [
4]. A trio dataset comprises the parental genomes and that of their offspring; the child’s DNA is essentially a linear combination of the parents’ sequences, as each chromosome pair contains one copy from each parent. These data enable checks for Mendelian-inheritance consistency and highlights anomalous allele transmissions [
5].
Errors, however, can arise at several stages. Two steps are pivotal for downstream analyses: alignment, which locates short reads (typically 50–300 bp) within the ~3 Gbp reference genome and variant calling, which catalogs every deviation from that reference. Alignment can be confounded by sequence homology, which can result in the misplacement of reads and producing false-positive or false-negative calls [
6].
A further complication is the scarcity and limited diversity of benchmark datasets; current references do not span the full breadth of human genetic variation [
7]. Therefore, robust, user-friendly methods are needed to measure error rates in non-benchmark datasets allowing researchers to compare sequencing tools and pipeline versions objectively—and thereby minimize spurious variant calls while maximizing true discoveries.
Comparative studies of whole-genome sequencing (WGS) pipelines are critical for establishing robust genomic analysis standards, as the choice of bioinformatic tools—from read alignment and variant calling to quality control and annotation—can profoundly impact the accuracy, reproducibility, and clinical interpretation of results. These investigations systematically evaluate the performance of different algorithmic combinations and parameters, benchmarking them against known reference materials and gold-standard datasets to identify optimal strategies for variant discovery and to quantify the often-substantial technical biases introduced by the analysis process itself.
1.1. Comparative Analysis of Whole-Genome Sequencing Pipelines Using Benchmark Data
Recent breakthroughs in sequencing now uncover increasingly subtle relationships between genomic variation and disease phenotypes. In response, the community continuously introduces new variant-calling algorithms and full bioinformatics workflows making rigorous, metric-rich benchmarking indispensable.
A prominent evaluation [
8] compared three pipelines—GATK, DRAGEN [
5] and DeepVariant—against the Genome in a Bottle (GIAB) reference panels. GIAB supplies the standards, methods and datasets that underpin clinical-grade whole-genome sequencing and spur next-generation platform development.
For this benchmark variants were classified as true positives (TP), false positives (FP) or false negatives (FN) under two stringency levels:
DRAGEN and DeepVariant delivered the best accuracy for single-nucleotide polymorphisms (SNPs) and small insertions/deletions (indels) with nearly identical F1-scores. Given DRAGEN’s markedly higher throughput, it is attractive for population-scale WGS, whereas a hybrid DRAGEN + DeepVariant scheme balances speed with maximal precision.
A separate study [
9] corroborated these findings on GIAB data underscoring the need to choose alignment and variant-calling tools that fit the specific analytical task.
1.2. Comparative Analyses Based on Trio Data
Conventional accuracy assessments depend on high-confidence benchmark variant sets. An alternative strategy exploits Mendelian-inheritance constraint within family trios.
In investigation [
10] outputs from two pipelines were tabulated in a contingency matrix that tallied genomic loci—fixed chromosomal positions—where both pipelines called the genotype correctly, both erred and only one produced an incorrect call. This Mendelian-driven framework yields a powerful benchmark when population-wide truth sets are scarce or absent.
where [x = y] equals 1 if x = y and 0 otherwise—i.e., it counts how many variants of individual p with the true genotype i are called genotype j by pipeline 1 and k by pipeline 2.
The generative framework explicitly models family relationships (parents plus child) treating the child’s genome as a linear blend of the parental chromosomes. It was validated on real-world trio datasets as follows. Two GATK callers—HaplotypeCaller and UnifiedGenotyper—were executed on the same samples. The resulting contingency table allowed the model to compute precision (TP/(TP + FP)) and recall (TP/(TP + FN)) for each caller. These estimates were then benchmarked against ground-truth genotypes by directly comparing the child’s calls with high-confidence reference sets.
The experiments showed that the framework consistently detects differences in precision and recall between the two callers providing a foundation for new variant-calling tools aimed at under-represented populations that lack benchmark references.
1.3. Literature Survey
The Global Alliance for Genomics and Health [
11] assessed standardized approaches for measuring the accuracy of sequencing-based variant calls and stressed the importance of a unified benchmarking framework. Current methods often disagree because they differ in how variants are represented, which metrics they report and how they treat challenging genomic regions [
12].
One major obstacle is that the same genetic change can appear in multiple formats within VCF files—especially complex insertions and deletions—so specialized normalization is essential. The authors therefore recommend tools such as vcfeval [
13] and hap.py, which compare calls not just by exact allele matches but also by local positional and genotype concordance.
Accuracy is usually gauged by sensitivity (true-variant detection rate) and precision (rate of avoiding false positives), yet these metrics shift markedly with variant class (SNV, indel, complex) and genomic context (repeats, high-GC regions). Benchmarking typically relies on Genome in a Bottle (GIAB) reference sets, which provide high-confidence calls for well-studied genomes—but even GIAB omits many difficult regions inflating apparent accuracy. For example, DeepVariant [
14] and GATK [
15] show 99.7% concordance for SNVs inside high-confidence regions but only 76.5% outside them.
The study also highlights the effect of PCR amplification [
16]: workflows that include PCR tend to call insertions and deletions less accurately than PCR-free protocols. Large indels remain poorly evaluated despite their substantial contribution to genomic diversity.
Finally, the authors propose a standardized benchmarking roadmap that includes adopting common metrics, stratifying results by variant type and genomic context and employing a web-based tool that automatically generates performance reports making pipeline comparison and optimization far more straightforward.
2. Materials and Methods
This section describes the methodological framework adopted in our study. It begins with a description of the architecture of the whole-genome sequencing pipeline and the computational tools used for data processing. Next, the workflow language and modular design that enable integration of the proposed method into the existing pipeline are detailed. Finally, the procedures used for data preparation, software implementation, and validation experiments on real datasets are presented.
2.1. Architecture of the Whole-Genome Sequencing Pipeline
Figure 1 outlines the workflow of the nf-core/sarek whole-genome sequencing pipeline. The core stages essential for any bioinformatics workflow, which are most pertinent to the method developed here—begin with read alignment (mapping).
At the read-alignment stage every short read contained in a FASTQ [
17] file—a text format that stores nucleotide sequences together with their quality scores—is mapped to its precise position in the reference genome (
Figure 2). The output of this step is a set of BAM (Binary Alignment/Map) files, which hold the reference-aligned reads.
Subsequently, during variant calling the pipeline sequentially examines all reads covering each genomic locus and decides whether a deviation from the reference exists. This process yields a comprehensive list of variants recorded in VCF (Variant Call Format) files.
VCF files encode every key detail: the chromosome identifier, precise genomic coordinate and the nucleotide change observed relative to the reference sequence. They also store the parental-origin alleles for each variant—information later used to assess Mendelian consistency.
Because pipeline artifacts or flaws in the raw data can generate spurious calls, VCFs may contain both genuine and erroneous variants. To flag and quantify these errors we introduce an auxiliary stage, collect_coverage_stats, which executes after the main pipeline steps and parses their output files.
2.2. The Nextflow Workflow Language
Nextflow [
18], the workflow management system driving nf-core/sarek, is implemented in Groovy [
19] and Java [
20] and has become a staple in bioinformatics. A Nextflow script revolves around channels, which pass data between components, and processes, which represent discrete pipeline tasks such as read alignment.
Each process is wrapped in a process block that defines its inputs, the script or shell commands to execute, and its outputs. As the workflow runs, these inputs and outputs stream from one process to the next enabling intermediate artifacts—like BAM files produced during alignment—to be captured and reused downstream. The forthcoming module relies on this data-flow model to collect every file it needs automatically.
2.3. Overall Design of the Module
Figure 3 presents the workflow of the forthcoming module for computing sequencing-result metrics from trio data.
Bedtools [
19] is a software suite for processing genomic data: it analyses, compares and manipulates files in BED, BAM, VCF, and GFF/GTF formats. Widely used in bioinformatics, Bedtools enables format conversion, interval comparison and merging, coverage calculation at specified loci and much more.
For each BAM file produced in the earlier pipeline steps we first identify regions with non-zero coverage. Coverage depth is the number of aligned short reads observed at every genomic position (i.e., the count assigned to each nucleotide; see
Figure 2). Because most disease-related mutations fall within the relatively small, protein-coding fraction of the genome, we restrict subsequent analyses to these functionally relevant regions.
Next, the BAM-derived intervals are merged with bedtools merge [
21] and then intersected (bedtools intersect [
21]) with the regulatory regions described in study [
22]—a dataset focused on blood-cell regulatory elements (illustrated in
Figure 4).
After that the same tools are applied to generate the final VCF files, which will be used to assess Mendelian-inheritance consistency.
3. Results
3.1. Development of a Trio-Based WGS Consistency Scoring Program
We developed software to quantify Mendelian-inheritance consistency in trio data.
The algorithm begins by loading and processing three VCF files—one each for the father, mother and child. Each file is parsed line by line to extract the key fields: chromosome, variant position and the allelic states inherited from the parents. Depending on the genotype annotation (0/0, 0/1, 1/1), positions are stored in dedicated data structures.
0 denotes the reference allele, 1 an alternate allele.
A “0/1” (or “1/0”) entry indicates that the child received the alternate allele from only one parent, whereas “1/1” indicates inheritance from both parents; “0/0” does not appear in the VCFs because it would signify no deviation from the reference.
Genotype calls missing parental-origin information (denoted by “./.”) were excluded from the analysis.
Using this information the program identifies positions where a variant is present in only one parent (as 1/1) and tallies how often the child exhibits the expected genotypes (e.g., 0/1 when one parent is 1/1 and the other 0/0) versus anomalous patterns. De novo mutations—changes present in the child but absent in both parents—or pipeline errors manifest as unexpected genotype combinations, which are flagged for downstream interpretation. Running different pipeline versions or alternative aligners on the same input data therefore yields quantitative error estimates that can be directly compared.
The implementation relies heavily on C++ Standard Library containers [
23].
std::map stores key–value pairs in which the key is the chromosome name (e.g., “chr1”) and the value is a set of variant positions.
std::set holds unique positions eliminating duplicates and enabling efficient set operations such as set_intersection and set_difference.
As std::set keeps its elements sorted, these operations run in linear time. This design enables the scalable analysis of large genomic datasets.
3.2. Integrating the Module into the nf-core/sarek Pipeline
To support the new functionality, we extended the nf-core/sarek pipeline with additional processes and parameters.
The modified pipeline accepts an input CSV (Comma-Separated Values) file that may include any or all of the following columns:
a unique patient identifier;
biological sex;
sample type (normal tissue vs. tumor);
paths to the sequence-read files (paired-end FASTQs).
These modifications ensure seamless retrieval of the BAM and VCF outputs produced earlier in the workflow so that the consistency-scoring module can execute automatically at the end of each run.
The input schema (
Figure 5) now allows users to specify which member of the trio each sample represents (father, mother, or child). To this end the schema_input.json file has been extended with a family member field that accepts the values father, mother, and son.
As whole-genome sequencing (WGS) analyses often involve long-running processes, it is convenient to submit several individuals—here, multiple trios—in a single run. Given the relationships among the samples (mother, father, child) and the new role indicator described above, the pipeline must also know which individuals belong to the same family. For this reason, users are now required to supply a family identifier whenever the input includes more than one sample.
Nextflow supports parameterization via configuration files or command-line options, so execution of the trio-consistency module is optional. It is toggled with the flag
--collect_coverage_stats true|false.
3.3. Testing
To evaluate the method’s practical utility, we executed the nf-core/sarek WGS pipeline with different variant callers while keeping the aligner fixed to minimap2 [
24]. The dataset was provided by the Institute of Biochemistry and Genetics, a subdivision of the Ufa Federal Research Centre of the Russian Academy of Sciences (UFIC RAS).
DeepVariant [
14] was used as the first variant caller. DeepVariant converts the aligned reads (BAM or CRAM) into image-like tensors, classifies each tensor with a convolutional neural network and outputs the results in standard VCF format.
The second caller tested was HaplotypeCaller [
15]. HaplotypeCaller identifies genomic regions that show evidence of variation, performs local de novo assembly of candidate haplotypes, realigns reads to those haplotypes, compares them with the reference genome, and computes the likelihood of each variant.
Following debugging, the pipeline was executed on two independent trios (mother, father, son) with both algorithms, and all necessary statistics were collected.
3.4. Evaluation of Results
Running the pipeline with the newly developed module produced the following results.
First dataset (mother, father, child):
Each row in
Table 1 and
Table 2 reports the number of variants called in the child and the corresponding inheritance pattern (0/0, 0/1 [1/0], 1/1). Rows are grouped by the parental allele configuration—e.g., cases where the mother is 1/1 and the father carries no alternate alleles at that locus and vice versa. Separate columns present the pipeline runs that used DeepVariant and HaplotypeCaller at the variant-calling stage.
Second dataset (mother, father, child):
Table 2.
The results for the second dataset run.
Table 2.
The results for the second dataset run.
| Mother 1/1, Father 0/0 | Mother 0/0, Father 1/1 |
---|
| HaplotypeCaller | DeepVariant | HaplotypeCaller | DeepVariant |
---|
0/0 | 765 | 426 | 1009 | 389 |
0/1 (1/0) | 19,177 | 19,074 | 18,845 | 18,961 |
1/1 | 640 | 266 | 640 | 300 |
To compare the two pipelines, we construct 95% confidence intervals for each metric and test whether the observed differences are statistically significant.
A 95% confidence interval (CI) defines an approximate range that is expected to contain the true value with high probability. A 95% level means that in 95 out of 100 repeated samples the true parameter would lie inside the interval.
For our proportions we use
where
Z = 1.96 is the Student (normal) coefficient for 95%;
n is the total number of called variants (0/0 + 0/1 + 1/10/0 + 0/1 + 1/1);
p is sensitivity—the fraction of expected calls (0/1 or 1/0) among all calls.
For the first trio we obtained the following confidence intervals: HaplotypeCaller—92.45% to 92.98%; DeepVariant—96.24% to 96.63%. We can also compute a 95% confidence interval for the difference between the two proportions:
which yields −4.05% to −3.39%.
The resulting interval is (−4.05%, −3.39%).
Statistical significance assesses whether an observed effect is likely due to chance. As the 95% confidence interval for the difference does not include zero, we conclude that the result is statistically significant and that using DeepVariant is more effective for the genomic regions under study.
Applying the same calculation to the second trio yields
HaplotypeCaller: 92.31–92.82%;
DeepVariant: 96.31–96.68%;
95% CI for the difference: (−4.24%, −3.62%).
These findings corroborate that DeepVariant outperforms HaplotypeCaller at the variant-calling stage and demonstrate the practical utility of the method we developed.
4. Discussion
The obtained results suggest that the use of trio data and the verification of consistency with Mendelian-inheritance principles provide a more objective assessment of sequencing quality compared to traditional approaches relying solely on benchmark references. Our experiments demonstrated that the trio-oriented approach enables quality evaluation of variant identification on datasets lacking benchmark annotations. A comparative analysis of DeepVariant and HaplotypeCaller showed a statistically significant advantage of the former, which is consistent with previous studies highlighting the high sensitivity and accuracy of DeepVariant in detecting single-nucleotide variants and small indels.
From the perspective of the working hypothesis, which assumed that trio analysis would reveal significant differences in the performance of different whole-genome sequencing pipeline configurations, our findings provided confirmation. The method demonstrated that error estimation in variant calling can be performed without relying on external benchmark datasets, making the approach promising for improving the quality of whole-genome sequencing pipelines tailored to specific populations in scenarios where benchmark data on genetic variants are unavailable.
The proposed approach can be applied to detect systematic errors; moreover, the methodology can be used in clinical research where verification of rare or unique variants is critical, including in regulatory regions associated with disease predisposition.
Future research directions include extending the approach to long-read sequencing data, developing statistical models for more accurate discrimination between true de novo mutations and pipeline errors, and conducting large-scale validation across a greater number of trios from diverse populations. Another promising avenue is the integration of the developed module into distributed digital ecosystems for biomedical data analysis, enabling collaborative studies under strict data security and confidentiality requirements.
5. Conclusions
In this study, we designed and evaluated a method for quantifying whole-genome sequencing (WGS) results from trio data by assessing Mendelian-inheritance consistency. The study was conducted using the infrastructure of the prototype cloud digital ecosystem of the biomedical data analysis platform of the ISP RAS. The work comprised the following steps:
The survey of the existing approaches.
Implementation of the algorithm to identify regulatory genomic regions with non-zero coverage.
The development of software to compute Mendelian-consistency scores for trio data within these regions.
Integration of the method into the nf-core/sarek WGS pipeline.
Experimental validation on the benchmark datasets.
Comparative analysis of the experimental outcomes.
Benchmarking demonstrated that DeepVariant achieves higher accuracy in regulatory regions than HaplotypeCaller.
Author Contributions
Conceptualization, E.K.; methodology, E.G.; software, N.K.; validation, N.K.; resources, L.M.; writing—original draft preparation, L.M.; visualization, N.K.; supervision, O.S. and E.K.; project administration, E.K.; funding acquisition, E.K. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
WGS | Whole-Genome Sequencing |
VCF | Variant Call Format |
BAM | Binary Alignment/Map |
SNP | Single-Nucleotide Polymorphism |
Indel | Insertion/Deletion |
CI | Confidence Interval |
GIAB | Genome in a Bottle |
CNN | Convolutional Neural Network |
References
- Brlek, P.; Bulić, L.; Bračić, M.; Projić, P.; Škaro, V.; Shah, N.; Primorac, D. Implementing whole genome sequencing (WGS) in clinical practice: Advantages, challenges, and future perspectives. Cells 2024, 13, 504. [Google Scholar] [CrossRef]
- Nakagawa, H.; Fujita, M. Whole genome sequencing analysis for cancer genomics and precision medicine. Cancer Sci. 2018, 109, 513–522. [Google Scholar] [CrossRef] [PubMed]
- Roepman, P.; de Bruijn, E.; van Lieshout, S.; Schoenmaker, L.; Boelens, M.C.; Dubbink, H.J.; Cuppen, E. Clinical validation of whole genome sequencing for cancer diagnostics. J. Mol. Diagn. 2021, 23, 816–833. [Google Scholar] [CrossRef] [PubMed]
- Lin, Y.L.; Chang, P.C.; Hsu, C.; Hung, M.Z.; Chien, Y.H.; Hwu, W.L.; Lee, N.C. Comparison of GATK and DeepVariant by trio sequencing. Sci. Rep. 2022, 12, 1809. [Google Scholar] [CrossRef] [PubMed]
- Behera, S.; Catreux, S.; Rossi, M.; Truong, S.; Huang, Z.; Ruehle, M.; Sedlazeck, F.J. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. 2025, 43, 1177–1191. [Google Scholar] [CrossRef]
- Wong, K.M.; Suchard, M.A.; Huelsenbeck, J.P. Alignment uncertainty and genomic analysis. Science 2008, 319, 473–476. [Google Scholar] [CrossRef]
- Rakocevic, G.; Semenyuk, V.; Lee, W.P.; Spencer, J.; Browning, J.; Johnson, I.J.; Kural, D. Fast and accurate genomic analyses using genome graphs. Nat. Genet. 2019, 51, 354–362. [Google Scholar] [CrossRef]
- Zhao, S.; Agafonov, O.; Azab, A.; Stokowy, T.; Hovig, E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci. Rep. 2020, 10, 20222. [Google Scholar] [CrossRef]
- Betschart, R.O.; Thiéry, A.; Aguilera-Garcia, D.; Zoche, M.; Moch, H.; Twerenbold, R.; Ziegler, A. Comparison of calling pipelines for whole genome sequencing: An empirical study demonstrating the importance of mapping and alignment. Sci. Rep. 2022, 12, 21502. [Google Scholar] [CrossRef]
- Kómár, P.; Kural, D. Geck: Trio-based comparative benchmarking of variant calls. Bioinformatics 2018, 34, 3488–3495. [Google Scholar] [CrossRef]
- Krusche, P.; Trigg, L.; Boutros, P.C.; Mason, C.E.; De La Vega, F.M.; Moore, B.L.; Global Alliance for Genomics and Health Benchmarking Team. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 2019, 37, 555–560. [Google Scholar] [CrossRef]
- Wagner, J.; Olson, N.D.; Harris, L.; Khan, Z.; Farek, J.; Mahmoud, M.; Zook, J.M. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2022, 2, 100128. [Google Scholar] [CrossRef] [PubMed]
- Dunn, T.; Narayanasamy, S. Vcfdist: Accurately benchmarking phased small variant calls in human genomes. Nat. Commun. 2023, 14, 8149. [Google Scholar] [CrossRef]
- Poplin, R.; Chang, P.C.; Alexander, D.; Schwartz, S.; Colthurst, T.; Ku, A.; DePristo, M.A. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018, 36, 983–987. [Google Scholar] [CrossRef]
- Lefouili, M.; Nam, K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci. Rep. 2022, 12, 11331. [Google Scholar] [CrossRef] [PubMed]
- Kramer, M.F.; Coen, D.M. Enzymatic amplification of DNA by PCR: Standard procedures and optimization. Curr. Protoc. Cell Biol. 2001, 10, A-3F. [Google Scholar] [CrossRef]
- Cock, P.J.; Fields, C.J.; Goto, N.; Heuer, M.L.; Rice, P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010, 38, 1767–1771. [Google Scholar] [CrossRef]
- Langer, B.E.; Amaral, A.; Baudement, M.O.; Bonath, F.; Charles, M.; Chitneedi, P.K.; Notredame, C. Empowering bioinformatics communities with Nextflow and nf-core. Genome Biol. 2025, 26, 228. [Google Scholar] [CrossRef]
- Koenig, D.; Glover, A.; King, P.; Laforge, G.; Skeet, J. Groovy in Action; Manning Publications Co.: Shelter Island, NY, USA, 2007. [Google Scholar]
- Arnold, K.; Gosling, J.; Holmes, D. The Java Programming Language; Addison Wesley Professional: Boston, MA, USA, 2005. [Google Scholar]
- Quinlan, A.R. BEDTools: The Swiss-army tool for genome feature analysis. Curr. Protoc. Bioinform. 2014, 47, 11–12. [Google Scholar] [CrossRef]
- Meuleman, W.; Muratov, A.; Rynes, E.; Halow, J.; Lee, K.; Bates, D.; Stamatoyannopoulos, J. Index and biological spectrum of human DNase I hypersensitive sites. Nature 2020, 584, 244–251. [Google Scholar] [CrossRef] [PubMed]
- Josuttis, N.M. The C++ Standard Library: A Tutorial and Reference; Addison-Wesley Professional: Boston, MA, USA, 2012. [Google Scholar]
- Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).