Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling

Gurianova, Anastasiia; Pestruilova, Anastasiia; Beliaeva, Aleksandra; Kasianov, Artem; Mikhailova, Liudmila; Guguchkin, Egor; Karpulevich, Evgeny

doi:10.3390/ijms27010513

Open AccessArticle

Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling

by

Anastasiia Gurianova

¹,

Anastasiia Pestruilova

¹,

Aleksandra Beliaeva

^1,2

,

Artem Kasianov

³

,

Liudmila Mikhailova

^4,5

,

Egor Guguchkin

¹

and

Evgeny Karpulevich

^6,*

¹

Ivannikov Institute for System Programming of the Russian Academy of Science, 109004 Moscow, Russia

²

Center for Applied AI, Skolkovo Institute of Science and Technology, 121205 Moscow, Russia

³

BIOPOLIS Program in Genomics, Biodiversity and Land Planning, CIBIO, 4485-684 Vairão, Portugal

⁴

Economic Faculty, Lomonosov Moscow State University, 119991 Moscow, Russia

⁵

Higher School of Management, Financial University Under the Government of the Russian Federation, 125167 Moscow, Russia

⁶

Research Center for Trusted Artificial Intelligence, Ivannikov Institute for System Programming of the Russian Academy of Science, 109004 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Int. J. Mol. Sci. 2026, 27(1), 513; https://doi.org/10.3390/ijms27010513

Submission received: 8 November 2025 / Revised: 7 December 2025 / Accepted: 12 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Bioinformatics and Machine Learning for Predicting Biological Processes)

Download

Browse Figures

Review Reports Versions Notes

Abstract

DeepVariant has revolutionized the field of genetic variant identification by reframing variant detection as an image classification problem. However, despite its wide adoption in bioinformatics workflows, the tool continues to evolve mainly through the expansion of training datasets, while its core neural network architecture—Inception V3—has remained unchanged. In this study, we revisited the DeepVariant design and presented a prototype of a modernized version that supports alternative neural network backbones. As a proof of concept, we replaced the legacy Inception V3 model with a mid-sized EfficientNet model and evaluated its performance using the benchmark dataset from the Genome in a Bottle (GIAB) project. Alternative architecture demonstrated faster convergence, a twofold reduction in the number of parameters, and improved accuracy in variant identification. On the test dataset, updated workflow achieved consistent improvements of +0.1% in SNP F1-score, enabling the detection of up to several hundred additional true variants per genome. These results show that optimizing the neural architecture alone can enhance the accuracy, robustness, and efficiency of variant calling, thereby improving the overall quality of sequencing data analysis.

Keywords:

DeepVariant; variant calling; SNP; convolutional neural networks; genomics; GIAB; WGS; WES; NGS

1. Introduction

Whole-genome sequencing (WGS) and whole-exome sequencing (WES) generate millions to billions of reads per sample, enabling more sensitive and cost-effective disease diagnostics than those of traditional screening methods [1,2,3,4,5,6]. Accurate detection of genetic variants—single-nucleotide variants (SNVs) and indels—is a critical step, known as variant calling, required to translate these reads into clinically relevant information. Early and reliable variant identification supports risk assessment and informs treatment decisions [7]. Conversely, calling errors incur immediate clinical and economic costs: false positives necessitate confirmatory tests, while false negatives can lead to diagnostic failure or inappropriate therapy [8]. Current clinical guidelines identify high-accuracy variant calling as a critical bottleneck in the genome analysis pipeline [9].

Traditional variant calling methods rely on statistical models and heuristic algorithms. In contrast, DeepVariant approaches variant calling nontrivially as an image-classification problem solved by deep convolutional neural networks, widely adopted in medicine and biology over the past decade [10,11,12,13]. This makes DeepVariant one of the widely used applications employing image-based data representations, together with deep learning models, for genome analysis, alongside approaches such as CGR image-encoded analysis and integrative spatial gene-expression imaging models [14]. Multiple studies confirm DeepVariant superiority over well-established tools such as Genome Analysis Toolkit (GATK) and Strelka in terms of accuracy and recall [15,16]. Its advantage over statistics-based approaches was demonstrated in the PrecisionFDA Truth Challenge, solidifying its status as one of the most accurate variant callers [15].

The DeepVariant pipeline comprises three stages: (i) ‘make_examples’—aligned reads around putative variants (BAM) are encoded as multi-channel images; (ii) ‘call_variants’—CNN classifies each image into one of three classes reflecting diploid genotype states: 0/0 (homozygous reference: both alleles match reference), 0/1 (heterozygous: one reference allele and one alternate), or 1/1 (homozygous alternate: both alleles differ from reference); and (iii) ‘postprocess_variants’—CNN outputs are converted to standardized VCF records [17]. These stages are summarized in Figure 1.

Continuous accumulation of genomic data and the emergence of new sequencing technologies have driven significant evolution in DeepVariant since its initial release. Most improvements have focused on expanding the volume and diversity of training data and adapting to new sequencing techniques, including extension from WGS/WES to RNA-Seq [16] and support for third-generation platforms such as PacBio and Oxford Nanopore [18,19].

Modifications to the pipeline itself have been largely confined to the ‘make_examples’ stage, where variant context was enriched by adding a haplotype channel during pileup image encoding. The concept of read haplotagging was first explored by Shafin et al., who proposed the PEPPER-Margin-DeepVariant pipeline for nanopore long reads, and by Patterson et al., who showed that haplotype information improves variant detection accuracy with PacBio long reads [18]. Early implementations used Margin and WhatsHap for haplotagging [20,21]. This approach was later refined by Kolesnikov et al., who introduced an approximate haplotagging method that simplifies long-read variant calling [17]. Haplotype information is now included as a default channel in the DeepVariant pipeline.

Despite these successful modifications, DeepVariant backbone CNN—Inception-v3, used in the ‘call_variants’ stage—has remained unchanged since the early releases [22]. In the intervening years, the field of computer vision has advanced substantially. Progress has included improved training of very deep networks through residual connections in ResNet [23] and more efficient feature utilization and compound scaling strategies in models like EfficientNet [24]. Beyond these advancements, computer vision models have increasingly embraced hybrid and fully transformer-based designs. Integration of architectural insights from transformers exemplified by modernized CNNs like ConvNeXt [25,26] and Vision Transformers (ViT) demonstrated that, with sufficient data, self-attention can rival or surpass convolutional architectures for image classification [27]. Subsequent refinements—such as DeiT [28], which introduced more data-efficient training strategies, and Swin Transformer, which added hierarchical feature maps and shifted windows—facilitated the transition of transformers into fields requiring fine-grained, spatially aware predictions [29].

The continued use of Inception V3 within DeepVariant is a de facto standard. While most researchers enhance the pipeline through fine-tuning or adding new input channels, the fundamental question, ‘What if we replace the neural network backbone itself?’ has received comparatively little attention. Several community efforts have experimented with alternative architectures, but have typically focused on domain-specific modifications and yet none provide a direct, head-to-head comparison against DeepVariant’s original Inception V3 model within the full training pipeline [30,31]. In this work, we extend the official DeepVariant source code to enable the seamless substitution of alternative neural network backbones, allowing systematic exploration beyond the default Inception V3 architecture. To demonstrate this capability, we selected a representative alternative model and trained it from scratch alongside the baseline model under identical conditions. Performance was evaluated using a GIAB benchmark dataset [32], enabling a direct and reproducible comparison between backbone architectures.

2. Results

2.1. Training

Baseline Inception V3 and a preselected alternative model were trained and evaluated over 10 independent cross-validation runs. F1 was the primary evaluation metric, defined as the harmonic mean of precision and recall, which were also reported as supplementary metrics. Per-class (stratified) performance analyses were provided.

2.1.1. Learning Dynamics

Both baseline and alternative models exhibit a rapid loss decrease during early training steps, indicating quick learning, followed by stabilization and convergence (Figure 2). EfficientNet-B3 achieves lower training loss, demonstrating superior fit to the training data. Validation loss tends to plateau more gradually, while mirroring the training loss trend. On validation, EfficientNet-B3 once again attains lower loss, suggesting better generalization to the unseen validation set with a substantial performance gap relative to the baseline model.

2.1.2. Performance Evaluation

We further analyze trajectories of the F1 score, precision and recall across training epochs for the aforementioned models (Figure 3). The F1 curves demonstrate a consistent separation in favor of the alternative EfficientNet model, indicating overall performance improvement. Given the already high precision of the baseline model, the gain achieved by the alternative model is comparatively modest. In contrast, the recall curves reveal a persistent and credible advantage—an approximately one percentage-point increase—suggesting that the observed improvement in the F1 score is primarily driven by improved sensitivity.

The alternative architecture outperforms baseline Inception_V3 across all key metrics at the final epoch evaluation, as shown in Table 1, indicating an improved capacity to classify variants. The EfficientNet-B3 model achieves a higher overall F1 score of 95.31%, representing a 0.51 percentage-point improvement over the baseline 94.80%, with a significant improvement (paired t-test p = 0.0067) and a large effect size (Cohen’s d = 1.71). The alternative model achieves a precision of 95.78%, surpassing the strong baseline of 95.63% by 0.15 percentage points; however, this difference is not statistically significant (p = 0.243, Cohen’s d = 0.40). Most notably, the recall improves to 94.78%, marking a substantial 1.02 percentage-point gain over the baseline 93.76%, with a notably large effect size (p = 0.0003, Cohen’s d = 1.79). Collectively, these results provide evidence that EfficientNet-B3 is more effective at correct identification of true variants.

2.1.3. Performance Stratified by Genotype

We next conducted a stratified evaluation of validation metrics across three variant classes: homozygous reference (0/0), heterozygous (0/1), and homozygous alternate (1/1), as presented on Figure 4. This analysis enables the assessment of whether the models exhibit any class-specific preferences.

For F1 scores, the alternative model demonstrates consistent improvement over the baseline across all variant classes. Furthermore, the overall performance pattern reflects that of the baseline: the classification is most accurate for genotypes with two alternate alleles (homozygous alternate, 1/1), followed by those with a single alternate allele (heterozygous, 0/1), while the greatest challenge is observed in correctly identifying the homozygous reference (0/0), despite the fact that these variants were the most frequent in the training dataset.

However, when examining precision and recall separately, we noticed more nuances. The baseline model exhibits consistently low sensitivity across all genotype classes. The alternative model follows a similar trend for the most challenging homozygous reference (0/0) class. In contrast, for the other two classes (heterozygous, 0/1, and homozygous alternate, 1/1), the pattern reverses, with recall increasing relative to precision. This highlights the potential class-specific strengths and limitations of the EfficientNet architecture in genotype prediction, which remained obscured in aggregate performance metrics.

2.2. Testing

Trained models were included in the original DeepVariant pipeline to conduct a final evaluation of two independent test sets from hold-out samples HG003 and HG005 using hap.py (v0.3.12). Results are summarized separately for SNP and indel variants.

For SNP, both models demonstrate very high performance, with F1 scores exceeding 99% (Table 2 and Table 3). On the never-seen samples—belonging to a distinct ancestry with a divergent genotype class distribution—performance improvement of the alternative model persists. Across both HG003 and HG005, the DeepVariant pipeline incorporating EfficientNet-B3 consistently improves performance. For HG003, the model yields a 0.10 percentage-point increase in F1 (p = 0.0107, Cohen’s d = 1.68), accompanied by a 0.08 point gain in precision (p = 0.0471, Cohen’s d = 0.94) and a 0.11 point increase in recall (p = 0.1708, Cohen’s d = 0.69). Similarly, for HG005, F1 improves by 0.07 percentage points (p = 0.0075, Cohen’s d = 1.75), precision by 0.08 points (p = 0.0372, Cohen’s d = 1.00), and recall by 0.07 points (p = 0.2423, Cohen’s d = 0.57). Together, these results demonstrate consistent gains across metrics, with statistically significant improvements in F1 and precision for both samples and modest non-significant enhancements in recall.

For indel classification—which has traditionally posed a more challenging scenario—both models demonstrate modest performance, with F1 scores approaching 95%. For HG003, EfficientNet-B3 yields a modest 0.23 percentage-point improvement in F1 (p = 0.6301, Cohen’s d = 0.27). Precision shows a clearer benefit, increasing by 0.18 points (p = 0.0042, Cohen’s d = 1.61), while recall improves by 0.29 points, although without statistical significance (p = 0.7455, Cohen’s d = 0.18). A similar pattern is observed for HG005. F1 improves by 0.19 percentage points (p = 0.3877, Cohen’s d = 0.46), precision increases by 0.23 points, with strong statistical support (p = 0.0022, Cohen’s d = 1.86), and recall shows a small, non-significant gain of 0.15 points (p = 0.7070, Cohen’s d = 0.20).

In the context of indel detection, which is considerably less represented in the training data, the proposed model shows clear improvements in precision; however, gains in recall (and consequently, sensitivity) remain challenging. In HG003, for example, recall is nearly 10 percentage points lower than precision, underscoring the difficulty the model faces in recovering true indel variants. Although mean recall and F1 values trend upward across independent runs, the high variability observed for indels prevents drawing confident conclusions about consistent improvements in recall and downstream increases in F1 overall.

2.3. Training Efficiency and Inference Time

Training times varied across models, reflecting differences in architectural complexity and computational efficiency (Table 4). For a 10-epoch training run on chromosome 10 of the HG001 sample (approximately 350,000 variants, 30× WGS), the baseline model required, on average, 2 h and 34 min, whereas the alternative model completed training in 1 h and 59 min. This demonstrates that the alternative model achieves faster convergence, offering advantages in both accuracy and training efficiency. During inference, however, both models performed similarly: the full DeepVariant pipeline for the HG003 and HG005 samples (excluding chromosomes used in training to prevent leakage) required 2 h per one cross validation run, indicating that inference speed is largely unaffected by the model architecture.

3. Discussion

3.1. Alternative Model Performance

The alternative mid-sized EfficientNet model demonstrates substantial and consistent advantages over baseline Inception V3 across multiple evaluation dimensions.

During training, losses were markedly lower for EfficientNet-B3, indicating improved fitting to the data and better generalization. The alternative model exhibited more stable learning dynamics and reduced overfitting, reflecting architectural efficiency in extracting relevant features for variant classification. The performance gap becomes apparent early during training and remains stable after convergence, reflecting the architectural robustness and effective feature learning of EfficientNet-B3.

Throughout training, EfficientNet-B3 consistently outperforms the Inception V3 baseline across all key metrics. Curve analysis shows a clear separation in F1 curves; by the final epoch, EfficientNet-B3 achieves a mean F1 score of 95.31% (+0.51%), highlighting its enhanced classification performance. While both models exhibit high precision, EfficientNet-B3 shows improved recall of 94.78% (+1.02%), indicating that the overall performance gain is primarily driven by better sensitivity. These findings highlight EfficientNet-B3’s superior ability to detect true variants.

Stratified evaluation confirms that EfficientNet-B3 consistently outperforms the Inception V3 baseline across all genotype classes. While both models follow a similar F1 score pattern—best for the homozygous alternate (1/1), then the heterozygous (0/1), and worst for the homozygous reference (0/0)—EfficientNet-B3 achieves higher scores throughout. An analysis of precision and recall reveals further differences. Inception V3 adopts a conservative prediction strategy, resulting in uniformly low sensitivity. EfficientNet-B3 partially retains this pattern for the homozygous reference (0/0) class but shifts toward higher recall for genotypes with alternate alleles. This suggests improved detection of true variants, even despite their lower prevalence in the training set.

In the final evaluation, trained models were integrated into the original DeepVariant pipeline and tested on two held-out WGS samples, HG003 and HG005, using hap.py. Across both datasets, the EfficientNet-B3 architecture consistently outperformed the baseline Inception V3 model. For SNPs, performance was uniformly high, with F1 scores exceeding 99%, and the alternative model providing statistically significant gains in regards to both F1 and precision. Indel classification, while more challenging, showed a similar pattern. EfficientNet-B3 delivered clear and statistically supported improvements in precision for both samples, whereas gains in recall were small and variable. This reflects the difficulty the proposed model faces in enhancing indel sensitivity—particularly evident in HG003, where recall remains nearly 10 percentage points lower than precision for both models. Although average recall and F1 values trend positively across independent runs, high variability limits confidence in consistent improvements for these metrics. Overall, results indicate that modernized architecture provides robust and reproducible gains, especially through precision improvements, while highlighting specific areas—most notably indel recognition—where further optimization remains needed.

Notably, these performance gains are achieved alongside substantial improvements in model efficiency. EfficientNet-B3 displays approximately half as many parameters as the baseline Inception V3, leading to reduced GPU usage and faster training time. These efficiency advantages reflect the architectural strengths of EfficientNet, which is designed around a compound scaling strategy that balances network depth, width, and resolution in a principled manner. This allows the model to capture fine-grained patterns in pileup images more effectively, which is crucial for distinguishing subtle variant signals.

Importantly, in this study, we employed a relatively lightweight configuration; more complex architectures could potentially allow even greater gains, particularly in more challenging scenarios on larger-scale datasets. Together, these findings suggest an alternative trajectory for improving DeepVariant performance—focusing on architectural modernization rather than relying solely on scaling with ever-larger volumes of training data.

In this context, we present a case that we hope will contribute to rethinking current practices in genomic variant classification pipelines: leveraging well-designed models may offer a more sustainable and scalable path forward than will continued reliance on computationally intensive legacy architectures.

3.2. Limitations

While these results are promising, several limitations must be acknowledged. The current study is restricted to a subset of Illumina whole-genome sequencing (WGS) samples (2 × 151 bp, 30× coverage), which were preselected as representative cases to enable fast prototyping. Also, these findings may not fully generalize to other sequencing setups, platforms based on long-read technologies (e.g., PacBio or Oxford Nanopore), or low-coverage libraries (≤15×), which are known to exhibit distinct error profiles and could alter the relative performance of the evaluated models [33,34].

3.3. Future Directions

Future work may explore several directions. Incorporating advanced read alignment methods—already shown to enhance SNV detection accuracy—could further improve classification outcomes [35]. Additionally, experimenting with alternative pileup image encoding strategies, integrating supplementary input channels, or applying data augmentation techniques may enrich the feature space and model generalization. Finally, expanding the training dataset remains a logical next step, although the computational and financial cost of large-scale model training presents a significant practical challenge.

4. Materials and Methods

4.1. Datasets

DeepVariant production models were trained and evaluated using a well-defined GIAB dataset. This dataset includes high-confidence variant calls for multiple individuals and serves as a standard benchmark for variant calling. Specifically, seven samples from different individuals were used in the original DeepVariant setup. Six samples (HG001, HG002, HG004, HG005, HG006, HG007) were used for training (chromosomes 1–19 for training itself; chromosomes 21–22 for validation and evaluation of performance during training), while one sample (HG003, chromosome 20) was kept for testing [12].

In our study, we followed the same methodological principles but in a reduced configuration. Training is the most resource-demanding stage, and training full production-scale models for each architectural candidate is computationally prohibitive; therefore, we adopted a simplified GIAB setup consistent with the DeepVariant advanced training tutorial (https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-training-case-study.md, accessed on 10 September 2025) to rapidly prototype and evaluate the proposed models. In designing the dataset, we aimed to ensure diverse samples, as well as complete separation between the training, validation, and testing regions. We used BAM files from the original article’s supplementary materials, kindly provided by the DeepVariant authors [36], corresponding to NovaSeq 30× whole-genome sequencing, a standard and well-characterized coverage for WGS applications (sample details in Table 5). For model development, HG001 chromosome 10 and HG001 chromosome 20 were used for training and validation, respectively. Chromosomes 10 and 20 were selected because they provide representative genome regions of moderate size and variant density, making them suitable for rapid iteration during model development while still reflecting the characteristics of whole-genome variant calling tasks. For the final evaluation, we used the whole HG003 and HG005 samples, excluding the aforementioned regions to prevent data leakage.

Additionally, we present the distribution of variant classes across datasets. As shown in Table 6, in the training and validation datasets (HG001), homozygous reference variants (0/0) predominate, followed by heterozygous (0/1) and homozygous alternate (1/1). This distribution differs from that of the test set, where heterozygous variants represent the largest proportion among the three classes.

4.2. Selection of Alternative Model for Demonstration

Selection of the alternative model was guided by reference information about models from Keras Applications (https://keras.io/api/applications/ accessed on 27 November 2025). We aimed to balance accuracy and computational efficiency. Among the wide range of alternative architectures—including the previously mentioned popular choices—we focused on the EfficientNet family. EfficientNet models employ a compound-scaling strategy that jointly increases network depth, width, and input resolution in a balanced and principled manner. This design achieves substantially better accuracy-parameter trade-offs than that of earlier architectures, including Inception V3. For pileup tensors—which are smaller and structurally more constrained than natural images—such efficiency is particularly advantageous, enabling the model to capture relevant variation patterns without incurring unnecessary depth or computational cost. Additionally, variant-calling signals frequently appear as subtle local features (base-quality gradients, for example), and EfficientNets’ MBConv blocks with squeeze-and-excitation mechanisms enhance the network’s sensitivity to these fine-grained cues, even when the available global context is limited.

Within the EfficientNet family, we selected the mid-sized EfficientNet-B3 model (Figure 5). Model capacity and computational cost grow rapidly across the EfficientNet scale indices (B0–B7) due to the compound-scaling strategy, and preliminary experiments on public benchmark ImageNet dataset indicated that larger variants (B4–B7) provide only modest accuracy improvements while incurring disproportionately higher memory usage and training time. In contrast, EfficientNet-B3 offered a favorable balance: it delivered substantial accuracy gains over those of Inception V3, while maintaining a model size and inference cost that are compatible with the practical DeepVariant training pipeline. This combination of improved accuracy and moderate resource demands made EfficientNet-B3 a compelling and practically viable candidate for integration into DeepVariant.

4.3. Training

4.3.1. Training Pipeline Adaptation

Most changes were made in keras_modeling.py. Key modifications include the introduction of a configurable _BACKBONES dictionary within the model codebase, enabling selection among multiple neural network architectures—Inception V3 and EfficientNet B0-B7—accessed via the tf.keras.applications module, with the potential for future extension to other architectures. The get_model() function was adapted to instantiate the chosen backbone architecture, defaulting to Inception V3, and configured to accept input tensors representing pileup image data. Additionally, logging was updated throughout to improve training reproducibility. The train.py script logic remained unchanged, but we integrated support for other optimizers such as Adam and AdamW.

4.3.2. Training Configurations

In the original pipeline configurations—including parameters such as batch size, learning rate, optimizer, logging frequency, and others—are specified through the get_config() function in dv_config.py. We updated this script to allow the selection of different model architectures (e.g., EfficientNet-B0/B1/B2).

Configurations for the baseline Inception V3 model were based on the original script. In our setup, the batch size was reduced from 16,384 to 128 to match the limitations of the hardware. Additionally, as we trained models from scratch with randomly initialized weights, pre-trained model checkpoints were disabled. The baseline model was trained for 10 epochs using the RMSProp optimizer with a momentum of 0.9, an initial learning rate of 1 × 10⁻³ with exponential decay (decay rate of 0.947) every 2 epochs, a weight decay of 4 × 10⁻⁵, and a dropout in the backbone network with a probability of 0.2 following the original script. Model evaluation (validation) was performed obligatorily at the end of every epoch. Model checkpoints were saved based on the best validation F1 score.

To train an alternative model, we used the same settings but made a preliminary search for the optimal learning rate and the optimizer via automated hyperparameter optimization with Optuna [37]. A Tree-Structured Parzen Estimator (TPE) sampler guided the search across 20 trials, and the Successive Halving algorithm was implemented to terminate underperforming trials early.

4.3.3. Evaluation of Performance During Training

Model performance was evaluated primarily using the F1 score, as in the original article. Additionally, precision and recall were reported. To assess stability, 10 independent runs per model were implemented, and a 95% confidence interval was estimated using standard error and t-distribution. In addition, statistical significance between models was evaluated using p-values, and the magnitude of performance differences was quantified using Cohen’s d. The results were obtained using the services of the Shared Resource Center of the Ivannikov Institute for System Programming of the Russian Academy of Sciences – the ISP RAS Shared Resource Center.

4.4. Testing

Trained models were integrated into the full DeepVariant pipeline. The resulting variant calls (VCF files) were evaluated using hap.py, a standard tool for benchmarking variant callers [38]. Performance was assessed separately for SNPs and indels across all 10 independent runs. Statistical significance between models was evaluated using p-values, and the magnitude of performance differences was quantified using Cohen’s d.

4.5. Hardware and Environment

Model training and testing were performed on an NVIDIA A100 GPU with 40 GB of memory. The software environment was built using a custom container based on the official google/deepvariant:1.6.1-gpu Docker image.

4.6. Code Availability

The source code is publicly available in the repository https://github.com/ispras/deepvariant_alternative_models, accessed on 27 November 2025.

5. Conclusions

This study demonstrates that a mid-sized EfficientNet architecture provides a substantial and consistent improvement over the widely used Inception_V3 backbone for genomic variant classification within the DeepVariant pipeline. Across the training, validation, and evaluation stages, EfficientNet-B3 exhibits more stable learning dynamics, reduced overfitting, and markedly lower losses, indicating superior feature extraction and stronger generalization. These advantages manifest early during optimization and persist through convergence.

EfficientNet-B3 consistently outperforms the baseline across all principal metrics during training, with F1 curves showing a clear and sustained separation. Stratified analyses further confirm greater robustness across genotype classes. Independent evaluation on held-out WGS samples (HG003 and HG005) reinforces these findings. For SNP classification, EfficientNet-B3 provides statistically significant improvements in regards to both F1 and precision. Indel classification—traditionally more difficult—shows reliable and statistically supported precision improvements, although gains in recall and subsequently F1 are highly variable and limited. These trends highlight both the strengths and current limits of the architecture, pointing to indel sensitivity as a key target for future refinement.

Beyond accuracy, EfficientNet-B3 delivers notable efficiency benefits. With roughly half the parameters of Inception V3, the model reduces GPU memory requirements and shortens training time, reflecting the advantages of compound scaling for balancing network depth, width, and resolution. These architectural efficiencies enable the extraction of fine-grained patterns in pileup images more effectively without incurring additional computational burden.

Taken together, the results illustrate that thoughtful architectural modernization can yield meaningful advances in variant classification. The success of a relatively lightweight EfficientNet-B3 configuration suggests that more advanced architectures may unlock even greater improvements.

Author Contributions

Conceptualization, A.G. and E.K.; software, A.G., A.P., and A.B.; validation, E.G.; investigation, A.B.; resources, A.P.; writing—original draft preparation, A.G. and E.K.; writing—review and editing, L.M. and A.K.; visualization, A.G.; supervision, E.G. and E.K.; project administration, E.K.; funding acquisition, E.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant provided by the Ministry of Economic Development of the Russian Federation, in accordance with the subsidy agreement (agreement identifier 000000C313925P4G0002) and the agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences dated 20 June 2025, No. 139-15-2025-011.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed through the original source cited in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
GATK	Genome Analysis Toolkit
GIAB	Genome in a Bottle
NGS	Next-Generation Sequencing
PCR	Polymerase Chain Reaction
SNP	Single Nucleotide Polymorphism
WGS	Whole-Genome Sequencing
WES	Whole-Exome Sequencing

References

Imperial, R.; Nazer, M.; Ahmed, Z.; Kam, A.E.; Pluard, T.J.; Bahaj, W.; Levy, M.; Kuzel, T.M.; Hayden, D.M.; Pappas, S.G.; et al. Matched Whole-Genome Sequencing (WGS) and Whole-Exome Sequencing (WES) of Tumor Tissue with Circulating Tumor DNA (ctDNA) Analysis: Complementary Modalities in Clinical Practice. Cancers 2019, 11, 1399. [Google Scholar] [CrossRef]
Tirrell, K.M.B.; O’Neill, H.C. Comparing the Diagnostic and Clinical Utility of WGS and WES with Standard Genetic Testing (SGT) in Children with Suspected Genetic Diseases: A Systematic Review and Meta-Analysis. medRxiv 2023. [Google Scholar] [CrossRef]
Nielsen, R.; Paul, J.S.; Albrechtsen, A.; Song, Y.S. Genotype and SNP Calling from Next-Generation Sequencing Data. Nat. Rev. Genet. 2011, 12, 443–451. [Google Scholar] [CrossRef]
Satam, H.; Joshi, K.; Mangrolia, U.; Waghoo, S.; Zaidi, G.; Rawool, S.; Thakare, R.P.; Banday, S.; Mishra, A.K.; Das, G.; et al. Next-Generation Sequencing Technology: Current Trends and Advancements. Biology 2023, 12, 997, Correction in Biology 2024, 13, 286. [Google Scholar] [CrossRef]
Chin, C.-S.; Khalak, A. Human Genome Assembly in 100 Minutes. bioRxiv 2019. [Google Scholar] [CrossRef]
Ren, J.; Zhang, Z.; Wu, Y.; Wang, J.; Liu, Y. A comprehensive review of deep learning-based variant calling methods. Brief. Funct. Genom. 2024, 23, 303–313. [Google Scholar] [CrossRef]
Rantsiou, K.; Kathariou, S.; Winkler, A.; Skandamis, P.; Saint-Cyr, M.J.; Rouzeau-Szynalski, K.; Amézquita, A. Next Generation Microbiological Risk Assessment: Opportunities of Whole Genome Sequencing (WGS) for Foodborne Pathogen Surveillance, Source Tracking and Risk Assessment. Int. J. Food Microbiol. 2018, 287, 3–9. [Google Scholar] [CrossRef]
Bennett, E.P.; Petersen, B.L.; Johansen, I.E.; Niu, Y.; Yang, Z.; Chamberlain, C.A.; Met, Ö.; Wandall, H.H.; Frödin, M. indel detection, the ‘Achilles heel’ of precise genome editing: A survey of methods for accurate profiling of gene editing induced indels. Nucleic Acids Res. 2020, 48, 11958–11981. [Google Scholar] [CrossRef]
Muzzey, D.; Evans, E.A.; Lieber, C. Understanding the Basics of NGS: From Mechanism to Variant Calling. Curr. Genet. Med. Rep. 2015, 3, 158–165. [Google Scholar] [CrossRef] [PubMed]
Ibragimov, A.; Senotrusova, S.; Markova, K.; Karpulevich, E.; Ivanov, A.; Tyshchuk, E.; Grebenkina, P.; Stepanova, O.; Sirotskaya, A.; Kovaleva, A.; et al. Deep Semantic Segmentation of Angiogenesis Images. Int. J. Mol. Sci. 2023, 24, 1102. [Google Scholar] [CrossRef]
Ushakov, E.; Naumov, A.; Fomberg, V.; Vishnyakova, P.; Asaturova, A.; Badlaeva, A.; Tregubova, A.; Karpulevich, E.; Sukhikh, G.; Fatkhudinov, T. EndoNet: A Model for the Automatic Calculation of H-Score on Histological Slides. Informatics 2023, 10, 90. [Google Scholar] [CrossRef]
Poplin, R.; Chang, P.-C.; Alexander, D.; Schwartz, S.; Colthurst, T.; Ku, A.; Newburger, D.; Dijamco, J.; Nguyen, N.; Afshar, P.T.; et al. A Universal SNP and Small-indel Variant Caller Using Deep Neural Networks. Nat. Biotechnol. 2018, 36, 983–987. [Google Scholar] [CrossRef]
Ananev, V.V.; Skorik, S.N.; Shaklein, V.V.; Avetisyan, A.A.; Teregulov, Y.E.; Turdakov, D.Y.; Gliner, V.; Schuster, A.; Karpulevich, E.A. Assessment of the impact of non-architectural changes in the predictive model on the quality of ECG classification. In Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS), Moscow, Russia, 2–3 December 2021; Volume 33, pp. 87–98. (In Russian). [Google Scholar] [CrossRef]
Zhang, Y.Z.; Imoto, S. Genome analysis through image processing with deep learning models. J. Hum. Genet. 2024, 69, 519–525. [Google Scholar] [CrossRef]
Carroll, A. Improving the Accuracy of Genomic Analysis with DeepVariant 1.0. 18 September 2020. Available online: https://research.google/blog/improving-the-accuracy-of-genomic-analysis-with-deepvariant-10/ (accessed on 27 November 2025).
Cook, D.E.; Venkat, A.; Yelizarov, D.; Pouliot, Y.; Chang, P.-C.; Carroll, A.; De La Vega, F.M. A deep-learning-based RNA-seq germline variant caller. Bioinform. Adv. 2023, 3, vbad062. [Google Scholar] [CrossRef]
Kolesnikov, A.; Cook, D.; Nattestad, M.; Brambrink, L.; McNulty, B.; Gorzynski, J.; Goenka, S.; Ashley, E.A.; Jain, M.; Miga, K.H.; et al. Local Read Haplotagging Enables Accurate Long-Read Small Variant Calling. Nat. Commun. 2024, 15, 5907. [Google Scholar] [CrossRef] [PubMed]
Shafin, K.; Pesout, T.; Chang, P.-C.; Nattestad, M.; Kolesnikov, A.; Goel, S.; Baid, G.; Kolmogorov, M.; Eizenga, J.M.; Miga, K.H.; et al. Haplotype-Aware Variant Calling with PEPPER-Margin-DeepVariant Enables High Accuracy in Nanopore Long-Reads. Nat. Methods 2021, 18, 1322–1332. [Google Scholar] [CrossRef]
Patterson, M.; Marschall, T.; Pisanti, N.; van Iersel, L.; Stougie, L.; Klau, G.W.; Schönhuth, A. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J. Comput. Biol. 2015, 22, 498–509. [Google Scholar] [CrossRef] [PubMed]
UCSC-Nanopore-Cgl/Margin 2025. Available online: https://github.com/UCSC-nanopore-cgl/margin (accessed on 27 November 2025).
Martin, M.; Ebert, P.; Marschall, T. Read-Based Phasing and Analysis of Phased Variants with WhatsHap. In Haplotyping: Methods and Protocols; Peters, B.A., Drmanac, R., Eds.; Springer: New York, NY, USA, 2023; pp. 127–138. ISBN 978-1-0716-2819-5. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 2020. Available online: https://api.semanticscholar.org/CorpusID:167217261 (accessed on 27 November 2025).
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022. [Google Scholar] [CrossRef]
Todi, A.; Narula, N.; Sharma, M.; Gupta, U. ConvNext: A Contemporary Architecture for Convolutional Neural Networks for Image Classification. In Proceedings of the 2023 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT), Dehradun, India, 8–9 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 9992–10002. [Google Scholar]
Hasan Sifat, M.S.; Rahat Hossain, K.M. DeepIndel: A ResNet-Based Method for Accurate Insertion and Deletion Detection from Long-Read Sequencing. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 509. [Google Scholar] [CrossRef]
Yang, H.; Kao, W.; Li, J.; Liu, C.; Bai, J.; Wu, C.; Geng, F. ResNet Combined with Attention Mechanism for Genomic Deletion Variant Prediction. Autom. Control Comput. Sci. 2024, 58, 252–264. [Google Scholar] [CrossRef]
Zook, J.M.; Chapman, B.; Wang, J.; Mittelman, D.; Hofmann, O.; Hide, W.; Salit, M. Integrating Human Sequence Data Sets Provides a Resource of Benchmark SNP and indel Genotype Calls. Nat. Biotechnol. 2014, 32, 246–251. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Y.; Bollas, A.; Wang, Y.; Au, K.F. Nanopore Sequencing Technology, Bioinformatics and Applications. Nat. Biotechnol. 2021, 39, 1348–1365. [Google Scholar] [CrossRef]
Rhoads, A.; Au, K.F. PacBio Sequencing and Its Applications. Genom. Proteom. Bioinform. 2015, 13, 278–289. [Google Scholar] [CrossRef]
Guguchkin, E.; Kasianov, A.; Belenikin, M.; Zobkova, G.; Kosova, E.; Makeev, V.; Karpulevich, E. Enhancing SNV Identification in Whole-Genome Sequencing Data through the Incorporation of Known Genetic Variants into the Minimap2 Index. BMC Bioinform. 2024, 25, 238, Correction in BMC Bioinform. 2024, 25, 268. [Google Scholar] [CrossRef] [PubMed]
Baid, G.; Nattestad, M.; Kolesnikov, A.; Goel, S.; Yang, H.; Chang, P.-C.; Carroll, A. An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development. bioRxiv 2020. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. arXiv 2019. [Google Scholar] [CrossRef]
Illumina/Hap.Py 2025. Available online: https://github.com/Illumina/hap.py?ysclid=mih5iay0og918538349 (accessed on 27 November 2025).

Figure 1. Overview of the DeepVariant variant-calling pipeline, which consists of three main steps: ‘make_examples’, ‘call_variants’, and ‘postprocess_variants’. Our proposed modification involves replacing the baseline CNN architecture in the classification step.

Figure 2. Training (left panel) and validation (right panel) loss curves for the baseline Inception_V3 model and the alternative EfficientNet-B3 model. Each line represents the mean across 10 independent runs, and nearby shaded area indicates the confidence interval (CI).

Figure 3. Progression of F1 score (left panel), precision (middle panel), and recall (right panel) throughout the validation steps for the baseline Inception_V3 and alternative EfficientNet-B3 model. Each line represents the mean across 10 independent runs, and the nearby shaded area indicates the confidence interval (CI).

Figure 4. Stratified validation performance across genotype classes. Values are reported as mean across independent runs. Different colors represent baseline Inception V3 and alternative EfficientNet-B3 models; hatch patterns correspond to validation metrics: F1 score, precision, and recall.

Figure 5. Comparison of baseline Inception V3 and EfficientNet models in terms of accuracy, parameter count, and GPU training time. Bars represent top-1 accuracy (%); dots connected by a line indicate GPU training time; crosses denote number of parameters. The selected EfficientNet-B3 model represents a preferable balance between accuracy gain, as well as a lower number of parameters and training/inference time.

Table 1. Performance comparison of baseline Inception_V3 and EfficientNet-B3 models on the validation set. Values are reported as the mean across independent runs with 95% confidence intervals (CI), paired t-test p-values, and Cohen’s d effect sizes. ↑ indicates superior results across evaluated metrics; * indicates statistically significant results (p < 0.05).

Metric	Baseline (Inception_V3)	Alternative (EfficientNet_B3)	p-Value	Cohen’s d
F1	94.8 (95% CI: 94.56–95.05)	↑ 95.31 (95% CI: 95.2–95.42)	0.0067 *	1.88
Precision	95.63 (95% CI: 95.41–95.84)	↑ 95.78 (95% CI: 95.69–95.88)	0.2432	0.67
Recall	93.76 (95% CI: 93.44–94.08)	↑ 94.78 (95% CI: 94.64–94.91)	0.0003 *	2.96

Table 2. Performance comparison of baseline Inception V3 and alternative EfficientNet-B3 on independent test sample HG003. Values are reported as the mean across independent runs with 95% confidence intervals (CI), paired t-test p-values, and Cohen’s d effect sizes. ↑ indicates superior results across evaluated metrics; * indicates statistically significant results (p < 0.05).

Metric	Baseline (Inception_V3)	Alternative (EfficientNet_B3)	p-Value	Cohen’s d
SNP
F1	99.04 (95% CI: 99.01–99.08)	↑ 99.14 (95% CI: 99.1–99.18)	0.0107 *	1.68
Precision	99.64 (95% CI: 99.57–99.71)	↑ 99.72 (95% CI: 99.67–99.78)	0.0471 *	0.94
Recall	98.45 (95% CI: 98.36–98.55)	↑ 98.56 (95% CI: 98.44–98.68)	0.1708	0.69
Indel
F1	93.8 (95% CI: 93.2–94.4)	↑ 94.03 (95% CI: 93.38–94.69)	0.6301	0.27
Precision	98.83 (95% CI: 98.74–98.92)	↑ 99.01 (95% CI: 98.94–99.07)	0.0042 *	1.61
Recall	89.26 (95% CI: 88.2–90.32)	↑ 89.55 (95% CI: 88.36–90.73)	0.7455	0.18

Table 3. Performance comparison of baseline Inception V3 and alternative EfficientNet-B3 on independent test sample HG005. Values are reported as the mean across independent runs with 95% confidence intervals (CI), paired t-test p-values, and Cohen’s d effect sizes. ↑ indicates superior results across evaluated metrics; * indicates statistically significant results (p < 0.05).

Metric	Baseline (Inception_V3)	Alternative (EfficientNet_B3)	p-Value	Cohen’s d
SNP
F1	99.11 (95% CI: 99.07–99.14)	↑ 99.18 (95% CI: 99.15–99.22)	0.0075 *	1.75
Precision	99.64 (95% CI: 99.57–99.71)	↑ 99.72 (95% CI: 99.67–99.78)	0.0372 *	1.00
Recall	98.58 (95% CI: 98.5–98.66)	↑ 98.65 (95% CI: 98.55–98.75)	0.2423	0.57
Indel
F1	97.1 (95% CI: 96.82–97.38)	↑ 97.29 (95% CI: 96.99–97.59)	0.3877	0.46
Precision	99.01 (95% CI: 98.9–99.12)	↑ 99.24 (95% CI: 99.17–99.3)	0.0022 *	1.86
Recall	95.27 (95% CI: 94.78–95.76)	↑ 95.42 (95% CI: 94.85–95.98)	0.7070	0.20

Table 4. Performance comparison of baseline Inception_V3 and alternative EfficientNet-B3 on training. Values are reported as top-1 accuracy (%) regarding ImageNet, number of parameters, training time.

Metric	Baseline (Inception_V3)	Alternative (EfficientNet_B3)
Acc@1	77.9	81.6
Params	23.9 M	12.3 M
Training time	2 h and 34 min × 10 CV runs	1 h and 59 min × 10 CV runs

Table 5. Characteristics of GIAB samples used in this study.

Sample	Gender	Ancestry	BAM File
HG001	Female	Utah/ European Ancestry	HG001.novaseq.pcr-free.30x.dedup.grch38.bam
HG003	Male	Eastern European Ashkenazi Jewish Ancestry	HG003.novaseq.pcr-free.30x.dedup.grch38.bam
HG005	Male	Chinese Ancestry	HG005.novaseq.pcr-free.30x.dedup.grch38.bam

Table 6. Distribution of genotype classes across datasets, samples, and chromosomes.

Dataset	Sample	Chromosomes	Number of Variants
Dataset	Sample	Chromosomes	Total	0/0, Homozygous Reference	0/1, Heterozygous	1/1, Homozygous Alternate
Training	HG001	10	355,674	153,737	123,611	78,326
Validation	HG001	20	156,159	71,858	53,263	31,038
Test 1	HG003	Other	4,822,356	1,196,591	2,196,336	1,429,429
Test 2	HG005	Other	4,473,118	1,008,100	1,975,012	1,490,006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gurianova, A.; Pestruilova, A.; Beliaeva, A.; Kasianov, A.; Mikhailova, L.; Guguchkin, E.; Karpulevich, E. Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling. Int. J. Mol. Sci. 2026, 27, 513. https://doi.org/10.3390/ijms27010513

AMA Style

Gurianova A, Pestruilova A, Beliaeva A, Kasianov A, Mikhailova L, Guguchkin E, Karpulevich E. Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling. International Journal of Molecular Sciences. 2026; 27(1):513. https://doi.org/10.3390/ijms27010513

Chicago/Turabian Style

Gurianova, Anastasiia, Anastasiia Pestruilova, Aleksandra Beliaeva, Artem Kasianov, Liudmila Mikhailova, Egor Guguchkin, and Evgeny Karpulevich. 2026. "Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling" International Journal of Molecular Sciences 27, no. 1: 513. https://doi.org/10.3390/ijms27010513

APA Style

Gurianova, A., Pestruilova, A., Beliaeva, A., Kasianov, A., Mikhailova, L., Guguchkin, E., & Karpulevich, E. (2026). Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling. International Journal of Molecular Sciences, 27(1), 513. https://doi.org/10.3390/ijms27010513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rethinking DeepVariant: Efficient Neural Architectures for Intelligent Variant Calling

Abstract

1. Introduction

2. Results

2.1. Training

2.1.1. Learning Dynamics

2.1.2. Performance Evaluation

2.1.3. Performance Stratified by Genotype

2.2. Testing

2.3. Training Efficiency and Inference Time

3. Discussion

3.1. Alternative Model Performance

3.2. Limitations

3.3. Future Directions

4. Materials and Methods

4.1. Datasets

4.2. Selection of Alternative Model for Demonstration

4.3. Training

4.3.1. Training Pipeline Adaptation

4.3.2. Training Configurations

4.3.3. Evaluation of Performance During Training

4.4. Testing

4.5. Hardware and Environment

4.6. Code Availability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI