Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing

Mochurad, Lesia; Tsmots, Ivan; Mostova, Vita; Kystsiv, Karina

doi:10.3390/computation14020048

Open AccessArticle

Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing

¹

Artificial Intelligence Department, Lviv Polytechnic National University, 12 S. Bandery St., 79013 Lviv, Ukraine

²

Department of Automated Control Systems, Lviv Polytechnic National University, 79013 Lviv, Ukraine

^*

Author to whom correspondence should be addressed.

Computation 2026, 14(2), 48; https://doi.org/10.3390/computation14020048

Submission received: 12 January 2026 / Revised: 4 February 2026 / Accepted: 6 February 2026 / Published: 8 February 2026

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes and experimentally evaluates a multi-level CPU-oriented execution method for high-throughput portion-based processing of file-backed Variant Call Format (VCF) data and automated mutation classification. The approach is based on a formally defined local processing scheme and integrates three coordinated levels of parallelism: block-based partitioning of file-backed VCF portions read sequentially into localized fragments with data-level parallel processing; task-level decomposition of feature construction into independent transformations; and execution-level specialization via JIT compilation of numerical kernels. To prevent performance degradation caused by nested parallelism, a resource-control mechanism is introduced as an execution rule that bounds effective parallelism and mitigates oversubscription, improving throughput stability on a single multi-core CPU node. Experiments on a public chromosome-17 VCF dataset for BRCA1-region pathogenicity classification demonstrate that the proposed multi-level local CPU execution (parsing/filtering, feature construction, and JIT-specialized numeric kernels) reduces runtime from 291.25 s (sequential) to 73.82 s, yielding a 3.95× speedup. When combined with resource-coordinated parallel model training, the end-to-end runtime further decreases to 51.18 s, corresponding to a 5.69× speedup, while preserving classification quality (accuracy 0.8483, precision 0.8758, recall 0.8261, F1 0.8502). A stage-wise ablation analysis quantifies the contribution of each execution level and confirms consistent scaling under resource-bounded execution.

Keywords:

genomic data; VCF processing; multi-core CPU; multi-level parallelism; oversubscription control; JIT compilation; Random Forest; portion-based processing

1. Introduction

The integration of medicine and digital technologies is one of the most dynamic and promising areas of modern science, with the potential to fundamentally transform healthcare systems. The analysis of genetic alterations at the DNA level enables a deeper understanding of the molecular mechanisms of diseases and facilitates the development of personalized approaches to diagnosis and treatment. With the growing volume of data on human biological and genetic characteristics, personalized medicine has gained importance, tailoring medical decisions to individual patients through comprehensive analysis of personal data [1,2,3]. However, the implementation of such approaches presents a number of challenges. One of the main risks lies in the misinterpretation of genetic tests, which may lead to unnecessary costs, ineffective treatment, or even harmful interventions. Several studies have documented that errors in the interpretation of genetic variants can result in inappropriate clinical management, including unnecessary surveillance, unwarranted preventive surgeries, or missed opportunities for intervention [4]. In this context, machine learning methods support automated interpretation of genomic variants and reduce the risk of inconsistent clinical decisions [5].

Another challenge is the integration of heterogeneous biomedical sources—genetic, epigenetic, and clinical—into unified predictive models. This requires new solutions for data processing, standardization, and integration [6,7]. At the same time, the rapid growth of sequencing data makes purely sequential processing increasingly inadequate for current-scale Variant Call Format (VCF) workloads [8].

Variant analysis workflows commonly store and exchange mutation information in the VCF, which has become a de facto standard and is supported by widely used toolchains such as VCFtools [9].

Recent studies demonstrate the effectiveness of parallel architectures for bioinformatics workloads, including GPUs and hybrid CPU–GPU systems. However, dependence on accelerators restricts deployment in many institutions. This motivates CPU-oriented methods that can increase throughput of VCF preprocessing and feature construction on widely available multicore hardware [10].

In practice, a major unresolved issue for CPU-only environments is the coordination of parallelism across processing layers. Block-level parallel parsing, task-level feature extraction, and library-level multithreading may overlap and lead to thread oversubscription, which reduces throughput and makes runtime unstable [11]. Therefore, a key requirement is a resource-coordinated execution scheme that prevents nested-parallel oversubscription and stabilizes runtime in a single-node CPU setting.

With the continuous growth of genomic data, efficient processing methods that operate on conventional CPU infrastructures are increasingly needed. Many CPU-based tools focus on specific tasks, for example, imputation, and require extensive parameter tuning, while runtime and resource costs are often underreported. In contrast, this work targets end-to-end acceleration of portion-based VCF processing for mutation pathogenicity classification in a CPU-only setting [12].

The aim of this study is to develop an efficient multi-level CPU-parallel method for BRCA1 mutation pathogenicity classification that improves end-to-end throughput while maintaining classification quality in resource-constrained environments.

To address the above aim, this paper proposes and evaluates a multi-level parallel CPU execution method for accelerated portion-based processing of VCF data in a single-node setting. The method integrates data-parallel processing of independent VCF fragments; task-parallel feature construction via independent transformations; and execution-level specialization through just-in-time compilation of numeric kernels [13]. Execution is explicitly resource-coordinated to prevent nested-parallel oversubscription and to stabilize runtime on multicore CPUs.

The main contributions are as follows:

We propose a single-node CPU execution method for portion-based VCF processing and mutation classification that combines block-wise VCF partitioning with multi-level parallel execution.
We formulate and enforce a resource-control mechanism that coordinates multi-level parallelism and mitigates oversubscription when processing-layer parallelism interacts with internally threaded learning components, thereby supporting stable runtime and scalable execution on multicore CPUs.
We provide a stage-wise ablation and scalability study that isolates the runtime contribution of each component, including data-level parallel parsing and filtering, task-level feature construction, JIT-specialized numeric kernels, and controlled parallel hyperparameter search and model training. We report both per-stage and cumulative effects on end-to-end runtime.
Using a public dataset, we demonstrate that the proposed CPU-only methodology substantially reduces runtime while preserving classification quality, making it suitable for institutions with limited access to accelerator hardware.

Overall, the proposed method coordinates multi-level CPU parallelism for VCF parsing, filtering, feature construction, and model training in a single-node setting. By bounding effective parallelism and preventing nested oversubscription, it improves throughput and stabilizes runtime while preserving classification quality. The method is validated via stage-wise ablation and scalability experiments on a public chromosome-17 dataset for BRCA1-region pathogenicity classification.

2. State of the Art

Personalized medicine increasingly relies on large-scale genomic data processing and computational decision support. As sequencing throughput grows, efficient and reproducible analysis of variant data becomes essential for clinical and research workflows [1,12]. VCF has become a widely used interchange format for variant datasets and is supported by widely used utilities such as VCFtools [9].

Machine-learning methods are widely used for variant interpretation, ranging from deep models to classical classifiers. While deep approaches can achieve high accuracy in specific settings, they often require substantial computational resources and optimized training environments, which are not consistently available in clinical practice [5]. This motivates CPU-oriented solutions that provide competitive predictive performance with lower hardware requirements.

The issue of data compatibility and quality remains a significant barrier to clinical implementation. In [7], the authors identified considerable heterogeneity in the monitoring of clinical data quality, which manifests in differences such as variant annotation completeness, clinical-significance labeling criteria across data sources, update frequency of reference databases, and the handling of missing INFO-field attributes. Such variability directly affects VCF processing pipelines by increasing the complexity of parsing, filtering, and feature construction stages, and by introducing inconsistencies that may propagate into downstream machine-learning models. For example, the same genetic variant may receive different clinical-significance labels (e.g., “Pathogenic” versus “Conflicting interpretations”) across different releases of ClinVar, which necessitates explicit filtering strategies and label-stability rules during VCF preprocessing. In turn, [6] proposes a method for integrating heterogeneous biomedical data for classification tasks. These directions increase the need for scalable preprocessing and feature construction, especially for downstream VCF-based machine-learning tasks.

One of the key directions is improving the performance of genetic data processing. Study [8] systematizes approaches to parallel genome sequence processing, highlighting significant potential for both CPU and GPU platforms. Meanwhile, most modern solutions are based on GPU accelerators, such as NVIDIA Parabricks, which are discussed in detail in [14]. These solutions primarily target variant calling and primary processing, whereas downstream VCF parsing, feature construction, and classification remain important CPU-bound stages. However, GPUs are not always available in medical institutions, making the development of CPU-oriented strategies highly relevant. Study [15] investigates hybrid CPU + GPU systems and demonstrates their effectiveness, while also emphasizing the challenges associated with the performance of CPU-only solutions.

Machine-learning-based variant interpretation is actively studied in oncology and clinical genomics, where it supports diagnostics and risk assessment [15,16,17,18,19]. These works motivate computationally efficient and reproducible workflows that remain deployable under limited hardware resources.

Recent findings also confirm the effectiveness of ensemble learning for genomic prediction tasks. For example, studies [18,19] report that Random Forest models can provide robust and competitive predictive performance for treatment-outcome analysis in large patient cohorts when supported by extensive preprocessing, feature construction, and controlled validation strategies. Collectively, these results underscore the growing importance of computationally efficient and reproducible machine-learning workflows in precision medicine, motivating the development of optimized genomic analysis workflows capable of delivering high accuracy and stable performance in real-world settings.

As the use of machine learning in genomics continues to expand, efficient data-processing workflows become essential to ensure timely and accurate computational analysis, particularly in environments with limited hardware resources. The volume of genomic data is primarily driven by next-generation sequencing throughput—including coverage depth, cohort size, and panel width—rather than by machine-learning workloads themselves. Machine-learning and deep-learning methods increase computational demand but do not generate additional raw data volume. Therefore, accelerators such as GPUs are highly effective for large-scale, high-throughput genomic tasks; however, many clinical and academic laboratories operate on CPU-only workstations or shared clusters where accelerators are scarce or unavailable.

Based on the conducted analysis, current approaches exhibit several key limitations. First, they often rely on GPU-oriented infrastructures, resulting in high computational requirements and limited scalability in settings where accelerator access is restricted. Second, existing solutions frequently demonstrate low adaptability to routine clinical environments, where computational workflows must be cost-efficient, reproducible, and deployable on widely available CPU-based systems. Third, there remains a need for CPU-optimized genomic workflows that coordinate multi-level parallelism to avoid nested oversubscription, while maintaining competitive accuracy and high throughput.

These gaps motivate the CPU-oriented multi-level execution method proposed in this work. Table 1 presents a comparison between existing approaches and the method developed in this study.

Overall, CPU-oriented solutions remain important for accessible and scalable personalized-medicine pipelines, especially in settings where accelerator resources are unavailable or constrained. This work contributes a resource-coordinated multi-level CPU execution strategy that increases throughput for file-backed, portion-based VCF processing and downstream mutation classification.

3. Formal Execution Model

3.1. Problem Statement and Portion-Based Processing Constraint

We consider a single shared-memory CPU node that processes VCF data for the BRCA1 region and constructs a dataset for mutation pathogenicity classification using a Random Forest model. In our implementation and experiments, the input is stored in a finite VCF file. Nevertheless, the computation is formulated in a portion-based manner by sequentially reading and processing the file in portions (chunks). This formulation enables (i) a clear definition of per-portion service time and throughput, and (ii) a capacity-style stability condition that would apply if the same portioned pipeline were driven by an online producer.

Let the file-backed VCF input be represented as a sequence of portions (1).

X = {(X^{(k)})}_{k \geq 1}

(1)

where

X^{(k)}

is the

k

-th portion containing

N_{k} = |X^{(k)}|

variant records. In a true online scenario, portions would arrive with an inter-arrival time

{∆ t}_{i n}^{(k)}

. In the present study, portions are file-backed (i.e., available by sequential reading), so

{∆ t}_{i n}^{(k)}

is used as a conceptual arrival parameter rather than an experimentally enforced timing constraint.

For each portion

X^{(k)}

, let

T_{l o c}^{(k)}

denote the local end-to-end processing time on the CPU node, and let the local throughput be (2).

λ_{l o c}^{(k)} = \frac{N_{k}}{T_{l o c}^{(k)}}

(2)

If an online producer were present, portion-based execution would be stable when processing does not create an unbounded backlog of portions. This capacity condition can be expressed as (3).

T_{l o c}^{(k)} \leq {∆ t}_{i n}^{(k)} ⟺ λ_{l o c}^{(k)} \geq λ_{i n}^{(k)} = \frac{N_{k}}{{∆ t}_{i n}^{(k)}} .

(3)

In our experiments, we do not claim hard real-time guarantees because

{∆ t}_{i n}^{(k)}

is not imposed by an external source; instead, we report

T_{l o c}^{(k)}

and

λ_{l o c}^{(k)}

as practical capacity indicators for portion-based processing on a single CPU node.

The objective of the local CPU execution method is to increase

λ_{l o c}^{(k)}

while preserving semantic equivalence to a sequential baseline, that is, producing identical features under the same deterministic preprocessing parameters and identical random seeds in the learning stage.

For mutation classification, each processed record contributes a feature vector

z_{i}

and a label

y_{i}

. The training dataset for the learning stage is denoted as (4).

D = {\{(z_{i}, y_{i})\}}_{i = 1}^{N},

(4)

The learning stage produces a classifier h(⋅) that maps feature vectors to predicted pathogenicity labels. This section formalizes the CPU execution of preprocessing and feature construction. The learning stage is treated as an internally threaded component whose parallelism must be coordinated with preprocessing to avoid nested oversubscription on a single shared-memory node.

Backpressure handling and dynamic resource allocation are outside the scope of this study because the evaluated pipeline is executed on a file-backed input. So, (3) is included as a capacity-style condition that would apply under online ingestion.

3.2. Formal Model of VCF Portions and Localized Transformations

Each record

{x_{i}^{(k)} \in X}^{(k)}

,

i = 1, \dots, N_{k}

, is represented as a tuple (5).

x_{i}^{(k)} = ({C H R O M}_{i}, {P O S}_{i}, {R E F}_{i}, {A L T}_{i}, {I N F O}_{i}),

(5)

where additional VCF fields may be included explicitly or absorbed into

{I N F O}_{i}

.

To enable data-level parallel processing while preserving record order, each portion

X^{(k)}

is partitioned into B_k non-overlapping blocks defined by index ranges. Let ⊕ denote ordered concatenation, restoring the original order of records within the portion. Then, we have (6).

X^{(k)} = \oplus_{b = 1}^{B_{k}} X_{b}^{(k)}, X_{b}^{(k)} = (x_{s (k, b)}^{(k)}, \dots, x_{e (k, b)}^{(k)}),

(6)

where the index ranges

[s (k, b), e (k, b)]

are disjoint and cover

[1, \dots, N_{k}]

.

Definition 1 (localized block).

A block

X_{b}^{(k)}

is called localized if all operations applied to its elements require no access to records outside

X_{b}^{(k)}

.

Definition 2 (local operation).

An operation (or transformation) is called local if, when applied to a record

x_{i}^{(k)}

or a localized block

X_{b}^{(k)}

, it accesses only the fields of that record or records within the same block and uses only local parameters; it does not require access to records from other blocks and does not modify shared mutable state.

We adopt the following assumptions for local transformations:

(A1) each transformation is deterministic for fixed parameters and does not mutate shared state;
(A2) the output for a record depends only on that record and local parameters;
(A3) aggregation across blocks is performed after block completion and does not alter transformation values.

For each record, local processing is defined as a compositional scheme (7).

L (x_{i}^{(k)}) = L^{(3)} \circ L^{(2)} \circ L^{(1)} (x_{i}^{(k)}),

(7)

where

L^{(1)}

performs parsing and filtering of VCF fields and emits an intermediate representation;

L^{(2)}

constructs the feature vector by independent transformations;

L^{(3)}

applies numeric kernels that are specialized at runtime through just-in-time compilation.

This compositional definition is used to specify both semantic equivalence and admissible parallel execution.

3.3. Multi-Level CPU Execution Method

The proposed execution method is defined as a hierarchy of three coordinated levels of parallelism applied to the localized processing scheme

L

.

Level I—data-level parallelism (block processing)

For each portion

X^{(k)}

, parsing and filtering are evaluated independently on localized blocks (8).

L^{(1)} (X^{(k)}) = \oplus_{b = 1}^{B_{k}} L^{(1)} (X_{b}^{(k)}) .

(8)

Lemma 1.

Under Definitions 1 and 2 and assumptions (A1)–(A2), block-parallel evaluation of

L^{(1)}

is correct and equivalent to sequential evaluation with preserved record order.

Proof.

Since

L^{(1)}

does not access records outside the current block and does not mutate shared state, results for different blocks are independent. Ordered concatenation ⊕ restores the sequential record order. □

Level II—task-level parallelism (feature decomposition)

Feature construction for each record is decomposed into independent transformations (9).

L^{(2)} (x_{i}^{(k)}) = (φ_{p o s} (x_{i}^{(k)}), φ_{s e q} (x_{i}^{(k)}), φ_{a n n} (x_{i}^{(k)})),

(9)

where

φ_{p o s}

normalizes genomic coordinates,

φ_{s e q}

encodes allele information, and

φ_{a n n}

extracts annotation-derived attributes. The feature vector is formed by concatenating in a fixed order to form

z_{i}

.

The feature vector

x_{i}

is computed deterministically per VCF record from the standard fields POS, REF, ALT, QUAL, FILTER and a fixed set of INFO tags. Table 2 reports the exact inputs, transformations/encodings, feature ordering, and missing-value rules used to construct

x_{i}

. ClinVar-derived clinical-significance attributes (used only to assign the ground-truth label

y_{i}

from the INFO field) are explicitly excluded from the input feature set to prevent label leakage.

Lemma 2.

Under (A1)–(A2), parallel computation of

φ_{p o s}

,

φ_{s e q}

, and

φ_{a n n}

is correct and independent of evaluation order.

Proof.

Each component depends only on the input record and local parameters and does not modify shared state, hence there are no write–write or read–write conflicts and the final concatenated feature vector is invariant to the order of evaluation. □

Level III—execution-level specialization (JIT kernels)

Numeric kernels in

L^{(3)}

are specialized at runtime (see (10)).

L^{(3)} = J I T (P (θ)),

(10)

where

P (θ)

is a parameterized kernel family and

θ

captures the active feature layout, datatypes, and array sizes observed in the current execution. In our implementation,

P (θ)

includes kernels for coordinate normalization and array-wise feature encoding whose loops are specialized for the observed dtypes and contiguous memory layout.

Lemma 3.

JIT specialization preserves the semantics of

L^{(3)}

and can reduce execution overhead by changing only the execution form while preserving the input–output mapping.

Proof.

Specialization does not alter the mathematical function computed by the kernel; it only changes the compiled implementation selected for the observed

θ

. □

Thus, local processing for each portion is defined by the hierarchical composition (11).

L = L^{(3)} \circ L^{(2)} \circ L^{(1)},

(11)

where Levels I–III correspond to data-level parallel block processing, task-level parallel feature decomposition, and execution-level JIT specialization, respectively.

The flowchart of the method is presented in Figure 1.

3.4. Resource Coordination and Bounded Nested Parallelism

The execution involves two potential sources of concurrency:

external execution units used by Level I block processing;
internal threading used by the learning stage, including model training, cross-validation, and parameter search.

Let

C_{C P U}

be the number of physical CPU cores available to the workflow. Let

m_{1}

denote the number of external execution units used for Level I block processing. Let

m_{M L}

denote the internal degree of parallelism used by the learning stage.

Uncoordinated nesting may create effective concurrency that exceeds physical resources, leading to oversubscription and unstable runtimes. The method therefore introduces a resource-coordination mechanism that regulates when and where parallel execution is enabled.

Definition 3 (resource-coordinated execution).

The execution is resource-bounded if (12).

m_{1} {+ m}_{M L} \leq C_{C P U},

(12)

Internal threading is constrained whenever external block execution units are active, so that concurrent competition between preprocessing execution units and internally threaded learning routines is avoided. The parameters

m_{1}

and

m_{M L}

are treated as explicit configuration variables of the execution method and are fixed for each experimental run.

In practice, this is enforced by setting internal library thread counts, for example,

m_{M L}

to a constrained value during block processing and enabling full internal parallelism only in the learning-only phase. If block processing runs with

m_{1} > 1

external workers, then model selection, training, and validation are executed with constrained internal parallelism. Full internal parallelism is enabled only when external block workers are inactive. In other words, the method permits high parallelism in data processing or in learning routines, but not in both simultaneously.

Proposition 1 (stable single-node execution).

Under Definition 3, the workflow avoids oversubscription and exhibits more stable runtime behavior within a single CPU node, meaning that excessive scheduling overhead is mitigated and runtime variability is reduced.

Proof.

The bound

m_{1} {+ m}_{M L} \leq C_{C P U}

prevents the number of simultaneously runnable threads from systematically exceeding available physical cores, reducing excessive context switching and contention. As a result, runtime behavior is governed primarily by the computational structure of the pipeline and the learning stage. □

The key novelty is the explicit resource-coordination rule that aligns multi-level preprocessing parallelism with internally threaded learning routines under a single resource bound. This turns nested parallelism from an uncontrolled library side effect into a controllable element of the formal execution model.

4. Results and Analysis

This section reports the dataset used, the evaluation protocol, and the experimental results that quantify classification quality and end-to-end execution efficiency of the proposed multi-level CPU execution method, including scalability and stage-wise ablation.

4.1. Dataset Description (BRCA1 Region of Chromosome 17)

To address the task of automated classification of mutations in the BRCA1 gene, an open human DNA variant dataset Homo sapiens variation data (Release 110) in Variant Call Format (VCF) was used, containing information on single nucleotide variants (SNPs), multiple nucleotide variants (MNPs), insertions, and deletions (InDels) relative to the reference genome [20]. The variants were obtained from the Ensembl resource [21], which provides genome assemblies, gene annotations, and curated variant tracks.

For this study, the file homo_sapiens-chr17.vcf.gz [22] was downloaded, containing variants on chromosome 17. The downloaded VCF contained N_total = 19,772,917 variant records (excluding header lines). Variants were filtered to include only those located within the BRCA1 genomic region (GRCh38). In our experiments, the BRCA1 region was defined by the genomic interval [43,044,295; 43,125,482] in GRCh38 (chromosome 17), and all variants outside this interval were discarded. After coordinate-based filtering to the BRCA1 interval, N_BRCA1_coord = 26,548 variants were retained, which corresponds to 13% of the downloaded chromosome-17 records. Clinical labels were derived from ClinVar-derived clinical-significance attributes available in the VCF INFO field. Variants with unambiguous significance (Pathogenic/Likely_pathogenic) were assigned to class 1, and variants with unambiguous significance (Benign/Likely_benign) were assigned to class 0. Among the BRCA1-interval variants, N_excl = 21,475 records (80.89% of the BRCA1-interval subset) had undefined, ambiguous, or conflicting clinical-significance annotations and were removed. After coordinate-based filtering to the BRCA1 interval and excluding variants with undefined or conflicting clinical significance, the dataset comprised N = 5073 variants (N₁ = 2642 pathogenic/likely pathogenic; N₀ = 2431 benign/likely benign). Overall, the final labeled dataset corresponds to 19.11% of the BRCA1-interval variants and 0.03% of the original downloaded chromosome-17 file.

The resulting dataset defines a binary classification task: clinically significant variants (pathogenic, class 1) versus clinically insignificant variants (benign, class 0). Each record includes the variant position, reference and alternative alleles, and aggregated annotations, which, after preprocessing and feature construction, are used as inputs for the classification model.

4.2. Classification Performance Metrics

The following metrics were used to evaluate classification performance [23]. Let class 1 (pathogenic) be the positive class. Accuracy is the overall proportion of correctly classified instances as (13).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N},

(13)

where TP (true positives) is the number of pathogenic variants correctly predicted as pathogenic, TN (true negatives) is the number of benign variants correctly predicted as benign, FP (false positives) is the number of benign variants incorrectly predicted as pathogenic, and FN (false negatives) is the number of pathogenic variants incorrectly predicted as benign.

Precision is the fraction of predicted positives that are truly positive, as shown in (14).

P r e c i s i o n = \frac{T P}{T P + F P} .

(14)

Recall is the fraction of true positives that are correctly identified, as shown in (15).

R e c a l l = \frac{T P}{T P + F N} .

(15)

F1-score is the harmonic mean of Precision and Recall, as shown in (16):

F 1 = 2 \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l} .

(16)

4.3. Results and Efficiency of Implementations

Experiments were executed on a system with an Intel Core i7-13700H CPU (Intel Corporation, Santa Clara, CA, USA) (14 cores, 20 threads), 16 GB RAM, and Python 3.10. The implementation was developed in Python using the standard library multiprocessing module and Numba JIT (Numba version 0.62.1) compilation. End-to-end execution time

T

was measured with time.perf_counter() and validated across five independent runs. The measured runtime includes VCF portion reading, parsing and filtering, feature construction, model selection and training, and evaluation to ensure a consistent end-to-end comparison.

Although the dataset is stored as a finite file, the pipeline processes the VCF in a file-backed, portion-based (chunked) manner: the file is read sequentially in fixed-size portions, and the Level I–III transformations are applied to each portion before global aggregation. We report the effective throughput as

λ = N / T

(variants/s), where

N

is the total number of processed variants and

T

is the end-to-end runtime, and we report

λ

alongside runtime to characterize processing capacity rather than hard real-time latency guarantees.

To ensure reproducibility, we fixed random_state = 42 and used identical hyperparameter-search settings in all runs. Peak memory consumption was measured as resident set size (RSS) using psutil by sampling the main Python process and all worker processes at 0.25 s intervals during the end-to-end run. We report (i) peak total RSS (sum over processes) and (ii) peak per-process RSS (maximum over processes) to characterize memory footprint under different core counts. The dataset was split into training and test subsets using an 80/20 ratio with random_state = 42. Hyperparameters were optimized on the training subset only via RandomizedSearchCV (50 sampled configurations) with 5-fold cross-validation, and the final reported classification metrics were computed on the held-out test subset.

The reported results correspond to the execution levels defined in Section 3: Level I (data-parallel block processing) accelerates parsing and filtering; Level II (task-parallel feature decomposition) accelerates independent feature transformations; Level III (JIT specialization) reduces numeric preprocessing overhead. To avoid nested oversubscription (Section 3.4), preprocessing parallelism and internally threaded learning routines are executed as coordinated phases rather than unconstrained concurrent nesting.

After training and evaluating the Random Forest model, we performed a comparative analysis of sequential versus multi-level parallel execution. The analysis includes classification quality and end-to-end runtime across different CPU configurations. All experiments were conducted under identical data splits, random seeds, and hyperparameter-search settings. The results are summarized in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8. All memory values in Table 5 are reported as the maximum observed peak across five independent end-to-end runs for each configuration.

As shown in Table 3, classification quality is preserved after parallelization. Precision and F1-score slightly improve, while end-to-end runtime decreases from 291.25 s to 51.18 s. For a deeper analysis, recall was computed separately for both classes, indicating balanced sensitivity: recall for class 1 is approximately 0.83 and for class 0 approximately 0.87.

The scaling dynamics are summarized in Table 4. With an increasing number of cores, speedup increases; however, efficiency decreases due to parallel overheads and synchronization costs [24].

Table 5 provides the corresponding execution times. The monotonic reduction in runtime confirms the effectiveness of the parallelization strategy.

As expected, the memory footprint increases with the number of worker processes due to per-process parsing buffers and intermediate feature arrays. In our setup, the peak total RSS for 14 cores remained within the available 16 GB RAM (Table 5), indicating feasibility for CPU-only deployment under moderate memory constraints. For memory-constrained systems, the configuration can be adapted by reducing the number of worker processes or the portion size, trading peak RSS for longer runtime. In practice, we recommend selecting the highest core count that keeps peak total RSS comfortably below available RAM to avoid paging and performance collapse.

To strengthen the portion-based processing interpretation, we additionally report the effective throughput computed from the end-to-end runtime.

Random Forest hyperparameters were automatically optimized using RandomizedSearchCV for both implementations; the selected configuration was: max_depth = None, max_features = ‘sqrt’, min_samples_split = 18, min_samples_leaf = 1, n_estimators = 236.

To address the computational overhead of model selection, we explicitly measured the runtime of the hyperparameter search phase separately from preprocessing and model training. The hyperparameter optimization stage includes all RandomizedSearchCV operations (sampling of configurations, 5-fold cross-validation, and model fitting inside each fold) and is reported as a distinct component of the end-to-end pipeline. The measured runtime for hyperparameter search (mean ± SD) was 141.87 ± 4.635 s for sequential execution (1 core) and decreased with parallelization to 100.68 ± 5.35 s (2 cores), 46.62 ± 2.92 s (4 cores), 29.66 ± 4.04 s (8 cores), and 23.64 ± 3.76 s (14 cores). This explicit separation confirms that reported end-to-end speedups remain valid even when the hyperparameter search cost is included.

Hyperparameter tuning can constitute a substantial fraction of the learning-stage cost. Therefore, the runtime of the hyperparameter-search phase (RandomizedSearchCV with cross-validation) has been measured as a standalone component and is treated separately from preprocessing and the final model fitting.

To quantify the incremental contribution of each execution component, we progressively enabled optimizations from the sequential baseline to the full workflow. Table 7 reports the end-to-end runtime for cumulatively enabled steps, separating (i) local VCF processing and feature construction (Levels I–III) from (ii) learning-stage acceleration, decomposed into parallel hyperparameter search and parallel Random Forest fitting under resource-coordinated execution.

The ablation results demonstrate a cumulative reduction in end-to-end runtime as successive execution components are enabled. JIT compilation reduces interpreter overhead in numeric kernels; block-level data parallelism accelerates VCF parsing and filtering; and task-parallel feature processing accelerates independent feature transformations. The first four stages correspond to the local multi-level CPU execution (Levels I–III), achieving a 3.95× speedup (291.25 s → 73.82 s).

The learning stage is decomposed into two separately measured components. First, parallel hyperparameter search (RandomizedSearchCV) reduces the model-selection overhead and yields an additional reduction before the final estimator is fitted. In the parallel hyperparameter search ablation step, the acceleration is attributed to the search procedure (RandomizedSearchCV) as a standalone component, while the final model fitting is accounted for in the subsequent resource-coordinated parallel RF training step. Second, enabling resource-coordinated parallelism for the final Random Forest fitting further decreases the end-to-end runtime to 51.18 s (5.69×). This separation makes the contributions of preprocessing, hyperparameter search, and final training directly interpretable, while the resource-coordination scheme prevents nested oversubscription when externally parallel preprocessing is combined with internally threaded learning routines. Overall, the workflow achieves a 5.69× end-to-end speedup while maintaining classification quality (F1 ≈ 0.85) for BRCA1-region variant classification.

To ensure statistical robustness of the reported speedup values, each CPU configuration (1, 2, 4, 8, and 14 cores) was evaluated over five independent runs, reporting mean runtime, standard deviation, coefficient of variation, and 95% confidence intervals; paired t-tests were used to assess the significance of runtime differences relative to the sequential baseline.

The observed coefficient of variation remains below 10% across all configurations, indicating low runtime variability and stable performance behavior despite inherent multicore execution noise. The monotonic increase in speedup with the number of threads exceeds the reported confidence intervals, confirming that the observed performance gains are statistically significant and not attributable to random system fluctuations. A paired t-test between adjacent configurations confirmed statistically significant runtime reductions (p < 0.01).

5. Discussion

5.1. Effect of Multi-Level Parallelization on End-to-End Runtime and Quality

The proposed multi-level CPU execution strategy reduced the end-to-end runtime from 291.25 s (sequential baseline) to 51.18 s, corresponding to a 5.69× speedup, while preserving classification behavior and quality metrics (Table 3). The local multi-level execution for preprocessing and feature construction (Levels I–III) accounts for a substantial part of this gain (down to 73.82 s, i.e., 3.95×), and the end-to-end time is further reduced by resource-coordinated parallelism in the learning stage. Importantly, acceleration was achieved without degrading predictive performance: Accuracy remained 0.8483, Recall remained 0.8261, and the F1-score slightly increased from 0.8494 to 0.8502. This indicates that the introduced execution-level optimizations changed the execution form rather than the underlying learning objective or the semantics of feature construction.

Per-class recall further suggests a practically balanced sensitivity: recall is approximately 0.83 for pathogenic variants (class 1) and 0.87 for benign variants (class 0). In biomedical settings, maintaining sensitivity to positive cases is critical, and the obtained values confirm that throughput improvements did not come at the cost of missing pathogenic variants.

The stability of the selected Random Forest configuration across sequential and parallel runs supports the same conclusion. The best hyperparameters found by RandomizedSearchCV were consistent (e.g., max_depth = None, max_features = ‘sqrt’, min_samples_split = 18, min_samples_leaf = 1, n_estimators = 236), suggesting that the search and training logic remained equivalent under the controlled parallel execution.

From an execution perspective, the achieved speedup is explained by the cumulative effect of the three levels defined in Section 3 and quantified in Table 7. JIT specialization reduces preprocessing overhead in numeric kernels, data-parallel chunking accelerates parsing and filtering, and task-level decomposition accelerates independent feature transformations. Finally, enabling parallelism in the learning stage (n_jobs = −1) reduces tree-construction time. A key point is that these gains rely on coordinated use of parallel resources: preprocessing parallelism and internally threaded learning routines are treated as separate phases to avoid nested oversubscription, consistent with the formal resource bound in Section 3.4.

5.2. Scalability and the Observed Efficiency Drop

Scaling results (Table 4 and Table 5) show that end-to-end speedup increases with the number of CPU cores, reaching 5.69× at 14 cores. However, efficiency decreases from 0.96 (2 cores) to 0.40 (14 cores), which is expected for a multi-stage shared-memory workflow due to unavoidable serial components and increasing overheads from synchronization and memory contention.

This behavior is consistent with Amdahl’s law [25]: when a non-parallelizable fraction exists, the marginal benefit of adding cores diminishes. The results indicate that beyond roughly 8–10 cores, additional workers contribute progressively smaller runtime reductions, reflecting resource saturation and coordination overheads typical for shared-memory multiprocessing workloads. For the observed speedup S = 5.69 achieved on p = 14 CPU cores, the effective serial fraction can be estimated using Amdahl’s law as (17):

S_{A} \approx \frac{\frac{1}{S} - \frac{1}{p}}{1 - \frac{1}{p}} \approx 0.11 .

(17)

This indicates that approximately 11% of the end-to-end pipeline remains effectively sequential, while the theoretical upper bound on achievable speedup is given by (18).

S_{m a x} \approx \frac{1}{S} \approx 8.9 \times .

(18)

In practice, the measured speedup remains below this bound because the idealized Amdahl model does not capture contention on shared resources and non-scaling overheads. At higher core counts, VCF parsing and feature materialization increasingly become limited by memory bandwidth and cache capacity, so additional parallel processes compete for the same memory subsystem and last-level cache. Chunk-level multiprocessing also adds coordination overhead from task scheduling, synchronization, inter-process communication, and result aggregation, which grows with the number of parallel processes. Moreover, end-to-end runtime includes input-output and decompression costs as well as operating-system effects such as page-cache misses and frequency scaling. These factors explain why the observed 5.69× speedup on 14 cores does not reach the theoretical maximum.

Practically, the best operating point depends on whether the objective is minimal end-to-end runtime or maximal efficiency per core. Using all available cores minimizes runtime, whereas an 8-core configuration can provide a more favorable efficiency–performance trade-off on shared servers.

5.3. Optional Reference Comparison with GPU Execution

In addition to CPU scaling, we performed an additional reference experiment on an NVIDIA RTX 3060 (NVIDIA, Santa Clara, CA, USA) to contextualize the achieved CPU performance for institutions that may have limited or intermittent access to accelerators. Table 9 summarizes end-to-end execution time for CPU configurations and the GPU run.

The GPU run completed in 45.00 s, which corresponds to 6.47× speedup relative to the sequential CPU baseline. The optimized CPU execution at 14 cores achieved 87.9% of the GPU speed for this workload, indicating that a carefully coordinated multi-level CPU method can approach GPU-level performance for the considered end-to-end pipeline. This supports the intended deployment scenario: a high-throughput CPU-only alternative when GPU access is constrained.

Figure 2 clearly illustrates the reduction in processing time with increasing core counts, the gradual decline in efficiency under scaling, and the comparable performance levels achieved between the optimized CPU implementation and the GPU variant.

Similar conclusions are reported by [26], where it is shown that well-optimized multicore CPUs can deliver performance comparable to GPUs in bioinformatics tasks without relying on CUDA or specialized accelerators. Reference [27] emphasizes that CPU-based solutions remain a viable alternative for genomic data processing, particularly in cases involving large numbers of independent tasks or limited access to GPU resources.

5.4. Comparative Analysis of Algorithm Efficiency Across Different Genomic Regions

The evaluation was extended from a single genomic region (BRCA1) to three additional genes: MLH1 (chr3: 36993313-37050477), TP53 (chr17: 7668421-7687490), and CDKN2A (chr9: 21967752-21995324) (see Table 10). These regions differ substantially in variant count, class balance, and annotation density, providing a realistic test of scalability and robustness. Both sequential and parallel implementations of the Random Forest–based pipeline were applied to each gene.

As shown in Table 10, parallel execution consistently reduced total runtime across all evaluated regions, achieving speedups between 2.8× and 4.4×. The largest acceleration was observed for TP53, which exhibits a relatively balanced class distribution and moderate variant density. In contrast, CDKN2A shows a lower speedup, reflecting the impact of smaller labeled sample size and increased parallelization overhead. When compared to the BRCA1 experiment (Table 3), which achieved a higher speedup due to a larger and more densely annotated variant set, these results indicate that absolute performance gains are influenced by region-specific characteristics, while the overall benefit of parallelization remains consistent.

The results in Table 11 and Table 12 show that parallelization preserves the predictive performance of the classification pipeline across all evaluated genomic regions. For MLH1, the parallel execution yields nearly identical Accuracy and Precision compared to the sequential pipeline, with a slight increase in Recall and F1-score (0.732 versus 0.729). For TP53, all reported metrics remain exactly the same under sequential and parallel execution, indicating fully preserved classification behavior. For CDKN2A, classification metrics are identical between the two implementations, confirming that parallelization does not affect model performance even in regions characterized by strong class imbalance and limited pathogenic variant representation.

These results confirm that the proposed parallelization strategy generalizes across multiple genomic regions, preserving model quality while significantly improving runtime. Variant density, gene length, and annotation complexity influence absolute execution time but do not impact the consistency of learned Random Forest models.

5.5. Comparison with Existing Studies

Due to platform limitations, widely used VCF processing tools such as VCFtools and BCFtools are not natively available for the Microsoft Windows 11 operating system, which prevents direct execution and runtime comparison under identical hardware conditions. To address this limitation and to further validate the effectiveness of the proposed approach, a comparison was conducted against results reported in the literature using the ClinVar dataset [28], which contains clinically verified genetic variants. Table 13 summarizes the performance of the proposed method relative to recently published models, including integrated architectures such as GPN + ESM + Alphamissense, described in [29].

The results in Table 9 confirm that the proposed parallel approach not only significantly reduces execution time but also delivers comparable or even higher classification quality relative to other methods. Therefore, the proposed solution can be considered an efficient and accessible alternative for genomic variant classification tasks, particularly in resource-constrained computational environments. These findings also highlight the potential of extending the approach to hybrid CPU–GPU systems and adapting it for real-time biomedical data processing.

6. Conclusions

In this work, we developed a multi-level local CPU execution method for end-to-end VCF processing and mutation classification. The method is defined as a hierarchical composition of block-wise data parallelism for parsing and filtering, task-level parallelism for feature decomposition, and JIT-based execution-level specialization of numeric kernels, together with an explicit resource-coordination mechanism that prevents oversubscription when pipeline parallelism interacts with internally threaded learning routines.

The proposed approach enabled a substantial reduction in processing time from 291.25 s to 51.18 s while maintaining high accuracy, recall, and stable classification behavior for BRCA1 gene mutations. A reference run on an NVIDIA RTX 3060 indicates that the optimized 14-core CPU execution approaches GPU-level performance for the considered end-to-end pipeline (51.18 s on CPU versus 45.00 s on GPU), supporting practical deployment in environments with limited access to accelerators while preserving classification quality.

The experimental results further confirm predictable scalability with increasing numbers of CPU cores, alongside the expected efficiency degradation governed by Amdahl’s law. These findings demonstrate that carefully coordinated multi-level CPU parallelism can deliver high-throughput genomic data processing without reliance on specialized hardware.

Future work will focus on extending the proposed execution method to hybrid CPU–GPU systems and adapting it for real-time streaming of biomedical data. The approach can be integrated into practical personalized diagnostic systems operating on conventional multicore CPU infrastructures.

Author Contributions

Conceptualization, L.M.; methodology, L.M.; software, L.M., V.M. and K.K.; validation, I.T.; formal analysis, L.M.; investigation, I.T.; resources, V.M. and K.K.; data curation, V.M. and K.K.; writing—original draft preparation, L.M.; writing—review and editing, L.M.; visualization, L.M.; supervision, I.T.; project administration, L.M.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were used in this study. Ensembl “Homo sapiens variation data (Release 110)” is available at [21]; ClinVar is available at [27]. The source code and scripts used for data processing and analysis in this study are publicly available at [https://github.com/ovavl06-sys/Multi-Level-Parallel-CPU-Execution-Method-for-Accelerated-Portion-Based-VCF-Data-Processing (accessed on 4 February 2026)].

Acknowledgments

This work is supported by the Department of Artificial Intelligence Systems at Lviv Polytechnic National University. The authors would like to express their gratitude to the Department of Artificial Intelligence Systems for its support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Delpierre, C.; Lefèvre, T. Precision and Personalized Medicine: What Their Current Definition Says and Silences about the Model of Health They Promote. Implication for the Development of Personalized Health. Front. Sociol. 2023, 8, 1112159. [Google Scholar] [CrossRef]
Alsayaydeh, J.A.J.; Yusof, M.F.B.; Halim, M.Z.B.A.; Zainudin, M.N.S.; Herawan, S.G. Patient Health Monitoring System Development Using ESP8266 and Arduino with IoT Platform. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 617–624. [Google Scholar] [CrossRef]
Mochurad, L.; Solomiia, A. Optimizing the Computational Modeling of Modern Electronic Optical Systems. In Lecture Notes in Computational Intelligence and Decision Making; Lytvynenko, V., Babichev, S., Wójcik, W., Vynokurova, O., Vyshemyrskaya, S., Radetskaya, S., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2020; Volume 1020, pp. 597–608. ISBN 978-3-030-26473-4. [Google Scholar]
Manrai, A.K.; Funke, B.H.; Rehm, H.L.; Olesen, M.S.; Maron, B.A.; Szolovits, P.; Margulies, D.M.; Loscalzo, J.; Kohane, I.S. Genetic Misdiagnoses and the Potential for Health Disparities. N. Engl. J. Med. 2016, 375, 655–665. [Google Scholar] [CrossRef] [PubMed]
Sahraeian, S.M.E.; Fang, L.T.; Karagiannis, K.; Moos, M.; Smith, S.; Santana-Quintero, L.; Xiao, C.; Colgan, M.; Hong, H.; Mohiyuddin, M.; et al. Achieving Robust Somatic Mutation Detection with Deep Learning Models Derived from Reference Data Sets of a Cancer Sample. Genome Biol. 2022, 23, 12. [Google Scholar] [CrossRef]
Polewko-Klim, A.; Mnich, K.; Rudnicki, W.R. Robust Data Integration Method for Classification of Biomedical Data. J. Med. Syst. 2021, 45, 45. [Google Scholar] [CrossRef]
Houston, L.; Yu, P.; Martin, A.; Probst, Y. Heterogeneity in Clinical Research Data Quality Monitoring: A National Survey. J. Biomed. Inform. 2020, 108, 103491. [Google Scholar] [CrossRef]
Zou, Y.; Zhu, Y.; Li, Y.; Wu, F.-X.; Wang, J. Parallel Computing for Genome Sequence Processing. Brief. Bioinform. 2021, 22, bbab070. [Google Scholar] [CrossRef]
Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T.; et al. The Variant Call Format and VCFtools. Bioinformatics 2011, 27, 2156–2158. [Google Scholar] [CrossRef]
Mochurad, L.; Kotsiumbas, O.; Protsyk, I. A Model for Weather Forecasting Based on Parallel Calculations. In Advances in Artificial Systems for Medicine and Education VI; Hu, Z., Ye, Z., He, M., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer Nature: Cham, Switzerland, 2023; Volume 159, pp. 35–46. ISBN 978-3-031-24467-4. [Google Scholar]
Iwasaki, S.; Amer, A.; Taura, K.; Seo, S.; Balaji, P. BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads. In Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), Seattle, WA, USA, 23–26 September 2019; pp. 29–42. [Google Scholar]
Mochurad, L.; Horun, P. Improvement Technologies for Data Imputation in Bioinformatics. Technologies 2023, 11, 154. [Google Scholar] [CrossRef]
Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A LLVM-Based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin TX, USA, 15 November 2015; pp. 1–6. [Google Scholar]
O’Connell, K.A.; Yosufzai, Z.B.; Campbell, R.A.; Lobb, C.J.; Engelken, H.T.; Gorrell, L.M.; Carlson, T.B.; Catana, J.J.; Mikdadi, D.; Bonazzi, V.R.; et al. Accelerating Genomic Workflows Using NVIDIA Parabricks. BMC Bioinform. 2023, 24, 221. [Google Scholar] [CrossRef]
Czarnul, P. Investigation of Parallel Data Processing Using Hybrid High Performance CPU. Comput. Inform. 2020, 39, 510–536. [Google Scholar] [CrossRef]
Waarts, M.R.; Stonestrom, A.J.; Park, Y.C.; Levine, R.L. Targeting Mutations in Cancer. J. Clin. Investig. 2022, 132, e154943. [Google Scholar] [CrossRef] [PubMed]
Mishra, R.; Li, B. The Application of Artificial Intelligence in the Genetic Study of Alzheimer’s Disease. Aging Dis. 2020, 11, 1567. [Google Scholar] [CrossRef]
Jadhav, R.S.; Borkar, P.; Waghmode, R.R.; Mahalle, P.N.; Babar, A.V.; Bhattacharya, S. Advancing Personalized Medicine Using Genomic Data Analysis and Predictive Modeling. In Proceedings of the 2025 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5 March 2025; pp. 1–6. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Garrison, E.; Kronenberg, Z.N.; Dawson, E.T.; Pedersen, B.S.; Prins, P. A Spectrum of Free Software Tools for Processing the VCF Variant Call Format: Vcflib, Bio-Vcf, Cyvcf2, Hts-Nim and Slivar. PLoS Comput. Biol. 2022, 18, e1009123. [Google Scholar] [CrossRef]
Howe, K.L.; Achuthan, P.; Allen, J.; Allen, J.; Alvarez-Jarreta, J.; Amode, M.R.; Armean, I.M.; Azov, A.G.; Bennett, R.; Bhai, J.; et al. Ensembl 2021. Nucleic Acids Res. 2021, 49, D884–D891. [Google Scholar] [CrossRef]
Ensembl. Homo sapiens Variation VCF Files (Release 110). Available online: https://ftp.ensembl.org/pub/release-110/variation/vcf/homo_sapiens/ (accessed on 4 February 2026).
Miller, C.; Portlock, T.; Nyaga, D.M.; O’Sullivan, J.M. A Review of Model Evaluation Metrics for Machine Learning in Genetics and Genomics. Front. Bioinform. 2024, 4, 1457619. [Google Scholar] [CrossRef] [PubMed]
Córdoba, M.L.; Dopico, A.G.; García, M.I.; Rosales, F.; Arnaiz, J.; Bermejo, R.; Del Sastre, P.G. Efficient Parallelization of a Regional Ocean Model for the Western Mediterranean Sea. Int. J. High Perform. Comput. Appl. 2014, 28, 368–383. [Google Scholar] [CrossRef]
Amdahl, G.M. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities, Reprinted from the AFIPS Conference Proceedings, Vol. 30 (Atlantic City, N.J., Apr. 18–20), AFIPS Press, Reston, Va., 1967, pp. 483–485, When Dr. Amdahl Was at International Business Machines Corporation, Sunnyvale, California. IEEE Solid-State Circuits Soc. Newsl. 2007, 12, 19–20. [Google Scholar] [CrossRef]
Costanzo, M.; Rucci, E.; García-Sánchez, C.; Naiouf, M.; Prieto-Matías, M. Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with SW Sequence Alignment. Future Gener. Comput. Syst. 2025, 170, 107838. [Google Scholar] [CrossRef]
Khorsand, P.; Denti, L.; Human Genome Structural Variant Consortium; Bonizzoni, P.; Chikhi, R.; Hormozdiari, F. Comparative Genome Analysis Using Sample-Specific String Detection in Accurate Long Reads. Bioinform. Adv. 2021, 1, vbab005. [Google Scholar] [CrossRef] [PubMed]
Landrum, M.J.; Chitipiralla, S.; Kaur, K.; Brown, G.; Chen, C.; Hart, J.; Hoffman, D.; Jang, W.; Liu, C.; Maddipatla, Z.; et al. ClinVar: Updates to Support Classifications of Both Germline and Somatic Variants. Nucleic Acids Res. 2025, 53, D1313–D1321. [Google Scholar] [CrossRef] [PubMed]
Boulaimen, Y.; Fossi, G.; Outemzabet, L.; Jeanray, N.; Levenets, O.; Gerart, S.; Vachenc, S.; Raieli, S.; Giemza, J. Integrating Large Language Models for Genetic Variant Classification. arXiv 2024, arXiv:2411.05055. [Google Scholar] [CrossRef]

Figure 1. Multi-level organization of local parallel CPU processing of portion-based VCF data.

Figure 2. Relationship between execution time, speedup, and efficiency across CPU core counts and GPU performance (NVIDIA RTX 3060). Alt text: A combined plot showing how execution time decreases and speedup increases with the number of CPU cores, compared with GPU performance using an NVIDIA RTX 3060.

Table 1. Summary of the qualitative positioning of related works and the proposed approaches.

Approach/Source	Advantages	Disadvantages
Deep learning for somatic mutations [5]	High accuracy, detailed data modeling	High computational requirements, need for GPU
Data integration and quality [6,7]	Comprehensive approach to heterogeneous data	Standardization complexity; data heterogeneity
Parallel CPU and GPU computing [8,15]	Scalability, increased speed	High equipment cost, complex setup
NVIDIA Parabricks [14]	Accelerated processing, ready GPU solutions	Limited availability, GPU-only orientation
Applied ML models in medicine [15,16,17,18,19]	High clinical relevance, broad applicability	Resource-intensive; mostly implemented in advanced centers
Ensemble ML methods (Random Forest, etc.) [18,19]	High prediction accuracy, stability, ability to handle large genetic datasets	Require complex data preparation, sensitive to hyperparameter selection
CPU-oriented multi-level approach (this study)	Improved throughput on CPU-only systems; multi-level optimization; wide accessibility	Requires careful configuration; evaluated on a specific use case (BRCA1), with broader generalization as future work

Table 2. Per-record feature specification used in this study.

Feature Name	Source Field(s)	Exact Transformation	Encoding/Type	Missing Values Rule
pos_raw	POS	integer(POS)	numeric	n/a
pos_norm	POS	(POS − L)/(R − L), L = 43,044,295, R = 43,125,482	numeric	n/a
len_ref	REF	\|REF\|	numeric	n/a
len_alt	ALT	\|ALT\|	numeric	n/a
delta_len	REF, ALT	\|ALT\| − \|REF\|	numeric	n/a
is_SNV	REF, ALT	1 if \|REF\| = 1 and \|ALT\| = 1 else 0	binary	n/a
is_INS	REF, ALT	1 if \|ALT\| > \|REF\| else 0	binary	n/a
is_DEL	REF, ALT	1 if \|ALT\| < \|REF\| else 0	binary	n/a
qual	QUAL	0 if QUAL == ‘.’ else float(QUAL)	numeric	n/a
filter_pass	FILTER	1 if FILTER == “PASS” else 0	binary	n/a
miss_QUAL	QUAL	1 if QUAL == ‘.’ else 0	binary	n/a
miss_FILTER	FILTER	1 if FILTER is missing else 0	binary	n/a

The INFO clinical-significance attribute used to derive labels (ClinVar-derived) is not included in the feature vector. Missing-value handling is encoded via miss_QUAL and miss_FILTER indicators; QUAL values equal to ‘.’ are mapped to 0.

Table 3. Classification metrics for different implementations.

Metric	Sequential	Parallel	95% CI for Sequential Pipeline	95% CI for Parallel Pipeline
Execution time (s)	291.25	51.18	-	-
Accuracy	0.8483	0.8483	[0.8255, 0.8698]	[0.8255, 0.8698]
Precision	0.8740	0.8758	[0.8431, 0.9044]	[0.8449, 0.9062]
Recall	0.8261	0.8261	[0.7936, 0.8565]	[0.7936, 0.8565]
F1-score	0.8494	0.8502	[0.8250, 0.8729]	[0.8258, 0.8737]

Table 4. Speedup and efficiency of parallel execution.

Number of Cores	Speedup	Efficiency, %
2	1.92	96
4	3.56	89
8	4.96	62
14	5.69	40

Table 5. Execution time and memory footprint for different configurations.

Implementation/Cores	1	2	4	8	14
Execution time (s)	291.25	151.69	81.81	58.60	51.18
Peak total RSS (GB)	1.10	1.25	1.55	2.05	2.65
Peak per-process RSS (GB)	1.10	0.75	0.60	0.50	0.45

Table 6. Effective throughput for different CPU configurations.

Cores	1	2	4	8	14
Effective throughput $λ$ (variants/s), N = 5073	17.42	33.44	62.01	86.57	99.12

Table 7. Stage-wise contribution of the multi-level scheme (ablation).

Stage	Runtime (s)	Speedup
Sequential baseline	291.25	1.00×
+JIT optimization (Numba)	238.10	1.22×
+Data-parallel VCF chunking	126.45	2.30×
+Task-parallel feature construction	73.82	3.95×
+Parallel hyperparameter search (RandomizedSearchCV, n_jobs = −1)	67.80	4.30×
+Resource-coordinated parallel RF training (n_jobs = −1)	51.18	5.69×

Table 8. Statistical summary of runtime measurements (5 runs).

Configuration	Mean (s)	SD (s)	CV (%)	95% CI (s)	Paired t-Test vs. Sequential (p-Value)
Sequential (1 core)	284.31	17.23	6.06	[271.06, 297.56]	-
Parallel (2 cores)	148.73	7.39	4.97	[141.90, 155.56]	1.0553 × 10⁻⁴
Parallel (4 cores)	79.54	4.22	5.31	[76.01, 83.07]	3.1523 × 10⁻⁵
Parallel (8 cores)	57.60	4.65	8.09	[54.79, 60.41]	3.2279 × 10⁻⁵
Parallel (14 cores)	49.56	4.64	9.36	[47.10, 52.02]	3.7680 × 10⁻⁵

Table 9. Execution time, speedup, and efficiency on CPU (different core counts) and a reference GPU run (NVIDIA RTX 3060).

Processing Method	Execution Time (s)	Speedup	Efficiency (%)
Sequential	291.25	1.00	100
Parallel (2 cores)	151.69	1.92	96
Parallel (4 cores)	81.81	3.56	89
Parallel (8 cores)	58.60	4.96	62
Parallel (14 cores)	51.18	5.69	40
GPU (NVIDIA RTX 3060)	45.00	6.47	-

Table 10. Execution time and speedup of sequential and parallel processing across different genomic regions.

Genomic Regions	Sequential Execution Time (s)	Parallel Execution Time (s)	Speedup
MLH1	113.20	38.11	2.97
TP53	51.29	11.60	4.42
CDKN2A	52.09	18.41	2.83

Table 11. Classification performance of the sequential pipeline across multiple genomic regions.

Genomic Regions	Accuracy	Precision	Recall	F1-Score
MLH1	0.738	0.739	0.719	0.729
TP53	0.821	0.800	0.823	0.811
CDKN2A	0.757	0.578	0.578	0.578

Table 12. Classification performance of the parallel pipeline across multiple genomic regions.

Genomic Regions	Accuracy	Precision	Recall	F1-Score
MLH1	0.737	0.738	0.726	0.732
TP53	0.821	0.800	0.823	0.811
CDKN2A	0.757	0.578	0.578	0.578

Table 13. Comparison of the proposed approach with results from [28] on the ClinVar dataset.

Method	Accuracy	Precision	Recall	F1-Score	Notes
Random Forest (Boulaimen et al. [29])	0.83	0.82	0.83	0.83	Integrated model with GPN + ESM + Alphamissense
XGBoost (Boulaimen et al. [29])	0.82	0.81	0.83	0.82	Integrated model with GPN + ESM + Alphamissense
Multi-input Neural Network (Boulaimen et al. [29])	0.825	0.82	0.82	0.82	Integrated model with GPN + ESM + Alphamissense
Random Forest (proposed, sequential)	0.84	0.83	0.81	0.83	CPU, sequential processing
Random Forest (proposed, parallel)	0.84	0.83	0.81	0.83	CPU, 14 cores, multi-level parallelization

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mochurad, L.; Tsmots, I.; Mostova, V.; Kystsiv, K. Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing. Computation 2026, 14, 48. https://doi.org/10.3390/computation14020048

AMA Style

Mochurad L, Tsmots I, Mostova V, Kystsiv K. Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing. Computation. 2026; 14(2):48. https://doi.org/10.3390/computation14020048

Chicago/Turabian Style

Mochurad, Lesia, Ivan Tsmots, Vita Mostova, and Karina Kystsiv. 2026. "Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing" Computation 14, no. 2: 48. https://doi.org/10.3390/computation14020048

APA Style

Mochurad, L., Tsmots, I., Mostova, V., & Kystsiv, K. (2026). Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing. Computation, 14(2), 48. https://doi.org/10.3390/computation14020048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Level Parallel CPU Execution Method for Accelerated Portion-Based Variant Call Format Data Processing

Abstract

1. Introduction

2. State of the Art

3. Formal Execution Model

3.1. Problem Statement and Portion-Based Processing Constraint

3.2. Formal Model of VCF Portions and Localized Transformations

3.3. Multi-Level CPU Execution Method

3.4. Resource Coordination and Bounded Nested Parallelism

4. Results and Analysis

4.1. Dataset Description (BRCA1 Region of Chromosome 17)

4.2. Classification Performance Metrics

4.3. Results and Efficiency of Implementations

5. Discussion

5.1. Effect of Multi-Level Parallelization on End-to-End Runtime and Quality

5.2. Scalability and the Observed Efficiency Drop

5.3. Optional Reference Comparison with GPU Execution

5.4. Comparative Analysis of Algorithm Efficiency Across Different Genomic Regions

5.5. Comparison with Existing Studies

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI