PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome

Bahmane, Karima; Bhattacharya, Sambit; Kassem, My Abdelmajid

doi:10.3390/jgbg1010003

Open AccessArticle

PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome

by

Karima Bahmane

^1,2,3,

Sambit Bhattacharya

² and

My Abdelmajid Kassem

^1,*

¹

Plant Genomics and Bioinformatics Lab, Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA

²

Department of Mathematics and Computer Science, Fayetteville State University, Fayetteville, NC 28301, USA

³

LISAD Laboratory, Department of Computer Science, National School of Applied Sciences, Agadir 80000, Morocco

^*

Author to whom correspondence should be addressed.

J. Genome Biotechnol. Genet. 2026, 1(1), 3; https://doi.org/10.3390/jgbg1010003

Submission received: 16 January 2026 / Revised: 12 March 2026 / Accepted: 17 March 2026 / Published: 24 March 2026

Download

Browse Figures

Versions Notes

Abstract

Missense single nucleotide variants (SNVs) represent one of the most common forms of genetic variation and account for a substantial proportion of variants of uncertain significance in clinical databases. Accurate computational classification of these variants remains an important challenge in precision medicine and genomic research. In this study, we present PathoPredictor, an interpretable machine-learning framework designed to distinguish pathogenic from benign missense variants using curated clinical variant data and functional annotations. High-confidence variants were obtained from the November 2023 ClinVar release and annotated using dbNSFP v5.1 (GRCh37). After data filtering, imputation, and normalization, 59,302 expert-reviewed missense variants were retained for model development. Six machine-learning algorithms were evaluated under identical cross-validation conditions applied to the training set. Among the evaluated models, LightGBM demonstrated the strongest overall performance and was selected as the final PathoPredictor classifier, achieving a mean ROC–AUC of 0.93 ± 0.004, accuracy of 0.90 ± 0.006, and Matthew’s correlation coefficient of 0.80 ± 0.008 across five cross-validation folds. Model interpretability was examined using SHAP (SHapley Additive exPlanations), enabling both global feature ranking and variant-level explanation of predictions. Temporal validation using ClinVar variants submitted after November 2023 showed consistent predictive performance on previously unseen submissions within the same database ecosystem (ROC–AUC = 0.91). While the framework demonstrates strong discrimination and structured interpretability, potential limitations include training data bias and partial circularity associated with the inclusion of existing meta-predictors. Overall, PathoPredictor provides a reproducible and interpretable computational framework for integrating functional annotations in missense variant prioritization, supporting research and genomic analysis workflows.

Keywords:

missense variants; ClinVar; pathogenicity prediction; machine learning; SHAP; variant interpretation; LightGBM; XGBoost; Random Forest; explainable AI; clinical genomics; variant effect prediction

1. Introduction

Missense single nucleotide variants (SNVs), which alter the amino acid sequence of a protein, represent one of the most common forms of genetic variation in the human genome [1,2]. Large-scale sequencing initiatives such as the 1000 Genomes Project, the Genome Aggregation Database (gnomAD), and UK Biobank have demonstrated that every individual carries thousands of missense variants, most of which are rare and often population-specific [1,2,3,4]. Although many amino acid substitutions are functionally neutral, a subset contribute to Mendelian disorders, complex diseases, and cancer susceptibility [3,4,5,6]. Consequently, distinguishing pathogenic from benign missense variants remains a fundamental task in precision medicine, clinical genetics, and translational genomics [7,8].

Despite major advances in next-generation sequencing, variant interpretation remains a critical bottleneck [4,7]. ClinVar, a public archive of genotype–phenotype relationships, reports that a substantial proportion of submitted missense variants remain classified as variants of uncertain significance (VUS), reflecting insufficient or conflicting evidence regarding their clinical impact [3,4,9]. These uncertainties complicate genetic counseling, delay diagnosis, and hinder effective clinical management, particularly in hereditary disease evaluation [4,10]. Manual, expert-driven curation alone cannot keep pace with the exponential growth of genomic data, underscoring the need for robust, scalable, and reproducible computational approaches [4,11].

Numerous in silico methods have been developed to predict missense variant pathogenicity. Classical tools such as SIFT, PolyPhen-2, and MutationTaster rely on evolutionary conservation, protein structural features, and sequence homology to estimate functional impact [12,13,14]. More recent meta-predictors including CADD, REVEL, and MetaSVM integrate multiple annotations into unified scores and often achieve improved discrimination [15,16,17]. However, existing tools exhibit heterogeneous performance across genes and variant classes [18,19,20,21], may rely on limited or predefined feature subsets, and frequently lack transparency in their decision-making processes [22]. Furthermore, evaluation of pathogenicity predictors has been complicated by issues of circularity and dataset dependency, which may inflate reported performance when training and benchmarking data overlap [16]. In clinical settings such as diagnostic laboratories and variant review boards, interpretability and reproducibility are essential prerequisites for adoption [4,7].

The dbNSFP resource provides a comprehensive aggregation of functional, structural, evolutionary, and ensemble annotations for human missense variants [23]. The current release integrates dozens of complementary features, including conservation metrics (phyloP, phastCons, GERP++), protein-level descriptors, and widely used predictors such as CADD, PolyPhen-2, and SIFT [23]. This centralized and standardized annotation framework enables reproducible feature extraction and facilitates the development of machine learning-based pathogenicity classifiers grounded in a broad and biologically diverse feature space [23].

Machine learning approaches, particularly ensemble tree-based methods, have demonstrated strong performance in genomics due to their ability to model nonlinear relationships and complex feature interactions that single-score predictors cannot capture [24,25,26]. Gradient boosting frameworks such as XGBoost (v1.7) and LightGBM (v4.0) are now widely used for tabular biomedical data because of their scalability, robustness, and predictive accuracy [24,25]. Moreover, advances in explainable artificial intelligence, notably SHAP (SHapley Additive exPlanations), provide a theoretically grounded method to quantify feature contributions at both global and individual prediction levels [22,27]. Such interpretability frameworks are particularly important for genomic applications where transparency and traceability are required for responsible clinical integration [7,8].

Recent deep learning approaches, including deep generative models of evolutionary variation and primate-based constraint learning, have further expanded the landscape of computational variant effect prediction [28,29,30]. While these methods often achieve high predictive performance, their complexity and limited interpretability may present challenges for routine clinical deployment [8,29].

To address these challenges, we developed PathoPredictor, a supervised machine-learning framework trained on expert-reviewed pathogenic and benign missense variants derived from the November 2023 ClinVar release [3]. Variants were annotated using dbNSFP functional and evolutionary predictors and processed through a structured preprocessing and feature-selection pipeline designed to minimize information leakage. Multiple machine-learning algorithms were benchmarked under identical cross-validation conditions applied to the training set, and LightGBM was selected as the final model based on its cross-validated performance and stability (Figure 1).

Importantly, PathoPredictor integrates several existing pathogenicity predictors (e.g., CADD and REVEL) as input features. Because some of these predictors have historically been trained or benchmarked using ClinVar-related datasets, a degree of circularity cannot be completely excluded [18,31]. For this reason, the model is presented primarily as an integrative and interpretable framework for combining multiple computational signals, rather than as a replacement for existing predictors. To evaluate stability over time, we additionally performed temporal validation on variants submitted to ClinVar after the training cutoff. Although this evaluation provides an initial indication of model generalization, it should not be interpreted as fully independent external validation.

2. Results

2.1. Dataset Characteristics

A total of 987,214 missense single nucleotide variants (SNVs) were initially retrieved from the November 2023 release of ClinVar and annotated using dbNSFP v5.1 [3,23]. Following quality control procedures including removal of duplicate entries, variants lacking expert-reviewed clinical significance, and records with excessive missing annotation values, a final dataset of 59,302 high-confidence missense variants was retained for downstream analysis [3,4,7,9,18].

Among the curated variants, 48.3% were classified as Pathogenic or Likely Pathogenic (P/LP), while 51.7% were categorized as Benign or Likely Benign (B/LB) based on ClinVar review status [3]. This near-balanced class distribution reduced the need for external resampling strategies during model training [24,25].

Feature completeness analysis indicated that over 95% of retained variants included the full set of selected functional prediction scores and evolutionary conservation metrics [9,10,11,12,13,19]. Residual missing values were sparse and subsequently handled using the imputation procedures described in Section 4.2.2 [23,25].

2.2. Performance of Individual Predictors

The predictive performance of commonly used pathogenicity scores available in dbNSFP was evaluated on the independent test set using continuous score outputs [23,26]. Evaluated predictors included SIFT, PolyPhen-2, MutationAssessor, FATHMM, CADD, and REVEL [12,13,15,16,17,32]. Among individual predictors, REVEL demonstrated the strongest discriminative performance, achieving an ROC-AUC of 0.88, followed by CADD with an ROC-AUC of 0.84 [15,16,32]. Functional impact predictors such as MutationAssessor and FATHMM showed moderate performance, with ROC-AUC values ranging from 0.78 to 0.82 [17,19]. Conservation-based metrics including GERP++, phyloP, and phastCons exhibited lower discriminative ability, with ROC-AUC values between 0.70 and 0.80 [23,33]. In contrast, predictors derived from amino acid substitution matrices, such as BLOSUM62-based features, showed substantially weaker performance, typically yielding ROC-AUC values below 0.65 [23].

These results highlight the variability in predictive performance among individual annotation scores when evaluated independently on the held-out dataset [18,19,26]. They also illustrate the potential benefit of integrating complementary functional and evolutionary annotations within supervised learning frameworks, which can leverage multiple signals simultaneously rather than relying on a single predictor.

2.3. Machine Learning Model Performance

Supervised machine-learning algorithms including Logistic Regression, Support Vector Machine (SVM), Random Forest, k-Nearest Neighbors (kNN), XGBoost, and LightGBM were trained using independently optimized hyperparameters (Section 2.4) and evaluated under identical five-fold cross-validation splits and on an independent hold-out test set [24,25,26]. Among the evaluated models, LightGBM demonstrated the highest overall cross-validated performance [25] (Table 1).

During hyperparameter optimization on the training data, the LightGBM model achieved a mean cross-validated ROC–AUC of 0.93 ± 0.004, accuracy of 0.90 ± 0.006, and MCC of 0.80 ± 0.008 across five folds. When subsequently evaluated on the independent hold-out test set, the final trained model achieved an ROC–AUC of 0.93 (95% CI: 0.92–0.94) as estimated using bootstrap resampling [26]. XGBoost was the second-best performer, achieving a test ROC–AUC of 0.92 (95% CI: 0.91–0.93), accuracy of 0.89, sensitivity of 0.85, specificity of 0.92, precision of 0.88, and F1-score of 0.86 [24]. Compared with individual annotation predictors evaluated in Section 3.2, machine-learning models exhibited consistently higher discriminative performance across all evaluation metrics [18,19,26].

Statistical comparison of ROC–AUC values using the DeLong test confirmed that the LightGBM model achieved significantly higher discriminative performance than individual annotation-based predictors such as REVEL and CADD (p < 0.001) [16,26]. The comparative ROC curves of all evaluated models are shown in Figure 2, while precision–recall curves for the top-performing ensemble models are presented in Figure 3 [26].

Although LightGBM achieved the highest point estimate for ROC–AUC among the evaluated models, the confidence intervals of the top-performing ensemble classifiers partially overlap. This suggests that several gradient-boosting approaches provide comparably strong performance when integrating heterogeneous functional annotations, with LightGBM demonstrating the most consistent cross-validation stability across folds.

2.4. Feature Importance Analysis

Feature importance analysis for the final PathoPredictor model (LightGBM) was conducted using gain-based metrics (feature_importance_type =“gain”) to quantify the relative contribution of individual predictors to classification performance [25]. The most influential features included the REVEL score and CADD_phred score, followed by evolutionary conservation metrics such as phyloP100way_vertebrate and GERP++_RS [15,16,23,32].

Additional high-ranking predictors included protein impact features derived from amino acid substitution properties (e.g., Grantham distance) and splicing-related annotations obtained from dbscSNV, including the AdaBoost and Random Forest ensemble scores [23,33]. To complement model-specific importance measures, global SHAP (SHapley Additive exPlanations) values were computed for the final PathoPredictor model (LightGBM) [27]. SHAP-based rankings demonstrated consistent feature importance patterns, with REVEL, CADD_phred, and conservation-based predictors contributing most prominently to model predictions across test samples [15,16,27,32]. Global SHAP summary plots illustrating feature importance and effect directionality are presented in Figure 4 [22,27].

Notably, several of the top-ranked features correspond to widely used meta-predictors such as REVEL and CADD. Because these predictors integrate information from multiple annotation sources and have historically been evaluated using ClinVar-related datasets, their strong contribution to model predictions likely reflects the aggregation of previously established pathogenicity signals rather than entirely independent predictive discovery [15,16,18,31,32] (Table 2).

An illustrative case-level analysis was conducted for a representative missense variant present in the independent test set (BRCA1 p.Val1736Ala; ClinVar Variation ID: 37648) [3]. The variant was assigned a high predicted probability of pathogenicity by the final PathoPredictor model (LightGBM) [25]. SHAP-based local interpretability analysis indicated that this prediction was primarily influenced by elevated CADD_phred scores [15,32], high phyloP100way conservation values [23], and damaging functional impact scores from SIFT and PolyPhen-2 [12,13]. Global SHAP summary plots illustrating feature influence across all variants are shown in Figure 5 [22,27]. An example SHAP decision plot demonstrating variant-level interpretability is presented in Figure 6 [22,27].

2.5. Temporal Validation on Independent ClinVar Submissions

To assess model stability over time, we conducted temporal validation using an independent set of missense variants submitted to ClinVar after the November 2023 release, which were not included in training, feature selection, or hyperparameter optimization. These variants were not included during model training, feature selection, or hyperparameter optimization [18,31]. Because the validation dataset originates from the same database ecosystem, this experiment represents temporal validation rather than fully independent external validation.

On this temporally independent dataset, the final PathoPredictor model (LightGBM) achieved an ROC-AUC of 0.91 (95% CI: 0.90–0.92) and an overall accuracy of 0.87 [25,26]. Model calibration was additionally evaluated on the external dataset. Calibration curves demonstrated close agreement between predicted probabilities and observed outcome frequencies, with a corresponding Brier score of 0.11 [20,26,34]. Although this evaluation does not constitute fully independent external validation, it provides an initial assessment of model stability on temporally separated expert-reviewed submissions [18,31].

3. Discussion

Despite major advances in sequencing technologies, interpreting the functional impact of missense variants remains a persistent challenge in human genomics [1,35,36]. Numerous computational predictors have been proposed, yet their performance often varies across genes, variant classes, and evaluation datasets, and many still lack transparent interpretability [18,19,20,22]. To address these challenges, we developed PathoPredictor, an interpretable machine-learning framework that integrates functional and evolutionary annotations from dbNSFP to classify pathogenic and benign missense variants [3,4,7,9,23]. The final LightGBM model demonstrated strong discriminative performance within the evaluated dataset, achieving an ROC–AUC of 0.93 on the independent test set and outperforming individual annotation-based predictors assessed under the same evaluation framework [24,25,26].

These findings are consistent with previous studies showing that gradient-boosting algorithms are well suited for modeling nonlinear relationships and complex feature interactions in high-dimensional genomic datasets [24,25,26]. Feature importance and SHAP-based interpretability analyses further indicated that predictors such as CADD_phred, phyloP100way_vertebrate, SIFT_score, MutationTaster_score, and PolyPhen2_HVAR_score contributed substantially to classification outcomes [12,13,14,15,23,27]. These results reinforce existing evidence that combining complementary computational signals—particularly evolutionary conservation and predicted functional impact—can improve discrimination between pathogenic and benign missense variants within curated clinical datasets [17,19,26].

An important advantage of PathoPredictor is its emphasis on interpretability. While many high-performing predictive models function as opaque “black boxes,” clinical genomics workflows increasingly require transparent and explainable decision processes [8,22]. By incorporating SHAP-based analyses, PathoPredictor provides both global feature attribution and local, variant-level explanations, enabling users to examine how individual annotations influence model predictions [22,27]. Such interpretability may facilitate expert review during variant interpretation and may support more transparent integration of computational evidence in genomic analyses [4,37].

The performance of PathoPredictor also highlights the importance of high-quality training data [4,7,9]. By restricting model development to expert-reviewed ClinVar variants and excluding entries with conflicting interpretations, we sought to reduce label noise and improve model stability [3,4,7,9,18]. In addition, the use of dbNSFP enabled the integration of diverse annotation categories, including conservation metrics and meta-predictor scores, thereby providing a comprehensive feature space for classification [15,16,17,23,32].

Nevertheless, several limitations should be considered when interpreting these findings. First, the model relies on precomputed annotations derived from existing prediction tools such as SIFT and PolyPhen-2 [12,13]. Updates to these upstream resources may require periodic retraining of the model to maintain optimal performance. Second, although the curated dataset included expert-validated variants, it represented only a subset of all possible missense substitutions in the human genome [1,35]. Variants located in underrepresented genes or genomic regions may therefore be predicted with reduced confidence [10,36]. Third, the current framework focuses exclusively on missense SNVs and does not incorporate additional sources of biological evidence, such as protein structural modeling or transcript-level expression data, which may further improve predictive accuracy [38,39,40].

Future work could extend this framework by incorporating additional functional and structural features, including protein–protein interaction networks, splicing predictions, and three-dimensional structural data generated by tools such as AlphaFold [30,33,38]. Moreover, validation using fully independent datasets, such as those derived from clinical laboratory variant curation efforts, will be necessary to establish broader generalizability beyond the ClinVar-derived training and validation framework [18,31]. The development of a user-accessible interface may also facilitate integration of the model into variant prioritization and genomics research workflows [8].

Overall, PathoPredictor provides an interpretable machine-learning framework for prioritizing human missense variants using integrated functional and evolutionary annotations [23,26]. By combining expert-curated training data, ensemble learning strategies, and SHAP-based explainability, the proposed approach offers a complementary computational method for variant assessment in translational genomics research [22,24,25,27].

3.1. Clinical Significance of PathoPredictor

The potential clinical relevance of PathoPredictor lies in its ability to provide computational evidence that may assist variant interpretation within the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) framework. In particular, the integrated predictive outputs generated by the model may contribute supportive computational evidence relevant to criteria such as PP3 (“multiple computational lines of evidence support pathogenicity”) and BP4 (“multiple computational lines of evidence support a benign impact”) when used alongside other established lines of evidence [4,16,34].

Unlike many opaque predictive models, PathoPredictor incorporates SHAP-based interpretability analyses that provide transparent, feature-level explanations for individual variant predictions [22,27]. These explanations may assist molecular geneticists and clinical laboratory specialists in evaluating the computational evidence underlying a given classification. This approach is consistent with recent recommendations from the Clinical Genome Resource (ClinGen) Sequence Variant Interpretation Working Group, which emphasize the importance of transparent and reproducible computational methods in the interpretation of variants of uncertain significance (VUS) [4,9,34,41].

3.2. Comparison with State-of-the-Art Predictors

PathoPredictor was developed within the context of existing computational tools for missense variant interpretation, including ensemble predictors such as REVEL, MetaSVM, and CADD, as well as more recent deep learning–based approaches such as PrimateAI-2.0 and EVE [15,16,17,29,30,32].

Within our evaluation framework, PathoPredictor demonstrated improved discriminative performance compared with individual annotation-based predictors assessed on the same independent dataset, including REVEL and CADD (Section 3.2) [15,16,32]. In addition to predictive accuracy, the proposed framework incorporates several methodological features designed to support transparency and reproducibility, including SHAP-based interpretability [22,27], restriction of training data to expert-reviewed ClinVar submissions [3], and integration of diverse functional and evolutionary annotations from dbNSFP v5.1 [23].

However, comparisons between PathoPredictor and individual annotation-based predictors such as REVEL or CADD should be interpreted cautiously. Because these predictors are included as input features within the model, the comparison is inherently asymmetric. Consequently, improved discriminative performance primarily reflects the structural advantage of integrating multiple complementary annotations within an ensemble learning framework, rather than demonstrating independent predictive superiority.

Recent deep learning approaches have reported strong performance in variant pathogenicity prediction, but these models often rely on large-scale training datasets or sequence-based embeddings that may limit the interpretability of individual predictions [28,29,30]. In contrast, PathoPredictor provides feature-level attribution through SHAP analysis, enabling detailed examination of how specific annotations influence model predictions [22,27]. These characteristics position PathoPredictor as a complementary approach that combines ensemble learning with interpretable model outputs within a reproducible computational pipeline [20,26].

3.3. Limitations

Despite its promising performance, PathoPredictor has several limitations that should be considered. First, the current framework is restricted to missense single nucleotide variants (SNVs) and does not address other classes of potentially pathogenic variation, including insertions/deletions (indels), splice-altering variants, or structural variants, which may also contribute significantly to disease phenotypes [1,11].

Second, model development relied on variants curated in ClinVar. Although the dataset was restricted to expert-reviewed submissions to improve label reliability, ClinVar is known to be enriched for well-studied disease-associated genes. Consequently, variants located in underrepresented genes or populations may be predicted with reduced confidence [3,10,36,42].

Third, several annotation features used for model training are derived from existing computational predictors, including SIFT, PolyPhen-2, CADD, and REVEL. Because some of these meta-predictors were originally trained or benchmarked using datasets that partially overlap with ClinVar-related evidence, a degree of circularity between training signals and evaluation cannot be completely excluded [18,31]. As a result, the reported predictive performance may partially reflect inherited signals from previously developed predictors rather than entirely independent predictive discovery. The results should therefore be interpreted primarily as demonstrating the effectiveness of integrating complementary annotation sources within an interpretable machine-learning framework rather than as evidence of entirely novel pathogenicity signals. Future work should evaluate the framework using variant datasets curated independently from the ClinVar ecosystem, including prospective clinical sequencing cohorts or laboratory-curated variant repositories [15,16,18,31,32].

Finally, although SHAP-based explanations provide valuable insight into feature contributions, interpretation of these explanations in complex clinical contexts may still require domain expertise. Accordingly, these explanations should be considered supportive rather than definitive evidence in variant classification workflows [22,27].

3.4. Future Directions

Several directions may further enhance the predictive scope of PathoPredictor. Integration of protein structural representations, including those generated by tools such as AlphaFold2, may provide additional structural context for variants located within conformationally constrained protein domains [38]. Incorporating transcript-specific expression data and tissue-resolved functional annotations could also refine predictions for disorders with strong tissue-specific effects [30,39,43].

In addition, retraining or augmenting the model using population-scale datasets such as gnomAD may help reduce ancestry-related biases present in clinically curated datasets [2,10]. Expanding the framework to include additional classes of genomic variation including splice-altering variants, regulatory variants, and small insertions or deletions represents another promising direction for future development [23,44].

4. Methods

4.1. Data Sources

4.1.1. ClinVar Variant Dataset

Pathogenic and benign missense single nucleotide variants (SNVs) were extracted from the November 2023 release of ClinVar [3]. Variant coordinates were aligned to the GRCh37/hg19 reference genome build to ensure compatibility with downstream annotation using dbNSFP v5.1 [23]. ClinVar provides standardized clinical significance classifications submitted by clinical laboratories, expert panels, and research groups, following the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines for sequence variant interpretation [4]. While ClinVar represents a curated and structured resource suitable for supervised learning applications [3], prior studies have documented potential gene-level enrichment, reporting bias, and differences in submission practices that may influence dataset composition [9,10].

Only variants annotated with a molecular consequence of “missense variant” were retained, consistent with established approaches in computational pathogenicity prediction studies [18,19]. To ensure high-confidence labels, we restricted the dataset to entries with strong review status, specifically those labeled as “criteria provided, multiple submitters, no conflicts”, “reviewed by expert panel”, and “practice guideline”.

This filtering strategy aligns with recommendations for minimizing label noise in clinical variant benchmarking datasets [4,18]. Variants classified as “Pathogenic” or “Likely Pathogenic” were grouped into a single “Pathogenic” class, while those labeled “Benign” or “Likely Benign” were merged into a single “Benign” class, following ACMG/AMP evidence aggregation principles [4,34].

Variants annotated as “Uncertain significance,” “Conflicting interpretations of pathogenicity,” or lacking sufficient review status were excluded to reduce ambiguity in supervised model training [3,18]. Somatic variants and entries with inconsistent genomic coordinates were also removed to ensure consistency with germline-focused predictive frameworks [4]. Duplicate records and multi-allelic entries were consolidated to ensure that each variant was uniquely represented in the final dataset, consistent with best practices in genomic dataset harmonization [3,23]. After applying all filtering criteria, a total of 59,302 high-confidence missense variants were retained for downstream analysis.

To ensure methodological transparency and prevent information leakage, the analytical workflow followed a predefined sequence of steps consisting of dataset construction, preprocessing, train–test separation, feature selection conducted exclusively within the training data, model training with cross-validation-based hyperparameter optimization, independent hold-out evaluation, and temporal validation on newly submitted ClinVar variants (Figure 1). This pipeline was implemented prior to model evaluation to maintain strict separation between training and testing procedures [18,31].

4.1.2. Functional and Evolutionary Annotations (dbNSFP v5.1)

The retained missense variants were annotated using dbNSFP v5.1 (GRCh37), a comprehensive resource that aggregates functional predictions, evolutionary conservation metrics, and ensemble pathogenicity scores for all possible nonsynonymous single nucleotide variants in the human genome [23]. The database was accessed in its GRCh37-aligned release to ensure compatibility with the ClinVar dataset and consistency of genomic coordinate mapping [3,23].

For each variant, annotations were retrieved by matching genomic chromosome, position, reference allele, and alternate allele to ensure precise alignment. Variants were normalized to a consistent representation prior to matching to avoid coordinate mismatches following established best practices for variant harmonization in large-scale genomic analyses [3,23].

The extracted annotations encompassed several categories of functional and evolutionary features. Evolutionary conservation metrics included phyloP100way_vertebrate, phyloP30way_mammalian, phastCons100way, and GERP++ RS scores. These measures quantify evolutionary constraint across multiple species and have been widely used to infer the functional importance of genomic positions [23,33]. Regions exhibiting strong evolutionary conservation across vertebrate or mammalian lineages are frequently associated with functional constraint and increased likelihood of deleterious variation affecting protein function or disease susceptibility [23,33].

Functional impact predictors were also incorporated to estimate the potential effect of amino acid substitutions on protein function. These included SIFT, PolyPhen-2 (HDIV and HVAR models), MutationTaster, MutationAssessor, CADD-Splice, and PROVEAN scores [12,13,14,17,45,46]. These predictors evaluate amino acid substitutions using combinations of sequence homology, evolutionary conservation, protein structural context, and biochemical properties of amino acids, providing complementary perspectives on the potential functional consequences of missense variants [47].

In addition to individual predictors, several integrated meta-predictors were included, namely CADD (raw and PHRED-scaled scores), FATHMM, LRT, MetaSVM, and MetaLR [15,16,17,32]. These methods combine multiple functional annotations and conservation signals into unified probabilistic or ranking-based scores, which have been shown to improve discriminatory performance compared with individual predictors alone [15,16,17,32]. Such integrative approaches are particularly useful when modeling heterogeneous biological signals across genes and variant classes.

To further capture biochemical context, features describing amino acid substitution properties, including Grantham distance and other physicochemical changes, were incorporated when available. Additionally, UniProt functional domain annotations were included to provide protein domain-level context for missense substitutions. Previous studies have shown that incorporating domain-level information can improve interpretation of variants located within conserved protein regions or functional motifs [23,39].

When multiple transcript-specific annotations were available for a given variant, the canonical transcript annotation was selected whenever defined. In cases where no canonical transcript was specified, the first listed transcript entry was retained to ensure consistent annotation across variants. Transcript-aware harmonization is essential because predicted pathogenicity may vary substantially depending on transcript context and isoform-specific protein structure [23,38].

Finally, it should be noted that some meta-predictors included in dbNSFP, such as CADD and REVEL, were originally trained or benchmarked using datasets that partially overlap with ClinVar-derived evidence [15,16,32]. This introduces the possibility of partial circularity in performance evaluation, a challenge that has been widely discussed in benchmarking studies of variant pathogenicity predictors [18,31]. Accordingly, this limitation is considered when interpreting model performance and comparative evaluation results in the present study.

4.2. Data Curation and Preprocessing

4.2.1. Variant Filtering

Following annotation with dbNSFP v5.1, variants underwent a multi-stage filtering process to ensure data quality, feature completeness, and coordinate consistency [23]. Rigorous dataset curation is essential in supervised genomic modeling to minimize label noise and prevent biased performance estimation [4,18].

Variants lacking essential functional predictors (CADD, SIFT, or PolyPhen-2) were excluded, as these predictors constitute widely validated core features in computational pathogenicity assessment frameworks [12,13,15,18,19,32]. Additionally, variants missing more than 40% of the selected annotation features were removed. The 40% threshold was chosen to balance retention of informative variants while minimizing excessive imputation and feature sparsity, which may adversely affect model stability and generalization [23,25]. Post-filtering assessment confirmed that this threshold preserved class balance and broad gene representation across the dataset [3].

Duplicate entries were identified based on identical chromosome, genomic position, reference allele, and alternate allele and were collapsed into a single unique record, consistent with established best practices in genomic dataset harmonization [3,23]. Variants with inconsistent genomic coordinates, ambiguous allele representation, or mapping to alternative contigs were excluded to prevent annotation mismatches and alignment artifacts that may introduce systematic bias [10,23]. Only variants located on canonical chromosomes (1–22, X, Y) were retained to ensure uniform genomic representation and compatibility with downstream annotation resources [23].

Filtering was performed prior to train–test splitting and was based exclusively on annotation completeness and coordinate consistency, independent of outcome labels, to prevent information leakage between training and evaluation datasets [18,31]. Avoiding leakage is particularly critical in variant effect prediction, where subtle correlations between features and labels may otherwise inflate performance estimates [18,31].

After merging ClinVar missense variants with dbNSFP annotations using exact genomic coordinate and allele matching, 3115 variants could not be aligned due to annotation incompatibility or missing database entries and were removed. Similar discrepancies between annotation resources have been reported in previous benchmarking studies and are typically attributable to transcript updates, assembly differences, or variant normalization inconsistencies [19,23].

This systematic filtering process produced a high-confidence dataset suitable for supervised learning and reproducible model development. A summary of dataset composition and filtering steps is presented in Table 3.

The final dataset consisted of approximately balanced classes (48% pathogenic, 52% benign). Train–test splitting was performed using stratified sampling to preserve class distribution across subsets.

4.2.2. Handling Missing Values

After filtering, the remaining dataset contained a limited proportion of missing values across selected features. Feature-wise missingness was below 5% for all retained predictors [23]. To address incomplete entries, numerical features were imputed using the median value computed from the training set [25]. For categorical features, ordinal encoding was applied prior to imputation, and missing values were replaced using the mode of the corresponding feature within the training data [23,25].

To prevent information leakage, imputation parameters were estimated exclusively on the training set and subsequently applied to the held-out test set using the same fitted imputer [18,31]. All imputation procedures were implemented using the SimpleImputer [48] class within the scikit-learn framework. Preprocessing steps were incorporated into a reproducible pipeline to ensure consistent application during model training and evaluation [18,31].

4.2.3. Feature Normalization

The annotation set comprised numerical predictors with heterogeneous value ranges and distributional properties, including probability-based conservation metrics (bounded between 0 and 1) and Phred-scaled meta-scores (e.g., CADD_phred) with extended heavy-tailed distributions [15,23,32]. To improve numerical stability and ensure comparability across predictors, feature scaling was applied prior to model training [24,25].

Distributional characteristics were assessed using summary statistics and visual inspection of histograms. Features exhibiting approximately symmetric, near-Gaussian distributions were standardized using Z-score [48] normalization (mean subtraction and division by standard deviation). In contrast, features with heavy-tailed or skewed distributions, such as CADD_phred, were transformed using robust scaling based on median centering and interquartile range (IQR) scaling to reduce sensitivity to outliers.

Normalization parameters were estimated exclusively from the training set and subsequently applied to the test set using the same fitted transformers, thereby preventing information leakage [18,31]. All preprocessing steps were implemented within a reproducible pipeline using the scikit-learn preprocessing utilities to ensure consistent application during cross-validation and final evaluation. Although tree-based models are generally invariant to monotonic feature scaling, normalization was retained to maintain methodological consistency across comparative experiments and to support potential extension to scale-sensitive algorithms [24,25].

4.3. Feature Selection and Engineering

Because the dbNSFP annotation framework provides more than forty potential predictive features, many of which are correlated or partially redundant [23], a structured multi-stage feature-selection strategy was implemented [18,24,25]. All feature selection procedures were conducted exclusively within the training dataset prior to model evaluation to prevent information leakage into the independent test set [18,31].

4.3.1. Correlation Filtering

Pairwise Pearson correlation coefficients were computed among numerical predictors using only the training set [48]. When two features exhibited a correlation coefficient greater than 0.95, the feature with lower univariate predictive relevance (measured by mutual information with the outcome) was removed [22]. This step reduced redundancy while preserving the most informative representation of correlated feature groups [18,23].

4.3.2. Variance Thresholding

Low-variance predictors were removed using a variance threshold of 0.01 (computed on the training set), eliminating features with minimal discriminatory power across samples.

4.3.3. Model-Based Feature Selection

To further refine the feature set, model-based importance scores were computed using a Random Forest classifier trained exclusively on the training subset. Feature importance was averaged across cross-validation folds to ensure stability [24,25]. Additionally, preliminary SHAP (Shapley Additive Explanations) analyses were performed to assess consistency of feature contributions across folds [27]. Only features consistently ranked among the top contributors across folds were retained for final modeling [22,27]. All feature selection steps were conducted within the training data only, and the selected feature subset was subsequently applied to the held-out test set to prevent information leakage [18,31].

This procedure produced a final set of 28 predictors representing complementary sources of biological evidence. The retained features included functional impact predictors such as CADD_phred, SIFT_score, PolyPhen2_HVAR_score, and MutationTaster_score [12,13,14,15,32], evolutionary conservation metrics including phyloP100way_vertebrate and GERP++_RS [23,33], and several integrated meta-predictors that combine multiple functional annotations [15,16,17,32]. Additional retained features captured structural and biochemical properties of amino acid substitutions, splice-related prediction scores [23,44,48], and population-level allele frequency measures derived from large-scale sequencing datasets [2,35]. Collectively, these predictors represent diverse biological signals related to evolutionary constraint, predicted functional impact on protein structure, and patterns of population-level genetic variation, which have all been shown to contribute to the interpretation of missense variant pathogenicity [4,23,33]. The categories of retained features are summarized in Table 4.

4.4. Machine Learning Model Development

4.4.1. Models Evaluated

To assess predictive performance across diverse learning paradigms, six supervised machine-learning algorithms were evaluated [24,25,48]. These models are widely applied in genomic variant classification and collectively represent linear, kernel-based, instance-based, and ensemble learning approaches capable of modeling both simple and complex relationships in high-dimensional biological datasets [19,26]. The evaluated algorithms included Logistic Regression, Support Vector Machine (SVM), Random Forest, Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and k-Nearest Neighbors (kNN) [24,25,48].

Logistic Regression was used as a linear baseline model to estimate the probability of pathogenicity based on a weighted combination of input features [48]. The Support Vector Machine (SVM) with a radial basis function (RBF) kernel was included to capture potential nonlinear decision boundaries in the feature space [48]. The k-Nearest Neighbors (kNN) algorithm was selected as a distance-based classifier that assigns labels based on the similarity of each variant to its nearest neighbors in the multidimensional feature space, allowing the model to capture local structural patterns in the data [48].

Tree-based ensemble methods were incorporated to model nonlinear feature interactions and hierarchical relationships among predictors. Random Forest constructs multiple decision trees using bootstrap aggregation to improve prediction stability and reduce overfitting [48]. Gradient boosting frameworks, including XGBoost and LightGBM, iteratively build ensembles of decision trees that correct errors made by previous trees and optimize predictive performance [24,25]. Both XGBoost and LightGBM were evaluated because their distinct tree-growing strategies and optimization mechanisms may influence predictive performance in high-dimensional biological datasets [24,25].

All models were implemented using the scikit-learn framework, with gradient boosting models implemented via XGBoost and LightGBM Python APIs [24,25,48]. To ensure fair comparison, all models were trained using the same training–test split, identical preprocessing pipelines (imputation, normalization, and feature selection), and stratified sampling to preserve class distribution [24,25,48].

4.4.2. Training and Validation Strategy

The curated dataset was divided into an 80% training set and a 20% independent hold-out test set using stratified sampling to preserve the original class distribution of pathogenic and benign variants [20,48]. The test set was strictly reserved for final model evaluation and was not used during feature selection, preprocessing parameter estimation, or hyperparameter optimization [18,31].

Hyperparameter tuning was conducted exclusively within the training set using 5-fold stratified cross-validation [20,26]. For models with relatively small hyperparameter spaces (Logistic Regression, kNN, Random Forest), grid search was applied [48]. For more complex gradient-boosting models (XGBoost and LightGBM), Bayesian optimization was employed to efficiently explore the hyperparameter space while minimizing computational cost [24,25]. The primary optimization metric during cross-validation was the area under the receiver operating characteristic curve (ROC-AUC), given its robustness to class imbalance [19,26].

Class imbalance was addressed using built-in class-weight adjustments where supported (Logistic Regression, SVM, Random Forest, and LightGBM) [25,48]. For XGBoost, the scale_pos_weight parameter was tuned during optimization [24]. Because the filtered ClinVar dataset exhibited near-balanced class proportions (approximately 48% pathogenic and 52% benign), synthetic oversampling techniques such as SMOTE were evaluated preliminarily but ultimately not applied, as they did not yield measurable improvement in cross-validation performance and introduced additional variance [3]. All preprocessing steps, feature selection procedures, and model fitting operations were integrated into a unified pipeline to ensure reproducibility and to prevent information leakage between cross-validation folds [18,22,31].

4.4.3. Hyperparameter Optimization

Hyperparameter optimization was performed to maximize predictive performance while minimizing the risk of overfitting [24,25]. All optimization procedures were conducted exclusively within the training dataset using 5-fold stratified cross-validation, with ROC–AUC used as the primary objective function for model selection [20,26]. This approach ensured that hyperparameter tuning was performed without exposure to the independent test data, thereby maintaining a strict separation between model training and final evaluation.

For the gradient boosting models XGBoost and LightGBM, Bayesian optimization was applied to efficiently explore the hyperparameter space and identify parameter configurations associated with improved predictive performance [24,25]. The optimized parameters included the number of boosting iterations (n_estimators), maximum tree depth (max_depth), learning rate, subsampling ratio (subsample), column sampling ratio (colsample_bytree or feature_fraction), minimum child weight or minimum number of samples required in leaf nodes, and L1 and L2 regularization coefficients. Early stopping was incorporated during the optimization process to prevent overfitting, with model performance monitored across cross-validation folds [24,25].

For the Random Forest model, hyperparameter tuning was conducted using grid search optimization [48]. The explored parameters included the number of trees (n_estimators), maximum tree depth, the maximum number of features considered at each split, and the minimum number of samples required for node splitting and leaf node formation. For the Support Vector Machine (SVM) classifier, the regularization parameter (C) and kernel coefficient (gamma) were optimized to control model complexity and decision boundary flexibility [48]. The optimized hyperparameter configurations for each evaluated machine-learning model are summarized in Table 5.

Similarly, hyperparameter optimization for the k-Nearest Neighbors (kNN) model focused on the number of neighbors (k), the distance metric used for neighbor selection, and the weighting scheme applied to neighboring observations [48]. For Logistic Regression, optimization was centered on the regularization strength and the choice of regularization penalty, allowing the model to balance bias and variance appropriately [48].

Search ranges for all models were selected based on prior literature in genomic variant classification and were adjusted to balance computational feasibility with sufficient coverage of plausible parameter configurations [20,26].

4.5. Model Evaluation

Model evaluation was designed to distinguish between cross-validated training performance used for model selection and independent hold-out test performance used for final assessment, ensuring that reported metrics reflect generalization beyond the training data [18,31]. Model performance was assessed using the independent hold-out test set, which was not involved in feature selection, preprocessing parameter estimation, or hyperparameter tuning [18,31]. The primary evaluation metric was the area under the receiver operating characteristic curve (ROC-AUC), chosen for its threshold-independent assessment of discrimination performance [20,26]. Given the clinical relevance of identifying pathogenic variants, additional threshold-dependent metrics were computed, including accuracy, precision, recall (sensitivity), F1-score, and Matthews Correlation Coefficient (MCC) [20,26]. MCC was included because it provides a balanced measure of classification performance even when class distributions are uneven [26].

The area under the precision–recall curve (PR-AUC) was also reported, as it is particularly informative when evaluating positive-class detection performance in datasets with potential class imbalance [20,26]. For threshold-dependent metrics, the optimal probability cutoff was determined on the training set using Youden’s J statistic and then applied unchanged to the test set to avoid optimistic bias [20,26]. To quantify uncertainty, 95% confidence intervals for ROC-AUC were estimated using bootstrap resampling (1000 iterations) of the test set [26]. Statistical comparisons between competing models were conducted using the DeLong test for correlated ROC curves where appropriate [26].

Model calibration was evaluated using calibration plots and quantitatively assessed via the Brier score. These analyses examined the agreement between predicted probabilities and observed outcome frequencies [20,26]. All evaluation metrics were computed using the scikit-learn metrics module [48].

Feature Importance

To enhance interpretability and understand the relative contribution of predictors, feature importance analyses were derived from models trained on the training data using cross-validation and subsequently applied to the independent test set predictions for descriptive purposes only [18,31]. For XGBoost and LightGBM, feature importance was quantified using gain-based metrics, reflecting the average improvement in loss attributable to splits involving each feature [24,25]. For Random Forest, mean decrease in Gini impurity was computed [48]. Because impurity-based measures may exhibit bias toward continuous or high-variance features, importance rankings were additionally examined for consistency across cross-validation folds [22,26].

To provide model-agnostic interpretability, SHAP (Shapley Additive Explanations) analyses were performed for the best-performing model [27]. SHAP values quantify the marginal contribution of each feature to individual predictions, enabling both global and local interpretability while accounting for feature interactions [22,27].

Across ensemble models, several predictors consistently ranked among the top contributors, including CADD_phred, phyloP100way_vertebrate, SIFT_score, and MutationTaster_score [12,13,14,15,32]. These features represent complementary biological signals, integrating evolutionary conservation and predicted functional impact [21,23,33]. Feature importance rankings were derived exclusively from models trained on the training data, and no feature re-ranking was performed using the test labels, thereby preventing information leakage [18,31].

4.6. Model Explainability (SHAP Analysis)

To enhance interpretability and promote clinical transparency, SHAP (SHapley Additive exPlanations) analysis was performed on the best-performing tree-based model. SHAP is grounded in cooperative game theory and assigns each feature a Shapley value representing its marginal contribution to an individual prediction [27].

For tree-based models, SHAP values were computed using the TreeSHAP algorithm, which provides an exact and computationally efficient estimation of Shapley values [27]. Global SHAP analysis was conducted to quantify overall feature importance by calculating the mean absolute SHAP value across samples. This analysis enabled ranking of influential predictors and facilitated visualization of feature effect distributions and potential interaction patterns [22,27].

Local SHAP explanations were generated for individual variants to illustrate how specific features contributed to the predicted probability of pathogenicity. Visualization techniques included force plots, decision plots, and waterfall plots, allowing detailed examination of variant-specific decision pathways [27]. SHAP analyses were applied post hoc to predictions generated by the final trained model on the independent test set and were used solely for interpretability purposes. These analyses were conducted after model training and evaluation were completed and did not influence feature selection, model tuning, or performance estimation [18,31]. This explainability framework provides transparent, instance-level reasoning that complements predictive performance metrics and supports biologically meaningful interpretation in the context of variant pathogenicity assessment [22,27].

4.7. Software, Tools, and Reproducibility

All analyses were conducted in Python 3.10. Core machine-learning workflows were implemented using scikit-learn (v1.4), XGBoost (v1.7), LightGBM (v4.0), and SHAP (v0.44) [24,25,27,48]. Data manipulation and preprocessing were performed using pandas and NumPy, while visualization was carried out with matplotlib and seaborn.

All computations were executed on a Linux workstation running Ubuntu 22.04 LTS, equipped with 32 GB of RAM and an 8-core CPU. Models were trained using CPU-based implementations; no GPU acceleration was employed. To enhance reproducibility, random seeds were fixed across all relevant components, including data splitting, cross-validation, and model initialization [18,31]. Where applicable, deterministic training settings were enabled in gradient boosting frameworks [24,25]. The complete computational environment, including package versions and dependencies, was documented to facilitate replication [18,48].

4.8. Problem Formulation

The task of predicting missense variant pathogenicity can be formalized as a supervised binary classification problem. Given a missense variant represented by an annotation feature vector:

x = (f₁, f₂, …, f_d) ∈ ℝ^d,

The goal is to learn a predictive function:

h: ℝ^d → {0, 1}

where 1 = pathogenic and 0 = benign.

The classifier h is trained using a labeled dataset:

D = {(x_i, y_i)}^N_{i = 1}

With y_i ∈ {0, 1} derived from expert-reviewed ClinVar classifications.

Model training aims to maximize discriminatory performance, typically by optimizing the ROC–AUC metric:

h* = arg max AUC(h(D_test))

This formulation aligns with best practices in computational genomics and provides a mathematically grounded description of the classification task.

5. Conclusions

This study introduces PathoPredictor, an interpretable machine-learning framework designed to classify the pathogenicity of human missense variants [3,23]. Among the evaluated algorithms, the LightGBM model demonstrated the strongest discriminative performance within the analyzed dataset, achieving a test ROC–AUC of 0.93, while other ensemble models such as XGBoost showed comparable performance (ROC–AUC = 0.92) [24,25]. These results indicate that ensemble tree-based learning methods can effectively integrate heterogeneous functional annotations for missense variant classification [26].

In addition to predictive performance, PathoPredictor emphasizes model interpretability through SHAP-based feature attribution, enabling both global feature importance assessment and variant-level explanation of predictions [22,27]. Such transparency is particularly important in genomic applications where computational predictions are increasingly used to support, but not replace, expert interpretation within variant analysis workflows [4,8].

Although the framework demonstrated strong performance within the evaluated dataset, several limitations must be considered. The model relies on annotations derived from existing prediction tools and was trained using ClinVar-derived datasets, which may introduce bias toward well-studied genes and may partially inflate predictive performance due to potential circularity [15,16,18,31,32]. Furthermore, the current framework is limited to missense SNVs and has not yet been evaluated across other classes of genomic variation.

Future work will require validation on fully independent external datasets, integration of additional biological features such as structural or transcript-level annotations, and evaluation across more diverse genomic contexts [18,31,49]. These extensions will be important for determining the broader generalizability and practical utility of the framework.

Overall, PathoPredictor provides a transparent and reproducible computational framework for integrating multiple functional annotations in missense variant classification. Rather than serving as a standalone clinical diagnostic tool, the model is intended to support variant prioritization and research-oriented genomic analyses, where interpretable machine-learning approaches may complement existing pathogenicity predictors and assist expert review processes [4,8,22,27].

Author Contributions

Conceptualization, M.A.K.; methodology, K.B.; validation, K.B., and S.B.; formal analysis, K.B., and M.A.K.; investigation, K.B., and M.A.K.; data curation, K.B., and M.A.K.; writing—original draft preparation, K.B.; writing—review and editing, S.B., and M.A.K.; visualization, K.B., and M.A.K.; supervision, M.A.K.; project administration, M.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical approval was not required because this study exclusively analyzed publicly available, de-identified variant annotation datasets without involving human subjects or protected health information, in accordance with NIH and institutional guidelines.

Data Availability Statement

The data presented in this study are derived from publicly available resources, including ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/; accessed on 15 March 2026) and dbNSFP v5.1 (https://sites.google.com/site/jpopgen/dbNSFP; accessed on 15 March 2026). The processed dataset and all code used in this study are publicly available at the corresponding author’s GitHub repository: https://github.com/abdelmajidk/PathoPredictor.

Conflicts of Interest

The authors declare no conflict of interest.

References

Auton, A.; Brooks, L.D.; Durbin, R.M.; Garrison, E.P.; Kang, H.M.; Korbel, J.O.; Marchini, J.L.; McCarthy, S.; McVean, G.A.; Abecasis, G.R.; et al. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef]
Karczewski, K.J.; Francioli, L.C.; Tiao, G.; Cummings, B.B.; Alföldi, J.; Wang, Q.; Collins, R.L.; Laricchia, K.M.; Ganna, A.; Birnbaum, D.P.; et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020, 581, 434–443. [Google Scholar] [CrossRef]
Landrum, M.J.; Lee, J.M.; Benson, M.; Brown, G.; Chao, C.; Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Hoover, J.; et al. ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016, 44, D862–D868. [Google Scholar] [CrossRef]
Richards, S.; Aziz, N.; Bale, S.; Bick, D.; Das, S.; Gastier-Foster, J.; Grody, W.W.; Hegde, M.; Lyon, E.; Spector, E.; et al. Standards and guidelines for the interpretation of sequence variants. Genet. Med. 2015, 17, 405–424. [Google Scholar] [CrossRef] [PubMed]
Cooper, D.N.; Krawczak, M.; Polychronakos, C.; Tyler-Smith, C.; Kehrer-Sawatzki, H. Where genotype is not predictive of phenotype: Towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum. Genet. 2013, 132, 1077–1130. [Google Scholar] [CrossRef]
Frazer, K.A.; Murray, S.S.; Schork, N.J.; Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 2009, 10, 241–251. [Google Scholar] [CrossRef]
Amendola, L.M.; Jarvik, G.P.; Leo, M.C.; McLaughlin, H.M.; Akkari, Y.; Amaral, M.D.; Berg, J.S.; Biswas, S.; Bowling, K.M.; Conlin, L.K.; et al. Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the Clinical Sequencing Exploratory Research Consortium. Am. J. Hum. Genet. 2016, 98, 1067–1076. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Harrison, S.M.; Dolinsky, J.S.; Knight Johnson, A.E.; Pesaran, T.; Azzariti, D.R.; Bale, S.; Chao, E.C.; Das, S.; Vincent, L.; Rehm, H.L.; et al. Clinical laboratories collaborate to resolve differences in variant interpretations submitted to ClinVar. Hum. Mutat. 2017, 38, 1245–1251. [Google Scholar] [CrossRef] [PubMed]
Manrai, A.K.; Funke, B.H.; Rehm, H.L.; Bhatt, D.L.; Baras, A.; Celia-Terrassa, T.; Fishler, K.; Kohane, I.S.; Maas, R.L.; Ginsburg, G.S.; et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 2016, 375, 655–665. [Google Scholar] [CrossRef]
Goldstein, D.B.; Allen, A.; Keebler, J.; Margulies, E.H.; Petrou, S.; Petrovski, S.; Sunyaev, S. Sequencing studies in human genetics: Design and interpretation. Nat. Rev. Genet. 2013, 14, 460–470. [Google Scholar] [CrossRef]
Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef]
Adzhubei, I.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R. A method and server for predicting damaging missense mutations (PolyPhen-2). Nat. Methods 2010, 7, 248–249. [Google Scholar] [CrossRef]
Schwarz, J.M.; Cooper, D.N.; Schuelke, M.; Seelow, D. MutationTaster2: Mutation prediction for the deep-sequencing age. Nat. Methods 2014, 11, 361–362. [Google Scholar] [CrossRef]
Kircher, M.; Witten, D.M.; Jain, P.; O’Roak, B.J.; Cooper, G.M.; Shendure, J. A general framework for estimating the relative pathogenicity of human genetic variants (CADD). Nat. Genet. 2014, 46, 310–315. [Google Scholar] [CrossRef] [PubMed]
Ioannidis, N.M.; Rothstein, J.H.; Pejaver, V.; Middha, S.; McDonnell, S.K.; Baheti, S.; Musolf, A.; Li, Q.; Holzinger, E.; Karyadi, D.; et al. REVEL: An ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 2016, 99, 877–885. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Wei, P.; Jian, X.; Gibbs, R.; Boerwinkle, E.; Wang, K.; Liu, X. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 2015, 24, 2125–2137. [Google Scholar] [CrossRef]
Grimm, D.G.; Azencott, C.A.; Aicheler, F.; Gieraths, U.; MacArthur, D.G.; Samocha, K.E.; Cooper, D.N.; Stenson, P.D.; Daly, M.J.; Smoller, J.W.; et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 2015, 36, 513–523. [Google Scholar] [CrossRef] [PubMed]
Mahmood, K.; Jung, C.H.; Philip, G.; Georgeson, P.; Chung, J.; Pope, B.J.; Park, D.J. Variant effect prediction tools assessed using independent sets of pathogenic and benign variants. Genome Biol. 2017, 18, 212. [Google Scholar] [CrossRef]
Ghosh, R.; Oak, N.; Plon, S.E. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biol. 2017, 18, 225. [Google Scholar] [CrossRef]
Porretta, A.P.; Fressart, V.; Surget, E.; Morgat, C.; Bloch, A.; Messali, A.; Algalarrondo, V.; Vedrenne, G.; Pruvot, E.; Leenhardt, A.; et al. Making sense of missense: Benchmarking MutScore for variant interpretation in inherited cardiac diseases. Mol. Diagn. Ther. 2025, 29, 539–552. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed.; Independently Published: Munich, Germany, 2022; ISBN 979-8411463330. [Google Scholar]
Liu, X.; Li, C.; Mou, C.; Dong, Y.; Tu, Y. dbNSFP v4: A comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020, 12, 103. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 3146–3154. [Google Scholar]
Tabet, D.R.; Kuang, D.; Lancaster, M.C.; Li, R.; Liu, K.; Weile, J.; Coté, A.G.; Wu, Y.; Hegele, R.A.; Roden, D.M.; et al. Benchmarking computational variant effect predictors by their ability to infer human traits. Genome Biol. 2024, 25, 172. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
Sundaram, L.; Gao, H.; Padigepati, S.R.; McRae, J.F.; Li, Y.; Kosmicki, J.A.; Fritzilas, N.; Hakenberg, J.; Dutta, A.; Shon, J.; et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 2018, 50, 1161–1170. [Google Scholar] [CrossRef]
Frazer, J.; Notin, P.; Dias, M.; Gomez, A.; Min, J.K.; Brock, K.; Zemla, Y.; Gal, Y.; Marks, D.S. Disease variant prediction with deep generative models of evolutionary data. Nature 2021, 599, 91–95. [Google Scholar] [CrossRef]
Gao, H.; Hamp, T.; Ede, J.; Schraiber, J.G.; McRae, J.; Singer-Berk, M.; Yang, Y.; Dietrich, A.S.D.; Fiziev, P.P.; Kuderna, L.F.K.; et al. The landscape of tolerated genetic variation in humans and primates (PrimateAI-2.0). Science 2023, 380, eabn8197. [Google Scholar] [CrossRef] [PubMed]
Grimm, D.G.; Roqueiro, D.; Salomé, P.A.; Kleeberger, S.; Greshake, B.; Zhu, W.; Liu, C.; Lippert, C.; Stegle, O.; Schölkopf, B.; et al. Addressing circularity in genomic machine learning. Hum. Mutat. 2021, 42, 1523–1536. [Google Scholar] [CrossRef]
Rentzsch, P.; Witten, D.; Cooper, G.M.; Shendure, J.; Kircher, M. CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019, 47, D886–D894. [Google Scholar] [CrossRef]
Shihab, H.A.; Rogers, M.F.; Gough, J.; Mort, M.; Cooper, D.N.; Day, I.N.M.; Gaunt, T.R.; Campbell, C. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Hum. Mol. Genet. 2013, 22, 4002–4011. [Google Scholar] [CrossRef] [PubMed]
Tavtigian, S.V.; Greenblatt, M.S.; Harrison, S.M.; Nussbaum, R.L.; Prabhu, S.A.; Boucher, K.M.; Biesecker, L.G. Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genet. Med. 2018, 20, 1054–1060. [Google Scholar] [CrossRef]
Lek, M.; Karczewski, K.J.; Minikel, E.V.; Samocha, K.E.; Banks, E.; Fennell, T.; O’Donnell-Luria, A.H.; Ware, J.S.; Hill, A.J.; Cummings, B.B.; et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016, 536, 285–291. [Google Scholar] [CrossRef] [PubMed]
Petrovski, S.; Wang, Q.; Heinzen, E.L.; Allen, A.S.; Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 2013, 9, e1003709. [Google Scholar] [CrossRef]
Whiffin, N.; Minikel, E.; Walsh, R.; O’Donnell-Luria, A.H.; Karczewski, K.; Ing, A.Y.; Barton, P.J.R.; Funke, B.; Cook, S.A.; MacArthur, D.; et al. Using high-resolution variant frequencies to empower clinical genome interpretation. Nat. Genet. 2017, 49, 1465–1471. [Google Scholar] [CrossRef] [PubMed]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Pejaver, V.; Urresti, J.; Lugo-Martinez, J.; Pagel, K.A.; Lin, G.N.; Nam, H.J.; Mort, M.; Cooper, D.N.; Sebat, J.; Iakoucheva, L.M.; et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 2020, 11, 5918. [Google Scholar] [CrossRef]
Fowler, D.M.; Fields, S. Deep mutational scanning: A new style of protein science. Nat. Methods 2014, 11, 801–807. [Google Scholar] [CrossRef]
Walsh, R.; Thomson, K.L.; Ware, J.S.; Funke, B.H.; Woodley, J.; McGuire, K.J.; Mazzarotto, F.; Blair, E.; Sellers, N.; Taylor, J.C.; et al. Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet. Med. 2017, 19, 192–203. [Google Scholar] [CrossRef]
Samocha, K.E.; Robinson, E.B.; Sanders, S.J.; Stevens, C.; Sabo, A.; McGrath, L.M.; Kosmicki, J.A.; Rehnström, K.; Mallick, S.; Kirby, A.; et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014, 46, 944–950. [Google Scholar] [CrossRef]
Starita, L.M.; Ahituv, N.; Dunham, M.J.; Kitzman, J.O.; Roth, F.P.; Seelig, G.; Shendure, J.; Fowler, D.M. Variant interpretation: Functional assays to the rescue. Am. J. Hum. Genet. 2017, 101, 315–325. [Google Scholar] [CrossRef]
Bycroft, C.; Freeman, C.; Petkova, D.; Band, G.; Elliott, L.T.; Sharp, K.; Motyer, A.; Vukcevic, D.; Delaneau, O.; O’Connell, J.; et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018, 562, 203–209. [Google Scholar] [CrossRef]
Rentzsch, P.; Schubach, M.; Shendure, J.; Kircher, M. CADD-Splice: Improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 2021, 13, 31. [Google Scholar] [CrossRef] [PubMed]
Vaser, R.; Adusumalli, S.; Leng, S.N.; Sikic, M.; Ng, P.C. SIFT missense predictions for genomes. Nat. Protoc. 2016, 11, 1–9. [Google Scholar] [CrossRef] [PubMed]
Thornton, J.W.; DeSalle, R. Gene family evolution and homology: Genomics meets phylogenetics. Nat. Rev. Genet. 2014, 15, 689–701. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Du-bourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Livesey, B.J.; Marsh, J.A. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol. Syst. Biol. 2023, 19, e11474. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Workflow of the PathoPredictor framework. Schematic overview of the PathoPredictor pipeline. Expert-reviewed missense variants were retrieved from ClinVar (November 2023) and annotated using dbNSFP v5.1. After preprocessing removal of conflicting variants, imputation of missing values, and normalization six supervised machine learning classifiers were trained and evaluated. The final model incorporated SHAP-based interpretability to provide global feature importance profiles and variant-level explanations.

Figure 2. ROC curves of all evaluated machine learning models. Receiver operating characteristic (ROC) curves comparing the performance of Logistic Regression, Random Forest, XGBoost, LightGBM, Support Vector Machine, and k-Nearest Neighbors. LightGBM and XGBoost achieved the highest ROC-AUC values of 0.93 and 0.92, respectively, with Random Forest also performing strongly. These results indicate strong discriminative power for missense variant classification.

Figure 3. Precision–Recall curves of the top three models. Precision–Recall (PR) curves for LightGBM, XGBoost, and Random Forest. All three models exhibited high precision across a wide recall range, underscoring their robustness in detecting pathogenic variants across a near-balanced evaluation dataset.

Figure 4. Global SHAP feature importance ranking. Barplot of SHAP mean absolute feature contributions from the final PathoPredictor model (LightGBM). The most influential features included CADD_phred, MutationTaster_score, phyloP100way_vertebrate, SIFT_score, and PolyPhen2_HVAR_score, emphasizing the combined relevance of evolutionary conservation and predicted functional impact.

Figure 5. SHAP summary plot showing feature influence across all variants. SHAP beeswarm plot illustrating how each feature contributes to individual predictions from the LightGBM model. Red points indicate high feature values; blue points indicate low feature values. Conservation metrics and integrated predictive scores produce the strongest directional impact toward pathogenic classification.

Figure 6. Example SHAP decision plot for a pathogenic variant. SHAP decision plot for the BRCA1 p.Val1736Ala variant, illustrating how individual features from the final PathoPredictor model (LightGBM) cumulatively shift the prediction from the baseline toward a “pathogenic” classification. The plot demonstrates the variant-level decision pathway, allowing clinicians to trace and align the model’s reasoning with biological evidence.

Table 1. Performance metrics of evaluated classifiers reported as mean ± standard deviation across five cross-validation folds.

Model	Accuracy	Precision	Recall	F1-Score	ROC-AUC
Logistic Regression	0.81 ± 0.01	0.78 ± 0.01	0.74 ± 0.01	0.76 ± 0.01	0.84 ± 0.01
Random Forest	0.87 ± 0.01	0.86 ± 0.01	0.83 ± 0.01	0.84 ± 0.01	0.90 ± 0.01
XGBoost	0.89 ± 0.01	0.88 ± 0.01	0.85 ± 0.01	0.86 ± 0.01	0.92 ± 0.01
LightGBM	0.90 ± 0.01	0.90 ± 0.01	0.86 ± 0.01	0.88 ± 0.01	0.93 ± 0.01
KNN	0.85 ± 0.01	0.83 ± 0.01	0.80 ± 0.01	0.81 ± 0.01	0.88 ± 0.01
SVM (RBF)	0.83 ± 0.01	0.81 ± 0.01	0.77 ± 0.01	0.79 ± 0.01	0.86 ± 0.01

Table 2. Top 10 most informative predictors ranked by mean absolute SHAP value from the final LightGBM model.

Rank	Feature Name	Mean SHAP Value	Contribution Interpretation
1	CADD_phred	0.137	Strong integrated impact predictor
2	MutationTaster_score	0.121	Functional disruption likelihood
3	phyloP100way_vertebrate	0.113	Deep evolutionary conservation
4	SIFT_score	0.099	Amino Acid substitution tolerance
5	PolyPhen2_HVAR_score	0.093	Protein structural impact
6	GERP++_RS	0.082	Evolutionary constraint
7	REVEL	0.079	Meta-predictor risk score
8	FATHMM_score	0.071	Functional impact
9	MutationAssessor	0.065	Protein-level disruption
10	PROVEAN_score	0.058	Functional damage indicator

Table 3. Summary of ClinVar-derived datasets used for model development and evaluation, including filtering and preprocessing steps applied prior to machine-learning analysis.

Dataset Component	Count	Notes
Total ClinVar variants (November 2023)	2,988,631	All variant types
Missense SNVs	987,214	After filtering for substitution variants
Expert-reviewed missense SNVs	62,417	Pathogenic + benign, no conflicting interpretations
Final dataset after merging with dbNSFP	59,302	After removing missing-feature rows
Train set (80%)	47,441	Stratified
Test set (20%)	11,861	Stratified

Table 4. Categories of functional and evolutionary annotation features used for machine-learning model training. Example predictors are shown for each feature category.

Feature Category	Examples	Description
Conservation scores	phyloP100way, GERP++	Evolutionary constraint
Protein impact scores	SIFT, PolyPhen-2	Predicted deleteriousness
Meta-predictors	CADD, MutationAssessor	Integrated functional impact
Structural predictions	PROVEAN, FATHMM	3D/biochemical impact
Splicing-related scores	dbscSNV	Effects on splice sites
Allele frequency	gnomAD_AF	Population variation

Table 5. Optimized hyperparameter configurations for the evaluated machine-learning models obtained through cross-validation–based hyperparameter optimization.

Model	Key Hyperparameters
Logistic Regression	Penalty = L2; C = 1.0; Solver = liblinear; Class weight = balanced; Max iterations = 500
Random Forest	n_estimators = 500; criterion = gini; max_depth = 25; min_samples_split = 2; min_samples_leaf = 1; max_features = sqrt; class_weight = balanced_subsample; bootstrap = True
XGBoost	n_estimators = 600; learning_rate = 0.05; max_depth = 7; subsample = 0.8; colsample_bytree = 0.8; gamma = 0; reg_alpha = 0.1; reg_lambda = 0.1; objective = binary:logistic; eval_metric = AUC
LightGBM	n_estimators = 600; learning_rate = 0.03; num_leaves = 64; max_depth = −1; feature_fraction = 0.8; bagging_fraction = 0.8; bagging_freq = 5; min_data_in_leaf = 20; lambda_l1 = 0.1; lambda_l2 = 0.1; objective = binary; metric = AUC
Support Vector Machine (RBF)	Kernel = RBF; C = 2.0; Gamma = scale; Class weight = balanced
KNN	n_neighbors = 15; weights = distance; metric = minkowski (p = 2)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bahmane, K.; Bhattacharya, S.; Kassem, M.A. PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. J. Genome Biotechnol. Genet. 2026, 1, 3. https://doi.org/10.3390/jgbg1010003

AMA Style

Bahmane K, Bhattacharya S, Kassem MA. PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. Journal of Genome Biotechnology and Genetics. 2026; 1(1):3. https://doi.org/10.3390/jgbg1010003

Chicago/Turabian Style

Bahmane, Karima, Sambit Bhattacharya, and My Abdelmajid Kassem. 2026. "PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome" Journal of Genome Biotechnology and Genetics 1, no. 1: 3. https://doi.org/10.3390/jgbg1010003

APA Style

Bahmane, K., Bhattacharya, S., & Kassem, M. A. (2026). PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. Journal of Genome Biotechnology and Genetics, 1(1), 3. https://doi.org/10.3390/jgbg1010003

Article Menu

PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome

Abstract

1. Introduction

2. Results

2.1. Dataset Characteristics

2.2. Performance of Individual Predictors

2.3. Machine Learning Model Performance

2.4. Feature Importance Analysis

2.5. Temporal Validation on Independent ClinVar Submissions

3. Discussion

3.1. Clinical Significance of PathoPredictor

3.2. Comparison with State-of-the-Art Predictors

3.3. Limitations

3.4. Future Directions

4. Methods

4.1. Data Sources

4.1.1. ClinVar Variant Dataset

4.1.2. Functional and Evolutionary Annotations (dbNSFP v5.1)

4.2. Data Curation and Preprocessing

4.2.1. Variant Filtering

4.2.2. Handling Missing Values

4.2.3. Feature Normalization

4.3. Feature Selection and Engineering

4.3.1. Correlation Filtering

4.3.2. Variance Thresholding

4.3.3. Model-Based Feature Selection

4.4. Machine Learning Model Development

4.4.1. Models Evaluated

4.4.2. Training and Validation Strategy

4.4.3. Hyperparameter Optimization

4.5. Model Evaluation

4.6. Model Explainability (SHAP Analysis)

4.7. Software, Tools, and Reproducibility

4.8. Problem Formulation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI