PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome
Abstract
1. Introduction
2. Results
2.1. Dataset Characteristics
2.2. Performance of Individual Predictors
2.3. Machine Learning Model Performance
2.4. Feature Importance Analysis
2.5. Temporal Validation on Independent ClinVar Submissions
3. Discussion
3.1. Clinical Significance of PathoPredictor
3.2. Comparison with State-of-the-Art Predictors
3.3. Limitations
3.4. Future Directions
4. Methods
4.1. Data Sources
4.1.1. ClinVar Variant Dataset
4.1.2. Functional and Evolutionary Annotations (dbNSFP v5.1)
4.2. Data Curation and Preprocessing
4.2.1. Variant Filtering
4.2.2. Handling Missing Values
4.2.3. Feature Normalization
4.3. Feature Selection and Engineering
4.3.1. Correlation Filtering
4.3.2. Variance Thresholding
4.3.3. Model-Based Feature Selection
4.4. Machine Learning Model Development
4.4.1. Models Evaluated
4.4.2. Training and Validation Strategy
4.4.3. Hyperparameter Optimization
4.5. Model Evaluation
- Feature Importance
4.6. Model Explainability (SHAP Analysis)
4.7. Software, Tools, and Reproducibility
4.8. Problem Formulation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Auton, A.; Brooks, L.D.; Durbin, R.M.; Garrison, E.P.; Kang, H.M.; Korbel, J.O.; Marchini, J.L.; McCarthy, S.; McVean, G.A.; Abecasis, G.R.; et al. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef]
- Karczewski, K.J.; Francioli, L.C.; Tiao, G.; Cummings, B.B.; Alföldi, J.; Wang, Q.; Collins, R.L.; Laricchia, K.M.; Ganna, A.; Birnbaum, D.P.; et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020, 581, 434–443. [Google Scholar] [CrossRef]
- Landrum, M.J.; Lee, J.M.; Benson, M.; Brown, G.; Chao, C.; Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Hoover, J.; et al. ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016, 44, D862–D868. [Google Scholar] [CrossRef]
- Richards, S.; Aziz, N.; Bale, S.; Bick, D.; Das, S.; Gastier-Foster, J.; Grody, W.W.; Hegde, M.; Lyon, E.; Spector, E.; et al. Standards and guidelines for the interpretation of sequence variants. Genet. Med. 2015, 17, 405–424. [Google Scholar] [CrossRef] [PubMed]
- Cooper, D.N.; Krawczak, M.; Polychronakos, C.; Tyler-Smith, C.; Kehrer-Sawatzki, H. Where genotype is not predictive of phenotype: Towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum. Genet. 2013, 132, 1077–1130. [Google Scholar] [CrossRef]
- Frazer, K.A.; Murray, S.S.; Schork, N.J.; Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 2009, 10, 241–251. [Google Scholar] [CrossRef]
- Amendola, L.M.; Jarvik, G.P.; Leo, M.C.; McLaughlin, H.M.; Akkari, Y.; Amaral, M.D.; Berg, J.S.; Biswas, S.; Bowling, K.M.; Conlin, L.K.; et al. Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the Clinical Sequencing Exploratory Research Consortium. Am. J. Hum. Genet. 2016, 98, 1067–1076. [Google Scholar] [CrossRef]
- Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
- Harrison, S.M.; Dolinsky, J.S.; Knight Johnson, A.E.; Pesaran, T.; Azzariti, D.R.; Bale, S.; Chao, E.C.; Das, S.; Vincent, L.; Rehm, H.L.; et al. Clinical laboratories collaborate to resolve differences in variant interpretations submitted to ClinVar. Hum. Mutat. 2017, 38, 1245–1251. [Google Scholar] [CrossRef] [PubMed]
- Manrai, A.K.; Funke, B.H.; Rehm, H.L.; Bhatt, D.L.; Baras, A.; Celia-Terrassa, T.; Fishler, K.; Kohane, I.S.; Maas, R.L.; Ginsburg, G.S.; et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 2016, 375, 655–665. [Google Scholar] [CrossRef]
- Goldstein, D.B.; Allen, A.; Keebler, J.; Margulies, E.H.; Petrou, S.; Petrovski, S.; Sunyaev, S. Sequencing studies in human genetics: Design and interpretation. Nat. Rev. Genet. 2013, 14, 460–470. [Google Scholar] [CrossRef]
- Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef]
- Adzhubei, I.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R. A method and server for predicting damaging missense mutations (PolyPhen-2). Nat. Methods 2010, 7, 248–249. [Google Scholar] [CrossRef]
- Schwarz, J.M.; Cooper, D.N.; Schuelke, M.; Seelow, D. MutationTaster2: Mutation prediction for the deep-sequencing age. Nat. Methods 2014, 11, 361–362. [Google Scholar] [CrossRef]
- Kircher, M.; Witten, D.M.; Jain, P.; O’Roak, B.J.; Cooper, G.M.; Shendure, J. A general framework for estimating the relative pathogenicity of human genetic variants (CADD). Nat. Genet. 2014, 46, 310–315. [Google Scholar] [CrossRef] [PubMed]
- Ioannidis, N.M.; Rothstein, J.H.; Pejaver, V.; Middha, S.; McDonnell, S.K.; Baheti, S.; Musolf, A.; Li, Q.; Holzinger, E.; Karyadi, D.; et al. REVEL: An ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 2016, 99, 877–885. [Google Scholar] [CrossRef] [PubMed]
- Dong, C.; Wei, P.; Jian, X.; Gibbs, R.; Boerwinkle, E.; Wang, K.; Liu, X. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 2015, 24, 2125–2137. [Google Scholar] [CrossRef]
- Grimm, D.G.; Azencott, C.A.; Aicheler, F.; Gieraths, U.; MacArthur, D.G.; Samocha, K.E.; Cooper, D.N.; Stenson, P.D.; Daly, M.J.; Smoller, J.W.; et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 2015, 36, 513–523. [Google Scholar] [CrossRef] [PubMed]
- Mahmood, K.; Jung, C.H.; Philip, G.; Georgeson, P.; Chung, J.; Pope, B.J.; Park, D.J. Variant effect prediction tools assessed using independent sets of pathogenic and benign variants. Genome Biol. 2017, 18, 212. [Google Scholar] [CrossRef]
- Ghosh, R.; Oak, N.; Plon, S.E. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biol. 2017, 18, 225. [Google Scholar] [CrossRef]
- Porretta, A.P.; Fressart, V.; Surget, E.; Morgat, C.; Bloch, A.; Messali, A.; Algalarrondo, V.; Vedrenne, G.; Pruvot, E.; Leenhardt, A.; et al. Making sense of missense: Benchmarking MutScore for variant interpretation in inherited cardiac diseases. Mol. Diagn. Ther. 2025, 29, 539–552. [Google Scholar] [CrossRef]
- Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed.; Independently Published: Munich, Germany, 2022; ISBN 979-8411463330. [Google Scholar]
- Liu, X.; Li, C.; Mou, C.; Dong, Y.; Tu, Y. dbNSFP v4: A comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020, 12, 103. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 3146–3154. [Google Scholar]
- Tabet, D.R.; Kuang, D.; Lancaster, M.C.; Li, R.; Liu, K.; Weile, J.; Coté, A.G.; Wu, Y.; Hegele, R.A.; Roden, D.M.; et al. Benchmarking computational variant effect predictors by their ability to infer human traits. Genome Biol. 2024, 25, 172. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
- Sundaram, L.; Gao, H.; Padigepati, S.R.; McRae, J.F.; Li, Y.; Kosmicki, J.A.; Fritzilas, N.; Hakenberg, J.; Dutta, A.; Shon, J.; et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 2018, 50, 1161–1170. [Google Scholar] [CrossRef]
- Frazer, J.; Notin, P.; Dias, M.; Gomez, A.; Min, J.K.; Brock, K.; Zemla, Y.; Gal, Y.; Marks, D.S. Disease variant prediction with deep generative models of evolutionary data. Nature 2021, 599, 91–95. [Google Scholar] [CrossRef]
- Gao, H.; Hamp, T.; Ede, J.; Schraiber, J.G.; McRae, J.; Singer-Berk, M.; Yang, Y.; Dietrich, A.S.D.; Fiziev, P.P.; Kuderna, L.F.K.; et al. The landscape of tolerated genetic variation in humans and primates (PrimateAI-2.0). Science 2023, 380, eabn8197. [Google Scholar] [CrossRef] [PubMed]
- Grimm, D.G.; Roqueiro, D.; Salomé, P.A.; Kleeberger, S.; Greshake, B.; Zhu, W.; Liu, C.; Lippert, C.; Stegle, O.; Schölkopf, B.; et al. Addressing circularity in genomic machine learning. Hum. Mutat. 2021, 42, 1523–1536. [Google Scholar] [CrossRef]
- Rentzsch, P.; Witten, D.; Cooper, G.M.; Shendure, J.; Kircher, M. CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019, 47, D886–D894. [Google Scholar] [CrossRef]
- Shihab, H.A.; Rogers, M.F.; Gough, J.; Mort, M.; Cooper, D.N.; Day, I.N.M.; Gaunt, T.R.; Campbell, C. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Hum. Mol. Genet. 2013, 22, 4002–4011. [Google Scholar] [CrossRef] [PubMed]
- Tavtigian, S.V.; Greenblatt, M.S.; Harrison, S.M.; Nussbaum, R.L.; Prabhu, S.A.; Boucher, K.M.; Biesecker, L.G. Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genet. Med. 2018, 20, 1054–1060. [Google Scholar] [CrossRef]
- Lek, M.; Karczewski, K.J.; Minikel, E.V.; Samocha, K.E.; Banks, E.; Fennell, T.; O’Donnell-Luria, A.H.; Ware, J.S.; Hill, A.J.; Cummings, B.B.; et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016, 536, 285–291. [Google Scholar] [CrossRef] [PubMed]
- Petrovski, S.; Wang, Q.; Heinzen, E.L.; Allen, A.S.; Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 2013, 9, e1003709. [Google Scholar] [CrossRef]
- Whiffin, N.; Minikel, E.; Walsh, R.; O’Donnell-Luria, A.H.; Karczewski, K.; Ing, A.Y.; Barton, P.J.R.; Funke, B.; Cook, S.A.; MacArthur, D.; et al. Using high-resolution variant frequencies to empower clinical genome interpretation. Nat. Genet. 2017, 49, 1465–1471. [Google Scholar] [CrossRef] [PubMed]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
- Pejaver, V.; Urresti, J.; Lugo-Martinez, J.; Pagel, K.A.; Lin, G.N.; Nam, H.J.; Mort, M.; Cooper, D.N.; Sebat, J.; Iakoucheva, L.M.; et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 2020, 11, 5918. [Google Scholar] [CrossRef]
- Fowler, D.M.; Fields, S. Deep mutational scanning: A new style of protein science. Nat. Methods 2014, 11, 801–807. [Google Scholar] [CrossRef]
- Walsh, R.; Thomson, K.L.; Ware, J.S.; Funke, B.H.; Woodley, J.; McGuire, K.J.; Mazzarotto, F.; Blair, E.; Sellers, N.; Taylor, J.C.; et al. Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet. Med. 2017, 19, 192–203. [Google Scholar] [CrossRef]
- Samocha, K.E.; Robinson, E.B.; Sanders, S.J.; Stevens, C.; Sabo, A.; McGrath, L.M.; Kosmicki, J.A.; Rehnström, K.; Mallick, S.; Kirby, A.; et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014, 46, 944–950. [Google Scholar] [CrossRef]
- Starita, L.M.; Ahituv, N.; Dunham, M.J.; Kitzman, J.O.; Roth, F.P.; Seelig, G.; Shendure, J.; Fowler, D.M. Variant interpretation: Functional assays to the rescue. Am. J. Hum. Genet. 2017, 101, 315–325. [Google Scholar] [CrossRef]
- Bycroft, C.; Freeman, C.; Petkova, D.; Band, G.; Elliott, L.T.; Sharp, K.; Motyer, A.; Vukcevic, D.; Delaneau, O.; O’Connell, J.; et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018, 562, 203–209. [Google Scholar] [CrossRef]
- Rentzsch, P.; Schubach, M.; Shendure, J.; Kircher, M. CADD-Splice: Improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 2021, 13, 31. [Google Scholar] [CrossRef] [PubMed]
- Vaser, R.; Adusumalli, S.; Leng, S.N.; Sikic, M.; Ng, P.C. SIFT missense predictions for genomes. Nat. Protoc. 2016, 11, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Thornton, J.W.; DeSalle, R. Gene family evolution and homology: Genomics meets phylogenetics. Nat. Rev. Genet. 2014, 15, 689–701. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Du-bourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Livesey, B.J.; Marsh, J.A. Updated benchmarking of variant effect predictors using deep mutational scanning. Mol. Syst. Biol. 2023, 19, e11474. [Google Scholar] [CrossRef] [PubMed]






| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.81 ± 0.01 | 0.78 ± 0.01 | 0.74 ± 0.01 | 0.76 ± 0.01 | 0.84 ± 0.01 |
| Random Forest | 0.87 ± 0.01 | 0.86 ± 0.01 | 0.83 ± 0.01 | 0.84 ± 0.01 | 0.90 ± 0.01 |
| XGBoost | 0.89 ± 0.01 | 0.88 ± 0.01 | 0.85 ± 0.01 | 0.86 ± 0.01 | 0.92 ± 0.01 |
| LightGBM | 0.90 ± 0.01 | 0.90 ± 0.01 | 0.86 ± 0.01 | 0.88 ± 0.01 | 0.93 ± 0.01 |
| KNN | 0.85 ± 0.01 | 0.83 ± 0.01 | 0.80 ± 0.01 | 0.81 ± 0.01 | 0.88 ± 0.01 |
| SVM (RBF) | 0.83 ± 0.01 | 0.81 ± 0.01 | 0.77 ± 0.01 | 0.79 ± 0.01 | 0.86 ± 0.01 |
| Rank | Feature Name | Mean SHAP Value | Contribution Interpretation |
|---|---|---|---|
| 1 | CADD_phred | 0.137 | Strong integrated impact predictor |
| 2 | MutationTaster_score | 0.121 | Functional disruption likelihood |
| 3 | phyloP100way_vertebrate | 0.113 | Deep evolutionary conservation |
| 4 | SIFT_score | 0.099 | Amino Acid substitution tolerance |
| 5 | PolyPhen2_HVAR_score | 0.093 | Protein structural impact |
| 6 | GERP++_RS | 0.082 | Evolutionary constraint |
| 7 | REVEL | 0.079 | Meta-predictor risk score |
| 8 | FATHMM_score | 0.071 | Functional impact |
| 9 | MutationAssessor | 0.065 | Protein-level disruption |
| 10 | PROVEAN_score | 0.058 | Functional damage indicator |
| Dataset Component | Count | Notes |
|---|---|---|
| Total ClinVar variants (November 2023) | 2,988,631 | All variant types |
| Missense SNVs | 987,214 | After filtering for substitution variants |
| Expert-reviewed missense SNVs | 62,417 | Pathogenic + benign, no conflicting interpretations |
| Final dataset after merging with dbNSFP | 59,302 | After removing missing-feature rows |
| Train set (80%) | 47,441 | Stratified |
| Test set (20%) | 11,861 | Stratified |
| Feature Category | Examples | Description |
|---|---|---|
| Conservation scores | phyloP100way, GERP++ | Evolutionary constraint |
| Protein impact scores | SIFT, PolyPhen-2 | Predicted deleteriousness |
| Meta-predictors | CADD, MutationAssessor | Integrated functional impact |
| Structural predictions | PROVEAN, FATHMM | 3D/biochemical impact |
| Splicing-related scores | dbscSNV | Effects on splice sites |
| Allele frequency | gnomAD_AF | Population variation |
| Model | Key Hyperparameters |
|---|---|
| Logistic Regression | Penalty = L2; C = 1.0; Solver = liblinear; Class weight = balanced; Max iterations = 500 |
| Random Forest | n_estimators = 500; criterion = gini; max_depth = 25; min_samples_split = 2; min_samples_leaf = 1; max_features = sqrt; class_weight = balanced_subsample; bootstrap = True |
| XGBoost | n_estimators = 600; learning_rate = 0.05; max_depth = 7; subsample = 0.8; colsample_bytree = 0.8; gamma = 0; reg_alpha = 0.1; reg_lambda = 0.1; objective = binary:logistic; eval_metric = AUC |
| LightGBM | n_estimators = 600; learning_rate = 0.03; num_leaves = 64; max_depth = −1; feature_fraction = 0.8; bagging_fraction = 0.8; bagging_freq = 5; min_data_in_leaf = 20; lambda_l1 = 0.1; lambda_l2 = 0.1; objective = binary; metric = AUC |
| Support Vector Machine (RBF) | Kernel = RBF; C = 2.0; Gamma = scale; Class weight = balanced |
| KNN | n_neighbors = 15; weights = distance; metric = minkowski (p = 2) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Bahmane, K.; Bhattacharya, S.; Kassem, M.A. PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. J. Genome Biotechnol. Genet. 2026, 1, 3. https://doi.org/10.3390/jgbg1010003
Bahmane K, Bhattacharya S, Kassem MA. PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. Journal of Genome Biotechnology and Genetics. 2026; 1(1):3. https://doi.org/10.3390/jgbg1010003
Chicago/Turabian StyleBahmane, Karima, Sambit Bhattacharya, and My Abdelmajid Kassem. 2026. "PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome" Journal of Genome Biotechnology and Genetics 1, no. 1: 3. https://doi.org/10.3390/jgbg1010003
APA StyleBahmane, K., Bhattacharya, S., & Kassem, M. A. (2026). PathoPredictor: A Machine Learning Framework for Predicting Pathogenic Missense Variants in the Human Genome. Journal of Genome Biotechnology and Genetics, 1(1), 3. https://doi.org/10.3390/jgbg1010003

