Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations
Abstract
:1. Introduction
2. Materials and Methods
2.1. Genotype Datasets
2.2. Method
2.2.1. Feature Selection
2.2.2. Deep Learning
2.2.3. Evaluation
3. Results and Discussion
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Collins, F.S.; Brooks, L.D.; Chakravarti, A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 1998, 8, 1229–1231. [Google Scholar] [CrossRef] [PubMed]
- Davis, S.; Pettengill, J.B.; Luo, Y.; Payne, J.; Shpuntoff, A.; Rand, H.; Strain, E. CFSAN SNP Pipeline: An automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput. Sci. 2015, 1, e20. [Google Scholar] [CrossRef]
- Visscher, P.M.; Brown, M.A.; McCarthy, M.I.; Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012, 90, 7–24. [Google Scholar] [CrossRef]
- International Parkinson Disease Genomics Consortium. Imputation of sequence variants for identification of genetic risks for Parkinson’s disease: A meta-analysis of genome-wide association studies. Lancet 2011, 377, 641–649. [Google Scholar] [CrossRef]
- Sladek, R.; Rocheleau, G.; Rung, J.; Dina, C.; Shen, L.; Serre, D.; Boutin, P.; Vincent, D.; Belisle, A.; Hadjadj, S.; et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 2007, 445, 881. [Google Scholar] [CrossRef] [PubMed]
- Tsai, F.J.; Yang, C.F.; Chen, C.C.; Chuang, L.M.; Lu, C.H.; Chang, C.T.; Wang, T.Y.; Chen, R.H.; Shiu, C.F.; Liu, Y.M.; et al. A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet. 2010, 6, e1000847. [Google Scholar] [CrossRef]
- Li, H.; Gan, W.; Lu, L.; Dong, X.; Han, X.; Hu, C.; Yang, Z.; Sun, L.; Bao, W.; Li, P.; et al. A genome-wide association study identifies GRK5 and RASGRP1 as type 2 diabetes loci in Chinese Hans. Diabetes 2013, 62, 291–298. [Google Scholar] [CrossRef]
- Shiraishi, K.; Kunitoh, H.; Daigo, Y.; Takahashi, A.; Goto, K.; Sakamoto, H.; Ohnami, S.; Shimada, Y.; Ashikawa, K.; Saito, A.; et al. A genome-wide association study identifies two new susceptibility loci for lung adenocarcinoma in the Japanese population. Nat. Genet. 2012, 44, 900. [Google Scholar] [CrossRef]
- Hu, Z.; Wu, C.; Shi, Y.; Guo, H.; Zhao, X.; Yin, Z.; Yang, L.; Dai, J.; Hu, L.; Tan, W.; et al. A genome-wide association study identifies two new lung cancer susceptibility loci at 13q12. 12 and 22q12. 2 in Han Chinese. Nat. Genet. 2011, 43, 792. [Google Scholar] [CrossRef]
- Xu, J.; Mo, Z.; Ye, D.; Wang, M.; Liu, F.; Jin, G.; Xu, C.; Wang, X.; Shao, Q.; Chen, Z.; et al. Genome-wide association study in Chinese men identifies two new prostate cancer risk loci at 9q31. 2 and 19q13. 4. Nat. Genet. 2012, 44, 1231. [Google Scholar] [CrossRef]
- Eyre, S.; Bowes, J.; Diogo, D.; Lee, A.; Barton, A.; Martin, P.; Zhernakova, A.; Stahl, E.; Viatte, S.; McAllister, K.; et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat. Genet. 2012, 44, 1336. [Google Scholar] [CrossRef] [PubMed]
- Janssens, A.C.J.; van Duijn, C.M. Genome-based prediction of common diseases: Advances and prospects. Hum. Mol. Genet. 2008, 17, R166–R173. [Google Scholar] [CrossRef] [PubMed]
- Jostins, L.; Barrett, J.C. Genetic risk prediction in complex disease. Hum. Mol. Genet. 2011, 20, R182–R188. [Google Scholar] [CrossRef] [PubMed]
- Kruppa, J.; Ziegler, A.; König, I.R. Risk estimation and risk prediction using machine-learning methods. Hum. Genet. 2012, 131, 1639–1654. [Google Scholar] [CrossRef] [PubMed]
- Kooperberg, C.; LeBlanc, M.; Obenchain, V. Risk prediction using genome-wide association studies. Genet. Epidemiol. 2010, 34, 643–652. [Google Scholar] [CrossRef] [PubMed]
- Evans, D.T. A SNP Microarray Analysis Pipeline Using Machine Learning Techniques. Ph.D. Thesis, Ohio University, Athens, OH, USA, 2010. [Google Scholar]
- Qi, Q.; Liang, L.; Doria, A.; Hu, F.B.; Qi, L. Genetic predisposition to dyslipidemia and type 2 diabetes risk in two prospective cohorts. Diabetes 2012, 61, 745–752. [Google Scholar] [CrossRef] [PubMed]
- Goh, C.; Schumacher, F.; Easton, D.; Muir, K.; Henderson, B.; Kote-Jarai, Z.; Eeles, R. Genetic variants associated with predisposition to prostate cancer and potential clinical implications. J. Intern. Med. 2012, 271, 353–365. [Google Scholar] [CrossRef]
- Mittag, F.; Büchel, F.; Saad, M.; Jahn, A.; Schulte, C.; Bochdanovits, Z.; Simón-Sánchez, J.; Nalls, M.A.; Keller, M.; Hernandez, D.G.; et al. Use of support vector machines for disease risk prediction in genome-wide association studies: Concerns and opportunities. Hum. Mutat. 2012, 33, 1708–1718. [Google Scholar] [CrossRef]
- Botta, V.; Louppe, G.; Geurts, P.; Wehenkel, L. Exploiting SNP correlations within random forest for genome-wide association studies. PLoS ONE 2014, 9, e93379. [Google Scholar] [CrossRef]
- Maier, A.; Syben, C.; Lasser, T.; Riess, C. A gentle introduction to deep learning in medical image processing. Z. Med. Phys. 2019, 29, 86–101. [Google Scholar] [CrossRef]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Elgart, M.; Lyons, G.; Romero-Brufau, S.; Kurniansyah, N.; Brody, J.A.; Guo, X.; Lin, H.J.; Raffield, L.; Gao, Y.; Chen, H.; et al. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun. Biol. 2022, 5, 856. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Huang, C.; Ding, L.; Li, Z.; Pan, Y.; Gao, X. Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019, 166, 4–21. [Google Scholar] [CrossRef]
- Tabares-Soto, R.; Orozco-Arias, S.; Romero-Cano, V.; Bucheli, V.S.; Rodríguez-Sotelo, J.L.; Jiménez-Varón, C.F. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput. Sci. 2020, 6, e270. [Google Scholar] [CrossRef] [PubMed]
- Alatrany, A.S.; Khan, W.; Hussain, A.J.; Mustafina, J.; Al-Jumeily, D. Transfer Learning for Classification of Alzheimer’s Disease Based on Genome Wide Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023. [Google Scholar] [CrossRef]
- Liu, L.; Meng, Q.; Weng, C.; Lu, Q.; Wang, T.; Wen, Y. Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data. PLoS Comput. Biol. 2022, 18, e1010328. [Google Scholar] [CrossRef] [PubMed]
- Abd El Hamid, M.M.; Omar, Y.M.; Shaheen, M.; Mabrouk, M.S. Discovering epistasis interactions in Alzheimer’s disease using deep learning model. Gene Rep. 2022, 29, 101673. [Google Scholar] [CrossRef]
- Uppu, S.; Krishna, A.; Gopalan, R.P. A Deep Learning Approach to Detect SNP Interactions. JSW 2016, 11, 965–975. [Google Scholar] [CrossRef]
- Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
- Ho, D.S.W.; Schierding, W.; Wake, M.; Saffery, R.; O’Sullivan, J. Machine learning SNP based prediction for precision medicine. Front. Genet. 2019, 10, 267. [Google Scholar] [CrossRef]
- Wei, Z.; Wang, K.; Qu, H.Q.; Zhang, H.; Bradfield, J.; Kim, C.; Frackleton, E.; Hou, C.; Glessner, J.T.; Chiavacci, R.; et al. From disease association to risk assessment: An optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009, 5, e1000678. [Google Scholar] [CrossRef] [PubMed]
- Hajiloo, M.; Damavandi, B.; HooshSadat, M.; Sangi, F.; Mackey, J.R.; Cass, C.E.; Greiner, R.; Damaraju, S. Breast cancer prediction using genome wide single nucleotide polymorphism data. BMC Bioinform. 2013, 14, S3. [Google Scholar] [CrossRef]
- Pirooznia, M.; Seifuddin, F.; Judy, J.; Mahon, P.B.; Potash, J.B.; Zandi, P.P.; Bipolar Genome Study (BiGS) Consortium. Data mining approaches for genome-wide association of mood disorders. Psychiatr. Genet. 2012, 22, 55. [Google Scholar] [CrossRef] [PubMed]
- Alzubi, R.; Ramzan, N.; Alzoubi, H.; Amira, A. A hybrid feature selection method for complex diseases SNPs. IEEE Access 2017, 6, 1292–1301. [Google Scholar] [CrossRef]
- Abraham, G.; Kowalczyk, A.; Zobel, J.; Inouye, M. Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genet. Epidemiol. 2013, 37, 184–195. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; Wei, Z.; Keating, B.J.; Hakonarson, H. Machine learning derived risk prediction of anorexia nervosa. BMC Med. Genom. 2015, 9, 4. [Google Scholar] [CrossRef] [PubMed]
- The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447, 661. [Google Scholar] [CrossRef]
- Davies, R.W.; Dandona, S.; Stewart, A.F.; Chen, L.; Ellis, S.G.; Tang, W.W.; Hazen, S.L.; Roberts, R.; McPherson, R.; Wells, G.A. Improved prediction of cardiovascular disease based on a panel of single nucleotide polymorphisms identified through genome-wide association studies. Circ. Cardiovasc. Genet. 2010, 3, 468. [Google Scholar] [CrossRef]
- Roshan, U.; Chikkagoudar, S.; Wei, Z.; Wang, K.; Hakonarson, H. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic Acids Res. 2011, 39, e62. [Google Scholar] [CrossRef]
- Behravan, H.; Hartikainen, J.M.; Tengström, M.; Pylkäs, K.; Winqvist, R.; Kosma, V.M.; Mannermaa, A. Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls. Sci. Rep. 2018, 8, 13149. [Google Scholar] [CrossRef]
- Behravan, H.; Hartikainen, J.M.; Tengström, M.; Kosma, V.M.; Mannermaa, A. Predicting breast cancer risk using interacting genetic and demographic factors and machine learning. Sci. Rep. 2020, 10, 11044. [Google Scholar] [CrossRef] [PubMed]
- Mittag, F.; Römer, M.; Zell, A. Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies. PLoS ONE 2015, 10, e0135832. [Google Scholar] [CrossRef] [PubMed]
- Manor, O.; Segal, E. Predicting disease risk using bootstrap ranking and classification algorithms. PLoS Comput. Biol. 2013, 9, e1003200. [Google Scholar] [CrossRef] [PubMed]
- Bennasar, M.; Hicks, Y.; Setchi, R. Feature selection using joint mutual information maximisation. Expert Syst. Appl. 2015, 42, 8520–8532. [Google Scholar] [CrossRef]
- Evans, D.M.; Visscher, P.M.; Wray, N.R. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum. Mol. Genet. 2009, 18, 3525–3531. [Google Scholar] [CrossRef]
- He, Q.; Lin, D.Y. A variable selection method for genome-wide association studies. Bioinformatics 2010, 27, 1–8. [Google Scholar] [CrossRef]
- Ye, C.; Cui, Y.; Wei, C.; Elston, R.C.; Zhu, J.; Lu, Q. A non-parametric method for building predictive genetic tests on high-dimensional data. Hum. Hered. 2011, 71, 161–170. [Google Scholar] [CrossRef]
- Mieth, B.; Rozier, A.; Rodriguez, J.A.; Höhne, M.M.; Görnitz, N.; Müller, K.R. DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom. Bioinform. 2021, 3, lqab065. [Google Scholar] [CrossRef]
- Rich, S.; Goodarzi, M.; Palmer, N.; Langefeld, C.; Ziegler, J.; Haffner, S.; Bryer-Ash, M.; Norris, J.; Taylor, K.; Haritunians, T.; et al. A genome-wide association scan for acute insulin response to glucose in Hispanic-Americans: The Insulin Resistance Atherosclerosis Family Study (IRAS FS). Diabetologia 2009, 52, 1326–1333. [Google Scholar] [CrossRef]
- Michel, S.; Liang, L.; Depner, M.; Klopp, N.; Ruether, A.; Kumar, A.; Schedel, M.; Vogelberg, C.; von Mutius, E.; von Berg, A.; et al. Unifying candidate gene and GWAS Approaches in Asthma. PLoS ONE 2010, 5, e13894. [Google Scholar] [CrossRef]
- Kang, G.; Childers, D.K.; Liu, N.; Zhang, K.; Gao, G. Genome-wide association studies of rheumatoid arthritis data via multiple hypothesis testing methods for correlated tests. BMC Proc. 2009, 3, S38. [Google Scholar] [CrossRef] [PubMed]
- Uppu, S.; Krishna, A.; Gopalan, R. A review on methods for detecting SNP interactions in high-dimensional genomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 15, 599–612. [Google Scholar] [CrossRef] [PubMed]
- Miller, D.J.; Zhang, Y.; Yu, G.; Liu, Y.; Chen, L.; Langefeld, C.D.; Herrington, D.; Wang, Y. An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. Bioinformatics 2009, 25, 2478–2485. [Google Scholar] [CrossRef] [PubMed]
- Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef]
- Meyer, P.E.; Schretter, C.; Bontempi, G. Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008, 2, 261–274. [Google Scholar] [CrossRef]
- Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
- Riedmiller, M.; Braun, H. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, USA, 28 March–1 April 1993; pp. 586–591. [Google Scholar]
- Mieth, B.; Kloft, M.; Rodríguez, J.A.; Sonnenburg, S.; Vobruba, R.; Morcillo-Suárez, C.; Farré, X.; Marigorta, U.M.; Fehr, E.; Dickhaus, T.; et al. Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci. Rep. 2016, 6, 36671. [Google Scholar] [CrossRef]
- Pahikkala, T.; Okser, S.; Airola, A.; Salakoski, T.; Aittokallio, T. Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithms Mol. Biol. 2012, 7, 11. [Google Scholar] [CrossRef]
Dataset | No. of Samples | No. of Excluded Samples | No. of Samples after Filtration |
---|---|---|---|
Bipolar disorder (BD) | 1998 | 129 | 1869 |
Coronary artery disease (CAD) | 1998 | 62 | 1936 |
Inflammatory bowel disease (IBD) | 2005 | 256 | 1749 |
Hypertension (HT) | 2001 | 48 | 1953 |
Rheumatoid arthritis (RA) | 1999 | 136 | 1863 |
Type 1 diabetes (T1D) | 2000 | 37 | 1963 |
Type 2 diabetes (T2D) | 1999 | 75 | 1924 |
UK National Blood Service (UKBS) | 1500 | 42 | 1458 |
1958 British Birth Cohort (58C ) | 1504 | 24 | 1480 |
Hyperparameter | Description | Range |
---|---|---|
Activation function | Neuron’s activation function | Relu, Sigmoid, tanh |
Optimizer | The optimisation algorithm that performs the learning process in a neural network | rmsprop, NADAM, ADAM, SGD |
Epochs | Number of learning iterations | 50, 100, 200, 300 |
Learning Rate | Weight change updated during learning | 0.001, 0.0001, 0.00001 |
No. of hidden nodes | No. of neurons in the hidden layer | 64, 128, 256, 512 |
Dropout | Dropping out nodes during training | 0.2, 0.4, 0.6 |
Mini batch size | Group size submitted to model during training | 16, 32, 64, 100 |
Fold | BD | CAD | HT | IBD | RA | T1D | T2D |
---|---|---|---|---|---|---|---|
0.2 | 1991 | 2053 | 1767 | 1988 | 1758 | 1603 | 2224 |
0.4 | 1167 | 1099 | 1147 | 1183 | 1128 | 1121 | 1121 |
0.6 | 830 | 794 | 878 | 832 | 882 | 907 | 750 |
0.8 | 602 | 607 | 695 | 597 | 705 | 764 | 555 |
1 | 410 | 447 | 513 | 400 | 527 | 605 | 350 |
Accuracy | Sensitivity | Precision | F1-Score | MCC | |
---|---|---|---|---|---|
BD | 0.839 | 0.812 | 0.882 | 0.846 | 0.697 |
CAD | 0.948 | 0.934 | 0.966 | 0.950 | 0.891 |
HT | 0.838 | 0.798 | 0.904 | 0.848 | 0.685 |
IBD | 0.796 | 0.847 | 0.726 | 0.782 | 0.606 |
RA | 0.885 | 0.884 | 0.886 | 0.885 | 0.764 |
T1D | 0.917 | 0.901 | 0.936 | 0.918 | 0.901 |
T2D | 0.846 | 0.857 | 0.831 | 0.844 | 0.696 |
Disease/Method | T1D | T2D | BD | IBD | CAD | RA | HT |
---|---|---|---|---|---|---|---|
Proposed Model | 0.92 | 0.85 | 0.84 | 0.79 | 0.94 | 0.89 | 0.84 |
BootRank [44] | 0.90 | 0.82 | 0.83 | 0.70 | 0.72 | 0.74 | 0.68 |
GWASRank [44] | 0.88 | 0.69 | 0.68 | 0.67 | 0.72 | 0.75 | 0.65 |
LO, AC [46] | 0.75 | 0.60 | 0.67 | 0.63 | 0.60 | 0.67 | 0.61 |
DeepCOMBI [49] | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 | 0.65 |
SVM [40] | 0.82 | - | - | - | - | 0.71 | - |
GWASelect [47] | 0.79 | - | - | - | - | - | - |
SVM, LR [32] | 0.89 | - | - | - | - | - | - |
Forward ROC [48] | - | - | - | - | - | 0.71 | - |
LR, SVM, RF, BN [34] | - | - | 0.56 | - | - | - | - |
Elastic-net [15] | - | - | - | 0.64 | - | - | - |
LR, AC, SVM [39] | - | - | - | - | 0.60 | - | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alzoubi, H.; Alzubi, R.; Ramzan, N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. Sensors 2023, 23, 4439. https://doi.org/10.3390/s23094439
Alzoubi H, Alzubi R, Ramzan N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. Sensors. 2023; 23(9):4439. https://doi.org/10.3390/s23094439
Chicago/Turabian StyleAlzoubi, Hadeel, Raid Alzubi, and Naeem Ramzan. 2023. "Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations" Sensors 23, no. 9: 4439. https://doi.org/10.3390/s23094439
APA StyleAlzoubi, H., Alzubi, R., & Ramzan, N. (2023). Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. Sensors, 23(9), 4439. https://doi.org/10.3390/s23094439