miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset Construction
2.2. Embedding Method
2.3. Negative Dataset
2.4. Deep Learning Model
- Begin by multiplying the previous cell state value (Ct−1) with the output of the forget gate (ft), determining what information is to be discarded.
- Update the current cell state value by multiplying it with the output of the input gate. This operation gauges how much new information should be stored.
- Add the updated current cell state value to the output of the input gate. This step identifies the information to be retained.
3. Results
3.1. Dataset
3.2. LSTM and Bi-LSTM Experiment
3.3. K-Fold Cross-Validation and Model Performance
3.4. Validation of the Model
4. Discussion
4.1. Dataset
4.2. Comparison with Other Related Studies
4.3. Future Work
5. Conclusions
- Extensive training data: our model benefited from training on the largest sequence dataset ever constructed for miRNA–gene associations.
- Optimal data embedding: we employed a sophisticated vectorization technique, transforming complex miRNA and gene sequence features. This was based on a model specifically designed for protein sequence embedding.
- Logical negative data construction: by considering Euclidean distance, cosine similarity, and Mahalanobis distance, we defined a set of criteria that allowed for the logical construction of negative data.
- Optimized model architecture: drawing from the above data, we designed an effective miRNA–gene LSTM deep learning model.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cai, Y.; Yu, X.; Hu, S.; Yu, J. A brief review on the mechanisms of miRNA regulation. Genom. Proteom. Bioinform. 2009, 7, 147–154. [Google Scholar] [CrossRef] [PubMed]
- Fu, L.; Peng, Q. A deep ensemble model to predict miRNA-disease association. Sci. Rep. 2017, 7, 14482. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.; Zhang, L.; Chen, X. Updated review of advances in micrornas and complex diseases: Towards systematic evaluation of computational models. Brief. Bioinform. 2022, 23, bbac407. [Google Scholar] [CrossRef] [PubMed]
- Xie, W.; Luo, J.; Pan, C.; Liu, Y. SG-LSTM-FRAME: A computational frame using sequence and geometrical information via LSTM to predict miRNA–gene associations. Brief. Bioinform. 2021, 22, 2032–2042. [Google Scholar] [CrossRef] [PubMed]
- Deepthi, K.; Jereesh, A.; Liu, Y. A deep learning ensemble approach to prioritize antiviral drugs against novel coronavirus SARS-CoV-2 for COVID-19 drug repurposing. Appl. Soft Comput. 2021, 113, 107945. [Google Scholar]
- Chou, C.-H.; Shrestha, S.; Yang, C.-D.; Chang, N.-W.; Lin, Y.-L.; Liao, K.-W.; Huang, W.-C.; Sun, T.-H.; Tu, S.-J.; Lee, W.-H. miRTarBase update 2018: A resource for experimentally validated microRNA-target interactions. Nucleic Acids Res. 2018, 46, D296–D302. [Google Scholar] [CrossRef]
- Griffiths-Jones, S.; Grocock, R.J.; Van Dongen, S.; Bateman, A.; Enright, A.J. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006, 34, D140–D144. [Google Scholar] [CrossRef]
- Durinck, S.; Spellman, P.; Birney, E.; Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 2009, 4, 1184–1191. [Google Scholar] [CrossRef]
- Danielsson, P.-E. Euclidean distance mapping. Comput. Graph. Image Process. 1980, 14, 227–248. [Google Scholar] [CrossRef]
- Rahutomo, F.; Kitasuka, T.; Aritsugi, M. Semantic cosine similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology ICAST, Seoul, Republic of Korea, 29–30 October 2012; Volume 4, p. 1. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Ahmed, N.K.; Rossi, R.; Lee, J.B.; Willke, T.L.; Zhou, R.; Kong, X.; Eldardiry, H. Learning role-based graph embeddings. arXiv 2018, arXiv:1802.02896. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Gu, T.; Zhao, X.; Barbazuk, W.B.; Lee, J.-H. miTAR: A hybrid deep learning-based approach for predicting miRNA targets. BMC Bioinform. 2021, 22, 96. [Google Scholar] [CrossRef] [PubMed]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
- Wen, M.; Cong, P.; Zhang, Z.; Lu, H.; Li, T. DeepMirTar: A deep-learning approach for predicting human miRNA targets. Bioinformatics 2018, 34, 3781–3787. [Google Scholar] [CrossRef]
- Pla, A.; Zhong, X.; Rayner, S. miRAW: A deep learning-based approach to predict microRNA targets by analyzing whole microRNA transcripts. PLoS Comput. Biol. 2018, 14, e1006185. [Google Scholar] [CrossRef]
- Xie, W.; Zheng, Z.; Zhang, W.; Huang, L.; Lin, Q.; Wong, K.-C. SRG-vote: Predicting miRNA-gene relationships via embedding and LSTM ensemble. IEEE J. Biomed. Health Inform. 2022, 26, 4335–4344. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
- Graves, A.; Jaitly, N.; Mohamed, A.-r. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 273–278. [Google Scholar]
- Liu, Y.; Luo, J.; Ding, P. Inferring microRNA targets based on restricted Boltzmann machines. IEEE J. Biomed. Health Inform. 2018, 23, 427–436. [Google Scholar] [CrossRef]
- Oughtred, R.; Stark, C.; Breitkreutz, B.-J.; Rust, J.; Boucher, L.; Chang, C.; Kolas, N.; O’Donnell, L.; Leung, G.; McAdam, R. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019, 47, D529–D541. [Google Scholar] [CrossRef]
- Geer, L.Y.; Marchler-Bauer, A.; Geer, R.C.; Han, L.; He, J.; He, S.; Liu, C.; Shi, W.; Bryant, S.H. The NCBI biosystems database. Nucleic Acids Res. 2010, 38, D492–D496. [Google Scholar] [CrossRef] [PubMed]
- Lee, B.; Baek, J.; Park, S.; Yoon, S. deepTarget: End-to-end learning framework for microRNA target prediction using deep recurrent neural networks. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Seattle, WA, USA, 2–5 October 2016; pp. 434–442. [Google Scholar]
- Medsker, L.R.; Jain, L.C. Recurrent Neural Networks: Desing and Application; CRC Press: Boca Raton, FL, USA; London, UK; New York, NY, USA; Washington, DC, USA, 2001. [Google Scholar]
- Xiao, F.; Zuo, Z.; Cai, G.; Kang, S.; Gao, X.; Li, T. miRecords: An integrated resource for microRNA–target interactions. Nucleic Acids Res. 2009, 37, D105–D110. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Wang, Y.; Lin, Y.; Shao, D.; He, K.; Huang, L. LncMirNet: Predicting LncRNA–miRNA interaction based on deep learning of ribonucleic acid sequences. Molecules 2020, 25, 4372. [Google Scholar] [CrossRef] [PubMed]
- Harrow, J.; Frankish, A.; Gonzalez, J.M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B.L.; Barrell, D.; Zadissa, A.; Searle, S. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 2012, 22, 1760–1774. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Yang, K.K.; Wu, Z.; Bedbrook, C.N.; Arnold, F.H. Learned protein embeddings for machine learning. Bioinformatics 2018, 34, 2642–2648. [Google Scholar] [CrossRef]
- Fang, Y.; Pan, X.; Shen, H.-B. Recent Deep Learning Methodology Development for RNA–RNA Interaction Prediction. Symmetry 2022, 14, 1302. [Google Scholar] [CrossRef]
- De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D. The Mahalanobis Distance, Chemometrics and Intelligent Laboratory Systems; Elsevier: Amsterdam, The Netherlands, 2000. [Google Scholar]
- Ahn, H. Performance Evaluation of a Feature-Importance-based Feature Selection Method for Time Series Prediction. J. Inf. Commun. Converg. Eng. 2023, 21, 82–89. [Google Scholar] [CrossRef]
- Agarwal, V.; Bell, G.W.; Nam, J.-W.; Bartel, D.P. Predicting effective microRNA target sites in mammalian mRNAs. eLife 2015, 4, e05005. [Google Scholar] [CrossRef]
- Sticht, C.; De La Torre, C.; Parveen, A.; Gretz, N. miRWalk: An online resource for prediction of microRNA binding sites. PLoS ONE 2018, 13, e0206239. [Google Scholar] [CrossRef]
- Mavaddat, N.; Peock, S.; Frost, D.; Ellis, S.; Platte, R.; Fineberg, E.; Evans, D.G.; Izatt, L.; Eeles, R.A.; Adlard, J. Cancer risks for BRCA1 and BRCA2 mutation carriers: Results from prospective analysis of EMBRACE. JNCI J. Natl. Cancer Inst. 2013, 105, 812–822. [Google Scholar] [CrossRef] [PubMed]
- Barshir, R.; Fishilevich, S.; Iny-Stein, T.; Zelig, O.; Mazor, Y.; Guan-Golan, Y.; Safran, M.; Lancet, D. GeneCaRNA: A comprehensive gene-centric database of human non-coding RNAs in the GeneCards suite. J. Mol. Biol. 2021, 433, 166913. [Google Scholar] [CrossRef] [PubMed]
miRNA Association Object | Model Name | Dataset | Negative Data Method | Embedding Method | Computational Model | Results | ||||
---|---|---|---|---|---|---|---|---|---|---|
miRNA Sequence Data | Object Sequence Data | miRNA–Object Relation Data | Positive | Negative | ||||||
Gene | SG-LSTM-FRAME [4] | miRBase [7] | biomaRt [8] | miRTarBase [6] | 15,540 | 15,540 | Euclidean, Cos Sim | Doc2Vec [11], Role2Vec [12] | LSTM | 0.9363 (AUC) |
SRG-Vote [19] | miRBase | biomaRt | miRTarBase | 141,824 | 141,824 | Euclidean, Cos Sim | Doc2Vec, Role2Vec, GCN | LSTM, Bi-LSTM | 0.95 (AUC) | |
miTAR [14] | miRBase | - | DeepMirTar [17], miRAW [18] | 3908 | 3898 | miRanda | Embedding Layer | CNN, Bi-RNN | 95.5 (Accuracy) | |
mRNA | DeepTarget [26] | miRBase | - | miRecords [28] | 3398 | 3640 | Fisher–Yates Shuffle, miRanda | Dense Vector | AE | 0.9641 (Accuracy) |
DeepMirTar [17] | miRBase | UCSC Genome Browser | mirMark, CLASH | 3915 | 3905 | miRanda | One-Hot Encoding | AE | 0.9793 (AUC) | |
IMTRBM [23] | miRBase | NCBI [25] | miRTarBase, BioGRID [24] | 8420 | - | - | - | RBM | - | |
lncRNA | LncMirNet [29] | miRBase | GENCODE [30] | lncRNASNP2 | 15,386 | 15,386 | Knuth–Durstenfeld Shuffle | K-mer, CTD, Doc2Vec, Role2Vec | CNN | 0.9381 (AUC) |
Count of Data Elements | Database | |
---|---|---|
| 358,864 | miRTarBase [6] |
| 2656 | miRBase [7] |
| 14,319 | biomaRt [8] |
Count of Data Elements | Dataset | |
---|---|---|
All Combined (mi × Gene) | 38,031,264 | |
Extract (-) Positive Data = | 37,672,400 | Unknown Data Pool |
Three Distance-based Filtering = | 4,932,554 | Negative Candidate Pool |
Randomly Selected (As Pos) = | 358,864 | Negative Data |
Purpose | Data Name | Number of Data Elements |
---|---|---|
Model Training and Evaluation | Positive Data | 358,864 |
Negative Data | 358,864 | |
Total Data | 717,728 | |
Case Study Generate Negative Data | Unknown Data Pool | 37,672,400 |
Negative Candidate Pool | 4,932,554 | |
Data Components | miRNA | 2656 |
Gene | 14,319 |
LSTM | Bi-LSTM | |
---|---|---|
AUC | 0.98 | 0.936 |
Precision | Recall | Accuracy | F1 Score | AUC | |
---|---|---|---|---|---|
Fold-1 | 0.949 | 0.924 | 0.94 | 0.936 | 0.9819 |
Fold-2 | 0.925 | 0.933 | 0.93 | 0.929 | 0.9807 |
Fold-3 | 0.937 | 0.938 | 0.94 | 0.937 | 0.9834 |
Fold-4 | 0.915 | 0.925 | 0.92 | 0.92 | 0.9732 |
Fold-5 | 0.964 | 0.919 | 0.94 | 0.941 | 0.9826 |
Average | 0.938 | 0.9278 | 0.934 | 0.9326 | 0.9804 |
Rank | miRNA | Gene | Support |
---|---|---|---|
1 | hsa-miR-3913-5p | ARL13B | |
2 | hsa-miR-3913-5p | RALGAPA1 | TargetScan [36] |
3 | hsa-miR-3913-5p | OTOGL | |
4 | hsa-miR-32-5p | NNAT | |
5 | hsa-miR-3913-5p | PNISR | |
6 | hsa-miR-5697 | FOXO6 | |
6 | hsa-miR-5697 | GLRA3 | TargetScan |
7 | hsa-miR-3913-5p | CPED1 | |
8 | hsa-miR-5196-3p | EREG | TargetScan |
9 | hsa-miR-5196-3p | ZNF419 | TargetScan |
10 | hsa-miR-3913-5p | ZNF644 | |
11 | hsa-miR-3913-5p | ZSWIM6 | |
12 | hsa-miR-6893-3p | STARD8 | TargetScan, miRWalk [37] |
13 | hsa-miR-5196-3p | JADE1 | |
14 | hsa-miR-5697 | TFEC | |
15 | hsa-miR-548c-3p | UBE2D4 | TargetScan |
16 | hsa-miR-548c-3p | REPS2 | TargetScan |
17 | hsa-miR-5697 | BMPR2 | |
18 | hsa-miR-5697 | CDKN2B | |
19 | hsa-miR-3913-5p | KAT2B | |
20 | hsa-miR-3913-5p | ARL13B |
Rank | miRNA | Gene | Support |
---|---|---|---|
1 | hsa-miR-503-5p | BRCA2 | |
2 | hsa-miR-6499-3p | BRCA2 | |
3 | hsa-miR-125a-3p | BRCA2 | |
4 | hsa-miR-665 | BRCA2 | |
5 | hsa-miR-27b-3p | BRCA2 | |
6 | hsa-miR-4442 | BRCA2 | |
7 | hsa-miR-6081 | BRCA2 | |
8 | hsa-miR-4478 | BRCA2 | |
9 | hsa-let-7e-5p | BRCA2 | TargetScan |
10 | hsa-miR-6836-5p | BRCA2 |
Negative Set Method | Three Distance-Based Filtered Negative | Random Negative |
---|---|---|
AUC | 0.98 | 0.91 |
miRNA Association Object | Model Name | Dataset | Negative Data Method | Embedding Method | Computational Model | Results | ||||
---|---|---|---|---|---|---|---|---|---|---|
miRNA Sequence Data | Object Sequence Data | miRNA–Object Relation Data | Positive | Negative | ||||||
Gene | SG-LSTM-FRAME [4] | miRBase [7] | biomaRt [8] | miRTarBase [6] | 15,540 | 15,540 | Euclidean, Cos sim | Doc2Vec [11], Role2Vec [12] | LSTM | 0.9363 (AUC) |
SRG-Vote [19] | miRBase | biomaRt | miRTarBase | 141,824 | 141,824 | Euclidean, Cos sim | Doc2Vec, Role2Vec, GCN | LSTM, Bi-LSTM | 0.95 (AUC) | |
miTAR [14] | miRBase | - | DeepMirTar, miRAW [18] | 3908 | 3898 | miRanda | Embedding Layer | CNN, Bi-RNN | 95.5 (Accuracy) | |
miGAP [Ours] | miRBase | biomaRt | miRTarBase | 358,864 | 358,864 | Euclidean, Cos Sim, Mahalanobis | protein2Vec [32] | LSTM | 0.9834 (AUC) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yoon, S.; Hwang, I.; Cho, J.; Yoon, H.; Lee, K. miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model. Appl. Sci. 2023, 13, 12349. https://doi.org/10.3390/app132212349
Yoon S, Hwang I, Cho J, Yoon H, Lee K. miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model. Applied Sciences. 2023; 13(22):12349. https://doi.org/10.3390/app132212349
Chicago/Turabian StyleYoon, Seungwon, Inwoo Hwang, Jaeeun Cho, Hyewon Yoon, and Kyuchul Lee. 2023. "miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model" Applied Sciences 13, no. 22: 12349. https://doi.org/10.3390/app132212349
APA StyleYoon, S., Hwang, I., Cho, J., Yoon, H., & Lee, K. (2023). miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model. Applied Sciences, 13(22), 12349. https://doi.org/10.3390/app132212349