Harnessing Unsupervised Ensemble Learning for Biomedical Applications: A Review of Methods and Advances
Abstract
:1. Introduction
2. A Brief Overview of Binary Classification
3. Reframing Bioinformatics Tasks as Binary Classification Challenges
3.1. Differential Expression Calling
3.2. Network Inference
3.3. Somatic Mutation Calling
3.4. Other Common Bioinformatics Tasks as Binary Classification Problems
4. Introduction to Ensemble Learning
5. Discussion and Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Petrik, J. Microarray technology: The future of blood testing? Vox Sang. 2001, 80, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.H.; Liu, C.Y.; Min, Y.R.; Wu, Z.H.; Hou, P.L. Cancer diagnosis by gene-environment interactions via combination of SMOTE-Tomek and overlapped group screening approaches with application to imbalanced TCGA clinical and genomic data. Mathematics 2024, 12, 2209. [Google Scholar] [CrossRef]
- Heller, M.J. DNA microarray technology: Devices, systems, and applications. Annu. Rev. Biomed. Eng. 2002, 4, 129–153. [Google Scholar] [CrossRef] [PubMed]
- Müller, U.R.; Nicolau, D.V. Microarray Technology and Its Applications; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
- Miller, M.B.; Tang, Y.W. Basic concepts of microarrays and potential applications in clinical microbiology. Clin. Microbiol. Rev. 2009, 22, 611–633. [Google Scholar] [CrossRef]
- Veiga, D.F.; Dutta, B.; Balázsi, G. Network inference and network response identification: Moving genome-scale data to the next level of biological discovery. Mol. BioSyst. 2010, 6, 469–480. [Google Scholar] [CrossRef]
- Madhamshettiwar, P.B.; Maetschke, S.R.; Davis, M.J.; Reverter, A.; Ragan, M.A. Gene regulatory network inference: Evaluation and application to ovarian cancer allows the prioritization of drug targets. Genome Med. 2012, 4, 41. [Google Scholar] [CrossRef]
- Ozsolak, F.; Milos, P.M. RNA sequencing: Advances, challenges and opportunities. Nat. Rev. Genet. 2011, 12, 87–98. [Google Scholar] [CrossRef]
- Zhang, W.; Yu, Y.; Hertwig, F.; Thierry-Mieg, J.; Zhang, W.; Thierry-Mieg, D.; Wang, J.; Furlanello, C.; Devanarayan, V.; Cheng, J.; et al. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol. 2015, 16, 133. [Google Scholar] [CrossRef]
- Kolodziejczyk, A.A.; Kim, J.K.; Svensson, V.; Marioni, J.C.; Teichmann, S.A. The technology and biology of single-cell RNA sequencing. Mol. Cell 2015, 58, 610–620. [Google Scholar] [CrossRef]
- Huang, Q.; Liu, Y.; Du, Y.; Garmire, L.X. Evaluation of cell type annotation R packages on single-cell RNA-seq data. Genom. Proteom. Bioinform. 2021, 19, 267–281. [Google Scholar] [CrossRef]
- Peng, J.; Sun, B.F.; Chen, C.Y.; Zhou, J.Y.; Chen, Y.S.; Chen, H.; Liu, L.; Huang, D.; Jiang, J.; Cui, G.S.; et al. Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 2019, 29, 725–738. [Google Scholar] [CrossRef] [PubMed]
- Ding, S.; Chen, X.; Shen, K. Single-cell RNA sequencing in breast cancer: Understanding tumor heterogeneity and paving roads to individualized therapy. Cancer Commun. 2020, 40, 329–344. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Kuo, Y.F.; Goodwin, J.S. The effect of electronic medical record adoption on outcomes in US hospitals. BMC Health Serv. Res. 2013, 13, 39. [Google Scholar] [CrossRef] [PubMed]
- Graber, M.L.; Byrne, C.; Johnston, D. The impact of electronic health records on diagnosis. Diagnosis 2017, 4, 211–223. [Google Scholar] [CrossRef]
- El-Kareh, R.; Hasan, O.; Schiff, G.D. Use of health information technology to reduce diagnostic errors. BMJ Qual. Saf. 2013, 22, ii40–ii51. [Google Scholar] [CrossRef]
- Tierney, M.J.; Pageler, N.M.; Kahana, M.; Pantaleoni, J.L.; Longhurst, C.A. Medical education in the electronic medical record (EMR) era: Benefits, challenges, and future directions. Acad. Med. 2013, 88, 748–752. [Google Scholar] [CrossRef]
- Ong, J.C.L.; Chen, M.; Ng, N.; Elangovan, K.; Tan, N.Y.T.; Jin, L.; Xie, Q.; Ting, D.S.W.; Rodriguez-Monguio, R.; Bates, D.; et al. Generative AI and Large Language Models in Reducing Medication Related Harm and Adverse Drug Events-A Scoping Review. medRxiv 2024. [Google Scholar] [CrossRef]
- Yin, H.; Tang, J.; Li, S.; Wang, T. LLMADR: A Novel Method for Adverse Drug Reaction Extraction Based on Style Aligned Large Language Models Fine-Tuning. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Hangzhou, China, 1–3 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 470–482. [Google Scholar]
- Lahlou, C.; Crayton, A.; Trier, C.; Willett, E. Explainable health risk predictor with transformer-based medicare claim encoder. arXiv 2021, arXiv:2105.09428. [Google Scholar]
- Lee, S.A.; Lindsey, T. Do Large Language Models understand Medical Codes? arXiv 2024, arXiv:2403.10822. [Google Scholar]
- Baxevanis, A.D.; Bader, G.D.; Wishart, D.S. Bioinformatics; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
- Kalia, M. Biomarkers for personalized oncology: Recent advances and future challenges. Metabolism 2015, 64, S16–S21. [Google Scholar] [CrossRef]
- Li, Y.; Huang, C.; Ding, L.; Li, Z.; Pan, Y.; Gao, X. Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019, 166, 4–21. [Google Scholar] [CrossRef] [PubMed]
- Law, C.W.; Chen, Y.; Shi, W.; Smyth, G.K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014, 15, R29. [Google Scholar] [CrossRef] [PubMed]
- Love, M.; Anders, S.; Huber, W. Differential analysis of count data–the DESeq2 package. Genome Biol. 2014, 15, 550. [Google Scholar]
- Margolin, A.A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Favera, R.D.; Califano, A. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform. 2006, 7, S7. [Google Scholar] [CrossRef]
- Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef]
- Statnikov, A.; Aliferis, C.F.; Tsamardinos, I.; Hardin, D.; Levy, S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21, 631–643. [Google Scholar] [CrossRef]
- Ng, S.; Masarone, S.; Watson, D.; Barnes, M.R. The benefits and pitfalls of machine learning for biomarker discovery. Cell Tissue Res. 2023, 394, 17–31. [Google Scholar] [CrossRef]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
- Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Für Med. Phys. 2019, 29, 102–127. [Google Scholar] [CrossRef]
- Chibyshev, T.; Krasnova, O.; Chabina, A.; Gursky, V.V.; Neganova, I.; Kozlov, K. Image Processing Application for Pluripotent Stem Cell Colony Migration Quantification. Mathematics 2024, 12, 3584. [Google Scholar] [CrossRef]
- Rai, H.M.; Dashkevych, S.; Yoo, J. Next-Generation Diagnostics: The Impact of Synthetic Data Generation on the Detection of Breast Cancer from Ultrasound Imaging. Mathematics 2024, 12, 2808. [Google Scholar] [CrossRef]
- Si, Y.; Du, J.; Li, Z.; Jiang, X.; Miller, T.; Wang, F.; Zheng, W.J.; Roberts, K. Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review. J. Biomed. Inform. 2021, 115, 103671. [Google Scholar] [CrossRef]
- Lin, C.; Zhang, Y.; Ivy, J.; Capan, M.; Arnold, R.; Huddleston, J.M.; Chi, M. Early diagnosis and prediction of sepsis shock by combining static and dynamic information using convolutional-LSTM. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 219–228. [Google Scholar]
- Urbanowicz, R.J.; Olson, R.S.; Schmitt, P.; Meeker, M.; Moore, J.H. Benchmarking relief-based feature selection methods for bioinformatics data mining. J. Biomed. Inform. 2018, 85, 168–188. [Google Scholar] [CrossRef] [PubMed]
- Marbach, D.; Prill, R.J.; Schaffter, T.; Mattiussi, C.; Floreano, D.; Stolovitzky, G. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. USA 2010, 107, 6286–6291. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Ge, X.; Peng, F.; Li, W.; Li, J.J. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol. 2022, 23, 79. [Google Scholar] [CrossRef] [PubMed]
- Soneson, C.; Delorenzi, M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 2013, 14, 91. [Google Scholar] [CrossRef]
- Marbach, D.; Costello, J.C.; Küffner, R.; Vega, N.M.; Prill, R.J.; Camacho, D.M.; Allison, K.R.; Kellis, M.; Collins, J.J.; Stolovitzky, G. Wisdom of crowds for robust gene network inference. Nat. Methods 2012, 9, 796–804. [Google Scholar] [CrossRef]
- Xu, H.; DiCarlo, J.; Satya, R.V.; Peng, Q.; Wang, Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genom. 2014, 15, 244. [Google Scholar] [CrossRef]
- Ahsen, M.E.; Vogel, R.M.; Stolovitzky, G.A. Unsupervised evaluation and weighted aggregation of ranked classification predictions. J. Mach. Learn. Res. 2019, 20, 1–40. [Google Scholar]
- Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
- Choobdar, S.; Ahsen, M.E.; Crawford, J.; Tomasoni, M.; Fang, T.; Lamparter, D.; Lin, J.; Hescott, B.; Hu, X.; Mercer, J.; et al. Assessment of network module identification across complex diseases. Nat. Methods 2019, 16, 843–852. [Google Scholar] [CrossRef] [PubMed]
- Schaffter, T.; Buist, D.S.; Lee, C.I.; Nikulin, Y.; Ribli, D.; Guan, Y.; Lotter, W.; Jie, Z.; Du, H.; Wang, S.; et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw. Open 2020, 3, e200265. [Google Scholar] [CrossRef] [PubMed]
- Menden, M.P.; Wang, D.; Guan, Y.; Mason, M.J.; Szalai, B.; Bulusu, K.C.; Yu, T.; Kang, J.; Jeon, M.; Wolfinger, R.; et al. A cancer pharmacogenomic screen powering crowd-sourced advancement of drug combination prediction. BioRxiv 2017, 200451. [Google Scholar] [CrossRef]
- Tanevski, J.; Nguyen, T.; Truong, B.; Karaiskos, N.; Ahsen, M.E.; Zhang, X.; Shu, C.; Xu, K.; Liang, X.; Hu, Y.; et al. Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data. Life Sci. Alliance 2020, 3, e202000867. [Google Scholar] [CrossRef]
- Alloghani, M.; Al-Jumeily, D.; Mustafina, J.; Hussain, A.; Aljaaf, A.J. A systematic review on supervised and unsupervised machine learning algorithms for data science. In Supervised and Unsupervised Learning for Data Science; Springer: Cham, Switzerland, 2020; pp. 3–21. [Google Scholar]
- Kodinariya, T.M.; Makwana, P.R. Review on determining number of Cluster in K-Means Clustering. Int. J. 2013, 1, 90–95. [Google Scholar]
- Yang, M.S.; Lai, C.Y.; Lin, C.Y. A robust EM clustering algorithm for Gaussian mixture models. Pattern Recognit. 2012, 45, 3950–3961. [Google Scholar] [CrossRef]
- Pandey, G.; Bagri, R.; Gupta, R.; Rajpal, A.; Agarwal, M.; Kumar, N. Robust weighted general performance score for various classification scenarios. Intell. Decis. Technol. 2024, 18, 2033–2054. [Google Scholar] [CrossRef]
- Marzban, C. The ROC curve and the area under it as performance measures. Weather Forecast. 2004, 19, 1106–1114. [Google Scholar] [CrossRef]
- Boyd, K.; Eng, K.H.; Page, C.D. Area under the precision-recall curve: Point estimates and confidence intervals. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, 23–27 September 2013; Proceedings, Part III 13. Springer: Berlin/Heidelberg, Germany, 2013; pp. 451–466. [Google Scholar]
- Ling, C.X.; Huang, J.; Zhang, H. AUC: A better measure than accuracy in comparing learning algorithms. In Proceedings of the Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, AI 2003, Halifax, NS, Canada, 11–13 June 2003; Proceedings 16. Springer: Berlin/Heidelberg, Germany, 2003; pp. 329–341. [Google Scholar]
- Sofaer, H.R.; Hoeting, J.A.; Jarnevich, C.S. The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol. Evol. 2019, 10, 565–577. [Google Scholar] [CrossRef]
- Hatfield, G.W.; Hung, S.p.; Baldi, P. Differential analysis of DNA microarray gene expression data. Mol. Microbiol. 2003, 47, 871–877. [Google Scholar] [CrossRef]
- Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [PubMed]
- Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D.R.; Pimentel, H.; Salzberg, S.L.; Rinn, J.L.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012, 7, 562–578. [Google Scholar] [CrossRef]
- Hardcastle, T.J.; Kelly, K.A. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinform. 2010, 11, 422. [Google Scholar] [CrossRef]
- Tarazona, S.; García, F.; Ferrer, A.; Dopazo, J.; Conesa, A. NOIseq: A RNA-seq differential expression method robust for sequencing depth biases. EMBnet. J. 2011, 17, 18–19. [Google Scholar] [CrossRef]
- Karlebach, G.; Shamir, R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol. 2008, 9, 770–780. [Google Scholar] [CrossRef] [PubMed]
- Pellegrini, M.; Haynor, D.; Johnson, J.M. Protein interaction networks. Expert Rev. Proteom. 2004, 1, 239–249. [Google Scholar] [CrossRef] [PubMed]
- Yıldırım, M.A.; Goh, K.I.; Cusick, M.E.; Barabási, A.L.; Vidal, M. Drug—target network. Nat. Biotechnol. 2007, 25, 1119–1126. [Google Scholar] [CrossRef]
- Prill, R.J.; Saez-Rodriguez, J.; Alexopoulos, L.G.; Sorger, P.K.; Stolovitzky, G. Crowdsourcing network inference: The DREAM predictive signaling network challenge. Sci. Signal. 2011, 4, mr7. [Google Scholar]
- Davidson, E.; Levin, M. Gene regulatory networks. Proc. Natl. Acad. Sci. USA 2005, 102, 4935. [Google Scholar] [CrossRef]
- Nicolau, M.; Schoenauer, M. On the evolution of scale-free topologies with a gene regulatory network model. Biosystems 2009, 98, 137–148. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Entropy, relative entropy and mutual information. Elem. Inf. Theory 1991, 2, 12–13. [Google Scholar]
- Huynh-Thu, V.A.; Irrthum, A.; Wehenkel, L.; Geurts, P. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 2010, 5, e12776. [Google Scholar] [CrossRef] [PubMed]
- Haury, A.C.; Mordelet, F.; Vera-Licona, P.; Vert, J.P. TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst. Biol. 2012, 6, 145. [Google Scholar] [CrossRef]
- Hartemink, A.J.; Gifford, D.K.; Jaakkola, T.S.; Young, R.A. Bayesian methods for elucidating genetic regulatory networks. IEEE Intell. Syst. 2002, 17, 37–43. [Google Scholar]
- Morrissey, E.R. Grenits: Gene regulatory network inference using time series. R Package Version 2012, 1, 1–5. [Google Scholar]
- Singh, N.; Ahsen, M.E.; Challapalli, N.; Kim, H.S.; White, M.A.; Vidyasagar, M. Inferring genome-wide interaction networks using the phi-mixing coefficient, and applications to lung and breast cancer. IEEE Trans. Mol. Biol. Multi-Scale Commun. 2018, 4, 123–139. [Google Scholar] [CrossRef]
- Alioto, T.S.; Buchhalter, I.; Derdak, S.; Hutter, B.; Eldridge, M.D.; Hovig, E.; Heisler, L.E.; Beck, T.A.; Simpson, J.T.; Tonon, L.; et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 2015, 6, 10001. [Google Scholar] [CrossRef]
- Liu, Y.; He, Q.; Sun, W. Association analysis using somatic mutations. PLoS Genet. 2018, 14, e1007746. [Google Scholar] [CrossRef]
- Benjamin, D.; Sato, T.; Cibulskis, K.; Getz, G.; Stewart, C.; Lichtenstein, L. Calling somatic SNVs and indels with Mutect2. BioRxiv 2019, 861054. [Google Scholar]
- Saunders, C.T.; Wong, W.S.; Swamy, S.; Becq, J.; Murray, L.J.; Cheetham, R.K. Strelka: Accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 2012, 28, 1811–1817. [Google Scholar] [CrossRef]
- Koboldt, D.C.; Zhang, Q.; Larson, D.E.; Shen, D.; McLellan, M.D.; Lin, L.; Miller, C.A.; Mardis, E.R.; Ding, L.; Wilson, R.K. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012, 22, 568–576. [Google Scholar] [CrossRef] [PubMed]
- Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med. 2020, 12, 91. [Google Scholar] [CrossRef] [PubMed]
- Lefouili, M.; Nam, K. The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci. Rep. 2022, 12, 11331. [Google Scholar] [CrossRef] [PubMed]
- Richter, F.; Morton, S.U.; Qi, H.; Kitaygorodsky, A.; Wang, J.; Homsy, J.; DePalma, S.; Patel, N.; Gelb, B.D.; Seidman, J.G.; et al. Whole genome de novo variant identification with FreeBayes and neural network approaches. BioRxiv 2020. [Google Scholar] [CrossRef]
- Liu, J.J.; Cutler, G.; Li, W.; Pan, Z.; Peng, S.; Hoey, T.; Chen, L.; Ling, X.B. Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 2005, 21, 2691–2697. [Google Scholar] [CrossRef]
- Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221–227. [Google Scholar] [CrossRef]
- Rauschert, S.; Raubenheimer, K.; Melton, P.; Huang, R. Machine learning and clinical epigenetics: A review of challenges for diagnosis and classification. Clin. Epigenet. 2020, 12, 51. [Google Scholar] [CrossRef]
- Park, P.J. ChIP–seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 2009, 10, 669–680. [Google Scholar] [CrossRef]
- Grandi, F.C.; Modi, H.; Kampman, L.; Corces, M.R. Chromatin accessibility profiling by ATAC-seq. Nat. Protoc. 2022, 17, 1518–1552. [Google Scholar] [CrossRef]
- Aromolaran, O.; Aromolaran, D.; Isewon, I.; Oyelade, J. Machine learning approach to gene essentiality prediction: A review. Briefings Bioinform. 2021, 22, bbab128. [Google Scholar] [CrossRef]
- Cuperlovic-Culf, M. Machine learning methods for analysis of metabolic data and metabolic pathway modeling. Metabolites 2018, 8, 4. [Google Scholar] [CrossRef] [PubMed]
- Shamout, F.; Zhu, T.; Clifton, D.A. Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 2020, 14, 116–126. [Google Scholar] [CrossRef]
- Arnold, C.; Gerlach, D.; Stelzer, C.; Boryń, Ł.; Rath, M.; Stark, A. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 2013, 339, 1074–1077. [Google Scholar] [CrossRef] [PubMed]
- Kitzman, J. Benchmarking splice variant prediction algorithms using massively parallel splicing assays. Genome Biol. 2023, 24, 294. [Google Scholar] [CrossRef]
- Kurosawa, R.; Iida, K.; Ajiro, M.; Awaya, T.; Yamada, M.; Kosaki, K.; Hagiwara, M. PDIVAS: Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing. BMC Genom. 2023, 24, 601. [Google Scholar] [CrossRef] [PubMed]
- Ahsen, M.E.; Vogel, R.; Stolovitzky, G. Optimal linear ensemble of binary classifiers. Bioinform. Adv. 2024, 4, vbae093. [Google Scholar] [CrossRef]
- Whalen, S.; Pandey, G. A comparative analysis of ensemble classifiers: Case studies in genomics. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, 7–10 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 807–816. [Google Scholar]
- Abney, S. Bootstrapping. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 360–367. [Google Scholar]
- Zhong, Y.; Wei, H.; Chen, L.; Wu, T. Automated EEG pathology detection based on significant feature extraction and selection. Mathematics 2023, 11, 1619. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Zhang, C.; Ma, Y. Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; Volume 144. [Google Scholar]
- Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobotics 2013, 7, 21. [Google Scholar] [CrossRef]
- Schapire, R.E. Explaining adaboost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
- Sarvari, H.; Domeniconi, C.; Prenkaj, B.; Stilo, G. Unsupervised boosting-based autoencoder ensembles for outlier detection. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Virtual, 11–14 May 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 91–103. [Google Scholar]
- Kandanaarachchi, S. Unsupervised anomaly detection ensembles using item response theory. Inf. Sci. 2022, 587, 142–163. [Google Scholar] [CrossRef]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
- Parisi, F.; Strino, F.; Nadler, B.; Kluger, Y. Ranking and combining multiple predictors without labeled data. Proc. Natl. Acad. Sci. USA 2014, 111, 1253–1258. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.C.; Arun, A.S.; Ahsen, M.E.; Vogel, R.; Stolovitzky, G. The Fermi–Dirac distribution provides a calibrated probabilistic output for binary classifiers. Proc. Natl. Acad. Sci. USA 2021, 118, e2100761118. [Google Scholar] [CrossRef] [PubMed]
- Shaham, U.; Cheng, X.; Dror, O.; Jaffe, A.; Nadler, B.; Chang, J.; Kluger, Y. A deep learning approach to unsupervised ensemble learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 30–39. [Google Scholar]
- Zhao, Y.; Wang, J.; Chen, J.; Zhang, X.; Guo, M.; Yu, G. A literature review of gene function prediction by modeling gene ontology. Front. Genet. 2020, 11, 400. [Google Scholar] [CrossRef]
- Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
- Goldberg, A.; Zhu, X.; Singh, A.; Xu, Z.; Nowak, R. Multi-manifold semi-supervised learning. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Clearwater Beach, FL, USA, 16–18 April 2009; pp. 169–176. [Google Scholar]
- Zhai, X.; Oliver, A.; Kolesnikov, A.; Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1476–1485. [Google Scholar]
- Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
- Zhou, Z.H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
- Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Aspect | AUROC (Area Under the Receiver Operating Characteristic) | PRAUC (Area Under the Precision–Recall Curve) |
---|---|---|
Definition | Measures the ability of the model to distinguish between classes by plotting True Positive rate against False Positive rate at various thresholds. | Measures the trade-off between precision and recall across different thresholds, emphasizing performance on the positive class. |
Use Cases | Best suited for balanced datasets. | More informative for imbalanced datasets where the positive class is rare. |
Advantages | Provides a single metric summarizing the model’s ability to discriminate across thresholds; widely used and well-understood theoretical interpretation. | Focuses on the positive class, making it more relevant for imbalanced datasets; highlights performance in scenarios where False Negatives matter. |
Limitations | Can overestimate performance in imbalanced datasets due to the inclusion of True Negatives in the calculation. | Does not account for True Negatives, making it less suitable for balanced datasets or when the performance of the negative class is critical. |
Aspect | Supervised Ensemble Learning | Unsupervised Ensemble Learning | Semi-Supervised Ensemble Learning |
---|---|---|---|
Type of Data Used | Labeled data | Unlabeled data | Combination of labeled and unlabeled data |
Label Requirement | Requires fully labeled data for training | Does not require labeled data | Requires a small amount of labeled data along with unlabeled samples |
Examples of Methods | Bagging, Boosting, Stacking | Clustering ensembles, Dimensionality Reduction Ensembles | Self-training, Co-training, Graph-based methods |
Advantages | High accuracy with sufficient labeled data; easy evaluation | Only choice with no labeled data, useful with low quality labels/ less prune to overfitting | Balances the strengths of supervised and unsupervised methods; reduces label dependency |
Challenges | Expensive and time-consuming to obtain labeled data, easily biased with low quality labels, overfitting risk | Existing methods fail with conditionally correlated classifiers, Existing methods sensitive to initial model parameters | Effective use of unlabeled data is challenging; requires careful tuning |
Bioinformatics Example Use Cases | Predicting disease phenotypes, Disease Prognosis Prediction | Inferring interaction networks, gene network inference, somatic mutation calling | Inferring gene/protein functions, Drug Repurposing, Integrative Multi-Omics Analysis |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahsen, M.E. Harnessing Unsupervised Ensemble Learning for Biomedical Applications: A Review of Methods and Advances. Mathematics 2025, 13, 420. https://doi.org/10.3390/math13030420
Ahsen ME. Harnessing Unsupervised Ensemble Learning for Biomedical Applications: A Review of Methods and Advances. Mathematics. 2025; 13(3):420. https://doi.org/10.3390/math13030420
Chicago/Turabian StyleAhsen, Mehmet Eren. 2025. "Harnessing Unsupervised Ensemble Learning for Biomedical Applications: A Review of Methods and Advances" Mathematics 13, no. 3: 420. https://doi.org/10.3390/math13030420
APA StyleAhsen, M. E. (2025). Harnessing Unsupervised Ensemble Learning for Biomedical Applications: A Review of Methods and Advances. Mathematics, 13(3), 420. https://doi.org/10.3390/math13030420