ijms-logo

Journal Browser

Journal Browser

Machine Learning Applications in Bioinformatics and Biomedicine: 2nd Edition

A special issue of International Journal of Molecular Sciences (ISSN 1422-0067). This special issue belongs to the section "Molecular Informatics".

Deadline for manuscript submissions: closed (20 March 2025) | Viewed by 27827

Special Issue Editor

Special Issue Information

Dear Colleagues,

Machine learning has been developed for over 40 years. In recent years, with the rapid accumulation of data in the biological and medical fields, machine learning has been widely used in these fields. The purpose of organizing this Special Issue is to provide a platform for publishing the latest cutting-edge work related to the application of machine learning in the biological and medicine fields, and promote the development of related fields. This Special Issue will focus on various aspects of the development and application of computational methods and techniques in biological and medical data for discovering disease markers. The subtopics include, but are not limited to, the following:

  • Identification of disease markers from genome, transcriptome, proteome and metabolome;
  • Discovery of drug target using machine learning;
  • Drug design based on machine learning;
  • Using machine learning to analyze clinical data;
  • Research on big data of physical examination based on machine learning and artificial intelligence;
  • Prediction of drug side effects based on machine learning;
  • Epigenetics markers discovery for disease using artificial intelligence;
  • The discovery of molecular network marker for disease diagnosis and therapy;
  • Early screening of diseases based on artificial intelligence.

Dr. Hao Lv
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. International Journal of Molecular Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. There is an Article Processing Charge (APC) for publication in this open access journal. For details about the APC please see here. Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • genome
  • transcriptome
  • proteome
  • metabolome
  • drug target
  • machine learning
  • prediction of drug side effects
  • epigenetics markers discovery for disease
  • molecular network marker
  • early screening of diseases

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issues

Published Papers (18 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

15 pages, 28582 KiB  
Article
Exploring the Role of Circadian Rhythm-Related Genes in the Identification of Sepsis Subtypes and the Construction of Diagnostic Models Based on RNA-seq and scRNA-seq
by Xuesong Wang, Zhe Guo, Ziwen Wang, Xinrui Wang, Yuxiang Xia, Dishan Wu and Zhong Wang
Int. J. Mol. Sci. 2025, 26(9), 3993; https://doi.org/10.3390/ijms26093993 - 23 Apr 2025
Viewed by 134
Abstract
Sepsis is a severe systemic response to infection that may lead to the dysfunction of multiple organ systems and may even be life-threatening. Circadian rhythm-related genes (CRDRGs) regulate the circadian clock and affect many physiological processes, including immune responses. In patients with sepsis, [...] Read more.
Sepsis is a severe systemic response to infection that may lead to the dysfunction of multiple organ systems and may even be life-threatening. Circadian rhythm-related genes (CRDRGs) regulate the circadian clock and affect many physiological processes, including immune responses. In patients with sepsis, circadian rhythms may be disrupted, thus leading to problems such as immune responses. RNA-seq datasets of sepsis and control groups were downloaded from the Gene Expression Omnibus (GEO) database, and two sepsis subtypes were identified based on differentially expressed CRDRGs. Two gene modules related to sepsis diagnosis and subtypes were obtained using the weighted co-expression network (WGCNA) algorithm. Subsequently, using four machine learning algorithms (random forest, support vector machine, a generalized linear model, and xgboost), genes related to sepsis diagnosis were identified from the intersection genes of the two modules, and a diagnostic model was constructed. Single-cell sequencing (scRNA-seq) data were obtained from the GEO database to explore the expression landscape of diagnostic-related genes in different cell types. Finally, an RT-qPCR analysis of diagnosis-related genes confirmed the differences in expression trends between the two groups. Multiple differentially expressed CRDRGs were observed in the sepsis and control groups, and two subtypes were identified based on their expression levels. There were apparent differences in the distribution of samples of the two subtypes in two-dimensional space and the pathways involved. Using multiple machine learning algorithms, the intersection genes in the two most relevant modules of the WGCNA were identified, and a robust diagnostic model was constructed with five genes (ARHGEF18, CHD3, PHC1, SFI1, and SPOCK2). The AUC of this model reached 0.987 on the validation set, showing an excellent prediction performance. In this study, two sepsis subtypes were identified, and a sepsis diagnostic model was constructed via consensus clustering and machine learning algorithms. Five genes were identified as diagnostic markers for sepsis and can thus assist in clinical diagnosis and guide personalized treatment. Full article
Show Figures

Figure 1

19 pages, 5835 KiB  
Article
Machine Learning Identification of Neutrophil Extracellular Trap-Related Genes as Potential Biomarkers and Therapeutic Targets for Bronchopulmonary Dysplasia
by Xuandong Zhang, Bingqian Yan, Zhou Jiang and Yujia Luo
Int. J. Mol. Sci. 2025, 26(7), 3230; https://doi.org/10.3390/ijms26073230 - 31 Mar 2025
Viewed by 441
Abstract
Neutrophil extracellular traps (NETs) play a key role in the development of bronchopulmonary dysplasia (BPD), yet their molecular mechanisms in contributing to BPD remain unexplored. Using the GSE32472 dataset, which includes 100 blood samples from postnatal day 28, we conducted comprehensive bioinformatics analyses [...] Read more.
Neutrophil extracellular traps (NETs) play a key role in the development of bronchopulmonary dysplasia (BPD), yet their molecular mechanisms in contributing to BPD remain unexplored. Using the GSE32472 dataset, which includes 100 blood samples from postnatal day 28, we conducted comprehensive bioinformatics analyses to identify differentially expressed genes (DEGs) and construct gene modules. We identified 86 DEGs, which were enriched in immune and inflammatory pathways, including NET formation. Weighted gene co-expression network analysis (WGCNA) revealed a key gene module associated with BPD. By intersecting 69 NET-related genes (NRGs), 149 module genes, and 86 DEGs, we identified 12 differentially expressed NET-related genes (DENRGs). Immune infiltration analysis revealed an increase in neutrophils, dendritic cells, and macrophages in BPD patients. Machine learning models (LASSO, SVM-RFE, and RF) identified 5 upregulated biomarkers—MMP9, Siglec-5, DYSF, MGAM, and S100A12—showing potential as diagnostic biomarkers for BPD. Validation using nomogram, ROC curves, and qRT-PCR confirmed the diagnostic accuracy of these biomarkers. Clinical data analysis showed that Siglec-5 was most strongly correlated with BPD severity, while DYSF correlated with the grade of retinopathy of prematurity (ROP) and its laser treatment. Clustering analysis revealed two distinct BPD subtypes with different immune microenvironment profiles. Drug–gene interaction analysis identified potential inhibitors targeting MGAM and MMP9. In conclusion, the study identifies five NET-related biomarkers as reliable diagnostic tools for BPD, with their upregulation and association with disease severity and complications, such as ROP, highlighting their clinical relevance and potential for advancing BPD diagnostics and treatment. Full article
Show Figures

Figure 1

15 pages, 1676 KiB  
Article
SST-ResNet: A Sequence and Structure Information Integration Model for Protein Property Prediction
by Guowei Zhou, Yanpeng Zhao, Song He and Xiaochen Bo
Int. J. Mol. Sci. 2025, 26(6), 2783; https://doi.org/10.3390/ijms26062783 - 19 Mar 2025
Viewed by 310
Abstract
Proteins are the basic building blocks of life and perform fundamental functions in biology. Predicting protein properties based on amino acid sequences and 3D structures has become a key approach to accelerating drug development. In this study, we propose a novel sequence- and [...] Read more.
Proteins are the basic building blocks of life and perform fundamental functions in biology. Predicting protein properties based on amino acid sequences and 3D structures has become a key approach to accelerating drug development. In this study, we propose a novel sequence- and structure-based framework, SST-ResNet, which consists of the multimodal language model ProSST and a multi-scale information integration module. This framework is designed to deeply explore the latent relationships between protein sequences and structures, thereby achieving superior synergistic prediction performance. Our method outperforms previous joint prediction models on Enzyme Commission (EC) numbers and Gene Ontology (GO) tasks. Furthermore, we demonstrate the necessity of multi-scale information integration for these two types of data and illustrate its exceptional performance on key tasks. We anticipate that this framework can be extended to a broader range of protein property prediction problems, ultimately facilitating drug development. Full article
Show Figures

Figure 1

20 pages, 2065 KiB  
Article
Exploring Potential Medications for Alzheimer’s Disease with Psychosis by Integrating Drug Target Information into Deep Learning Models: A Data-Driven Approach
by Oshin Miranda, Chen Jiang, Xiguang Qi, Julia Kofler, Robert A. Sweet and Lirong Wang
Int. J. Mol. Sci. 2025, 26(4), 1617; https://doi.org/10.3390/ijms26041617 - 14 Feb 2025
Viewed by 861
Abstract
Approximately 50% of Alzheimer’s disease (AD) patients develop psychotic symptoms, leading to a subtype known as psychosis in AD (AD + P), which is associated with accelerated cognitive decline compared to AD without psychosis. Currently, no FDA-approved medication specifically addresses AD + P. [...] Read more.
Approximately 50% of Alzheimer’s disease (AD) patients develop psychotic symptoms, leading to a subtype known as psychosis in AD (AD + P), which is associated with accelerated cognitive decline compared to AD without psychosis. Currently, no FDA-approved medication specifically addresses AD + P. This study aims to improve psychosis predictions and identify potential therapeutic agents using the DeepBiomarker deep learning model by incorporating drug–target interactions. Electronic health records from the University of Pittsburgh Medical Center were analyzed to predict psychosis within three months of AD diagnosis. AD + P patients were classified as those with either a formal psychosis diagnosis or antipsychotic prescriptions post-AD diagnosis. Two approaches were employed as follows: (1) a drug-focused method using individual medications and (2) a target-focused method pooling medications by shared targets. The updated DeepBiomarker model achieved an area under the receiver operating curve (AUROC) above 0.90 for psychosis prediction. A drug-focused analysis identified gabapentin, amlodipine, levothyroxine, and others as potentially beneficial. A target-focused analysis highlighted significant proteins, including integrins, calcium channels, and tyrosine hydroxylase, confirming several medications linked to these targets. Integrating drug–target information into predictive models improves the identification of medications for AD + P risk reduction, offering a promising strategy for therapeutic development. Full article
Show Figures

Figure 1

22 pages, 13927 KiB  
Article
Discovery of TRPV4-Targeting Small Molecules with Anti-Influenza Effects Through Machine Learning and Experimental Validation
by Yan Sun, Jiajing Wu, Beilei Shen, Hengzheng Yang, Huizi Cui, Weiwei Han, Rongbo Luo, Shijun Zhang, He Li, Bingshuo Qian, Lingjun Fan, Junkui Zhang, Tiecheng Wang, Xianzhu Xia, Fang Yan and Yuwei Gao
Int. J. Mol. Sci. 2025, 26(3), 1381; https://doi.org/10.3390/ijms26031381 - 6 Feb 2025
Viewed by 966
Abstract
Transient receptor potential vanilloid 4 (TRPV4) is a calcium-permeable cation channel critical for maintaining intracellular Ca2+ homeostasis and is essential in regulating immune responses, metabolic processes, and signal transduction. Recent studies have shown that TRPV4 activation enhances influenza A virus infection, promoting [...] Read more.
Transient receptor potential vanilloid 4 (TRPV4) is a calcium-permeable cation channel critical for maintaining intracellular Ca2+ homeostasis and is essential in regulating immune responses, metabolic processes, and signal transduction. Recent studies have shown that TRPV4 activation enhances influenza A virus infection, promoting viral replication and transmission. However, there has been limited exploration of antiviral drugs targeting the TRPV4 channel. In this study, we developed the first machine learning model specifically designed to predict TRPV4 inhibitory small molecules, providing a novel approach for rapidly identifying repurposed drugs with potential antiviral effects. Our approach integrated machine learning, virtual screening, data analysis, and experimental validation to efficiently screen and evaluate candidate molecules. For high-throughput virtual screening, we employed computational methods to screen open-source molecular databases targeting the TRPV4 receptor protein. The virtual screening results were ranked based on predicted scores from our optimized model and binding energy, allowing us to prioritize potential inhibitors. Fifteen small-molecule drugs were selected for further in vitro and in vivo antiviral testing against influenza. Notably, glecaprevir and everolimus demonstrated significant inhibitory effects on the influenza virus, markedly improving survival rates in influenza-infected mice (protection rates of 80% and 100%, respectively). We also validated the mechanisms by which these drugs interact with the TRPV4 channel. In summary, our study presents the first predictive model for identifying TRPV4 inhibitors, underscoring TRPV4 inhibition as a promising strategy for antiviral drug development against influenza. This pioneering approach lays the groundwork for future clinical research targeting the TRPV4 channel in antiviral therapies. Full article
Show Figures

Graphical abstract

15 pages, 4398 KiB  
Article
Elucidating the Mechanism of VVTT Infection Through Machine Learning and Transcriptome Analysis
by Zhili Chen, Yongxin Jiang, Jiazhen Cui, Wannan Li, Weiwei Han and Gang Liu
Int. J. Mol. Sci. 2025, 26(3), 1203; https://doi.org/10.3390/ijms26031203 - 30 Jan 2025
Viewed by 952
Abstract
The vaccinia virus (VV) is extensively utilized as a vaccine vector in the treatment of various infectious diseases, cardiovascular diseases, immunodeficiencies, and cancers. The vaccinia virus Tiantan strain (VVTT) has been instrumental as an irreplaceable vaccine strain in the eradication of smallpox in [...] Read more.
The vaccinia virus (VV) is extensively utilized as a vaccine vector in the treatment of various infectious diseases, cardiovascular diseases, immunodeficiencies, and cancers. The vaccinia virus Tiantan strain (VVTT) has been instrumental as an irreplaceable vaccine strain in the eradication of smallpox in China; however, it still presents significant adverse toxic effects. After the WHO recommended that routine smallpox vaccination be discontinued, the Chinese government stopped the national smallpox vaccination program in 1981. The outbreak of monkeypox in 2022 has focused people’s attention on the Orthopoxvirus. However, there are limited reports on the safety and toxic side effects of VVTT. In this study, we employed a combination of transcriptomic analysis and machine learning-based feature selection to identify key genes implicated in the VVTT infection process. We utilized four machine learning algorithms, including random forest (RF), minimum redundancy maximum relevance (MRMR), eXtreme Gradient Boosting (XGB), and least absolute shrinkage and selection operator cross-validation (LASSOCV), for feature selection. Among these, XGB was found to be the most effective and was used for further screening, resulting in an optimal model with an ROC curve of 0.98. Our analysis revealed the involvement of pathways such as spinocerebellar ataxia and the p53 signaling pathway. Additionally, we identified three critical targets during VVTT infection—ARC, JUNB, and EGR2—and further validated these targets using qPCR. Our research elucidates the mechanism by which VVTT infects cells, enhancing our understanding of the smallpox vaccine. This knowledge not only facilitates the development of new and more effective vaccines but also contributes to a deeper comprehension of viral pathogenesis. By advancing our understanding of the molecular mechanisms underlying VVTT infection, this study lays the foundation for the further development of VVTT. Such insights are crucial for strengthening global health security and ensuring a resilient response to future pandemics. Full article
Show Figures

Graphical abstract

13 pages, 2569 KiB  
Article
Multi-Objective Optimization Accelerates the De Novo Design of Antimicrobial Peptide for Staphylococcus aureus
by Cheng-Hong Yang, Yi-Ling Chen, Tin-Ho Cheung and Li-Yeh Chuang
Int. J. Mol. Sci. 2024, 25(24), 13688; https://doi.org/10.3390/ijms252413688 - 21 Dec 2024
Viewed by 955
Abstract
Humans have long used antibiotics to fight bacteria, but increasing drug resistance has reduced their effectiveness. Antimicrobial peptides (AMPs) are a promising alternative with natural broad-spectrum activity against bacteria and viruses. However, their instability and hemolysis limit their medical use, making the design [...] Read more.
Humans have long used antibiotics to fight bacteria, but increasing drug resistance has reduced their effectiveness. Antimicrobial peptides (AMPs) are a promising alternative with natural broad-spectrum activity against bacteria and viruses. However, their instability and hemolysis limit their medical use, making the design and improvement of AMPs a key research focus. Designing antimicrobial peptides with multiple desired properties using machine learning is still challenging, especially with limited data. This study utilized a multi-objective optimization method, the non-dominated sorting genetic algorithm II (NSGA-II), to enhance the physicochemical properties of peptide sequences and identify those with improved antimicrobial activity. Combining NSGA-II with neural networks, the approach efficiently identified promising AMP candidates and accurately predicted their antibacterial effectiveness. This method significantly advances by optimizing factors like hydrophobicity, instability index, and aliphatic index to improve peptide stability. It offers a more efficient way to address the limitations of AMPs, paving the way for the development of safer and more effective antimicrobial treatments. Full article
Show Figures

Figure 1

25 pages, 1859 KiB  
Article
AEmiGAP: AutoEncoder-Based miRNA–Gene Association Prediction Using Deep Learning Method
by Seungwon Yoon, Hyewon Yoon, Jaeeun Cho and Kyuchul Lee
Int. J. Mol. Sci. 2024, 25(23), 13075; https://doi.org/10.3390/ijms252313075 - 5 Dec 2024
Viewed by 928
Abstract
MicroRNAs (miRNAs) play a crucial role in gene regulation and are strongly linked to various diseases, including cancer. This study presents AEmiGAP, an advanced deep learning model that integrates autoencoders with long short-term memory (LSTM) networks to predict miRNA–gene associations. By enhancing feature [...] Read more.
MicroRNAs (miRNAs) play a crucial role in gene regulation and are strongly linked to various diseases, including cancer. This study presents AEmiGAP, an advanced deep learning model that integrates autoencoders with long short-term memory (LSTM) networks to predict miRNA–gene associations. By enhancing feature extraction through autoencoders, AEmiGAP captures intricate, latent relationships between miRNAs and genes with unprecedented accuracy, outperforming all existing models in miRNA–gene association prediction. A thoroughly curated dataset of positive and negative miRNA–gene pairs was generated using distance-based filtering methods, significantly improving the model’s AUC and overall predictive accuracy. Additionally, this study proposes two case studies to highlight AEmiGAP’s application: first, a top 30 list of miRNA–gene pairs with the highest predicted association scores among previously unknown pairs, and second, a list of the top 10 miRNAs strongly associated with each of five key oncogenes. These findings establish AEmiGAP as a new benchmark in miRNA–gene association prediction, with considerable potential to advance both cancer research and precision medicine. Full article
Show Figures

Figure 1

12 pages, 6850 KiB  
Article
vScreenML v2.0: Improved Machine Learning Classification for Reducing False Positives in Structure-Based Virtual Screening
by Grigorii V. Andrianov, Emeline Haroldsen and John Karanicolas
Int. J. Mol. Sci. 2024, 25(22), 12350; https://doi.org/10.3390/ijms252212350 - 18 Nov 2024
Viewed by 1566
Abstract
The enthusiastic adoption of make-on-demand chemical libraries for virtual screening has highlighted the need for methods that deliver improved hit-finding discovery rates. Traditional virtual screening methods are often inaccurate, with most compounds nominated in a virtual screen not engaging the intended target protein [...] Read more.
The enthusiastic adoption of make-on-demand chemical libraries for virtual screening has highlighted the need for methods that deliver improved hit-finding discovery rates. Traditional virtual screening methods are often inaccurate, with most compounds nominated in a virtual screen not engaging the intended target protein to any detectable extent. Emerging machine learning approaches have made significant progress in this regard, including our previously described tool vScreenML. The broad adoption of vScreenML was hindered by its challenging usability and dependencies on certain obsolete or proprietary software packages. Here, we introduce vScreenML 2.0 to address each of these limitations with a streamlined Python implementation. Through careful benchmarks, we show that vScreenML 2.0 outperforms other widely used tools for virtual screening hit discovery. Full article
Show Figures

Figure 1

29 pages, 4568 KiB  
Article
AI-Assisted Identification of Primary and Secondary Metabolomic Markers for Postoperative Delirium
by Vladimir A. Ivanisenko, Artem D. Rogachev, Aelita-Luiza A. Makarova, Nikita V. Basov, Evgeniy V. Gaisler, Irina N. Kuzmicheva, Pavel S. Demenkov, Artur S. Venzel, Timofey V. Ivanisenko, Evgenia A. Antropova, Nikolay A. Kolchanov, Victoria V. Plesko, Gleb B. Moroz, Vladimir V. Lomivorotov and Andrey G. Pokrovsky
Int. J. Mol. Sci. 2024, 25(21), 11847; https://doi.org/10.3390/ijms252111847 - 4 Nov 2024
Cited by 1 | Viewed by 1503
Abstract
Despite considerable investigative efforts, the molecular mechanisms of postoperative delirium (POD) remain unresolved. The present investigation employs innovative methodologies for identifying potential primary and secondary metabolic markers of POD by analyzing serum metabolomic profiles utilizing the genetic algorithm and artificial neural networks. The [...] Read more.
Despite considerable investigative efforts, the molecular mechanisms of postoperative delirium (POD) remain unresolved. The present investigation employs innovative methodologies for identifying potential primary and secondary metabolic markers of POD by analyzing serum metabolomic profiles utilizing the genetic algorithm and artificial neural networks. The primary metabolomic markers constitute a combination of metabolites that optimally distinguish between POD and non-POD groups of patients. Our analysis revealed L-lactic acid, inositol, and methylcysteine as the most salient primary markers upon which the prediction accuracy of POD manifestation achieved AUC = 99%. The secondary metabolomic markers represent metabolites that exhibit perturbed correlational patterns within the POD group. We identified 54 metabolites as the secondary markers of POD, incorporating neurotransmitters such as gamma-aminobutyric acid (GABA) and serotonin. These findings imply a systemic disruption in metabolic processes in patients with POD. The deployment of gene network reconstruction techniques facilitated the postulation of hypotheses describing the role of established genomic POD markers in the molecular-genetic mechanisms of metabolic pathways dysregulation, and involving the identified primary and secondary metabolomic markers. This study not only expands the understanding of POD pathogenesis but also introduces a novel technology for the bioinformatic analysis of metabolomic data that could aid in uncovering potential primary and secondary markers in diverse research domains. Full article
Show Figures

Figure 1

27 pages, 7417 KiB  
Article
An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models
by Timofey V. Ivanisenko, Pavel S. Demenkov and Vladimir A. Ivanisenko
Int. J. Mol. Sci. 2024, 25(21), 11811; https://doi.org/10.3390/ijms252111811 - 3 Nov 2024
Cited by 1 | Viewed by 3299
Abstract
The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don’t capture semantic and contextual nuances. Deep-learning models [...] Read more.
The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don’t capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information—a problem known as hallucination—which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information. Full article
Show Figures

Figure 1

14 pages, 2886 KiB  
Article
DeepIndel: An Interpretable Deep Learning Approach for Predicting CRISPR/Cas9-Mediated Editing Outcomes
by Guishan Zhang, Huanzeng Xie and Xianhua Dai
Int. J. Mol. Sci. 2024, 25(20), 10928; https://doi.org/10.3390/ijms252010928 - 11 Oct 2024
Viewed by 1457
Abstract
CRISPR/Cas9 has been applied to edit the genome of various organisms, but our understanding of editing outcomes at specific sites after Cas9-mediated DNA cleavage is still limited. Several deep learning-based methods have been proposed for repair outcome prediction; however, there is still room [...] Read more.
CRISPR/Cas9 has been applied to edit the genome of various organisms, but our understanding of editing outcomes at specific sites after Cas9-mediated DNA cleavage is still limited. Several deep learning-based methods have been proposed for repair outcome prediction; however, there is still room for improvement in terms of performance regarding frameshifts and model interpretability. Here, we present DeepIndel, an end-to-end multi-label regression model for predicting repair outcomes based on the BERT-base module. We demonstrate that our model outperforms existing methods in terms of accuracy and generalizability across various metrics. Furthermore, we utilized Deep SHAP to visualize the importance of nucleotides at various positions for DNA sequence and found that mononucleotides and trinucleotides in DNA sequences surrounding the cut site play a significant role in repair outcome prediction. Full article
Show Figures

Figure 1

14 pages, 1403 KiB  
Article
PROTA: A Robust Tool for Protamine Prediction Using a Hybrid Approach of Machine Learning and Deep Learning
by Jorge G. Farias, Lisandra Herrera-Belén, Luis Jimenez and Jorge F. Beltrán
Int. J. Mol. Sci. 2024, 25(19), 10267; https://doi.org/10.3390/ijms251910267 - 24 Sep 2024
Viewed by 1145
Abstract
Protamines play a critical role in DNA compaction and stabilization in sperm cells, significantly influencing male fertility and various biotechnological applications. Traditionally, identifying these proteins is a challenging and time-consuming process due to their species-specific variability and complexity. Leveraging advancements in computational biology, [...] Read more.
Protamines play a critical role in DNA compaction and stabilization in sperm cells, significantly influencing male fertility and various biotechnological applications. Traditionally, identifying these proteins is a challenging and time-consuming process due to their species-specific variability and complexity. Leveraging advancements in computational biology, we present PROTA, a novel tool that combines machine learning (ML) and deep learning (DL) techniques to predict protamines with high accuracy. For the first time, we integrate Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of protamine prediction. Our methodology evaluated multiple ML models, including Light Gradient-Boosting Machine (LIGHTGBM), Multilayer Perceptron (MLP), Random Forest (RF), eXtreme Gradient Boosting (XGBOOST), k-Nearest Neighbors (KNN), Logistic Regression (LR), Naive Bayes (NB), and Radial Basis Function-Support Vector Machine (RBF-SVM). During ten-fold cross-validation on our training dataset, the MLP model with GAN-augmented data demonstrated superior performance metrics: 0.997 accuracy, 0.997 F1 score, 0.998 precision, 0.997 sensitivity, and 1.0 AUC. In the independent testing phase, this model achieved 0.999 accuracy, 0.999 F1 score, 1.0 precision, 0.999 sensitivity, and 1.0 AUC. These results establish PROTA, accessible via a user-friendly web application. We anticipate that PROTA will be a crucial resource for researchers, enabling the rapid and reliable prediction of protamines, thereby advancing our understanding of their roles in reproductive biology, biotechnology, and medicine. Full article
Show Figures

Figure 1

14 pages, 6340 KiB  
Article
Computational Insights into Reproductive Toxicity: Clustering, Mechanism Analysis, and Predictive Models
by Huizi Cui, Qizheng He, Wannan Li, Yuying Duan and Weiwei Han
Int. J. Mol. Sci. 2024, 25(14), 7978; https://doi.org/10.3390/ijms25147978 - 22 Jul 2024
Cited by 2 | Viewed by 1645
Abstract
Reproductive toxicity poses significant risks to fertility and progeny health, making its identification in pharmaceutical compounds crucial. In this study, we conducted a comprehensive in silico investigation of reproductive toxic molecules, identifying three distinct categories represented by Dimethylhydantoin, Phenol, and Dicyclohexyl phthalate. Our [...] Read more.
Reproductive toxicity poses significant risks to fertility and progeny health, making its identification in pharmaceutical compounds crucial. In this study, we conducted a comprehensive in silico investigation of reproductive toxic molecules, identifying three distinct categories represented by Dimethylhydantoin, Phenol, and Dicyclohexyl phthalate. Our analysis included physicochemical properties, target prediction, and KEGG and GO pathway analyses, revealing diverse and complex mechanisms of toxicity. Given the complexity of these mechanisms, traditional molecule-target research approaches proved insufficient. Support Vector Machines (SVMs) combined with molecular descriptors achieved an accuracy of 0.85 in the test dataset, while our custom deep learning model, integrating molecular SMILES and graphs, achieved an accuracy of 0.88 in the test dataset. These models effectively predicted reproductive toxicity, highlighting the potential of computational methods in pharmaceutical safety evaluation. Our study provides a robust framework for utilizing computational methods to enhance the safety evaluation of potential pharmaceutical compounds. Full article
Show Figures

Figure 1

15 pages, 11884 KiB  
Article
Novel AT2 Cell Subpopulations and Diagnostic Biomarkers in IPF: Integrating Machine Learning with Single-Cell Analysis
by Zhuoying Yang, Yanru Yang, Xin Han and Jiwei Hou
Int. J. Mol. Sci. 2024, 25(14), 7754; https://doi.org/10.3390/ijms25147754 - 15 Jul 2024
Cited by 2 | Viewed by 2274
Abstract
Idiopathic pulmonary fibrosis (IPF) is a long-term condition with an unidentified cause, and currently there are no specific treatment options available. Alveolar epithelial type II cells (AT2) constitute a heterogeneous population crucial for secreting and regenerative functions in the alveolus, essential for maintaining [...] Read more.
Idiopathic pulmonary fibrosis (IPF) is a long-term condition with an unidentified cause, and currently there are no specific treatment options available. Alveolar epithelial type II cells (AT2) constitute a heterogeneous population crucial for secreting and regenerative functions in the alveolus, essential for maintaining lung homeostasis. However, a comprehensive investigation into their cellular diversity, molecular features, and clinical implications is currently lacking. In this study, we conducted a comprehensive examination of single-cell RNA sequencing data from both normal and fibrotic lung tissues. We analyzed alterations in cellular composition between IPF and normal tissue and investigated differentially expressed genes across each cell population. This analysis revealed the presence of two distinct subpopulations of IPF-related alveolar epithelial type II cells (IR_AT2). Subsequently, three unique gene co-expression modules associated with the IR_AT2 subtype were identified through the use of hdWGCNA. Furthermore, we refined and identified IPF-related AT2-related gene (IARG) signatures using various machine learning algorithms. Our analysis demonstrated a significant association between high IARG scores in IPF patients and shorter survival times (p-value < 0.01). Additionally, we observed a negative correlation between the percent predicted diffusing capacity for lung carbon monoxide (% DLCO) and increased IARG scores (cor = −0.44, p-value < 0.05). The cross-validation findings demonstrated a high level of accuracy (AUC > 0.85, p-value < 0.01) in the prognostication of patients with IPF utilizing the identified IARG signatures. Our study has identified distinct molecular and biological features among AT2 subpopulations, specifically highlighting the unique characteristics of IPF-related AT2 cells. Importantly, our findings underscore the prognostic relevance of specific genes associated with IPF-related AT2 cells, offering valuable insights into the advancement of IPF. Full article
Show Figures

Figure 1

13 pages, 2603 KiB  
Article
Machine Learning Identifies Key Proteins in Primary Sclerosing Cholangitis Progression and Links High CCL24 to Cirrhosis
by Tom Snir, Raanan Greenman, Revital Aricha, Matthew Frankel, John Lawler, Francesca Saffioti, Massimo Pinzani, Douglas Thorburn, Adi Mor and Ilan Vaknin
Int. J. Mol. Sci. 2024, 25(11), 6042; https://doi.org/10.3390/ijms25116042 - 30 May 2024
Cited by 4 | Viewed by 2117
Abstract
Primary sclerosing cholangitis (PSC) is a rare, progressive disease, characterized by inflammation and fibrosis of the bile ducts, lacking reliable prognostic biomarkers for disease activity. Machine learning applied to broad proteomic profiling of sera allowed for the discovery of markers of disease presence, [...] Read more.
Primary sclerosing cholangitis (PSC) is a rare, progressive disease, characterized by inflammation and fibrosis of the bile ducts, lacking reliable prognostic biomarkers for disease activity. Machine learning applied to broad proteomic profiling of sera allowed for the discovery of markers of disease presence, severity, and cirrhosis and the exploration of the involvement of CCL24, a chemokine with fibro-inflammatory activity. Sera from 30 healthy controls and 45 PSC patients were profiled with proximity extension assay, quantifying the expression of 2870 proteins, and used to train an elastic net model. Proteins that contributed most to the model were tested for correlation to enhanced liver fibrosis (ELF) score and used to perform pathway analysis. Statistical modeling for the presence of cirrhosis was performed with principal component analysis (PCA), and receiver operating characteristics (ROC) curves were used to assess the useability of potential biomarkers. The model successfully predicted the presence of PSC, where the top-ranked proteins were associated with cell adhesion, immune response, and inflammation, and each had an area under receiver operator characteristic (AUROC) curve greater than 0.9 for disease presence and greater than 0.8 for ELF score. Pathway analysis showed enrichment for functions associated with PSC, overlapping with pathways enriched in patients with high levels of CCL24. Patients with cirrhosis showed higher levels of CCL24. This data-driven approach to characterize PSC and its severity highlights potential serum protein biomarkers and the importance of CCL24 in the disease, implying its therapeutic potential in PSC. Full article
Show Figures

Graphical abstract

20 pages, 868 KiB  
Article
Prediction of Protein–Protein Interactions Based on Integrating Deep Learning and Feature Fusion
by Hoai-Nhan Tran, Phuc-Xuan-Quynh Nguyen, Fei Guo and Jianxin Wang
Int. J. Mol. Sci. 2024, 25(11), 5820; https://doi.org/10.3390/ijms25115820 - 27 May 2024
Cited by 5 | Viewed by 2530
Abstract
Understanding protein–protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation and protein–disease relationship identification. Deep-learning-based approaches are being intensely researched for PPI determination to reduce the cost and time of previous testing methods. In this [...] Read more.
Understanding protein–protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation and protein–disease relationship identification. Deep-learning-based approaches are being intensely researched for PPI determination to reduce the cost and time of previous testing methods. In this work, we integrate deep learning with feature fusion, harnessing the strengths of both approaches, handcrafted features, and protein sequence embedding. The accuracies of the proposed model using five-fold cross-validation on Yeast core and Human datasets are 96.34% and 99.30%, respectively. In the task of predicting interactions in important PPI networks, our model correctly predicted all interactions in one-core, Wnt-related, and cancer-specific networks. The experimental results on cross-species datasets, including Caenorhabditis elegans, Helicobacter pylori, Homo sapiens, Mus musculus, and Escherichia coli, also show that our feature fusion method helps increase the generalization capability of the PPI prediction model. Full article
Show Figures

Figure 1

Review

Jump to: Research

24 pages, 426 KiB  
Review
Alzheimer’s Disease: Exploring Pathophysiological Hypotheses and the Role of Machine Learning in Drug Discovery
by Jose Dominguez-Gortaire, Alejandra Ruiz, Ana Belen Porto-Pazos, Santiago Rodriguez-Yanez and Francisco Cedron
Int. J. Mol. Sci. 2025, 26(3), 1004; https://doi.org/10.3390/ijms26031004 - 24 Jan 2025
Viewed by 1600
Abstract
Alzheimer’s disease (AD) is a major neurodegenerative dementia, with its complex pathophysiology challenging current treatments. Recent advancements have shifted the focus from the traditionally dominant amyloid hypothesis toward a multifactorial understanding of the disease. Emerging evidence suggests that while amyloid-beta (Aβ [...] Read more.
Alzheimer’s disease (AD) is a major neurodegenerative dementia, with its complex pathophysiology challenging current treatments. Recent advancements have shifted the focus from the traditionally dominant amyloid hypothesis toward a multifactorial understanding of the disease. Emerging evidence suggests that while amyloid-beta (Aβ) accumulation is central to AD, it may not be the primary driver but rather part of a broader pathogenic process. Novel hypotheses have been proposed, including the role of tau protein abnormalities, mitochondrial dysfunction, and chronic neuroinflammation. Additionally, the gut–brain axis and epigenetic modifications have gained attention as potential contributors to AD progression. The limitations of existing therapies underscore the need for innovative strategies. This study explores the integration of machine learning (ML) in drug discovery to accelerate the identification of novel targets and drug candidates. ML offers the ability to navigate AD’s complexity, enabling rapid analysis of extensive datasets and optimizing clinical trial design. The synergy between these themes presents a promising future for more effective AD treatments. Full article
Show Figures

Figure 1

Back to TopTop