ijms-logo

Journal Browser

Journal Browser

Deep Learning in Bioinformatics and Biological Data Analysis

A special issue of International Journal of Molecular Sciences (ISSN 1422-0067). This special issue belongs to the section "Molecular Informatics".

Deadline for manuscript submissions: closed (30 April 2024) | Viewed by 11300

Special Issue Editor


E-Mail Website
Guest Editor
College of Computer Science and Technology, Jilin University, Changchun 130000, China
Interests: deep learning; computer
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Bioinformatics and biocomputing respresent frontiers and interdisciplinary subjects derived from the theories and methodologies of comprehensive computer science, life science, and biology, which play integral roles in research on the regulatory mechanisms of DNA, RNA, proteins, and other molecules. In recent years, significant progress has been achieved in the fields of medical science and health informatics. This has led to in-depth analytics demanded by the generation, collection, and accumulation of massive data, for which the analytics generated through traditional analytical methods are no longer deemed sufficient. On the other hand, algorithms in bioinformatics and biocomputing have been significantly improved thanks to the rapid development of deep learning, which includes but is not limited to convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks. Accordingly, applying deep learning in bioinformatics and biocomputing to gain insight from data has been emphasized in both academic and life science fields.

At present, due to the rapid development of biotechnology in the historical period, the biological data generated in various research and application fields have increased exponentially, ranging from the level of molecular (gene functions, protein interactions, metabolic pathways, etc.) through to biological tissue (brain connectivity maps, X-ray images, magnetic resonance images, etc.) and clinical (intensive care units, electronic medical record, etc.). The unneglectable fact is that the speed of growth and heterogeneous structure of biological data make them much more challenging to handle when only using the conventional data analysis methods. Therefore, it is necessary to establish more powerful theoretical methods and practical tools for analyzing and extracting meaningful information from the abovementioned complex bio-data. Analyzing these complex and heterogeneous data is a typical complex system problem. We need to analyze the dependence, relationship, or interaction between different levels of data and their environment. In this case, due to the nonlinear, emergent, spontaneous order, adaptation, and feedback loop characteristics of the raw data, modeling using traditional methods is difficult—only through deep learning can we solve these problems.

This Special Issue seeks to highlight the latest developments in applying advanced deep-l earning techniques in bioinformatics and extensive biodata analysis. Both original research papers and review articles related to deep learning for big data analysis of DNAs, RNAs, proteins, and other types of molecules will be considered for publication.

Potential topics include but are not limited to:

  • Deep learning methods in genome sequencing and single cell sequencing
  • Deep learning methods in epigenomics and genomics regulatory analysis
  • Deep learning methods in multi-omics integration
  • Deep learning methods in protein structure prediction
  • Deep learning methods in molecule property prediction
  • Deep learning methods in protein–ligand binding prediction

Prof. Dr. Hao Zhang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. International Journal of Molecular Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. There is an Article Processing Charge (APC) for publication in this open access journal. For details about the APC please see here. Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

20 pages, 5330 KiB  
Article
Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models
by Hongkang Chu and Taigang Liu
Int. J. Mol. Sci. 2024, 25(8), 4507; https://doi.org/10.3390/ijms25084507 - 19 Apr 2024
Viewed by 334
Abstract
Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational [...] Read more.
Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

14 pages, 2947 KiB  
Article
Attenphos: General Phosphorylation Site Prediction Model Based on Attention Mechanism
by Tao Song, Qing Yang, Peng Qu, Lian Qiao and Xun Wang
Int. J. Mol. Sci. 2024, 25(3), 1526; https://doi.org/10.3390/ijms25031526 - 26 Jan 2024
Viewed by 575
Abstract
Phosphorylation site prediction has important application value in the field of bioinformatics. It can act as an important reference and help with protein function research, protein structure research, and drug discovery. So, it is of great significance to propose scientific and effective calculation [...] Read more.
Phosphorylation site prediction has important application value in the field of bioinformatics. It can act as an important reference and help with protein function research, protein structure research, and drug discovery. So, it is of great significance to propose scientific and effective calculation methods to accurately predict phosphorylation sites. In this study, we propose a new method, Attenphos, based on the self-attention mechanism for predicting general phosphorylation sites in proteins. The method not only captures the long-range dependence information of proteins but also better represents the correlation between amino acids through feature vector encoding transformation. Attenphos takes advantage of the one-dimensional convolutional layer to reduce the number of model parameters, improve model efficiency and prediction accuracy, and enhance model generalization. Comparisons between our method and existing state-of-the-art prediction tools were made using balanced datasets from human proteins and unbalanced datasets from mouse proteins. We performed prediction comparisons using independent test sets. The results showed that Attenphos demonstrated the best overall performance in the prediction of Serine (S), Threonine (T), and Tyrosine (Y) sites on both balanced and unbalanced datasets. Compared to current state-of-the-art methods, Attenphos has significantly higher prediction accuracy. This proves the potential of Attenphos in accelerating the identification and functional analysis of protein phosphorylation sites and provides new tools and ideas for biological research and drug discovery. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

16 pages, 681 KiB  
Article
Enhancer Recognition: A Transformer Encoder-Based Method with WGAN-GP for Data Augmentation
by Tianyu Feng, Tao Hu, Wenyu Liu and Yang Zhang
Int. J. Mol. Sci. 2023, 24(24), 17548; https://doi.org/10.3390/ijms242417548 - 16 Dec 2023
Viewed by 741
Abstract
Enhancers are located upstream or downstream of key deoxyribonucleic acid (DNA) sequences in genes and can adjust the transcription activity of neighboring genes. Identifying enhancers and determining their functions are important for understanding gene regulatory networks and expression regulatory mechanisms. However, traditional enhancer [...] Read more.
Enhancers are located upstream or downstream of key deoxyribonucleic acid (DNA) sequences in genes and can adjust the transcription activity of neighboring genes. Identifying enhancers and determining their functions are important for understanding gene regulatory networks and expression regulatory mechanisms. However, traditional enhancer recognition relies on manual feature engineering, which is time-consuming and labor-intensive, making it difficult to perform large-scale recognition analysis. In addition, if the original dataset is too small, there is a risk of overfitting. In recent years, emerging methods, such as deep learning, have provided new insights for enhancing identification. However, these methods also present certain challenges. Deep learning models typically require a large amount of high-quality data, and data acquisition demands considerable time and resources. To address these challenges, in this paper, we propose a data-augmentation method based on generative adversarial networks to solve the problem of small datasets. Moreover, we used regularization methods such as weight decay to improve the generalizability of the model and alleviate overfitting. The Transformer encoder was used as the main component to capture the complex relationships and dependencies in enhancer sequences. The encoding layer was designed based on the principle of k-mers to preserve more information from the original DNA sequence. Compared with existing methods, the proposed approach made significant progress in enhancing the accuracy and strength of enhancer identification and prediction, demonstrating the effectiveness of the proposed method. This paper provides valuable insights for enhancer analysis and is of great significance for understanding gene regulatory mechanisms and studying disease correlations. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

15 pages, 1864 KiB  
Article
Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction
by Yang Qu, Zitong Niu, Qiaojiao Ding, Taowa Zhao, Tong Kong, Bing Bai, Jianwei Ma, Yitian Zhao and Jianping Zheng
Int. J. Mol. Sci. 2023, 24(22), 16496; https://doi.org/10.3390/ijms242216496 - 18 Nov 2023
Cited by 1 | Viewed by 1013
Abstract
Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language [...] Read more.
Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

14 pages, 1267 KiB  
Article
Effective Local and Secondary Protein Structure Prediction by Combining a Neural Network-Based Approach with Extensive Feature Design and Selection without Reliance on Evolutionary Information
by Yury V. Milchevskiy, Vladislava Y. Milchevskaya, Alexei M. Nikitin and Yury V. Kravatsky
Int. J. Mol. Sci. 2023, 24(21), 15656; https://doi.org/10.3390/ijms242115656 - 27 Oct 2023
Viewed by 916
Abstract
Protein structure prediction continues to pose multiple challenges despite outstanding progress that is largely attributable to the use of novel machine learning techniques. One of the widely used representations of local 3D structure—protein blocks (PBs)—can be treated in a similar way to secondary [...] Read more.
Protein structure prediction continues to pose multiple challenges despite outstanding progress that is largely attributable to the use of novel machine learning techniques. One of the widely used representations of local 3D structure—protein blocks (PBs)—can be treated in a similar way to secondary structure classes. Here, we present a new approach for predicting local conformation in terms of PB classes solely from amino acid sequences. We apply the RMSD metric to ensure unambiguous future 3D protein structure recovery. The selection of statistically assessed features is a key component of the proposed method. We suggest that ML input features should be created from the statistically significant predictors that are derived from the amino acids’ physicochemical properties and the resolved structures’ statistics. The statistical significance of the suggested features was assessed using a stepwise regression analysis that permitted the evaluation of the contribution and statistical significance of each predictor. We used the set of 380 statistically significant predictors as a learning model for the regression neural network that was trained using the PISCES30 dataset. When using the same dataset and metrics for benchmarking, our method outperformed all other methods reported in the literature for the CB513 nonredundant dataset (for the PBs, Q16 = 81.01%, and for the DSSP, Q3 = 85.99% and Q8 = 79.35%). Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

17 pages, 2231 KiB  
Article
MSGNN-DTA: Multi-Scale Topological Feature Fusion Based on Graph Neural Networks for Drug–Target Binding Affinity Prediction
by Shudong Wang, Xuanmo Song, Yuanyuan Zhang, Kuijie Zhang, Yingye Liu, Chuanru Ren and Shanchen Pang
Int. J. Mol. Sci. 2023, 24(9), 8326; https://doi.org/10.3390/ijms24098326 - 05 May 2023
Cited by 2 | Viewed by 1995
Abstract
The accurate prediction of drug–target binding affinity (DTA) is an essential step in drug discovery and drug repositioning. Although deep learning methods have been widely adopted for DTA prediction, the complexity of extracting drug and target protein features hampers the accuracy of these [...] Read more.
The accurate prediction of drug–target binding affinity (DTA) is an essential step in drug discovery and drug repositioning. Although deep learning methods have been widely adopted for DTA prediction, the complexity of extracting drug and target protein features hampers the accuracy of these predictions. In this study, we propose a novel model for DTA prediction named MSGNN-DTA, which leverages a fused multi-scale topological feature approach based on graph neural networks (GNNs). To address the challenge of accurately extracting drug and target protein features, we introduce a gated skip-connection mechanism during the feature learning process to fuse multi-scale topological features, resulting in information-rich representations of drugs and proteins. Our approach constructs drug atom graphs, motif graphs, and weighted protein graphs to fully extract topological information and provide a comprehensive understanding of underlying molecular interactions from multiple perspectives. Experimental results on two benchmark datasets demonstrate that MSGNN-DTA outperforms the state-of-the-art models in all evaluation metrics, showcasing the effectiveness of the proposed approach. Moreover, the study conducts a case study based on already FDA-approved drugs in the DrugBank dataset to highlight the potential of the MSGNN-DTA framework in identifying drug candidates for specific targets, which could accelerate the process of virtual screening and drug repositioning. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

Review

Jump to: Research

35 pages, 624 KiB  
Review
Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models
by Tianwei Yue, Yuanxin Wang, Longxiang Zhang, Chunming Gu, Haoru Xue, Wenping Wang, Qi Lyu and Yujie Dun
Int. J. Mol. Sci. 2023, 24(21), 15858; https://doi.org/10.3390/ijms242115858 - 01 Nov 2023
Cited by 2 | Viewed by 4589
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and [...] Read more.
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics. Full article
(This article belongs to the Special Issue Deep Learning in Bioinformatics and Biological Data Analysis)
Show Figures

Figure 1

Back to TopTop