International Journal of Molecular Sciences

Editorial

Jump to: Research, Review

3 pages, 459 KB

Open AccessEditorial

Editorial of Special Issue “Deep Learning and Machine Learning in Bioinformatics”

by Mingon Kang and Jung Hun Oh

Int. J. Mol. Sci. 2022, 23(12), 6610; https://doi.org/10.3390/ijms23126610 - 14 Jun 2022

Cited by 5 | Viewed by 3509

Abstract

In recent years, deep learning has emerged as a highly active research field, achieving great success in various machine learning areas, including image processing, speech recognition, and natural language processing, and now rapidly becoming a dominant tool in biomedicine [...] Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

Research

Jump to: Editorial, Review

16 pages, 19879 KB

Open AccessArticle

Generative Adversarial Networks for Creating Synthetic Nucleic Acid Sequences of Cat Genome

by Debapriya Hazra, Mi-Ryung Kim and Yung-Cheol Byun

Int. J. Mol. Sci. 2022, 23(7), 3701; https://doi.org/10.3390/ijms23073701 - 28 Mar 2022

Cited by 25 | Viewed by 4156

Abstract

Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease [...] Read more.

Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease diagnosis in animals. This paper proposes a generative adversarial networks (GAN) model to create synthetic nucleic acid sequences of the cat genome tuned to exhibit specific desired properties. We obtained the raw sequence data from Illumina next generation sequencing. Various data preprocessing steps were performed using Cutadapt and DADA2 tools. The processed data were fed to the GAN model that was designed following the architecture of Wasserstein GAN with gradient penalty (WGAN-GP). We introduced a predictor and an evaluator in our proposed GAN model to tune the synthetic sequences to acquire certain realistic properties. The predictor was built for extracting samples with a promoter sequence, and the evaluator was built for filtering samples that scored high for motif-matching. The filtered samples were then passed to the discriminator. We evaluated our model based on multiple metrics and demonstrated outputs for latent interpolation, latent complementation, and motif-matching. Evaluation results showed our proposed GAN model achieved 93.7% correlation with the original data and produced significant outcomes as compared to existing models for sequence generation. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

14 pages, 2657 KB

Open AccessArticle

BioS2Net: Holistic Structural and Sequential Analysis of Biomolecules Using a Deep Neural Network

by Albert Roethel, Piotr Biliński and Takao Ishikawa

Int. J. Mol. Sci. 2022, 23(6), 2966; https://doi.org/10.3390/ijms23062966 - 9 Mar 2022

Cited by 3 | Viewed by 4194

Abstract

Background: For decades, the rate of solving new biomolecular structures has been exceeding that at which their manual classification and feature characterisation can be carried out efficiently. Therefore, a new comprehensive and holistic tool for their examination is needed. Methods: Here we propose [...] Read more.

Background: For decades, the rate of solving new biomolecular structures has been exceeding that at which their manual classification and feature characterisation can be carried out efficiently. Therefore, a new comprehensive and holistic tool for their examination is needed. Methods: Here we propose the Biological Sequence and Structure Network (BioS2Net), which is a novel deep neural network architecture that extracts both sequential and structural information of biomolecules. Our architecture consists of four main parts: (i) a sequence convolutional extractor, (ii) a 3D structure extractor, (iii) a 3D structure-aware sequence temporal network, as well as (iv) a fusion and classification network. Results: We have evaluated our approach using two protein fold classification datasets. BioS2Net achieved a 95.4% mean class accuracy on the eDD dataset and a 76% mean class accuracy on the F184 dataset. The accuracy of BioS2Net obtained on the eDD dataset was comparable to results achieved by previously published methods, confirming that the algorithm described in this article is a top-class solution for protein fold recognition. Conclusions: BioS2Net is a novel tool for the holistic examination of biomolecules of known structure and sequence. It is a reliable tool for protein analysis and their unified representation as feature vectors. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

10 pages, 1278 KB

Open AccessCommunication

Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique

by Hasan Zulfiqar, Qin-Lai Huang, Hao Lv, Zi-Jie Sun, Fu-Ying Dao and Hao Lin

Int. J. Mol. Sci. 2022, 23(3), 1251; https://doi.org/10.3390/ijms23031251 - 23 Jan 2022

Cited by 32 | Viewed by 4263

Abstract

4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study [...] Read more.

4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

13 pages, 3106 KB

Open AccessArticle

Pan-Cancer Prediction of Cell-Line Drug Sensitivity Using Network-Based Methods

by Maryam Pouryahya, Jung Hun Oh, James C. Mathews, Zehor Belkhatir, Caroline Moosmüller, Joseph O. Deasy and Allen R. Tannenbaum

Int. J. Mol. Sci. 2022, 23(3), 1074; https://doi.org/10.3390/ijms23031074 - 19 Jan 2022

Cited by 18 | Viewed by 5067

Abstract

The development of reliable predictive models for individual cancer cell lines to identify an optimal cancer drug is a crucial step to accelerate personalized medicine, but vast differences in cancer cell lines and drug characteristics make it quite challenging to develop predictive models [...] Read more.

The development of reliable predictive models for individual cancer cell lines to identify an optimal cancer drug is a crucial step to accelerate personalized medicine, but vast differences in cancer cell lines and drug characteristics make it quite challenging to develop predictive models that result in high predictive power and explain the similarity of cell lines or drugs. Our study proposes a novel network-based methodology that breaks the problem into smaller, more interpretable problems to improve the predictive power of anti-cancer drug responses in cell lines. For the drug-sensitivity study, we used the GDSC database for 915 cell lines and 200 drugs. The theory of optimal mass transport was first used to separately cluster cell lines and drugs, using gene-expression profiles and extensive cheminformatic drug features, represented in a form of data networks. To predict cell-line specific drug responses, random forest regression modeling was separately performed for each cell-line drug cluster pair. Post-modeling biological analysis was further performed to identify potential biological correlates associated with drug responses. The network-based clustering method resulted in 30 distinct cell-line drug cluster pairs. Predictive modeling on each cell-line-drug cluster outperformed alternative computational methods in predicting drug responses. We found that among the four drugs top-ranked with respect to prediction performance, three targeted the PI3K/mTOR signaling pathway. Predictive modeling on clustered subsets of cell lines and drugs improved the prediction accuracy of cell-line specific drug responses. Post-modeling analysis identified plausible biological processes associated with drug responses. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

20 pages, 3544 KB

Open AccessArticle

DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks

by Mohammad Madani, Kaixiang Lin and Anna Tarakanova

Int. J. Mol. Sci. 2021, 22(24), 13555; https://doi.org/10.3390/ijms222413555 - 17 Dec 2021

Cited by 44 | Viewed by 6109

Abstract

Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein’s function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict [...] Read more.

Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein’s function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein’s solubility to guide experimental design. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Graphical abstract

12 pages, 4226 KB

Open AccessArticle

Sparsely Connected Autoencoders: A Multi-Purpose Tool for Single Cell omics Analysis

by Luca Alessandri, Maria Luisa Ratto, Sandro Gepiro Contaldo, Marco Beccuti, Francesca Cordero, Maddalena Arigoni and Raffaele A. Calogero

Int. J. Mol. Sci. 2021, 22(23), 12755; https://doi.org/10.3390/ijms222312755 - 25 Nov 2021

Cited by 15 | Viewed by 4089

Abstract

Background: Biological processes are based on complex networks of cells and molecules. Single cell multi-omics is a new tool aiming to provide new incites in the complex network of events controlling the functionality of the cell. Methods: Since single cell technologies provide many [...] Read more.

Background: Biological processes are based on complex networks of cells and molecules. Single cell multi-omics is a new tool aiming to provide new incites in the complex network of events controlling the functionality of the cell. Methods: Since single cell technologies provide many sample measurements, they are the ideal environment for the application of Deep Learning and Machine Learning approaches. An autoencoder is composed of an encoder and a decoder sub-model. An autoencoder is a very powerful tool in data compression and noise removal. However, the decoder model remains a black box from which is impossible to depict the contribution of the single input elements. We have recently developed a new class of autoencoders, called Sparsely Connected Autoencoders (SCA), which have the advantage of providing a controlled association among the input layer and the decoder module. This new architecture has the benefit that the decoder model is not a black box anymore and can be used to depict new biologically interesting features from single cell data. Results: Here, we show that SCA hidden layer can grab new information usually hidden in single cell data, like providing clustering on meta-features difficult, i.e. transcription factors expression, or not technically not possible, i.e. miRNA expression, to depict in single cell RNAseq data. Furthermore, SCA representation of cell clusters has the advantage of simulating a conventional bulk RNAseq, which is a data transformation allowing the identification of similarity among independent experiments. Conclusions: In our opinion, SCA represents the bioinformatics version of a universal “Swiss-knife” for the extraction of hidden knowledgeable features from single cell omics data. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

15 pages, 2726 KB

Open AccessArticle

A Deep Learning Approach with Data Augmentation to Predict Novel Spider Neurotoxic Peptides

by Byungjo Lee, Min Kyoung Shin, In-Wook Hwang, Junghyun Jung, Yu Jeong Shim, Go Woon Kim, Seung Tae Kim, Wonhee Jang and Jung-Suk Sung

Int. J. Mol. Sci. 2021, 22(22), 12291; https://doi.org/10.3390/ijms222212291 - 13 Nov 2021

Cited by 22 | Viewed by 4676

Abstract

As major components of spider venoms, neurotoxic peptides exhibit structural diversity, target specificity, and have great pharmaceutical potential. Deep learning may be an alternative to the laborious and time-consuming methods for identifying these peptides. However, the major hurdle in developing a deep learning [...] Read more.

As major components of spider venoms, neurotoxic peptides exhibit structural diversity, target specificity, and have great pharmaceutical potential. Deep learning may be an alternative to the laborious and time-consuming methods for identifying these peptides. However, the major hurdle in developing a deep learning model is the limited data on neurotoxic peptides. Here, we present a peptide data augmentation method that improves the recognition of neurotoxic peptides via a convolutional neural network model. The neurotoxic peptides were augmented with the known neurotoxic peptides from UniProt database, and the models were trained using a training set with or without the generated sequences to verify the augmented data. The model trained with the augmented dataset outperformed the one with the unaugmented dataset, achieving accuracy of 0.9953, precision of 0.9922, recall of 0.9984, and F1 score of 0.9953 in simulation dataset. From the set of all RNA transcripts of Callobius koreanus spider, we discovered neurotoxic peptides via the model, resulting in 275 putative peptides of which 252 novel sequences and only 23 sequences showing homology with the known peptides by Basic Local Alignment Search Tool. Among these 275 peptides, four were selected and shown to have neuromodulatory effects on the human neuroblastoma cell line SH-SY5Y. The augmentation method presented here may be applied to the identification of other functional peptides from biological resources with insufficient data. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

13 pages, 2779 KB

Open AccessArticle

Multi-Run Concrete Autoencoder to Identify Prognostic lncRNAs for 12 Cancers

by Abdullah Al Mamun, Raihanul Bari Tanvir, Masrur Sobhan, Kalai Mathee, Giri Narasimhan, Gregory E. Holt and Ananda Mohan Mondal

Int. J. Mol. Sci. 2021, 22(21), 11919; https://doi.org/10.3390/ijms222111919 - 3 Nov 2021

Cited by 15 | Viewed by 3861

Abstract

Background: Long non-coding RNA plays a vital role in changing the expression profiles of various target genes that lead to cancer development. Thus, identifying prognostic lncRNAs related to different cancers might help in developing cancer therapy. Method: To discover the critical lncRNAs that [...] Read more.

Background: Long non-coding RNA plays a vital role in changing the expression profiles of various target genes that lead to cancer development. Thus, identifying prognostic lncRNAs related to different cancers might help in developing cancer therapy. Method: To discover the critical lncRNAs that can identify the origin of different cancers, we propose the use of the state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. We thus propose a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. The assumption is that a feature appearing in multiple runs carries more meaningful information about the data under consideration. The genome-wide lncRNA expression profiles of 12 different types of cancers, with a total of 4768 samples available in The Cancer Genome Atlas (TCGA), were analyzed to discover the key lncRNAs. The lncRNAs identified by multiple runs of CAE were added to a final list of key lncRNAs that are capable of identifying 12 different cancers. Results: Our results showed that mrCAE performs better in feature selection than single-run CAE, standard autoencoder (AE), and other state-of-the-art feature selection techniques. This study revealed a set of top-ranking 128 lncRNAs that could identify the origin of 12 different cancers with an accuracy of 95%. Survival analysis showed that 76 of 128 lncRNAs have the prognostic capability to differentiate high- and low-risk groups of patients with different cancers. Conclusion: The proposed mrCAE, which selects actual features, outperformed the AE even though it selects the latent or pseudo-features. By selecting actual features instead of pseudo-features, mrCAE can be valuable for precision medicine. The identified prognostic lncRNAs can be further studied to develop therapies for different cancers. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

17 pages, 2055 KB

Open AccessArticle

Machine Learning Assisted Approach for Finding Novel High Activity Agonists of Human Ectopic Olfactory Receptors

by Amara Jabeen, Claire A. de March, Hiroaki Matsunami and Shoba Ranganathan

Int. J. Mol. Sci. 2021, 22(21), 11546; https://doi.org/10.3390/ijms222111546 - 26 Oct 2021

Cited by 16 | Viewed by 5512

Abstract

Olfactory receptors (ORs) constitute the largest superfamily of G protein-coupled receptors (GPCRs). ORs are involved in sensing odorants as well as in other ectopic roles in non-nasal tissues. Matching of an enormous number of the olfactory stimulation repertoire to its counterpart OR through [...] Read more.

Olfactory receptors (ORs) constitute the largest superfamily of G protein-coupled receptors (GPCRs). ORs are involved in sensing odorants as well as in other ectopic roles in non-nasal tissues. Matching of an enormous number of the olfactory stimulation repertoire to its counterpart OR through machine learning (ML) will enable understanding of olfactory system, receptor characterization, and exploitation of their therapeutic potential. In the current study, we have selected two broadly tuned ectopic human OR proteins, OR1A1 and OR2W1, for expanding their known chemical space by using molecular descriptors. We present a scheme for selecting the optimal features required to train an ML-based model, based on which we selected the random forest (RF) as the best performer. High activity agonist prediction involved screening five databases comprising ~23 M compounds, using the trained RF classifier. To evaluate the effectiveness of the machine learning based virtual screening and check receptor binding site compatibility, we used docking of the top target ligands to carefully develop receptor model structures. Finally, experimental validation of selected compounds with significant docking scores through in vitro assays revealed two high activity novel agonists for OR1A1 and one for OR2W1. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Graphical abstract

15 pages, 3033 KB

Open AccessArticle

iBitter-Fuse: A Novel Sequence-Based Bitter Peptide Predictor by Fusing Multi-View Features

by Phasit Charoenkwan, Chanin Nantasenamat, Md. Mehedi Hasan, Mohammad Ali Moni, Pietro Lio’ and Watshara Shoombuatong

Int. J. Mol. Sci. 2021, 22(16), 8958; https://doi.org/10.3390/ijms22168958 - 19 Aug 2021

Cited by 51 | Viewed by 5468

Abstract

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine [...] Read more.

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

13 pages, 1988 KB

Open AccessArticle

Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning

by Tulio L. Campos, Pasi K. Korhonen and Neil D. Young

Int. J. Mol. Sci. 2021, 22(10), 5056; https://doi.org/10.3390/ijms22105056 - 11 May 2021

Cited by 12 | Viewed by 4156

Abstract

Experimental studies of Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular and cellular processes in metazoans at large. Since the publication of their genomes, functional genomic investigations have identified genes that are essential or non-essential for survival in [...] Read more.

Experimental studies of Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular and cellular processes in metazoans at large. Since the publication of their genomes, functional genomic investigations have identified genes that are essential or non-essential for survival in each species. Recently, a range of features linked to gene essentiality have been inferred using a machine learning (ML)-based approach, allowing essentiality predictions within a species. Nevertheless, predictions between species are still elusive. Here, we undertake a comprehensive study using ML to discover and validate features of essential genes common to both C. elegans and D. melanogaster. We demonstrate that the cross-species prediction of gene essentiality is possible using a subset of features linked to nucleotide/protein sequences, protein orthology and subcellular localisation, single-cell RNA-seq, and histone methylation markers. Complementary analyses showed that essential genes are enriched for transcription and translation functions and are preferentially located away from heterochromatin regions of C. elegans and D. melanogaster chromosomes. The present work should enable the cross-prediction of essential genes between model and non-model metazoans. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

17 pages, 4665 KB

Open AccessArticle

smartPARE: An R Package for Efficient Identification of True mRNA Cleavage Sites

by Kristian Persson Hodén, Xinyi Hu, German Martinez and Christina Dixelius

Int. J. Mol. Sci. 2021, 22(8), 4267; https://doi.org/10.3390/ijms22084267 - 20 Apr 2021

Cited by 8 | Viewed by 3360

Abstract

Degradome sequencing is commonly used to generate high-throughput information on mRNA cleavage sites mediated by small RNAs (sRNA). In our datasets of potato (Solanum tuberosum, St) and Phytophthora infestans (Pi), initial predictions generated high numbers of cleavage site predictions, which highlighted [...] Read more.

Degradome sequencing is commonly used to generate high-throughput information on mRNA cleavage sites mediated by small RNAs (sRNA). In our datasets of potato (Solanum tuberosum, St) and Phytophthora infestans (Pi), initial predictions generated high numbers of cleavage site predictions, which highlighted the need of improved analytic tools. Here, we present an R package based on a deep learning convolutional neural network (CNN) in a machine learning environment to optimize discrimination of false from true cleavage sites. When applying smartPARE to our datasets on potato during the infection process by the late blight pathogen, 7.3% of all cleavage windows represented true cleavages distributed on 214 sites in P. infestans and 444 sites in potato. The sRNA landscape of the two organisms is complex with uneven sRNA production and cleavage regions widespread in the two genomes. Multiple targets and several cases of complex regulatory cascades, particularly in potato, was revealed. We conclude that our new analytic approach is useful for anyone working on complex biological systems and with the interest of identifying cleavage sites particularly inferred by sRNA classes beyond miRNAs. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

15 pages, 3303 KB

Open AccessArticle

MET Exon 14 Skipping: A Case Study for the Detection of Genetic Variants in Cancer Driver Genes by Deep Learning

by Vladimir Nosi, Alessandrì Luca, Melissa Milan, Maddalena Arigoni, Silvia Benvenuti, Davide Cacchiarelli, Marcella Cesana, Sara Riccardo, Lucio Di Filippo, Francesca Cordero, Marco Beccuti, Paolo M. Comoglio and Raffaele A. Calogero

Int. J. Mol. Sci. 2021, 22(8), 4217; https://doi.org/10.3390/ijms22084217 - 19 Apr 2021

Cited by 9 | Viewed by 4682

Abstract

Background: Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 [...] Read more.

Background: Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 skipping was shown to be targetable. Methods: We constructed neural networks (NN/CNN) specifically designed to detect MET exon 14 skipping events using RNAseq data. Furthermore, for discovery purposes we also developed a sparsely connected autoencoder to identify uncharacterized MET isoforms. Results: The neural networks had a Met exon 14 skipping detection rate greater than 94% when tested on a manually curated set of 690 TCGA bronchus and lung samples. When globally applied to 2605 TCGA samples, we observed that the majority of false positives was characterized by a blurry coverage of exon 14, but interestingly they share a common coverage peak in the second intron and we speculate that this event could be the transcription signature of a LINE1 (Long Interspersed Nuclear Element 1)-MET (Mesenchymal Epithelial Transition receptor tyrosine kinase) fusion. Conclusions: Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

11 pages, 1666 KB

Open AccessArticle

PredNTS: Improved and Robust Prediction of Nitrotyrosine Sites by Integrating Multiple Sequence Features

by Andi Nur Nilamyani, Firda Nurul Auliah, Mohammad Ali Moni, Watshara Shoombuatong, Md Mehedi Hasan and Hiroyuki Kurata

Int. J. Mol. Sci. 2021, 22(5), 2704; https://doi.org/10.3390/ijms22052704 - 8 Mar 2021

Cited by 19 | Viewed by 3442

Abstract

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction [...] Read more.

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

21 pages, 5593 KB

Open AccessArticle

Dissecting Response to Cancer Immunotherapy by Applying Bayesian Network Analysis to Flow Cytometry Data

by Andrei S. Rodin, Grigoriy Gogoshin, Seth Hilliard, Lei Wang, Colt Egelston, Russell C. Rockne, Joseph Chao and Peter P. Lee

Int. J. Mol. Sci. 2021, 22(5), 2316; https://doi.org/10.3390/ijms22052316 - 26 Feb 2021

Cited by 16 | Viewed by 3916

Abstract

Cancer immunotherapy, specifically immune checkpoint blockade, has been found to be effective in the treatment of metastatic cancers. However, only a subset of patients achieve clinical responses. Elucidating pretreatment biomarkers predictive of sustained clinical response is a major research priority. Another research priority [...] Read more.

Cancer immunotherapy, specifically immune checkpoint blockade, has been found to be effective in the treatment of metastatic cancers. However, only a subset of patients achieve clinical responses. Elucidating pretreatment biomarkers predictive of sustained clinical response is a major research priority. Another research priority is evaluating changes in the immune system before and after treatment in responders vs. nonresponders. Our group has been studying immune networks as an accurate reflection of the global immune state. Flow cytometry (FACS, fluorescence-activated cell sorting) data characterizing immune cell panels in peripheral blood mononuclear cells (PBMC) from gastroesophageal adenocarcinoma (GEA) patients were used to analyze changes in immune networks in this setting. Here, we describe a novel computational pipeline to perform secondary analyses of FACS data using systems biology/machine learning techniques and concepts. The pipeline is centered around comparative Bayesian network analyses of immune networks and is capable of detecting strong signals that conventional methods (such as FlowJo manual gating) might miss. Future studies are planned to validate and follow up the immune biomarkers (and combinations/interactions thereof) associated with clinical responses identified with this computational pipeline. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

12 pages, 1509 KB

Open AccessArticle

PUP-Fuse: Prediction of Protein Pupylation Sites by Integrating Multiple Sequence Representations

by Firda Nurul Auliah, Andi Nur Nilamyani, Watshara Shoombuatong, Md Ashad Alam, Md Mehedi Hasan and Hiroyuki Kurata

Int. J. Mol. Sci. 2021, 22(4), 2120; https://doi.org/10.3390/ijms22042120 - 20 Feb 2021

Cited by 11 | Viewed by 3775

Abstract

Pupylation is a type of reversible post-translational modification of proteins, which plays a key role in the cellular function of microbial organisms. Several proteomics methods have been developed for the prediction and analysis of pupylated proteins and pupylation sites. However, the traditional experimental [...] Read more.

Pupylation is a type of reversible post-translational modification of proteins, which plays a key role in the cellular function of microbial organisms. Several proteomics methods have been developed for the prediction and analysis of pupylated proteins and pupylation sites. However, the traditional experimental methods are laborious and time-consuming. Hence, computational algorithms are highly needed that can predict potential pupylation sites using sequence features. In this research, a new prediction model, PUP-Fuse, has been developed for pupylation site prediction by integrating multiple sequence representations. Meanwhile, we explored the five types of feature encoding approaches and three machine learning (ML) algorithms. In the final model, we integrated the successive ML scores using a linear regression model. The PUP-Fuse achieved a Mathew correlation value of 0.768 by a 10-fold cross-validation test. It also outperformed existing predictors in an independent test. The web server of the PUP-Fuse with curated datasets is freely available. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

16 pages, 1684 KB

Open AccessArticle

A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification

by Nguyen Quoc Khanh Le, Duyen Thi Do, Truong Nguyen Khanh Hung, Luu Ho Thanh Lam, Tuan-Tu Huynh and Ngan Thi Kim Nguyen

Int. J. Mol. Sci. 2020, 21(23), 9070; https://doi.org/10.3390/ijms21239070 - 28 Nov 2020

Cited by 58 | Viewed by 5617

Abstract

Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes [...] Read more.

Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

Review

Jump to: Editorial, Research

20 pages, 1701 KB

Open AccessReview

Protein Design with Deep Learning

by Marianne Defresne, Sophie Barbe and Thomas Schiex

Int. J. Mol. Sci. 2021, 22(21), 11741; https://doi.org/10.3390/ijms222111741 - 29 Oct 2021

Cited by 40 | Viewed by 8324

Abstract

Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount [...] Read more.

Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

18 pages, 1422 KB

Open AccessReview

Artificial Intelligence in Bulk and Single-Cell RNA-Sequencing Data to Foster Precision Oncology

by Marco Del Giudice, Serena Peirone, Sarah Perrone, Francesca Priante, Fabiola Varese, Elisa Tirtei, Franca Fagioli and Matteo Cereda

Int. J. Mol. Sci. 2021, 22(9), 4563; https://doi.org/10.3390/ijms22094563 - 27 Apr 2021

Cited by 25 | Viewed by 8099

Abstract

Artificial intelligence, or the discipline of developing computational algorithms able to perform tasks that requires human intelligence, offers the opportunity to improve our idea and delivery of precision medicine. Here, we provide an overview of artificial intelligence approaches for the analysis of large-scale [...] Read more.

Artificial intelligence, or the discipline of developing computational algorithms able to perform tasks that requires human intelligence, offers the opportunity to improve our idea and delivery of precision medicine. Here, we provide an overview of artificial intelligence approaches for the analysis of large-scale RNA-sequencing datasets in cancer. We present the major solutions to disentangle inter- and intra-tumor heterogeneity of transcriptome profiles for an effective improvement of patient management. We outline the contributions of learning algorithms to the needs of cancer genomics, from identifying rare cancer subtypes to personalizing therapeutic treatments. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

31 pages, 1886 KB

Open AccessReview

Towards the Interpretability of Machine Learning Predictions for Medical Applications Targeting Personalised Therapies: A Cancer Case Survey

by Antonio Jesús Banegas-Luna, Jorge Peña-García, Adrian Iftene, Fiorella Guadagni, Patrizia Ferroni, Noemi Scarpato, Fabio Massimo Zanzotto, Andrés Bueno-Crespo and Horacio Pérez-Sánchez

Int. J. Mol. Sci. 2021, 22(9), 4394; https://doi.org/10.3390/ijms22094394 - 22 Apr 2021

Cited by 59 | Viewed by 7373

Abstract

Artificial Intelligence is providing astonishing results, with medicine being one of its favourite playgrounds. Machine Learning and, in particular, Deep Neural Networks are behind this revolution. Among the most challenging targets of interest in medicine are cancer diagnosis and therapies but, to start [...] Read more.

Artificial Intelligence is providing astonishing results, with medicine being one of its favourite playgrounds. Machine Learning and, in particular, Deep Neural Networks are behind this revolution. Among the most challenging targets of interest in medicine are cancer diagnosis and therapies but, to start this revolution, software tools need to be adapted to cover the new requirements. In this sense, learning tools are becoming a commodity but, to be able to assist doctors on a daily basis, it is essential to fully understand how models can be interpreted. In this survey, we analyse current machine learning models and other in-silico tools as applied to medicine—specifically, to cancer research—and we discuss their interpretability, performance and the input data they are fed with. Artificial neural networks (ANN), logistic regression (LR) and support vector machines (SVM) have been observed to be the preferred models. In addition, convolutional neural networks (CNNs), supported by the rapid development of graphic processing units (GPUs) and high-performance computing (HPC) infrastructures, are gaining importance when image processing is feasible. However, the interpretability of machine learning predictions so that doctors can understand them, trust them and gain useful insights for the clinical practice is still rarely considered, which is a factor that needs to be improved to enhance doctors’ predictive capacity and achieve individualised therapies in the near future. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

19 pages, 867 KB

Open AccessReview

Incorporating Machine Learning into Established Bioinformatics Frameworks

by Noam Auslander, Ayal B. Gussow and Eugene V. Koonin

Int. J. Mol. Sci. 2021, 22(6), 2903; https://doi.org/10.3390/ijms22062903 - 12 Mar 2021

Cited by 96 | Viewed by 40788

Abstract

The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be [...] Read more.

The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Bioinformatics)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Deep Learning and Machine Learning in Bioinformatics

Share This Special Issue

Special Issue Editors

Special Issue Information

Benefits of Publishing in a Special Issue

Published Papers (22 papers)

Editorial

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI