Non-coding RNAs (ncRNAs) are a type of RNA [1
] that is unable to produce a protein. However, these ncRNAs contain unique information that yields other functional RNA molecules [2
], and, thereafter, these RNA molecules turn into proteins through gene transcription. This process is visually represented in Figure 1
which shows the transcription step of ncRNA genes. In physiology and disease development, these RNAs regulate numerous levels of gene expression. The current study of the human genome [3
] yielded many regulatory ncRNAs including microRNAs, small RNAs, and various types of long ncRNAs (lncRNAs) [4
]. In practice, ncRNAs also achieve regularity through modularity, assembling diverse combinations of proteins and possibly RNA and DNA interactions [5
]. A regulatory framework was proposed in [6
] to construct a network between long ncRNAs (lncRNAs) and protein-coding genes using the Bayesian network (BN). They utilized 762 prostate RNA-seq data to construct this regularity network. In that system, it was noticed that the lncRNAs are utilized in tissue development. Apart from the functions of ncRNAs sequences, the ncRNAs are also described by specific secondary and tertiary structures [7
]. Still, finding the function and structure of ncRNAs is becoming a challenging task due to huge volumes of data involved in human next-generation sequencing (NGS) [8
]. For determining the function or annotation of ncRNAs, the computational intelligence (CI) techniques were developed in the past studies. Currently, the CI techniques applied on large human NGS datasets are a challenging task. Despite this fact, the prediction of ncRNAs [9
] is also a major issue for CI techniques.
We present a detailed review on state-of-the-art computational intelligence (CI) techniques from 2001 to 2016 in terms of automatic functional annotation and finding of non-coding RNA (ncRNAs) genes. The primary aim of this review article is to attract both biologists and computer experts to show the problems and importance of CI techniques for the finding of human disease in various domains. In the literature, researchers are mainly using CI algorithms such as support vector machine (SVM), neural network (NN), Bayesian networks (BNs), genetic algorithms (GAs), hidden Markov models (HMMs), and hybrid classifiers to find ncRNA genes. The latest trend is to develop more advanced CI techniques that will try to classify or annotate the ncRNA genes. Currently, the authors are widely using deep neural network (DNN) learning and/or convolutional neural network (CNN) classifiers to predict the ncRNAs sequences. The DNN is a more advanced CI technique to recognize multiclass specific problems without using domain-expert knowledge.
However, there are a number of challenges to annotating ncRNAs [10
] because there are many classes that are predicted by medical and bioinformatics experts. This happened due to a lack of an unambiguous classification framework in past studies. Similarly, the differentiation [11
] between lncRNAs and messenger RNAs (mRNAs) was also a challenging task. Compared to lncRNA patterns, the microRNA (miRNA) [12
] is a type of ncRNA that regulates the gene expression during post-transcriptional operation. It was noticed that microRNA has some special roles in the development of cancer cells. However, the review suggests that the functional identification of miRNAs continues to be a thought-provoking task due to more than 1000 distinct genes of miRNAs in the human genome. Furthermore, we also describe data sources that are provided to facilitate the researchers in the development of computational algorithms. This review will serve as a good reference to the newcomers in the computational domain field of ncRNA research. Moreover, this review article focuses on the technical site and the limitations of state-of-the-art CI techniques.
This review article is organized as follows: Section 2
describes the detailed state-of-the-art computational intelligence (CI) approaches for finding ncRNAs and miRNA genes. The online tools and data sources are also presented for the researchers to analyze ncRNA genes. The discussions are presented in Section 3
and conclusions and future works are presented in Section 4
and Section 5
3. Online Tools and Data Sources
The online tools and data sources are provided for researchers to develop new studies based on new CI approaches. We did not only show the sources for finding ncRNA and microRNA genes, but also other classes of non-coding RNAs—as displayed in Table 3
. Whereas in Table 1
, the online available tools were described. As mentioned before, these databases are useful to test various annotation and gene-finding techniques. The Rfam database [21
] was one of the first. It integrated various new and existing curated structural alignments into a common structure-annotated format. It also uses covariance modeling and automated sequence annotation software.
The NONCODE database [19
] brought together most publicly available information about experimentally confirmed or computationally predicted ncRNAs with the exception of transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs). It also introduced a classification system termed process function class (PfClass) based on the cellular processes and functions associated with the ncRNA.
The RNAdb [18
] was launched as a sequence repository for experimentally supported regulatory mammalian ncRNAs (miRNAs, small nucleolar RNAs (snoRNAs), but not tRNAs, rRNAs, and spliceosomal RNAs). Apart from bioinformatics analyses, it also was meant to facilitate microarray chip characterization experiments. This database also includes a large number of commonly accepted ncRNAs from reputable complementary DNA (cDNA) libraries. The authors described the computational methods to identify genes and presented a brief technical reference for future studies.
One of the first programs for searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure was RSEARCH [49
]. It relies on a local alignment algorithm. The latter are a series of base pair and single nucleotide substitution matrices for RNA sequences. The web-based tool RNALOSS [9
] was developed to provide information about the distribution of locally optimal secondary structures.
] was applied to the discovery of riboswitches, a class of RNA domains, which regulate metabolite synthesis. Given an RNA sequence with known secondary structure, FastR efficiently computes all structural homologs in a genomic database. The tool relies heavily on filter design and optimization as well as the actual filtering algorithms and computation.
The interpretation and annotation of ncRNA gene finding is an emerging trend. There are numerous challenges to annotate and interpret ncRNAs because there are many classes that are still being predicted by medical and bioinformatics experts [52
]. In recent years, the CI approaches have attracted many researchers to perform those tasks. However, those CI approaches were lacking a definitive classification framework that utilized the past studies. Some reviews have summarized CI approaches but focused on the particular viewpoint on methodologies. In this article, the CI techniques for interpretation and annotation of ncRNA gene finding are summarized in detail differently from the existing body of research, and we attempt to deliver a short but concise technical discussion.
Biological sequences (such as DNA, RNA, and protein sequences) naturally fit the recurrent NN that are capable of temporal modeling. Nonetheless, prior work on applying deep learning to bioinformatics utilized only convolutional and fully connected NNs. The biggest novelty of our work lies in its use of recurrent NNs to model RNA sequences and to further learn their sequence-to-sequence interactions, without laborious feature engineering (e.g., more than 151 features of miRNA-target pairs have been proposed in the literature). As shown in their experimental results, even without any of the known features, deepTarget delivered substantial performance boosts (over 25% increase in F-measure) over existing miRNA target detectors, demonstrating the effectiveness of recent advances in end-to-end learning methodologies.
The training deepTarget [26
] was focused on improving its capability to reject false positives (i.e., bogus miRNA-mRNApairs) as a target predictor. The decision was based on the study that more priority should be given to sensitivity in the search for potential targets of specific miRNAs, whereas specificity should be emphasized in the examination of miRNAs that regulate specific genes. Depending on the specific needs, we could alternatively train deepTarget to put more priority on specificity. For instance, this could be done by altering the composition of a mock negative dataset to have additional mispairings between miRNA and mRNA sequences except the seed sequence.
Notably, deepTarget [26
] does not depend on any sequence alignment operation, which has been used in many bioinformatics pipelines as a holy grail to reveal similarity/interactions between sequences. Although effective in general, sequence alignment is susceptible to changes in parameters (e.g., gap/mismatch penalty and match premium) and often fails to reveal the true interactions between sequences, as is often observed in most of the alignment-based miRNA target detectors. By processing miRNA and RNA sequences with recurrent neural networks (RNN)-based auto encoders without alignment, deepTarget successfully discovered the inherent sequence representations, which are effectively used in the next step of deepTarget for interaction learning. Although the performance of deepTarget is incomparably higher than that of the existing tools we compared it to, Fritah et al. [54
] there remains room for further improvements. An additional breakthrough may be possible by enhancing the current step to learn sequence-to-sequence interactions. The current version of deepTarget relies on concatenating the RNA representations from two auto-encoders and learning interactions therein using a unidirectional two-layer RNN architecture. Although this architecture was effective to some extent, as shown in their experiments [26
], adopting even more sophisticated approaches may further boost the capability of deepTarget to detect subtle interactions that currently go undetected.
As we are living in the era of big data, transforming biomedical big data [1
] into valuable knowledge has been one of the most important problems in bioinformatics. At the same time, deep learning has advanced rapidly since early 2000s and has recently shown state-of-the-art performance in various fields. This article reviews some research of deep learning [55
] in bioinformatics. To provide a big picture the authors in the past studies utilized deep learning architectures (i.e., deep neural network, convolutional neural network, recurrent neural network, modified neural network) and presented brief descriptions of each work. Additionally, we introduce a few issues of deep learning in bioinformatics such as problems of class imbalance data and suggest future research directions [28
] such as multimodal deep learning. The authors believe that the study could provide valuable insights and be a starting point for researchers to apply deep learning in their bioinformatics studies.
Certainly, bioinformatics is no exception in such trends. Various forms of biomedical data including omics data, image, and signal have been significantly accumulated, and its great potential in biological and health-care research has caught the interest of industry as well as academia. For instance, IBM provided Watson for Oncology, a platform analyzing patients’ medical information and assisting clinicians with treatment options [2
]. In addition, Google DeepMind, achieving a great success with AlphaGo in the game of GO, recently launched DeepMind Health to develop effective healthcare technologies [4
To extract knowledge from huge data in bioinformatics, machine learning has been one of the most widely used methodologies. Machine learning algorithms use training data to uncover underlying patterns, build a model, and then make predictions on the new data based on the model. Some of the well-known algorithms—SVM, HMM, BNs, Gaussian networks—have been applied in genomics, proteomics, systems biology, and many other domains [7
]. Conventional machine learning algorithms have limitations in processing the raw form of data, so researchers put a tremendous effort into transforming the raw form into suitable high abstraction level features with considerable domain expertise [56
]. On the other hand, deep learning, a new type of machine learning algorithm, has emerged recently on the basis of big data, the power of parallel and distributed computing, and sophisticated algorithms. Deep learning algorithms have overcome the former limitations and are making major advances in diverse fields such as image recognition, speech recognition, and natural language processing.
Certainly, bioinformatics is no exception in deep learning applications. Several studies have been conducted [57
] to apply deep learning in bioinformatics as in Figure 1
. We categorized the research by the form of input data into three domains: omics, biomedical imaging, and biomedical signal processing. Detailed lists of bioinformatics research topics where deep learning is applied and input data examples of each domain are shown in Table 1
shows the classification of CI techniques percentage used since 2001. This pie chart clearly shows that most of the CI techniques have been used for this problem. It clearly shows that most of the methods are based upon ANN such as CNN and DNN that is approximately 35% of all these CI techniques. It clearly shows that ANN based approaches are the most promising as compared to other CI techniques. Table 1
shows that the NN based approach achieved approximately 99% accuracy to predict new miRNA known as pre-miRNAs [28
] which is the maximum so far. Mostly NN based approaches have been proposed recently. Figure 2
also clearly shows that SVM has also been used for this problem around 25% of the time. But Table 1
shows that the maximum accuracy achieved by the SVM based approach is 95%, which is less than that of the NN based approaches. SVM based approaches were proposed before 2010 as shown in Table 1
. Table 1
also shows that some other classifiers such as RT, RF, SMO, HMM, GA, and logistic regression have also been used, but these approaches do not show promising results as compared to NN based approaches. Thus, it can be concluded that NN based approaches are the most suitable for the classification and prediction of miRNA.
In fact, the differentiation between normal and cancer tissues are dependent on the analysis of the lncRNA transcription patterns. It was also noticed that the lncRNA expression in normal tissues is highly abnormal for lncRNA expression in human cancers. Therefore, they utilized 272 human serial analyses of gene expression (SAGE) libraries to detect transcription patterns of lncRNA [58
State-of-the-art advances have been presented in three levels of lncRNAs (the primary sequence, the secondary structure, and the function annotation) along with CI methods [59
]. Computational approaches for the analysis of ncRNA through deep sequencing techniques were discussed in [60
]. One review of lncRNAs [23
] also argues that the quality of annotations and the function of these genes are important. In that research study, the authors proposed a novel cancer-related finding of the lncRNAs gene and discussed the limitations.