The transcriptional landscape in eukaryotic organisms (e.g., humans) is now perceived as far more intricate than was originally thought [1
] after the discovery that only about 2% of the genomic regions in humans encode for proteins, and the remaining sequences are non-coding regions that do not encode for proteins [2
]. Since most of the human genome is transcribed, whether it encodes a protein or not, a major part of the human genome is pervasively transcribed into non-coding RNAs (ncRNAs). From this expanded view of ncRNAs, long non-coding RNAs (lncRNAs), which are more than 200 nucleotides in length, have recently been in the limelight due to evidence of linking mutations in their sequence to the dysregulation in many human diseases [3
]. For example, genome-wide association studies (GWAS) have discovered that the long non-coding RNA (lncRNA) ANRIL is significantly associated with susceptibility to type 2 diabetes, intracranial aneurysm, coronary disease, and several types of cancers [3
]. There are several mutations within the ANRIL gene body, as well as in its surroundings, that are correlated with a propensity for developing the above-mentioned diseases [3
]. Another example of an lncRNA is Gas5, which is involved in susceptibility to auto-immune disorders [4
] and could also act as a tumor suppressor in breast cancer [5
]. Besides these examples, numerous other lncRNAs are involved in a multitude of human diseases. Interested readers may refer to the following articles to get a more detailed picture of the role of lncRNAs in different diseases [3
In the early 1980s, scientists used to consider the hybridization of complementary DNA (cDNA) for cloning the genes and measuring their expression and tissue-specificity [9
]. Initially, the efforts were focused on genes that were known to produce proteins. Then, the scientific community adopted the same approach for RNAs without considering their coding potential. Based on this approach, the first discovered lncRNA in a eukaryotic organism was H19. The intriguing factor about the discovery of H19 was the absence of being translated even though it had small open reading frame sequences in the gene body. Surprisingly, the transcripts of H19 showed similar characteristics to those of messenger RNAs (mRNA) in terms of splicing, polyadenylation, localization in the cytoplasm, and its transcription by RNA polymerase II [10
]. From the roster of the earliest discovered lncRNAs, X-inactive-specific transcript (XIST) is among the most well-studied lncRNAs due to its role in the X-chromosome inactivation (XCI) phenomena [11
]. The loci of XIST was discovered in the early 1990s, and it showed very low expression levels in mouse undifferentiated embryonic stem (ES) cells for both males and females [12
]. Since the pioneering discoveries of H19 and XIST, the view on non-coding genes in the scientific community has changed completely and has rejuvenated the efforts to discover and characterize novel non-coding RNAs. Specifically, studying lncRNAs has increased dramatically. Additionally, advancement in next-generation-sequencing technology enabled the discovery of many functional lncRNAs in the non-coding regions of the human genome. LncRNAs, despite being considered to be junk DNA regions for approximately the last twenty years [14
], are now recognized as being pervasively transcribed, and non-coding RNA transcriptomes (specifically lncRNAs) have become a major field in biomedical research.
The pervasive nature of the transcriptomes in humans [15
] and mice [16
] has also been highlighted by the Functional Annotation of the Mammalian Genome (FANTOM) consortium in the largest collection of functional lncRNAs, with over 23,000 lncRNA genes [17
]. GENCODE [18
] v25 provides a list of ~18,000 human lncRNA genes. MiTranscriptome has collected 58,548 lncRNA genes [19
], however, it is unclear if all of them are functional. From this, we can observe that the discovery of novel lncRNAs is becoming a regular occurrence, and the catalogue of lncRNAs is constantly growing. Therefore, it is of interest to analyze this large, versatile, and dynamic collection of lncRNAs in a systematic fashion using state-of-the-art computational techniques to derive novel hypotheses, discover unanticipated links, and make proper functional inferences [20
]. Machine learning (ML)-based methods are well suited for lncRNA research, since ML-based techniques can generate insights and discover new patterns from the growing number of lncRNA repositories.
Though ML-based methods are applicable to different types of data, the performance of ML-based models depends on the representation of the data. The quality of data representation and the relevance of the data to a particular problem affect the performance of ML-based models. Deep learning (DL), a sub-field of ML, can address this issue by embedding the data for the model to yield end-to-end models [21
]. DL, a biology-inspired neural network [22
], uses multiple hidden layers and is considered to be among the best paradigms for classification and prediction in the ML field [23
]. In the past ten years, DL-based models have achieved tremendous success in computer vision [24
], machine translation [25
], and speech recognition [26
]. The main reason for their success is the unprecedented availability of massive volumes of data, improvement of computational capacity, and the advancement of sophisticated algorithms [27
]. The enormous amount of biological data, which was once considered to be a big analysis challenge, transformed into an opportunity for biomedical researchers [29
]. DL-based methods have now been successfully applied in the genomics research domain [21
Considering the functionally diverse role of lncRNA in different human biological processes and diseases and the extreme capacity of DL to identify informative patterns from big data, we reviewed how DL has facilitated the discovery of the role of lncRNAs in different human diseases and the underlying mechanism in a data-driven fashion. To the best of our knowledge, this article is the first to summarize the contribution of DL in multiple research domains of lncRNAome.
We organized this article in the following way. We first introduce a primer on DL techniques that were successfully applied in different lncRNAome-related problems. Then, we highlight the DL-based methods that have been successfully applied in several lncRNA-related research problems. We continue by discussing potential issues that might be encountered by researchers while implementing DL-based solutions for lncRNAome and possible resolutions. Finally, we conclude by discussing the perspectives of DL methods in lncRNAome research areas.
5. Future Perspectives for Deep Learning in lncRNAome Research
DL-based methods are already extensively used in lncRNAs. However, to date, the most common DL architectures used in lnRNA-related research are CNN and RNN (see Table 1
). Despite this, there are some other emerging architectures that may have applications in lncRNA-related research.
Di Lena et al. [127
] applied deep spatio-temporal neural networks (DST-NNs) [128
] using spatial features (e.g., protein secondary structures, orientation probabilities, and alignment probabilities) to determine protein structure predictions. Baldi et al. [129
] applied multidimensional recurrent neural networks (MD-RNNs) [130
] to amino acid sequences, the correlated profiles, and the secondary structures of proteins. Convolutional auto-encoders (CAEs) are designed to capitalize on the advantages of both CNN and AE to learn the hierarchical representation of data [131
]. To the best of our knowledge, CAEs, MD-RNNs, and DST-NNs have not yet been used in the lncRNA domain.
Graph convolutional networks (GCN) have been successfully used in predicting different molecular attributes such as solubility, drug efficacy, etc. Recently, GCN and attention-based mechanisms have been used in lncRNA–disease prediction [65
]. However, GCN, or attention-based mechanisms, have not been used in lncRNA–protein predictions thus far, and this might be an interesting area for further research.
GAN belongs to unsupervised learning methods, where the goal is to discover the underlying patterns from the data. GAN can also generate new sample data (e.g., sequences) with some variations. To date, the application of GAN is mainly focused on image processing [43
]. However, as a relatively new method, the application of GAN is extremely limited in genomics. GAN models have been used to generate protein-coding DNA sequences [132
] as well as for designing DNA probes for protein binding microarrays but have not been used in lncRNA research.
Capsule network models are a relatively new invention in the DL domain [133
]. These models attempt to mimic the hierarchical representation of the human brain. Recently, capsule network models have been successfully used to classify brain tumor images [134
]. However, capsule networks have not been used in any significant application in the lncRNA domain. LncRNAome might be an interesting area for capsule network-based research.
In this article, we summarized the contribution of DL in nine different lncRNAome research areas and highlighted the challenges DL-based researchers may face while developing models for lncRNAome. Comparative results from DL- and ML-based models highlight DL-based models’ superiority in different lncRNAome prediction tasks. Specifically, in the study of lncRNA identification, the distinction of transcription regulation programs for lncRNA, lncRNA–protein interaction prediction, and lncRNA–disease association prediction, DL-based models have outperformed the traditional ML-based models. Based on these results, there is significant potential for the application of DL-based techniques in lncRNAome. Unfortunately, only a few DL-based models for the task of lncRNA localization prediction, lncRNA–DNA interaction prediction, and the distinction of transcription regulation program for lncRNA exist. Researchers should consider focusing on developing new DL-based models in these areas which have received relatively little attention from the scientific community. However, the development of DL-based models for lncRNAome is a daunting task. Due to the low expression level and cell-/tissue-specific nature of lncRNA, DL-based model development may need to overcome the challenges of utilizing a relatively smaller dataset while building cell-/tissue-specific models. Additionally, the evolving annotations of lncRNAs from multiple research groups orchestrate another layer of complication in integrating newly discovered lncRNA into existing models. Thus, in spite of DL-based models achieving high-level prediction accuracy thus far, huge challenges in applying DL-based models in lncRNAome still exist. Leveraging state-of-the-art DL-based techniques while improving the existing ones, we expect to gain a better insight into lncRNAome in the near future.