Panning for Long Noncoding RNAs

The recent advent of high-throughput approaches has revealed widespread transcription of the human genome, leading to a new appreciation of transcription regulation, especially from noncoding regions. Distinct from most coding and small noncoding RNAs, long noncoding RNAs (lncRNAs) are generally expressed at low levels, are less conserved and lack protein-coding capacity. These intrinsic features of lncRNAs have not only hampered their full annotation in the past several years, but have also generated controversy concerning whether many or most of these lncRNAs are simply the result of transcriptional noise. Here, we assess these intrinsic features that have challenged lncRNA discovery and further summarize recent progress in lncRNA discovery with integrated methodologies, from which new lessons and insights can be derived to achieve better characterization of lncRNA expression regulation. Full annotation of lncRNA repertoires and the implications of such annotation will provide a fundamental basis for comprehensive understanding of pervasive functions of lncRNAs in biological regulation.


Introduction
It is well known that DNA is transcribed into messenger RNA (mRNA), which is then translated to protein(s) with the help of housekeeping noncoding RNAs (ncRNAs) such as transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs). Messenger RNAs serve as intermediate carriers, forwarding genetic information (as coding genes) from DNA to protein. Characterization of coding genes and their protein products has been of great importance in our goal to understand gene expression regulation. While early expectations were to find about 100,000 genes in the human genome, the current estimate stands at 20,000-25,000 [1] genes after the first draft of the human genome was released in 2001 [2]. We have now learned that only about 2% of the human genome encodes protein sequences [1], much of the rest of the noncoding segments used to be considered as "junk" or "dark matter" [3,4], despite evidence of their participation in gene expression regulation at multiple levels. Housekeeping ncRNAs with known functions have been studied for many decades. For example, they play key roles in translation (tRNA and rRNA), splicing (snRNA), and RNA modification (snoRNA). The advent of state-of-the-art deep sequencing technology has revealed that most of the human genome is pervasively transcribed [5,6], indicating a rich pool for ncRNAs besides the aforementioned well characterized molecules.
New small regulatory ncRNAs were first identified by exogenous RNA interference in plants and nematodes, and later found to exist endogenously. These small ncRNAs, including but not limited to microRNAs (about 22 nt long), function as posttranscriptional repressors [7]. Through a combination of size selected high-throughput sequencing and computational approaches, a very large number of small ncRNAs have now been identified and predicted in genomes, and their evolutionary conservation and structural stability have been extensively analyzed [8]. Generally speaking, the computational pipeline for small ncRNA prediction with high-throughput experiments is now relatively mature [9,10], and over 1600 precursors and 2042 mature miRNAs have been reported in the human genome (miRBase 19, released date August 2012).
Beyond the small regulatory ncRNAs, the multifaceted transcriptome has become even more complex with the discovery of the pervasive transcription of long noncoding RNAs (lncRNAs, at least 200 nt long). LncRNAs are known to play important roles in both biological and pathological events [11][12][13][14], including X-chromosome inactivation (Xist) [15], genomic imprinting (Air, Kcnq1ot1) [16,17] and nuclear trafficking (NORN) [18]. The application of tiling arrays allowed the discovery of additional lncRNAs, including the well-characterized HOTAIR [19], NEAT1 and MALAT1 [20]. These lncRNAs are involved in trans-acting gene regulation (HOTAIR) [19], providing a structural scaffold in nuclear architectures (NEAT1) [21][22][23][24] and alternative splicing regulation (MALAT1) [25], although the effects might be very subtle as indicated by discrepancies in cell cultures [25] and mouse models [26]. Detailed studies of these abundant lncRNAs have served as road maps for the functional characterization of other lncRNAs. Very recently, the new finding and understanding of pervasive transcription from the "dark matter" attracted our attention to an integrated annotation of lncRNAs from transcriptomes. The existence of thousands of lncRNAs from intergenic regions (large intergenic noncoding RNA, lincRNA) has been inferred from massive high-throughput sequencing data including histone modification landscapes (chromatin signatures) in both mouse [27] and human [28]. In addition, functional investigations of certain lncRNAs further revealed additional roles of these molecules in gene expression regulation, from controlling chromatin complexity [29], to acting as competing endogenous RNAs [30], to performing enhancer-like functions [31], and to maintaining pluripotency [32] and embryogenesis [33]. In addition, non-polyadenylated RNA enrichment from human transcriptomes, followed by computational analysis, revealed that some excised introns can stably accumulate as lncRNAs [34]. In some cases, intronderived lncRNAs are capped by snoRNAs at both ends to protect intronic sequences from degradation after splicing, leading to the formation of a new class of lncRNAs (sno-lncRNAs) [35].
As up to 70% of the human genome can be transcribed [36] and only about 2% of the human genome encodes protein coding genes including UTRs [1], it is not surprising that the majority of lncRNAs were previously classified as "junk sequences" or "dark matter". Some of the bestcharacterized lncRNAs are generally highly expressed and conserved across species, but these features are more the exception than the rule and cannot be generalized to thousands of other lncRNAs identified by large-scale screening. The latter are generally expressed at a low level [37] and are less conserved [38], which have impeded their discovery and functional studies. In this review, we assess issues that have challenged lncRNA discovery in the past, and also highlight recent experimental and computational designs that have facilitated lncRNA identification and characterization. These advances not only shed light on lncRNA characterization but also reveal the complex mechanisms they use to regulate other molecules.

Challenges for LncRNA Discovery
In the first decade of this century, whole genome sequencing revealed approximately 20,000 protein coding genes in humans, which is comparable to estimates in the fly and worm, although humans exhibit much more complexity through alternative splicing [1,2]. With the rapid development of highthroughput technologies, growing lines of evidence have indicated that genomes are pervasively transcribed, with many previously ignored portions of the genome transcribed as lncRNAs [6,36,38] ( Figure 1a). However, several intrinsic features of lncRNAs have posed challenges for their discovery as well as their functional study, as discussed below.

LncRNAs in General Are Expressed at Low Levels in vivo, but with High Tissue-Specificity
RNA-seq (deep sequencing from reverse-transcribed RNAs) datasets revealed that the human genome is pervasively transcribed [5]. However, the extent of this pervasive transcription has been disputed [39,40]. The controversy has been partially due to different datasets and computational approaches [6] that were applied to individual analyses, but also to the nature of the low expression in most noncoding regions in genomes. For example, many such transcripts from intergenic or intronic regions were detected at very low levels by various technologies [41]. In addition, the median expression level of lincRNAs was approximately one-third of that of the coding ones in the mouse [42] and about 10-fold lower than of coding genes in humans [28]. Moreover, the recent Encyclopedia of DNA Element (ENCODE) project released a variety of transcriptomes of RNA repertoires from 15 human cell lines. The complete annotation of these transcriptomes suggested that lncRNAs have lower expression levels than coding RNAs [36]. In particular, 80% of detected lncRNAs exist in ≤1 copy per cell, compared with only 25% of coding RNAs in examined cell lines [36]. Taken together, the nature of low expression of lncRNAs makes it difficult for their discovery, precise annotation, and subsequent functional studies. Nonetheless, the expression of a few lncRNAs is comparable or even higher than coding ones in certain cell lines (e.g., H19 in NHEK cells [36] and sno-lncRNAs in hES cells [35]).
Accumulated results suggested that most lncRNAs exhibit a low level of expression but high tissue-/cell-specific patterns [37,38,43]. About 78% of human lincRNAs are tissue-specific, compared with about 19% for protein coding genes [28]. Moreover, the complete transcriptome analyses from 15 human cell lines in the ENCODE project showed that 29% of all detected lncRNAs are only from one cell line and only 10% are expressed in all cell lines. In contrast, 7% of expressed coding RNAs were only detected from one cell line, but 53% of them were expressed in all cell lines [36]. These observations indicate that their tissue-specific expression patterns make the identification and characterization of these lncRNAs quite challenging if only a small portfolio of tissues/cell lines are chosen for analyses.

Evolutionary Conservation of LncRNAs on Average Is Relatively Lower than That of Coding RNAs
Homologous sequence comparison is an efficient method for identifying genes that exhibit similar functions between species and for discovering novel coding regions [44], however, it is not an effective way for non-protein coding sequences, because they are less conserved. For example, only a small portion (<5%) of noncoding sequences are conserved between human and mouse [5,45]. Recent transcriptome analyses by a variety of RNA-seq experiments indicated the existence of thousands of lowly conserved lncRNAs from zebrafish [46] genome to mouse [27] and human [28] genomes. Only 29 out of 550 lincNRAs in zebrafish have detectable sequence similarity with putative mammalian orthologs, and similar sequences are typically restricted to a single short region of high conservation [46]. Thus, although lncRNAs are less conserved across species than protein coding genes, they still on average represent somewhat higher levels of conservation than random regions or introns [42].
Usually, evolutionary constraint can be estimated from the nucleotide substitution rate in functional sequences [47]. Nucleotide substitutions in ncRNAs are on average about 90-95%, compared with about 10% in coding genes. This is reasonable, as nucleotide substitutions tend to be less deleterious in noncoding sequences than in coding ones [47]. A limited phylogenetic range of ncRNAs can be explained as emerging or declining rapidly within particular lineages [48]. For instance, it has been suggested that about one third of lncRNAs have arisen within the primate lineage only [38].
The aforementioned studies suggested that low evolutionary conservation might be a natural feature of noncoding transcripts, which is consistent with their rather poor genome-wide annotations in early studies [1,2,4]. However, considering the relatively higher species divergence, it is possible to identify more novel lncRNAs from different species/evolutionary lineages. Their generally low expression level together with poor conservation initially led researchers to conclude that transcripts from noncoding segments may represent transcriptional noise [49]. However, lack of conservation does not mean lack of function [50]. For example, human NEAT1 RNA and its mouse homolog Men ε/β have low sequence similarity [20] but are functionally conserved [21][22][23][24]. Interestingly, some mouse pseudogenes, whose ancestors have lost their protein-coding capabilities during rodent evolution, have retained their expression and act as competitive noncoding RNAs and function as miRNA-decoys [51]. In fact, an increasing number of intensive functional studies have shown that lncRNAs are not just ancient relics with little function, but have a variety of roles from epigenetic regulation to pluripotency maintenance, and are also highly correlated with some human diseases [52,53].

Controversial Coding Capacity of LncRNAs
Exclusion of protein-encoding capacity is a fundamental requirement for lncRNA definition. In the post-genomic era, this capacity can be predicted genome wide using computational approaches, mainly based on the length and conservation of ORFs [54]. Cutoffs for minimal ORF length, if applied for 300 nt (100 amino acids) [55] or even 60 nt (20 amino acids) [56], can still cause controversy. For example, some well-characterized lncRNAs, such as Xist [57], have remnants featuring longer-than-100-amino acid ORFs. With widespread transcription from a given genome, one can imagine that many transcripts identified as lncRNAs may contain ORF remnants, while some coding RNAs may contain only small ORFs for short polypeptides. In this case, computational algorithms with multiple features incorporated are needed to distinguish truly noncoding RNAs from coding ones. For instance, CPC [58] contains six features to not only evaluate the extent and quality of ORFs, but also parse the ORF conservation of sequences using BLASTX [59]. Although low conservation of ORFs reflected the gene evolution in specific lineages or gene loss in other lineages, studies suggested that most putative human ORFs with no cross-species counterparts are likely to be random occurrences [60] and this is indeed the case for Xist [57]. A phylogenetic model of codon substitution frequency (phyloCSF) metric by orthologous transcript comparison was chosen to distinguish noncoding transcripts from coding ones [61], and successfully applied for lincRNA predictions in both mouse [42] and humans [28].
Besides computational judgments based on critical features of putative ORFs, several other crucial criteria, such as the subcellular localization and the accessibility to the translation machinery, could also be used to evaluate whether a given transcript is a true lncRNA or not. RNA transcripts localized in the nucleus principally suggest functions that are primarily non-coding. This can be estimated experimentally by RNA fractionation from nuclear homogenates [38], as exemplified by NEAT1 [21] and DEB-T [62], despite the risk of possible nuclear/cytoplasmic leakage during RNA isolation. RNA fluorescence in situ hybridization (FISH) is an alternative way to examine the subcellular localization. A growing list of well-characterized lncRNAs do localize in the nucleus and within specific subnuclear structures as illuminated by RNA FISH and are associated with nuclear proteins as revealed by RNAprotein double FISH [63]. Furthermore, ribosome profiling coupled with RNA-seq can provide extra insights for the accessibility of a given transcript to the translational machinery [64]. Moreover, proteome datasets with a spectrum of all protein products can also be applied to mine the existence/non-existence of coding products from tested transcripts. These datasets offer the most direct evidence to determine coding capacity of any transcript, although with low resolution and low availability. Finally, it cannot be ruled out that some transcripts have a dual nature, acting both as ncRNA and producing protein products [65,66].
The best way to distinguish between coding and non-coding sequences is to integrate computational and experimental approaches that enhance understanding of lncRNA expression regulation and biological function in vivo.

Recent Progress in LncRNA Discovery Using New Strategies
With technological improvements and the application of integrated methodologies, significant progress has been achieved in uncovering new lncRNA molecules. Some of these practical strategies can be further applied to achieve new insights into lncRNA functions.

Application of Chromatin Signatures to Determine LncRNAs from Intergenic Regions
Several individual studies have applied a systematic and integrative strategy with multiple biological features to identify lncRNAs, mainly in intergenic regions (lincRNAs), first in mouse [27] and then in zebrafish [46] and human [28] genomes. Distinguished from other previous trials, a brand new feature of "H3K4me3-H3K36me3" chromatin signatures has been utilized in all three species to confirm lncRNA promoters using the histone 3 Lys 4 trimethylation (H3K4me3) signature followed by identification of actively transcribed lncRNA regions using the histone 3 Lys 36 trimethylation (H3K36me3) signature. By differentiating the "H3K4me3-H3K36me3" chromatin signatures of lncRNAs from those of known coding genes/microRNAs/endogenous siRNAs, these analyses reliably identified lncRNA-expressed genomic sequences, largely in intergenic regions (Figure 1b). In addition, other stringent criteria have also been taken into account for lncRNA characterization, including the identification of poly(A) sites, transcription initiation signals, expression patterns among tissues and potential coding capacity. Loss-of-function and gain-of-function of certain conserved lncRNAs demonstrated crucial biological roles of lncRNAs in zebrafish [46], indicating functional conservation despite limited sequence conservation. More importantly, 7some lincRNAs have been shown to play important roles in multiple layers of biological processing, including epigenetic regulation and pluripotency maintenance (reviewed by Guttman [14], Rinn [13] and their colleagues).

Development of a Non-Polyadenylated RNA Enrichment Strategy to Uncover LncRNAs from Introns
Most RNA polymerase II transcripts, including mRNAs and lncRNAs, are polyadenylated (poly(A)+) at their 3' ends. The application of transcriptome analysis of poly(A)+ RNA by highthroughput deep sequencing (mRNA-seq) has revealed a digital map of poly(A)+ transcripts from both known and previously un-annotated genes [67]. However, the transcribed portion of the genome is more than poly(A)+ transcripts, and there are a large number of non-polyadenylated transcripts (poly(A)− transcripts), including ribosomal RNAs (rRNAs) generated by RNA polymerases I and III, other small RNAs generated by RNA polymerase III, replication-dependent histone mRNAs [68] and some lncRNAs [24,69] transcribed by RNA polymerase II. Depletion of ribosomal RNAs (RiboMinus) from total RNA results in both poly(A)+ and poly(A)− transcripts available for deep sequencing analysis. This has led to the discovery of many new poly(A)− transcripts when compared with poly(A)+ RNA deep sequencing [70,71]. However, rRNA-depletion methods cannot physically separate poly(A)− transcripts from poly(A)+ RNAs, thus it is difficult to directly annotate poly(A)− transcripts using only the rRNA-depletion method. Recently, a combination of both rRNA and poly(A)+ RNA removal was applied to obtain a largely pure population of poly(A)-RNAs for highthroughput deep sequencing [34]. This type of poly(A)− RNA-seq of the human cell transcriptomes surprisingly revealed many previously un-annotated RNA transcripts, including a new family of lncRNAs from introns in humans [35] (Figure 1b). In addition, with the same separation strategy for poly(A)− transcripts followed by deep sequencing analyses, additional poly(A)− lncRNAs from intronic regions were also found in various human cell lines [38]. Interestingly, RNA fractionation from nuclear homogenates also indicated the presence of stable intronic sequence RNAs in X. tropicalis [72]. As most lncRNAs are tissue/cell-specific and species-specific, further application of poly(A)− RNA-seq for different tissues and species may result in the identification of additional intron-derived lncRNAs.
What mechanism(s) can generate RNA transcripts without canonical poly(A) tails at their 3' ends? For most of the replication-dependent histone pre-mRNAs, evolutionarily conserved stem-loop structures in their 3' UTRs direct U7 snRNA-mediated 3' end formation to stabilize mature mRNAs and confer cell cycle dependent regulation of their accumulation [67]. For MALAT1 and Men ε/β lncRNAs, their 3' end maturation depends on RNase P cleavage [24,69], stabilized by highly conserved A-and U-rich motifs that form a triple-helical structure [73,74]. For telomerase RNA in S. pombe, incomplete splicing, but not the complete splicing, generates a functional TER1 transcript [75].
However, it appears that none of the above mechanisms are applicable to explain the biogenesis of lncRNAs from introns, as introns are generally rapidly degraded after splicing. Yin et al. recently demonstrated that intron-derived sno-lncRNAs depend on the snoRNA machinery at both ends for their processing and on snoRNP complexes at both ends to protect intronic sequences from exonucleotic trimming [35]. Genome-wide analysis of poly(A)− RNAs from introns has revealed a large number of lncRNAs from intron regions [34,38]; however, only some are capped with snoRNAs. The biogenesis of others needs to be further addressed. Finally, in addition to poly(A)− RNA-seq, the development of more specific experimental and computational approaches will help to understand other poly(A)− lncRNAs matured by RNase P cleavage or incomplete splicing.

Determination of Co-Factors to Study LncRNA Biogenesis and Function
It's now clear that lncRNAs play important roles in a variety of biological processes [13,14,63]. So far, only a handful of mechanisms have been identified to explain how lncRNAs function in vivo. Accumulated lines of evidence suggest that very often lncRNAs function by recruiting and assembling other co-factors, which are usually proteins but possibly other RNAs [51,76,77] or DNAs [78]. Clearly, identifying these co-factors is of key importance for understanding lncRNA function.
The lncRNA Xist is capable of recruiting Polycomb Repressive Complex 2 (PRC2) to remodel chromatin modifications [79], resulting in transcriptional inactivation of one X chromosome. Similarly, Air and Kcnq1ot1 lncRNAs achieve transcriptional silencing by recruiting chromatin-remodeling complexes during genomic imprinting [80,81]. Indeed, many lncRNAs have been identified to bind with PRC2 or other chromatin-modifying complexes for transcriptional repression [32,82]. In addition, lncRNAs can also activate gene transcription by binding specific protein factors. For instance, Evf-2 binds the Dlx-2 protein, which in turn increases the activity of the Dlx-5/6 enhancer [83]. Interestingly, one specific lncRNA might play complementary roles in gene expression regulation by selectively recruiting either PcG for repression [84] or Trithorax group proteins (TrxG) for activation [85].
In addition, lncRNAs can act as molecular scaffolds. For example, telomerase RNA component (TERC) acts as a flexible scaffold for bridging protein subunits together to promote telomerase activity [86]. NEAT1 lncRNA is crucial for the integrity of paraspeckles [21][22][23][24], and a recent study revealed that NEAT1 is capable of initiation of paraspeckle de novo formation [87].
Moreover, lncRNAs can also function as molecular sponges or decoys to affect gene regulation mediated by protein cofactors. For example, Gas5 lncRNA binds the glucocorticoid receptor (GR) to compete against the association of the GR with other glucocorticoid response DNA elements, resulting in functional repression of GR [88]. PWS region sno-lncRNAs trap Fox family members to alter local Fox protein concentration and, subsequently, modulate Fox-regulated alternative splicing events [35]. Meanwhile, lncRNAs also act as competing endogenous decoys through their microRNA response elements (MREs) to titrate the availability of miRNAs for the other RNA molecules [30,51,76,77]. Finally, promoter associated lncRNAs can directly interact with enhancer DNA elements to form DNA: RNA triplexes to carry out their regulatory function [78].
Taken together, these studies suggest that the functional specificity of a given lncRNA is largely dependent on the association with its co-factors, mainly protein partners. Hence, it is important to find associated protein co-factors in order to fully understand the functional roles of lncRNAs. While the potential binding capacity can be predicted by computationally searching for consensus RNA sequences/motifs, direct lncRNA-protein interactomes can also be retrieved from cross-linking immuno-precipitation coupled with high-throughput sequencing (CLIP-seq) (Figure 1b), or using labeled lncRNAs as baits to pull down protein partners.
How do lncRNAs bind to their protein co-factors? There are a variety of known mechanisms for this. Xist contains at least two distinct domains. One is the RepC domain, which is bound by YY1 and hnRNP U for the localization; the other one is the RepA domain, which recruits PRC2 for in-cis gene expression regulation [89,90]. Different from Xist, the PWS region sno-lncRNAs contain multiple consensus hexamer motifs for Fox family splicing regulators [91], which leads to the sequestration of Fox proteins and subsequently the alteration of patterns of Fox-regulated alternative splicing [35]. Interestingly, low evolutionarily conserved lncRNAs have been found associated with the same proteins. For example, human NEAT1 and mouse Men ε/β share low primary sequence similarity, but both are associated with DBSH proteins [21][22][23][24]. This suggests that RNA structure features may sometimes play important roles in the determination of their protein partners. Thus, the recent application of genome-wide structural analysis that determines ncRNA secondary structure has begun to decipher the functional elements of the yeast transcriptome [92]. Similar studies in higher eukaryotes will help to reveal structural information and diverse biological insights of lncRNAs, possibly with their protein co-factors.

Perspectives
In the era of post-genomics, elucidating the full spectrum of RNA molecules by a given cell is important for understanding gene expression and functional regulation. Largely from the previously imagined "dark matter" of the genome, a variety of lncRNAs have been systematically revealed from different tissues and species with clear characteristics distinguishing them from coding RNAs. The characteristics of lncRNAs are (1) low expression but with a pattern of tissue-specificity, (2) decreased conservation in primary sequence but with a likelihood of functional conservation, and (3) restrained coding capacity but with a probability of ancestral ORF relics. Transcriptome analyses by highthroughput technologies (including tiling arrays and RNA-seq) with high coverage, high sensitivity, and high efficiency represent an evolutionary leap in our methodology for lncRNA characterization. Recent studies have inspired new insights into the study of lncRNAs, and in turn, these insights have prompted further application of novel methodologies for lncRNA study.
Despite recent and rapid progress in our understanding of lncRNAs, a number of important features remain to be further addressed. For example, what are the landscapes of lncRNA expression in specific tissues/species and what are the connections of specific expression repertoires with specific tissue/species function? What are the distinct mechanisms for regulation of lncRNAs in specific tissues/species? What secondary structures are associated with lncRNA functions? Furthermore, existing computational algorithms are not sufficiently robust to deal with these sequence analyses. For example, they are less efficient for the accurate alignment of sequencing reads to lncRNAs in repetitive regions as well as for the precise transcript alignment of multiple lncRNA molecules from the same genomic segments.
Clearly, the integration of not only new computational pipelines, but also further experimental approaches, will be required to further our ability to discover new lncRNAs and how they function in gene regulation.