Long non-coding RNAs (lncRNAs) are an important part of the mammalian transcriptome [1
]. Although the set of lncRNAs that is well understood with respect to biological function and molecular mechanisms is still limited, it is rapidly expanding. Genes such as ANRIL
], or Xist
] may serve as examples. Widespread roles include—but are not limited to—interactions with chromatin to silence or activate chromatin [9
], and the regulation of splicing [11
]. Still, the coverage and precision of the lncRNA annotations lag behind the accurate maps of protein-coding genes. The GENCODE project [12
] provides the most accurate transcript and gene annotation for the human genome. It is a combination of manual and automated annotation techniques which endeavours to list gene features from HAVANA and Ensembl datasets. Detailed surveys of expression patterns across many tissue and cell-types (e.g., [13
]) provide evidence for intricate regulatory networks in which lncRNAs are key players.
There is mounting evidence, however, that lncRNA isoforms may differ drastically in their biological function [16
]. Due to their usually low expression values, gene models for lncRNAs historically were often truncated—a situation that has only been improving recently. Notably, the recent lncRNA atlas by Hon et al. [15
] is specifically aimed at providing accurate 5’ ends. The situation is still more difficult at the 3’ side, since long, large, unspliced 3’ regions make it difficult to determine complete transcript from Illumina data (e.g., [18
]). Furthermore, lncRNAs such as ANRIL
] exhibit complex patterns of alternative splicing. Even in extremely well-studied protein-coding loci, rare isoforms keep being discovered [20
]. It thus remains an unresolved question to what extent the current maps of lncRNAs are complete, both in terms of the number of expressed transcripts per gene and in terms of the variability of their isoforms.
One component towards answering this question is to ask to what extent the transcript portfolio of a particular cell type has been mapped completely. Conversely, one asks in this context to what extent reported transcripts are noise. To address these issues, we investigate here a very large set of transcriptome data from B cell lymphomas. By virtue of aggregating hundreds of independently generated transcriptome datasets, we can study in detail if the set of detectable splice junctions converges to a consensus.
A comparison of published annotation data showed substantial differences in the average number of exons per transcript, with some systematic trends over time. The recent Ensembl version 83 [21
], which shares GENCODE v.24 [22
] as basis of gene annotation, and reports slightly fewer introns per lncRNA gene than earlier versions. Earlier Ensembl versions report only a limited number of long intergenic non-coding RNA (lincRNA) genes. Ensembl version 60 [23
], for example, reports only 1443 lncRNAs, compared to 15,941 lncRNAs in GENCODE v.24. The average number of introns is very close to the data reported for the much more complete GENCODE v.7 annotation. In contrast, the NONCODE database is very inclusive and provides more than an order of magnitude more entries. Correspondingly, the GENCODE annotation also exhibits a moderate decrease in the number of exons per lncRNA over time (Table 1
). This is probably a consequence of the fact that more recently included lncRNA contain a larger fraction of intron-poor or even intronless RNAs.
To better understand the distribution and structure of rare isoforms in the human transcriptome, we systematically investigated the influence of the dataset size on the complexity of inferred gene structures. We focussed on RNA-Seq data from 111 lymphoma samples consisting of the different subtypes Burkitt lymphoma (BL), follicular lymphoma (FL), and diffuse large B cell lymphoma (DLBCL) that was published in the context of the ICGC MMML-Seq project [24
]. The RNA-Seq data comprised of a total of more than 10 billion mapped reads. Since the RNA-seq data is not strand-specific, only non-overlapping lncRNA genes (lincRNAs) were considered for this analysis. Using GENCODE version 19 as reference, we obtained a set of 5257 lincRNAs that were detectably expressed in our lymphoma data.
summarizes the effect of increasing coverage on the estimated average number of introns per gene locus. Here, only the subset of expressed lncRNAs with at least one annotated intron is used to reduce the chance of erroneously or ambiguously mapped reads. The data qualitatively reproduce the observation of the ENCODE project that there is a large difference in the average number of splice junctions between protein-coding loci and ncRNAs [25
]. Although restricted to a quite narrowly defined cancer type, the average number of introns in lncRNAs is larger by about one than the GENCODE dataset, which uses a composite of a broad range of cell lines and tissues. While GENCODE v.19 reports an average of 3.21 introns for these genes, the analysis of the RNA-seq data reveals a mean of 4.29 introns. We can confirm that slightly more than 8% of the introns are novel. This is clearly an effect of the extreme sequencing depth of combined lymphome data. The curves also show that the data saturate very slowly, requiring dozens or even a hundred samples to reach the plateau value.
The data show that the detected splice junctions are unlikely to be noise, since the curves saturate well for all three minimum read support thresholds instead of showing linear growth, affirming their biological reality.
In Figure 2
, we compare the data in more detail with the GENCODE v.19 annotation, which was used here as the reference annotation dataset since the genes in the lymphome dataset have the largest overlap with this version. Alternative splicing is a very common phenomenon throughout the human genome [26
], and expression levels of alternatively spliced isoforms are usually tissue-specific. It is not surprising that genes with multiple exons are more likely to have alternate splice sites [28
]. For the case of protein-coding loci, we tend to miss some splice junctions at those loci that are extremely lowly expressed in the lymphome transcriptomes. This is not surprising, as rare variants are of course easier to detect in transcriptomes where they are more highly expressed; after all, the GENCODE annotation is a composite of vastly diverse cell types and tissues. It is interesting to note, however, that we systematically observe more introns at moderate reads per kilobase and million reads (RPKM) values, even from the very narrowly-defined cell types used here. This attests that large numbers of well-defined but rare isoforms have so far eluded annotation.
Comparing our data to existing GENCODE annotations, we find that lncRNAs exhibit systematically larger numbers of exons and introns. However, the discrepancy is moderate and applicable—in particular to lncRNAs that already have a large number of exons annotated. Around 41% of genes are found to have a greater number of introns. Fourteen percent of the genes have at least one intron more, and 19% are calculated to have at least two introns more.
In Figure 3
, we show two examples that appear to be substantially more complicated in our data than in the GENCODE annotation. No functional annotation is available at present for either locus. As the figure shows, at least some of the additional exons also appear in EST data tracks provided by the UCSC genome browser.
3. Concluding Remarks
We have shown that human transcriptome data harbour a large number of rare exons (and thus also introns) that have remained unannotated. Due to their low abundance, they appear only when data from large-scale experiments are pooled. As shown in Figure 1
, they can nevertheless be reproduced very accurately. There is very little noise in these data, as shown by the near perfect saturation of the average number of splice junctions per gene. Transcriptional noise—whether biological or technical—would result in a linear increase of the number of detected junctions as a function of the size of the data set. If such a slope exists, it is too small to be detectable from our data, which comprise of more than
reads. Therefore, we have to conclude that the current annotation of the human transcriptome is confined to a very well-defined set of splice variants. As a consequence, it is a meaningful and—in our view—worthwhile task to attempt the construction of an exhaustive atlas.
The well definedness of isoforms by itself does not imply that all isoforms carry biological functions. If the vast majority of isoforms are indeed non-functional junk, however, an explanation is needed for the precision of the processing and its restriction to very few splice sites.
4. Materials and Methods
The 111 RNA-seq lymphoma samples from the ICGC MMML-Seq project [24
] were mapped onto the human reference genome hg19 using the splicing-aware mapping tool segemehl
, version 0.1.7 [20
] with split-read mapping (option -S
) enabled in addition to the default parameters. The total number of reads were around 120 million per sample. The read length was 101 nucleotides, and around 90% of the reads were mappable.
The read supports for all identified splice junctions (i.e., genomic intervals spanning exon–exon boundaries) were calculated. Since segemehl
identifies split-reads independent of any annotation, we consider all splice junctions located within genomic coordinates taken from GENCODE version 19 as potential introns belonging to that GENCODE gene. In order to call it an intron, we require a minimum of one, five, or ten reads representing the junction. This procedure was performed using the complete range of all dataset sizes from one sample to all 111 samples. Normalized mean expression values—quantified as RPKM (reads per kilobase and million reads)—were averaged over all 111 samples and used to define a gene’s expression level in the lymphoma data set in Figure 2
. We utilized the UCSC genome browser [31
] to visualize additional introns displayed in Figure 3
To analyse the annotation landscape of lncRNAs, we used the detailed annotation data from GENCODE (versions 7 through 24) [32
], Ensembl (releases 60 and 83) [33
], NONCODE (version of 2016) [35
], and the dataset by Cabili et al. [13
]. We included the categories “antisense”, “lincRNA”, “processed_transcript” and “sense_intronic” biotypes as lncRNAs. The annotation data were used to define the location of individual genes to count the total number of introns. In order to address changes in the annotated gene structures over time, we used the GENCODE v.19 genes as reference. The intersection of GENCODE v.19 with the loci that are supported by at least 10 reads in the lymphoma data set comprises 5257 lincRNAs.
An in-house pipeline was used to aggregate the multiple transcriptomes and compute the summary statistics of interest. Input datasets are provided in the standard Gene Transfer Format (GTF) files, which contain fields indexing the various features of each gene, viz. its exact location, as well as its set of constituent transcripts. Within a gene, transcripts typically overlap and share exons. We therefore first determined a collection of unique exons for each gene, which was subsequently used to determine the unique introns. Mean and median number of introns and exons per gene were computed from the number of unique exons and introns, respectively (i.e., without considering how often they appear in distinct transcripts). A summary of the overlap between lincRNAs of the lymphoma dataset and GENCODE annotations can be found in Table 2