4.1. Dealing with Ambiguity in RNA-Seq Reads Alignment: A Challenge to Resolve TEs Expression Quantification
TEs have been co-opted in different biological scenarios representing novel molecules able to regulate the tissue specific transcriptional networks that establish in physiological and pathological context. The advent of evolving NGS technologies, the formation of international consortia that produced a multitude of datasets and developed bioinformatic tools have been indispensable for realizing how broad is TEs involvement in mammalian biology, and depicting precise function for certain classes, superfamilies and subfamilies of TEs in a given spatiotemporal frame. However, in order to precisely define the contribution of a given TEs locus to the regulatory networks of specific genes, it is important to identify and characterize TEs at the genomic instance resolution. A systematic and unambiguous analysis of TEs (that are repeated in several highly homologous interspersed genomic loci) at the genomic instance level or within genes containing TEs using RNA-seq is a non-trivial task (Figure 4
), due to the limitations of mapping algorithms, which do not allow the assignment of multi-mapping reads to a precise genomic locus [137
Here, we provide a comprehensive overview of the technological progresses in NGS technologies and computational methods, from the sequencing design (e.g., read length and pairing) to the development of specific tools for the downstream analysis of TEs annotation and expression. Also provided is an outline on the contribution of the knowledge that we have acquired and previously summarized on TE functions in genome biology.
Some precautions in the library preparation can help mitigating the amount of multi-mapping TE-derived reads, such as using a paired-end layout and a longer read length to make more likely that the read will contain a unique genomic sequence that can be mapped. However, long repeat instances, such as LTR and LINE retrotransposons, can span from hundreds to thousands of nucleotides, challenging an unambiguous identification via the current, state-of-the-art RNA-seq protocols. Some of the longest TEs harbor an intact promoter and ORF sequences, and are therefore able to be transcribed and to retrotranspose under conditions that cause the removal of their repression, such as hypomethylation in cancer (see Section 3.1
). Therefore, being able to resolve the quantification of these TEs can be crucial to properly study the contribution of transposons in such pathological conditions.
Since the early years of the NGS era, multi-reads have been handled in different ways, each with its own advantages and drawbacks: i) ignoring multi-reads by selecting unique alignments only. This option may lead to underestimating the expression levels of TEs and their derivates, as well as the overall expression level of a sample, but assigns reads with the highest confidence; ii) reporting the best alignment for each multi-mapping read based on the alignment quality score calculated by the mapping algorithm. Here, the results may vary based on how mismatches and gaps between the reads and the reference genome are weighted, making it difficult to provide the exact genomic location with high confidence; iii) keeping multi-reads, counting them once for each mapped feature. This prevents discarding potentially relevant loci from the downstream analysis. However, genomic features characterized by a high number of multi-reads, as well as the total library size, will be overestimated.
To avoid discarding relevant biological signals from multi-mapping or ambiguous reads, multi-mapping reads should be either assigned to a unique genomic feature or re-distributed across the multi-mapped regions. To accomplish the assignation to a unique genomic feature, available methods implement algorithms to assign, according to different criteria (see below), the genomic feature that is the source of transcription for those reads. Whenever this is not possible, the reads can be assigned computing a probability, they will be proportionally re-distributed across the mapped genomic features according to how likely they are to be the source of transcription (often based on the level of transcription of the genomic features, see below). This approach offers a more precise estimation of expression and reads coverage across genes, and some of the methods implementing it are discussed below.
In 2008, Mortazavi et al. [138
] depicted one of the earliest efforts in this direction, in which multi-reads are recovered by distributing them across the aligned genes, proportionally to the amount of unique alignments on a given gene. This method resulted in an increase of expression levels estimates by more than 30% compared to discarding multi reads, for several mouse genes.
The importance to use multi-reads in gene expression profiling of cancer, has been more recently considered by Robert and Watson [139
] with a survey on 12 common methods for gene-level expression quantification from RNA-seq data. The expression levels of hundreds of genes are underestimated by one or more of those methods; interestingly, many of these genes are implicated in human diseases. The quantification of such genes is proposed via multi-map groups (MMGs) of genes that multi-reads map to, and by this approach, MMGs are differentially expressed between normal and lung tumour mouse cells, while the methods based on unique counts failed to produce this result [139
]. By avoiding quantifying the expression of individual ambiguous genes, Robert and Watson could retrieve important data that otherwise would have been missed, but, on the other hand, the information on the transcripts is not considered in the analysis. This technical gap was filled by the multi-mapper resolution tool (MMR), developed by Kahles et al. in 2016 [140
]. In contrast to the previous methods, MMR returns an expression estimate for each individual gene or transcript, and it does not proportionally distribute multi-reads across the aligned features. Rather, MMR assumes that the reads coverage should be uniform within a local region, thus selecting the alignment that leads to the smoothest coverage signal across a window of a fixed length.
Recently, pseudo-alignment algorithms emerged as an alternative to aligning RNA-seq reads to a reference genome, by directly inferring the transcript from which the read originates [141
]. The ambiguity of highly overlapping transcripts in the human genome is circumvented by probabilistically distributing the reads count across a given transcriptome, avoiding the generation of multi-reads in the first place. Tools based on pseudo-alignment have become valuable in transcriptomics, providing a fast and reliable method for transcript-level quantification.
Besides RNA-seq, several NGS methods are designed to meet specific needs in transcriptome analysis. Among these, cap analysis of gene expression (CAGE) is an high-throughput technology for sequencing the 5′ end of transcripts into short reads (tags) [143
]. CAGE has been proven valuable for the discovery of novel transcription start sites (TSS) of either novel genes or alternative transcript isoforms of known genes [144
]. Faulkner et al. [145
] showed a method to recover short multi-reads produced by tag-based NGS technologies such as CAGE, in which a score is given to tag-TSS associations according to the amount of individual tags associated to the same TSS; multi-mapping tags are proportionally assigned to the mapped TSS according to the calculated scores. With this method, it has been demonstrated that up to 30% of transcripts initiate from within TEs [36
], and that some of them are associated with enhancer regions in stem cells, regulating their pluripotency [46
Therefore, rescuing multi-mapping CAGE tags, or multi-reads in other NGS technologies complementary to RNA-seq, has been fundamental in clarification of the extent to which TEs influence the transcriptional output of mammalian cells in both physiological and pathological contexts.
4.2. Current Computational Methods for TEs Transcriptome Analysis
General-purpose computational methods, such as the aforementioned ones, help with the recovering of ambiguous reads for their inclusion in downstream analyses, including those originating from TEs. However, some contexts of analysis require complementary specialized tools designed for TEs to survey the overall contribution of the various TE categories to the transcriptional output of a certain tissue, or to be able to properly distribute the RNA-seq signal among active TE instances and TEs expressed as part of other transcripts.
Several TE-centric tools have been developed to (i) identify and quantify expressed TEs from transcriptomic datasets that can be classified based on their capability of quantifying TE expression at the subfamily level (counting a subfamily as an individual entity) or at genomic instance level (to quantify the expression of individual elements), and (ii) discern TEs that are actively transcribed as individual transcriptional units from those that are co-expressed within other transcripts (Table 1
Criscione et al. [146
] published RepEnrich in 2014. They rescued most multi-reads by assigning them proportionally to the subfamilies on which they align, and showed that many TEs subfamilies are expressed in a tissue-specific manner, and significantly enriched in cancer [148
]. Recently, Jung et al. [156
] used TEtranscripts to improve the expression estimate of L1Hs in cancer, potentially active in the human genome. By quantifying L1Hs somatic insertions and their overall expression in whole-genome and RNA sequencing data from matched TCGA gastrointestinal cancer samples, they found that L1 insertions count and expression are significantly higher in cancer tissues compared to normal, and that L1 insertions causes abnormal mRNA splicing and gene expression [156
TEtranscripts does not discern potentially autonomously transcribed TEs from pervasively transcribed ones. To do that, Navarro et al. [155
] recently released TeXP method that removes the noise due to pervasive transcription from the RNA-seq signal mapping on evolutionarily young subfamilies. [155
]. They applied this method in several RNA-seq datasets from cancer and healthy human cell lines and tissues, and found a greater amount of autonomous transcription for transposons in the human germline and in tumor cell lines.
A different approach to quantify the expression of TEs at class, superfamily or subfamily level is to align RNA-seq reads on a custom transcriptome of TEs sequences, rather than a reference genome. TEtools [153
] is a pipeline that works in this way, enabling the analysis of a TE transcriptome by providing the sequences of TE instances and computing a class-superfamily-subfamily level count and a differential expression analysis. A recent work by Cebrià-Costa et al. used TEtools to perform a differential expression analysis of TEs in an epigenetic study on the function of histone 3 lysine 4 oxidation by LOXL2 in breast cancer cells, and to rule out the possibility that the overexpression of TEs were responsible for DNA damage response in LOXL2 KD cells [157
As aforementioned, pseudo-alignment can quantify transcripts including both unique and ambiguous reads, avoiding the generation of multi-reads. Recently, TE-centric pipelines based on pseudo-alignment have been released as SalmonTE [149
] and REdiscoverTE [95
], that both leverage on Salmon’s pseudo-alignment algorithm. Kong et al. illustrate REdiscoverTE using over five million genomic repetitive elements annotated by RepeatMasker [158
] together with cDNA transcript sequences as well as the sequences of introns containing repetitive elements. They show that including all genomic repeats instances in the reference transcriptome allows taking in account the sequence diversity within TE subfamilies. This includes eventual genomic TE loci that significantly deviate from the Repbase consensus sequence, and results in a more accurate quantification of TE hierarchies. Further, the inclusion of intronic sequences containing repetitive elements allows mapping reads on TEs transcribed within unannotated alternative exons or retained introns. By applying this pipeline on 7750 TCGA cancer samples, Kong and colleagues [95
] described the TE expression landscape in cancer, differentiated between the TEs co-expressed within host genes and intergenic TEs, and found the latter more expressed and more correlated with DNA demethylation, DNA damage and immune response in cancer [95
Measuring the expression enrichment of TEs in RNA-seq data when comparing different cell types, developmental stages or pathological conditions can provide important evidences on the regulatory network in which TEs are involved. However, to deeply investigate TEs involvement in a specific mechanism or phenotype, it is crucial to study TEs expression at the individual genomic instance resolution. Indeed, for example, a different function would be expected for evolutionarily old TEs in respect to the youngest ones that own a promoter and are able to retrotranspose in the genome. For this purpose, Yang et al. published SQuIRE in 2019, the first bioinformatics tool designed for locus-specific quantification of interspersed repeats [150
], based on the spliced alignment of a reference genome of RNA-seq data. By applying this method, they show a differential expression of individual TE instances across different tissues of healthy mouse, as well as of TEs differentially expressed in a D. Melanogaster
model of amyotrophic lateral sclerosis, highlighting the structure of the transcripts containing such TEs, that would not have been possible without a locus-level resolution.
Besides SQuIRE, other tools reports the expression estimates of TEs at genomic instance level [146
] by L1EM tool [146
] that has been developed to quantify the expression of autonomously transcribed L1 elements at locus level. As reported by L1EM analysis, full-length L1 loci of the L1Hs subfamily are highly expressed in stem and cancer cells, while being less expressed in differentiated tissue samples.
Bioinformatics analyses in TE-centric studies may not be limited to the expression of TEs instances. As we reviewed, TEs influence the transcription of coding and non-coding RNAs in several ways [36
]. Jang et al. characterized the landscape of TE onco-exaptation across RNA-seq data from TCGA tumors and normal samples, which they reanalyzed using a pipeline for transcript assembly and integrated with data from the FANTOM5 consortium for the annotation of TE-derived transcription start sites [116
]. This analysis revealed the prevalence of TE usage as novel regulatory sequences in cancer and its importance for oncogene activation and tumorigenesis. In this context, a recent tool, LIONS [147
], is specifically designed to detect and quantify transcripts initiated from within TEs. This tool is able to estimate expression levels of both TEs and exons, and to compute a specific metric to discern TE-initiation from TE-exonization events based on read coverage. Finally, if more than one experimental group is being processed, LIONS performs a differential analysis between them.
Alternative approaches for an accurate quantification of TEs expression could also use data generated by new technologies, although less available than RNA-seq. For example, Deininger et al. developed a pipeline based on RNA-seq and 5′ RACE coupled with PACbio sequencing of 1200 base pair-long reads to estimate the expression of L1 RNAs expressed as independent transcriptional units [159
]. In particular, they show that a large part of the total expression of full-length L1 elements derives by the transcription of a relatively small number of L1 loci. Indeed, this method anticipates the potential of long read sequencing in identifying the TEs contributing to the majority of expression and new insertions in several cancer conditions. Indeed, recent advancements on long-read sequencing, that obtain and map tens of thousands of base-pair long reads, should allow to identify the TEs expressed and contributing to new insertions in cancer conditions, and may signal a new era for the analysis of TEs in transcription regulation, other than for genomics as a whole [160
Despite the limitations of NGS technologies for studying interspersed repetitive elements, recent efforts in bioinformatic research have undoubtedly reached the goal of increasing the level of confidence by which the expression levels of such elements are estimated, and enabled the discovery of several transcriptional regulatory networks in which they are involved in physiological and pathological conditions. Nonetheless, further efforts are still required to improve bioinformatic practices and increase the awareness of the biological relevance of the once called “junk DNA”.