**The Intertwining of Transposable Elements and Non-Coding RNAs**

**Michael Hadjiargyrou 1 and Nicholas Delihas 2, \***


*Received: 20 May 2013; in revised form: 5 June 2013 / Accepted: 5 June 2013 / Published: 26 June 2013*

**Abstract:** Growing evidence shows a close association of transposable elements (TE) with non-coding RNAs (ncRNA), and a significant number of small ncRNAs originate from TEs. Further, ncRNAs linked with TE sequences participate in a wide-range of regulatory functions. Alu elements in particular are critical players in gene regulation and molecular pathways. Alu sequences embedded in both long non-coding RNAs (lncRNA) and mRNAs form the basis of targeted mRNA decay via short imperfect base-pairing. Imperfect pairing is prominent in most ncRNA/target RNA interactions and found throughout all biological kingdoms. The piRNA-Piwi complex is multifunctional, but plays a major role in protection against invasion by transposons. This is an RNA-based genetic immune system similar to the one found in prokaryotes, the CRISPR system. Thousands of long intergenic non-coding RNAs (lincRNAs) are associated with endogenous retrovirus LTR transposable elements in human cells. These TEs can provide regulatory signals for lincRNA genes. A surprisingly large number of long circular ncRNAs have been discovered in human fibroblasts. These serve as "sponges" for miRNAs. Alu sequences, encoded in introns that flank exons are proposed to participate in RNA circularization via Alu/Alu base-pairing. Diseases are increasingly found to have a TE/ncRNA etiology. A single point mutation in a SINE/Alu sequence in a human long non-coding RNA leads to brainstem atrophy and death. On the other hand, genomic clusters of repeat sequences as well as lncRNAs function in epigenetic regulation. Some clusters are unstable, which can lead to formation of diseases such as facioscapulohumeral muscular dystrophy. The future may hold more surprises regarding diseases associated with ncRNAs andTEs.

**Keywords:** non-coding RNAs; transposable elements; microRNAs; Alu sequences; endogenous retrovirus LTR; epigenetics; disease formation

#### **1. Introduction**

The genome is a dynamic entity, ever-changing as a result of endogenous DNA movement or the acquisition of exogenous DNA leading to genomic rearrangements. Such events have contributed to the plasticity and evolution of the genome and all of its complexity, much of which has slowly come to light over the past decades but whose pace has certainly accelerated in the last few years as a result of breakthroughs in genomic technologies, development of newer sequencing techniques, and availability of data in public databases by scientists all over the world. This new and vast genomic knowledge has led to a revolutionary and unparalleled in depth examination of the genome, transcriptome, proteome, interactome, *etc.* Indeed, the concept and definition of a gene may have to be altered [1,2].

However, one key question before us is what new functional loci and regulatory mechanisms have been formed during genomic evolution, especially as they pertain to the genesis of non-coding RNAs (ncRNAs), ncRNA regulatory roles and their association and interaction with transposable elements (TEs). In this review, we outline recent advances in origins of microRNAs (miRNA) and functional properties of ncRNAs as they pertain to their interaction with TEs, especially in humans. What emerges is a fascinating new picture of interconnected molecular interactions and regulatory pathways.

#### *1.1. Non-coding RNAs*

The development and use of new sequencing techniques, such as RNA-Seq has greatly increased our discovery of new RNAs [3,4]. The Encyclopedia of DNA Elements (ENCODE), an international project with the intent goal of determining functional elements of the entire human genome, has employed these techniques and found thousands of new RNA transcripts [2,5]. Surprisingly, about 75%–85% of the human genome is transcribed into primary and processed transcripts [2]; yet only 1.2% of the human genome encodes proteins. This suggests that most of the human genome space is devoted to RNA synthesis that is not devoted to protein-coding. Functions for most non-coding RNA (ncRNA) transcripts are unknown, but the future may hold fascinating prospects of finding new roles and molecular pathways. For example, thousands of circular ncRNAs (cirRNA) have recently been identified; they represent scrambled coding sequences that originate from exons (nonrandom products of RNA splicing) and are involved in small ncRNA regulation [6,7]. These cirRNAs are transcripts that do not encode proteins but have a regulatory role in the cell and thus are regulatory ncRNAs. A significant number of ncRNAs stem from non-protein-coding regions of the genome (intergenic regions), but many also originate from protein-coding regions as antisense transcripts, or from intron regions, and as just mentioned, from scrambled coding sequences. Many ncRNAs target mRNAs and induce their degradation. On the other hand, others are associated with regulation of transcription. Indeed the biological significance of regulation by RNA was grossly underestimated in the past.

Given the barrage of published studies on newly discovered ncRNAs, especially with eukaryotes, their classification and subclassification is indeed very challenging. However, Di Leva and Garofalo [8] used a simple classification system and presented three basic categories: (1) Housekeeping RNAs (rRNAs, tRNAs, snRNAs and snoRNAs); (2) short non-coding RNAs that are less than 200 nucleotides that include but are not limited to microRNAs (miRNAs), Piwi-interacting RNAs (piRNAs) and retrotransposon-derived ncRNAs, and (3) long non-coding RNAs (lncRNAs) that are greater than 200 nucleotides. lncRNAs are currently divided into long intergenic ncRNAs (lincRNAs) as they are encoded in intergenic regions, transcripts from introns, long ncRNA that are antisense transcripts in coding regions but do not encode proteins, and circular RNA transcripts from coding regions that have scrambled exon sequences and also do not encode proteins. Recent identification and classification of long ncRNAs lists additional categories [5].

A common theme that prevails in target RNA regulation by ncRNAs is the formation of short intermolecular RNA/RNA base-paired stems that contain Watson–Crick pairs, imperfect pairing (bulged and looped out positions), and non-canonical pairs. Imperfect ncRNA/target RNA pairing was determined experimentally with the first discovered and functionally characterized non-coding RNA in prokaryotes [9–11]. This is a type of ncRNA/target RNA binding that prevails throughout all biological kingdoms, although RNA binding proteins are also key factors in stable binding. Most of the ncRNAs discussed in this paper interact with their target RNAs via imperfect pairing.

#### *1.2. Transposable Elements*

TEs are defined as mobile genetic elements (pieces of DNA capable of moving to new locations); they also constitute the "mobilome" in that they can impact cell transcription [12]. TEs are characterized as either Class I retrotransposons or Class II DNA transposons. Retrotransposons are further subdivided based on the presence of their long terminal repeat (LTR) that contains the element's functions for mobility and regulatory sequences. LTRs flank endogenous retroviruses (ERV) and are capable of transposition. ERVs are mostly inactive viruses due to accumulation of mutations, but LTRs are active transposons, encode for all the essential factors for mobility and can multiply within a cell independent of the ERV. They carry promoter and enhancer sequences enabling host genes to be transcribed, as well as lncRNA genomic sequences. There are about 500,000 copies of LTR sequences in Homo sapiens, which make up about 8 percent of the human genome.

The other type of Class I retrotransposons are composed of the Long Interspersed Nuclear Elements (LINEs) as well as the Short Interspersed Nuclear Elements (SINEs). LINEs are mobile, whereas SINEs are non-autonomous DNA transposable elements and require LINEs for their mobility and propagation [13]. Duplicate copies are generated during mobility, with sequences identical to the original element inserted in a new location on the genome. In time, such copies may accumulate mutations independently and therefore will differ in sequence from their original sequence leading to increased divergence.

SINEs do not encode proteins. Alu sequences are classified as SINE elements and are about 300 nucleotides. They originated from the 7SL RNA transcript via retrotranscription into DNA. 7SL RNA forms part of the signal recognition particle. A hallmark of Alu sequences is that Alu RNAs fold into specific stable stem-loop structures, albeit with extensive imperfect base-pairing (Figure 1). Alus are highly abundant in mammalian cells, e.g., there ~106 copies of Alu in the human genome that make up ~11 percent of the genome [14], but most of these cannot be mobilized due to accumulation of mutations. Alu sequences are also found embedded in lncRNAs, where they are found to directly participate in base-pairing to target mRNAs (see Section 3.1). Additionally they are also found in piRNA genetic clusters (see Section 3.9).

**Figure 1.** Secondary structural model of Alu RNA. Modified from [15] with permission from Dr. Jennifer Doudna.

All-in-all, repeat sequences comprise 50%–75% of the human genome [14,16]. Repeats are broadly classified either as TE repeats or tandem repeats [17], but the major fraction of repeats represent transposable elements, either active or inactive. Tandem repeats represent a rather heterogeneous group, but some may have originated from TEs [18]. Repeats are regions where there is high recombination and this may sometimes result in genetic abnormalities.

In this review we focus on the origins of ncRNAs, the microRNAs from TEs and the interaction of ncRNA with TEs, primarily as found in mammalian tissues. Evidence is rapidly accumulating to show that this intimate association plays a central role in molecular and genetic mechanisms, such as RNA-based immunity. Furthermore, Cowley and Oakley have already described some of the impact of TEs in the promotion of human transcript diversity [12].

### **2. TE Origins of miRNAs**

microRNAs (miRNA) are small non-coding regulatory RNAs molecules that function post-transcriptionally by binding to the 3'UTR of target mRNAs and ultimately inducing inhibition of target mRNA function. While the majority of miRNAs originate from intergenic genomic sequences, some arise from genes and TEs. The molecular origins of many miRNAs support the hypothesis that miRNA hairpin generation is based on the insertion of two related TEs flanking a single genomic locus (see below). As such, transcription that occurs across this locus leads to the biogenesis of functional miRNAs. One of the earliest studies to indicate that a number of mammalian miRNAs are derived from TEs utilized a bioinformatics approach, where the authors analyzed the Sanger miRNAi database using a software program that specifically detects well characterized repeats [19]. Specifically, 11 different miRNA precursors contained repeat sequences (4 derived from LINE-2 repeats and others with SINEs, LTRs and simple repeats). The majority of these miRNAs are highly conserved across human, mouse and rat, but some are confined to only one or two species.

In a subsequent and more in depth study, Piriyapongsa and Jordan investigated the relationship between human miRNAs and TEs by comparing the genomic locations of experimentally characterized human miRNA genes with the locations of annotated genomic TE sequences [20]. A correlation was observed, and nt sequence comparisons showed a high identity between seven members of the family of miRNAs hsa-mir-548 and the miniature inverted repeat transposable element (MITE), Made1. By use of human genome tilling arrays that visualize genomic expression, one Made1 element was found to be inserted into a transcriptionally active intergenic site. Made1 and other MITEs have palindromic sequences, and when transcribed, show a segment that has an imperfect stem-loop RNA structure. As RNAi-related enzymes can recognize this type of imperfect stem-loop and process it into the 22 bp mature miRNA sequences, the authors proposed that Made1 TE transcripts are processed into hsa-mir-548 miRNAs. The expression date and high sequence identities strongly support the proposed TE origin of several hsa-mir-548 family members.

In a related study, Piriyapongsa *et al*. used comparative genomic sequence data from the UCSC Genome Browser and evaluated the evolution of TE-derived human miRNAs [21]. They found 55 experimentally characterized human miRNA genes that were derived from TEs (LINE and SINE, LTRs and DNA transposons). Sequence comparisons showed that on average, TE-derived miRNAs are less conserved than non-TE-derived miRNAs. Further, a subset of these, are related to the ancient L2 and MIR families. Results also predicted an additional 85 novel TE-derived miRNA genes. Lastly, for some of the TE-derived miRNAs and their putative target genes, a comparison of expression patterns (miRNA *vs.* mRNA) was performed and revealed a number of them to have anti-correlated expression, consistent with regulation via mRNA degradation and thus supporting their regulatory function.

Examination of fourteen previously identified marsupial (Monodelphis domestica) specific miRNAs and their flanking sequences revealed that half of these miRNAs evolved from marsupial-specific TEs [22]. More specifically, six of these TE sequences were identified as LINEs and one as a Mariner DNA transposon. In a subsequent study, Yuan and colleagues also investigated another placental-specific miRNA gene family (miR-1302) that at the time of the analysis had 11 members that were distributed in the human genome (present in the miRBase) [23]. They demonstrated that all members of this family were derived from a single transposon (MER53 element). MER53 is a TE with a 193-bp consensus sequence and is characterized by the presence of terminal inverted repeats and TA target site duplications that can form palindromic structures [24]. Further analysis of the phylogenetic distribution and evolution dynamics of the miR-1302 family identified 36 potential paralogs of MER53-derived miR-1302 genes in the human genome and another 58 potential orthologs in placental mammals and showed that these members of the hsa-mir-1302 family emerged within the last 180 million years since placental mammals diverged from marsupials. Lastly, the authors also explored the targets of the mature human miR-1302 and found 1835 genes with predicted function in transportation, localization, system development processes and their regulation, as well as in binding and in transcription regulation [23].

Genome-wide studies were performed using a comparative genomics approach in order to identify human miRNA paralogs (in mouse and rhesus) in segmental duplication pair data [25]. Of ~1000 miRNA genes and ~1000 mature sequences from human, ~700 miRNA genes and ~1000 mature sequences from mouse, and ~500 miRNA genes and ~500 mature sequences from rhesus, they identified 228 novel miRNA homologs in the rhesus genome and 22 novel miRNA homologs in the mouse genome (by using miRBase 16). Further, they also found 12 and 2 novel miRNA paralogs in the human and mouse genome, respectively, but none were found in the rhesus genome. In a separate analysis, the authors also examined the coverage density of repetitive elements, and if it was at least 50% in a miRNA gene or 100% in one of the associated mature miRNA sequences, then the miRNA gene was considered to be a RdmiR. Using this rule, the study identified a large number of miRNAs genes that overlap with repeats (TEs: LINEs, SINEs and LTRs) and other types of repetitive elements (DNA transposons, specifically MADE1 elements) within the three genomes; 226 (human), 115 (rhesus) and 141 (mouse). The study also identified a smaller number of possible repeat derived miRNAs, which they termed RrmiRs. Lastly, a computational analysis was conducted to investigate the functions of 19 of the conserved RrmiR families (between the three genomes), by identifying their target genes and it was found that the most significant targets are involved in transcriptional regulation, central nervous system development, and negative regulation of biological process. Collectively, the results of this study suggest that repetitive elements contribute to the de novo origin of miRNAs, and that large segmentation duplication events most likely accelerate the expansion of miRNA families (including RdmiRs).

A more recent study involved a comprehensive analysis of the genomic events responsible for the formation of ~15,000 annotated miRNAs against the principle datasets for TEs and ncRNAs and found 2392 (~15%) TE-based miRNAs [26]. The majority of these TE-based miRNAs may have originated via the proposed mechanism depicted in Figure 2.

The authors further investigated the exact TE origins of these 2392 miRNAs and showed that DNA transposons comprise the TE most frequently responsible for miRNA generation (891); others were: LTR Retrotransposon (414), Non-LTR Retrotransposon (814), LINE (312), SINE (353), Satellite (137) and others (136). This last category ("Other") had significant sequence identity to known noncoding RNA sequences (e.g., snoRNAs, scaRNAs, tRNAs). Lastly, a hypothetical scheme proposes that the regulatory miRNAs may have arisen via selective subfunctionalization created by the associated benefit of regulating host genes containing portions of TEs.

**Figure 2.** Schematic of proposed origin of TE-based miRNAs. When two related but not identical LINE1elements insert themselves near each other, but on opposite strands of the DNA (in blue), they can create a precursor miRNA containing an imperfect stem-loop upon transcription (red line). The pre-miRNA is shown above the arrow and transcription is indicated from the positive strand LINE1. The stem is potentially recognized and processed by the endogenous RNAi machinery. The pre-miRNA stem-loop depicted is representational. Modified from [26].

Using the miRBASE database, a more recent study sought to map all miRNA precursors to several genomes and to determine the repetition and dispersion of the corresponding loci, as well as the repetitive elements overlapping these loci. To facilitate this analysis an automatic method called ncRNAclassifier was used in order to classify the relationship of TEs with pre-ncRNAs [27]. By applying this method, a correlation between the number of pre-ncRNA candidates and the presence of TEs was determined using six genomes (frog, human, mouse, nematode, rat and sea squirt). The results indicate that 235 and 68 mis-annotated pre-miRNAs correspond completely to TEs out of 1426 human and 721 mouse pre-miRNAs of miRBase (10.0 release), respectively. Further, the various types of TEs involved were also identified and include (MADE1 and other MITEs, DNA transposons, LTR/ERV, CR1/RTE, L1, SINE, other non-LTR). Lastly, the authors suggest that the ncRNAclassifier can be openly used to determine if a given ncRNA hairpin sequence corresponded to a TE sequence.

An investigation of the TE origins of miRNAs focused on the MER (MEdium Reiteration frequency), interspersed repeats in the genomes of primates, rodentia, and lagomorpha) transposon-derived miRNAs in human genome. Once again, a bioinformatics approach was undertaken to identify the specific miRNAs that are derived from palindromic MERs, by analyzing MER paralogs in human genome. Results from this study identified three miRNAs derived from MER96 located on chromosome 3, and MER91C paralogs located on chromosome 8 and chromosome 17 [28]. More importantly, this study also experimentally validated the interactions between these MER-derived miRNAs with AGO1, AGO2, and AGO3 proteins (involved in gene silencing and act as the catalytic component of the RNA induced silencing complex [RISC]).

Lastly, there are additional classes of small ncRNAs that originated from TEs and/or consist of TE sequences, e.g., certain piRNAs and the specialized SINE and Alu transcripts that function as small ncRNAs; these are discussed below with respect to their functional roles.

In summary, a sizable proportion of miRNAs appear to be derived from TEs. It is highly probably that future bioinformatic analyses will increase the number of miRNA-transposable element relationship as not all miRNAs have been discovered and most likely, all consensus repetitive elements have not yet been described. As ncRNAs serve such a critical regulatory role, TE colonization of the genome has given rise to a number of regulatory processes, several of which we discuss here.

#### **3. Interaction of TEs with ncRNAs—Functional and Disease-Related Significance**

#### *3.1. Alu Element Embedded in Long ncRNAs and mRNAs—Crucial Role in Target mRNA Decay*

Gong and Maguat revealed the importance of Alu intermolecular base-pairing to lncRNA-induced degradation of mRNA [29,30]. By computational analysis, Alu sequences are found present in ~380 lncRNAs in HeLa cells [30]. In addition, mRNAs were identified that contain an Alu sequence in their 3'UTR regions. Certain mRNAs are targets of the double-stranded RNA binding protein Staufen1 (Stau1), which can induce degradation of the mRNAs. Co-immunoprecipitation experiments show an Alu-containing lncRNA, which originates from chromosome 11, binds to and decreases the abundance of target messages, *i.e.*, plasminogen activator inhibitor type 1 (SERPINE 1) mRNA and an mRNA that encodes an unknown protein termed FLJ21870. Both these mRNAs have an Alu sequence in their 3'UTRs. By secondary structure modeling, it was shown that Alu sequences in lncRNA can base-pair to Alu sequences in the 3'UTR of target mRNAs by intermolecular base-pairing with a stable −∆ G (Figure 3). This pairing is imperfect and contains bulged positions. The intermolecular stem structure, formed by the interaction between the Alu sequences present in the lncRNA and in the 3' UTR of the mRNA serves as the binding site for Stau1, which subsequently recruits UPF1, a protein required to initiate mRNA decay. Several hundred other lncRNAs contain Alu sequences and have the potential to base-pair with Alu-containing mRNAs, but possible functions of the majority of the several hundred Alu-containing lncRNAs are unknown.

Alu/Alu sequence pairing is not the only interaction seen with Stau1-binding mRNAs, but it represents an important "variation on a theme". In previous experiments with another Stau1-binding mRNA that contains no Alu sequences but a perfect 19 bp stem in the 3'UTR, it was shown that Stau1 binds the perfect 19 bp stem that is formed intramolecularly between distal sequences in the 3'UTR of the mRNA [31] (Figure 3, left). ARF1 is an ADP-ribosylation factor 1 protein, and the *arf1* mRNA transcript has a Stau1-binding site. The 19 bp stem is phylogenetically conserved in different mammalian species. The question remains, how many Stau1-binding mRNAs are targets for decay via intramolecular bp within a message and how many by intermolecular Alu/Alu pairing. As mentioned, there are several hundred lncRNAs that have Alu sequences. In addition, in a related topic, it should also be pointed out that inverted Alu elements are found in many human mRNA 3'UTR sequences. These can form double-stranded intramolecular stems, and they appear to affect mRNA translation efficiency [32].

**Figure 3.** Schematic of binding of Stau1 to 3' UTR mRNA intramolecular stem (*arf1* mRNA) (**left**) and to intermolecular stem formed by Alu sequence in lncRNA with Alu sequence within 3'UTR mRNA. Upf1is an RNA helicase. The two RNA duplex stems shown are not drawn to scale. Modified from Gong and Maquat [30].

Thus, mRNAs that have a Stau1 binding site, whether it is a formed intermolecularly via Alu/Alu lncRNA/mRNA imperfect base-pairing or by perfect Watson–Crick intramolecular pairing within the mRNA can be targeted for degradation. This raises an interesting question concerning the specificity of Stau1 and its recognition sites on duplex RNA stems and imperfect *vs.* perfect double stranded stems. The probability of mistakes in recognition must be very low, yet two or more types of RNA tertiary structures are recognized with great accuracy. Crystal structures of perfect and imperfect double-stranded RNA/Stau1-protein complexes would be of major interest. Gleghorn *et al*. have already determined crystal structures of Stau1 and showed that dimerization of Stau1 occurs by a degenerate dsRNA-binding domain on Stau1 [33]. And in another recent study, it was revealed that dimerization can also involve Stau2 [34].

In another study from this laboratory, rodents appear to use the same mechanism of mRNA regulation involving intermolecular imperfect base-pairing between lncRNAs and mRNAs, only this occurs via SINE elements B1, B2 or B4 found at 3'UTRs and in lncRNAs [35].
