Introduction of Plant Transposon Annotation for Beginners

Gao, Dongying

doi:10.3390/biology12121468

Open AccessReview

Introduction of Plant Transposon Annotation for Beginners

by

Dongying Gao

Small Grains and Potato Germplasm Research Unit, USDA-ARS, Aberdeen, ID 83210, USA

Biology 2023, 12(12), 1468; https://doi.org/10.3390/biology12121468

Submission received: 7 November 2023 / Revised: 21 November 2023 / Accepted: 23 November 2023 / Published: 26 November 2023

(This article belongs to the Special Issue De Novo Detection of Transposons)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Simple Summary

Transposons are the most abundant repeats in plant genomes, and many of them can produce transcripts and encode proteins that may result in overestimating and incorrectly annotating functional genes. Thus, accurate transposon annotation is essential for all plant genome sequencing projects and other research. Although numerous tools have been developed, it is still challenging to annotate plant transposons for most scientists. The aims of this review are to introduce the basic knowledge about plant transposons and to provide a beginner’s guide on plant transposon annotation. I accentuate the importance of transposons and summarize the general strategies for transposon annotation. I briefly introduce the unique features of different transposon superfamilies in plants and the related resources for annotating plant transposons. I further present the information on improving the quality of transposon annotation. The challenges and future prospects for plant transposon annotation are also discussed.

Abstract

Transposons are mobile DNA sequences that contribute large fractions of many plant genomes. They provide exclusive resources for tracking gene and genome evolution and for developing molecular tools for basic and applied research. Despite extensive efforts, it is still challenging to accurately annotate transposons, especially for beginners, as transposon prediction requires necessary expertise in both transposon biology and bioinformatics. Moreover, the complexity of plant genomes and the dynamic evolution of transposons also bring difficulties for genome-wide transposon discovery. This review summarizes the three major strategies for transposon detection including repeat-based, structure-based, and homology-based annotation, and introduces the transposon superfamilies identified in plants thus far, and some related bioinformatics resources for detecting plant transposons. Furthermore, it describes transposon classification and explains why the terms ‘autonomous’ and ‘non-autonomous’ cannot be used to classify the superfamilies of transposons. Lastly, this review also discusses how to identify misannotated transposons and improve the quality of the transposon database. This review provides helpful information about plant transposons and a beginner’s guide on annotating these repetitive sequences.

Keywords:

transposon annotation; plant; genome; bioinformatics pipeline; database

1. Introduction of Plant Transposons

Transposons or transposable elements (TEs) are genomic sequences that have the potential to change their positions in the host genome. According to their transposition mechanisms, transposons are grouped into two major classes: Class I elements or retrotransposons that move via a copy-and-paste model and Class II elements or DNA transposons that transpose via a cut-and-paste model, rolling-circle replication, or other mechanisms [1,2]. Each transposon class can be further divided into different superfamilies based on their sequence structures and the encoded proteins. Thus far, over 30 transposon superfamilies have been identified in prokaryotic and eukaryotic genomes [1,2,3,4,5,6]. However, only 16 superfamilies were found in the plant kingdom (Figure 1), including 2 superfamilies of long terminal repeat (LTR) retrotransposons, 3 superfamilies of long interspersed nuclear elements (LINEs), 1 superfamily of short interspersed nuclear elements (SINEs), and 9 DNA transposon superfamilies. Additionally, many plant genomes harbor endogenous plant pararetroviruses (EPRVs), which share similar core genes with LTR retroelements for reverse transcription but lack both functional integrase (INT) and LTRs [7,8]. Numerous elements can still be recognized as TEs as they show some typical features of transposons such as LTRs, terminal inverted repeats (TIRs), and flanking target site duplications (TSDs). However, they do not encode transposase (TPase) proteins and are difficult to group. These elements would usually be considered as unclassified transposons that include large retrotransposon derivatives (LARDs) [9], terminal-repeat retrotransposons in miniatures (TRIMs) [10,11], and miniature inverted-repeat transposable elements (MITEs) [12]. Notably, with the advance of sequencing technologies and related bioinformatics software, some novel types of transposons may be identified in the existing and/or newly sequenced plant genomes.

Transposons are important contributors for plant gene and genome evolution as their movements may alter gene expression and regulatory networks and result in chromosome rearrangements [13]. They can create novel genetic and morphological variations that are beneficial for the host’s fitness under disadvantageous environmental conditions. Transposons are key components of functional centromeres and play pivotal roles in the rapid divergence of centromeres between closely related plants [14]. TEs are frequently related to hybrid defects that might cause reproductive isolation across diverse species [15]. Transposons represent the most abundant repetitive sequences in plant genomes, and they are found in all sequenced plants including the model plant Arabidopsis and many major crops in the world such as rice (Oryza sativa), wheat (Triticum aestivum), barley (Hordeum vulgare), maize (Zea mays), and soybean (Glycine max) (Figure 2). Unlike some repeats in plant genomes, such as telomeric tandem repeats and ribosomal DNA (rDNA) repeats, most transposons are poorly conserved and can be quickly replaced in a relatively short period [16]. In many plants with larger genome sizes, transposons make up large fractions of the genomes. For example, about 85% of the maize genome is composed of transposons [17]. Therefore, transposon annotation is one of the most important and fundamental tasks for genome sequencing projects as it represents the precondition for many genomic analyses and the first step in the computation phase of genome annotation [18,19].

2. Strategies of Transposon Discovery

One of the most, or maybe the most, important goal of all plant genome sequencing projects is to accurately identify and annotate functional genes in the genomes. However, it is important to define and understand genes before the gene annotation is started. Generally, a gene is a transcribed unit of heredity that may or may not translate to a protein. It is well known that many TEs are expressed and can generate proteins. Thus, these transposon sequences can be treated as genes. However, except limited TEs called domesticated transposons, which have been co-opted by the host genome and confer some biological traits [20], the only function of the transposon-encoding proteins is for catalyzing transpositions. Therefore, TEs are frequently called ‘jumping genes’ or ‘selfish genes’, and they were excluded by many gene annotation projects which mostly focused on annotating the genes that encode non-transposase proteins and associate with biological and molecular functions and phenotypic traits, such as the genes controlling the yield, biotic, and abiotic tolerance, and other phenotypic traits in crops.

Like genes and other genomic components, transposons are DNA sequences and are composed of four nucleotides including adenine (A), cytosine (C), guanine (G), and thymine (T). It is impossible to distinguish transposons from genes based on the composition of nucleotides. However, transposons are usually repetitive and exhibit unique structures; some of them may encode transposase proteins or reverse transcriptase (RT), which can be used for transposon prediction. Thus far, many resources including bioinformatics software and a TE database have been developed. Here, I just mention some of these TE-related software and databases as most of the resources were well documented in previous publications [21,22,23,24]. Generally, three major strategies are widely used for transposon annotations.

2.1. Repeat-Based Annotation

Transposons are dispersed repeats, and many TE families are present in plant genomes in multiple copies; thus, these features can be used to develop bioinformatics tools for de novo transposon identification. One of the most popular programs is RECON, which identifies and groups TEs based on the pairwise alignments between genomic sequences [25]. To conduct genome-wide transposon annotations with RECON, the assembled genome sequences are usually split into smaller fragments, such as 10–20 Kb, and then used as the queries to search the whole genome sequences again to identify repetitive sequences and group them into different families. The represented element or consensus sequence of each repeat family are extracted, and the boundaries of the repeats can be defined via computational analysis and/or manual inspections (Figure 3A). For large plant genomes, such as wheat (~17 Gb), it is extremely challenging to conduct all-against-all comparisons. One practical option is to use a certain fraction (10–20% or more) of the genome to search against the whole genome sequences. However, this could miss some repeat families. It should be noted that not all transposon families are highly repetitive; some single-copy or low-copy transposons may not be detected by RECON. Additionally, many genes including gene families and duplicate genes are also scattered throughout plant genomes, and these repeats should be removed during transposon annotations.

2.2. Structure-Based Annotation

Transposons exhibit some unique structural features such as LTRs, TIRs, and Poly(A) tails (Figure 1). Additionally, many transposon superfamilies can generate TSDs when they are moved and inserted into new genomic positions. Therefore, these unique characteristics can be used to develop bioinformatics programs for de novo transposon annotations. For example, many non-LTR retrotransposons contain a 3′ poly(A) terminal motif and are flanked by variable TSDs; thus, scientists can develop software to recognize the poly(A) motif, and the boundaries of non-LTR retroelements can be defined by TSDs. Then, the annotated sequences are extracted and grouped into different families based on their sequence comparisons (Figure 3B). Compared to RECON, structure-based software can annotate transposons with well-defined boundaries, and many of those programs are able to conduct genome-wide transposon annotations for the plants with larger genome sizes in a relatively short time. Thus far, most of the transposon annotation programs were developed based on the unique structures of distinct transposon superfamilies including those for annotating LTR retrotransposons [26,27,28,29,30,31], non-LTR retrotransposons [32,33], and DNA transposons [34,35,36,37,38,39,40,41]. Notably, the structure-based annotation software may find full-length transposons with both low and high copies, but they may miss some transposons without the typical transposon features.

2.3. Homology-Based Annotation

This strategy is based on the hypothesis that transposons from related organisms could share common origins and show certain sequence similarities. Thus, researchers can use the known transposons which have already been annotated to find their homologous sequences in new genomes. The best-known and most widely used program for homology searches is RepeatMasker (https://www.repeatmasker.org, 1 November 2023), which applies customer or precompiled repeat libraries to identify homologous repeats. There are several ways to obtain known transposon sequences. The first is to check related transposon databases such as Repbase Update or Repbase, which is a well-curated and the most comprehensive database of repetitive elements in eukaryotic genomes [42]. Another way is to use the published transposon database from relative plants. For example, Garcia et al. (2021) used the annotated transposons in common bean (Phaseolus vulgaris) to find the homologous sequences in Lima bean (Phaseolus lunatus L.) [43]. The third way is to download deposited transposons from GenBank. However, the efficiency and accuracy of homology-based annotation heavily depend on the genomic similarity between the host plants of the reference transposons and the plants under study. As plant transposons are dynamic sequences and many TE families are species-specific, the reference transposons may yield poor annotations in distantly related plants. Also, this approach is not practical for the orphan plant lineages for which no transposon database is available in the relative species. Despite the fact that we can use the repeats in Repbase or other databases to find transposon-related sequences in distantly related organisms, in most cases, the homology sequences are short and fragmental and lack typical features of transposons. To obtain a better annotation, the proteins of identified transposons can be used to conduct a TBLASTN search against the newly sequenced plant genomes and to identify homologous sequences. Then, the hits and their flanking regions (5–10 Kb for each side) are extracted and used for BLASTN searches and genome-wide comparisons to identify complete transposons (Figure 3C). As a TBLASTN search is very sensitive in detecting homologous sequences, technically, it can identify more transposase-encoding elements including both full-length and truncated transposons as well as both single-copy and multiple copies of elements.

Except the three major strategies above, other new methods of TE annotations were also developed, such as TASR, which is based on the fact that transposons are usually targeted and epigenetically silenced by 24 nt-siRNAs; thus, this feature can be used to develop a bioinformatics pipeline to recognize the regions targeted by the small RNAs and to define transposons [44]. Occasionally, some transposons inserted into genes and caused phenotypic variations. Thus, these transposons can be identified by cloning and comparing the gene sequences of mutants and the wild types. However, this method is not suitable for genome-wide transposon discovery. As plant genomes are very complicated and each of the transposon annotation methods has its own limitations, it is better to combine multiple computational pipelines to obtain high-quality transposon annotations.

3. Steps for Transposon Annotation

There are three basic steps in genome-wide transposon annotation: preliminary annotation, classification, and quality check and data improvement. However, many bioinformatics pipelines for transposon annotation combine the first and second steps and can directly output classified transposons.

3.1. Brief Introduction of Plant Transposon Superfamilies and Their Annotations

3.1.1. LTR Retrotransposons

LTR retrotransposons represent the most abundant repeats in many, maybe most, plant genomes, such as maize, which consists of about 75% LTR retroelements [17]. These retroelements exhibit some structural hallmarks including LTR, TSD, reverse transposase proteins, primer binding site (PBS), polypurine tract (PPT), and the 5′TG…CA3′ terminal motifs that can be used for characterizing LTR retrotransposons. Many plant transposon annotations usually start with LTR retrotransposons for three reasons: (1) LTR retrotransposons contribute large fractions of plant genomes; (2) several excellent bioinformatics programs have been developed [26,27,28,29,30,31] for annotating LTR retrotransposons; and (3) this software may generate lower error rates of data mining than many programs for the de novo detection of DNA transposons and non-LTR retrotransposons. By following the protocols of the developed software, users can generate sufficient data for preliminary predictions of LTR retrotransposons. In plant genomes, only two superfamilies of LTR retrotransposons, Ty1/Copia and Ty3/Gypsy, are found; these two superfamilies share similar structures but have a different order of the reverse transcriptase and integrase (INT) domains (Figure 1). Most complete LTR retrotransposons are large (about 4–10 Kb) and some can be over 20 Kb. However, some types of LTR retroelements called TRIMs are small (about 250 bp to 1500 bp) and have tiny LTRs (<100 bp) [11]. Therefore, users need to pay attention to the default parameters of the related software. For LTR_Finder, the default minimum size for LTRs and the internal regions is 100 bp and 1000 bp, respectively [27]. To obtain better annotations with LTR_ Finder, we usually use the sets ./ltr_finder -d 30 -D 15000 -l 30 -L 5000 -s ./tRNAdb/Athal-tRNAs.fa where d represents the minimum distance between 5′ and 3′ of the LTRs, D represents the maximum distance between 5′ and 3′ LTRs, l represents the minimum size of 5′ and 3′ LTRs, L represents the maximum size of 5′ and 3′ LTRs, and s means the tRNA sequence file we used.

3.1.2. LINEs

LINEs are non-LTR retrotransposons and may be the most ancient types of retrotransposons in plant genomes. The typical LINEs usually contain non-LTR reverse transposases and the poly(A) terminal motif. The general methods for LINEs annotations are presented in Figure 3, and the related software including RECON [25] and MGEScan-Non-LTR [45] can be used for automatic LINE annotations. The boundaries of full-length LINEs can be defined by TSDs. However, most LINE elements in plant genomes are truncated at their 5′ end due to unsuccessful reverse transcription or other reasons. It should be noted that RTE retrotransposons, which is one superfamily of LINEs, lack a poly (A) tail and instead contain tandem repeats [46,47]. It seemed that the RTEs in flowering plants were horizontally transferred from aphids or other unsequenced animals [47]. As the RTEs from different plants show a high sequence similarity, thus, the identified RTEs can be used to find the homologous elements in new plant genomes.

3.1.3. SINEs

SINEs are another type of non-LTR retrotransposon. Unlike LINEs, they are very small (<500 bp) and lack conserved coding domains. Also, the 5′ end of many SINEs are truncated. Thus, SINEs may represent the superfamily that is most difficult to precisely annotate among all plant transposons. Nearly all plant SINEs were derived from tRNAs; they usually contain an internal RNA polymerase III (pol III) promoter consisting of box A and box B motifs, and the 3′ poly (A) flanking by TSDs [48]. Thus far, some SINE annotation software has been developed to recognize these structural motifs and conduct genome-wide SINE prediction [32,33]. SINEs were found in a wide range of plants, many of which are present in specific lineages, such as p-SINE1 in the Oryza genus [49], but some SINE families such as Au are widely distributed in both dicots and monocots [50] that provide good resources for homology-based SINE annotations [51].

3.1.4. Endogenous Plant Pararetrovirus (EPRVs)

Pararetroviruses are double-stranded DNA viruses that replicate through an RNA intermediate. They were named by Temin (1985) to distinguish hepadnaviruses in animals from retroviruses that are RNA viruses and integrate their DNA copies into host genomes [52]. Except very few members, such as Petunia vein-clearing virus (PVCV) [53], the vast majority of plant pararetroviruses do not have the integrase domain. All plant pararetroviruses belong to the family Caulimoviridae, they contain one to eight open reading frames (ORFs), and their genome sizes range from 7.1 to 9.8 Kb [54]. Endogenous pararetroviruses (EPRVs) are present in the genomes of a wide range of plants [55,56], and they encode polyproteins that share a high sequence similarity with Ty3/Gypsy LTR retrotransposons; thus, EPRVs were frequently misannotated as Ty3/Gypsy LTR retrotransposons or were ignored. All EPRVs identified thus far lack LTRs [7,8], which brings difficulties in annotating these pararetroviral sequences and accurately defining their boundaries. Recently, the bioinformatics pipeline CAULIFINDER has been developed for the automatic annotation and classification of EPRVs in plant genomes [7].

3.1.5. DNA Transposons with TIRs

Among the nine superfamilies of DNA transposons identified in plant genomes, seven superfamilies have TIRs including Mutator, CACTA, hAT, PIF/Harbinger, Tc1/Mariner, Sola, and Ginger.

Mutator Transposons

Mutator transposons or mutator-like transposable elements (MULEs) were originally discovered in maize [57] and have since been identified in other plants, animals, and fungi [58]. The transposons of this superfamily usually carry relatively long TIRs (can be over 300 bp) and produce TSDs of 7–12 bp [59]. MULEs are abundant in plant genomes and may represent the most mutagenic plant transposons identified thus far [60]. They are usually less than 5 Kb in size and contain one ORF-encoding mutator transposase. However, MULEs can be over 8 Kb and may have multiple ORFs including one encoding transposase protein and other(s), which may help transposon movement, or their functions are not very clear [60,61]. Some MULEs called Pack-MULEs have captured functional genes or fragments of expressed genes and play important roles in plant gene evolution [59,62].

CACTA Transposons

CACTA transposons were first found in maize and named Enhancer (En) and Suppressor-mutator (Spm) [63,64]. The typical features of this superfamily are the terminal motifs starting with CACTA or CACTG and ending with TAGTG or CAGTG. The TIRs of CACTA elements are short (mostly < 50 bp) and they generate 3-bp TSDs. The complete CACTA transposons are usually large (>10 Kb), and some of them can be over 20 kb in size [65]. Due to their large sizes and high copy numbers, CACTA transposons contribute larger fractions of many plant genomes than other DNA transposons. Like Pack-MULEs, some CACTA transposons have been found to carry host gene sequences [66].

hAT Transposons

hAT transposons were first identified by Barbara McClintock as the Activator or Ac element [67], but the name of this superfamily is an acronym of its three members: hobo from Drosophila, Ac from maize, and Tam3 from snapdragon [68]. hAT elements are widely distributed in plants and other eukaryotes, they are typically less than 5 kb in length, and usually contain short TIRs (5–27 bp) flanked by 8-bp TSDs [69,70]. In addition, the transposase of many hAT elements contains the conserved domain of 50 amino acids at the C terminus, which may be involved in dimerization or other functions [71].

PIF/Harbinger Transposons

The PIF/Harbinger transposon superfamily obtains its name from the two founding members: the P instability factor (PIF) from maize [72] and Harbinger from Arabidopsis thaliana [73]. They usually contain short TIRs (about 14–50 bp) and generate 3-bp TSDs (TAA or TTA). Unlike other DNA transposons, the intact PIF/Harbinger elements have two main open reading frames (ORFs); one ORF produces catalytic transposase containing a conserved DDE motif and another ORF encodes a Myb-like protein [74]. PIF/Harbinger transposons have been found in the genomes of many plants and some of them were co-opted or domesticated to serve as new molecular functions associated with yield and other traits in plants [75,76].

Tc1/Mariner Transposons

Tc1 and Mariner transposons was first discovered in Caenorhabditis elegans and Drosophila mauritiana in the 1980s, respectively [77,78], and then were found in a wide range of organisms including both prokaryotes and eukaryotes [58]. The Tc1/Mariner transposons in plants mostly range in size from 1.5 to 6 Kb and contain relatively short TIRs (~30 bp) [79,80]. In contrast to other TEs that create variable TSDs, the Tc1/Mariner superfamily always generates 2-bp TSDs of TA. They are not very abundant in many plant genomes and were missed by some plant genome annotations.

Sola Transposons

Sola transposons were first reported in 2009 and they are distributed in a wide range of organisms including bacteria and metazoans [3]. The transposons of the Sola superfamily encode DDD-TPase and are flanked by 4-bp TSDs. Most Sola elements range in size from 2 Kb to 9 Kb, but some can be over 15 Kb. They also have very variable terminal sequences and the TIRs of the reported Sola transposons ranged from 11 bp to 1124 bp [3]. Despite Sola transposons being found in some plants, such as the moss Physcomitrella patens, it seems that they are not common in flowering plants.

Ginger Transposons

Gypsy INteGrasE Related (Ginger) transposons are unusual DNA transposons as they contain TIRs and a protein which shares a high sequence similarity to the integrase encoded by Gypsy LTR retrotransposons [4]. DNA transposons containing a Gypsy-like integrase were first found in Dictyostelium discoideum [81,82], and subsequently identified in animals [4] and plants [5]. The TIRs of Ginger transposons are relatively long (40–270 bp) and contain the 5′-TG…CA-3′ terminal dinucleotides, and the insertions of Ginger transposons generate TSDs of 4–6 bp [4,5].

MITEs

MITEs were first identified in maize in 1992 [12], and they are small DNA transposons (mostly < 500 bp) with TIRs surrounded by variable TSDs. In contrast to many previously reported DNA transposons that were mobile with the ‘cut and paste’ model and were usually present in low copy numbers, MITEs were frequently found in high copy numbers, which can be over 10,000 copies for some families [83]. Additionally, dramatic copy number differences for the same MITE family can be detected between closely related genomes [74], suggesting that MITEs can rapidly increase their copy numbers in a short period. Therefore, MITEs were considered as a special type of DNA transposons or even Class III transposons [84]. However, more studies revealed that MITEs were likely derived from the internal deletions of large DNA elements as the TIRs of MITEs showed a high sequence identity to that of several reported DNA transposon superfamilies such as PIF/Harbinger and hAT [74,85,86,87].

TIR DNA Transposon Annotation

Several programs have been developed to annotate TIR DNA transposons such as TIRvish [34], TIR-Learner [35], Generic Repeat Finder [36], and Inverted Repeat Finder [88]. The software provides good resources to detect transposons solely based on the recognition of TIRs or the combination of structure and homology transposon annotations. In addition, MITE-Hunter [37], detectMITE [38], MiteFinderII [39], MITE Tracker [40], and other related bioinformatics pipelines can be used to find small TIR-DNA transposons that lack transposases. As the TIRs of DNA transposons are usually short and less conserved, it is still challenging to accurately detect these DNA transposons and define their boundaries.

3.1.6. Helitron Transposons

Helitron transposons were first identified in Arabidopsis, rice, and nematode (Caenorhabditis elegans) in 2001 [89]. They have conservative termini (5′-TC … CTRR (mostly CTAG)-3′) and contain hairpins (16–20 bp) separated by 10–12 nucleotides from the 3′ end. However, all Helitron elements do not have terminal inverted repeats and do not generate TSDs; they transpose precisely between 5′AT3′. Helitron-like sequences are widely present in plant genomes; some of them are large and can capture gene fragments [90]. Helitron transposons are difficult to annotate and precisely define their boundaries as they lack both TSDs and TIRs. They can be detected using the unique terminal motifs and the helicase sequences or with related software such as HelitronScanner [41].

3.1.7. Replitron Transposons

Replitron is a new group of DNA transposon reported in 2023 [6]. They lack TIRs but contain short direct terminal repeats ranging from 5 to 11 bp and HUH endonuclease that is distantly related to Helitron transposons. Replitron elements are present in the genomes of green algae, liverworts, mosses, lycophytes, and ferns, but absent in hornworts and seed plants [6]. The identified Replitrons are 900 bp to 4 kb, and their insertions can generate TSDs of 2–8 bp. Thus far, no software has been developed for annotating Replitron transposons. One practical approach is to conduct homology searches with the endonuclease proteins of known Replitrons and identify new Replitron sequences based on sequence comparisons.

3.1.8. Others TE Annotation Resources

Despite the programs used to predict specific types of transposons that have been developed, there is no need to annotate each of the transposon superfamilies one by one. Several bioinformatics packages have been created through combining multiple programs and methods to annotate all types of transposons, such as Extensive de-novo TE Annotator (EDTA) [91], RepeatExplorer2 [92], TransposonUltimate [93], Earl Grey [94], and other comprehensive bioinformatics pipelines.

3.2. Classification of Transposons

Once the potential transposon sequences have been identified, the next step is to group them and define their families. Many bioinformatics pipelines can annotate and automatically generate classified transposons. Here, I just describe how to handle the transposon sequences that are not grouped.

3.2.1. Definition of Transposon Superfamilies

The superfamilies of transposons can be defined based on the encoded transposase proteins and other sequence features including TIRs, LTRs, and the sizes of TSDs. To classify the transposon superfamilies, the widely used method is to use the generated transposon sequences for conducting a BLASTX or BLASTN search against GenBank or other databases such as Gypsy Database (GyDB) [95] that provide a valuable resource to define the superfamilies of retroelements and determine the conserved domains of LTR retrotransposons and EPRVs. However, many annotated transposons have accumulated mutations or undergone deletions and encode no transposase or short protein sequences, and the superfamilies for these TEs can be defined based on their terminal motifs and flanked TSDs, or they can be classified as unclassified transposons, such as TRIMs, LARDs, or MITEs. It is important to note that unclassified transposons should be real transposons and non-transposon sequences should be excluded.

3.2.2. Definition of Transposon Families

In some cases, it is challenging to define transposon families as different scientists applied distinct cutoffs or strategies. Thus far, transposon families are commonly defined using the ‘80-80-80’ rule, which means that any members of the same transposon family should be over 80 bp and show more than an 80% sequence identity over 80% of their sizes [2]. However, other studies argued that this definition may ignore the consensus sequences [96] and may not be applicable for monophyletic groups [97]. As many transposons may contain highly conserved regions, such as the internal RT domain of LTR retrotransposons, transposons of the same family defined by the ‘80-80-80’ rule may be grouped into different phylogenetic clades. The terminal sequences including both LTRs and TIRs are less conserved than the transposase-encoded regions, and elements of the same transposon family usually exhibit extensive sequence identity at their termini. Therefore, the terminal sequences of transposons may be more sensitive in defining transposon families. In addition, members from the same transposon family should share close phylogenetic relationships and should be catalyzed by the same transposase enzymes. The better strategies for defining a transposon family should consider sequence identity and phylogenetic origins [98] as well as the molecular interactions.

3.2.3. Autonomous and Non-Autonomous Transposons

Transposons can also be grouped into autonomous and non-autonomous elements. The former are usually complete transposons and encode the entire enzymes required for transposition, whereas the latter may have undergone internal deletions or mutations and lack functional transposases and their movement is catalyzed by their autonomous partners. It should be noted that the terms ‘autonomous’ and ‘non-autonomous’ are defined based on the functional mobility, and they cannot be used to classify superfamilies of transposons. Any superfamily of both Class I and II transposons can be divided into either autonomous or non-autonomous elements. In many cases, autonomous and non-autonomous transposons were defined solely based on in silico gene prediction; if any transposons encode transposase proteins and contain the entire domains for transposition, they can be generally considered as autonomous transposons. However, the accuracy and reliability of computational predictions depend on many factors. If the mobility of a transposon has not been experimentally validated, any autonomous elements defined by computational prediction should be treated as potentially (but not true) autonomous transposons even their gene models are well supported by transcriptional data.

3.3. Quality Control and Improvement of Transposon Annotation

Transposons are highly repetitive, and many plants have complex genome structures, especially those with large genome sizes and/or duplicated chromosomes. Despite many bioinformatics software or pipelines having been developed and used for transposon prediction, not all annotated sequences are real transposons and some of them may be misannotated. Additionally, some protein-coding genes can also be predicted as repetitive sequences by some de novo annotation tools [19]. Therefore, users must carefully evaluate the annotated sequences, discard the incorrect annotation, and improve the quality of transposon annotation when they obtain the first version of the transposon database for newly sequenced plant genomes. Computational analysis is important, but manual inspection is strongly encouraged for this step. Goubert et al. have proposed a guideline for the manual curation of transposons [99]. Here, I just want to highlight four things that are helpful for improving the quality of transposon annotation.

3.3.1. Identification and Exclusion of Misannotated Sequences

Generally, three major types of genomic sequences were frequently misannotated: tandem repeats (TRs), non-transposon sequences, and misclassified transposons. Occasionally, some transposons such as TRIMs were tandemly organized [11]. However, nearly all tandem repeats are long track repeats such as centromeric tandem repeats, they are widely dispersed, and consist of various basic units. These sequences were frequently misannotated as LTR retrotransposons and significantly reduce the accuracy of transposon annotations. TRs can be identified with related programs [100] or pairwise sequence alignments (Figure 4). Non-transposon sequences, such as genes, pseudogenes, and other genomic sequences, can also be misidentified as transposons, especially for those annotated as fragmental or unclassified transposons. Two methods can be used to identify these types of misannotations: one is to use these annotated sequences for BLASTX or BLASTN searches and see if they show a significant sequence similarity to the known transposon protein sequences; and another is to extract the annotated sequences and the flanking regions and check if they are repetitive and contain any sequence structures of transposons (TIRs, LTRs, and TSDs). In some cases, some transposon sequences were misclassified. For example, I manually inspected a SINE database annotated by a computational program and found that some annotated SINEs were fragments of true LTR retrotransposons.

3.3.2. Elimination of Nested Transposons

One transposon called bottom transposon may contain other transposon(s) named nested transposon(s). Nested organizations of transposons are very common in many plants with higher repeat contents such as maize [101,102]. They are also frequently found in some specific regions of several plant genomes with lower repeat contents such as the centromeric and pericentromeric regions in rice [14]. In some cases, transposons were organized into multiple layers of nested insertions in which nested elements further served as target transposons for other TEs [14,102]. In plants, LTR retrotransposons frequently acted as target transposons to harbor other LTR retrotransposons of the same or different families, but DNA transposons can also be inserted by LTR retrotransposons and other transposons. Despite nested transposons providing some insights into the evolutionary dynamics of transposons, they may cause problems for transposon classifications. Therefore, nested transposons should be removed from the reference transposons. Nested transposons can be easily identified by comparing different family members of the same transposon family or orthologous elements between multiple genomes from the same or related species. Despite identification of inserted transposons being tedious and time-consuming, this process may find some novel transposon families and many of them may be recently active.

3.3.3. Definition of Transposon Boundaries

The exact boundaries of transposons can be determined based on the terminal sequences, including LTR and TIRs, and TSDs. However, the movements of some transposons such as Helitrons do not generate TSDs. Additionally, not all transposons possess the typical features of their superfamilies, some of them may have accumulated mutations or undergone deletions, and their termini and TSDs are difficult to recognize. Unfortunately, no software is available for defining the boundaries for fragmental or mutated transposons. One strategy is to extract the flanking sequences of transposons and conduct sequence alignment analysis. It is exhausting and time-consuming but is practical. With the advance of sequencing technologies, multiple genomes from the same and closely related species have been sequenced, and the exact boundaries of many transposons can also be defined with multiple reference sequence sets [103].

3.3.4. Identification of Unannotated Transposons

Due to the complexity of plant genomes and the current limitations of available bioinformatics resources for de novo TE discovery, some transposons may be missed. There are several ways to identify missing transposons. The first method is to apply more and different annotation approaches and see if any new transposons can be discovered. The second strategy is to check the reported transposons in newly sequenced genomes. Transposons are usually very dynamic, but some transposons are present in a wide range of plants. For example, An-RTEs (RTE clade of LINEs), Cassandra (TRIMs), and Au-SINEs are present in many dicots and monocots [11,47,50,104]. If they are not found in a new sequenced dicot or monocot, one possibility is that they may be missed by the transposon annotations. The third way is to use the annotated transposons to screen the genome with RepeatMasker and use the transposase proteins of identified superfamilies to conduct a TBLASTN search against the masked genome. If multiple significant hits were detected, it suggested that some transposons must be missed.

3.4. Criteria for Good Transposon Database

One question is how to evaluate the quality of the annotated transposon database? If two research groups (A and B) independently annotated transposons in the same genome, the A group’s transposons can define 75% of the genome whereas the B group’s transposons only mask 70% of the genome. Can we say the transposon database of the A group was better than that of the B group? The answer for this may be yes or no. The coverage is an indicator, but not the sole one to evaluate the quality of transposon annotation. In my opinion, there are four criteria to judge a good transposon database. a. Transposons should be accurately annotated. Two major points for this are (1) the annotated transposons should be real transposons and non-TE sequences should be excluded; (2) all annotated transposons should be correctly classified. For example, a Ty3/Gypsy retrotransposon cannot be classified as a Ty1/Copia. In some publications, the majority of the annotated transposons were listed as unknown/unclassified/ungrouped transposons, which implied the extremely poor quality of transposon annotations and/or genome sequencing and assembling. b. The boundaries of transposons should be well defined. c. Nearly all transposons should be annotated. It is impossible to identify all transposons in plant genomes, but we should not miss many of them. d. The transposon database should be non-redundant. Transposons are very redundant, and it is not necessary to include all or many copies of a transposon family into the database as it will enlarge the size of the TE library and impede computational analyses; thus, it is better to have a well-defined reference transposon for each family.

4. Conclusions and Future Directions

Despite extensive research having been conducted, the accurate annotation of transposons remains challenging and time-consuming. Thus far, 16 superfamilies of transposons have been found in plants and numerous bioinformatics software and pipelines have been developed to annotate transposons based on the three major annotation strategies. Although different systems or parameters were proposed for transposon classification and family definition, the terms ‘autonomous’ and ‘non-autonomous’ cannot be used to classify transposon superfamilies. A good transposon database should be accurately annotated and fully represented, and the boundaries of transposons should be well defined. As transposon annotation heavily depends on the available genomes, it is impossible to have a high-quality TE database with poorly sequenced and assembled genomes. With the innovation in software development and sequencing techniques, more complete plant genomes will be well annotated, and new types of transposons may also be discovered in plants. The current bioinformatics programs offer incredible resources for transposon predictions; however, some of them are difficult to install and use for beginners, especially for those with little expertise in bioinformatics and computational biology, and many of these programs also produce numerous false-positive annotations. New methodologies and algorithms are needed to develop next-generation annotation tools for significantly improving the accuracy and efficiency of transposon annotation. It is highly appreciated for beginners to create user-friendly transposon annotation platforms via national and international collaborations and to allow users to submit their genomes and download the annotated transposons online. Another need is to develop bioinformatics tools for comparatively annotating transposons with multiple genomes and to identify more new and unshared transposons. Thus far, multiple genomes from the same species and/or related species have been sequenced for many important crops or model plants; however, the comprehensive annotation and comparison of transposon landscapes in the sequenced pan-genomes are very limited. It is also highly recommended to periodically update the current transposon databases and make them to be accessible publicly.

Funding

This research was supported by USDA-ARS CRIS Project No. 2050-21000-038-000D.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

I am grateful for Ning Jiang for providing the initial training on transposons. I also would like to thank Scott Jackson for giving the opportunity to work on plant transposon annotations.

Conflicts of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the United States Department of Agriculture.

References

Kapitonov, V.V.; Jurka, J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat. Rev. Genet. 2008, 9, 411–412. [Google Scholar] [CrossRef]
Wicker, T.; Sabot, F.; Hua-Van, A.; Bennetzen, J.L.; Capy, P.; Chalhoub, B.; Flavell, A.; Leroy, P.; Morgante, M.; Panaud, O.; et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007, 8, 973–982. [Google Scholar]
Bao, W.; Jurka, M.G.; Kapitonov, V.V.; Jurka, J. New superfamilies of eukaryotic DNA transposons and their internal divisions. Mol. Biol. Evol. 2009, 26, 983–993. [Google Scholar] [CrossRef]
Bao, W.; Kapitonov, V.V.; Jurka, J. Ginger DNA transposons in eukaryotes and their evolutionary relationships with long terminal repeat retrotransposons. Mob. DNA 2010, 1, 3. [Google Scholar] [CrossRef]
Cerbin, S.; Wai, C.M.; VanBuren, R.; Jiang, N. GingerRoot: A novel DNA transposon encoding integrase-related transposase in plants and animals. Genome Biol. Evol. 2019, 11, 3181–3193. [Google Scholar] [CrossRef]
Craig, R.J. Replitrons: A major group of eukaryotic transposons encoding HUH endonuclease. Proc. Natl. Acad. Sci. USA 2023, 120, e2301424120. [Google Scholar] [CrossRef] [PubMed]
Vassilieff, H.; Haddad, S.; Jamilloux, V.; Choisne, N.; Sharma, V.; Giraud, D.; Wan, M.; Serfraz, S.; Geering, A.D.W.; Teycheney, P.Y.; et al. CAULIFINDER: A pipeline for the automated detection and annotation of caulimovirid endogenous viral elements in plant genomes. Mob. DNA 2022, 13, 31. [Google Scholar] [CrossRef] [PubMed]
Richert-Pöggeler, K.R.; Noreen, F.; Schwarzacher, T.; Harper, G.; Hohn, T. Induction of infectious petunia vein clearing (pararetro) virus from endogenous provirus in petunia. EMBO J. 2003, 22, 4836–4845. [Google Scholar] [CrossRef] [PubMed]
Havecker, E.R.; Gao, X.; Voytas, D.F. The diversity of LTR retrotransposons. Genome Biol. 2004, 5, 225. [Google Scholar] [CrossRef] [PubMed]
Witte, C.P.; Le, Q.H.; Bureau, T.; Kumar, A. Terminal-repeat retrotransposons in miniature (TRIM) are involved in restructuring plant genomes. Proc. Natl. Acad. Sci. USA 2001, 98, 13778–13783. [Google Scholar] [CrossRef]
Gao, D.; Li, Y.; Do Kim, K.; Abernathy, B.; Jackson, S.A. Landscape and evolutionary dynamics of terminal repeat retrotransposons in miniature in plant genomes. Genome Biol. 2016, 17, 7. [Google Scholar] [CrossRef] [PubMed]
Bureau, T.E.; Wessler, S.R. Tourist: A large family of small inverted repeat elements frequently associated with maize genes. Plant Cell. 1992, 4, 1283–1294. [Google Scholar]
Bennetzen, J.L.; Wang, H. The contributions of transposable elements to the structure, function, and evolution of plant genomes. Annu. Rev. Plant Biol. 2014, 65, 505–530. [Google Scholar] [CrossRef]
Gao, D.; Jiang, N.; Wing, R.A.; Jiang, J.; Jackson, S.A. Transposons play an important role in the evolution and diversification of centromeres among closely related species. Front. Plant Sci. 2015, 6, 216. [Google Scholar] [CrossRef]
Serrato-Capuchina, A.; Matute, D.R. The role of transposable elements in speciation. Genes 2018, 9, 254. [Google Scholar] [CrossRef]
Gao, D.; Gill, N.; Kim, H.; Walling, J.G.; Zhang, W.; Fan, C.; Yu, Y.; Ma, J.; SanMiguel, P.; Jiang, N.; et al. A lineage-specific centromere retrotransposon in Oryza brachyantha. Plant J. 2009, 60, 820–831. [Google Scholar] [CrossRef]
Schnable, P.S.; Ware, D.; Fulton, R.S.; Stein, J.C.; Wei, F.; Pasternak, S.; Liang, C.; Zhang, J.; Fulton, L.; Graves, T.A.; et al. The B73 maize genome: Complexity, diversity, and dynamics. Science 2009, 326, 1112–1115. [Google Scholar] [CrossRef]
Jamilloux, V.; Daron, J.; Choulet, F.; Quesneville, H. De novo annotation of transposable elements: Tackling the fat genome issue. Proc. IEEE 2016, 105, 474–481. [Google Scholar] [CrossRef]
Yandell, M.; Ence, D. A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 2012, 13, 329. [Google Scholar] [CrossRef] [PubMed]
Jangam, D.; Feschotte, C.; Betrán, E. Transposable element domestication as an adaptation to evolutionary conflicts. Trends Genet. 2017, 33, 817–831. [Google Scholar] [CrossRef] [PubMed]
TE Hub Consortium; Elliott, T.A.; Heitkam, T.; Hubley, R.; Quesneville, H.; Suh, A.; Wheeler, T.J. TE Hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mob. DNA 2021, 12, 16. [Google Scholar]
Lerat, E. Identifying repeats and transposable elements in sequenced genomes: How to find your way through the dense forest of programs. Heredity 2010, 104, 520–533. [Google Scholar] [CrossRef]
Storer, J.M.; Hubley, R.; Rosen, J.; Smit, A.F.A. Methodologies for the de novo discovery of transposable element families. Genes 2022, 13, 709. [Google Scholar] [CrossRef]
Mokhtar, M.M.; Alsamman, A.M.; El Allali, A. PlantLTRdb: An interactive database for 195 plant species LTR-retrotransposons. Front. Plant Sci. 2023, 14, 1134627. [Google Scholar] [CrossRef] [PubMed]
Bao, Z.R.; Eddy, S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002, 12, 1269–1276. [Google Scholar] [CrossRef]
McCarthy, E.M.; McDonald, J.F. LTR STRUC: A novel search and identification program for LTR retrotransposons. Bioinformatics 2003, 19, 362–367. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Wang, H. LTR-FINDER: An efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007, 35, 265–268. [Google Scholar] [CrossRef] [PubMed]
Ellinghaus, D.; Kurtz, S.; Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinform. 2008, 9, 18. [Google Scholar] [CrossRef]
Steinbiss, S.; Willhoeft, U.; Gremme, G.; Kurtz, S. Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res. 2009, 37, 7002–7013. [Google Scholar] [CrossRef] [PubMed]
Ou, S.; Jiang, N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal-repeat retrotransposons. Plant Physiol. 2017, 176, 1410–1422. [Google Scholar] [CrossRef]
Orozco-arias, S.; Liu, J.; Id, R.T.; Ceballos, D.; Silva, D.; Id, D.; Ming, R.; Guyot, R. Inpactor, integrated and parallel analyzer and classifier of LTR retrotransposons and its application for pineapple LTR retrotransposons diversity and dynamics. Biology 2018, 7, 32. [Google Scholar] [CrossRef] [PubMed]
Wenke, T.; Döbel, T.; Sörensen, T.R.; Junghans, H.; Weisshaar, B.; Schmidt, T. Targeted identification of short interspersed nuclear element families shows their widespread existence and extreme heterogeneity in plant genomes. Plant Cell 2011, 23, 3117–3128. [Google Scholar] [CrossRef]
Li, Y.; Jiang, N.; Sun, Y. AnnoSINE: A short interspersed nuclear elements annotation tool for plant genomes. Plant Physiol. 2022, 188, 955–970. [Google Scholar] [CrossRef] [PubMed]
Gremme, G.; Steinbiss, S.; Kurtz, S. GenomeTools: A comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 2013, 10, 645–656. [Google Scholar] [CrossRef]
Su, W.; Gu, X.; Peterson, T. TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome. Mol. Plant. 2019, 12, 447–460. [Google Scholar] [CrossRef] [PubMed]
Shi, J.; Liang, C. Generic Repeat Finder: A High-Sensitivity Tool for Genome-Wide De Novo Repeat Detection. Plant Physiol. 2019, 180, 1803–1815. [Google Scholar] [CrossRef] [PubMed]
Han, Y.; Wessler, S.R. MITE-Hunter: A program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 2010, 38, e199. [Google Scholar] [CrossRef]
Ye, C.; Ji, G.; Liang, C. detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes. Sci. Rep. 2016, 6, 19688. [Google Scholar] [CrossRef]
Hu, J.; Zheng, Y.; Shang, X. MiteFinderII: A novel tool to identify miniature inverted-repeat transposable elements hidden in eukaryotic genomes. BMC Med. Genom. 2018, 11, 101. [Google Scholar] [CrossRef]
Crescente, J.M.; Zavallo, D.; Helguera, M.; Vanzetti, L.S. MITE Tracker: An accurate approach to identify miniature inverted-repeat transposable elements in large genomes. BMC Bioinform. 2018, 19, 348. [Google Scholar] [CrossRef]
Xiong, W.; He, L.; Lai, J.; Dooner, H.K.; Du, C. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proc. Natl. Acad. Sci. USA 2014, 111, 10263–10268. [Google Scholar] [CrossRef]
Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015, 6, 11. [Google Scholar] [CrossRef]
Garcia, T.; Duitama, J.; Zullo, S.S.; Gil, J.; Ariani, A.; Dohle, S.; Palkovic, A.; Skeen, P.; Bermudez-Santana, C.I.; Debouck, D.G.; et al. Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean. Nat. Commun. 2021, 12, 702. [Google Scholar] [CrossRef]
El Baidouri, M.; Kim, K.D.; Abernathy, B.; Arikit, S.; Maumus, F.; Panaud, O.; Meyers, B.C.; Jackson, S.A. A new approach for annotation of transposable elements using small RNA mapping. Nucleic Acids Res. 2015, 43, e84. [Google Scholar] [CrossRef]
Rho, M.; Tang, H. MGEScan-Non-LTR: Computational Identification and Classification of Autonomous Non-LTR Retrotransposons in Eukaryotic Genomes. Nucleic Acids Res. 2009, 37, e143. [Google Scholar] [CrossRef]
Malik, H.S.; Eickbush, T.H. The RTE class of non-LTR retrotransposons is widely distributed in animals and is the origin of many SINEs. Mol. Biol. Evol. 1998, 15, 1123–1134. [Google Scholar] [CrossRef]
Gao, D.; Chu, Y.; Xia, H.; Xu, C.; Heyduk, K.; Abernathy, B.; Ozias-Akins, P.; Leebens-Mack, J.H.; Jackson, S.A. Horizontal Transfer of Non-LTR Retrotransposons from Arthropods to Flowering Plants. Mol. Biol. Evol. 2018, 35, 354–364. [Google Scholar] [CrossRef]
Deragon, J.M.; Zhang, X. Short interspersed elements (SINEs) in plants: Origin, classification, and use as phylogenetic markers. Syst. Biol. 2006, 55, 949–956. [Google Scholar] [CrossRef]
Umeda, M.; Ohtsubo, H.; Ohtsubo, E. Diversification of the rice Waxy gene by insertion of mobile DNA elements into introns. Jpn. J. Genet. 1991, 66, 569–586. [Google Scholar] [CrossRef]
Yasui, Y.; Nasuda, S.; Matsuoka, Y.; Kawahara, T. The Au family, a novel short interspersed element (SINE) from Aegilops umbellulate. Theor. Appl. Genet. 2001, 102, 463–470. [Google Scholar] [CrossRef]
Mao, H.; Wang, H. SINE_scan: An efficient tool to discover short interspersed nuclear elements (SINEs) in large-scale genomic datasets. Bioinformatics 2017, 33, 743–745. [Google Scholar] [CrossRef]
Temin, H.M. Reverse transcription in the eukaryotic genome: Retroviruses, pararetroviruses, retrotransposons and retrotranscripts. Mol. Biol. Evol. 1985, 2, 455–468. [Google Scholar] [PubMed]
Richert-Poggeler, K.R.; Shepherd, R.J. Petunia vein clearing virus: A plant pararetrovirus with the core sequence of an integrase function. Virology 1997, 236, 137–146. [Google Scholar] [CrossRef] [PubMed]
Teycheney, P.-Y.; Geering, A.D.W.; Dasgupta, I.; Hull, R.; Kreuze, J.F.; Lockhart, B.; Muller, E.; Olszewski, N.; Pappu, H.; Pooggin, M.; et al. ICTV Virus taxonomy profile: Caulimoviridae. J. Gen. Virol. 2020, 101, 1025–1026. [Google Scholar] [CrossRef]
Jakowitsch, J.; Mette, M.F.; van Der Winden, J.; Matzke, M.A.; Matzke, A.J. Integrated pararetroviral sequences define a unique class of dispersed repetitive DNA in plants. Proc. Natl. Acad. Sci. USA 1999, 96, 13241–13246. [Google Scholar] [CrossRef] [PubMed]
Geering, A.D.W.; Maumus, F.; Copetti, D.; Choisne, N.; Zwickl, D.J.; Zytnicki, M.; McTaggart, A.R.; Scalabrin, S.; Vezzulli, S.; Wing, R.A.; et al. Endogenous florendoviruses are major components of plant genomes and hallmarks of virus evolution. Nat. Commun. 2014, 5, 5269. [Google Scholar] [CrossRef] [PubMed]
Robertson, D.S. Characterization of a Mutator system in maize. Mutat. Res. 1978, 51, 21–28. [Google Scholar] [CrossRef]
Feschotte, C.; Pritham, E.J. DNA transposons and the evolution of the eukaryotic genomes. Annu. Rev. Genet. 2007, 41, 331–368. [Google Scholar] [CrossRef]
Yu, Z.; Wright, S.I.; Bureau, T.E. Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution. Genetics 2000, 156, 2019–2031. [Google Scholar] [CrossRef]
Lisch, D. Mutator transposons. Trends Plant Sci. 2002, 7, 498–504. [Google Scholar] [CrossRef]
Gao, D.; Caspersen, A.M.; Hu, G.; Bockelman, H.E.; Chen, X. A novel mutator-like transposable elements with unusual structure and recent transpositions in barley (Hordeum vulgare). Front. Plant Sci. 2022, 13, 904619. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Bao, Z.; Zhang, X.; Eddy, S.R.; Wessler, S.R. Pack-MULE transposable elements mediate gene evolution in plants. Nature 2004, 431, 569–573. [Google Scholar] [CrossRef] [PubMed]
Peterson, P.A. A mutable pale green locus in maize. Genetics 1953, 38, 682–683. [Google Scholar]
McClintock, B. Mutations in maize and chromosomal aberrations in Neurospora. In Annual Report of the Director of the Department of Genetics, Carnegie Institution of Washington Year Book No. 53, 1953–1954; Carnegie Institution of Washington: Cold Spring, NY, USA, 1954; pp. 254–260. [Google Scholar]
Zabala, G.; Vodkin, L. A putative autonomous 20.5 kb-CACTA transposon insertion in an F3′H. allele identifies a new CACTA transposon subfamily in Glycine max. BMC Plant Biol. 2008, 8, 124. [Google Scholar] [CrossRef] [PubMed]
Kawasaki, S.; Nitasaka, E. Characterization of Tpn1 family in the Japanese morning glory: En/Spm-related transposable elements capturing host genes. Plant Cell Physiol. 2004, 45, 933–944. [Google Scholar] [CrossRef] [PubMed]
McCLINTOCK, B. The origin and behavior of mutable loci in maize. Proc. Natl. Acad. Sci. USA 1950, 36, 344–355. [Google Scholar] [CrossRef]
Calvi, B.R.; Hong, T.J.; Findley, S.D.; Gelbart, W.M. Evidence for a common evolutionary origin of inverted repeat transposons in Drosophila and plants: Hobo, Activator, and Tam3. Cell 1991, 66, 465–471. [Google Scholar] [CrossRef]
Atkinson, P.W. hAT transposable elements. Microbiol. Spectr. 2015, 3, 773–800. [Google Scholar] [CrossRef]
Rubin, E.; Lithwick, G.; Levy, A.A. Structure and evolution of the hAT transposon superfamily. Genetics 2001, 158, 949–957. [Google Scholar] [CrossRef]
Essers, L.; Adolphs, R.H.; Kunze, R. A highly conserved domain of the maize activator transposase is involved in dimerization. Plant Cell 2000, 12, 211–224. [Google Scholar] [CrossRef]
Zhang, X.; Feschotte, C.; Zhang, Q.; Jiang, N.; Eggleston, W.B.; Wessler, S.R. P Instability Factor: An Active Maize Transposon System Associated with the Amplification of Tourist-like MITEs and a New Superfamily of Transposases. Proc. Natl. Acad. Sci. USA 2001, 98, 12572–12577. [Google Scholar] [CrossRef]
Kapitonov, V.V.; Jurka, J. Molecular paleontology of transposable elements from Arabidopsis thaliana. Genetica 1999, 107, 27–37. [Google Scholar] [CrossRef]
Jiang, N.; Bao, Z.; Zhang, X.; Hirochika, H.; Eddy, S.; McCouch, S.R.; Wessler, S.R. An active DNA transposon family in rice. Nature 2003, 421, 163–167. [Google Scholar] [CrossRef] [PubMed]
Velanis, C.N.; Perera, P.; Thomson, B.; de Leau, E.; Liang, S.C.; Hartwig, B.; Förderer, A.; Thornton, H.; Arede, P.; Chen, J.; et al. The domesticated transposase ALP2 mediates formation of a novel Polycomb protein complex by direct interaction with MSI1, a core subunit of Polycomb Repressive Complex 2 (PRC2). PLoS Genet. 2020, 16, e1008681. [Google Scholar] [CrossRef]
Mao, D.; Tao, S.; Li, X.; Gao, D.; Tang, M.; Liu, C.; Wu, D.; Bai, L.; He, Z.; Wang, X.; et al. The Harbinger transposon-derived gene PANDA epigenetically coordinates panicle number and grain size in rice. Plant Biotechnol. J. 2022, 20, 1154–1166. [Google Scholar] [CrossRef] [PubMed]
Emmons, S.W.; Yesner, L.; Ruan, K.; Katzenberg, D. Evidence for a transposon in Caenorhabditis elegans. Cell 1983, 32, 55–65. [Google Scholar] [CrossRef] [PubMed]
Jacobson, J.W.; Medhora, M.M.; Hartl, D.L. Molecular structure of a somatically unstable transposable element in Drosophila. Proc. Natl. Acad. Sci. USA 1986, 83, 8684–8688. [Google Scholar] [CrossRef] [PubMed]
Dupeyron, M.; Baril, T.; Bass, C.; Hayward, A. Phylogenetic analysis of the Tc1/mariner superfamily reveals the unexplored diversity of pogo-like elements. Mob. DNA 2020, 11, 21. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Yang, G. Tc1-like transposable elements in plant genomes. Mob. DNA 2014, 5, 17. [Google Scholar] [CrossRef]
Wells, D.J. Tdd-4, a DNA transposon of Dictyostelium that encodes proteins similar to LTR retroelement integrases. Nucleic Acids Res. 1999, 27, 2408–2415. [Google Scholar] [CrossRef] [PubMed]
Glockner, G.; Szafranski, K.; Winckler, T.; Dingermann, T.; Quail, M.A.; Cox, E.; Eichinger, L.; Noegel, A.A.; Rosenthal, A. The complex repeats of Dictyostelium discoideum. Genome Res. 2001, 11, 585–594. [Google Scholar] [CrossRef] [PubMed]
Gao, D.; Zhao, D.; Abernathy, B.; Iwata-Otsubo, A.; Herrera-Estrella, A.; Jiang, N.; Jackson, S.A. Dynamics of a novel highly repetitive CACTA family in common bean (Phaseolus vulgaris). G3 2016, 6, 2091–2101. [Google Scholar] [CrossRef] [PubMed]
Boutanaev, A.M.; Osbourn, A.E. Multigenome analysis implicates miniature inverted-repeat transposable elements (MITEs) in metabolic diversification in eudicots. Proc. Natl. Acad. Sci. USA 2018, 115, E6650–E6658. [Google Scholar] [CrossRef] [PubMed]
Jurka, J.; Kapitonov, V.V. PIFs Meet Tourists and Harbingers: A Superfamily Reunion. Proc. Natl. Acad. Sci. USA 2001, 98, 12315–12316. [Google Scholar] [CrossRef] [PubMed]
Grzebelus, D.; Lasota, S.; Gambin, T.; Kucherov, G.; Gambin, A. Diversity and structure of PIF/Harbinger-like elements in the genome of Medicago truncatula. BMC Genom. 2007, 8, 409. [Google Scholar] [CrossRef]
Deprá, M.; Ludwig, A.; Valente, V.L.; Loreto, E.L. Mar, a MITE family of hAT transposons in Drosophila. Mob. DNA 2012, 3, 13. [Google Scholar] [CrossRef]
Warburton, P.E.; Giordano, J.; Cheung, F.; Gelfand, Y.; Benson, G. Inverted repeat structure of the human genome: The X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res. 2004, 14, 1861–1869. [Google Scholar] [CrossRef]
Kapitonov, V.V.; Jurka, J. Rolling-circle transposons in eukaryotes. Proc. Natl. Acad. Sci. USA 2001, 98, 8714–8719. [Google Scholar] [CrossRef] [PubMed]
Du, C.; Fefelova, N.; Caronna, J.; He, L.; Dooner, H.K. The Polychromatic Helitron Landscape of the Maize Genome. Proc. Natl. Acad. Sci. USA 2009, 106, 19916–19921. [Google Scholar] [CrossRef] [PubMed]
Ou, S.; Su, W.; Liao, Y.; Chougule, K.; Agda, J.R.A.; Hellinga, A.J.; Lugo, C.S.B.; Elliott, T.A.; Ware, D.; Peterson, T.; et al. Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline. Genome Biol. 2019, 20, 275. [Google Scholar] [CrossRef] [PubMed]
Novák, P.; Neumann, P. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nat. Protoc. 2020, 15, 3745–3776. [Google Scholar] [CrossRef]
Riehl, K.; Riccio, C.; Miska, E.A.; Hemberg, M. TransposonUltimate: Software for transposon classification, annotation and detection. Nucleic Acids Res. 2022, 50, e64. [Google Scholar] [CrossRef] [PubMed]
Baril, T.; Imrie, R.M.; Hayward, A. Earl Grey: A fully automated user-friendly transposable element annotation and analysis pipeline. bioRxiv 2022. [CrossRef]
Llorens, C.; Futami, R.; Covelli, L.; Domínguez-Escribá, L.; Viu, J.M.; Tamarit, D.; Aguilar-Rodríguez, J.; Vicente-Ripolles, M.; Fuster, G.; Bernet, G.P.; et al. The Gypsy Database (GyDB) of mobile genetic elements: Release 2.0. Nucleic Acids Res. 2011, 39, 70–74. [Google Scholar] [CrossRef] [PubMed]
Flutre, T.; Duprat, E.; Feuillet, C.; Quesneville, H. Considering transposable element diversification in de novo annotation approaches. PLoS ONE 2011, 6, 0016526. [Google Scholar] [CrossRef]
Hayward, A.; Gilbert, C. Transposable elements. Curr. Biol. 2022, 32, R904–R909. [Google Scholar] [CrossRef]
Neumann, P.; Novák, P.; Hoštáková, N.; Macas, J. Systematic Survey of Plant LTR-Retrotransposons Elucidates Phylogenetic Relationships of Their Polyprotein Domains and Provides a Reference for Element Classification. Mob. DNA 2019, 10, 1. [Google Scholar] [CrossRef]
Goubert, C.; Craig, R.J.; Bilat, A.F.; Peona, V.; Vogan, A.A.; Protasio, A.V. A beginner’s guide to manual curation of transposable elements. Mob. DNA 2022, 13, 7. [Google Scholar] [CrossRef]
Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 1999, 27, 573–580. [Google Scholar] [CrossRef]
SanMiguel, P.; Tikhonov, A.; Jin, Y.K.; Motchoulskaia, N.; Zakharov, D.; Melake-Berhan, A.; Springer, P.S.; Edwards, K.J.; Lee, M.; Avramova, Z.; et al. Nested retrotransposons in the intergenic regions of the maize genome. Science 1996, 274, 765–768. [Google Scholar] [CrossRef]
Kronmiller, B.A.; Wise, R.P. TEnest: Automated chronological annotation and visualization of nested plant transposable elements. Plant Physiol. 2008, 146, 45–59. [Google Scholar] [CrossRef] [PubMed]
Buisine, N.; Quesneville, H.; Colot, V. Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 2008, 91, 467–475. [Google Scholar] [CrossRef] [PubMed]
Kalendar, R.; Tanskanen, J.; Chang, W.; Antonius, K.; Sela, H.; Peleg, O.; Schulman, A.H. Cassandra retrotransposons carry independently transcribed 5S RNA. Proc. Natl. Acad. Sci. USA 2008, 105, 5833–5838. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structures and typical features of different types of plant transposons. GAG, group-specific antigen; POL, polyprotein; PR, protease; RT, reverse transcriptase; RH, RNAase-H; INT, integrase; EN, endonuclease; CP, coat protein; MP, movement protein; TAV, transactivation protein; TPase, transposase; REP, replication protein.

Figure 2. Transposon fractions of 21 sequenced plant genomes.

Figure 3. Three major strategies for annotating transposons (LINEs as the example). (A) Repeat-based transposon annotation. The plant genome sequences (assembled chromosomes or large scaffolds) were split into smaller fragments, and then used to search against all plant genome sequences. All repetitive sequences were extracted and grouped into different repeat families based on sequence identity. The boundaries of transposons can be defined by computational comparisons. (B) Structure-based transposon annotation. The poly(A) motif of LINEs was recognized by developed software and the boundaries of LINEs were defined based on TSDs (smiley faces). Then, the identified LINE sequences were extracted and grouped into different families. (C) Homology-based transposon annotation. The proteins of represented LINEs were used to search against the plant genomes and to identify homologous sequences (blue rectangles). Then, the homologous hits and the flanking regions (usually 5–10 Kb for each direction) were extracted together and used to conduct sequence comparisons for defining boundaries of LINEs and grouping them into families.

Figure 4. Pairwise sequence comparisons of two potential LTR retrotransposons in barley annotated by bioinformatics software. Both annotated sequences have LTRs (black rectangles) and were flanked by 5-bp TSDs. However, pairwise comparisons revealed that sequence A was a real LTR retroelement whereas sequence B was misannotated as it contained tandem repeats.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, D. Introduction of Plant Transposon Annotation for Beginners. Biology 2023, 12, 1468. https://doi.org/10.3390/biology12121468

AMA Style

Gao D. Introduction of Plant Transposon Annotation for Beginners. Biology. 2023; 12(12):1468. https://doi.org/10.3390/biology12121468

Chicago/Turabian Style

Gao, Dongying. 2023. "Introduction of Plant Transposon Annotation for Beginners" Biology 12, no. 12: 1468. https://doi.org/10.3390/biology12121468

APA Style

Gao, D. (2023). Introduction of Plant Transposon Annotation for Beginners. Biology, 12(12), 1468. https://doi.org/10.3390/biology12121468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Introduction of Plant Transposon Annotation for Beginners

Abstract

Simple Summary

Abstract

1. Introduction of Plant Transposons

2. Strategies of Transposon Discovery

2.1. Repeat-Based Annotation

2.2. Structure-Based Annotation

2.3. Homology-Based Annotation

3. Steps for Transposon Annotation

3.1. Brief Introduction of Plant Transposon Superfamilies and Their Annotations

3.1.1. LTR Retrotransposons

3.1.2. LINEs

3.1.3. SINEs

3.1.4. Endogenous Plant Pararetrovirus (EPRVs)

3.1.5. DNA Transposons with TIRs

Mutator Transposons

CACTA Transposons

hAT Transposons

PIF/Harbinger Transposons

Tc1/Mariner Transposons

Sola Transposons

Ginger Transposons

MITEs

TIR DNA Transposon Annotation

3.1.6. Helitron Transposons

3.1.7. Replitron Transposons

3.1.8. Others TE Annotation Resources

3.2. Classification of Transposons

3.2.1. Definition of Transposon Superfamilies

3.2.2. Definition of Transposon Families

3.2.3. Autonomous and Non-Autonomous Transposons

3.3. Quality Control and Improvement of Transposon Annotation

3.3.1. Identification and Exclusion of Misannotated Sequences

3.3.2. Elimination of Nested Transposons

3.3.3. Definition of Transposon Boundaries

3.3.4. Identification of Unannotated Transposons

3.4. Criteria for Good Transposon Database

4. Conclusions and Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI