Discovering the Repeatome of Five Species Belonging to the Asteraceae Family: A Computational Study

Genome divergence by repeat proliferation and/or loss is a process that plays a crucial role in species evolution. Nevertheless, knowledge of the variability related to repeat proliferation among species of the same family is still limited. Considering the importance of the Asteraceae family, here we present a first contribution towards the metarepeatome of five Asteraceae species. A comprehensive picture of the repetitive components of all genomes was obtained by genome skimming with Illumina sequence reads and by analyzing a pool of full-length long terminal repeat retrotransposons (LTR-REs). Genome skimming allowed us to estimate the abundance and variability of repetitive components. The structure of the metagenome of the selected species was composed of 67% repetitive sequences, of which LTR-REs represented the bulk of annotated clusters. The species essentially shared ribosomal DNA sequences, whereas the other classes of repetitive DNA were highly variable among species. The pool of full-length LTR-REs was retrieved from all the species and their age of insertion was established, showing several lineage-specific proliferation peaks over the last 15-million years. Overall, a large variability of repeat abundance at superfamily, lineage, and sublineage levels was observed, indicating that repeats within individual genomes followed different evolutionary and temporal dynamics, and that different events of amplification or loss of these sequences may have occurred after species differentiation.


Introduction
The collection of all repetitive sequences distributed along chromosomes, known as the "repeatome", constitutes one of the major components of eukaryotic genomes [1]. Overall, repeat types can be characterized as satellite DNA (i.e., sequences organized as tandem repetitions) and interspersed repeats (i.e., transposable elements) [2]. Transposable elements (TEs) are DNA sequences that can move independently within the genome through specific transposition mechanisms. The discovery of TEs dates back to the 1940s, when U.S. biologist Barbara McClintock identified DNA sequences capable of moving from one locus to another within the Zea mays genome [3]. Based on their transposition mechanism, TEs are divided into two main classes: retrotransposons (REs), or Class I TEs; and DNA transposons, or Class II TEs. Both classes are autonomous and non-autonomous elements based on the presence or absence of specific open reading frames encoding transposon proteins. Non-autonomous elements are not able to transpose autonomously but can still proliferate by exploiting the transposition proteins encoded by the autonomous elements [4][5][6][7]. DNA transposons can move through a mechanism of transposition called "cut-and-paste", whereas retrotransposons use a "copy-and-paste" type of replication involving an intermediate RNA molecule [8]. REs can also be divided into two major approach could be particularly useful in discovering repeats that are difficult to identify with structural tools. The identification of repeats based only on structural features, in fact, could lead to mismeasurements of repeat abundance. Repeat sequences in species where transposition events occurred in very ancient times could have accumulated mutations and have been poorly detected. Furthermore, scanning genome sequences for identifying full-length elements could result in a low number of repetitive elements because of common mis-assembly events (i.e., repeats collapsing during the assembly procedure) [30].
Based on graph clustering and the identification of full-length repeats, this study aimed to clarify the repeatome belonging to important Asteraceae and shed light on various evolutionary and temporal dynamics of retrotranspositional activity following species separation by: (i) Establishing the extent of repetitive DNA variation among species belonging to the same family; and (ii) Analyzing the relationship between changes in LTR-RE abundance and variations in the dynamics of specific LTR-REs among related species.

Metarepeatome Analysis of Asteraceae Species
The repeatomes of five species of the Asteraceae family (i.e., Helianthus annuus, Lactuca sativa, Cynara cardunculus var. scolymus, Artemisia annua, and Chrysanthemum seticuspe) were studied to classify repetitive sequences and identify their homologous groups in individual genomes (Table 1). A comparative analysis using hybrid clustering was performed with RepeatExplorer2 using a set of 1,000,000 random reads from each of the five chosen species for a total of 5,000,000 reads. The clustered sequence reads, i.e., the repetitive DNA, ranged from 60.44% of the genome of Cynara cardunculus var. scolymus to 78.44% of the genome of Helianthus annuus (Table 2). In total, 2,190,582 reads were grouped into 100,231 clusters, representing different subfamilies of specific repetitive elements. Furthermore, exploiting the feature of paired-end reads, clusters were grouped into 99,971 superclusters, which included repeats belonging to the same repeat family. In total, this analysis estimated the repetitive component as In total, 2,190,582 reads were grouped into 100,231 clusters, representing different subfamilies of specific repetitive elements. Furthermore, exploiting the feature of pairedend reads, clusters were grouped into 99,971 superclusters, which included repeats belonging to the same repeat family. In total, this analysis estimated the repetitive component as 67% of the metagenome structure of the five species, while 725,621 sequences remained singlets (Figure 1). Of the 528 top clusters (i.e., clusters representing >0.01% of the analyzed reads), 455 were annotated as repeats belonging to the LTR order, showing that the overall structure of the five species was largely composed of LTR-RE-related clusters. Among the clusters annotated as LTR-REs, the two major superfamilies were represented by similar percentages: 21.36% and 19.52% of the metarepeatome for Copia and Gypsy, respectively. DNA transposons accounted for 1.03% of the metarepeatome, rDNA sequences for 0.92%, and satellite DNA for 0.14%. Finally, 23.67% consisted of unidentifiable repeated elements, Of the 528 top clusters (i.e., clusters representing >0.01% of the analyzed reads), 455 were annotated as repeats belonging to the LTR order, showing that the overall structure of the five species was largely composed of LTR-RE-related clusters. Among the clusters annotated as LTR-REs, the two major superfamilies were represented by similar percentages: 21.36% and 19.52% of the metarepeatome for Copia and Gypsy, respectively. DNA transposons accounted for 1.03% of the metarepeatome, rDNA sequences for 0.92%, and satellite DNA for 0.14%. Finally, 23.67% consisted of unidentifiable repeated elements, and 33.12% was attributable to single or low-copy-number sequences, including repeats that were not abundant in the respective species ( Figure 2).
Concerning LTR-retrotransposons (Table 3), the repeats annotated as LTR-REs ranged from 35.27% of the genome of Cynara cardunculus var. scolymus to 52.32% in Chrysanthemum seticuspe. Gypsy elements ranged from 10.61% in Cynara cardunculus var. scolymus to 41.11% in Helianthus annuus, whereas Copia elements ranged from 6.35% in Helianthus annuus to 35.28% in Chrysanthemum seticuspe. The ratio between the genomic proportions of Gypsy and Copia elements largely differed among these Asteraceae species, from 0.35 in Chrysanthemum seticuspe to 6.47 in Helianthus annuus ( Table 3). The maximum difference of genome proportion of each LTR-RE superfamily or lineage among the five species analyzed gave us an estimation of genome proportion variability of Copia and Gypsy elements within Asteraceae. Such variability among genomes was larger for each Gypsy lineage compared to Copia lineages, and it was even larger for whole superfamilies; the maximum difference was 28.93% for the Copia superfamily and 30.50% for the Gypsy superfamily (Table 3). LTR-RE redundancy was also studied after annotating elements at the lineage level: six lineages (plus one group that could not be annotated) were identified among Copia REs (Ale, Angela, Ikeros, Ivana, SIRE, and TAR), and three lineages (plus one group that could not be annotated) were identified among Gypsy REs (Chromovirus, Athila and Tat) ( and 33.12% was attributable to single or low-copy-number sequences, including repeats that were not abundant in the respective species ( Figure 2). Concerning LTR-retrotransposons (Table 3), the repeats annotated as LTR-REs ranged from 35.27% of the genome of Cynara cardunculus var. scolymus to 52.32% in Chrysanthemum seticuspe. Gypsy elements ranged from 10.61% in Cynara cardunculus var. scolymus to 41.11% in Helianthus annuus, whereas Copia elements ranged from 6.35% in Helianthus annuus to 35.28% in Chrysanthemum seticuspe. The ratio between the genomic proportions of Gypsy and Copia elements largely differed among these Asteraceae species, from 0.35 in Chrysanthemum seticuspe to 6.47 in Helianthus annuus ( Table 3). The maximum difference of genome proportion of each LTR-RE superfamily or lineage among the five species analyzed gave us an estimation of genome proportion variability of Copia and Gypsy elements within Asteraceae. Such variability among genomes was larger for each Gypsy lineage compared to Copia lineages, and it was even larger for whole superfamilies; the maximum difference was 28.93% for the Copia superfamily and 30.50% for the Gypsy superfamily (Table 3). LTR-RE redundancy was also studied after annotating elements at the lineage level: six lineages (plus one group that could not be annotated) were identified among Copia REs (Ale, Angela, Ikeros, Ivana, SIRE, and TAR), and three lineages (plus one group that could not be annotated) were identified among Gypsy REs (Chromovirus, Athila and Tat) ( Table 3). Among the Copia REs, the SIRE lineage had a genome proportion higher than 3% in all species, while the Angela lineage was particularly abundant (14.66%) in Lactuca sativa. Each Gypsy lineage accounted for different percentages of the genome, with Chromoviruses being the most abundant, especially in Helianthus annuus (31.11%).  To investigate the possible variability within lineages and to identify species-specific repeats, hierarchical clustering was performed on the annotated clusters based on their abundance within the analyzed genomes and grouping the homologous shared clusters. As shown in Figure 3, the analyzed species essentially shared rDNA sequences. The other To investigate the possible variability within lineages and to identify species-specific repeats, hierarchical clustering was performed on the annotated clusters based on their abundance within the analyzed genomes and grouping the homologous shared clusters. As shown in Figure 3, the analyzed species essentially shared rDNA sequences. The other DNA repeat classes were very specific, with the presence of distinct sublineages, except for Artemisia annua and Chrysanthemum seticuspe (which belong to the same tribe, Anthemideae, and share some of their repeats).

Isolation and Analysis of Full-Length LTR Retrotransposons
Because LTR-REs are largely the most abundant repeat class in the genomes of the five Asteraceae species, full-length LTR-REs were identified based on the structural features in the sequenced genomes of each selected species. Overall, 48,872 full-length LTR-REs were retrieved (Table 4).  Most of the full-length LTR-REs (34,580 out of 48,872) were identified in the large genome of Helianthus annuus, 77.5% of which were annotated as Gypsy-related LTR-REs, with a prevalence of elements belonging to the Chromovirus/Tekay lineage. Then, 6875 fulllength LTR-REs were found in lettuce, 79% of which belonged to the Copia superfamily.
The sequences encoding RT domains (15,431 intact RT domains for the Copia superfamily and 26,203 intact RT domains for the Gypsy superfamily) were identified and collected from the pool of full-length elements and analyzed to infer the phylogenetic relationship occurring among the LTR-REs of a single species, highlighting a clear separation of lineages in each of the studied genomes (Figures 4 and 5).
Furthermore, the phylogenetic trees based on all RT sequences of Copia and Gypsy elements, separated according to the lineage, revealed that for all lineages, RT sequences clustered randomly, i.e., not based on the species to which they belonged (Supplementary Figures S1 and S2).
Finally, proliferation time profiles of the full-length LTR-REs were analyzed in the five genomes by measuring pairwise distances between the LTRs of the same element. The two LTR sequences of a retrotransposon are identical immediately after the insertion event and then undergo mutations over time [36]. If LTR-REs accumulate more mutations than genes as time passes [21], distances between LTR sequences are converted into timing profiles using a mutation rate that is twice the rate calculated for synonymous substitutions in Helianthus annuus gene sequences [37,38]. This analysis showed the proliferation of LTR-REs in the last 15 MY (Figures 6 and 7). The species presented different insertion time profiles specific to the different lineages. Most of the lineages of the Copia superfamily showed a proliferation peak at about 1 MYA (Figure 6), except for elements belonging to some lineages (SIRE, TAR and Tork) that showed older proliferation peaks in certain species, such as Lactuca sativa and Cynara cardunculus var. scolymus. The lineages belonging to the Gypsy superfamily were generally older and showed abundant proliferation activity between 1 and 5 MYA (Figure 7). Appreciable differences were also found by studying the proliferation events of the different lineages in the individual species. In Helianthus annuus, all but one lineage of the Copia superfamily revealed proliferation peaks around 1 MYA, while the Bianca lineage still appeared to be going through proliferation events, showing an upward curve ( Figure 6). In Lactuca sativa, the Gypsy lineages Chromovirus and Athila showed proliferation peaks around 1 and 2 MYA, respectively, while Tat elements seemed older, with two different peaks around 8 and 5 MYA (Figure 7).

Discussion
The Asteraceae family is of considerable economic importance, and Helianthus has been a model system for studying the genetic mechanisms of speciation, hybridization, and domestication for more than two decades [39]. However, the characterization and possible involvement of the repeatome of other Asteraceae genomes in evolutionary processes are still poorly studied.
Repetitive sequences have been identified and quantified by hybrid graph-based clustering [26], a strategy commonly used to gain insight into the composition and sequence variation of repetitive components in a pool of related species [21,40,41]. Among the five selected Asteraceae species, repetitive DNA ranged from 60.44% in Cynara cardunculus var. scolymus to 78.44% in Helianthus annuus, similar to what has already been reported for this species by Giordani [42]. On the other hand, differences in transposable elements abundance were observed in the selected species comparing to previous studies [22,32,33,35,43]. Such variability can be due to the different genotypes analyzed, as reported in sunflower [24], or to the usage of diverse methods of repeat discovery and quantification. Clustering analyses, using unassembled reads obtained from low-coverage genome sequencing for estimating the genome proportion of the repeated sequences, is one of the most reliable methods as it has been demonstrated in other study systems [44][45][46].
The genome structure was similar among the analyzed species, with LTR-REs representing the most repetitive sequences. The prevalence of LTR-retrotransposons in the fraction of highly repeated sequences is a common feature of higher plant genomes, where retroelements represent one of the major forces driving genome size evolution [47][48][49] and were previously observed in Asteraceae by Staton [50]. However, striking differences in

Discussion
The Asteraceae family is of considerable economic importance, and Helianthus has been a model system for studying the genetic mechanisms of speciation, hybridization, and domestication for more than two decades [39]. However, the characterization and possible involvement of the repeatome of other Asteraceae genomes in evolutionary processes are still poorly studied.
Repetitive sequences have been identified and quantified by hybrid graph-based clustering [26], a strategy commonly used to gain insight into the composition and sequence variation of repetitive components in a pool of related species [21,40,41]. Among the five selected Asteraceae species, repetitive DNA ranged from 60.44% in Cynara cardunculus var. scolymus to 78.44% in Helianthus annuus, similar to what has already been reported for this species by Giordani [42]. On the other hand, differences in transposable elements abundance were observed in the selected species comparing to previous studies [22,32,33,35,43]. Such variability can be due to the different genotypes analyzed, as reported in sunflower [24], or to the usage of diverse methods of repeat discovery and quantification. Clustering analyses, using unassembled reads obtained from low-coverage genome sequencing for estimating the genome proportion of the repeated sequences, is one of the most reliable methods as it has been demonstrated in other study systems [44][45][46].
The genome structure was similar among the analyzed species, with LTR-REs representing the most repetitive sequences. The prevalence of LTR-retrotransposons in the fraction of highly repeated sequences is a common feature of higher plant genomes, where retroelements represent one of the major forces driving genome size evolution [47][48][49] and were previously observed in Asteraceae by Staton [50]. However, striking differences in abundance and variability were observed after analyzing the different LTR-REs from the superfamily to the lineage level.
The ratio between the abundance of Gypsyand Copia-related sequences was highly variable, ranging from 0.35 in the chrysanthemum to 6.47 in the sunflower. The TE abundance biased towards Gypsy TEs was observed in Asteraceae by Staton [50], suggesting that the two superfamilies have contributed differently to the genome community. Generally, in Angiosperms, Gypsy elements are more abundant than Copia elements, with valuable exceptions, such as pear, date palm, and banana [11]. However, this ratio is not apparently related to the taxonomy of species. The large variability of this ratio among the selected Asteraceae species confirms the data reported for higher plants (Angiosperms and Gymnosperms [11]) at the intrafamily level.
At the lineage level, among Copia lineages, SIRE elements were by far the most abundant in all analysed species, varying from 3.88% in lettuce to 29.91% (i.e., more than 7-fold) in chrysanthemum. Regarding Gypsy lineages, Chromovirus elements were the most frequent in the genomes, and their abundance varied from 2.07% in chrysanthemum to 30.42% (i.e., more than 14-fold) in sunflower. The predominance of SIRE and Chromovirus elements has also been observed in other Asteraceae genera, including Hieracium [45], Senecio [46], and Stevia [51] These variations indicate that the high amplification rate was maintained in certain species even after speciation or that other rearrangements, such as duplications of chromosomal fragments, may have occurred, producing such large variations. These results suggest that after species separation, the repetitive components underwent different rates of amplification/loss but also that new LTR-RE sublineages originated (by mutations or by horizontal transfer) in the genomes. This is because DNA repeats can co-evolve but also have a different and independent evolution with respect to the genome of the host [4].
The hybrid clustering of Illumina short reads from five species also provided information about an "average" composition of the analysed genomes, showing the extent of sharing repetitive sequences within this family.
On average, repetitive DNA represented about 67% of this "metagenome". However, most of this repetitive fraction was comprised of repeats specific to each species, i.e., most repeats were not shared between Asteraceae species. In this sense, only the most abundant repeats of each species were represented in the clusters of the metagenome.
Moreover, the analyzed species shared ribosomal DNA sequences, while the other classes of repetitive DNA were generally species-specific. The exceptions were Artemisia annua and Chrysanthemum seticuspe, both belonging to the tribe Anthemideae, which shared several repeat clusters.
The dendrogram obtained by hierarchical clustering analysis ( Figure 3) did not recapitulate the phylogenetic relationship between the five species, except for the two species belonging to the same tribe (A. annua and C. seticuspe), for which the dendrogram was consistent with the Asteraceae phylogeny. This suggests that the evolution of LTR-REs was partially independent of the evolution of such species, and that individual genomes have undertaken different evolutionary dynamics in the composition and abundance of repeated elements following speciation. This aspect is not surprising given the potential autonomy of these elements in replication within the host genomes [4].
Other analyses were performed to identify and characterize full-length elements belonging to the LTR-RE fraction of the repetitive DNA, i.e., the most abundant REs in the genome of each selected species, using the available genome assemblies (at both chromosome and scaffold levels).
Overall, 48,872 full-length LTR-REs were retrieved from the five analyzed species. Most of the full-length LTR-REs, about 71%, were isolated in sunflower, the species with the largest genome (3.6 Gbp) [31] and the largest abundance in repeats [24]. However, many full-length elements were identified and characterized for the first time in the other Asteraceae species evaluated in this study. The isolation of full-length LTR-REs enabled us to obtain important information about the variability and phylogeny of REs within the studied genomes. Indeed, full-length LTR-REs present highly conserved domains that may preserve their functionality and allow effective reconstruction of the evolutionary dynamics that lead to the differentiation of the repeatomes within Asteraceae.
The phylogenetic trees showed a well-defined clustering of RT-encoding sequences according to the LTR-RE lineages within each species (Figures 4 and 5), indicating that LTR-RE lineage separation occurred before Asteraceae speciation.
However, in RT-related dendrograms constructed by separating LTR-RE lineages (Supplementary Figures S1 and S2), the separation among species was less defined, suggesting that different sublineages had undergone different transposition rates after speciation.
Finally, a large variability was also observed concerning the temporal profiles of transposition bursts, established by comparing LTR sequences of isolated full-length elements [36]. As a result of the amplification burst(s) that may have occurred, our data on the LTR-RE insertion age (Figures 6 and 7) demonstrate that RE amplification occurred at different times for different species.

Conclusions
Our study exploits the potentiality of massive parallel sequencing technologies applied to the analysis of genome structure and evolution, representing a first contribution towards the metarepeatome of the Asteraceae family. The identification and characterization of repeat sequences in these species will aid in genome annotation, as well as in the development of molecular markers for breeding programs. Overall, a large variability of repeat abundance at superfamily, lineage, and sublineage levels was observed, suggesting that the repeatomes within individual genomes followed different evolutionary and temporal dynamics, indicating that different events of amplification or a loss of most LTR-RE lineages occurred after species separation. This is in line with studies highlighting the potential autonomous nature of repeats [4]: cases of species-specific huge amplification of LTR-RE lineages were already reported in sunflowers [52,53], where LTR-REs were identified as retrotranspositionally active [54]. Further analyses related to the mobility of retrotransposons will be useful to define with more precision the evolution of the repetitive component along the selected genomes, knowing that LTR-REs can affect not only the coding portion of the genome but also modify the cis-regulatory sequences of the genes, with possible heritable phenotype changes in plant species.

Sequence Data Collection
After exploring the data available in the NCBI GenBank, five economically relevant species of the Asteraceae family were chosen. In particular, the genome assembly and read packages produced by NGS Illumina sequencing techniques of Helianthus annuus, Lactuca sativa, Cynara cardunculus var. scolymus, Artemisia annua, and Chrysanthemum seticuspe were selected and downloaded.
FastQC v0.11.5 [55], software embedded in the Galaxy platform of RepeatExplorer2 [56], was used to perform sequence quality checks of the FASTQ-formatted read packages. At the end of the process, the software provided a quality report. Trimming by Trimmomatic v0.39 [57] was performed based on the quality control results to clean up the read datasets and to make subsequent analyses easier and more accurate. Using this tool, reads with a low-quality score were discarded, and adapters were removed. All reads containing organellar DNA sequences were removed using CLC-BIO Genomic Workbench 9.5.3 (CLC-BIO, Aarhus, Denmark) against a library consisting of the chloroplast sequences of the five Asteraceae species (NCBI codes: MK341452.1, Helianthus annuus; AP007232.1, Lactuca sativa; KP842713.1, Cynara cardunculus var. scolymus; PKPP01000155.1, Artemisia annua; NC_040920.1, Chrysanthemum lucidum) and the mitochondrial sequence of Helianthus annuus (NCBI code: CM007908).

Clustering Analyses with RepeatExplorer2
The reads of all five Asteraceae species, processed as above, were used to perform hybrid clustering with RepeatExplorer2. A total of 1,000,000 reads (forward and reverse) extracted from the input files of each species were used for this analysis. The resulting clusters were built by an all-to-all comparison of sequence reads to reveal their similarities and represent different repetitive element subfamilies. This tool also provided a list of superclusters, i.e., clusters of shared paired-end reads representing the same repeat family.
Similarity searches by blastn and tblastx, using the BLAST package v2.6.0+ [58] with default parameters, were performed on the remaining unknown clusters against a library of repetitive sequences belonging to sunflower, SUNREP [23], to increase the number of annotated clusters.

Phylogenetic Analysis of LTR-REs
The pool of LTR-REs was analysed to isolate sequences corresponding to the reverse transcriptase (RT) protein domains. The RT domain was chosen because it represents a protein region essential for the transposition process (present in both superfamilies) and is, therefore, conserved among species. The sequences were aligned using MAFFT v7.475 [61], and then ClustalW v2.1 [62] was used to build neighbour-joining (NJ) trees. The NJ trees were edited with R software [63]. The robustness of the trees was tested by repeated random resamplings for 100 interactions. Phylogenetic trees were constructed by separating the species or LTR-RE lineages.

Evaluation of the Insertion Time of LTR-REs
The age of insertion of the LTR-REs was estimated by comparing the LTR sequence at the 5 end and the LTR sequence at the 3 end of each full-length element [36]. The two LTRs of each element were first aligned using the Stretcher tool (EMBOSS package v6.6.0.0) [64], and then the nucleotide distances between the LTRs were measured using the Kimura two-parameter method (K2P) [65] implemented in the Distmat tool (EMBOSS package) [64] using an in-house built perl script. The K2P method is one of the most widely used mathematical models for predicting nucleotide substitutions, i.e., mutations caused by exchanging one nucleotide with another. For the analyzed sequences, the Kimura distances were converted to MYA using a synonymous substitution rate that is twice that calculated for sunflower genes, i.e., 2 × 10 −8 [21].