A Transposon Story: From TE Content to TE Dynamic Invasion of Drosophila Genomes Using the Single-Molecule Sequencing Technology from Oxford Nanopore

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.


Introduction
Transposable elements (TEs) are major components of almost all eukaryotic genomes [1,2]. They can be separated into three main groups that include several TE superfamilies and families: repetitive regions like PIWI-interacting RNA (piRNA) clusters that contribute to maintaining genome integrity by repressing TE mobility.
Here, we developed strategies to generate de novo assemblies of high quality long-read sequencing data, suitable for genomic analyses of TEs present at high and low frequencies in Drosophila populations. We first validated our method by comparing the data (genome size, TE content and TEI site estimation) obtained by short and long-read sequencing in D. melanogaster and D. simulans, two closely related species, but that may vary in TE content [24,25]. We found that, although the D. simulans genome contains a large number of old and degraded TE copies, among the most recent pool of insertions, DNA transposons display higher intra-family sequence divergence than LTR elements, suggesting that elements of this group invaded the genome more recently than DNA transposons. Moreover, we observed that piRNA production correlates with TE genome occupancy. When considering the most recent pool of TE insertions, we could not find convincing evidence supporting the piRNA clusters trap model [26,27]. Finally, we developed and validated an approach to identify TEI that occur at low frequencies in a population.
A previously published D. melanogaster laboratory line [28] was used for Piwi knockdown (piwi KD) in adult follicle cells. This line carries three components: (i) a GAL4 UAS activator driven by the follicle cell-specific traffic jam (tj)-promoter (tj-GAL4), (ii) an UAS short hairpin(sh)-piwi that induces Piwi RNAi, and (iii) the ubiquitously expressed thermo-sensitive GAL4-inhibitor GAL80 ts . At 20 • C, GAL80 ts sequesters GAL4, preventing sh-piwi expression. At 25 • C, GAL80 ts is partially inactive, allowing some GAL4-driven expression of sh-piwi in somatic follicle cells. The resulting partial Piwi depletion allows for the derepression of at least two LTR families (ZAM and gtwin) in follicle cells and their integration as new proviruses in the progeny genome [28]. The polymorphism of this line was partially reduced by isolating a single pair of parents and the line was thereafter stably maintained at 20 • C as a large population (more than 500 progenitors at each generation). The G0 and G0-F100 genomic libraries were prepared shortly after isolation of this line and at the hundredth generation, respectively. Soon after isolation of this isofemale line, a subset of individuals at the pupal to early adult stages was shifted to 25 • C for 5 days, and this was repeated for at least 500 flies for 73 successive generations of partial Piwi KD. Then, after six more generations of stabilization at 20 • C, a third genomic library, called G73, was generated.

Genome Size Estimations
Flow cytometry: genome size was estimated according to [29] using fresh samples of 4-day-old females heads with 10 replicates (five heads per replicate) for each Drosophila wild-type strain.
findGSE: k-mer distribution was established from the Illumina reads using findGSE [30]. Briefly, adaptors were first removed from the reads with Skewer version 0.2.2 (paired-ends) or NxTrim version 0.4.3-6eb8d5e (mate pairs), when necessary. Reads were then treated essentially as previously described [31] to remove duplicates, filter out reads mapping to reference mitochondrial genomes Cells 2020, 9,1776 4 of 23 (GenBank AF200854.1 and AF200828.1 [32]) or microbial contaminants. This allowed for establishing the 21-mer distributions from which genome sizes were estimated using findGSE [30] with default parameters, except for dmsj23 in which the k-mer distribution clearly displayed a peak corresponding to heterozygous regions and was thus treated accordingly.

Illumina Sequencing
Wild-type strains: DNA was extracted from 3 to 5-day-old females for each strain using the Qiagen DNeasy Blood&Tissue kit (# 69506) and following the manufacturer's instructions. Genomic DNA (1.5 µg) was fragmented for a target insert size of 300 base pairs and sequenced by paired-end Illumina HiSeq (125 bp reads). Library and sequencing were performed by the GeT-PlaGe facility, Génopole Toulouse (France).

DNA Isolation, Oxford Nanopore MinION Sequencing and Base Calling
DNA was extracted from ∼100 males from each wild-type and from thePiwi KD lines using the Qiagen DNeasy Blood&Tissue kit. The genomic DNA quality and quantity were evaluated using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and a Qubit ® 1.0 Fluorometer (Invitrogen, Carlsbad, CA, USA), respectively. Three micrograms of DNA were repaired using the NEBNext FFPE DNA Repair Mix (NEB M6630). End repair and dA-tailing were performed using the NEBNext End repair/dA-tailing Module (E7546, NEB). Ligation was then performed with the Ligation Sequencing Kit 1D (SQK-LSK108, ONT, for G0, and SQK-LSK109 ONT for wild type strains, G73 and G0-F100 samples). MinION sequencing was performed according to the manufacturer's guidelines using R9.4.1 flow cells (FLO-MIN106, ONT) and a Nanopore MinIon Mk1b sequencer (ONT) controlled by the ONT MinKNOW software (version 18.3.1 for G0, version 19.05.0.0 for isogenic wild-type strains, and version 19.10.1 for the G73 and G0-F100 samples). Base calling was performed after sequencing using Albacore (version 2.3.3) for G0, and the GPU-enabled guppy basecaller in high accuracy mode for isogenic wild-type strains (version 3.1.5), G73 (version 3.3.3) and G0-F100 samples (version 3.4.4).

TE Content and TEI Site Estimates from Illumina Sequencing
TE abundance was estimated using forward reads and two methods: the TEcount module of TEtools [33] and dnaPipeTE (v1.0.0 and v.1.3.1) [11]. TEcount estimates TE abundance by quantifying reads that map to a set of known TE sequences, here the rosetta fasta file [34]. This tool was run using default parameters and Bowtie2 (v2.2.4) [35,36]. dnaPipeTE assembles repeated sequences from a subsample of reads (<1x) and quantifies reads mapping to these sequences to estimate TE abundance. dnaPipeTE was used with the following parameters: -sample_number 2, -genome_coverage 0.25). Concerning the genome size option, 175 Mb and 147 Mb were used for D. melanogaster and D. simulans samples, respectively. The rosetta fasta file was used as library [34].
TEIs were detected in Illumina sequencing data using a dedicated mapping-based algorithm similar to that implemented in PoPoolationTE2 [17] with paired-end reads as input, FlyBase reference genomes (ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_ r6.16_FB2017_03/fasta/dmel-all-chromosome-r6. 16.fasta.gz and ftp://ftp.flybase.net/genomes/Dro sophila_simulans/dsim_r2.02_FB2017_04/gtf/dsim-all-r2.02.gtf.gz), and the TE sequence library at https://github.com/bergmanlab/transposons/raw/e2a12ff708c42dcce5b15d6af290506d78021212/relea ses/D_mel_transposon_sequence_set_v10.1.fa. Sequencing reads are mapped to the reference genome and TE sequences using Bowtie2 (version 2.3.3) [36]. Then, the algorithm scans the resulting Binary Alignment/Map (BAM) files for pairs in which one end matches to the reference genome, the other end to a TE sequence, and the pair cannot be mapped concordantly to the genome. For each pair, the position of the genome-mappable read is noted, and positions are clustered in order to have no read further apart than 100 bp in that cluster. Each cluster is then interpreted as an insertion, the position of which is the mean of the position of the reads it contains, and the strength of which is evaluated on the basis of the number of reads it contains. For the purpose of this study, only insertions that were supported by at least 50 reads were retained. Unlike PoPoolationTE2, the insertions detected with this procedure correspond to occurrences absent from the reference genome.

Small RNA Extraction and Sequencing
For small RNA sequencing, two replicates per strain were prepared. Small RNA was isolated from 50 pairs of ovaries using HiTrap Q HP anion exchange columns (Cytiva, Velizy-Villacoublay, France) as described in [37], and the eluate was run on a 10% TBE urea gel (Thermo Fisher Scientific). Small RNA size selection (18-50 bp) was performed on gel at the sequencing facility. Quality was checked with the Bioanalyzer small RNA kit (Agilent, Santa Clara, CA, USA). Library construction was performed using the TruSeq Small RNA Library kit (Illumina, San Diego, CA, USA) and sequenced (1 × 50 single reads) on an Illumina HiSeq 4000 at the IGBMC Microarray and Sequencing facility. Adapter sequences were removed using cutadatp [38]. Size selection was then performed using PRINSEQ lite version 0.20.4 [39]. All subsequent analyses were built upon small RNA counts after normalization according to the miRNA amounts, as described in [34].

Global Structural Variant Detection
Global variant detection (i.e., variants common to most genomes of a considered sample compared with the reference genome, see below) was performed using the svTEidentification.py tool (available at https://github.com/DrosophilaGenomeEvolution/TrEMOLO). Briefly, this tool recovers the insertion and deletion positions and creates the associated fasta sequence, based on the Assemblytics report from the RaGOO scaffolding (the deletions are extracted from the reference and the insertions from the new assembly). Once the fasta file corresponding to the SVs was recovered, these sequences were matched with the Basic Local Alignment Search Tool, nucleotide to nucleotide (BLASTN)+ v2.4.0 to a specified TE database. Hits larger than 80% of the TE sequence and identical to more than 80% at the nucleotide level were considered as candidate for new TE insertions/deletions (TEI/TED) in the G0, G0-F100 and G73 samples. For wild-type strains, new insertions/deletions were detected without any filter. The potential candidates were then listed in a tabular format that included their position, size and percentage of size or similarity compared with the reference TEs. The used TE database was a collection of the reference TEs from Bergman's laboratory (https://github.com/bergmanlab/transposons) and from previously published data [50].

LTR Minor Insertional Variant (LTR MIV) Detection
Each raw long read was mapped using minimap2 v2.16 (-ax map-ont -t 16 as options) to the assembly corresponding to that set of long reads. After recovering the sam file, samtools v1.10.0 was used to compress and sort the sam file in BAM with samtools view and samtools sort (basic options, but with 16 threads), and the MD tag was added using samtools calmd. Then, SV were detected in the resulting sorted BAM file using Sniffles v1.0.10 with at least 1 read and -report_seq -s 1 -n -1 as parameters [51]. These sequences longer than 1000 bp were aligned with BLASTN v2.4.0+ (-outfmt 6) to the LTR subset (60 families) of the database used before. A nucleotide alignment of more than 94% identity and a minimum of 90% of the total length of the TE consensus sequence were then considered as criteria to validate a putative LTR minor insertion variant (LTR MIV), if the length of the variant did not exceed the total size of the TE by more than 18 nt. This corresponds to the largest target site duplication (TSD) ever reported to flank any LTR TE [52]. All codes are available in a snakemake file at https://github.com/DrosophilaGenomeEvolution/TrEMOLO.

Fluorescent In Situ Hybridization on Polytene Chromosomes
Polytene chromosomes were squashed from salivary glands of third instar male larvae. NotI and PstI restriction enzymes were used to extract a fragment of the ZAM pol gene from a previously published plasmid [53]. The probe was labeled with digoxigenin-11-dUTP using the Nick Translation Mix (Roche #11 745 816 910), and signals were detected with anti-digoxigenin-rhodamine Fab fragments (Roche). The fluorescent in situ hybridization method was adapted from a previously described protocol [54].

Automatic Identification of the Target Site Duplication for LTR MIV
The putative LTR MIVs matching to six LTR families (blood, gtwin, mdg3, ZAM, roo, and copia) were studied. One read supporting each MIV, previously extracted in a fasta file, was compared by BLASTN v2.4.0+ with the corresponding consensus sequence. To automatically check for the presence of a TSD, the positions of the 5 and 3 end of the TE alignment were determined within the read. 30nt-long sequences upstream and downstream the putative insertion site were extracted and were aligned to detect the presence, on both sides of the insertion, of a short duplication, the size of which was previously reported by [55] for ZAM and by [52] for the other TEs. The resulting TSD sequences were then extracted and used to create sequence logos with WebLogo (https://github.com/WebLogo/weblogo). All scripts and codes for this automatic extraction are available at the project GitHub.

piRNA Cluster Identification in the Assembled Genomes
To determine the piRNA cluster localization in genome assemblies, a previous annotation of piRNA clusters in the D. melanogaster Dmel_R6.04 genome release was used [56]. The flanking genes for each of the 153 major piRNA clusters were identified, their sequence was extracted and mapped to the new reference using BLASTN to locate the limits of the corresponding piRNA clusters in the corresponding assemblies. When only a single gene could be used as border, the piRNA cluster length described in [56] was used to define the other border. Bona fide piRNAs were extracted from the previously published G0 small RNA-seq library [28], and from each of the small RNA-seq libraries presented here, as reads longer than 23 nt that do not map (bowtie -best) to sequences of other known small RNAs (downloaded from FlyBase [57] and MirBase [58]). These selected small RNA reads were then mapped to the corresponding assemblies using Bowtie 1.2.2 [59]. Bowtie parameters were selected to keep only reads that display unique alignments and <2 mismatches (-best -v 2 -m 1). The positions of uniquely mapped reads were determined in the assembly, and sequences with more than 500 reads were conserved and compared to the piRNA cluster coordinates determined in the assembly of that line. Table S4 shows the list of the 42 piRNA clusters corresponding to the best piRNA producers in the G0 line. The coordinates of these 42 regions were then determined in the G73 and G0-F100 assemblies. For wild-type strains, the piRNA abundance was computed within 1 kb windows.

Comparison of ZAM Sequences
After obtaining the corresponding region of the ZAM insertions the fasta sequence was extracted (using bedtools getfasta) and compared with the ZAM sequence at a global level using redotable v1.1.

Using Oxford Nanopore Technologies (ONT) to De Novo Assemble the Highly Contiguous Genomes of Several Isogenic Wild-Type Strains and of one Unstable Line
The ONT-based single-molecule long-read sequencing data provided between 5 and 24 million reads, with a depth of coverage ranging from 40x to 196x (mean = 130x), and a N50 ranging from 3.7 to almost 20 kb (mean = 11 kb) (QC 7 reads only; Table S1). The N50 large range was explained by the different methods used for genomic DNA extraction and ligation (Materials and Methods). Our assembled genome procedure is summarized in Figure 1. To compare our data with the reference D. melanogaster and D. simulans genomes, whole genome alignments and local dot plots were performed using D-genies and Gepard, respectively (Figure S1).
A strong correspondence was observed between most de novo assemblies and the corresponding reference genome, except for the G73 and dsgoth31 assemblies in which incongruent contigs were detected. These incongruent contigs were manually broken at the discrepancy points ( Figure S1) and the final statistics for the de novo assemblies were obtained using Assembly-Stats (Table 1).
Using our approach based only on ONT data, the N50 ranged from 1.2 Mb (L50 of 33 contigs) to 21 Mb (L50 of 3 contigs). The previously described de novo D. melanogaster hybrid assembly obtained using BioNano and assembly merging [23] reported a N50 of 9 Mb (L50 of 6 contigs) for the raw data, and a N50 of 21.3 Mb (L50 of 3 contigs) after merging. Moreover, the BUSCO score of their hybrid assembly was 97.2% after Illumina polishing, while the BUSCO score of our assemblies ranged from 93.7% to almost 99% (98.5% for the reference Dmel_R6.23 assembly [23]) only with RACON polishing. This comparison indicates that our assemblies are of high quality, and that RaGOO use as scaffolder allowed obtaining high-quality assemblies at the chromosome scale.

Estimation of Genome size Using Different Methods
To determine the quality of the ONT-based assemblies of the isogenic wild-type D. melanogaster and D. simulans genomes, their sizes were compared to the genome sizes estimated with two other approaches: findGSE (based on k-mer estimation) and flow cytometry (Table S2).
Genome size estimates varied between 142 and 144 Mb (flow cytometry) and 129 and 132 Mb (findGSE) for the D. simulans strains and between 162 and 163 Mb (flow cytometry) and 133 and 137 Mb (findGSE) for the D. melanogaster strains after excluding dmsj7. The k-mer distribution obtained for this strain was much more scattered than the others, and resulted in a k-mer-based genome size estimate of 147 Mb, most probably an artefact. The size estimated obtained using the ONT data ranged between 131 and 142 Mb for the wild-type D. simulans strains and between 130 and 134 Mb for the D. melanogaster strains, with similar values for the final assemblies. The correlation coefficients were significant only between the ONT-based and the flow cytometry estimates for D. melanogaster (r = 0.9675, p = 0.0325), but not D. simulans (flow cytometry: r = 0.7564, p = 0.2436; findGSE: r = 0.1237, p = 0.8763). The correlation only with the flow cytometry estimate indicates that the different genome compositions, and probably the different amounts of heterochromatin affect the estimations obtained by findGSE. The genome size estimates obtained with findGSE were globally more similar than those obtained using the de novo assembly approach, but no correlation was observed between these values, probably due to the different amounts of repeats present in the various strains. In conclusion, genome size estimations present several biases in function of the used method, and ONT assemblies seem to give values close to those obtained by flow cytometry, which is a more global method.

Comparison of TE Abundance in the Isogenic Wild-Type Strains Measured by Illumina and ONT Sequencing
To validate the ONT approach, the TE abundance in the isogenic wild-type D. melanogaster and D. simulans strains was evaluated using dnaPipeTE [11] and TEcount [33] for Illumina sequencing data, and RepeatMasker for ONT assembled chromosomes ( Figure 2). Overall, TE content (expressed as genome percentage) was often higher when estimated using dnaPipeTE (Illumina data) (Wilcoxon matched-pairs signed rank test; p = 0.0234) than with RepeatMasker (ONT assemblies) (Wilcoxon matched-pairs signed rank test; p = 0.0156). This might be explained by the fact that unlike the RepeatMasker TE database, dnaPipeTE is based on the de novo detection of TEs and the local assembly of TE families, independently of a previously annotated reference genome, thus recovering the maximum number of reads that correspond to known and unknown TEs. In agreement, the correlation was higher between the results obtained with RepeatMasker (ONT data) and the results obtained with TEcount, which is based on the read similarity against a curated database of known TEs [34] (r = 0.8921, p < 0.0001), than with dnaPipeTE (r = 0.8504, p < 0.0001) (Figure 2, right panel). As previously reported, Cells 2020, 9, 1776 9 of 23 the LTR group was more abundant than the LINE and DNA transposon groups in all Drosophila genomes (see [60] for a review).

Comparison of the TEI Sites Identified in the Isogenic Wild-Type Strains Using the Illumina and ONT Data
Before focusing on the results provided by the ONT approach, we first compared these data to the classically used Illumina results based on discordant pairs of reads (method developed in the laboratory, see Material and Methods). The number of TEI sites tended to be higher when using the Illumina data than ONT data (Wilcoxon paired test, p-value = 0.023). This could be due to the presence of false positives caused by PCR artefacts during the Illumina library preparation [18], and/or to the fact that some TEIs might have been too short (fragmented or partially deleted) to be identified using the assembled ONT data. Using the Illumina approach, TEI numbers were significantly lower in the D. simulans than in the D. melanogaster strains (Wilcoxon test, p-value = 0.029), but not when using the ONT data (Wilcoxon test, p-value = 0.343) (Figure 3). This may reflect a bias towards D. melanogaster sequences in our TE reference file, and/or a long-term difference in TE dynamics between these species [25,61]. Comparisons (chi-square tests) of TEI distributions across TE groups (DNA, LINE, LTR) (see Table S3) showed that in D. simulans, the distributions obtained using both approaches were similar. Conversely, in D. melanogaster, the TEI number for retrotransposons was significantly higher relative to the other groups, when using the Illumina approach. This may be due to the higher propensity of D. melanogaster retrotransposons to be involved in Illumina PCR chimeras [18] because of their higher genome occupancy (Figure 2), and this difference may be amplified by the exponential behavior of the PCR reaction.
In the subsequent analyses, only TEIs identified using the ONT approach (i.e., the most reliable set of recent insertions) were considered.

Comparison of the TEI Sites Identified in the Isogenic Wild-Type Strains Using the Illumina and ONT Data
Before focusing on the results provided by the ONT approach, we first compared these data to the classically used Illumina results based on discordant pairs of reads (method developed in the laboratory, see Material and Methods). The number of TEI sites tended to be higher when using the Illumina data than ONT data (Wilcoxon paired test, p-value = 0.023). This could be due to the presence of false positives caused by PCR artefacts during the Illumina library preparation [18], and/or to the fact that some TEIs might have been too short (fragmented or partially deleted) to be identified using the assembled ONT data. Using the Illumina approach, TEI numbers were significantly lower in the D. simulans than in the D. melanogaster strains (Wilcoxon test, p-value = 0.029), but not when using the ONT data (Wilcoxon test, p-value = 0.343) (Figure 3). This may reflect a bias towards D. melanogaster sequences in our TE reference file, and/or a long-term difference in TE dynamics between these species [25,61]. Comparisons (chi-square tests) of TEI distributions across TE groups (DNA, LINE, LTR) (see Table S3) showed that in D. simulans, the distributions obtained using both approaches were similar. Conversely, in D. melanogaster, the TEI number for retrotransposons was significantly higher relative to the other groups, when using the Illumina approach. This may be due to the higher propensity of D. melanogaster retrotransposons to be involved in Illumina PCR chimeras [18] because of their higher genome occupancy (Figure 2), and this difference may be amplified by the exponential behavior of the PCR reaction.

TEI Landscape in the Isogenic Wild-Type Strains
Using the ONT approach, the de novo genome assembly of each wild-type strain was compared with the reference genome and the detected insertional structural variants were called global variants (see Figure 1). These global variants correspond to the most recent TEIs. On average, there were 492 and 456 global variants in D. melanogaster and D. simulans, respectively (Table 2). DNA transposons were the most abundant group in both species (188 and 215 copies, on average, in D. melanogaster and D. simulans, respectively), and LTR retrotransposons the least abundant (147 and 117 copies, on average, in D. melanogaster and D. simulans, respectively). These results may seem in contradiction with the previous data on genome occupancy. However, in this analysis only recent insertions were considered. Moreover, as DNA transposons are in general smaller than LTR retrotransposons, similar levels of genome occupancy correspond to higher copy numbers for DNA transposons than for LTR retrotransposons.
Comparison of the locations of the insertions identified in the chromosome assemblies showed that 22 global variants were present in all four D. melanogaster strains, and 23 in all four D. simulans strains. These were mainly DNA transposons (n = 9 and n = 10, respectively). The number of shared pairwise global variants was rather low, roughly 10% of all insertions in most comparisons ( Figure  4a). D. simulans strains appeared equally distant in terms of insertion sites. Conversely, a geographical structuring could be observed in the D melanogaster comparisons: strains from the same population shared more insertion sites than strains from distinct populations.  In the subsequent analyses, only TEIs identified using the ONT approach (i.e., the most reliable set of recent insertions) were considered.

TEI Landscape in the Isogenic Wild-Type Strains
Using the ONT approach, the de novo genome assembly of each wild-type strain was compared with the reference genome and the detected insertional structural variants were called global variants (see Figure 1). These global variants correspond to the most recent TEIs. On average, there were 492 and 456 global variants in D. melanogaster and D. simulans, respectively (Table 2). DNA transposons were the most abundant group in both species (188 and 215 copies, on average, in D. melanogaster and D. simulans, respectively), and LTR retrotransposons the least abundant (147 and 117 copies, on average, in D. melanogaster and D. simulans, respectively). These results may seem in contradiction with the previous data on genome occupancy. However, in this analysis only recent insertions were considered. Moreover, as DNA transposons are in general smaller than LTR retrotransposons, similar levels of genome occupancy correspond to higher copy numbers for DNA transposons than for LTR retrotransposons.
Comparison of the locations of the insertions identified in the chromosome assemblies showed that 22 global variants were present in all four D. melanogaster strains, and 23 in all four D. simulans strains. These were mainly DNA transposons (n = 9 and n = 10, respectively). The number of shared pairwise global variants was rather low, roughly 10% of all insertions in most comparisons (Figure 4a). D. simulans strains appeared equally distant in terms of insertion sites. Conversely, a geographical structuring could be observed in the D melanogaster comparisons: strains from the same population shared more insertion sites than strains from distinct populations.   The mean copy numbers for the different TE families were weakly correlated between D. melanogaster and D. simulans (Figure 4b) (Spearman rho = 0.33, p-value = 1e-4, across 129 TE families). Few families were found in the D. simulans strains but not in the D. melanogaster strains, and vice versa. In D. melanogaster strains, the most abundant families were roo (mean copy number: 24.00), jockey (mean copy number: 48.00), and pogo (mean copy number: 44.25), for LTR retrotransposons, LINEs, and DNA elements, respectively. In D. simulans, they were roo (mean copy number: 23.00), Cr1a (mean copy number: 18.50), and hobo (mean copy number: 70.50). In addition, some TE families displayed different copy numbers across strains. For instance, the 297 family had 18 copies in dmgoth63, 6 in dmgoth101, 6 in dmsj23, and 5 in dmsj7. Such patterns are suggestive of recent, independent activations, or even bursts of some families in specific strains, as suggested by in situ hybridization studies in a large number of samples [62]. Kofler et al. (2015) studied TE patterns in D. melanogaster and D. simulans field samples using Illumina pool-seq data [63]. By computing the insertion frequencies for each family of a subset of 121 TE families, they established that LTR elements were more frequent in D.

Comparison of TE Dynamics in Isogenic Wild-Type D. Melanogaster and D. Simulans Strains by Studying TEI Sequences in ONT Assemblies
The major advantage of the ONT approach is its ability to retrieve whole TEI sequences, while short read-based approaches only give access to TE insertion sites. First, the TEI sizes across strains were compared by parsing the BLAST results at the insertion level and by computing the insertion lengths (Figure 5a). The mean insertion lengths (i.e., fragment sizes) significantly varied among TE groups (2-way ANOVA, p-value = 2e-81), but not between species (2-way ANOVA, p-value = 0.22). LTR retrotransposons were the largest (mean size = 2692 bp), followed by LINEs (mean size = 1290 bp), and DNA transposons (mean size = 1210 bp). The observed absence of difference between species in these global variants differs from what was previously described. Indeed, for a subset of 15 families, Lerat et al. found that TE copies were more internally deleted (i.e., shorter) in D. simulans than in D. melanogaster [24]. However, analysis of these 15 families using our ONT data indicated that they displayed, on average, longer fragment sizes compared with the other TE families in D. melanogaster (Wilcoxon test, p-value = 8e-19), but not in D. simulans (Wilcoxon test, p-value = 0.34) [24]. This suggests that Lerat et al. 2011 focused on TE families that have particularly large copies in D. melanogaster [24], probably because they have been more studied in the past due to their easier analysis by in situ hybridization on polytene chromosomes [7,25,64].
Then, the Refiner module of RepeatModeler (http://www.repeatmasker.org/RepeatModeler) was used to compute the intra-family sequence divergence (average Kimura distance) (Figure 5b). This measure is a proxy of the time passed since the last transposition wave(s). Overall, these distributions were not significantly different between D. melanogaster and D. simulans and among TE groups (2-way ANOVA; species effect, p-value = 0.151; group effect, p-value = 0.701), showing that the TE recent dynamics are similar in these two species. However, in D. simulans, DNA transposons displayed significantly higher intra-family divergence compared with LTR retrotransposons (Wilcoxon test, p-value = 0.023). This suggests that among the most recent transposition events, DNA transposon insertions occurred slightly less recently in D. simulans.
Kofler et al. 2015 assumed that population frequencies of TE insertions provide an estimator for the insertion age. However, we find that their population frequencies were not correlated with our measures of intra-family sequence divergence (Spearman correlation coefficients: −0.714 (p-value = 0.136) and 0.116 (p-value = 0.827) for D. melanogaster and D. simulans, respectively). We think that intra-family sequence divergence is a more direct estimate of the age of transposition events; however, this discrepancy may also reflect differences in the origins of the sampled flies [61,64]. Alternatively, it may suggest that other factors influence insertion frequencies, besides the age since the initial transposition burst. In addition, our analysis only included TEIs that are not found in the reference genome, i.e., TEIs that result from transposition events more recent than the set-up of the actual populations. Altogether, while the TE ancient dynamics are different between D. melanogaster and D. simulans [60], the present results suggest that D. melanogaster and D. simulans TE landscapes are rather similar when comparing only global variants (i.e., the subset of the most recent insertions). As already proposed [25], this may reveal that the colonization of D. simulans genome by TEs has now reached a state similar to that of D. melanogaster, although it started more recently. Then, the Refiner module of RepeatModeler (http://www.repeatmasker.org/RepeatModeler) was used to compute the intra-family sequence divergence (average Kimura distance) (Figure 5b). This measure is a proxy of the time passed since the last transposition wave(s). Overall, these distributions were not significantly different between D. melanogaster and D. simulans and among TE groups

piRNAs, piRNA Clusters and TEIs in Isogenic Wild-Type Strains
Another way to study TE dynamics is to understand the way the production of piRNAs is linked to the TEI type and structure. Indeed, some relationships might exist between piRNA abundance and the recent activity of TEs, estimated by the intra-family sequence divergence. Therefore, piRNA production, TE length and intra-family sequence divergence were analyzed for each TE group and strain. This analysis highlighted a significant TE group effect: piRNA counts were higher in retrotransposon families (LTR elements and LINEs) than in DNA transposon families (p-value = 2e-9). Moreover, piRNA counts were significantly and positively correlated with genome occupancy (p-value = 5e-7), which strongly depends on TE copy number (Figure 6a). The hypothesis that TE copy numbers determine piRNA abundance was previously suggested in D. melanogaster [65,66] and is confirmed here also for D. simulans. However, it should be noted that genome occupancy accounts only for 6.2% of the total variation of piRNA counts, indicating that many other factors are involved in TE control.  Figure S2. The off-scale peaks might correspond to microRNAs that are absent from miRBase.

Recent TEIs May not be Frequent Enough to Be Incorporated in the Assembled Genomes
To challenge the ONT assembly approach, a bioinformatic analysis was performed to identify recent LTR TEIs that occurred during the last 73 generations (G73) in the unstable Piwi KD line (Materials and Methods and [28]). As a control, to estimate the basal transposition rate when TEs are normally repressed by the functional piRNA pathway, the genome of the hundredth generation (called G0-F100) after establishment of the stable G0 isofemale line was also sequenced. Using the pipeline for detection of global variants (Figure 1), no new ZAM insertion could be detected in the G73 assembled genome compared with the G0 reference genome. This is not consistent with previous data obtained by PCR quantification of the ZAM copy number [28]. Therefore, in situ hybridization  Figure S2. The off-scale peaks might correspond to microRNAs that are absent from miRBase.
These observations are also in agreement with the idea that newly integrated copies become piRNA producers [67], and that longer copies produce more piRNAs. It should be noted that retrotransposons are on average longer than DNA transposons.
As ONT assemblies also include piRNA cluster sequences, 42AB and flamenco (the two major piRNA cluster producers in D. melanogaster) could be retrieved using their flanking genes (see Material and Methods) [68] from each assembly. Alignment of the uniquely mapped piRNA sequences against the assembly of each wild-type isogenic strain (Figure 6b and Figure S2, black lines) indicated that the regions corresponding to 42AB and flamenco did not display any enrichment in global variant insertion numbers (Figure 6b, gray lines). This indicates that recent TEIs are not specifically enriched in the two major piRNA cluster producers in D. melanogaster and D. simulans strains. Therefore, the analysis of the de novo assembled genomes to follow the piRNA cluster dynamics in these isogenic wild-type strains did not highlight the previously reported high TEI insertion rate within piRNA clusters [26,50,69,70]. Our data suggests the number of recent TEIs fixed in these piRNA clusters is not different compared with anywhere else in the genome. This discrepancy could be explained by the high frequency of deletions (from several base pairs up to several kilobases) that seems to occur in these regions and that affect ancient TEs, which remain as vestiges in these loci, and also recently inserted TEs [50].

Recent TEIs May Not Be Frequent Enough to Be Incorporated in the Assembled Genomes
To challenge the ONT assembly approach, a bioinformatic analysis was performed to identify recent LTR TEIs that occurred during the last 73 generations (G73) in the unstable Piwi KD line (Materials and Methods and [28]). As a control, to estimate the basal transposition rate when TEs are normally repressed by the functional piRNA pathway, the genome of the hundredth generation (called G0-F100) after establishment of the stable G0 isofemale line was also sequenced. Using the pipeline for detection of global variants (Figure 1), no new ZAM insertion could be detected in the G73 assembled genome compared with the G0 reference genome. This is not consistent with previous data obtained by PCR quantification of the ZAM copy number [28]. Therefore, in situ hybridization analysis was performed to determine whether de novo ZAM insertions were present on polytene chromosomes of G73 male larvae (Figure 7a and Figure S3). This analysis confirmed the presence of the two preexisting ZAM insertions identified on chromosome 2R as global variants in the G0 de novo assembly (compared with the Dmel_R6.23 reference genome). These two insertions were also detected in all three G73 larvae analyzed, as well as many other ZAM signals that were not observed in the G0 samples (Figure 7a and Figure S3). As each of these many G73-specific new ZAM insertions was present in a single larva, they were not incorporated in the G73 de novo assembled genome due to their low frequency, and therefore could not be detected as global variants. Based on the G0 assembled genome, the sequences of the two shared ZAM detected by FISH on chromosome 2R could be accessed. One contained the full length canonical ZAM consensus sequence, while the other displayed an internal deletion (Figure 7b).

A Long Read-Based Pipeline to Detect Low Frequency TEI Polymorphisms
To determine whether ONT can be used to detect TEIs with a frequency not high enough to be recovered in the assembled haplotype, an approach to identify "minor insertional variants" (MIV) was developed (Material and Methods, paragraph 2.9, and Figure 1 (gray)). Minimap2 was used to map each individual long read to the corresponding assembled genome, and Sniffles to obtain the list of variants that had been neglected during the assembling process. Some of the sequences identified as MIVs matched to the 60 canonical LTR TE consensus sequences (Materials and Methods).
As expected, very few LTR MIVs were detected in the G0-F100 "stable line". Only copia and roo, which have high transposition rates [71], exhibited more than four variants (14 and 22, respectively) among the 51 LTR MIVs detected (Figure 7c). Also in the G73 line, copia and roo were among the more active LTR families (35 and 48 LTR MIVs among the 274 LTR MIVs detected) (Figure 7c). However, two other LTR families, ZAM and gtwin (51 and 93 LTR MIVs, respectively), showed a 50-fold increase in G73 compared with G0-F100, which is more than an order of magnitude higher than what observed for any other LTR family.  The next question was to determine whether the 274 LTR MIVs, present at low frequency in G73, had occurred after the establishment of the isofemale line. Indeed, such insertions could have been already present in G0 at high frequency (and therefore, could have been incorporated in the G0 but not in the G73 assembled genome) or at low frequency (and, therefore, detectable only as MIVs in G0). The first hypothesis was ruled out by comparing global deletions in G73 and G0. Very few G0 insertions were lost in the G73 assembly and they all belonged to five LTR families (mdg3, Transpac, 3S18, blood, and driver) that did not show a large MIV increase in G73 (data not shown). The total absence of LTR MIVs in G0 was not in favor of the second hypothesis.
As a large fraction of the 274 LTR MIVs in G73 were supported by a single read (Figure 7d), the next step was to check whether they were bona fide insertions by looking for insertional hallmarks, such as the target site duplications (TSDs) that occur upon integration as a result of staggered double-strand breaks at this site [72]. Flanking duplications were first detected automatically for each of the top six LTR families (mdg3, blood, copia, roo, ZAM, and gtwin) by aligning the two 30nt-long sequences that flank each putative LTR MIV extracted from the read supporting the variant. This analysis showed that depending on the LTR family, 30-80% of MIVs were flanked by a short duplication of the expected size (4 or 5 nt) ( Table 3) [52]. The TSD consensus sequences identified are presented in Figure 7e. The failure to automatically detect a TSD for the other LTR MIVs could be due to the frequent sequencing errors, a known ONT drawback. When located in the genome-LTR junction region, such errors, which may include several nt-long indels, could impair the automatic detection of the expected TSD, as shown in Figure S4 for the manual inspection of the 2R-33863 putative ZAM insertion. Even when junctions are correctly determined, a simple sequencing error in one of the duplicated sequences might prevent their perfect matching. However, it was possible to correct the errors present in these single reads by aligning them with the empty genomic target present on the assembled genome (see, Figure S4). Using this method to manually inspect the sequence of all 51 ZAM variant reads, 48 bona fide insertions were identified, as judged by the presence of the expected 4-nt TSD included in a palindromic GC-rich 6-nt target site motif (TSM) (Figure 7f) [52,55].
Therefore, despite ONT low sequencing accuracy, LTR MIVs could be detected with high sensitivity (insertions present in a population at a frequency <1%, because detected as single reads in a 197x average coverage library) and specificity (FDR of 3/51 = 6%).

Invading LTR Elements Are Not Preferentially Trapped by piRNA Clusters
It is widely assumed that a TE invasion is stopped when a member of the TE family jumps into a piRNA cluster that then triggers the production of piRNAs to repress this TE family (i.e., trap model) [27]. Long-read sequencing data allowed determining whether new insertions accumulated in major piRNA source loci during the 73 generations of LTR TE derepression. Comparison of the 42 major piRNA clusters after their localization in the G0 and G73 assemblies (Table S4) did not highlight any new TEI into any of these piRNA clusters in the G73 assembled genome. However, new insertions that occurred during the 73 generations of piRNA pathway impairment could still segregate as MIVs in the G73 population. Indeed, among the 274 LTR MIVs present in G73, 6.57% (n = 18) were located within the 42 major piRNA producers ( Figure 8). However, this proportion was very similar to that of the piRNA cluster size relative to the total de novo assembled genome size (7.36%). Therefore, unlike what expected in the trap model, LTR retrotransposons do not seem to have preferentially accumulated in piRNA clusters during the 73 generations of transposition burst. Specifically, assuming a binomial law with n = 274 and p = 0.0736 and using a one-tailed test, more than 29 insertions (and not the 18 detected) belonging to many different TE families would have been necessary to validate the hypothesis that piRNA clusters are TE trappers (5% probability threshold). More than 50% of the LTR MIVs located in piRNA clusters belonged to the gtwin family, suggesting that this family inserts preferentially into piRNA clusters. Indeed, among the 93 gtwin MIVs, 11 (11.8%) were found in piRNA clusters, which is very close to the minimal number (n = 12) required to reject the null hypothesis of random insertion in the genome (binomial law with n = 93, p = 0.0736, and 5% probability threshold). More data on de novo gtwin mobilization are needed to confirm their preferential integration in piRNA clusters during a transposition burst and to support the trap model for this TE family.

Conclusions
Our work demonstrates that long reads are crucial in order to finely describe TE landscapes at the intra-genome scale. Using isogenic wild-type strains and an unstable line with a succession of transposition bursts, we could characterize the most common TE variants in different strains and identify TE minor variants observed soon after transposition. The parallel analysis of two close More than 50% of the LTR MIVs located in piRNA clusters belonged to the gtwin family, suggesting that this family inserts preferentially into piRNA clusters. Indeed, among the 93 gtwin MIVs, 11 (11.8%) were found in piRNA clusters, which is very close to the minimal number (n = 12) required to reject the null hypothesis of random insertion in the genome (binomial law with n = 93, p = 0.0736, and 5% probability threshold). More data on de novo gtwin mobilization are needed to confirm their preferential integration in piRNA clusters during a transposition burst and to support the trap model for this TE family.

Conclusions
Our work demonstrates that long reads are crucial in order to finely describe TE landscapes at the intra-genome scale. Using isogenic wild-type strains and an unstable line with a succession of transposition bursts, we could characterize the most common TE variants in different strains and identify TE minor variants observed soon after transposition. The parallel analysis of two close species (D. melanogaster and D. simulans) and two genetic backgrounds allowed us to show that overall, TE recent dynamics are quite similar between species and among strains. However, there is still some strain specificity concerning the identity of the most recently active TE families. ONT is also a powerful tool to investigate the dynamics of piRNA clusters, which are in general inaccessible using short-read sequencing methods. We show here that recent TEIs are not enriched in piRNA clusters, despite recent bursts of TE transposition. Moreover, ONT allows detecting very recent TEIs that are sequenced as singleton reads.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4409/9/8/1776/s1, Figure S1: D-genies genome-wide dot plot of ONT assembly contigs versus reference genome, Figure S2: piRNA analyses in wild-type strains, Figure S3: ZAM copies were visualized by fluorescent in situ hybridization on G0 and G73, Figure S4: Alignments of the 2R-33863 insertion variant to the ZAM consensus sequence. Table S1: Statistics about sequencing data. All lengths are expressed in bases. Quality is expressed in standard Phred scale, Table S2: Genome size estimations using different methods, Table S3: Comparison of TEI distributions across TE groups using chi-square tests, Table S4: piRNA cluster coordinates based on flanking genes in de novo assembled genomes. Funding: This research was funded by the Fondation pour la Recherche Médicale, grant number "DEQ20180339167" to S.C, by the ANR Exhyb to C.V., by the CNRS.