1. Introduction
The genome contains the genetic information necessary for a given species to function optimally in its environment. Coding and non-coding regions, mutations, structural variants, as well as other genomic entities such as transposable elements and regulatory elements can be identified through various molecular and bioinformatics analyses. Sequencing experiments generate large genomic data, thus making the bioinformatics analysis a real challenge. Currently, the main difficulties are caused by the limitations of current methods of data analysis, as well as the complexity of handling high-throughput data.
Assembling genomic sequences is one of the most important steps in genomics [
1]. Long reads generated by Oxford Nanopore Technologies (ONT) sequencing are useful for analyzing repetitive regions and structural variants in genomes and contribute to the quality and the completeness of an assembly. However, sequences generated by this type of technology are reported to have a relatively high error rate, which can be at least partially corrected before the assembly process [
2,
3].
There are currently two distinct strategies used for de novo assembly of long sequences: a correction step performed directly on the assembly or a read correction step followed by their assembly [
2]. Some de novo sequence assemblers, such as Flye [
4] and Shasta [
5] start by generating the assembly of uncorrected reads and then refine the genome assembly. Conversely, tools such as Canu [
6] initiate sequences’ correction and then assemble the corrected reads. Both assembling strategies have strengths and drawbacks in terms of computational requirements, working time, contiguity and accuracy of the resulting assembly.
In this study, we describe and compare alternative genome assemblies of long reads generated by nanopore genome sequencing of a Romanian local strain of
Drosophila melanogaster (the fruit fly), named Horezu_LaPeri (Horezu). We used multiple bioinformatics applications dedicated to the assembly of long reads such as Flye [
4], Canu [
6], minimap2 [
7] and NGMLR [
8] and then compared the results obtained with either unfiltered or quality filtered read datasets. The strategy of using unfiltered data proved to be more proficient for the annotation of NTs.
3. Discussion
Our study is the first sequencing project of a Romanian local natural strain of D. melanogaster, named Horezu and collected from Horezu region. We evaluated if a rapid ONT sequencing kit designed for fast library preparation without a ligation step is appropriate to generate collections of long reads suitable for a good quality genome assembly. We were also concerned if genomic assemblies generated by various methods are suitable for accurate annotation of various genes and NTs. Our results highlight that the qualitatively unfiltered sequencing reads are of adequate quality when searching either for sequences of predefined sets of genes or for NTs mapping. In addition, de novo and guided assembly steps performed using the unfiltered reads revealed several advantages in terms of coverage and assembly completeness.
Using Data set I for de novo assembly, we obtained a genome fraction coverage of 94.8% with Canu and 91.44% with Flye, respectively (
Table 3). Interestingly, the Flye assembly does not output the highest genome coverage, but it generates better values for key qualitative parameters, such as the largest contig, the longest alignment or the N50 score. These characteristics could bring an advantage for the identification and analysis of genes in Horezu genome. Conversely, searching for NTs revealed that the Canu assembly considerably outperforms the Flye one according to the results of RepeatMasker and mdg1 assessments.
The Canu—Data set II and Flye—Data set II assemblies provided D. melanogaster reference genome coverages of 89.7% and 86.5%. Important parameters, such as the overall error rate during assembly, the lower number of misaligned bases and the reduced number of partially misaligned contigs, displayed better values for the assemblies compiled with Data set II. These distinctive features are adequate for identification of structural variants or genes. Confirmatory, Flye—Data set II assembly harbors the maximum number of genes having 100 percent identity with the corresponding reference sequences. Conversely, the Canu—Data set II assembly is a better option than the Fly—Data set II assembly for NTs mapping.
As an overall quality control, we mapped both genes and NTs in the four distinct assemblies. The best BUSCO score was achieved for Flye—Data set I assembly that has 876 USCOs (91.8%) detected in at least one copy (
Figure 2). The assemblies obtained with Data set I provided better BUSCO scores than those generated using Data set II. For example, Canu—Data set II assembly allows for detection of only 750 complete USCOs. Since BUSCO assessment can provide false-positive results [
26], we tested the assembling quality using BLAST and a supplemental set of genes involved in innate immunity (Toll and Imd-JNK pathways genes). The majority of the genes displayed similarity scores of over 95% in each assembly, but for a few genes minor quality issues were detected in the assemblies obtained with filtered reads. Globally, there are only small differences among the four assemblies of Horezu strain, as shown in
Table S2. We conclude that searching for genes in assemblies compiled from filtered reads provides some quality improvements.
Regarding the NTs mapping, the differences between Canu and Flye assemblies are obvious. The results obtained with RepeatMasker indicate that Canu—Data set I assembly offers the best values for every considered NTs category, while the Flye assemblies are outperformed by the Canu ones. Canu—Data set I assembly has a total NTs content representing approximatively 20.32% of the Horezu genotype. Since the estimated total NTs content of the
D. melanogaster reference genome is ~20% [
21], it appears that the stand alone ONT sequencing is very reliable for the analysis of transposons in the Drosophilidae genomes. The differences between Canu assemblies are not very evident, but Canu—Data set I allows for a better mapping for a selection of NTs, such as Gypsy and Copia transposon families according to RepeatMasker (10,436 versus 8912 copies and 772 versus 671 copies, respectively). As a complementary approach assessing minute quality differences, we mapped mdg1 retroelement and, as expected, the results revealed that Canu—Data set I assembly is the best option for this purpose.
The NTs analyses were also performed on the minimap/miniasm—ISO1 assembly [
18], using RepeatMasker and GA_v2. The comparative analysis of minimap/miniasm—ISO1 and Horezu assemblies regarding NTs detection reveals that Canu—Data set I assembly is the most appropriate one for this purpose. The total content of NTs identified in Canu—Data set I (20.32%) is substantially higher as compared to minimap/miniasm—ISO1 (13.61%). Accordingly, the number of mdg1 copies identified with GA_v2 in the respective assemblies is 44 versus 17.
Therefore, care should be taken when considering what sequencing data are to be used with de novo assemblers, such as Canu and Flye. Adjustments of the assembly strategy paradigm might be considered in accordance with the research objectives.
The analysis of Horezu-guided assemblies indicated that minimap2 was the most efficient one for mapping reads to the reference genome for both datasets. The percentage of aligned reads, the average coverage value and the average alignment quality had the best values when using minimap2 (
Table 5). Additionally, the BAMstats analysis revealed that the averages of the coverage values for each chromosome are higher for the assemblies generated with minimap2 (
Table S8).
Next generation sequencing technologies are used on a large scale in whole-genome sequencing (WGS) projects, but short reads fail to cover the entire genome, often leaving gaps or producing assembly errors in repetitive regions [
27,
28]. Instead, long-read sequencing technologies have been tested for sequencing of large genomes, mainly those of model organisms, in order to simplify genome assembly and to resolve low-complexity regions [
29,
30]. For example, using ONT for genome sequencing of the experimental model
Arabidopsis thaliana, Debladis et al. [
31] generated a number of 118,554 reads with a minimum length of 6 nt, a maximum of 691,915 nt and a median of 4.6 kb. Even with a low level of coverage, their sequencing data allowed the identification of transposon insertions such as LTR retroelements and DNA transposons CACTA and CAC1. In a similar study, Michael et al. [
32] sequenced genomic DNA from
Arabidopsis thaliana and obtained 300,053 sequencing reads with an average read length of 11.4 kb. WGS experiments using nanopore sequencing were also performed on pea (
Pisum sativum) and approximately 33.2 million reads with an N50 read length of 15.5 kb, totaling 262.1 Gb of data were obtained. After de novo assembly, a number of 117,981 contigs (3.3 Gb) were generated, with an N50 value of 51.2 kb and a BUSCO score of 51% [
33]. For zebrafish (
Danio rerio) the long-read sequencing of its genome produced sequences with N50 = 15 Kb and a value of 464,751 nt for the longest read. Assembly generated with Canu (1.42 Gb) showed a coverage of the reference genome of 90.8%, while the assembly produced with miniasm (1.39 Gb) had 88% coverage [
34]. The
Caenorhabditis elegans genome was recently recompleted in a sequencing experiment using ONT that generated a number of 225,835 raw reads. After filtering according to quality score, 166,198 reads were obtained with an average length of 16,413, minimum length of 15 nt and maximum read length of 336,266 nt [
35]. In a different study performed on
C. elegans VC2010 wild-type strain [
36], combined data from three flow cells, consisting of 1,116,324 reads, revealed an average read length ranging from 13 kb to 20 kb, with a maximum of 134 kb. Raw reads were filtered according to quality score (q >10) and size (>1 kb), improving sequence quality but reducing the number of reads (583,466). When utilizing only q10 long reads for genome assembly, Canu generated 73 contigs, the largest contig having more than 9.9 Mb. Moreover, half of the reference genome was contained in the 10 largest contigs. In addition, the contigs were corrected with Illumina short reads, increasing sequence identity with the reference genome to 99.8% [
36].
ONT was used in 2018 to sequence the genomes of 15 species of Drosophilidae [
37]. A total of 23 million reads were generated, with an average read length of 4302 nt. A proportion of 76% of reads passed Albacore filter (≥7) and had an average read length of 5894 nt. Genome assembly was performed with Canu and miniasm, which had similar assembly statistics: an average contig N50 value of 4.4 Mb and average BUSCO score of 97.7% [
37]. Additionally, in a study aiming to test ONT technology on
D. melanogaster reference genome, Solares et al., generated a total of 663,784 reads with an average read length of 7122 nt. A number of 593,354 (89%) of all reads were marked as “pass” (having a quality score ≥7). A comparison between Canu and minimap/miniasm assemblies revealed a higher accuracy and completeness of the Canu assembly (contig N50 = 3.0 Mb and BUSCO score of 67.7%) [
18]. In another recent study using nanopore technology for sequencing, 101 Drosophilidae species, Flye assemblies with N50 average of 10.5 Mb and a BUSCO score greater than 97% were obtained for 97/101 of them [
38]. N50 values of 6.6 Mb and 5.4 Mb were obtained for contigs assembled with Canu and scaffolded with Hi-C data in a study using 713,692 and 481,640 reads for the DGRP379 and DGRP732 strains of
D. melanogaster [
39].
On average, the parameters of our ONT reads and assemblies are in the range of the values reported for the above mentioned ONT experiments and the results of the quality assessments by detection of genes and NTs are supportive. Therefore, we consider that our ONT only genome assemblies are reliable for the annotation of both unique and repetitive genomic features of Horezu strain. This approach could contribute to a more detailed analysis and understanding of the structure and evolution of D. melanogaster genome, as no Romanian fruit fly strain was sequenced so far.
4. Materials and Methods
4.1. Fly Stock
The fruit flies were collected in August 2018 from the location Romanii de Sus, Horezu, Vâlcea County, Romania. For isogenization, Horezu stock was maintained for about 2 years at 18 °C in standard medium-sized bottles containing culture medium based on an agar and banana recipe. Prior to sequencing, the fly stock was maintained for one day at 25 °C.
4.2. DNA Isolation and Quantification
To obtain long DNA fragments, we performed an adapted DNA extraction protocol previously described by Miller et al. [
37].
We collected about 50 D. melanogaster males from the Horezu strain, which were kept at −20 °C for about an hour before DNA extraction. We used pestles to grind the chitinous layer of the cuticle of the frozen males placed in an 1.5 mL Eppendorf tube in which we added 1 mL homogenization buffer (0.1 M NaCl, 30 mM Tris-HCL, 10 mM EDTA, 0.5% Triton X).
The homogenized suspension was transferred to a 1.5 mL Eppendorf tube using a wide-bore pipette tip and the tissue debris were separated by centrifugation at 500× g at 4 °C for 1 min. Supernatant was then transferred into a new tube, and nuclei were pelleted by centrifugation 5 min at 2000× g at 4 °C. Pelleted nuclei were resuspended in 200 µL homogenization buffer. For nuclear membrane lysis, we added 1.268 mL extraction buffer (0.1 M TrisHCl, 0.1 M NaCl, 20 mM EDTA), 1.5 µL proteinase K (20 mg/mL) and 30 µL of 10% SDS. Subsequently, the tube was maintained at 37 °C for about 3 h. The nucleic acid solution was mixed with equal volumes of phenol: chloroform: isoamyl alcohol pH 8.0. We performed a succession of two homogenizations and two centrifugations at 5000× g for 5 min at room temperature with the transfer of the upper aqueous phase after each step. Finally, we transferred the aqueous phase to a new tube over which we added 3M sodium acetate (NaOAc) (10% v/v) and ethanol (EtOH) 97% (twice the volume of the aqueous phase).
After an overnight incubation at 4 °C, we stimulated DNA precipitation by adding 2 µL glycogen and centrifuged the solution at 14,000 rpm at 4 °C. The DNA precipitate was taken with a wide-bore pipette tip and washed with 500 µL of 70% ethanol, then centrifuged at low speed. After air-drying, DNA was stored at 4 °C in 67 µL ultrapure water. This DNA extract was used for the first sequencing run symbolized Run_1.
For the second sequencing run, symbolized Run_2, we collected 60 males from the same Horezu stock. DNA was extracted as described above.
A DNA concentration of 113.5 ng/µL was used in Run_1 and, respectively, a DNA concentration of 76 ng/µL in Run_2.
4.3. Nanopore Library Preparation, Sequencing, and Basecalling
The library preparation, sequencing and basecalling processes were performed according to the manufacturer’s protocol for the Rapid Sequencing kit (SQK-RAD004). In order to prepare the library, we mixed 7.5 μL genomic DNA with 2.5 μL FRA (Fragmentation Mix). After incubating the mixed DNA library at 30 °C for 1 min and then at 80 °C for 1 min, we added 1 μL of RAP (Rapid Adapter) in order to attach the sequencing adapters to the DNA fragments ends. The DNA/FRA/RAP mixture was incubated for 5 min at room temperature. Prior to loading the library, the flow cell was set up using SQB (Sequencing Buffer), FLT (Flush Tether), FB (Flush Buffer) and LB (Loading Beads) solutions. After removing the air bubbles inside the flow cell, we loaded 800 μL of the priming mix (30 μL FLT + an FB tube) into the priming port and let it stand for 5 min. In a separate tube, we mixed 34 μL of SQB , 25.5 μL of LB , 4.5 μL ultrapure water and the DNA library (11 μL). The resulting 75 μL mix was loaded into the sequencing port (SpotON) of the MinION.
We used two FLO-MIN106 type flow cells. Run_1 started with 909 available pores and ran for approximately 48 h. Run_2 started with 1400 available pores and ran for 72 h.
We used the MinKNOW tool version 3.6.5 for data acquisition and for converting the raw data files represented by electrical signals (FAST5) into FASTQ files (basecalling).
Both collections of ONT reads have been uploaded to SRA/NCBI, under accession numbers SRA/NCBI: SRX8215201 and SRA/NCBI: SRX17355721, respectively.
4.4. Computational Environment
Oxford Nanopore MinION sequencing device was connected to a computer equipped with 32 Gb DDR4 RAM, an i7-6500U processor, 500 Gb SSD and Linux Mint 20 operating system. Basecalling and assembly steps were performed on the same device.
4.5. Data Processing and Quality Control
We used the EPI2ME platform (Oxford Nanopore, Oxford, UK) for the analysis of ONT data. EPI2ME (accessed on 22 December 2021) is able to provide quality control of the data and splits reads into “pass” and “fail”, based on high/low quality scores of the reads.
To eliminate adapters, we used Porechop version 0.2.4 (accessed on 30 March 2020) [
10], which aligns reads subsets to the sequences of all adapters specific to ONT sequencing methodology and removes the adapter sequences from the end of the reads if they are detected. Then, we filtered the reads according to the quality score with NanoFilt tool (accessed on 21 April 2020) [
9], designed for reads obtained by nanopore sequencing. The processed reads were quality assessed with NanoPlot [
9], an application for visualizing and processing long reads (accessed on 30 March 2020).
4.6. De Novo Assembly
De novo assembly step was performed in a Linux environment using the following assemblers: i. Flye, version 2.8.3 (accessed on 5 July 2021)—an application for assembling sequences generated by ONT and Pacific Biosciences (PacBio), which can be used for both bacterial and eukaryotic genomes [
4]; ii. Canu version 2.1.1 (accessed on 4 August 2021), specialized for assembly of high-noise long sequences [
6].
The Flye—Data set II assembly was submitted to GenBank/NCBI, accession number JANZWZ000000000.1.
4.7. Assembly versus the Reference Genome of D. melanogaster
The guided assembly was performed using the
D. melanogaster r6.39 reference genome from FlyBase [
40]. Reference scaffolds that could not be associated with any
D. melanogaster chromosomes (or mitochondrial DNA) were removed. The following programs were used to perform guided assembly:
Minimap2 version 2.20 (accessed on 20 June 2021)—a bioinformatics application designed to align long ONT and PacBio reads to a reference sequence. The program quickly aligns the nucleotide sequences with each other to identify overlapped regions and aligns the reads to the reference genome [
7].
NGMLR version 0.2.8 (accessed on 21 June 2021)—a bioinformatics tool able to map ONT reads to a large reference genome. NGMLR application provides quick and accurate nucleotide sequences alignments, taking into account both possible sequencing errors and genomic variations [
8].
SAMtools version 1.7 (accessed on 20 June 2021)—a suite of programs dedicated to process high-throughput sequencing data [
41].
4.8. Assessing the Quality of Generated Assemblies
The following tools were used for the qualitative evaluation of the generated assemblies:
QUAST-LG (accessed on 4 August 2021) is one of the best-known tools for evaluating the quality of de novo genome assemblies. The application can also be used with a reference genome and supports multiple assemblies at the same time, which makes it suitable for comparative analyses [
11];
BUSCO version 5.2.2 (accessed on 3 December 2021) searches in de novo assemblies for highly conserved USCOs. We used the metazoa_odb10 database, which contains 954 USCOs likely to be present in many metazoan genomes [
12];
Qualimap version 2.2.1 (accessed on 19 July 2021) is a Java application that allows qualitative evaluation of the assemblies resulting following reads alignment to a reference genome. Guided assembly data (BAM files) are used to obtain a qualitative report that includes graphs and statistical parameters of the assembly [
24];
BAMstats version 1.25 (accessed on 20 July 2021) is a graphical interface program used to calculate mapping statistics of reads from a BAM file. This application provides an overview of the query/reference genome alignment quality [
25].
As an additional quality evaluation, we examined the Horezu genotype for genes sequences integrity by comparing it to the sequences of 53 control genes drawn from the
D. melanogaster r6.39 reference genome. The control gene set consisted of
γCOP gene and a particular selection of 52 genes involved in the Toll and Imd-Jnk immune pathways. Sequences of these genes were downloaded from FlyBase [
40] and aligned against our de novo assemblies using blastn (accessed on 6 June 2021) [
13] in the Linux terminal.
In addition, we also used RepeatMasker version 4.1.2 (accessed on 27 October 2022), a popular software developed to quantify the NTs content in re-sequenced genomes and currently being the gold standard for this type of analysis [
19]. The program was run using the alignment application rmblastn version 2.10.0+ and NTs consensus database Dfam 3.3. To identify and analyze the insertions of mdg1 retroelement in the
D. melanogaster Horezu genotype, we used Genome ARTIST (GA) v2 software (accessed on 01 November 2022) [
20]. Genome sequencing, preprocessing and data analysis were performed in the
Drosophila laboratory of the Department of Genetics, Faculty of Biology, University of Bucharest.