Establishing MinION Sequencing and Genome Assembly Procedures for the Analysis of the Rooibos (Aspalathus linearis) Genome

While plant genome analysis is gaining speed worldwide, few plant genomes have been sequenced and analyzed on the African continent. Yet, this information holds the potential to transform diverse industries as it unlocks medicinally and industrially relevant biosynthesis pathways for bioprospecting. Considering that South Africa is home to the highly diverse Cape Floristic Region, local establishment of methods for plant genome analysis is essential. Long-read sequencing is becoming standard procedure for plant genome research, as these reads can span repetitive regions of the DNA, substantially facilitating reassembly of a contiguous genome. With the MinION, Oxford Nanopore offers a cost-efficient sequencing method to generate long reads; however, DNA purification protocols must be adapted for each plant species to generate ultra-pure DNA, essential for these analyses. Here, we describe a cost-effective procedure for the extraction and purification of plant DNA and evaluate diverse genome assembly approaches for the reconstruction of the genome of rooibos (Aspalathus linearis), an endemic South African medicinal plant widely used for tea production. We discuss the pros and cons of nine tested assembly programs, specifically Redbean and NextDenovo, which generated the most contiguous assemblies, and Flye, which produced an assembly closest to the predicted genome size.


Introduction
Most large plant genomes contain a high proportion of repetitive DNA as a result of whole-genome, chromosomal, subchromosomal and/or tandem duplications [1][2][3]. These structural features impede whole-genome assembly when using only sequencing data that was produced on Second Generation Sequencing platforms, such as Illumina and Ion Torrent. While accuracy is very high, these sequencers only generate short reads (50-350 bp in length), which generally do not span longer repeat regions, leading to incomplete or highly fragmented genome assemblies [4]. Third Generation Sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have revolutionized the field of plant genomics, as they produce reads that can be many thousands of base pairs long [5]. Such reads can span large repetitive regions and thus improve the contiguity and quality of genome assemblies. ONT uses a novel nanopore technology to determine nucleotide sequences. Briefly, DNA molecules are pulled through

DNA Extraction, Purification and Quantification
Leaf material was obtained from one commercial (plant 1) and one wild rooibos plant (plant 2) that originated from Nieuwoudtville, Northern Cape province, South Africa (31 • 40.0 E, respectively). Plant samples were flash frozen in the field, transported in liquid nitrogen and maintained at −80 • C. DNA was extracted using two methods: a Sodium Dodecyl Sulfate (SDS) method and a hexadecyltrimethylammonium bromide (CTAB) method [18], modified as described below. For all extraction and purification steps, wide-bore tips were used to minimize DNA shearing. The final elution buffer did not contain EDTA, as this may interfere with the construction of MinION libraries. To avoid DNA degradation, samples were stored overnight at −20 • C and sequenced the day after extraction/purification.
For the SDS method, 1 g of rooibos leaf material was ground into a fine powder using liquid nitrogen. Thereafter, 4 mL of heated (55 • C) SDS lysis buffer (containing 20 mM EDTA pH 8.0, 100 mM Tris-HCl pH 8.0, 1.4 M NaCl, 1% (w/v) SDS, 0.04% (w/v) PVP-40 and 20 µg/mL proteinase K, to which 0.5% (v/v) β-mercaptoethanol was added just before use) were poured into the mortar and mixed vigorously as soon as the slur started to thaw. For this protocol, all centrifugation steps were carried out at 8000× g for 12 min at room temperature, unless stated otherwise. The homogenate was incubated at 55 • C for 30 min. The solution was centrifuged and the supernatant was transferred to a new tube. This was followed by adding 1 2 a volume of chloroform, mixing by gentle inversion and centrifugation using a fixed angle rotor. After transferring the supernatant to a new tube, an equal volume of chloroform:isoamyl alcohol (24:1) was added, the solution was centrifuged and the supernatant was recovered. DNA was precipitated by adding 0.1 volumes of 5 M NaCl and 2.5 volumes of 100% ice-cold ethanol, and subsequent incubation on ice for 30 min. If DNA precipitated directly, the DNA was fished out using a hooked glass pipette and transferred to a 1.5 mL microcentrifuge tube containing 100 µL of 10 mM Tris-HCl pH 8.0. If not, the precipitate was centrifuged and the pellet was washed twice with 1 mL ice-cold 70% ethanol. The pellet was air-dried for 5-10 min and dissolved in 100 µL of 10 mM Tris-HCl pH 8.0 at 37 • C for about 1 h.
Thereafter, the SDS samples were purified using chloroform:isoamyl alcohol extraction. For this, an equal volume of chloroform:isoamlyl alcohol (24:1) was added to the sample, the phases were separated by centrifugation and the supernatant was gently pipetted to a new tube. Thereafter, 0.1 volumes of 5 M NaCl and 2.5 volumes of 100% ice-cold ethanol were added and the sample was incubated on ice for 30 min. The solution was again centrifuged and after removing the supernatant the pellet was washed three times with 1 mL of 70% ice-cold ethanol, respectively. The pellet was then air-dried for 5-10 min, resuspended in 100 µL of 10 mM Tris-HCl pH 8.0, and stored at −20 • C until further use.
For the CTAB method, 1 g of rooibos leaves was ground into a fine powder using liquid nitrogen in the presence of 0.5 g PVP (PolyVinyl Pyrrolidone, Mr 10,000), making sure that the tissue never thawed. This was followed by adding 15 mL of freshly prepared CTAB buffer (containing 20 mM EDTA pH 8.0, 100 mM Tris-HCl pH 8.0, 1.5 M NaCl and 2% (w/v) CTAB, to which 1% (v/v) β-mercaptoethanol was added just before use). As the slur began to thaw, it was mixed vigorously. The resulting homogenate was transferred into a 50 mL Nalgene tube and incubated at 65 • C for 2 h with intermittent mixing by inversion every 10 min to maximize yields. Thereafter, the solution was centrifuged at 11,000× g for 15 min at room temperature and the aqueous phase was carefully transferred into a new 50 mL Nalgene tube. A double volume of chloroform:isoamyl alcohol (24:1) was mixed into the sample by gentle inversion, the tube was centrifuged at 11,000× g for 15 min at room temperature and the aqueous layer was recovered. If this aqueous solution appeared translucent, the step was repeated until the solution was transparent. To precipitate the DNA, a double volume of chilled isopropanol and 1/3 volume of 3 M sodium acetate (pH 5.2) were added and mixed by inversion, after which the sample was kept at −20 • C overnight. After centrifugation at 4500× g for 90 min at 4 • C, the supernatant was discarded. The pellet was washed twice by adding 5 mL of 70% chilled ethanol, dislodging and gently swirling the pellet, and centrifugation at 4500× g for 15 min at 4 • C. The pellet was air-dried, then fully dissolved in 500 µL of nuclease-free water, transferred to a 2 mL Eppendorf tube, re-precipitated and washed as described above, and then dissolved in 100 µL of 10 mM Tris-HCl pH 8.0. Thereafter, the sample was stored at −20 • C, to be used for DNA purification or library preparation.
CTAB-extracted DNA was purified using three commercial kits, namely the Zymo-clean™ Large Fragment DNA Recovery Kit (Zymo Research, United States of America), QIAGEN ® DNeasy PowerClean CleanUp Kit (QIAGEN, Germany) and QIAGEN ® Genomic-tip 500/G (QIAGEN, Germany). When using the Zymoclean™ kit, 3.5 µg of crude rooibos DNA were loaded into one well of a 0.8% agarose gel, which was run at 80 V for 1.5 h. Gel slices with HMW DNA were cut from the gel, further divided into 100 mg cubes, each of which was subsequently processed separately using the Zymoclean™ kit. Both the Zymoclean™ kit and the QIAGEN ® DNeasy PowerClean CleanUp kit were used following the manufacturers' instructions, except that all centrifugation steps were carried out at half the speed and double the time specified in the respective protocols to minimize DNA shearing. For the QIAGEN ® Genomic-tip 500/G, we followed two protocols: one published by the manufacture ( [19]: Genomic-tip 1) and one published by ONT ( [20]: Genomic-tip 2).
DNA concentrations were measured with the Qubit ® 2.0 Fluorometer (Thermo Fisher Scientific) using the double-stranded DNA (dsDNA) High-Sensitivity (HS) assay kit (Thermo Fisher Scientific), following the manufacturer's protocol. The purity was evaluated with a Nanodrop ® 2000 (Themo Fisher Scientific), assessing the 260/280 nm and 260/230 nm ratios. The genomic DNA was analyzed on a 0.8% (w/v) agarose gel: the DNA samples were mixed with 6× purple gel loading dye (New England Biolabs) that had been spiked with red gel (New England Biolabs) in a 1:1 ratio, and the samples were run at 90 V in 1× TBE buffer for 1 h. Lambda DNA digested with HindIII was used as a molecular weight marker (New England Biolabs). DNA was visualized using a UV (302 nm) transilluminator and a photo was captured using the ENDURO™ Gel Documentation System.

Genome Sequencing
For the Illumina sequencing dataset, DNA extraction, library preparation and data preprocessing steps have been described previously [16]. Here, the paired-end read datasets were further quality filtered using the bbmap tool FilterByTile (v 37.90, [21]) with default settings. Read quality was subsequently assessed using FastQC (v 0.11.5, [22]) to determine read lengths, read numbers, %GC, and to visualize per tile sequence quality.
Each MinION library was prepared using 1.5 µg of genomic DNA (as measured by the Qubit ® Fluorometer) and the ONT 1D Ligation Sequencing Kit, SQK-LSK109 (ONT), following the manufacturer's instructions. The third-party reagents NEBNext ® end repair/dAtailing Module (NEB #E7546), NEBNext ® formalin-fixed paraffin-embedded (FFPE) DNA Repair Mix (NEB #M6630), and NEB Quick Ligation Module (NEB #E6056) were used during library preparation. The adapter-ligated DNA sample was quantified using a Qubit ® dsDNA HS assay kit. Each MinION library was sequenced using a FLO-MIN106 R9.4.1 spotON Flow Cell (ONT), which was primed and loaded according to the manufacturer's instructions on a MinION Mk1B device (ONT). Sequencing was performed for 72 h using the MinKNOW software (v 4.3.20, ONT), installed on a computer running on a Linux operating system (Ubuntu 18.04). The fast5 files containing raw reads were base called using Guppy (v 6.1.7, ONT) with the configuration file dna_r9.4.1_450bps_hac.cfg and default parameters. The read length and quality of the MinION data was assessed using minion_qc (v 1.4.2, [23]) and NanoPlot (v 1.33.0, [24]). Nanoplot computed read statistics using only MinION sequencing reads with a minimum quality threshold of Q ≥ 7.

Genome Assemblies and Evaluation
All programs below were run on a High-Performance Computing cluster at the Centre for High-Performance Computing (CHPC, Cape Town, South Africa).
First, all quality processed Illumina reads (including the MiSEQ and HiSEQ reads from a 300 bp insert library, and the HiSEQ reads from two mate-pair libraries with insert sizes of 3 Kb and 8 Kb, respectively) were assembled using the short-read de-novo genome assembly programs Platanus (v 1.2.4, [25]) and MaSuRCA (v 4.0.1, [26]). The entire dataset amounted to 2.5 billion reads (292 Gbp; minimum 235× genome coverage). For Platanus, the genome size was set to 1.2 Gbp and the k-mer value to 81 (same as the one automatically computed by MaSuRCA). Platanus was run in three distinct steps (contig assembly, scaffold assembly and gap closing), all with default parameters. For MaSuRCA, a configuration file was created, which contained the paths to the input data, as well as the assembly parameters, all set to default. A shell script was then generated from the configuration file to run the final assembly. Two passes of mega-reads were performed.
Hybrid assemblies using 40× of Illumina and 25× of MinION data were performed with MaSuRCA (v 4.0.1) as described above, as well as with Haslr (v 0.8a1, [27]) and Wengan (v 0.2, [28]) using default settings. The data size had to be reduced because hybrid assemblers are resource-intensive.
MinION data amounting to 25× genome coverage was also de-novo assembled using the long-read assemblers Flye (v 2.8.3, [29]), Raven (v 1.6.0, [30]), Redbean (wtdbg2, v 2.5, [31]), Canu (v 2.2, [32]) and NextDenovo (v 2.4.0, [33]). Again, the programs were run with default parameters, setting the genome size to 1.2 Gbp. Long-read assemblies were polished with two long-read and two short-read polishing programs. Briefly, assemblies were first polished with four rounds of Racon (v 1.4.3, [34]) using all long reads and recommended parameters (-m 8 -x 6 -g 8 -w 500), where read mapping was completed using minimap2 (v 2.22, [35]). Next, the Racon consensus was polished with one round of Medaka (v 1.5.0, [36]) using the model r941_min_hac_g507. Medaka was run in three distinct steps, namely: (1) mini_align to align reads to the input assembly; (2) medaka consensus, which runs the consensus algorithm and outputs results into hdf files; and (3) medaka_stitch, which aggregates the results of the previous step and creates a consensus sequence (.fa or fasta). The Medaka-polished consensus was further polished with Illumina reads using one round of either Racon or Nextpolish (v 1.1.0, [37]), all parameters set to default.

Optimizing DNA Sample Purity
Our first goal was to optimize DNA extraction and purification procedures to generate ultra-pure HMW genomic DNA, suitable for long-read sequencing. Of the two tested DNA extraction procedures, the SDS-based method returned 4×-25× less DNA than the CTAB-based protocol (Tables 1 and S1). In addition, the 260/230 ratios of the SDS samples were outside the range of values associated with high purity of DNA, indicating trace amounts of organic solvents (260/280 and 260/230 ratios should be within 1.8-2.0 and 2.0-2.2, respectively). For the CTAB samples, the 260/280 and 260/230 ratios were within the expected ranges. However, here 3-5-fold differences in DNA concentrations between the Qubit and Nanodrop readings were observed, which implies presence of contaminants. Gel electrophoresis confirmed that both extraction methods yielded intact HMW DNA and revealed a substantial amount of RNA in the CTAB samples ( Figure 1, Lane 2). Due to higher yields, only CTAB-extracted DNA samples were used to establish DNA purification procedures.
Four DNA purification methods were tested. Of those, the worst results were obtained when using the Zymoclean™ (Zymo) kit. While Nanodrop absorbance curves for the other three DNA purification methods showed well-defined peaks with a maximum at 260 nm (Figure S1d-f), the samples purified using the Zymo kit showed a maximum absorbance around 235 nm ( Figure S1c). Moreover, the 260/280 and 260/230 ratios were out-of-range, and a significant amount of DNA was lost during the clean-up procedure (at most, 600 ng were recovered out of 3.5 µg of crude, CTAB-extracted DNA). When CTAB-extracted DNA samples were purified using the QIAGEN ® Genomic-tip 500/G (Genomic-tip), the 260/280 and 260/230 ratios were close to the intended ranges, but the Qubit:Nanodrop ratios were still too high. So, the Genomic-tip 1-purified samples still contained substantial amounts of RNA ( Figure 1 Lane 5). Independent of the protocol, only 2-6.9% of the DNA was recovered after Genomic-tip DNA purification. By far the best results were consistently obtained when purifying the crude DNA using the QIAGEN ® DNeasy PowerClean CleanUp (DNeasy) kit. The DNA was ultra-pure, as indicated by good absorption ratios and a nearly 1:1 relationship for the Qubit:Nanodrop DNA concentrations. Recovery of HMW DNA ranged between 39% and 70% (Tables 1 and S1).  Four DNA purification methods were tested. Of those, the worst results were obtained when using the Zymoclean™ (Zymo) kit. While Nanodrop absorbance curves for the other three DNA purification methods showed well-defined peaks with a maximum at 260 nm (Figure S1d-f), the samples purified using the Zymo kit showed a maximum absorbance around 235 nm ( Figure S1c). Moreover, the 260/280 and 260/230 ratios were out-of-range, and a significant amount of DNA was lost during the clean-up procedure (at most, 600 ng were recovered out of 3.5 µg of crude, CTAB-extracted DNA). When CTAB-extracted DNA samples were purified using the QIAGEN ® Genomic-tip 500/G (Genomic-tip), the 260/280 and 260/230 ratios were close to the intended ranges, but the Qubit:Nanodrop ratios were still too high. So, the Genomic-tip 1-purified samples still contained substantial amounts of RNA (Figure 1 Lane 5). Independent of the protocol, only 2-6.9% of the DNA was recovered after Genomic-tip DNA purification. By far the best results were consistently obtained when purifying the crude DNA using the QI-AGEN ® DNeasy PowerClean CleanUp (DNeasy) kit. The DNA was ultra-pure, as indicated by good absorption ratios and a nearly 1:1 relationship for the Qubit:Nanodrop DNA concentrations. Recovery of HMW DNA ranged between 39% and 70% (Tables 1 and S1).   Table 2 shows the run statistics for seven MinION sequencing runs: run1 was completed using CTAB-extracted unpurified DNA from rooibos plant 1; runs 2-4 using DNA from plant 1 that was purified using different clean-up procedures; run5 using CTABextracted DNA from plant 2 purified with the DNeasy kit, and runs 6 and 7 using oxidized leaf samples of plant 1, where the CTAB-extracted DNA was also purified using the DNeasy kit. Each run was conducted on a separate MinION flowcell with 1003-1485 pores available before sequencing, and lasted 72 h, i.e., until nearly all pores were irreversibly exhausted.

Nanopore Sequencing
The unpurified DNA sample of plant 1 (run1) yielded the least amount of data (1 Gbp). For this sample, sequencing statistics were lowest for all measured parameters, including number of sequenced bases, number of reads, read length, N50 and read quality. Nonetheless, this run generated over 300,000 high-quality reads of up to 58.7 Kbp in length, the total base count amounting to at least 0.8× coverage of the rooibos genome. Samples purified using the Genomic-tip (run2 and run3) generated 4.56 Gbp and 6.62 Gbp of high-quality sequencing data, respectively. Read lengths as well as sequencing quality were substantially improved in comparison to the unpurified sample. The differences between run2 (Genomic-tip 1 protocol) and run3 (Genomic-tip 2 protocol) were not very big: on average, the two runs produced 1.4 ± 0.2 Mio high-quality reads, reaching similar N50s (6.0 Kbp and 6.6 Kbp) and maximum read lengths of 115 Kbp and 75 Kbp, respectively. For plant 1, DNA purification using the DNeasy kit (run4) yielded by far the best results in terms of read numbers (≈5 Mio HQ reads amounting to 19 Gbp), with an N50 of 6.5 Kbp and a maximum read length of 79.6 Kbp. When this clean-up protocol was tested on plant 2 (a wild type of rooibos with high amounts but markedly different profiles of phenolic compounds; run5), the run statistics were again notably higher than for the Genomic-tip-purified samples (3.9 Mio HQ reads, 11.2 Gbp and a maximum read length of 180.9 Kbp). The DNeasy kit was also tested on two oxidized leaf samples of plant 1 (runs 6 and 7). Although all quality tests (light absorbance and Qubit:Nanodrop ratios, as well as gel electrophoresis) indicated that the DNA was of high quality and purity, the total yield in Gbp and the number of reads were much lower than in run4, but still higher than in run1 (the non-purified DNA sample of plant 1). However, the median read lengths and the N50 values were substantially lower than in any of the other sequencing runs.

Evaluation of Genome Assemblies
The second aim of this study was to evaluate different programs for plant genome assembly. Illumina sequencing generated nearly three billion raw reads, with an average read length of 119 bp after removal of adapter sequences. After quality trimming, the data amounted to 235× to 290× genome coverage (depending on whether the flow cytometry or k-mer-derived genome size value is used for reference).
To see how the addition of MinION data improves genome assembly, Illumina data was first assembled using the two de-novo genome assembly programs, Platanus and MaSuRCA (Tables 3 and S2). When using the flow cytometry genome size value (1.24 Gbp) as a reference, the two assemblers reconstructed only 56% (Platanus) and 69% (MaSuRCA) of the rooibos genome into 78 k and 70 k scaffolds, respectively. MaSuRCA outperformed Platanus in terms of scaffold lengths, but assembly accuracy may have been reduced, as indicated by lower BUSCO statistics. When adding 25× genome coverage of HQ MinION sequencing data from runs 1-4 (amounting to 31 Gbp of data) and 40× genome coverage of Illumina data, the MaSuRCA assembly statistics improved dramatically (Tables 3 and S2): the number of scaffolds was reduced to 29 k, the N50 increased nearly five-fold, the largest scaffold was over 1 Mbp long (15-fold increase), and the proportion of identified BUSCO genes increased to 86% (only 2% of which were fragmented). With approximately 1.5 Gbp, the total assembly length was notably larger than the estimated genome size, indicating incomplete assembly. This was also reflected in the elevated proportion of complete duplicated BUSCO hits (35.3%). Neither Haslr nor Wengan produced reasonable assemblies, as the total assembly lengths (≈0.2 Gbp) were only a fraction of the estimated genome size ( Table 3).
The next five tested assemblers (Flye, Canu, Raven, Redbean and NextDenovo) only accept long reads and require polishing of the reconstructed genome using long and shortread sequencing data. The results for the unpolished assemblies differed substantially between the programs (Tables 4 and S2).  Flye generated an assembly closest to the predicted genome size (total length 1.1 Gbp) with the highest proportion of completed BUSCO genes (97.3%). However, the assembly was very fragmented (33 k contigs) and the N50 comparatively low (77 Kbp). Canu produced the same number of contigs and achieved similar BUSCO statistics as Flye, but the total assembly length was below the estimated genome size values. Moreover, the N50 and the largest contig size were the lowest among the long-read assemblers. Raven also generated a comparatively small assembly in terms of total size (0.91 Gbp) and contig numbers (11 k), but BUSCO results indicate high accuracy. With 99 Kbp, the N50 was substantially higher than in the assemblies generated by MaSuRCA, Flye and Canu. The Redbean assembly had the second-highest N50 (142 Kbp) and included a contig of nearly 1.7 Mbp in length. Contig numbers were reduced to half of those generated by MaSuRCA, Flye and Canu. However, BUSCO statistics for the unpolished assembly were the lowest. The most contiguous unpolished assembly was produced by NextDeonovo, which had the highest N50 (218 Kbp) and the longest contig (3.4 Mbp) assembled into just 5431 contigs. However, the total length of this assembly amounted to only 66-82% of the predicted genome size (1-1.24 Gbp).
In general, assembly polishing using long reads only tended to improve assembly statistics, slightly reducing the number of contigs and increasing both the N50 and the maximum contig lengths. While the differences in BUSCO statistics were close to naught for the Flye and Raven assemblies, the proportion of complete BUSCOs in the Canu, Redbean and NextDenovo assemblies increased substantially, with Canu and Redbean assemblies now outperforming Flye and Raven. The two programs used for assembly polishing with the Illumina data (Racon and Nextpolish) produced similar results. This last step of data analysis improved BUSCO statistics even further, the proportion of complete BUSCOs now ranging between 94.5% (NextDenovo assembly) and 99.6% (Canu assembly).
All tested programs required substantial amounts of computational resources (Table 5). Most programs completed analyses when provided with access to 56 cores and a standard 900 GB of RAM, although CPU times differed between 33 h (Medaka polishing) and 2462 h (MaSuRCA hybrid assembly). The exception was Canu, which only completed analyses when provided with 112 cores and nearly 3000 GB of RAM, taking 35,401 CPU hours to run.

Discussion
Long-read sequencing is becoming standard procedure for plant genome analysis, as these reads are able to span repetitive regions of the DNA, substantially facilitating reassembly of a contiguous genome. With the MinION, Oxford Nanopore offers a comparatively cost-efficient and compact sequencing technology that promises feasibility of high-throughput plant DNA sequencing, even in small laboratories with limited financial means. However, DNA purification and sequencing protocols must still be adapted to generate sufficient high-quality long-read data that would cover larger genomes several times at a reasonable price, particularly when working with non-model plant species.
The main challenge for long-read sequencing of plant DNA using MinION is the requirement of copious amounts of HMW ultra-pure DNA. This is not straightforward, as most plant species contain secondary metabolites that bind to the DNA during extraction, affecting DNA quantity and quality. Rooibos, for example, produces a unique combination of phenolic compounds, including flavones, flavonols and glucosyl dihydrochalcones. The total amount of polyphenols can make up to 30% of the plant dry weight and aspalathin alone can contribute up to 13.5% to the leaf dry weight [40]. In previous studies, we established a DNA extraction procedure that generated high yields of HMW DNA suitable for Illumina sequencing [16]. However, CTAB may be difficult to remove after extraction, and we therefore also tested a DNA extraction protocol that uses SDS as a detergent. When assessing yield and purity of the DNA, the best results were achieved with the CTAB extraction method. Similar results were reported previously in [41], who found that CTAB was more efficient than SDS in extracting genomic DNA from the endemic Brazilian plant species, Croton linearifolius. Plants vary dramatically in their chemical compositions, which explains why there is no universal method for plant DNA extraction. CTAB-based methods worked well for diverse medicinal plant species that were high in phenolic compounds [42][43][44], although SDS was the preferred detergent for DNA extraction from eucalyptus [45].
According to [46], the best MinION sequencing results are achieved when the Qubit :Nanodrop ratio for DNA concentrations is close to 1:1, as it implies very high purity of the DNA. Basically, it indicates a high proportion of dsDNA in the sample, which can ligate to the ONT adapter and is then available for sequencing. Neither the SDS nor the CTAB protocol yielded such ultrapure DNA. Since the SDS extraction method did not generate sufficient amounts of DNA, only the CTAB-extracted DNA was further purified using three commercial kits: the Zymoclean™ Large Fragment DNA Recovery Kit, the QIAGEN ® DNeasy PowerClean CleanUp Kit and the QIAGEN ® Genomic-tip 500/G. The Zymoclean kit failed to generate suitable DNA. It efficiently removed RNA (as visualized on the gel) and even appeared to narrow the Qubit:Nanodrop ratio in comparison to unpurified DNA; however, the 260/280 ratios were always too high and the 260/230 ratios were close to zero. The low 260/230 ratios were most likely associated with the presence of carbohydrates due to insufficient removal of the agarose [47]. This may be the result of reduced centrifugation speeds that were employed to minimize breakage of the DNA. Consequently, this sample was not sequenced. The second-best results were obtained when using the QIAGEN ® Genomic-tip 500/G kit, which is the clean-up procedure recommended by ONT (personal communications). In this kit, the columns operate based on gravity flow, which means that centrifugation steps that could potentially damage HMW DNA can be avoided. Of the two tested protocols, the one published by ONT yielded slightly better results in terms of DNA recovery and purity. Nonetheless, the Qubit:Nanodrop ratios in DNA concentration were still high. The QIAGEN ® Genomic-tip 500/G protocol had initially been tested using arabidopsis and wheat species [48]. For plant species that are rich in secondary metabolites, such as lavender (Lavandula angustifolia), catmint (Nepeta mussinii) and poplar species [3,48], additional clean-up procedures (e.g., through Amicon Buffer Exchange or AMPure XP beads purification) were tested with variable results. In our study, by far the best results in terms of DNA purification were consistently obtained when using the QIAGEN ® DNeasy PowerClean CleanUp Kit. The DNA losses during purification were comparatively small and DNA purity was excellent, as indicated by the light absorbance and the Qubit:Nanodrop ratios.
When comparing across the DNA purification methods with DNA samples from plant 1, our MinION sequencing results confirm that a close Qubit:Nanodrop ratio is a good indicator for successful sequencing: the best results were obtained when the sample was purified using the QIAGEN ® DNeasy PowerClean CleanUp Kit, which yielded 19× more HQ data (in Gbp) than the unpurified DNA, and approximately three times more than the Genomic-tip purified DNA samples at similar N50 values. Good performance of this clean-up procedure was confirmed when using flash-frozen green leaf material of plant 2 (a wild type of rooibos plant that has high concentrations, but substantially different profiles of phenolic compounds). Here, read lengths of up to 230 Kbp were achieved, surpassing the results obtained with the Genomic-tip kit. If not immediately flash-frozen or dried after harvest, rooibos leaves turn brown within a very short period of time, which is associated with the rapid oxidation of the plant material. We therefore also used the QIAGEN ® DNeasy kit to purify the DNA from oxidized leaf samples of plant 1, where the DNA had likely undergone some degradation. Even here, the sequencing output of HQ data in terms of total number of nucleotides and maximum read length were comparable to the results obtained with the Genomic-tip kit used on DNA from non-oxidized leaf samples, although read length distributions were substantially skewed toward the smaller size, as indicated by the lower N50 values.
A second challenge for plant genome analysis is the requirement of substantial computational capacities for the reassembly of the genome from the sequencing datasets. Using our Illumina and MinION data, two approaches were tested: hybrid assembly, where long and short reads are used simultaneously, and long-read assembly followed by polishing, first with the long and then with the short-read datasets. When using short-read data only, the hybrid assembler MaSuRCA generated a better assembly than previously tested short-read assembly programs, including ABySS, SoapDenovo and Platanus [17]. The effect of providing MaSuRCA with both, Illumina and MinION data was substantial: total scaffold numbers dropped from 70 k to 29 k, the N50 increased nearly five-fold to 81 Kbp, and the longest scaffold length increased 15-fold to 1.2 Mbp. It must be noted that this program is computationally very demanding: for the combined assembly of the short and long-read data, MaSuRCA required 864 GB and 2462 CPU hours of run time. The other two tested hybrid assemblers, Haslr and Wengan, were found unsuitable for rooibos genome reconstruction. After polishing, several long-read assemblers performed better than MaSuRCA, particularly in BUSCO statistics, which implies better assembly accuracy. Top BUSCO scores were obtained for the Canu, Flye and Redbean assemblies. To date, Canu remains the most commonly used assembler for plant genome analysis [49,50]. It was specifically developed to handle long, high-noise sequencing reads produced by PacBio and MinION. Tests with Macadamia nut [2] and flax [50] datasets showed that Canu outperformed other assembly programs in terms of contiguity and completeness of the reassembled genomes, which is why it was included into the LeafGo protocol for plant genome analysis [51]. In Canu, the first step includes read correction, followed by trimming and assembly. Consequently, this program has substantial computational requirements, which may be a limiting factor when dealing with large heterozygous genomes. Therefore, Canu consistently crashed when attempting to reassemble the highly heterozygous 1.6 Gbp genome of the mollusc Mytilus coruscus [52]. In our study, when using only the MinION rooibos sequencing data, Canu also failed to complete the analyses until it was provided with double the number of cores (112) and more than three times the amount of RAM memory (≈3000 GB) than the other programs. Yet, it generated a highly fragmented assembly with a comparatively low N50. This is likely associated with the relatively low genome coverage obtained from the MinION sequencing data (Koren, S.; personal communications). Moreover, our studies indicate that rooibos has a high rate of heterozygosity (approximately 2% [17]), which may have also impeded the assembly. Just like Canu and MaSuRCA, Flye required substantial computational resources and generated a rather fragmented assembly. Previous studies indicated that this assembler is suitable for plant genome assembly [53], but may be less efficient when working with data from highly heterozygous, repeat-rich genomes. This may explain our results, as rooibos is also predicted to have a high proportion of repetitive DNA (>50%; [17]). However, it must be noted that in terms of total assembled sequence length (1.1 Gbp), Flye yielded a value that was closest to the rooibos genome size predicted using flow cytometry (1.24 Gbp). Redbean completed the assembly within 32 h, using only 54 GB of RAM. This program appeared to successfully reconstruct large sections of the genome (as indicated by the second-highest N50 and second-longest contig length), but polishing was found to be essential to improve sequence accuracy. After polishing, however, Redbean yielded outstanding BUSCO values (99.2%), which were comparable to the Canu and Flye assembly results. With 1 Gbp, the total assembly length was similar to the genome size predicted using k-mer analysis (also 1 Gbp). Redbean had performed worst in the Macadamia nut [2] and the flax [50] studies, reconstructing only half of the expected genome sizes. Both plant species have experienced a whole genome duplication event, which permits the conclusion that Redbean may excessively collapse DNA regions with high sequence similarity, leading to higher N50s and longer maximal read lengths, but also to reduced total genome assembly sizes. NextDenovo generated the most contiguous but also the smallest assembly. While the values for N50 and the longest contig length were outstanding, they must be taken with caution. The total assembly may only represent as little as 66-82% of the actual genome size, implying that NextDenovo may have collapsed genome regions with high sequence similarity. In this assembly, a high proportion of BUSCOs were still missing, even after polishing, although not as many as in the MaSuRCA assembly.

Conclusions
To date, only a few plant genomes have been sequenced using Oxford Nanopore sequencing technologies where the genome size is as big as that of rooibos, or larger [1,54]. Consequently, DNA extraction and purification procedures still require optimization. In this study, DNA purification after CTAB extraction substantially improved yields, particularly when using the QIAGEN ® DNeasy PowerClean CleanUp Kit. However, the median read lengths did not improve. Reducing centrifugation speed and size selection before library construction may improve read lengths. Considering the substantially lower price per sample, the QIAGEN ® DNeasy kit may be considered a suitable alternative for the QIAGEN ® Genomic-tip 500/G kit for DNA purification. With regard to genome assembly, the aim of this study was to obtain an overview on computational requirements and program performance with the rooibos sequencing datasets. Therefore, the different assembly and polishing programs tested here were run using only default parameters, which are more or less optimized for data analysis (see also [55][56][57]). However, each of the programs has a varying number of adjustable multi-level parameters, that permit fine tuning of the assembly and polishing processes. Testing each of these parameters here is beyond the scope of this study. For final genome assembly, the different parameters of the selected program(s) should be optimized. Although Redbean appeared to be the superior assembly program in terms of computational resource requirements and genome assembly statistics, such as N50, contig numbers, assembly length and BUSCO hits, it may potentially collapse regions of high sequence similarity. Future studies will focus on comparative analyses of genome assemblies generated by different assembly programs toward identification of the most accurate representation of the rooibos genome.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/plants11162156/s1, Table S1: Yields and quality metrics for rooibos DNA samples generated using various DNA extraction and purification procedures; Figure S1: Illustration of DNA purity and Nanodrop readings from different DNA extraction and purification procedures. (a) DNA extraction using SDS method followed by chloroform:isoamyl alcohol extraction.  Table S2: Relevant assembly and BUSCO statistics for the assemblies of the rooibos genome generated using Illumina and MinION sequencing data with short, hybrid and long-read assembly programs.

Data Availability Statement:
The datasets generated and analyzed in this article cannot be submitted to public databases due to the restrictive Biodiversity Legislation of South Africa.