1. Introduction
Wildlife health and conservation initiatives benefit tremendously from genetic methods of species identification for infectious disease screening [
1,
2], detecting illegally traded wildlife products [
3], uncovering food label fraud [
3,
4,
5], and documenting understudied biodiversity [
6]. One major challenge for wildlife molecular studies is obtaining fresh samples from live or dead wild animals. Such endeavors can be logistically challenging, generally involving highly skilled teams, detailed planning, and acquisition of permissions from local, regional, and international partners and governmental agencies for animal handling, sample collection, and sample transfer for molecular testing. Consequently, environmental samples [
7,
8] and animal samples that can be collected non-invasively (e.g. hair, feathers, scat, etc.) [
9,
10,
11] are increasingly being used for ecological studies, wildlife health assessments, and characterizing biodiversity. Non-invasively collected samples are easier to obtain than fresh organ tissues, but may contain PCR inhibitors, have lower DNA yields, or be degraded from environmental exposure [
10,
12,
13,
14]. Archived historical wildlife samples, often preserved in formalin, also offer a unique opportunity to obtain genetic information [
15]. However, challenges for molecular studies include formalin-related fragmentation and DNA cross-linking [
16,
17].
DNA barcoding is a common molecular technique for species identification [
18,
19] that had been traditionally carried out via standard lab-based sequencing equipment until recent developments in portable technology. The Oxford Nanopore Technologies (ONT) MinION sequencer is currently the only available portable sequencer. Although nanopore sequencing is known to have higher raw sequence error rates in comparison to standard short read sequencing platforms such as Illumina or BGI-Seq, particularly at homopolymeric regions [
20,
21], significant improvements in the accuracy of MinION sequencing chemistry has led to its recent rise in popularity for field applications (reviewed in [
22]). This sequencer is especially useful in situations where there is a lack of access to sequencing facilities or when sample export is difficult. The MinION also has a lower investment cost and shorter turnaround times than traditional sequencing platforms (e.g., Sanger, Illumina).
MinION DNA barcoding studies have primarily used laboratory-based QIAGEN® kits for reliable and pure DNA extraction products (e.g., [
23,
24,
25]). To expand the potential for portable sequencing applications, field-friendly DNA extraction methods can be used to reduce lab equipment requirements. While field-friendly DNA extraction methods are often less effective at producing DNA of high concentration and purity levels, MinION DNA barcoding has been successfully performed using QuickExtract
TM solution (Lucigen, Middleton, USA), which only requires a heat source [
26]. The Chelex® 100 resin (Bio-Rad Inc., Hercules, USA) extraction method similarly only requires a heat source (e.g., heat block or PCR thermocycler), but is less expensive and has not been tested for MinION sequencing so far. Both methods have short protocols, but do not remove cellular debris or PCR inhibitors, which can affect downstream applications [
27,
28]. The Biomeme M1 Sample Prep™ Kit (Biomeme Inc., Philadelphia, USA) is another DNA extraction kit developed for field use. While more expensive than either QuickExtract or Chelex methods, the Biomeme kit includes all necessary components and both protein and salt wash steps to remove impurities. Studies have shown that Biomeme-extracted samples have higher levels of inhibitors compared to Qiagen extractions, and thus require additional dilution steps [
8,
29].
To date, MinION DNA barcoding pipelines have used either de novo assembly [
23,
24], clustering-based [
25], or alignment [
26,
30] methods to generate consensus sequences for species identification. Assembly approaches generally work more consistently for longer barcodes (≈1 kb), as the underlying software were originally designed for assembling long reads for genome assemblies rather than amplicons. To date, published clustering [
25] or alignment [
26,
30] pipelines use subsets of the data (100–200 reads) to generate scaffolds for read error correction. While these approaches may work for high quality sequence data, such data subsets could include more sequence error bias in lower quality datasets. Thus, we developed a clustering-based pipeline, SAIGA (
https://github.com/marisalim/Saiga), with software specifically designed for error prone MinION reads that processes data regardless of barcode length and has no limits on the number of reads that can be clustered, thus maximizing the use of demultiplexed reads for downstream species identification analysis.
In this study, we systematically evaluate the accuracy of the MinION for DNA barcoding across a range of wildlife sample types, including two field-friendly DNA extraction approaches. We sequenced a short fragment of the commonly used mitochondrial cytochrome b (Cytb) gene from scat, hair, feather, fresh frozen liver, and formalin-fixed paraffin embedded (FFPE) liver. For each sample type, we compared the accuracy of Cytb consensus sequences for three different DNA extraction methods: QIAGEN silica membrane-based kits (Qiagen Inc., Germantown, USA), Chelex 100 resin (Bio-Rad Inc., Hercules, USA), and the Biomeme M1 Sample Prep Kit (Biomeme Inc., Philadelphia, USA). All analyses were conducted with SAIGA. We demonstrate that MinION sequencing can be used with field-friendly extraction methods to accurately identify wildlife species from a variety of sample types.
2. Materials and Methods
2.1. Sample Collection
For this study, scat, hair, feather, fresh frozen liver, and FFPE liver samples were collected opportunistically during necropsy examinations from a snow leopard (Panthera uncia) and a cinnamon teal (Anas cyanoptera) from a zoological collection. The FFPE liver samples were part of a suite of tissues that were collected, stored in 10% neutral buffered formalin, and subsequently processed and paraffin-embedded for histologic examination and routine tissue archiving. Fresh liver, scat, hair, and feather samples were frozen (−80 °C) immediately after collection.
2.2. DNA Extraction
DNA was extracted from each sample type using three different approaches: (1) Qiagen (QIAamp® DNA minikit or QIAamp® DNA Stool Mini Kit, Qiagen Inc., Germantown, MD, USA); (2) Chelex 100 Resin (Bio-Rad Inc., Hercules, CA, USA); and (3) Biomeme M1 Sample Prep Kit for DNA (Biomeme Inc., Philadelphia, PA, USA). DNA quantification is inaccurate for Chelex extracts due to the presence of cellular components, thus Chelex extracts were not quantified. All Qiagen and Biomeme extracts were quantified using the Qubit™ dsDNA High Sensitivity Kit on the Qubit™ 4 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). The Qiagen, Chelex, and Biomeme extraction protocols are summarized for each tissue type in Text SI. All Qiagen, Biomeme DNA extracts with >10 ng/µL, and all Chelex extracts were run on a 1.0% gel to assess DNA fragmentation by sample type.
2.3. PCR and Library Preparation
2.3.1. DNA Barcoding PCR—Round 1
Approximately 460 bp of the mitochondrial Cytb gene was amplified using primers mcb398 and mcb869 [
31], with universal tailed sequences on each primer that are compatible with the ONT PCR Barcoding Expansion kit EXP-PBC001 (ONT, Oxford, UK) (
Table S1). These primers were designed from an alignment of 67 animal species, and validated for mammals, reptiles, and birds [
31].
PCR was carried out with 6.25 µL DreamTaq HotStart PCR Master Mix (Thermo Fisher, Waltham, MA, USA), 1.25 µL DNA template, and 2 µL of each primer (10 µM stock) in a final volume of 12.5 µL. Cycling conditions were: 95 °C for 3 min; 35 cycles of 95 °C for 30 s, 55 °C for 30 s, and 72 °C for 30 s; and a final extension of 72 °C for 5 min. All Chelex extractions were diluted for the DNA Barcoding PCR as described in
Text S1. PCR products were purified using 1.8× Agencourt AMPure XP beads (Beckman Coulter, Indianapolis, IN, USA), tested for purity using the NanoDrop™ One spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA), and quantified fluorometrically using the Qubit dsDNA High sensitivity kit.
2.3.2. Indexing PCR—Round 2
To attach dual ONT PCR index sequences to the Cytb amplicons, a second round of PCR was carried out with the ONT PCR Barcoding Expansion kit EXP-PBC001 for each sample with 25 µL KAPA Biosystems HiFi HotStart ReadyMix (2×) (Thermo Fisher Scientific, Waltham, USA), containing 25 ng of first-round PCR amplicon and 1 µL ONT PCR Barcode in a final volume of 50 µL. Cycling conditions were: 95 °C for 3 min; 11 cycles of 95 °C for 15 s, 62 °C for 15 s, and 72 °C for 15 s; and a final extension of 72 °C for 1 min. Hereafter, we refer to ONT PCR barcodes as ‘indexes’ to reduce confusion with the Cytb barcode. Indexed PCR products from round 2 were purified and tested for purity and quantity like round 1 products.
2.3.3. Library Preparation
Samples were grouped into four libraries by sample type (FFPE, scat, hair/feather, frozen liver). For each library, purified indexed amplicons were pooled in equal ratios to produce 1.0–1.2 µg in a total of 45 µL nuclease-free water. Pooled libraries were next prepared using the ONT Ligation Sequencing kit SQK-LSK109 (ONT, Oxford, UK) with modifications to the manufacturer’s instructions: 25 µL of the pooled library was mixed with 3.5 µL NEBNext Ultra II End-Prep Reaction buffer and 1.5 µL Ultra II End-prep Enzyme mix (New England Biolabs, Ipswich, MA, USA), incubated for 10 min at room temperature, then 10 min at 65 °C. For adapter ligation, 15 µL of the end-prepped library (not bead-purified) was mixed with 25 µL Blunt/TA Ligase and 10 µL Adapter Mix (AMX), incubated at room temperature for 20 min and eluted in a final volume of 12 µL of Elution Buffer.
2.4. Sequencing
The four libraries were split between two FLO-MIN106D R9.4.1 chemistry flow cells (ONT, Oxford, UK) to minimize bleed-through between experiments, with two libraries run on each flow cell—FAL19910: (1) FFPE (four samples, BC01-04), (2) scat (six samples, BC05-10); FAL19272: (1) hair/feather (six samples, BC01-06), (2) frozen liver (six samples, BC07-12). Flow cells were washed with Wash Solution A followed by the addition of Storage Buffer S according to the manufacturer’s protocols. All libraries were sequenced for approximately 1 h to obtain at least 100,000 raw reads per sample.
For comparison to MinION sequences, Sanger sequencing in the forward and reverse directions was performed on all purified indexed amplicons (Eton Bioscience Inc. Newark, NJ, USA). Sanger consensus sequences were generated using Geneious Prime v2019.0.4 software (Biomatters LDT, Auckland, New Zealand).
2.5. Bioinformatics
The SAIGA bioinformatics pipeline is available on GitHub (
https://github.com/marisalim/Saiga) and steps are outlined in
Figure 1. MinKNOW (ONT) was used for sequencing and the raw sequence data were basecalled using Guppy v3.5.1 (ONT) with basecalling model “dna_r9.4.1_450bps_fast.cfg”.
2.5.1. Demultiplexing and Filtering
Assigning sequencing reads to the correct sample is a critical step to avoid mixing sample sequences within or between sequencing runs. Thus, we compared results from two demultiplexing programs: (1) qcat v1.1.0 (ONT,
https://github.com/nanoporetech/qcat) and (2) MiniBar v0.21 [
24]. The qcat software was built specifically for demultiplexing reads indexed with ONT’s barcode kits, while MiniBar is a general demultiplexing software that allows any set of user-specified index and primer sequences. We used stringent demultiplexing filters based on software recommendations, sensitivity analyses, and to minimize incorrect read assignments. Qcat uses the epi2me demultiplexing algorithm and we trimmed adapter and index sequences with the trim option. Using the min-score option, demultiplexed reads with alignment scores <99 were removed prior to downstream analysis, where a score of 100 means every nucleotide of the index is correct. Lower min-score thresholds (i.e., 60–90) reduced downstream consensus sequence quality. In MiniBar, up to two nucleotide differences between reads were allowed for the index sequences and 11 nucleotide differences between primer sequences per software recommendations; MiniBar primarily uses the index sequence information to demultiplex and trim dual index and primer sequence.
After demultiplexing, reads were removed if they had mean Phred quality scores <7 and were longer or shorter than the target amplicon length (≈421 bp excluding primers) with a 100 bp buffer (321–521 bp) in NanoFilt v2.5.0 [
32]. Following each of the above steps, we calculated and visualized read quality statistics for raw, demultiplexed, and filtered reads with NanoPlot v1.21.0 [
32]. To standardize dataset size across the four sequencing experiments and to investigate the effect of read depth, we generated 100, 500, and 5000 random read subsets for each sample from the filtered demultiplexed read files. Hereafter, we refer to these subsets as 100R, 500R, and 5KR, respectively.
2.5.2. Read Clustering and Consensus Sequence Generation
To generate the consensus sequence for each sample, all reads were first clustered using isONclust v0.0.4 [
33]. We chose isONclust over clustering tools previously used in nanopore-based DNA barcoding pipelines, such as VSEARCH (implemented in ONTrack, [
25]), as it was specifically designed to work with error-prone long-read data and thus should be less affected by read errors and more efficient in cluster formation. Next, SAIGA outputs the number of reads per cluster, only retaining clusters with >10% of the total reads (user-defined). We implemented this step to minimize the inclusion of reads with high sequence error and possible contaminant reads in downstream analysis. Intermediate consensus sequences are then generated using SPOA v3.0.1 (
https://github.com/rvaser/spoa), which is based on a partial order alignment (POA) algorithm [
34]. SPOA also conducts error corrections, resulting in more accurate consensus sequences. The SPOA consensus sequences are then clustered using cd-hit-est v4.8.1 with a stringent similarity cutoff (0.9; user-defined) [
35,
36]. Since isONclust separates reads in different strand orientations, this second round of clustering groups reverse-complement SPOA consensus sequences, ensuring that more filtered reads are used for generating the final consensus sequence. The reads contributing to all SPOA consensus sequences that group with the majority isONclust cluster’s SPOA consensus sequence are combined into a single file for mapping. SAIGA then maps these reads to the SPOA consensus sequence of the majority isONclust cluster for consensus polishing with ONT’s Medaka software v0.10.0 (
https://github.com/nanoporetech/medaka).
2.6. Consensus Accuracy and Analysis
The MinION consensus sequences were compared to Sanger sequences from the same sample using a nucleotide Blast search v2.8.1+ [
37]. To assess and compare species identification results across tissue types, extraction methods, demultiplexing programs, and data subsets, the following were evaluated: (1) the percent of matching nucleotides between consensus and Sanger sequences, (2) the number of matching nucleotides between consensus and Sanger sequences, and (3) the proportion of filtered reads in the cluster used to generate final consensus sequence. Accurate species identification was defined as those with >99% sequence similarity to the Sanger sequence and ≈421 bp of matching nucleotides. The proportion of demultiplexed reads contributing to the final consensus indicates how much data was used for species identification. For samples with consensus sequences generated from fewer than ≈75% of reads, we investigated the non-majority isONclust clusters for potential sequence error or contaminant reads. Finally, all MinION consensus and Sanger sequences across tissue types, extraction methods, demultiplexing software, and data subsets were aligned with Mafft v1.3.7 in Geneious Prime v2019.0.4 to identify common regions with sequence errors.
2.7. Data Availability
A representative Sanger sequence for both species is available on GenBank (MN823069-70), and MinION fastq files (basecalled, demultiplexed, and filtered) are available on NCBI Short Read Archive (BioProject: PRJNA594927, accessions: SRR10678113-SRR10678156). Raw MinION sequence data is available on the EBI European Nucleotide Archive (ERP119594).
4. Discussion
We demonstrate that a MinION-based DNA barcoding workflow can generate accurate consensus sequences from scat, hair, feather, and FFPE liver tissue samples, which are often considered challenging for molecular studies. The ability to use field-friendly DNA extraction protocols with these sample types will help to overcome logistical challenges, such as the need for cumbersome or expensive equipment, for molecular field research. The accuracy of our species identifications is on par with previous MinION DNA barcoding studies and pipelines [
23,
24,
25,
26,
30]. For all tissue types, extraction methods, and subsets tested with our pipeline, we obtained high quality reads and a consensus sequence that matched >99.29% and at least 419/421 bp to the Sanger sequence for each sample. Although Oxford Nanopore’s goal is the “analysis of any living thing, by anyone, anywhere,” major barriers to its use are ease of sample processing, complicated data analysis, and cost. The results of our study help to reduce these barriers.
4.1. Field-Friendly Protocols for Wildlife Samples Expands Conservation Applications with the MinION
We show that the Chelex and Biomeme extraction methods can be used to generate highly accurate MinION consensus sequences, similar to Qiagen extraction methods, even with low starting DNA concentrations. Our PCR amplicon purification and library prep protocols resulted in libraries of sufficient purity; cellular debris or contaminants present in the Chelex and Biomeme extracts did not affect sequencing of the Cytb amplicons. However, we have to caution that while PCR amplification resulted in high purity, direct sequencing of DNA extracts obtained using these methods is likely to be negatively affected by contaminants (e.g., by inhibiting library preparation or sequencing). ONT’s protocols recommend using highly purified DNA for genomic sequencing. Although the field-friendly DNA extracts had low DNA concentrations overall, amplification was successful for all samples, including scat (known for containing PCR inhibitors), hair and feather (low DNA quantities), and FFPE tissue, from which DNA is generally difficult to amplify. Alternatively, ONT’s Native Barcode Expansion kits can be used to ligate on indexes to the amplicon of interest.
Formalin can cause DNA fragmentation, cross-linking, subsequent sequence artifacts, and altered base pairs [
16,
17]. As artifacts are randomly distributed, they should not affect the final Sanger sequence if sufficient starting template is used [
38,
39]. Indeed, we accurately sequenced Qiagen-extracted DNA from FFPE samples, and further show that amplifiable DNA was successfully isolated from FFPE tissue using Chelex and Biomeme extraction methods.
4.2. SAIGA: A DNA Barcoding Bioinformatics Pipeline for New MinION Users
We developed the SAIGA bioinformatics pipeline with a read clustering and consensus calling approach using software that were specifically designed for long-read and error-prone sequence data (isONclust, SPOA, Medaka). SAIGA performed successfully and consistently with as few as 100 reads per sample, allowing researchers to reduce sequencing time and cost per sample (e.g., multiplexing more samples). Like other studies investigating read coverage requirements, species identification accuracy still met our requirements but dropped slightly for the larger subset (5KR) [
23,
24]. Further, SAIGA options allow users to explore parameters and provide informative data quality checks and statistics throughout the pipeline. All software components are freely available, and the pipeline structure allows for integration of new software in the future.
Our results show that both qcat and MiniBar correctly demultiplex reads between samples in a sequence run and across multiple runs on a flow cell. Due to the very stringent demultiplexing parameters, the majority of raw data loss occurred during read assignment. More relaxed settings reduce raw read loss, but increase the chance of including incorrectly assigned reads or reads with higher sequencing error. Srivathsan et al. [
26] and Maestri et al. [
25] noted similar magnitudes of read loss with ≈76% and ≈53.6% of reads lost after demultiplexing, respectively; other MinION DNA barcoding publications have not reported this statistic. Despite the read loss, MiniBar- and qcat-demultiplexed reads performed well based on all our metrics for accurate species identification. Both demultiplexers tend to under-trim reads, which is preferred since potentially useful regions of the amplicon for distinguishing species are lost from over-trimmed reads. Although the consensus accuracy of qcat results was slightly higher than MiniBar results, we prefer Minibar for its flexibility to analyze non-ONT index sequences. Customized indexes are less expensive than ONT indexes and can be lyophilized for field use.
Measuring the proportion of clustered filtered reads used for consensus sequence generation provides a benchmark for detecting sequencing error and potential contamination. For example, SAIGA created separate SPOA consensus sequence clusters for some samples even though these clusters produce the same species identification result. Lowering the sequence similarity threshold in cd-hit could force the sequences to form a single cluster. However, for the purpose of validating SAIGA, we used very stringent sequence similarity thresholds to reduce species identification bias from sequence error. Using this measure, we also show that SAIGA can handle low to medium amounts of laboratory contamination (≈4%–20% reads of total subsample) from relatively distinct species in samples without affecting final species identification since contaminant reads were successfully filtered out during the clustering process. Since contaminant teal reads had the correct indexes used for the three snow leopard samples, contamination likely occurred during library preparation rather than from mis-assignment of reads during demultiplexing. These snow leopard samples were either difficult to amplify during the Barcoding PCR (scat/Chelex) or had low recovery of indexed PCR product used in the sequencing run (hair/Biomeme and liver/Chelex). The contamination risk for these samples was likely exacerbated by the two-step PCR protocol and low starting DNA concentration and/or purity. Further development is needed to adapt this workflow and pipeline for mixed species samples, for which it may be more difficult to differentiate between true sample species and laboratory contaminants.
4.3. Cost-Effective Strategies for Field Implementation
Each field-friendly method has its advantages and disadvantages. The Chelex method is cheap and the resin can be transported at room temperature, but requires heating equipment and the Chelex solution must be kept cool (4 °C) once prepared. The Biomeme kit is room temperature stable and self-contained. However, it is more expensive than both the Chelex resin and Qiagen kits ($15/sample versus $0.17 and $3, respectively) and yielded lower DNA concentrations compared to the Qiagen kit.
We show that qcat and MiniBar can correctly assign reads to samples within and between runs, which reduces costs by allowing multiple sequence runs per flow cell. Future experiments can also scale up by sequencing more samples per flow cell because relatively few reads per sample are required for a consistent, accurate consensus (e.g., [
26]). For the Cytb barcode amplified in this study, reads were sequenced at a rate of ≈100,000 reads per ≈10 min. Sufficient sequence data for species barcoding can therefore be obtained rapidly depending on the barcoding gene length and number of samples. We also reduced the volumes of the ONT PCR index per sample by 50% to lower costs and maximize the ONT kit.