Genetic and genomic data are of critical importance for many applications, including species delimitation [1
], studies on evolution and phylogenies [4
], biodiversity assessments and conservation [7
], reconstructions of past plant communities [9
], or for more applied tasks such as forensics [12
], pollination and food web studies [14
] and monitoring of invasive species [17
]. While many of these tasks can be undertaken by sequencing plastid or rDNA amplicons [1
], increasing emphasis has been given to the potential of using genomic data for DNA barcoding and wider phylogenomic studies [4
]. One key approach for gathering large scale genomic data from plants is genome skimming which consists of shallow pass shotgun sequencing [23
]. The major advantage of genome skimming is the large amount of genetic information it provides. Genome skims notably allow for simultaneous assembly of both nuclear ribosomal and plastid DNA. Thus, a single analysis may provide complete plastid and ribosomal assemblies including all the plastid and nuclear ribosomal markers that have been used in plant DNA barcoding (e.g., the plastid genes matK
], plastid spacers/introns such as trnH-psbA
, as well as the nuclear ribosomal regions ITS1 and ITS2, see also further discussion of plant barcodes [2
The use of the complete chloroplast genome as a standard barcode has been repeatedly suggested [23
] because of its capacity to increase the resolution at lower taxonomic levels in plants [31
]. It is also a useful information source for deeper level phylogenetic studies [4
]. Most chloroplast genomes are 110–160 kbp, a size that, on the one hand, provides much more information than a few loci and, on the other hand, allows the chloroplast genome to more easily be sequenced and assembled than the much larger nuclear genome. Moreover, when genome skimming approaches are used, the problem of nonuniversal primer sites that have been a limitation for several of the most use markers as matK
, ITS1 and ITS2 [1
], is avoided. However, the structure and complexity of the chloroplast varies [32
], and especially taxa with chloroplast genomes harbouring many repeats are, according to our experience, challenging to assemble and therefore to annotate. Also, in some genera, species may not be distinguishable by chloroplasts due to recent alloploid origins, chloroplast sharing, or hybrid speciation [2
]. Nuclear ribosomal DNA is a good complement to the chloroplast genome, as it includes the frequently used and rapidly evolving markers ITS1 and ITS2, as well as the more conserved 18S, 5.8S and 28S [33
Generating large scale data-sets involving thousands of samples is a major effort, even with standard amplicon sequencing (e.g., building DNA barcoding reference libraries for regional floras [35
]). There is thus considerable interest in developing approaches to increase the number of loci recovered from plant samples using a method that is scalable over multiple individuals of multiple species, while remaining tractable and manageable at a scale of many thousands of samples. Two recent studies that have tackled this using genome skimming and have generated large-scale genomic data from plants, showing the potential to extend sampling coverage to the scale of regional floras, the first from China (n
= 1659) [4
], the second from Australia (n
= 672) [37
]. Further studies are required to refine protocols and assess which approaches result in efficient and cost-effective recovery of data. Of particular importance, is the development and testing of informatics pipelines across diverse sample sets, and developing robust laboratory protocols that cope with the inevitable heterogeneity of tissue type and quality that is found in large scale studies.
A very important, but potentially challenging source of tissue for large scale studies are the plant collections housed in the world’s herbaria. They contain all described species of multicellular plants worldwide including their type specimens, as well as both species that are extinct or not yet described [38
]. They represent several hundred years of global efforts in collecting, describing and identifying plants in both easily accessible and more remote areas [26
]. Using herbarium collections for large scale genome skimming thus offers the opportunity to open the ‘treasure vault’ that these specimens represent [41
]. However, the quality and quantity of DNA found in herbarium specimens depends on conditions during collection and storage, which is, in general, lower than for freshly collected plant material followed by immediate drying in silica gel or freezing [43
]. Low quantity and quality of DNA from the outset can affect all downward steps such as sequencing success, assembly and annotation, and may therefore affect the overall success of a large scale project. However, genome skimming methods have improved and several recent studies have shown that it is possible to extract sufficient quality and quantity of herbarium material for retrieval of partial or full complete plastid genomes [41
In this study we focus on the practicalities of large-scale genome skimming. We share experience gained from two large-scale projects involving several thousands of species to guide future deployment of genome skim sequencing to understand plant biodiversity. The first project (PhyloAlps including PhyloCarpates) is focused on the European Alps and the Carpathians, and is mainly based on freshly collected leaf material dried in silica gel. The second (PhyloNorway), is mainly based on herbarium material from Norway and the Arctic region. Except for a modification in the extraction protocol for the herbarium material, these two projects use the same methods and therefore also allow for herbarium material to be evaluated as a cost-efficient source for large scale genome skimming. Specifically, we evaluate (1) the quality of the DNA recovered, and the success of genome skimming of herbarium and of silica gel dried material, (2) the recovery of standard plant barcodes from the genome skim data, and (3) the effect of sample age and time of growing season on assembling the full chloroplast from herbarium material.
4. Materials and Methods
4.1. Sampling and DNA Extraction
For PhyloAlps, we collected most of the fresh leaves during the summer months in 2009, 2010, and 2011, with some additional materials collected in subsequent years to fill sampling gaps. Most PhyloCarpates samples were collected during the 2013, 2015 and 2016 fieldwork seasons, focusing on Carpathian endemics and regionally distributed Carpathian–Balkan taxa.
For PhyloNorway, we sampled leaf material from herbarium specimens at Tromsø Museum (herbarium TROM, 220,000 specimens). The only treatment used for minimising insects in this herbarium is freezing at −30 °C for 4 days. The specimens in the herbarium are stored at a temperature of around 15 °C in woody cabinets with around 50% humidity. Specimens were selected using 5 criteria: (1) The species is native in boreal and/or arctic regions; (2) The specimen is healthy—every specimen was inspected under a dissecting microscope to exclude specimens with e.g., visible fungal infections.; (3) Collection date for the specimen is as early in the growing season as possible; (4) Sampling of specimens collected in the field after year 2000 was prioritised, where they met the other criteria; (5) The sample has good documentation and reliable taxonomic identification. The primary aim was to cover all species of Norway and polar regions, but common invasive plant species in this region were also included.
DNA extractions were performed using Macherey-Nagel Nucleospin 96 Plant II kit with the following specifications and modifications. A minimum of 20 mg dried leaf material was collected from each specimen; a few specimens had less material due to their small size. Two tungsten carbide beads (3 mm diameter) were added to each sample before they were inserted into the TissueLyser for 4 × 1 minutes at 25 Hz. For each batch of 96 samples, a lysis buffer consisting of 50 mL Buffer PL1 and 1 mL RNase A were prepared, and 500 µL lysis buffer was dispensed to each sample. For silica dried samples (PhyloAlps and PhyloCarpates), a brief spin was performed at this step; this was skipped for herbarium material (PhyloNorway). Incubation time at 65 °C was increased to overnight for all samples, followed by a centrifugation step, silica gel dried material for 10 min 16,000× g
and herbarium material for 15 min at 13,200 rpm. A filtration step was performed after step 3 in the original protocol, loading 400 μL cell lysate into NucleoSpin Flash Filter Plate stacked on top of a square-well block, and then centrifuged for 2 min at 2500× g
for silica dried and 4600 rpm for herbarium material. Thereafter, 450 μL Binding Buffer PC was added to the square-well block. For step 6 in the original protocol (DNA binding to silica membrane), centrifugation was increased to 20 min at 4600 rpm for herbarium material. All wash steps for herbarium material were centrifuged at 4600 rpm. In step 7 (wash and dry silica membrane), all wash steps for herbarium material were centrifuged at 4600 rpm. For the third wash, we first centrifuged for 2 min before the square-well block was emptied and re-centrifuged without seal for 5 min, and then dried at room temperature for 5 min instead of the original 10 min centrifugation. For step 8 (DNA elution), we used 150 μL preheated Buffer PE and the flow-through was re-applied onto the filter to increase DNA yielding for herbarium material. See full extraction protocol in Supplementary Appendix 2
4.2. Library Preparation and Sequencing
The library preparation protocol applied was chosen on the basis of the DNA extraction yields. When available, 250 ng of genomic DNA were sonicated using the E210 Covaris instrument (Covaris, Inc., USA). The NEBNext DNA Modules Products (New England Biolabs, MA, USA) were used for end-repair, 3’-adenylation and ligation of NextFlex DNA barcodes (Bio Scientific Corporation). After two consecutive 1x AMPure XP clean ups, the ligated products were amplified by 12 cycles PCR using Kapa Hifi Hotstart NGS library Amplification kit (Kapa Biosystems, Wilmington, MA), followed by a 0.6x AMPure XP purification. When the extraction yielded low DNA quantities, 10–50 ng of genomic DNA were sonicated. Fragments were end-repaired, 3’-adenylated and NEXTflex DNA barcoded adapters were added by using NEBNext Ultra II DNA Library prep kit for Illumina (New England Biolabs). After two consecutive 1x AMPure clean ups, the ligated products were PCR-amplified with NEBNext Ultra II Q5 Master Mix included in the kit, followed by 0.8x AMPure XP purification.
All libraries were subjected to size profile analyses conducted by Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and qPCR quantification (MxPro, Agilent Technologies, USA), then sequenced using 101 base-length read chemistry in a paired-end flow cell on the Illumina HiSeq2000 sequencer (Illumina, USA). For 155 libraries, the same extract was sequencing twice either as a quality control or because the first results were poor.
An Illumina filter was applied to remove the least reliable data from the analyses. The raw data were filtered to remove any clusters with too much intensity corresponding to bases other than the called base. Adapters and primers were removed from the whole read. Nucleotides exhibiting a low Illumina sequence quality score (below 20) were trimmed from both extremities of the read. Sequences between the second unknown nucleotide (N) and the end of the read were also removed. Reads shorter than 30 nucleotides after trimming were discarded. Finally, the reads and their mates that mapped onto run quality control sequences (PhiX genome) were removed. These trimming steps were achieved using internal software based on the FastX package [65
4.3. Global Assembly and Annotation
For each sample, the complete sequence of the nrDNA and of the chloroplast genome were first assembled using the Organelle Assembler [66
], which is a De Bruijn graph based assembler specifically developed for the PhyloAlps and PhyloNorway projects and designed for the assembly of high copy genetic elements such as organelle genomes and nrDNA from genome skimming datasets.
The sequence data was indexed with the “oa index” using a variable length cut-off that retains 90% of the input sequences. The chloroplast protein coding genes and nrDNA from Arabidopsis
were used to find the assembly seeds in the index sequence with the “oa seed” command. For both the chloroplast and nrDNA assemblies, the assembly graphs were constructed with “oa buildgraph” allowing up to 30 iterations for filling assembly gaps. The final assembled contigs were produced with “oa unfold” and “oa unfoldrdna” for the chloroplast and nrDNA assemblies respectively. A circular contig was attempted to be generated from the chloroplast assembly graph. However, if none could be obtained, the separate contigs were produced instead. The assembled sequences were annotated with the ORG.Annot pipeline [67
4.4. Targeted Assembly for matK and rbcL
As some chloroplast assemblies did not succeed for all specimens, we used the OrthoSkim pipeline [68
] to retrieve the chloroplast genes for samples lacking complete assemblies (n = 1815). This pipeline consists of assembling all sequencing reads into genomic contigs and extracting all targeted genes from these contigs by mapping to close reference. For this, we formatted a database of chloroplast coding genes from our annotations by keeping all protein sequences. For each sample, assembly was performed in OrthoSkim using the SPAdes assembler [69
] with the “SPAdes_assembly” mode. Afterwards, OrthoSkim selected the closest taxon for each gene of the database in the NCBI taxonomy and contigs were first mapped to this closed reference to extract matching contigs from the contigs set with a diamond [70
]. Selected contigs were then mapped using exonerate [71
] in order to identify the exonic regions for each gene, which were next extracted. This was implemented using the “extraction” mode with the “chloroplast_CDS” target.
4.5. Quality Control
For PhyloNorway, the chloroplast rbcL
and nuclear ribosomal ITS2 barcode regions were extracted from the annotated database for quality controls. For each marker the data was uploaded to BOLD systems [72
] and analyzed via the Taxon ID Tree option (visualization via a simple NJ tree). Samples that were misplaced in the tree were manually checked for misidentifications based on the uploaded herbarium material, corrected where possible or removed from the final dataset if the final identification was unclear. A total of 87 samples were corrected and 8 libraries were removed in this step.
Additionally, the in-house quality control process that was applied to the reads that passed the Illumina quality filters included a taxonomic assignment step. For each dataset, taxonomic assignment was performed on a random sample of 20,000 reads using MegaBLAST [73
], Kraken [74
], or Centrifuge [75
]. This allowed us to identify 35 additional PhyloNorway samples that likely corresponded to identification/sampling errors, as well as 113 PhyloAlps samples. These samples were discarded from subsequent analyses. We also tagged 42 PhyloNorway samples that were contaminated by other DNA from the environment (bacteria, fungi, birds, fish, human; contamination was 0.5–14% of total reads). These had lower success rates than the overall PhyloNorway dataset (Fisher p
= 2.43 × 10−4
), and we were only able to assemble the full chloroplast genome for 11 of these samples. These are kept in the overall dataset to give realistic statistics of success rate.
4.6. Statistical Analyses
All success rates are calculated based on libraries. To evaluate the significance of correlations between continuous variables (coverage, insert size, age) with assembly success or preservation methods, Wilcoxon rank-sum tests were used. To estimate the correlation between success rate and preservation method, Fisher’s exact test was used. All statistical analyses were done in R version 3.6 [76
4.7. Data Availability
For PhyloNorway, the full dataset of matK
and ITS2 is available on BOLD [72
]. A subset of 1535 samples has been included in ongoing work (Wang et al. in prep) and the raw reads and sequence assemble will be deposited at the European Nucleotide Archive [77
]. The remaining data will be released after further quality control. Metadata for the majority of specimens are provided on PhyloAlps [78