Comparing the Effectiveness of Exome Capture Probes, Genotyping by Sequencing and Whole-Genome Re-Sequencing for Assessing Genetic Diversity in Natural and Managed Stands of Picea abies

Conifer genomes are characterized by their large size and high abundance of repetitive material, making large-scale genotyping in conifers complicated and expensive. One of the consequences of this is that it has been difficult to generate data on genome-wide levels of genetic variation. To date, researchers have mainly employed various complexity reduction techniques to assess genetic variation across the genome in different conifer species. These methods tend to capture variation in a relatively small subset of a typical conifer genome and it is currently not clear how representative such results are. Here we take advantage of data generated in the first large-scale re-sequencing effort in Norway spruce and assess how well two commonly used complexity reduction methods, targeted capture probes and genotyping by sequencing perform in capturing genome-wide variation in Norway spruce. Our results suggest that both methods perform reasonably well for assessing genetic diversity and population structure in Norway spruce (Picea abies (L.) H. Karst.). Targeted capture probes were slightly more effective than GBS, likely due to them targeting known genomic regions whereas the GBS data contains a substantially greater fraction of repetitive regions, which sometimes can be problematic for assessing genetic diversity. In conclusion, both methods are useful for genotyping large numbers of samples and they greatly reduce the cost involved with genotyping a species with such a complex genome as Norway spruce.


Introduction
Norway spruce (Picea abies (L.) Karst.) is an evergreen gymnosperm tree that belongs to the biggest group of gymnosperms, the family Pinaceae [1]. It is one of the most important conifer species in Europe, with a distribution range extending from the west coast of Norway to mainland Russia in northern Europe and across the Alps, the Carpathians and the Balkans in central Europe [2]. Norway spruce is a wind pollinated and predominantly outcrossing [3] and this creates high levels of gene flow that is one of the most important factors influencing the genetic structure of populations. High gene flow results in high genetic diversity within populations and reduces the difference between populations [3]. Conifers, as a group, are characterized by high levels of pollen gene flow and this is often found in studies of seed orchards contamination from surrounding forests [4,5]. A review study completed on six different conifer species gave an average contamination rate of seed orchards of 45% [6]. In Norway spruce seed orchards, high levels of pollen contamination have also been recorded. In Finland a seed orchard was found to have pollen contamination rates of 69-71% [7] and a study on two seed orchards in Sweden reported 43% and 59% of pollen contamination, respectively [8,9]. Scotti et al. [10] showed that seed dispersal creates patches of genotype diversity in mitochondrial markers that arise due to the limited long-distance dispersal seen in seeds. Pollen dispersal (based on chloroplast markers), on the other hand, had a homogenizing effect over both short and long distances due to the effective wind dispersal of pollen.
Pinaceae genomes have a largely conserved karyotype, consisting of 12 (2n = 24) chromosomes [11][12][13] and in Picea these are of similar size and shape [12,14]. Most conifer genomes are also characterized by their large size (>15 Gbp) and high abundance of repetitive material, making sequencing and assembling a draft genome a daunting task in conifers. Nystedt et al. [14] published the first draft assembly for Norway spruce and found little evidence that recent whole genome duplications are responsible for the large genome size of Picea abies and other conifers. The genome contains a high abundance of repetitive sequences (around 60%) belonging to different families of transposable elements that have expanded and inserted into the spruce genome over tens of millions years, creating large introns and a high number of pseudogenes [14]. One consequence of the large and complex genomes of conifers is that it has proven difficult to generate data on genome-wide levels of genetic variation. Early studies aimed at characterizing genetic diversity in conifers were mainly focused on assessing variation in short genic [15,16] or non-coding regions [17,18]. To be able to get a more unbiased view of genome-wide variation in Norway spruce, researchers have recently employed various complexity reduction techniques, such as genotyping by sequencing (GBS), restriction digest associated sequencing (RADseq) or targeted capture probes [19][20][21][22]. However, these methods also capture variation in a relatively small subset of a typical conifer genome and it is not clear how representative such results are.
Even with current sequencing technologies, resequencing a single conifer genome is expensive and time consuming and this effectively limits the number of individuals that can be assessed. The large and repetitive nature of a typical conifer genome also make handling of re-sequencing data difficult, due to, for instance, missing regions in the genome assembly and multi-mapping of reads derived from repetitive regions. As the first studies employing whole-genome re-sequencing have been performed in conifers [23,24] it is therefore interesting to assess how well different complexity reduction methods capture genome-wide variation in genetic diversity in a typical conifer genome. Complexity reductions methods are used to capture a smaller fraction of the target genome, and this is usually achieved by either selecting specific regions using digestions with restriction enzymes (e.g., RADseq and GBS) or by using capture probes targeting specific genomic regions. In this paper we take advantage of data generated in the first large-scale re-sequencing effort in Norway spruce [24] and use this to assess how well two commonly used complexity reduction methods, targeted capture probes [22] and genotyping by sequencing [19,21], perform in capturing genome-wide variation in Norway spruce.

Sampling
We sampled a total of 34 trees from the natural distribution range of Norway spruce (Picea abies). Individuals were collected from different populations in Finland, Sweden, Norway, Poland, Belarus and Romania (locations for all samples is shown in Table 1). Samples were taken either as dormant buds or fresh needles and stored in −80 • C until DNA extraction. The 20 individuals that were sampled in northern Sweden (Västerbotten county) were sampled from either young and planted (<20 years old, Marsfjället 12.3 and Långrumpskogen 2.3) or old and untouched (>150 years old, Marsfjället 12.1 and Långrumpskogen 2.1) stands in the same general location. Marsfjället is located close to the Norwegian border in the mountain area in the west of Sweden where as Långrumpskogen is located close to the eastern coast of Sweden.

Whole-Genome Re-Sequencing Data
The sequencing and SNP calling procedures for these samples have previously been described in detail in Bernhardsson et al. [25] and Wang et al. [24]. Briefly, DNA was extracted with a Qiagen plant mini kit (Qiagen, Hilden, Germany) according to the instructions for the WGS data. Sequencing was performed at the National Genomics Initiative platform (NGI) at SciLifeLab in Stockholm, Sweden. Paired-end libraries with an insert size of 500 bp were sequenced using either the Illumina HiSeq 2000 (Pab002-Pab006) or Illumina HiSeq X platforms (Pab007-Pab034). Samples sequenced on both platforms were analysed using the same general bioinformatics pipeline [25]. Raw sequence reads were mapped against the P. abies reference genome v.1.0 [14] using BWA-MEM with default settings. Following read mapping against the entire reference genome, the reference genome was reduced by only keeping genomic scaffolds greater than 1kb in length [26]. Before SNP calling, PCR duplicates, generated during PCR steps in sequencing library construction, were marked in all data subsets using MarkDuplicates in Picard v2.0.1 (http://broadinstitute.github.io/picard/) to eliminate artefacts introduced due to DNA amplification by polymerase chain reaction (PCR). Artefacts in read alignments occurring in regions with insertions and/or deletions (indels) were addressed using local realignment with GATK v3.7 [27]. Finally, SNP calling was performed using GATK HaplotypeCaller. Variant filtering was performed to only include biallelic SNPs that were (i) positioned >5 bp away from an indel, (ii) fulfilled GATKs quality parameters recommendations for hard filtering, (iii) had a mapping depth in the range 6-30× and (iv) a p-value for excess of heterozygosity greater than 0.05. All SNPs that passed these filtering criteria were used in downstream analyses.

Targeted Capture Probe Data
Vidalis et al. [22] developed a set 40,018 targeted capture probes that align to exonic and intronic regions in the Norway spruce genome. For all analyses in this paper we use data from the extended probe regions that consist of the 120 bp probes plus 100 bp on either side of the probe, for a total length 320 bp per probe (see Vidalis et al. [22] for more details). SNP data for all extended probe regions were then extracted from the whole-genome re-sequencing data set using BEDTools v2.26.0 [28]. We also extracted data for all scaffolds that were targeted by the capture probes using BEDTools to generate a comparable but more extensive WGS data set to compare with the data from the probe regions. Finally, we supplemented the data for the probe regions extracted from the individuals subjected to re-sequencing with a large data set consisting of 517 individuals previously genotyped using the capture probes [29], to enhance our sample for assessing population structure in Norway spruce.

Genotyping-by-Sequencing Data
Genotyping-by-sequencing (GBS) was performed using samples of Norway spruce individuals sampled from 45 populations across Västerbotten county in northern Sweden. These samples include 20 of the samples used for whole-genome sequencing (Pab016-Pab035). DNA was extracted from all individuals using the Omega Bio-Tek E-Z 96 plant kit (Omega Bio-Tek, Norcross, GA, USA). Before library preparations 30 samples from each population were pooled in equimolar concentrations with a total of 200 ng (DNA concentrations measured on a Qubit, ThermoFisher Scientific, Waltham, MA, USA) of DNA per sample. To keep the reaction volume down the pools were divided into 8 tubes and dried. Library preparation followed the methods described in Pan et at. [21] with some minor changes outlined below. DNA digestion was made using the restriction enzyme PstI (New England BioLab, Woburn, MA, USA) and during the same time adaptors were ligated to the DNA pools using one of five unique barcodes. Following DNA digestion and ligation, five populations with unique barcodes were pooled producing all together nine independent pools (each containing five populations). All samples were subsequently purified with a QIAquick PCR purification Kit (Qiagen, Hilden, Germany). The DNA was then amplified using a PCR step and purified a second time. Size selection was made with the E-gel Size-Select II pre-cast gel (ThermoFisher Scientific, Waltham, MA, USA) targeting fragments with a size in the range 350-450 bp (including the 125-132 bp barcodes and sequencing adapters). After size selection, DNA was extracted from the gel using QIAquick Gel Extraction Kit (Qiagen, Hilden, Germany). Pair-end sequencing (2 × 150 bp) of all pools were performed on an Illumina HiSeqX by Novogene (Hong Kong) with unique barcodes for all nine pools. Each pool was sequenced individually on one HiSeq X lane per pool, yielding >120 Gbp raw sequencing data per pool.
The raw sequencing data were quality checked using FastQC v0.11.8 (https://www.bioinformatics. babraham.ac.uk/projects/fastqc/) and sequencing adaptors were trimmed using Trimmomatic v0.36 [30]. The data was then demultiplexed using the process_radtags routine from Stacks v2.2 [31]. Sequencing reads for all samples were mapped against the P. abies v1.0 genome using BWA-MEM. BAM files were intersected using the multiinter tool from BEDTools v2.26.0 [28] to identify genomic regions common to all samples. SNP data for the genomic regions targeted by the GBS data were then extracted from the whole-genome re-sequencing data set using BEDTools v2.26.0 [28]. As for the probe data, we extracted data for all scaffolds that were targeted by the GBS data using BEDTools to create a comparable data set to compare with the data from the GBS regions.

Data Analysis
We first estimated relatedness among all 34 individuals using the relatedness2 option in VCFtools [32]. By allowing for the presence of unknown population structure [33], we identified two samples (Pab034 and Pab035) that showed an unusually high degree of relatedness. We therefore excluded one of these samples, Pab034, from all remaining analysis. We then used the 33 remaining individuals, together with the 517 individuals from Baison et al. [29], to assess population structure in the capture probe data to evaluate how much data is needed to capture overall patterns of population structure. Population structure was quantified using a principle component analysis on the relatedness matrix of all samples using the function "prcomp" in R v3.3.3 [34].
In order to determine how well the extended probe regions and the GBS regions capture genetic variation in the WGS data derived from the corresponding scaffold data sets, we estimated a number of summary statistics from the different SNP data sets. For these analyses we only used SNP data from the 19 individuals that were sampled in northern Sweden (Pab016-Pab033 and Pab035). We used ANGSD v0.921 [35] to calculate Tajima's D [36] and nucleotide diversity in the form of pairwise theta [37] for all extended probe regions and for all GBS regions as well as for all probe bearing scaffolds and all GBS scaffolds. We also used VCFtools to calculate F ST values between the four original populations that were sampled in northern Sweden.

Results
The extended probe regions cover a total of 12.8 Mb and the corresponding probe-bearing scaffolds cover 567.6 Mb with the probe regions make up approximately 2.3% of the genomic regions encompassed by the probe-bearing scaffolds ( Table 2). The GBS regions, on the other hand, cover 1.2 Mb and the scaffolds that encompass the GBS regions have a total size of 171.1 Mb (Table 2), yielding a coverage by the GBS regions of 0.7%. The probe data set cover a larger fraction of coding regions that the GBS data, both for the actual probe regions and the probe bearing scaffolds (Table 2). Similarly, the GBS regions and scaffolds harbor a greater fraction of sequences characterized as repetitive (Table 2). In total, 58,092 bp are common between the extended probe and GBS region data sets and a total of 1965 scaffolds are common between the two scaffold datasets. When analyzing relatedness among the WGS samples, two individuals from the old population at the coast of northern Sweden turned out to be closely related (kinship coefficient Φ = 0.526). To reduce the effect of this on downstream analyses, the sample with the largest amount of missing sequence data (Pab034) was excluded from all subsequent tests.
We used the SNP data for the extended probe regions taken from the WGS dataset to estimate population structure in the 34 WGS samples plus the 517 samples derived from Baison et al. [29]. The results show a clear separation of the samples that mirror the geography of the samples, with clusters representing northern Sweden, Norway, Finland and Central Europe (Figure 1). The sampling is, however, heavily biased towards samples from Scandinavia and Finland and hence best capture variation in the so-called Nordic domain of Norway spruce [38]. This analysis confirms earlier results on the population structure in the re-sequencing samples [24] and earlier results that have shown that stands from southern Sweden contain a large fraction of individuals that have been introduced from central and eastern Europe, including samples that are admixed between central/eastern Europe and southern Sweden (Figure 1, [29,38]). The samples from old growth forests from northern Sweden also separate according to geography (mountains vs. coastal regions, shown as dark green in Figure 1).
Forests 2020, 11, x FOR PEER REVIEW 6 of 13 southern Sweden (Figure 1, [29,38]). The samples from old growth forests from northern Sweden also separate according to geography (mountains vs. coastal regions, shown as dark green in Figure 1). The capture probes were initially selected to have intermediate levels of genetic variation [22] and this will hence result in a certain degree of ascertainment bias in the polymorphism data generated from the extended probe regions. Tajima's D does not appear to differ in any systematic way between the extended probe regions and the data generated from the entire probe-bearing scaffold set (Figure 2A, Table 3). For nucleotide diversity (pairwise theta), scaffolds appear to be slightly more variable on average than the extended probe regions, but overall the extended probe regions accurately capture both the amount and frequency spectrum of genome-wide nucleotide polymorphism.  The capture probes were initially selected to have intermediate levels of genetic variation [22] and this will hence result in a certain degree of ascertainment bias in the polymorphism data generated from the extended probe regions. Tajima's D does not appear to differ in any systematic way between the extended probe regions and the data generated from the entire probe-bearing scaffold set (Figure 2A, Table 3). For nucleotide diversity (pairwise theta), scaffolds appear to be slightly more variable on average than the extended probe regions, but overall the extended probe regions accurately capture both the amount and frequency spectrum of genome-wide nucleotide polymorphism. The GBS data display a similar pattern, only with slightly fewer extreme values in Tajima's ( Figure 3A, Table 3). There are fewer data points in Figure 3 than in Figure 2, simply because the GBS data contain a substantially smaller number of unique regions compared to the extended probe regions ( Table 2). While GBS regions have lower diversity values than their corresponding scaffolds ( Figure 3B), the site frequency spectrum summarized by Tajima's D, does not show any appreciable differences overall between regions and scaffolds ( Figure 3A). Table 3. Median and standard deviation (in parenthesis) for nucleotide diversity and Tajima's D for the different data sets.

Subset
Pairwise Theta Tajimas Genetic differentiation between populations also show little evidence for any systematic differences between targeted regions and scaffolds in both probe and GBS data sets (Figures 4 and 5, [39]). Short probe or GBS regions thus appear to accurately capture genome-wide variation in genetic differentiation. Population 2.1 is an old costal population and 2.3 a young, recently planted costal population. Similarly, population 12.1 is an old population sampled from the mountain region in Västerbotten county and 12.3 is a newly planted population also located in the mountain region. FST values between the different pairs of populations from northern Sweden all show very weak population differentiation independent on the data source and populations that are compared (median FST for probe data is −0.019-−0.0018, median FST for GBS data is −0.018-0.007). However, even if the average genetic differentiation among populations is low, there is abundant variation in genetic differentiation across the Norway spruce genome, as many genomic regions in both the probe and GBS data that show substantial genetic differentiation (FST values > 0.2).  The GBS data display a similar pattern, only with slightly fewer extreme values in Tajima's ( Figure 3A, Table 3). There are fewer data points in Figure 3 than in Figure 2, simply because the GBS data contain a substantially smaller number of unique regions compared to the extended probe regions (Table 2). While GBS regions have lower diversity values than their corresponding scaffolds ( Figure 3B), the site frequency spectrum summarized by Tajima's D, does not show any appreciable differences overall between regions and scaffolds ( Figure 3A).
Genetic differentiation between populations also show little evidence for any systematic differences between targeted regions and scaffolds in both probe and GBS data sets (Figures 4 and 5, [39]). Short probe or GBS regions thus appear to accurately capture genome-wide variation in genetic differentiation. Population 2.1 is an old costal population and 2.3 a young, recently planted costal population. Similarly, population 12.1 is an old population sampled from the mountain region in Västerbotten county and 12.3 is a newly planted population also located in the mountain region. F ST values between the different pairs of populations from northern Sweden all show very weak population differentiation independent on the data source and populations that are compared (median F ST for probe data is −0.019-−0.0018, median F ST for GBS data is −0.018-0.007). However, even if the average genetic differentiation among populations is low, there is abundant variation in genetic differentiation across the Norway spruce genome, as many genomic regions in both the probe and GBS data that show substantial genetic differentiation (F ST values > 0.2).

Discussion
In this study we found no large differences when using either WGS, capture probe data or GBS data for assessing genetic diversity in Norway spruce. The main difference between the two complexity reduction methods is what prior knowledge that is needed to develop and use a given method. Sequence capture probes are designed to target known regions in a genome and hence require access to either a draft genome or at least extensive sequence information, such as RNAseq or other targeted sequencing, from an organism in order to design them before use. GBS, on the other hand, can be applied to any organism without any prior genome information. Complexity reduction with GBS is only based on selecting a specific fragment size from a gel and is thus more 'random' with respect to the genomic regions it targets and should therefore, at least in theory, provide a more accurate representation of the genome in an organism. In our data this is most apparent from higher repetitive content contained in the different sequence data sets related to the GBS method compared to the sequence capture probe method (Table 2).
Comparing estimates of genetic diversity for the different datasets show that the scaffold data sets had less extreme values, both high and low, compared to the reduced representation regions (capture probes or GBS, Figures 2A and 3A), a simple consequence of the greater size of regions

Discussion
In this study we found no large differences when using either WGS, capture probe data or GBS data for assessing genetic diversity in Norway spruce. The main difference between the two complexity reduction methods is what prior knowledge that is needed to develop and use a given method. Sequence capture probes are designed to target known regions in a genome and hence require access to either a draft genome or at least extensive sequence information, such as RNAseq or other targeted sequencing, from an organism in order to design them before use. GBS, on the other hand, can be applied to any organism without any prior genome information. Complexity reduction with GBS is only based on selecting a specific fragment size from a gel and is thus more 'random' with respect to the genomic regions it targets and should therefore, at least in theory, provide a more accurate representation of the genome in an organism. In our data this is most apparent from higher repetitive content contained in the different sequence data sets related to the GBS method compared to the sequence capture probe method (Table 2).
Comparing estimates of genetic diversity for the different datasets show that the scaffold data sets had less extreme values, both high and low, compared to the reduced representation regions (capture probes or GBS, Figures 2A and 3A), a simple consequence of the greater size of regions covered in the scaffold data. This makes estimates of diversity in these data sets less prone to stochastic variation. For the two complexity reduction data sets, measurements of diversity GBS seems to be slightly less effective than probe data and this is likely a consequence of the method itself. Using a restriction enzyme to perform complexity reductions does not allow for precise control over where the DNA gets cut. For example, if a target cut site region has high nucleotide diversity it is possible that there is also polymorphism at sites in the base pairs making up the cut site, effectively reducing the number of cuts and also generating differences among individuals in what fragments gets included in the final GBS data. Since GBS relies on selecting a predetermined fragment size, fragments that lack a specific cut site will generate fragment sizes outside of the selection range for some individuals and this ultimately leads to missing data between different individuals and also to a relatively small overlap across individuals if no missing data is accepted. Depending on how much missing data that is allowed, the overlap of regions across individuals will vary since missing data will create regions where data is not present in every individual. This is simply a consequence of the GBS method and partly explain why the size of genomic regions covered in the GBS data is relatively small, only 1.2 Mb, when we require no missing data across individuals. If one allows for 10% or 20% missing data in the GBS results, the size of the regions covered increase to 1.4 Mb and 1.6 Mb, respectively. Repetitive regions are also problematic for downstream genotyping [25], regardless of method and the greater fraction of repetitive regions in the GBS could hence further skew the results from these regions. The overall summary of the data ( Table 2) show that the capture probes generates more data both in terms of total size and number of unique scaffolds and also, as expected, contain a greater fraction coding regions.
Using only capture probe data we were also able to reconstruct the population structure of Norway spruce trees from Sweden and the rest of Europe with high accuracy and thereby also confirming earlier results from Wang et al. [24]. For assessing population structure and genetic differentiation a relatively modest amount of data is required to accurately capture the overall patterns in the data. In the analyses of genetic differentiation among populations, both the capture probe and GBS data yield the same general results, although the GBS is slightly more variable. This is again likely caused by the lower amount of data available in the GBS data set. However, even though the GBS data has more extreme values the median F ST values are similar for all data sets. It is also worth noticing that even if median values for F ST are very low and close to zero for all data sets, there are still regions showing strong genetic differentiation with F ST values above 0.2 (Figures 4 and 5).

Conclusions
Our results show that both capture probes and GBS are performing reasonably well for assessing genetic diversity and population structure in Norway spruce. Targeted capture probes are slightly more effective and moderately variable than GBS, by virtue of targeting regions known a priori to be largely unique in the Norway spruce genome while the GBS data contains a substantially greater fraction of repetitive regions. This is not surprising since GBS selects regions anonymously and should hence more accurately reflect the high repetitive content present in Norway spruce [14]. However, GBS methods can be tailored to reduce the fraction of repetitive regions targeted, for example by using methylation sensitive restriction enzymes that allows for enrichment of unique genomic regions, such as coding regions [40]. Nevertheless, both capture probes and GBS yield the same results on average. These methods are hence useful when genotyping large numbers of samples and they greatly reduce the cost involved with genotyping a species with such a complex genome as Norway spruce. Our research is focused on large geographic areas with differences in both biotic and abiotic factors. For us to be able to show possible differences caused by forestry practices we have to be able to genotype a large number of samples, both planted and from natural stands from every area. With a cost-efficient option like GBS we now have the tools to conduct genotyping on multiple stands of Norway spruce.