Genetic Diversity and Structure of a Diverse Population of Picea sitchensis Using Genotyping-by-Sequencing

: Picea sitchensis , Sitka spruce, is of interest to forestry as both a conservation species and a highly productive crop. Its native range stretches from Alaska to California, and it is hence distributed across a large environmental cline with areas of local adaptation. The IUFRO collection, established in 1968–1970, consists of 81 provenances of commercial and scientiﬁc interest spanning this native range. We used genotyping-by-sequencing on 1177 genotypes, originating from 80 of the IUFRO provenances which occupy 19 geographic regions of the Paciﬁc Northwest, resulting in an SNP database of 36,567 markers. We detected low levels of genetic differentiation across this broad environmental cline, in agreement with other studies. However, we discovered island effects on geographically distant populations, such as those on Haida Gwaii and Kodiak Island. Using glaciation data, alongside this database, we see apparent post-glacial recolonization of the mainland from islands and the south of the range. Genotyping the IUFRO population expands upon the use of the collection in three ways: (i) providing information to breeders on genetic diversity which can be implemented into breeding programs, optimizing genetic gain for important traits; (ii) serving a scientiﬁc resource for studying spruce species; and (iii) utilizing provenances in breeding programs which are more tolerant to climate change.


Introduction
Picea sitchensis (Bong.) Carrière (Sitka spruce) is one of the seven Picea species native to North America, with a native range from Alaska 69 • N to coastal California 39 • N, occupying coastal areas, islands in the Alexander Archipelago and river regions [1]. Picea sitchensis is a species of considerable commercial importance in Atlantic Europe, where it is well adapted to the mild maritime conditions of Britain, Ireland and the coastal region from France to Denmark. In its native range, commercial interests are limited to Alaska, British Columbia and Haida Gwaii. Its natural distribution overlaps with Picea engelmannii Parry ex. Engelm. (Engelmann spruce) and Picea glauca (Moench) Voss (White spruce), which it hybridizes with to form Picea x lutzii Little, in areas such as South Alaska and Northwestern British Columbia [2]. However, Sitka and Engelmann spruce hybrids do not commonly occur. Hybrids of White and Engelmann spruce can also occur in this area, but neither hybrid is of economic importance [3]. The geographical niche occupied by Sitka spruce reflects its adaption to moist coastal areas provided by mild maritime conditions with high humidity. It continues inland until these conditions change and other species dominate. Peripheral zones are occupied largely by White spruce. This creates a large latitudinal spread and a thin longitudinal niche of Sitka along coastal areas. As a result of this, genetic isolation is likely due largely to geographical distance along the north-south range. the invasion of new insect pest species. The genotyping of this material will allow for the acceleration of breeding and genetic gain, hopefully combating productivity losses associated with climate change. This will act as a DNA bank that can be used to investigate traits and screen for resistant genotypes for breeding. In this study, we aimed to build a key resource for Sitka Spruce and North American Spruce research and will expand on existing resources that are available to research. We also described this population by using the genotyping data to highlight areas of interest to scientists and breeders.

Sample Populations
The IUFRO population was planted in 1975 and 1978 at John F. Kennedy Arboretum in county Wexford, Ireland (52.315, −6.941) [23]. This population consisted of seeds collected from 80 provenances of the original 81 available in the IUFRO population (  Table S1). Vascular cambium was sampled for DNA extractions in May 2021, using a cork borer. The samples were then frozen at −20 • C, which allowed the cambium layer to split from the bark layer. The cambial layer was removed and freeze-dried prior to DNA extraction.
tures the diversity of the native population, along with areas of interest to breeders. Secondly, a collection this large will allow us to investigate fundamental questions about its ancestry, hybridization, clinal adaptation and post-glacial spread. Finally, with the effects of climate change, it is uncertain that our current breeding stock is resilient against some of the foreseeable climatic changes, such as drought, and unforeseeable changes, such as the invasion of new insect pest species. The genotyping of this material will allow for the acceleration of breeding and genetic gain, hopefully combating productivity losses associated with climate change. This will act as a DNA bank that can be used to investigate traits and screen for resistant genotypes for breeding. In this study, we aimed to build a key resource for Sitka Spruce and North American Spruce research and will expand on existing resources that are available to research. We also described this population by using the genotyping data to highlight areas of interest to scientists and breeders.

Sample Populations
The IUFRO population was planted in 1975 and 1978 at John F. Kennedy Arboretum in county Wexford, Ireland (52.315, −6.941) [23]. This population consisted of seeds collected from 80 provenances of the original 81 available in the IUFRO population ( Figure  1) (Supplementary Table S1 Table S1). Vascular cambium was sampled for DNA extractions in May 2021, using a cork borer. The samples were then frozen at −20 °C, which allowed the cambium layer to split from the bark layer. The cambial layer was removed and freeze-dried prior to DNA extraction.

DNA Extraction and Sequencing
First, 30 mg of the freeze-dried cambial sample was transferred to deep-well plates, and three titanium beads were added to each well prior to milling. Cambium samples were milled for 1 h at 30 oscillations per minute on a Retsch MM400 mill (long milling time was used due to the woody nature of the samples). The samples were extracted by using the Machery and Nagel NucleoMag Plant DNA kit (744400.1) modified for the Kingfisher flex extraction system. The incubation time was increased to three hours to aid in the breakdown of woody tissue. A second elution step was added at the end of the Kingfisher protocol that increased the yield of DNA. Samples were quantified after extraction by using the Quant-iT PicoGreen dsDNA Assay Kit (P7589) and a BioTek Synergy HT to assess if samples were at a quantity of 30 ± 5 ng/µL. Samples were sent to LGC Genomics (Berlin, Germany), where genotyping-by-sequencing (GBS) libraries were prepared by using restriction enzymes to reduce genome complexity [24]. The restriction enzymes used for library preparation were Pstl-ApeK1, and the resulting libraries were sequenced on an Illumina platform in paired-end mode and read lengths of 150 bp to achieve a target depth of 3 M paired-end reads per sample.

Variant Calling
The FASTQ sequences generated from the GBS were aligned to the Q903_v1_1000 plus Sitka spruce genome (GCA_010110895.2) [25], using BWA-mem with default parameters [26]. Samtools (v1.9) mpileup generated a mpileup file, which was processed with bcftools (v1.9) to call variants [27]. A minimum site mapping quality of 20 was required to retain an SNP. The VCF files were summarized and visualized by using vcfR (v1.12.0) and vcftools (v0.1.16) [28] that allowed for the filtering of variants. We removed indels and only retained biallelic SNPs. The minimum genotype quality (GQ) was set to 20, and the minimum read depth was set to 5, as the majority of variants were above this threshold. The minimum allele frequency was set to 0.05. We filtered out sites with greater than 30% of missing data points, leaving 108,606 SNPs (-remove-indels -min-alleles 2 -max-alleles 2 -minDP 5 -maf 0.05 -max-maf 0.975 -minGQ 20 -max-missing 0.7). SNPs were thinned so that no two SNPs would be within 10 bp of each other, using vcftools (v0.1. 16). An investigation into the Loss of Heterozygosity (LoH e ) and Gain of Heterozygosity (GoH e ) was completed by using vcftools (v0.1.16)(-hwe -het) and dplyr (v1.0.8) to access filtering on HWE.

Genetic Diversity Statistics
The data analysis was completed by using R version 4.0.2, unless specified otherwise. The VCF file of 36567 variants was converted into a genind object by using the adegenet (v 2.15) package [29,30]. Observed heterozygosity (H o ), expected heterozygosity (H e ), gene diversity (H t and Dst), allelic richness (A R ), fixation index (Fst) and population differentiation statistics (Dest and Gst) were calculated across the entire population, using the hierfstat (v0.5-10) function basic.stats [31], and summarized per population, using dplyr (v1.0.8).

Population Structure
Here we investigated the prior assignments of the geographic regions for discernible population structure. An Analysis of Molecular Variance (AMOVA) was used to compare the given geographic regions of the IUFRO population and run by using poppr (v2.9.3) [32] on a geneclone object created from the VCF file, using adegenet [29,30]. The AMOVA was run with the geographic region being hierarchical to the provenance. A principal component analysis (PCA) was run on the dataset, using adegenet and ggpolot2 (v3.3.5), retaining 1400 principal components. A discriminate analysis of principal components (DAPC) retaining 6 discriminate analyses was applied to analyze population structure, using supervised clustering with regions as prior [32,33]. Analyses of the population structure while using undefined prior regional assignments were also conducted to investigate the population structure. Optimal genetic clusters (K) were analyzed by using snapclust in adegenet, using the Bayesian information criterion (BIC) [29]. Values for each BIC model were compared, and the optimal K was determined by using the elbow method.

Phylogenetic Analysis
The relationships between the geographic regions were investigated to show the genetic distance between regions and also highlight any population structure. The IUFRO population was clustered by geographic region and transformed into a genind object by adegenet and dplyr (v1.0.8) [29,30]. The aboot function in poppr was used to create an unweighted pair group method with arithmetic mean (UPGMA) clustering tree with 100 bootstrap replicates [32,34].

Isolation by Distance
The geographic distance amongst the populations was compared to their genetic distance to investigate gene flow and diversity. The pairwise Weir and Cockerham Fst were calculated for each pair of provenances, using the hierfstat package [31,35]. Geographic distances between provenances were measured by using the package geodist (v0.0.7). The geographic distance (km) was plotted against Fst/1-Fst, and outliers were investigated.

Genotyping a Large Sitka Spruce Population, Covering Its Native Geographic Range
The continuous development of genotyping technologies, such as GBS, has resulted in a dramatic increase in sequencing density and reduction in cost, leading to the adoption of SNP genotyping across many forestry species [16]. However, this uptake in SNP genotyping has produced a wealth of data which cannot be fully taken advantage of in most forestry tree species due to the absence of high-quality reference genomes and/or appropriate tools designed and benchmarked for large-genome species [40,41]. The availability of a draft Sitka genome, while being a fragmented assembly, allows for refined processing of the GBS data, resulting in high mapping rates. Encouragingly, the average mapping rate to the draft assembly was 84.9% across the 1184 samples (seven samples were removed due to poor mapping rates). In total, 81.9% of the reads were properly paired.
Putative SNPs were identified in the population, and standard filtering on read depth, minor allele frequency and genotype quality resulted in a final database of 108,608 variants [42]. The variant filtration applied to the database was aimed at finding a balance between false negatives and false positives. The filtering of the SNP sets is dependent on the task with, for example, association studies requiring hard filtration of the dataset [43]. Overly harsh filtering to reduce false positives would bias sampling toward frequent alleles impacting our ability to capture diversity, and population structure. This final dataset we analyzed was formed after thinning to ensure that no two SNPs were within 10 bp of each other; this was performed to remove SNPs in strong LD, as conventional LD filtering is challenging in fragmented assemblies [44,45]. Deviations in Hardy-Weinberg equilibrium (HWE) were due to GoH e (Supplementary Figure S1), indicating that deviations were due to sequencing and alignment errors rather than natural processes [46]. HWE was filtered to p > 0.001, resulting in 36,567 variants that were used in a further analysis. The sequence data were submitted to NCBI (BioProject PRJNA852515) The resulting SNP database in this study includes 36,567 variants over 1177 genotypes and captures most of the native range of Sitka, but future improvements in the genome assembly and bioinformatics tools can allow for the reanalysis of the GBS dataset [47]. Here we created a key genetic database for the IUFRO population grown in Ireland. For breeders, this database can provide a baseline of genetic diversity to compare against breeding stock and seed orchards. Most notably SNP sets have been used for evaluating the parental contributions and contamination in seed orchards, allowing for the better design and selection of parents [14]. For researchers, this database can allow for the investigation of traits of interest, using methods such as Environmental Association Analysis (EAA) [48,49]. For example, responses to climate change have been investigated by using EAA in both Lolium perenne [48] and Pinus taeda [49], utilizing SNP sets.

Geographic and Genetic Diversity
The IUFRO collection that was used in this study primarily captures a representative sample of the entire native range of Sitka, with a large diversity of habitats [12]. There are large clusters of populations sampled from around the Vancouver region, Coastal British Columbia, Alaska and Haida Gwaii (Figure 1). These areas are more populous, with larger forestry industries, leading to more seed being included in the original collection. The more isolated populations, such as those of Montague Island, Afognack Island, Kodiak Island, Southeastern Alaska and Northern California, are not as well represented in this database. The high geographic diversity and isolation of these adjacent populations does not, however, result in high genetic differentiation due to the high amounts of gene flow between populations. The genetic diversity of the geographic regions measured by H e ranged from 0.17 to 0.24, with an overall H e of 0.21. In most cases, the H o was greater than the H e , but some regions had the same H o and H e . Heterozygosity is low compared to a study using SSRs, but it is similar to what is found in a study using SNPs, thus highlighting marker differences [50,51]. Additionally, no private alleles were discovered within the geographic regions, and the overall allelic richness (A R ) was 1.198 (Table 1), further suggesting high amounts of gene flow among the provenances and geographic regions. It is in the peripheral populations where genetic differentiation is seen, notably Kodiak Island, which is the most isolated region. However, there is effective gene flow across thousands of kilometers in non-isolated populations, similar to what is seen in other conifers [52]. The efficacy of this gene flow is reduced over larger distances, but physical barriers such as the ocean result in the larger genetic differentiation of those populations. Our results for genetic differentiation amongst provenances and regions are in accordance with those of another study [4] with a reported Gst of 0.03 across eight sampling sites, from Alaska to California. H o and H e differed between studies, but it is difficult to compare with these studies due to differences in sample size, sampling range, marker type and marker number; however, the large sample size and distribution combined with detailed genotype data used in this study allows for a more complex analysis.

Genetic Structure
The AMOVA for the entire Sitka spruce population indicated that the majority of genetic variation occurred between samples with between-region genetic variation only accounting for 1.98% (p > 0.01) of the total genetic variation. The variation between provenances was 7.49% (p > 0.01), with within-provenance variation being 3.77%, yet not significant (p > 0.31). The low overall Fst value (0.023) shows low levels of differentiation within the entire population. These Fst values are similar to what are seen in previous studies and suggest an excess of heterozygosity within the populations [53]. Some regions have higher levels of differentiation (Table 2), notably Central Oregon, with an Fst of 0.16. This lack of distinct population structure is somewhat typical in conifer species with large distributions across environmental clines when gene flow is not prohibited. The PCA analysis revealed a high degree of uncertainty in population structure ( Figure 2A); however, some general differentiation is apparent, notably in regard to the Kodiak and Montague islands. The axes of the PCA account for 2.1% and 1.6% of the variation, further indicating the presence of high gene flow within and among populations of this species, as described in other studies. The indication of some structure and differentiation is seen in the DAPC ( Figure 2B), which also shows differentiation of Kodiak and Montague islands, along with Haida Gwaii, suggesting an isolation effect on these islands. The same is not seen on the other large island in the population, Vancouver Island.
The BIC supports the clustering seen in the PCA and DAPC, with estimates of K at 3/4 (Supplementary Figure S2). The BIC estimation of four optimal clusters does not take into account prior regional assignment, and the DAPC and PCA alone do not lead to any strong conclusions on population assignment, but taken together, they show a core population of Sitka on the North American mainland, with the separation of populations isolated on islands. This island effect is typical across species due to limitations on gene flow. Spruce seeds have a mean wind dispersal of 345 m and typically conifer pollen travels 4-6 km, which isolates Afognak, Haida Gwaii and Kodiak Island ( Figure 1); however, some pollen may be introduced by strong winds or by human and animal interference, which can also generate seed-mediated gene flow [15,[54][55][56]. The BIC supports the clustering seen in the PCA and DAPC, with estimates of K at 3/4 (Supplementary Figure S2). The BIC estimation of four optimal clusters does not take into account prior regional assignment, and the DAPC and PCA alone do not lead to any strong conclusions on population assignment, but taken together, they show a core population of Sitka on the North American mainland, with the separation of populations isolated on islands. This island effect is typical across species due to limitations on gene flow. Spruce seeds have a mean wind dispersal of 345 m and typically conifer pollen travels 4-6 km, which isolates Afognak, Haida Gwaii and Kodiak Island ( Figure 1); however, some The overall isolation effect is not just one of distance ( Figure 3A) but largely due to isolation based on geographical barriers, namely the open sea separating islands ( Figure 3B). The correlation coefficient of isolation by distance was 0.46 and 0.41 without these outlier islands of Kodiak and Montague. Phylogenetic structure clustering shows some distinct clustering groups with high bootstrap confidence support (Figure 4). Again, the Kodiak and Montague islands are outliers, likely due to their geographic distance from the mainland. The regions surrounding Naas and Skeena Rivers cluster away from the main population.
The AMOVA was rerun, taking into account the structure as signified by the PCA, BIC, DAPC and UPGMA, but this did not change the outcome.
The overall isolation effect is not just one of distance ( Figure 3A) but largely due to isolation based on geographical barriers, namely the open sea separating islands ( Figure  3B). The correlation coefficient of isolation by distance was 0.46 and 0.41 without these outlier islands of Kodiak and Montague. Phylogenetic structure clustering shows some distinct clustering groups with high bootstrap confidence support (Figure 4). Again, the Kodiak and Montague islands are outliers, likely due to their geographic distance from the mainland. The regions surrounding Naas and Skeena Rivers cluster away from the main population. The AMOVA was rerun, taking into account the structure as signified by the PCA, BIC, DAPC and UPGMA, but this did not change the outcome.    On the mainland, genetic clustering indicates segregation of the genotypes originating from the Skeena and Naas rivers (Figure 4). This area is a known hybrid zone with White spruce, so the genetic influence is likely to be seen between the two closely related species [2,3]. Conifers have a slow rate of speciation, so the genetic differences between White spruce and Sitka spruce may be slight in many areas of this population, especially in the hybrid zone [9]. Identifying "pure" White spruce and "pure" Sitka spruce markers in our dataset would allow for these genetic differences to be fully quantified but the effect of gene flow within the entire population and the close relationship between the two species means the recognition of "pure" Sitka or White spruce is not a realistic concept.

Genetic Admixture and Ancestry
The admixture model confirms the population structure and enlightens some of the questions about the evolutionary history of the species. We again confirm the island effects on Kodiak and Montague Islands, with no clear admixture events at all in the Kodiak region. Kodiak Island is primarily composed of the K7 ancestral population, indicating there has been little historical mixing with other populations. Montague Island is also primarily composed of a single ancestral population, K11, yet mixing has occurred with K2 and K3. This raises questions about when the isolation event occurred (Figures 5 and 6). The geological history of the Kodiak archipelago shows no land bridges between it and the mainland [57]. The geological history of the area and the history of spruce evolution leads to the conclusion that the Pleistocene glaciation created a refugium on Kodiak Island which in turn recolonized the mainland [13]. During the Pleistocene glaciation, the Cordilleran ice sheet was at its largest 18,500 years ago [58]. The ice sheet covered the area of the entire IUFRO collection range apart from Oregon and Washington states, with some debate on whether refugia existed on islands such as Kodiak or Haida Gwaii [59]. Costal On the mainland, genetic clustering indicates segregation of the genotypes originating from the Skeena and Naas rivers (Figure 4). This area is a known hybrid zone with White spruce, so the genetic influence is likely to be seen between the two closely related species [2,3]. Conifers have a slow rate of speciation, so the genetic differences between White spruce and Sitka spruce may be slight in many areas of this population, especially in the hybrid zone [9]. Identifying "pure" White spruce and "pure" Sitka spruce markers in our dataset would allow for these genetic differences to be fully quantified but the effect of gene flow within the entire population and the close relationship between the two species means the recognition of "pure" Sitka or White spruce is not a realistic concept.

Genetic Admixture and Ancestry
The admixture model confirms the population structure and enlightens some of the questions about the evolutionary history of the species. We again confirm the island effects on Kodiak and Montague Islands, with no clear admixture events at all in the Kodiak region. Kodiak Island is primarily composed of the K7 ancestral population, indicating there has been little historical mixing with other populations. Montague Island is also primarily composed of a single ancestral population, K11, yet mixing has occurred with K2 and K3. This raises questions about when the isolation event occurred (Figures 5 and 6). The geological history of the Kodiak archipelago shows no land bridges between it and the mainland [57]. The geological history of the area and the history of spruce evolution leads to the conclusion that the Pleistocene glaciation created a refugium on Kodiak Island which in turn recolonized the mainland [13]. During the Pleistocene glaciation, the Cordilleran ice sheet was at its largest 18,500 years ago [58]. The ice sheet covered the area of the entire IUFRO collection range apart from Oregon and Washington states, with some debate on whether refugia existed on islands such as Kodiak or Haida Gwaii [59]. Costal glaciation retreat occurred earliest, around 18,000 years ago, allowing for the recolonization of the species from islands, as is apparent in the admixture model with the K7 and K9 ancestral populations ( Figure 6). The southern retreat of the ice sheet was slower with regions such as Washington and Vancouver with glacial retreat occurring 15,000-16,000 years ago [59].   The K8 population occurs frequently in the hybrid zones, possibly due to hybridization events. This is also supported by the phylogenetic tree, which indicates clustering of provenances located around the Skeena and Naas rivers. This area did not become fully clear of ice until approximately 12,000 years ago, suggesting that it could have been one of the last areas to be recolonized by both Sitka and White spruce. The North Alaskan provenances have the most diverse genetic admixture; however, the region also has the largest sample size (n = 159) ( Figure 5). In the southern regions of the IUFRO collection, K3, K1 and K11 are most represented. These ancestral populations may be of interest to breeding programs, and association studies may find traits of interest in these populations.
The population genetic structuring of the species has clearly been shaped by the Pleistocene glaciation and the retreat of the Cordilleran ice sheet, which has isolated areas such as Kodiak, but remixture of some of the refugia has occurred into the main population. This answers key questions about isolation events of Sitka but raises additional questions about the evolutionary spread of ancestral populations and how they relate to East Asian spruce. A phylogenetic analysis of North American and East Asian species is required at a detailed level to piece together the puzzles of Picea ancestry. Unfortunately, such genetic resources are not all collected yet; however, here we have contributed an SNP dataset for Sitka spruce, a species which is at the forefront of the ancient land bridge between Asia and America. On the western reaches of this range, White spruce is well distributed and often hybridizes with Sitka. Future work should focus on comparing these closely related species and identifying patterns of speciation and adaptation between the The K8 population occurs frequently in the hybrid zones, possibly due to hybridization events. This is also supported by the phylogenetic tree, which indicates clustering of provenances located around the Skeena and Naas rivers. This area did not become fully clear of ice until approximately 12,000 years ago, suggesting that it could have been one of the last areas to be recolonized by both Sitka and White spruce. The North Alaskan provenances have the most diverse genetic admixture; however, the region also has the largest sample size (n = 159) ( Figure 5). In the southern regions of the IUFRO collection, K3, K1 and K11 are most represented. These ancestral populations may be of interest to breeding programs, and association studies may find traits of interest in these populations.
The population genetic structuring of the species has clearly been shaped by the Pleistocene glaciation and the retreat of the Cordilleran ice sheet, which has isolated areas such as Kodiak, but remixture of some of the refugia has occurred into the main population. This answers key questions about isolation events of Sitka but raises additional questions about the evolutionary spread of ancestral populations and how they relate to East Asian spruce. A phylogenetic analysis of North American and East Asian species is required at a detailed level to piece together the puzzles of Picea ancestry. Unfortunately, such genetic resources are not all collected yet; however, here we have contributed an SNP dataset for Sitka spruce, a species which is at the forefront of the ancient land bridge between Asia and America. On the western reaches of this range, White spruce is well distributed and often hybridizes with Sitka. Future work should focus on comparing these closely related species and identifying patterns of speciation and adaptation between the two species. Carrying out similar work in the western range of White spruce may allow for the reclassification of some samples in our database as hybrids or White spruce.

Conclusions
The IUFRO collection has been a valuable resource for scientific research of Sitka spruce and for use within breeding programs and has been distributed to many organizations across different countries. Here we built on the existing resource by genotyping 80 of the 81 provenances in the original collection. We used the data to study population diversity and found a high degree of gene flow within the population on Mainland North America, with diverse clusters occurring on isolated islands. The IUFRO collection and the genotyping data will allow for trait-association studies as the genomic resources available in Sitka spruce improve. This will be beneficial for breeders and enable further phylogenetic comparisons of spruce species of scientific interest.
Supplementary Materials: The following supporting information can be downloaded at https://www. mdpi.com/article/10.3390/f13091511/s1. Figure S1: Gain of Heterozygosity (GoHe) associated with sequencing based errors causes deviations from Hardy Weinberg Equilibrium; Figure S2: Optimal clusters as determined by Bayesian Information Criterion (BIC); Table S1: Provenances and their respective geographic regions.