Digital Commons @ Michigan Tech Digital Commons @ Michigan Tech Analysis of environment-marker associations in American Analysis of environment-marker associations in American chestnut chestnut

: American chestnut ( Castanea dentata Borkh.) was a dominant tree species in its native range in eastern North America until the accidentally introduced fungus Cryphonectria parasitica (Murr.) Barr, that causes chestnut blight, led to a collapse of the species. Different approaches (e.g., genetic engineering or conventional breeding) are being used to ﬁght against chestnut blight and to reintroduce the species with resistant planting stock. Because of large climatic differences within the distribution area of American chestnut, successful reintroduction of the species requires knowledge and consideration of local adaptation to the prevailing environmental conditions. Previous studies revealed clear patterns of genetic diversity along the northeast-southwest axis of the Appalachian Mountains, but less is known about the distribution of potentially adaptive genetic variation within the distribution area of this species. In this study, we investigated neutral and potentially adaptive genetic variation in nine American chestnut populations collected from sites with different environmental conditions. In total, 272 individuals were genotyped with 24 microsatellite (i.e., simple sequence repeat (SSR)) markers (seven genomic SSRs and 17 EST-SSRs). An F ST -outlier analysis revealed ﬁve outlier loci. The same loci, as well as ﬁve additional ones, were signiﬁcantly associated with environmental variables of the population sites in an environmental association analysis. Four of these loci are of particular interest, since they were signiﬁcant in both methods, and they were associated with environmental variation, but not with geographic variation. Hence, these loci might be involved in (temperature-related) adaptive processes in American chestnut. This work aims to help understanding the genetic basis of adaptation in C. dentata , and therefore the selection of suitable provenances for further breeding efforts.


Introduction
American chestnut (Castanea dentata Borkh.) has been a dominant tree species in its distribution range in eastern North America and one of the ecologically and economically most important species in that region [1][2][3]. The accidentally introduced fungus Cryphonectria parasitica (Murr.) Barr, that causes chestnut blight, dramatically reduced the number and vitality of chestnut trees, so that the species nowadays occurs mainly as a vegetative propagating (from stump sprouts) understory shrub [1,3,4]. Mainly three different approaches have been used to fight against chestnut blight [1]: inoculation of diseased trees with hypovirulent strains of C. parasitica, genetic engineering, and breeding of resistant trees. Treatment of cankers with hypoviruses is effective in intensively managed plantations or orchards and hypoviruses naturally occur in Michigan (USA), but biological control with hypovirulence has not been successful in forest populations in North America [5]. Genetic engineering of resistant trees has made substantial progress, but is currently limited by public acceptance and regulatory and legal restrictions. For instance, genetically modified trees to be grown in North America must be sterile or have some other means to control flowering [6], and hence, natural repopulation by sexual reproduction is impossible. Large efforts have also been made in conventional breeding to establish blight resistant chestnut trees. Specifically, backcross breeding was used to incorporate blight resistance from Chinese chestnut (Castanea mollissima Blume) into C. dentata, resulting in hybrid trees that show a lower disease incidence [7][8][9]. The development of new genomic resources [10][11][12][13][14] for different Castanea species will help to enhance breeding success for disease resistance.
Because of large climatic differences within the distribution area of American chestnut, a successful reintroduction of the species requires knowledge and consideration of local adaptation to the prevailing environmental conditions. For instance, there are indications that hybrid backcrossed chestnut trees are less cold tolerant than pure American chestnut trees [15]. Different studies investigated the distribution of genetic diversity based on different genetic markers in American chestnut populations in North America. Thus, the highest genetic diversity was detected in southwestern populations with a decrease along the Appalachian Mountains to the northeast [16,17]. Furthermore, a clinal variation of allele frequencies was observed along the Appalachian axis [16][17][18]. The observed distribution of genetic variation is most likely an effect of postglacial recolonization. Refugial areas of American chestnut were likely located far south along the Gulf Coast and the species re-migrated comparatively slowly to the north arriving in Connecticut (northeast USA) only about 2000 years ago [19,20]. Less is known, however, about the distribution of potentially adaptive genetic variation within the distribution area of American chestnut. Therefore, we investigated neutral and potentially adaptive genetic variation in nine American chestnut populations growing under different climatic conditions. The same populations were analyzed previously by Gailing and Nelson [16] with partly overlapping markers (17 EST-SSR markers as well as five chloroplast SSR markers). They found a decrease of genetic diversity from southwest to northeast but also a high genetic diversity of a population from Ontario (Canada). This population clustered together with southwestern populations from the USA. Furthermore, allele frequencies were strongly associated with longitude, and population pairs east and west of the Appalachian axis showed pronounced allele frequency differences over a small geographic range. In this study, seven additional putatively neutral SSR markers were included to separate neutral from potentially adaptive variation patterns. In total, 272 individuals were genotyped with 24 SSR markers (seven genomic SSRs (g-SSRs) and 17 genic EST-SSRs). The objectives of the study were: (1) to identify markers with signatures of selection by means of outlier tests, and (2) to associate these markers with environmental variables of the population sites.

Plant Material and Environmental Variables
DNA samples of nine American chestnut populations covering a wide portion of the species' native range were described in a previous study [16] (Figure 1, Table 1). The original sampling of the populations (dormant buds or expanded leaves) was conducted as described in Kubisiak and Roberds [17]. A total of 272 individuals (with 25 to 32 individuals per population) were investigated in the present study. Climate data (19 bioclimatic variables "bioclim", period 1950-2000, resolution: 30 arc sec) for the sampling sites were obtained from the WorldClim database [22] using the Data Extraction Tool of the Senckenberg Research Society (http://dataportal-senckenberg.de/dataExtractTool/). Based on the WorldClim data, two additional variables were calculated that may be important for adaptation in tree species: mean growing season temperature and mean growing season precipitation, where the growing season was considered to last from May 1 to 30 September. The former ranged from 16.5 °C for the Maryland population to 21 °C for the Kentucky population, whereas the latter ranged from 404 mm (Ontario population) to 628 mm (Asheville population) ( Table 1). An overview of all environmental variables can be found in Table S1. Climate data (19 bioclimatic variables "bioclim", period 1950-2000, resolution: 30 arc sec) for the sampling sites were obtained from the WorldClim database [22] using the Data Extraction Tool of the Senckenberg Research Society (http://dataportal-senckenberg.de/dataExtractTool/). Based on the WorldClim data, two additional variables were calculated that may be important for adaptation in tree species: mean growing season temperature and mean growing season precipitation, where the growing season was considered to last from May 1 to 30 September. The former ranged from 16.5 • C for the Maryland population to 21 • C for the Kentucky population, whereas the latter ranged from 404 mm (Ontario population) to 628 mm (Asheville population) ( Table 1). An overview of all environmental variables can be found in Table S1.

Data Analysis
A principle component analysis (PCA) was conducted on the climate variables to obtain principle components (PCs) for the association analysis (see below). In addition to the 19 bioclim variables, longitude, latitude, altitude, mean growing season temperature, and mean growing season precipitation were included in the analysis. The PCA was performed with the "prcomp" function in R 3.4.3 [25]. For the PCA the variables were standardized to a mean of 0 and a standard deviation of 1. For an interpretation of the PCs we calculated Spearman's rank correlation coefficients among environmental variables and PCs using the "corr.test" function in the psych 1.7.8 R package [26] with correction for multiple testing (false discovery rate (FDR) [27] with a threshold of 5%).
The GenAlEx 6.5 software [28,29] was used to calculate the number of alleles (N a ), the mean number of private alleles (P a ), the observed heterozygosity (H o ), and the expected heterozygosity (H e ) separately for the seven g-SSRs and the 17 EST-SSRs as well as across all markers. The inbreeding coefficient (F IS ) [30] and linkage disequilibrium (LD) were calculated with the Genepop software 4.7 [31] using 10,000 demorizations, 100 batches and 5000 iterations per batch for Markov chain parameters. Presence and frequency of null alleles were determined with the Micro-Checker software 2.2.3 [32]. Population structure was inferred with the STRUCTURE 2.3.4 software [33]. Two different analyses were performed: the first one based on the complete marker set, and the second one based only on potentially neutral SSR markers (markers CmSI0495, CmSI0527, CmSI0537, CmSI0559, CmSI0561, CmSI0608, CmSI0611, CsCAT7 and CsCAT24). Markers were considered as neutral when they did not show deviations from neutrality in the outlier analyses, and when they were not associated with the PCs based on the environmental variables (see below). For both analyses, the admixture model and correlated allele frequencies were selected. A burn-in period of 50,000 and Markov chain Monte Carlo (MCMC) replicates of 100,000 were used. Potential clusters (K) from 1 to 16 were tested using 10 iterations. The ∆ K method by Evanno et al. [34] was applied to determine the most likely number of K using the STRUCTURE HARVESTER 0.6.94 program [35]. The CLUMPAK software [36] was used for summation and graphical representation of the STRUCTURE results.
Outlier analyses were conducted for all loci using LOSITAN 1.0 [37] with 70,000 simulations, a FDR of 0.1, and the stepwise mutation model. Additionally, the BayeScan software 2.1 [38] was used to detect outlier loci. We used default parameters including 100,000 iterations after a burn-in of 50,000. A q-value threshold of 10% was applied to determine significant outliers. The markers used in this study were not directly derived from Castanea dentata. Therefore, sequences in which the outlier loci and loci that were significantly associated with the PCs of the environmental variables (see results) were located, were used for sequence similarity searches against Castanea dentata transcripts using the BLAST search option on the hardwood genomics homepage (https://www.hardwoodgenomics.org/blast). A general linear model (GLM) implemented in the TASSEL 2.1 software [39] was used to detect marker-environment associations. In association studies, it is necessary to account for neutral population structure [40]. The outlier analysis revealed not only significant EST-SSRs, but also two significant g-SSR loci (see below). Furthermore, loci under weak selection may not be detected by outlier approaches [40,41]. Thus, in the final association analysis only markers without signatures of selection in the outlier and initial association analyses (see below) were used to infer neutral population structure. Initially, only populations among which no population structure was detected using all markers were included in the association analyses (analysis 1: Ontario, Maryland, Kentucky, Asheville and Murphy; analysis 2: Massachusetts, New York and Portland), and hence, no Q-matrix was used as covariate in the model to correct for population structure. Tested were associations between all markers and the first three PCs (PC1, PC2 and PC3) obtained from the PCA of the environmental variables. In the final association analysis, loci that showed significant associations in the initial analyses and outlier loci were tested for associations with the three PCs in all populations. The remaining neutral loci were used to calculate the Q-matrix in STRUCTURE as covariate to account for population structure. The reported p-values are based on 1000 permutations and correction for multiple testing based on the methods by Churchill and Doerge [42] and Ge et al. [43] implemented in the TASSEL software.

Environmental Variables
The PCA showed that the first three PCs (hereafter PC1, PC2 and PC3) had eigenvalues higher than 1 and explained 92.01% of the variance of the environmental variables. For a better interpretation of the different PCs, they were correlated with the environmental variables. PC1 showed a significant negative correlation with latitude, temperature seasonality (bio4), and temperature annual range (bio7), but also a positive correlation with several climatic variables such as annual mean temperature (bio1), minimum temperature of the coldest month (bio6), or annual precipitation (bio12) ( Table 2). PC2 was significantly positively correlated with the maximum temperature of the warmest month (bio5), and PC3 was significantly negatively correlated with altitude.

Genetic Diversity and Population Structure
The number of alleles (N a ) ranged from 4.1 in the Portland population to 7.3 in the Murphy population, and the mean number of private alleles ranged from 0 (Portland) to 0.833 (Murphy) ( Table 3). The observed heterozygosity (H o ) ranged from 0.521 (Kentucky) to 0.586 (Murphy), whereas the expected heterozygosity (H e ) ranged from 0.469 (Portland) to 0.610 (Murphy). The mean fixation index (F IS ) was 0.0157 and it was significantly different from zero in six populations (Table 3), albeit no population showed consistently positive or negative F IS values for all markers (Table S3). The mean genetic diversity indices (except P a ) were lower based on EST-SSRs than on g-SSRs (Table 3), even though several EST-SSRs also showed high values of N a , H o and H e ( Table 4). The percentage of loci pairs in linkage disequilibrium ranged from 2.5% in the New York population to 9.8% in the Kentucky population, with a mean of 5.6% among all populations (Table S4). Only few markers showed evidence for the presence of null-alleles in the different populations (Table S5).
The STRUCTURE analysis revealed the most likely number of K = 2 based on the ∆K method [34] and the complete marker set, and the most likely number of K = 3 based on only potentially neutral markers ( Figure S1). Based on the complete marker set, the northeastern populations Massachusetts, New York and Portland formed one cluster, and the remaining populations a second cluster (Figure 2). Only the Pennsylvania population was not clearly assigned to one of the two clusters, but reveals a high degree of admixture. Based on potentially neutral loci, a similar clustering was observed, albeit the Pennsylvania population showed a similar admixture level as the southern populations and Ontario ( Figure S2).  The STRUCTURE analysis revealed the most likely number of K = 2 based on the ΔK method [34] and the complete marker set, and the most likely number of K = 3 based on only potentially neutral markers ( Figure S1). Based on the complete marker set, the northeastern populations Massachusetts, New York and Portland formed one cluster, and the remaining populations a second cluster (Figure 2). Only the Pennsylvania population was not clearly assigned to one of the two clusters, but reveals a high degree of admixture. Based on potentially neutral loci, a similar clustering was observed, albeit the Pennsylvania population showed a similar admixture level as the southern populations and Ontario ( Figure S2).

Outlier and Environmental Association Analysis
The LOSITAN analysis revealed five outlier loci that showed higher FST values than expected under neutral assumptions (CsCAT1, CsCAT3, CmSI0031, CmSI0600, and CmSI0594). With BayeScan no significant outlier loci were detected.

Outlier and Environmental Association Analysis
The LOSITAN analysis revealed five outlier loci that showed higher F ST values than expected under neutral assumptions (CsCAT1, CsCAT3, CmSI0031, CmSI0600, and CmSI0594). With BayeScan no significant outlier loci were detected.
Association analysis 1, in which the populations from Ontario, Maryland, Kentucky, Asheville, and Murphy (second cluster as identified by STRUCTURE) were included, revealed 8 loci significantly associated with at least one of the PCs (Table S6), whereas association analysis 2, in which the populations from Massachusetts, New York, and Portland (first cluster) were included, revealed 11 significant loci (Table S6). A total of nine markers revealed no signatures of selection in the outlier and initial association analyses and were used to calculate the Q-matrix as covariate to be used in the final association analysis based on all populations. In this analysis, 10 significant loci were found, including the loci that were also significant in the outlier analysis (Table 5). Table 5. SSR markers significantly associated with principal components.
x CmSI0327 x x CmSI0391 x CmSI0594 x CmSI0600 x x QaCA022 x X-p < 0.05, bold-loci that were significant in the outlier analysis.
The phenotypic variation explained by markers (R 2 ) ranged from 0.014 for marker CmSI0594 to 0.45 for marker CsCAT3, and was on average higher for markers significantly associated with the PCs than for non-significant markers (Figure 3). In total, 7 of the 10 significant loci were successfully assigned to Castanea dentata transcripts using BLAST against the C. dentata UniGene-transcript database (Table S7).
which the populations from Massachusetts, New York, and Portland (first cluster) were included, revealed 11 significant loci (Table S6). A total of nine markers revealed no signatures of selection in the outlier and initial association analyses and were used to calculate the Q-matrix as covariate to be used in the final association analysis based on all populations. In this analysis, 10 significant loci were found, including the loci that were also significant in the outlier analysis (Table 5). Table 5. SSR markers significantly associated with principal components.
x CmSI0327 x x CmSI0391 x CmSI0594 x CmSI0600 x x QaCA022 x X-p < 0.05, bold-loci that were significant in the outlier analysis.
The phenotypic variation explained by markers (R 2 ) ranged from 0.014 for marker CmSI0594 to 0.45 for marker CsCAT3, and was on average higher for markers significantly associated with the PCs than for non-significant markers ( Figure 3). In total, 7 of the 10 significant loci were successfully assigned to Castanea dentata transcripts using BLAST against the C. dentata UniGene-transcript database (Table S7).

Discussion
Seven g-SSRs were used to complement EST-SSR-based genotypic data of nine American chestnut populations sampling the native range that were previously investigated by Gailing and Nelson [16]. Compared to the EST-SSRs, g-SSRs revealed higher genetic diversity values (N a , H o , H e ), which was expected due to the usually higher variability of the latter marker type [44]. Only the mean number of private alleles (P a ) was higher for EST-SSRs. These private alleles may present adaptive beneficial variants, but this remains open. The reported distribution of genetic variation among the analyzed populations by Gailing and Nelson [16] could be confirmed with both the newly applied g-SSRs and the complete marker set (g-SSRs and EST-SSRs): populations from further south as well as the Ontario population in Canada showed a higher allelic diversity than populations further northeast in the USA. Furthermore, F IS values were positive for (south) western populations and negative for northeastern populations. One exception was the Massachusetts population that showed a significantly positive F IS value based on g-SSRs, while the F IS value was not significantly different from zero based on EST-SSRs. Also the population structure among populations was similar to that revealed by Gailing and Nelson [16]. The northeastern populations of Massachusetts, New York and Portland formed one cluster and the remaining populations a second cluster. Only the Pennsylvania population reveals a high level of admixture between both clusters. Since all SSR-markers were transferred from related tree species, we tested for the presence of null alleles. Only a few SSRs showed evidence for null alleles in the populations (Table S5), and hence, did likely not bias the results of our study. Further, only 5.6% of marker pairs were found to be in LD among all populations. This could be expected, since the markers have previously been mapped to 10 different LGs in C. mollissma [11].
The outlier analysis based on the Fdist approach [45] implemented in the LOSITAN software [37] revealed five outlier loci, whereas no outlier loci were detected with a Bayesian method implemented in the BayeScan software [38]. BayeScan has been shown to be more conservative in detecting outliers compared to other methods before [46][47][48].
All five F ST -outlier loci were significantly associated with PC1, four with PC2 and only two with PC3. Additionally, five more loci were associated with at least one of the PCs. The phenotypic variation explained by markers (R 2 ) was higher for significant loci (in the outlier and association analysis; R 2 : 0.08 to 0.45, mean of 0.25) compared to neutral loci (R 2 : 0.01 to 0.24, mean 0.09). The R 2 values were relatively high and at the top end of other reported values for tree species summarized in Lind et al. [49]. In association analyses, population structure can lead to spurious associations [50]. Therefore, neutral population structure is usually included as a covariate in models searching for significant marker-trait associations. Likewise, population structure based on potentially neutral markers was considered in our association analysis, but with our marker set it was challenging to reliably identify neutral loci. Since PC1 was significantly correlated with several environmental variables, but also with latitude, the associations of markers with PC1 may be biased by population structure related to geography. PC2, however, was only correlated with bio5 (maximum temperature of the warmest month), and PC3 was only (negatively) correlated with altitude. Hence, loci associated with these two PCs may indeed be involved in adaptive processes related to environmental conditions. In total, four of the six loci that were associated with PC2 and PC3 were also significant in the F ST -outlier analysis, and hence, detected by two different approaches. Therefore, the four loci CmSI0031 (LG_H, 25.3 cM), CmSI0600 (LG_J, 55.4 cM), CsCAT1 (LG_C;~45.7 cM), and CsCAT3 (LG_J; 39.0 cM) [11,51] might be the most promising ones for further analyses. Loci CsCAT1 and CsCAT3 are g-SSRs originally developed in C. sativa. Usually, g-SSRs are located within non-coding genomic regions, and hence, they are likely not directly involved in adaptive processes, but rather linked with loci under selection. Locus CsCAT1, however, was successfully assigned to a Castanea dentata transcript (AC454_contig17130_v3; uncharacterized protein) using BLAST (Table S7). Thus, this locus might be located next to or within a coding region, and therefore be directly involved in adaptive processes. Also the other two loci were successfully assigned to C. dentata transcripts: locus CmSI0031 showed similarities with the transcript "AC454_contig35613_v3," which shows homology to a Tubulin alpha chain, and locus CmSI0600 was assigned to transcript "AC454_contig7150_v3," for which no specific protein could be inferred ("uncharacterized protein"). Thus, the specific function of the genes, in which the outliers are located, remains open. Nevertheless, PC2 and PC3 are almost exclusively related to the maximum temperature of the warmest month or altitude. Hence, the loci associated with these PCs may be involved in temperature-related adaptive processes. Temperature can play an important role in the performance and adaptation of American chestnut provenances. For instance, C. dentata growth was (among others) negatively correlated with previous year August temperature [52]. In addition, Schaberg et al. [53] showed that height growth was correlated with winter shoot injury. In general, American chestnut is vulnerable to winter injury in its northern distribution area [15], and provenances differ in cold hardiness [53,54].

Conclusions
The wider genomic coverage represented by the enlarged marker set compared to Gailing and Nelson [16] revealed similar distribution patterns of genetic diversity and differentiation among the populations. By means of outlier and environmental association analysis, ten markers were identified that are significantly associated with environmental variables of the population sites. Four of these markers are of particular interest and could be involved in temperature-related adaptive processes in American chestnut. Future studies may take advantage of genomic resources that have recently been developed for chestnut species [10,11,13,14] to get a better understanding of adaptation patterns in C. dentata.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4907/9/11/695/s1, Data file S1: Genotypic data, Figure S1: Plots of delta K and log likelihood for each K based on the complete marker set (a), and potentially neutral markers that did not show deviations from neutrality in the outlier analysis (b), Figure S2: Clustering of individuals for K = 3 based on potentially neutral markers, Table S1: Environmental variables of the analyzed populations, Table S2: Primer information, Table S3: F IS values of the SSR markers in the different populations, Table S4: Percentage of SSR pairs with significant linkage disequilibrium, p < 0.05, Table S5: Frequency of null alleles, Table S6: SSR-markers significantly associated with principal components within the two different population clusters (a) Ontario, Maryland, Kentucky, Asheville, and Murphy; (b) Massachusetts, New York, and Portland, Table S7: BLAST results and linkage groups of the loci significantly associated with the environmental variables.