Modelling Shifts and Contraction of Seed Zones in Two Mexican Pine Species by Using Molecular Markers

: A seed zone or provenance region is an area within which plants can be moved with little risk of maladaptation because of the low environmental variation. Delineation of seed zones is of great importance for commercial plantations and reforestation and restoration programs. In this study, we used AFLP markers associated with environmental variation for locating and delimiting seed zones for two widespread and economically important Mexican pine species ( Pinus arizonica Engelm. and P. durangensis Mart í nez), both based on recent climate conditions and under a predicted climate scenario for 2030 (Representative Concentration Pathway of ~4.5 Wm − 2 ). We expected to observe: (i) associations between seed zones and local climate, soil and geographical factors, and (ii) a meaning latitudinal shift of seed zones, along with a contraction of species distributions for the period 1990–2030 in a northward direction. Some AFLP outliers were signiﬁcantly associated with spring and winter precipitation, and with phosphorus concentration in the soil. According to the scenario for 2030, the estimated species and seed zone distributions will change both in size and position. Our modeling of seed zones could contribute to reducing the probabilities of maladaptation of future reforestations and plantations with the pine species studied.


Introduction
A seed zone or provenance region is an area within which plants can be moved with little risk of maladaptation because of low environmental variation [1]. That is, because of local adaption to zone-specific environmental conditions, plants originating from the same seed zone are expected to have higher fitness than plants from outside that zone, as demonstrated by higher survival, reproduction, productivity, disease resistance and abiotic resilience in several trees [2]. Delineation of seed zones is therefore of great importance for commercial plantations and reforestation and restoration programs of tree species [3,4]. As local adaptation usually differs clinally across the landscape [5], fixed-boundary zones can be studied by considering the maximum climatic or geographic distance of transfer between the origin and planting sites, as outlined in the "floating principle" of seed transfer [6,7] This principle also recognizes that similar genotypes tend to occur in similar environments [5].
Seed zones can be delineated: (i) by conducting provenance trials to detect associations between genetic, geographic and climatic factors [8][9][10]; (ii) by correlating climate and geomorphological data; and (iii) by using climate surrogates, such as latitude, longitude and/or elevation [5]; with the risk that such variables are not always good phenotype predictors for widespread tree species (e.g., [11]). Provenance trials are however timeconsuming and usually only allow measuring phenotypic differences between individuals and populations under common conditions. Only reciprocal trials allow disclosing the contribution of phenotypic plasticity, such as the interaction between genotype and environment [1,12]. Moreover, environmental variation does not always translates into adaptive variation; it can also result in non-adaptive phenotypic plasticity [13], which is difficult to detect in the field. The use of non-reciprocal provenance trials and environmental information as proxies to determine seed zones could therefore result in pinpointing more and smaller provenance regions than is practical or necessary. However, provenance trials cannot be made for all species and, using environment-related environmental markers and comparing them to the neutral genetic structure could be more effective.
Differences from putatively adaptive molecular markers are expected to reveal genetic differentiation linked to regional adaptation more accurately than environmental variables alone, because neutral genetic differences caused by gene flow can be excluded, and genetic drift is not affected by plastic phenotypic responses. Under these conditions, genetic patterns can be used to identify optimal sites and sizes for seed zones [1,14].
In most mega-diverse countries with low political investment in applied science, like Mexico, provenance trials of tree species are scarce (e.g., [15]), and trial-based delineation of seed zones are thus unavailable. Currently, seed zones are delineated based on physiographical features or climate-based models for creating 'provinces' and 'sub-provinces', which are not species-specific [7,16]. The lack of a more detailed delineation is mainly due to the absence of precise tree species distribution models for most of the 4331 Mexican tree species [17], and the absence of spatial genetic analyses (let alone studies of local adaptation). However, responses to anthropogenic and environmental pressures are highly species-specific, and adaptively-based seed zone models are urgently required for improving current management and conservation plans of individual Mexican tree species.
Pinus arizonica Engelm. and P. durangensis Martínez are two Mexican species nearly endemic and endemic, respectively, both from the subgenus Pinus, section Trifoliae, and subsection Ponderosae [18]. While P. arizonica is evaluated as "Least Concern", P. durangensis is listed by the IUCN as "Near Threatened" due to still-ongoing extensive logging throughout its range along the Sierra Madre Occidental, which has reduced the abundance of large trees (see IUCN's Red List, Available online: https://www.iucnredlist.org/species/ 42341/2973948 (accessed on 19 September 2020)).
These two pines have clearly different ecological niches defined by temperature, precipitation and soil variables, with P. durangensis growing in warmer and more humid areas than P. arizonica (Tables S1 and S2). Knowledge about seed quality and productive potential, essential for delineating seed areas for these two pines, is scarce. For instance, production efficiency has been estimated in 46% ± 26% (SD) for P. durangensis [19,20], and between 28% ± 26% (SD) for P. arizonica [21]. However, special silvicultural recommendations for these two species, based on those findings, have not yet been proposed.
Friedrich et al. [22] reported a mean AFLP diversity of Hill ([23], a = 2) of 1.37 and significant spatial genetic structures in Pinus arizonica (var. arizonica and cooperi) at the large scale (among 12 seed stands), which may be the result of isolation by distance. In P. durangensis, Hernández-Velasco et al. [24] showed a mean AFLP diversity of Hill ([23], a = 2) of 1.36, and also significant spatial genetic structures were found among the five seed stands studied.
In this study, we used AFLP markers associated with environmental variation as a proxy for delimiting seed zones of these two economically important Mexican pines, by comparing past (1990) and future climate conditions (predicted for 2030). We expected that molecular markers related to climate were good predictors for delineating seed zones in both taxa, and that northward shifts and contractions were evident for both pines' seed zones between the two analyzed periods.

Study Area
To detect and model associations between genetic markers and climate, soil and geographical factors, we surveyed 25 Pinus arizonica seed stands and 25 P. durangensis seed stands collecting foliage in 35 trees per stand. The mean distances to the next tree were 80 m and 75 m, respectively (Table S3a,b). For selecting the study stands we followed [25], with the intention of including a representation of the large populations of natural forests of Pinus arizonica and P. durangensis in the states of Durango and Chihuahua, along the Sierra Madre Occidental in northwestern Mexico (see Figure 1). Elevation was between 1841 and 3062 m ( Table S3a,b). The prevailing climate is temperate; the annual average precipitation ranges from 625 to 1345 mm with most rainfall occurring in summer; the annual average temperature range from 8.3 to 15.6 • C (Table S1). comparing past (1990) and future climate conditions (predicted for 2030). We expected that molecular markers related to climate were good predictors for delineating seed zones in both taxa, and that northward shifts and contractions were evident for both pines' seed zones between the two analyzed periods.

Study Area
To detect and model associations between genetic markers and climate, soil and geographical factors, we surveyed 25 Pinus arizonica seed stands and 25 P. durangensis seed stands collecting foliage in 35 trees per stand. The mean distances to the next tree were 80 m and 75 m, respectively (Table S3a,b). For selecting the study stands we followed [25], with the intention of including a representation of the large populations of natural forests of Pinus arizonica and P. durangensis in the states of Durango and Chihuahua, along the Sierra Madre Occidental in northwestern Mexico (see Figure 1). Elevation was between 1841 and 3062 m ( Table S3a,b). The prevailing climate is temperate; the annual average precipitation ranges from 625 to 1345 mm with most rainfall occurring in summer; the annual average temperature range from 8.3 to 15.6 °C (Table S1).

Determination of Climate Variables
An already existing climate model [26] was used to extract 22 climate variables for each seed stand. This model is based on the Thin Plate Spline (TPS) method [27] implemented for Mexico [28], and estimates standardized monthly average values of minimum and maximum temperatures and precipitation, based on more than 200 climatic stations in Chihuahua and Durango, for the period 1961-1990 (Table S1). Point estimates for each location were obtained from a public database (available online: http://forest.moscowfsl. wsu.edu/climate/ (accessed on 10 December 2020)) by inputting geographic coordinates (latitude, longitude, and elevation).

Determination of Soil Variables
Soil subsamples were collected at a depth of 0-15 cm at the base of each of studied tree. Subsamples were combined in a composite sample (1 kg) per stand (50 samples in total), which was assumed to be representative of the soil of each entire stand. The following edaphic variables were determined: texture (relative proportion of clay, silt and sand), density (DEN, g/cm3), pH, and potassium (K, ppm), magnesium (Mg, ppm), sodium (Na, ppm), copper (Cu, ppm), iron (Fe, ppm), manganese (Mn, ppm), zinc (Zn, ppm) and calcium concentrations (Ca, ppm), which were measured according to Castellanos et al. [29]. Furthermore, phosphorus (P) concentration (ppm) was determined as in Olsen et al. [30], and nitrate amounts (NO 3 , kg/ha) were determined following Baker et al. [31] The relative content of organic matter (OM, %) was measured as in León and Aguilar, [32], and electrical conductivity (CE, dS/m) was determined according to Vázquez and Bautista, [33]. Cation exchange capacity (CEC, meq 100 g soil), the relative proportions (%) of hydrogen, Ca CEC , Mg CEC , K CEC , Na CEC and "other bases" (OB) were determined with the ammonium acetate method (pH 8.5) [34]. Finally, hydraulic conductivity (HC, CM/h) was determined following Mualem [35], and saturation percentage (SAT, %) according to Herbert [36]. Mean values per stand are available in Table S2.

Cluster Analysis
To detect the number of environmentally homogeneous seed stand clusters, we used the Affinity Propagation (AP) clustering technique, using the quantile (q) of input similarities [37]. This method includes all data points as potential "exemplars". Furthermore, AP has several advantages over related techniques, such as k-centres clustering, the expectation maximization (EM) algorithm, Markov chain Monte Carlo procedures, hierarchical clustering and spectral clustering (see details in [38]). More importantly, a pre-defined number of groups is not needed [37]. For all the 25 seed stands per tree species, the AP was applied to all the climate and soil variables together (Tables S1 and S2). All analyses were implemented using the "apcluster" package for R [37].

Genetic Analysis
DNA data from the 35 needle samples per stand were analyzed using the AFLP technique, according to the protocol described by Vos et al. [39] and modified by Ávila-Flores et al. [40]. DNA was extracted using the QIAGEN DNeasy96 Plant Kit (Qiagen, Venlo, The Netherlands), and digested with restriction enzymes MseI and EcoRI. PCR amplification was carried out with double strands MseI and EcoRI adaptors, to bind to the end of the restriction fragments and produce DNA templates. Pre-amplification used the E01/M03 (EcoRI-A/MseI-G) primer combination, and selective amplification E35 primers (fluorescently marked with FAM) and M63 + C (MseI-GAAC), for both species. All PCR reactions were performed in a Peltier ThermoCycler (PTC-200 version 4.0, MJ Research, St. Bruno, Quebec, Canada). Products were electrophoresed in a genetic analyzer (ABI 3100 16 capillaries) together with an internal size standard (ROX fluorescent dye GeneScan 500 ROX from Applied Biosystems, Foster City, CA, USA). AFLP fragments were measured with GeneScan 3.7 and Genotyper 3.7 packages (Applied Biosystems) after confirming quality and reproducibility by using 16 randomly selected reference samples, which were duplicated within plates in independent analysis (PCR replicates). Only those peaks observed in replicated samples were retained [40,41]. Markers were coded in a binary matrix as presence (code 1) or absence (code 0) of specific peaks [42,43].

Outlier Detection
Outlier loci were identified with BayeScan v 2.1 [44,45], which is a software based on the multinomial Dirichlet model and uses the Jump Markov Chain Monte Carlo Reversible algorithm (RJMCMC) to obtain the posterior distributions. Inbreeding coefficients (F IS ) used to estimate allele frequencies were not able to be estimated from binary data as purely dominant AFLP markers [46]. In BayeScan, F IS moves freely within its prior range in order to incorporate the uncertainty associated with F IS . A positive value of the locusspecific component (α) and a posterior probability >0.99 indicate differential selection (false discovery rates < 0.01), whereas negative α values with posterior probabilities >0.99 indicate possible balancing or purifying selection [45,47]. The following parameters were used: output number of iterations (5000), thinning interval size (10), pilot runs (20), length of pilot runs (5000), burn in (50,000), prior odds for the neutral model (10), lower boundary for uniform prior on the inbreeding coefficients F IS (0) and the upper boundary for uniform prior on F IS (1).
In addition, genetic differentiation (δ) among populations and D j components for each AFLP locus [48] were used to identify loci putatively under differential selection. The non-parametric measure δ has various conceptually desirable properties and is based on the genetic distance (d 0 ) [49], measured as the genetic difference between two biological collectives through pairwise comparison of genotypes, with a value of zero indicating genetically identical collectives, and a value of one completely different stands. Consequently, if the observed δ is smaller than 99.99% of the expected δ (i.e., p < 0.0001), the existence of non-random homogenizing forces (i.e., gene flow, balancing selection) is suspected, but if the observed δ is higher than 99.99% of the expected δ (i.e., p > 0.0001), non-random diversifying forces (differential selection or non-recurrent mutation) are inferred (see [48,50]). δ and its p value were calculated in GSED 3.0 ( [51]; Available online: http://www.uni-goettingen.de/forstgenetik/gsed (accessed on 6 June 2020)) and checked in the self-written C program "Delta" in Linux).

Determination of AFLP Diversity
We analysed the genetic diversity of each seed stand through the mean genetic diversity (dg [23]), the proportion of polymorphic fragments (pr poly ) and the DW index (which quantifies rare markers within cohorts). This last index is the sum of the proportion of occurrence of each marker per stand divided by the number of occurrences of the same marker in all the stands together. DW values are expected to be high when rare markers are accumulated within a stand [52]. The median values of dg, pr poly and DW observed in seed stand clusters which grow in similar environmental conditions, were compared with the overall median dg, pr poly and DW of the 25 seed stands together per tree species. It has been reported that differential selection caused by environmental conditions reduces genetic diversity [48]. Therefore, if differential selection occurs in a given cluster, the overall genetic diversity is expected to be higher.

Analysis of Molecular Variance and of Linkage Disequilibrium
To detect genetic differentiation between clusters, we further performed a hierarchical Analysis of Molecular Variance (AMOVA) [53] with GenAlEx 6.5 using 9999 permutations. Such differentiation was determined using the phiPT value (an analogue of the fixation index, F ST ) which allows for suppressing the within-cluster variance.
Natural selection, supporting adaptation to local environmental conditions, will increase overall linkage disequilibrium (LD) caused by differences in allele frequencies among subpopulations whenever alleles at different loci are favored. According to [54], the F statistic, when partitioning overall LD, is appropriate when trying to determine whether differences in LD result only from differences in allele frequency or from other factors, such as selection, that differ among populations [55][56][57].
In order to detect signs of possible selection, observed linkages among AFLP against expected distributions from permutation were evaluated using the standardized index of association (r d ) [58]. This index corresponds to the corrected (standardized) difference of the observed variance (as the sum of the individual variances of the AFLP) divided by the expected variance (as the sum of all the variances over all the AFLP together). r d accounts for the number of AFLP sampled, which is less biased when calculated with the "poppr" package [59] in R and the p value is evaluated with 999 and 9999 permutations. "poppr" was created to perform analysis of populations with mixed reproductive patterns and allows analyzing haploid and diploid dominant/co-dominant marker data, including also AFLP. r d was calculated among all the AFLP, excepting all the outliers (i.e., only the putatively neutral AFLP) within each cluster of seed stands growing in similar environmental conditions and also for the whole sample of trees together (i.e., the overall r d ).
If differential selection occurs, the overall r d value is expected to be significantly higher (p < 0.01) when using only the outlier AFLP found, than the r d values among each cluster of seed stands growing in similar environmental conditions when using all the AFLP without outliers (i.e., putatively neutral AFLP). Such difference was verified with a non-parametric Kruskal-Wallis test [60] in R (version 4.0.2).

Modelling Seed Zones from Environmentally-Associated AFLP Variants and Environmental Factors
To evaluate the ability of climate, soil and geographic variables to explain the variation of outlier loci per species, we applied a multi-tiered variance partitioning method (e.g., [61]) to the results of a redundancy analysis (RDA) after including the Hellinger's transformation, as implemented in the vegan R package (rda and varpart functions) [62,63].
We then evaluated the relationships between the relative frequencies (f r ) of each outlier locus and the climate, soil and geographical variables of each species through covariance coefficients (C) with their respective p (Z ≥ C) values [48]. Since those variables are mostly nonlinear, we used methods that detect monotonous, but not necessarily linear, covariation. C allows for such a detection, where values close to 1 indicate highly positive covariation and those close to −1 a strictly negative covariation. C was calculated as: where C = Covariation between each genetic variant at an outlier locus at the species X and each variable Y in the study sites. (X i − X j ) = difference in the genetic variant values of each studied species X, between the i-th and the j-th loci. (Y i − Y j ) = difference in value of the variable Y, between the i-th and j-th studied stands. The denominator of Equation (1) is the sum of absolute values of the products indicated. We selected no-collinear environmental variables (|r| < 0.7) that were significantly correlated to the relative frequency (f r ) of outlier loci (C) for modelling six species-specific machine learning algorithms (MLM) of f r , all including cross-validation (CV). We used the following methods: (i) linear regression ("lm"), (ii) Random Forest ("rf"), (iii) Neural Network ("nnet"), iv) Model Averaged Neural Network ("avNNet"), (v) Multi-Layer Perceptron ("mlpWeightDecay") and (vi) Bayesian Regularized Neural Networks ("brnn") [64,65] (Available online: http://topepo.github.io/caret/index.html (accessed on December 2020)) in R (version 3.3.4) [63] (see more in [11]). Models were tested 15 through 10-fold crossvalidation, by using 80% of the dataset as a training set and the remaining 20% as a test set. The maximum number of variables per model was determined by the rule of ten events per variable [66,67], i.e., two selected variables for P. arizonica and two for P. durangensis with 25 stands (repetitions). We evaluated the goodness-of-fit of the regression using the (pseudo) coefficient of determination (R 2 ), the root mean square error (RMSE) and mean squared error (MSE). The associations between f r and important environmental variables were represented graphically using the "ggplot2" package and the function (method = 'gam'), for generalized additive models (GAM) [68] in R (version 3.3.4) [63].

Species Distribution Models
Species distribution models (SDM) were generated using the selected climate, soil, geographical and topographical variables. Random forest classification (RF) [69] was used to calculate, map and delineate each defined seed zone and its potential shift. We used the geographical coordinates of presence-absence points (for P. arizonica 615 presence and 4473 absence locations, and for P. durangensis 1119 presence and 3757 absence locations) to project the distribution depending on climate, soil, geographical and topographical variables in the study area. Climate data were obtained at a 30-arc second resolution (about 800 m) from the WorldClim database (Available online: https://worldclim.org/ data/bioclim.html (accessed on 13 December 2020)). Models were initially generated for the 1960-1990 period, and then projected to the year 2030, according to a future emission scenario: RCP~4.5 Wm −2 , or approximately 650 ppm of CO 2 -equivalent in the year 2100 (Tables S5 and S8). In the A1FI and A2 scenarios, Houghton et al. [70] predicted 449 and 444 ppm for 2030. This early-future climate scenario was chosen to better verify the models, and also given the lack of other available climate models that predict low CO 2 concentrations. We obtained soil variables from Available online: https://soilgrids.org/ (accessed on 13 December 2020)) (Tables S7 and S10), and the topographical data from a digital elevation model of INEGI (Available online: https://www.inegi.org.mx/ (accessed on 25 November 2020)) (Tables S6 and S9).
Using the Waikato Environment for Knowledge Analysis (WEKA) open source software [71], the potentially most important regressor variables were selected by the Wrapper method [72] to fit the random forest (RF) algorithm. We used a 10-fold cross validation approach, repeated 10 times [73,74]. The final presence-absence models were applied to environmentally spatial variables, which were resampled to a 30 arc-seconds resolution, to generate spatially continuous maps. The predictive ability of the RF model was further evaluated by the True Skill Statistic (TSS) [75], using WEKA (for more details, see [76]. The TSS (also known as the Hanssen-Kuipers discriminant) is an appropriate alternative to the area under a Receiver Operating Characteristic (ROC) Curve (AUC; [77]) in cases where model predictions are formulated as presence-absence models. The TSS accounts for both omission and commission errors and is not affected by the sample size of each class. The TSS is defined as sensitivity + specificity, and ranges from −1 to +1, where +1 indicates a perfect classification model and TSS values of zero or less indicate model performance no better than random [75,78].
The Matthews correlation coefficient (MCC) was also calculated, which is reported as a more informative and truthful score for evaluating binary classifications than accuracy (ratio between the number of correctly classified samples and the overall number of samples) [79]. We calculated the threshold of probability of presence (PoP) using the average PoP value that maximized the sum of sensitivity and specificity, and the PoP value that minimized the difference between the absolute values of sensitivity and specificity. The intuitive, and most commonly used, threshold of 0.5 may actually not be the threshold above which the species is most likely to be present [80].
Although adaptation is expected to vary clinally across the landscape [7], we created fixed-boundary zones. As survey data of spatial soil variables correlated with f r are not available, two seed zones per AFLP were classified and mapped, one zone including the dominant climatically AFLP variant (f r ≥ 0.5) and the other including the recessive variant (f r < 0.5). Thus, for more climatically-associated and edaphically-associated AFLP, the resolution of mosaic seed zone classes will be finer (2 n AFLP zones). The SDMs included seed zones that were mapped and for which their size could be calculated. Finally

Amount and Distribution of Genetic Diversity, AMOVA and Linkage Disequilibrium
We identified 371 polymorphic bands between 75 and 450 base-pairs long for Pinus arizonica and 369 for P. durangensis. Outlier analysis pinpointed 11.1% of these bands in P. arizonica and 12.5% in P. durangensis (Table S4).
The overall genetic diversity was not significantly higher than the median genetic diversity among the clusters of seed stands separated by climate, topographical and soil variables; either using all AFLP together or using only the outlier loci. However, AMOVA and LD estimates support the occurrence of selection signs in these three environmentally determined seed stand clusters. For each species, the AMOVA shows that the most differentiated clusters (Cluster 2 for P. arizonica and Cluster 3 for P. durangensis) responding to climatic and edaphic conditions were also the most genetically differentiated. Besides, the overall r d value using the outliers AFLP was significantly higher than the average r d among clusters (p < 0.05) (Tables S14-S24 and Figures S9 and S10).

Modelling Seed Zones through Environmentally Associated Variants
For P. arizonica and using all 42 outliers, the redundancy analysis revealed that the edaphological, climate and geographic variables explained 18.7% (p = 0.003) of the total variation of the retained outliers. The climatic variables proved most suitable for predicting the relative frequency of outliers (14.4%, p = 0.01), followed by soil variables (5.5%, p = 0.04). Climate variables further showed the largest independent effect (13.2%, p = 0.02) on the variation of outliers, again followed by the edaphological variables (4.2%, p = 0.06) (Figure 2, Table S12). On the other hand, the redundancy analysis for P. durangensis did not reveal any significant relationships for any group of variables (edaphological, climate or geographical) for explaining variation using all 47 outliers. Using individual outlier loci, we observed four moderate but significant (p < 0.001) positive covariations between the relative frequency of some outliers (209, 314, 349 and 351) in P. arizonica and soil phosphorus concentration (P) and spring precipitation (SPRP; Table 1). For P. durangensis, two significantly negative and moderate covariations were  Using individual outlier loci, we observed four moderate but significant (p < 0.001) positive covariations between the relative frequency of some outliers (209, 314, 349 and 351) in P. arizonica and soil phosphorus concentration (P) and spring precipitation (SPRP; Table 1). For P. durangensis, two significantly negative and moderate covariations were detected (C = −0.91 and −0.81; p < 0.001) for the relative frequency of two outliers (240 and 416 fr AFLP ) with WINP and LAT, and positive covariations were observed for four other markers (236, 406, 416 and 342) and SPRP, MTWM, latitude and aridity index (AAI; Table 1).  For P. durangensis, the best models included variables WINP, SPRP, LAT and AAI for predicting the frequency of AFLP variants 236, 240, 342, 406 and 416. Regarding P. arizonica, the best model for environmentally associated AFLP included phosphorous as a predictive variable. Figure 3 graphically represents the association between marker frequency and several predictor variables, for selected results from Table 1.

Species Distribution and Seed Zone Models for Pinus arizonica and P. durangensis
The best species distribution (SDM) models for P. arizonica and P. durangensis for the two-time frames considered (1961-1990 and 1990-2030; RCP~4.5 Wm −2 ) were respectively elaborated with 18 and 21 independent environmental variables (Figures 4 and 5, Figures S1-S3 and Tables S5-S10).
The most important variables for projecting P. arizonica distribution were the mean temperature of the warmest quarter, the precipitation in the driest month, the temperature seasonality, the mean temperature in the coldest quarter, and the mean diurnal range of temperature. For P. durangensis, the most important variables were precipitation in the driest quarter, soil pH, the mean temperature in the driest quarter, bulk density, and temperature seasonality (Tables S5-S10 and Figures S4 and S5). Relatively good validation statistics were obtained for both species' models, with AUCs of 0.92. However, the TSS only showed values of 0.41 and 0.59 for P. arizonica and P. durangensis, respectively (Table 2).
For P. durangensis, the best models included variables WINP, SPRP, LAT and AAI fo predicting the frequency of AFLP variants 236, 240, 342, 406 and 416. Regarding P. arizon ica, the best model for environmentally associated AFLP included phosphorous as a pr dictive variable. Figure 3 graphically represents the association between marker fr quency and several predictor variables, for selected results from Table 1.

Species Distribution and Seed Zone Models for Pinus Arizonica and P. Durangensis
The best species distribution (SDM) models for P. arizonica and P. durangensis for the two-time frames considered (1961-1990 and 1990-2030; RCP ~ 4.5 Wm −2 ) were respectively elaborated with 18 and 21 independent environmental variables (Figures 4 and 5, Figures S1-S3 and Tables S5-S10).  Table 1). The species distribution was modeled by 18 environmental variables and the Random Forest classifier (Tables S5, S6 and S7a,b); the threshold for probability of presence (PoP) was 0.30; black points = P. arizonica seed stands studied here (Table S3a); blue color = relative frequency of the AFLP outlier locus 351 between 0.0 and 0.50; and red color = relative frequency of the AFLP outlier locus 351 between 0.51 and 1.0.  Table 1); the  Table 1). The species distribution was modeled by 18 environmental variables and the Random Forest classifier (Tables S5, S6 and S7a,b); the threshold for probability of presence (PoP) was 0.30; black points = P. arizonica seed stands studied here (Table S3a); blue color = relative frequency of the AFLP outlier locus 351 between 0.0 and 0.50; and red color = relative frequency of the AFLP outlier locus 351 between 0.51 and 1.0.

Species Distribution and Seed Zone Models for Pinus Arizonica and P. Durangensis
The best species distribution (SDM) models for P. arizonica and P. durangensis for the two-time frames considered (1961-1990 and 1990-2030; RCP ~ 4.5 Wm −2 ) were respectively elaborated with 18 and 21 independent environmental variables (Figures 4 and 5, Figures S1-S3 and Tables S5-S10).  Table 1). The species distribution was modeled by 18 environmental variables and the Random Forest classifier (Tables S5, S6 and S7a,b); the threshold for probability of presence (PoP) was 0.30; black points = P. arizonica seed stands studied here (Table S3a); blue color = relative frequency of the AFLP outlier locus 351 between 0.0 and 0.50; and red color = relative frequency of the AFLP outlier locus 351 between 0.51 and 1.0.  Table 1); the  Table 1); the species distribution was modeled by 21 environmental variables using the Random Forest classifier (Tables S8, S9 and S10a,b). The threshold for probability of presence (PoP) was 0.35; black points = P. durangensis seed stands studied (Table S3b); green color = relative frequency of the adaptive AFLP-240 assumption between 0 and 0.50; red color = relative frequency of the adaptive AFLP-240 assumption between 0.51 and 1; blue color = relative frequency of the AFLP outlier locus 416 assumption between 0 and 0.50; and yellow color = relative frequency of the AFLP outlier locus 416 assumption between 0.51 and 1. Two seed zones were defined for P. arizonica based on the 1990 projection (Figure 4), which were defined by the relative frequency of AFLP 351 (fr) (>0.5 and ≤0.5) modelled against spring precipitation ( Table 1). The seed zone for the major allele (fr > 0.5; shown in red in Figure 4) covered an area of about 1,859,761 ha (and included all seed stands studied), and the second seed zone, which mostly accounts variation of the minor allele (fr ≤ 0.5; shown in blue in Figure 4 Table 3).  For P. durangensis, four seed zones were identified; these were based on the distribution of markers 240 and 416 modelled against spring and winter precipitation (SPRP and WINP) ( Table 1). The first seed zone (red) covered an area 11,723 ha, the second one (yellow) an area of 347,708 ha, the third one (green) spanned over 790,409 ha (and included seed stands PD-AN, PD-CE and PD-PB), and the fourth one (blue) covered an area of 2,254,168 ha. Thus, the total modelled area for this species was 3,404,008 ha in climatic conditions of 1990 ( Figure 5).
The size and position of all P. durangensis seed zones will change by 2030. The first seed zone (red in Figure 5) will expand 309 ha and move 56 km to the northwest; the second zone (yellow in Figure 5) will expand to an area of 482,322 ha and move 67 km to the northwest; the third zone (green) will also expand until reaching an area of 1,056,237 ha and will shift 105 km to the northwest. Finally, the fourth zone (blue) will decline by 25.4% and move 100 km to the northwest. The total distribution of P. durangensis will therefore  Table 3 and Table S13).

AFLP-Environment Associations
In this study, we found that seed stands growing in similar climate-soil conditions are genetically similar, either using all AFLP or only using the outlier loci (Tables S14-S21). However, the genetic diversity was similar among the three environmentally homogeneous seed stand clusters, probably caused by: (i) the large populations with high gene flow and small gene drift and similar selection degree [84] in which the seed stands grow and (ii) using all (mostly neutral) AFLP. The redundancy analysis model (RDA) for P. arizonica showed that the explained outlier variance was especially associated with climate ( Figure 2 and Table S12). Higher proportions of the environmental-associated variance were also explained in similar studies with other species. For instance, Capblancq et al. [85] reported that 35.2% of the variance was explained by climate variables in a study on Fagus sylvatica involving 36 locations, 570 trees and 65 outlier SNPs. Cullingham et al. [86] reported 27.3% of the variance was explained by climate variables, in a study of Pinus contorta, based on 36 localities. However, in a study of Quercus lobata involving 13 locations, 45 trees and 195 SNPs, Sork et al. [87] found that only 6% of the total variation was explained by climate variables. All of these studies also reported higher proportions of explained variance, of respectively 47%, 15% and 68%. However, none of the studies included soil variation, which in our case accounted for up to 5.5% of the explained environmental-associated variance ( Figure 2 and Table S12). A similar result was observed for Quercus liaotungensis [88].
Some outlier loci were significantly associated with climate, edaphic and geographical variables, especially spring and winter precipitation, and phosphorus (P) concentration in the soil (Table 1 and Figure 3). Previous studies on plant species have reported correlations between both climate (rainfall/temperature) and geographical variables and outlier loci [89][90][91][92][93][94]. Temperature-linked spatial and temporal trends in AFLP frequency have also been reported for Fagus sylvatica L. [90], in which an association between AFLP and genecological prediction including 35 climate variables was observed. However, plant studies involving relationships between AFLP and soil traits are scarce; one of the few studies on this topic was presented by Quintela-Sabarís et al. [95], who reported associations between AFLP and the Mn and Sb total soil content in Cistus ladanifer L. Soil-associated AFLP differentiation has also been observed in Picea chihuahuana M. and attributed to different copper concentrations in the topsoil [96].
Phosphorus is a vital macronutrient for plant growth and development. Various genes confer adaptation to inorganic P, mainly by regulating P acquisition, internal remobilization, metabolic changes and signal transduction [97,98].
In the present study, the regression models for the relative frequency of some outlier loci by AFLP showed a similar fit for some soil and climate variables, also confirmed by the root-mean-square error (0.13-0.18 for the P concentration in the soil and 0.09-0.15 for climate variables) and the cross-valuated pseudo R 2 (0.45-0.62 for P in the soil and 0.30-0.69 for climate variables) ( Table 1 and Figure 3). The mean differences between extreme AFLP frequencies were 0.71 (range 0-0.71) and 0.74 (range 0-0.74) for the winter and the spring precipitation gradients, respectively (Table S11). Hornoy et al. [99] reported RF regression models of SNP allele frequency in Picea glauca with climatic gradients using 46 SNPs (from 11,085 SNPs) of those most strongly associated with climate, reaching a plateau at around 10 SNPs (pseudo R 2 = 0.50-0.55); the same authors found that the mean differences between extreme SNP allele frequencies were 0.31 ± 0.08 (range 0.04-0.46, n = 39) and 0.32 ± 0.08 (range 0.21-0.47, n = 11) for the gradients of mean annual temperature and the total annual precipitation, respectively. Lotterhos et al. [100] found an association between each of 801 top candidate SNPs (out of 585,270 exome SNPs, n > 250) and 18 climate and three geographical variables (Spearman's R 2 ≤ 0.36). Therefore, our models relating AFLP frequencies and some environmental variables (Table 1) can be considered powerful enough to estimate the spatial distribution of some environmental-associated AFLP, in the two pine species under study. Joint analysis of the covariation, regression models (Table 1, Figure 3) and RDA ( Figure 2) suggest that many outlier loci found were probably associated with other variables not included in our study. Sample size and design, marker type, and environmental cline can influence the detection of association of adaptive genetic traits with environmental variables [87].

Species Distribution and Seed Zones Models
Given the large number of climate, soil, geographical and topographical variables considered, and the many presence points and recognized statistical techniques [101,102] used to build our models, we were able to produce SDMs for both P. arizonica and P. durangensis supported by good validation statistics (AUC = 0.92) in comparison to other studies on tree species [101,103]. In contrast to Castellanos-Acuña et al. [7] and Shirk et al. [101], our study showed that shifts in species and seed zones distribution may take place from the southeast to the northwestern of the Sierra Madre Occidental (Table 3) probably since these mountains run southeast-northwest.
Castellanos-Acuña et al. [7] determined the concept of global seed zones for the whole of Mexico, independently of tree species. However, our study indicated that both the pine species and their genetic variants studied, respectively, were environment-associated (Tables S5-S10, Table 2). Therefore, the seed zones must be delimited for specific species, in order to achieve higher rates of survival of future reforestations.
In this work, of the total AFLP outliers, we found only 4-5 AFLP which were significantly directly or indirectly associated with specific climate and soil variables (Table 1). Such a few loci are probably not enough to delimit seed zones, particularly as we have not performed studies on associations between individual fitness or population density and these markers' frequencies. On the other hand, only one or a few mutations in a single gene can be sufficient to dramatically impact tree fitness. E.g., Liu et al. [104] found Pinus monticola pathogenesis-related gene PmPR10-2 alleles as defense candidates for stem quantitative disease resistance against white pine blister rust (Cronartium ribicola) often provoking a lethal infection. Likewise, Hernández-Velasco et al. [21] found strong significant association (R 2 > 0.72, p < 0.00004) between seedlings' growth traits which were growing in the same climate and soil conditions and mean annual precipitation, winter precipitation, mean annual temperature, length of the frost-free period and maximum temperature in the warmest month in 20 provenances of five Mexican pine species (including P. arizonica and P. durangensis). However, although it cannot be ruled out that there are other genetic evolutionary factors, as selection, leading to stronger AFLP differentiation in the genome, various studies have used only few AFLP outlier [45,105] to detect natural selection, adaptation and provenances regions (e.g., [89,[106][107][108]. The 4.5 Wm −2 projections used for the period 1990-2030 predict higher temperatures [70] than are likely for this period resulting in smaller predicted distributions for both P. arizonica and P. durangensis than in reality. Our SDMs indicated that the species distributions were slightly larger in 2019 than in the corresponding forest management plans for the same year (Table 3), indicating that climate change between 1990 and 2019 (and 2030) can only partly explain the strong reductions of the species and seed zone distributions found in reality.
Thus, other factors such as forest fires, plagues, diseases, species over-exploitation, along with clearing to make room for pastures for grazing cattle may also contribute to reducing the species distributions. In this context, the main silvicultural system "Método Mexicano de Ordenación de Bosques Irregulares (MMOBI)", applied to forest management in the Sierra Madre Occidental, is based on selective extraction still focused mainly on large trees, and relatively little attention is given to species composition in the natural regeneration [109,110]. This silvicultural system may therefore tend to over-exploit commercially attractive tree species, such as P. arizonica and P. durangensis, in the typically uneven-aged mixed pine-oak forests.
To reduce the over-exploitation of these two species, we recommend adapting or changing the current forest management system [111], making correct selections including different diameter categories of the trees to be harvested, conducting a strict, independent and professional state supervision and, promoting higher levels of forest knowledge [112]. Alternative regeneration methods, such as irregular group shelterwood (Expanding Gap Silviculture "Femelschlag"), should also be considered, in order to improve natural regeneration. Complementary reforestation and plantation with appropriate seeds, taking into account the optimal seed zones and their shifting, found in this study, may help enhance the regeneration of the forest [111].

Conclusions
In conclusion, this research shows that environmental and genetic traits may be useful to delimit seed zones. However, AFLP have limits when predicting local adaptation [43,[113][114][115], especially if only a small number of populations and AFLP related to environmental variables are used. Therefore, our delimitation of proposed seed zones should be understood as a working hypothesis for subsequent studies. In order to find a better delimitation of seed zones, we recommend checking our results using genome-wide clearly adaptive molecular markers, such as SNP [1], paired with reciprocal provenance trials.
However, to restore populations of native tree species, seed collection limited to a very local area is commonly recommended. Given that adaptation is specific for the two economically important pine species studied and their seed zones included in this research, complementary studies are needed to evaluate the possible differences of using local versus non-local seeds. Consequently, more research on plant provenances and seed zones restoration is recommended, aiming to achieve best practices in forest restoration and conservation. Additionally, a reforestation guide taking into account the issue of the distribution reduction and the shifting of the studied species and their seed zones, could be helpful to improve forest management.