Genomic Variation Shaped by Environmental and Geographical Factors in Prairie Cordgrass Natural Populations Collected across Its Native Range in the USA

Prairie cordgrass (Spartina pectinata Link) is a native perennial warm-season (C4) grass common in North American prairies. With its high biomass yield and abiotic stress tolerance, there is a high potential of developing prairie cordgrass for conservation practices and as a dedicated bioenergy crop for sustainable cellulosic biofuel production. However, as with many other undomesticated grass species, little information is known about the genetic diversity or population structure of prairie cordgrass natural populations as compared to their ecotypic and geographic adaptation in North America. In this study, we sampled and characterized a total of 96 prairie cordgrass natural populations with 9315 high quality SNPs from a genotyping-by-sequencing (GBS) approach. The natural populations were collected from putative remnant prairie sites throughout the Midwest and Eastern USA, which are the major habitats for prairie cordgrass. Partitioning of genetic variance using SNP marker data revealed significant variance among and within populations. Two potential gene pools were identified as being associated with ploidy levels, geographical separation, and climatic separation. Geographical factors such as longitude and altitude, and environmental factors such as annual temperature, annual precipitation, temperature of the warmest month, precipitation of the wettest month, precipitation of Spring, and precipitation of the wettest month are important in affecting the intraspecific distribution of prairie cordgrass. The divergence of prairie cordgrass natural populations also provides opportunities to increase breeding value of prairie cordgrass as a bioenergy and conservation crop.


Introduction
Prairie grass species native to North America, such as switchgrass (Panicum virgatum L.), big bluestem (Andropogon gerardii Vitman), indiangrass (Sorghastrum nutans L.), and prairie cordgrass have shown potential use for conservation practices and potential bioenergy production [1][2][3][4]. To develop prairie grass species as a new crop for either conservation practices or bioenergy feedstock on marginal conditions, it is important to characterize and maintain the genetic resources of local or regional populations that show agronomic advantages, such as high biomass yield and strong biotic and abiotic stress tolerances. However, the presence of North American tallgrass prairie has been diminished by agriculture and urban development since European settlement. Although scattered throughout the historical range, thousands of remnant prairie sites still exit in North Amer-ica [5]. Locally adapted natural populations from those remnant prairie sites are valuable genotypes that harbor adaptive traits to various environments [6,7].
Prairie cordgrass is a native, perennial, warm-season (C4) grass that once dominated North American tallgrass prairies. The habitats of prairie cordgrass cover wet to moist prairies and low areas alongside rivers and tributaries [8,9]. Mobberley [10] found prairie cordgrass also thrive in open, dry prairie and along railroads in the Midwestern United States. Common nursery evaluation of prairie cordgrass in Europe [11], eastern South Dakota [12,13], and central Illinois [1], has shown high biomass yield potential, comparable to that of switchgrass and other warm-season grasses. According to Boe and Lee [12], seven natural populations of prairie cordgrass from South Dakota produced more biomass than switchgrass while showed significant differences for biomass production among populations. Another study evaluating populations collected from an area spanning Midwestern and Eastern USA also showed extensive phenotypic variation among prairie cordgrass populations [1]. Compared to other perennial grasses, such as switchgrass and big bluestem, prairie cordgrass has a limited breeding history, with only five cultivars released as sourceidentified genetic material [14,15]. The information of genetic background of other prairie cordgrass natural populations is also limited.
Cytotaxonomic studies of prairie cordgrass revealed different ploidy levels existing among populations, including tetraploid (2n = 4x = 40) populations distributed from Southern Canada and the Eastern USA, and octoploid (2n = 8x = 80) populations distributed across Midwestern USA [16][17][18]. A mixed-ploidy population consisting of tetraploid and hexaploid (2n = 6x = 60) individuals was found in central Illinois, USA [19]. Within this mixed-ploidy population, substantial phenotypic variability was observed between two ploidy levels, such as flowering time, stomatal size, and plant morphological characteristics [19]. Kim et al. [20] reported the presence of all three ploidy levels among 11 surveyed natural populations and found that a positive association between genome size and the stomata size between octoploids and tetraploids. A cytogenetic survey of 60 prairie cordgrass natural populations found the tetraploid populations extended from the East North Central to the New England regions of the U.S., and the octoploid cytotypes distributed in the West North Central regions [21]. A study of prairie cordgrass chloroplast DNA (cpDNA) also showed a strong relationship between cpDNA haplotypes and geographic distribution [22]. Three cpDNA haplotype groups including "PCG1" haplotypes occurred in the New England/Middle Atlantic regions in the east and central U.S., a "PCG2" haplotypes found in southern SD and northern IA, IL, and MO in the central U.S., and a "PCG3" haplotypes identified in a distinct region that includes portions of ND, SD, and MN. The major cpDNA haplotype group ("PCG1") includes members of all three cytotypes. The wide dispersal of cytotypes within cpDNA haplotypes could be resulting from a combination of migration and polyploidization, which is not uncommon in Spartina species [23,24].
To fully investigate the genetic variation and phylogeography in this outcrossing species, nuclear molecular markers with single nucleotide polymorphisms (SNPs) should be jointly used with organelle molecular markers [25,26]. With the advent of high-throughput sequencing technology, it is now feasible to survey the whole genome and provide trait-associated molecular markers for phylogenetic studies. DNA libraries constructed using Genotypingby-sequencing (GBS) on restriction site takes advantages of high-throughput sequencing technology to generate thousands of SNPs across many individuals [27,28]. This simultaneous polymorphism discovery and genotyping approach avoid ascertainment bias while lowering the overall cost by combining many genotypes in a single run [28,29]. A greater number of molecular markers improves clustering of the wild taxa as sources of useful genes in breeding programs and identifies conservation territories of a particular species of interest [30][31][32].
Undomesticated species have often gone through extensive inter-specific gene flow, lineage splitting, and genetic drift, resulting in incongruences of genealogical information carried by each gene [33][34][35]. Environmental and geographical factors are significant contributors in shaping population structure through the above-mentioned divergence events. A better understanding of environmental and geographical adaptation within a species could benefit research communities such as plant breeding, conservation ecology, and evolutionary ecology. Therefore, in this study, we collected 96 prairie cordgrass populations across the east and central midwest US range and genotyped them using a GBS approach. The objectives of this study are to: (1) identify intraspecific genetic diversity among prairie cordgrass natural populations collected in U.S.; (2) reveal the intraspecific bio-geographical distribution of those natural populations.; and (3) evaluate the influences of environmental and geographical variables on the distribution of those populations.  (Table A1). For the best representation of a local population at each location, seeds were collected from all visually identifiable clones within a 1-km radius of the sampling area. When a large cohort of plants were identified, seeds were collected from a random sampling of inflorescences covering the area. Northern Appalachian mountain areas, Ohio, West Virginia, and West New York were searched for prairie cordgrass natural populations in the remnant prairie area. However, there were no prairie cordgrass remnant populations found or reported by the local USDA plant materials collection centers. The county-based USDA-NRCS distribution map also showed a sparse adaptation in those regions for prairie cordgrass (USDA-NRCS, https://plants.usda.gov/core/profile?symbol=sppe, accessed on 29 November 2019). In addition, more than 100 rhizomes of each of two cultivars ('Kingston' germplasm (KST), 'Southampton' germplasm (STP)) were obtained from the USDA-NRCS Big Flats Plant Material Center, NY. Seeds of 'Red River' prairie cordgrass, a cultivar developed by interpopulation open pollination among vegetative propagules obtained from east central Minnesota (Grant County), northeastern South Dakota (Day County), and east central North Dakota (Cass and Grand Forks Counties) [13], were also included in this study ( Table 1). Seedlings of four genotypes developed from each population were transplanted on 0.9-m centers in a common field nursery at the University of Illinois Energy Farm in Urbana, IL (40 • 6 N, 88 • 13 W). The dominant soil was Drummer silty clay loam (fine-silty, mixed, super-active, mesic typic Endoaquolls). A randomized complete block design with four replications was used to arrange populations. Each plot consisted of 16 plants of the same genotype spaced on 0.9-m centers, and individual plot size was 3.6 m × 3.6 m. Weeds were controlled by applying 0.28 kg ai ha −1 quinclorac (3,7-dichloroquinoline-8-carboxylic acid) before the emergence and 0.79 kg ae ha −1 2,4-D ester (2-ethylhexyl ester of 2,4-dichlorophenoxyacetic acid) in the growing season from 2011 to 2013. All plots were also fertilized with 112 kg N ha −1 in April of 2011, 2012 and 2013.

Genotyping-by-Sequencing
Leaf tissue samples from each genotype were collected and bulked for DNA extraction in 96-well frozen plant format using a standard CTAB protocol [36]. A minimum of two genotypes from each population were collected. Up to four genotypes were collected when possible. In total, 213 individuals were included in the preparation of sequencing library. DNA was then quantified with PicoGreen (Life Technologies, Grand Island, NY, USA) and prepared for GBS library construction following the proctocol proposed by Poland et al. [28]. Genomic DNA was double-digested using PstI-HF (rare cutting) and HinP1I (common cutting) enzyme. Rare and common restriction overhangs were ligated with two sets of barcoded adapters. Illumina primers (Beckman Coulter, Inc., Indianapolis, IN, USA) were used to amplify pooled restriction ligation reactions. Size of generated fragments were measured using an Agilent Bioanalyzer (Agilent, Santa Clara, CA, USA). The library was submitted to the University of Illinois Keck Biotechnology Center for sequencing on an Illumina Hi-Seq2000 to obtain single-end, 100-bp reads. Raw sequence reads were processed using the GBS-SNP-CROP pipeline [37]. The sequence data were first demultiplexed and trimmed from barcode and cut sites using TRIMMOMATIC [38]. Reads from ten individuals with diverse geographical origins were assembled and clustered to create a pseudo-reference genome using VSEARCH [39]. The diverse set of samples were chosen based on several factors, including the representative populations reported in the previous phylogenetics study using chloroplast sequences, read depth, and ploidy levels. A minimum of 2.5 million reads is required for a sample to be chosen for creating the pseudo-reference genome. An even number of samples were selected from tetra-and octo-ploidy populations Table A1. Processed reads were aligned to the pseudo-reference genome using BWA-mem and SAMtools algorithm to identify all potential SNPs for each sample. Given the high ploidy levels among prairie cordgrass populations and the purpose of this phylogenetic study, SNPs were filtered first based on read depth and then allele frequency. A minimum read depth of 11x is required to call a locus homozygous in the absence of any reads of the alternative for tetraploid or higher levels of ploidy [37]. In addition, the minimum read depth of calling a locus heterozygous is 3x, the required proportion of secondary reads to all non-primary reads is 0.9, the proportion of genotyped individuals to accept a SNP is 0.75, and the acceptable ratio of the depth of the secondary allele to that of the primary allele is 0.1. At last, individuals with read depths lower than 4x were eliminated. Although the read depth filters reduced the number of SNPs retained from the pipeline, it avoided calling SNPs in regions with low coverage. The average read depth for each sample is presented in Table A1. Diploid genotypes were generated in this study to estimate the population structure and heterozygosity for tetra-, hexa-, and octoploids. A study by Bishop et al. [40] indicated all three cytotypes are highly likely allopoly-ploidy by examining the chromosome pairing patterns. However, there was indeed a higher chance for hexa-ploidy to behave differently. Another study by Crawford et al. [41] reported a disomic inheritance based on the distribution of allele frequencies in a bi-parental F1 tetra-ploidy population.

Ploidy Levels
To estimate ploidy level of each population, flow cytometry was performed on the main tiller of four clonally propagated plants collected from one genotype for each population when one or more secondary tillers were initiated. Nuclear DNA content was determined using a procedure modified from Rayburn et al. [42] and Kim et al. [20]. Details on sample preparation were described by Lee et al. [43], and the analysis of relative DNA content was conducted with DB LSR flow cytometry (BD Biosciences, San Jose, CA, USA) in the Flow Cytometry Laboratory (Biotechnology Center, University of Illinois at Urbana-Champaign, Champaign, IL, USA). The relative DNA content was calculated by dividing the relative fluorescence of the sample using the relative fluorescence of the standard. Ploidy level of each population was determined according to Kim et al. [20]. Briefly, a plant sample with 1.6 picogram (pg) DNA content would be designated tetraploid (2n = 4x), a 2.3 pg plant would be considered to be hexaploid individual (2n = 6x), while the 3.1 pg plants would represent octoploid plants (2n = 8x).

Population Structure and Genetic Diversity
For population and genetic diversity analyses, we selected data from one genotype from each population to represent each population. Samples with higher than 15% missing rate were also avoided. A total of 96 samples were selected. Single nucleotide polymorphisms data were first imputed using the LD-kNNi algorithm in Tassel V5 [44,45] and scored in a binary format as homozygous primary allele (0), heterozygous (1), and homozygous secondary allele (2). The LD-KNNi algorithm is based on a k-nearest neighbor genotype imputation method, designed for unordered markers on unphased genotype data from heterozygous species. Population structure was analyzed using fastSTRUC-TURE, a Bayesian-based algorithm [46], and the discriminant analysis of principal components (DAPC, 'adegenet' package, R Development Core Team 2013) [47] to visualize the genome-wide patterns of distribution and potential group membership of each population. The fastSTRUCTURE was run from K = 1 to K = 10 using default parameters for 96 samples. Geographical distribution of populations was mapped on the US state map using ggplot ('ggplot2') [48] based on the coordinates of collection origins. We also evaluated the likelihood provided by fastSTRUCTURE and the Bayesian information criterion (BIC) score provided by DAPC to infer the best number of demes supported by the data Figure A2. The principal coordinate analysis (PCOA) was then performed using pcoa ('ape' package) [49] to investigate the genetic differentiation among and within demes. An analysis of molecular variance (AMOVA) was conducted on all individuals for genetic variation associated with ploidy levels, populations, demes, and plants using the poppr.amova ('poppr' package) [50] in R. We proposed two models. In the first model, we explored the genetic variation at levels of ploidy, populations within each ploidy, and samples within each population. This provides information of genetic variation for a plant breeder to perform selections within and among potential landraces. In the second model, we evaluated the effects of levels of demes, ploidy levels within demes, and populations within ploidy. Since we imposed the category of demes in the second model, results from second model are approximates of levels of variance explained by demes, ploidy levels, populations, and genotypes. Heterozygosity and fixation statistics were calculated for within and among potential genetic groups using genet.dist ('hierfstat' package) [51] in R according to Weir & Cockerham [52].

Environmental and Geographical Variables
To assess the association of environmental and geographical variation with the genetic variation of natural prairie cordgrass populations, a 30-year normals for temperature and precipitation were collected from National Oceanic and Atmospheric Administration (NOAA) weather stations located closest to the collection site (https://www.ncei.noaa. gov/products/us-climate-normals, accessed on 16 May 2019). Three geographical variables are longitude (LONG) and latitude (LAT) (expressed in hundredths of degrees) and altitude (ALT) (expressed in meter). The environmental variables were then calculated to generate more biologically meaningful variables using a 'biovar' function in the R package 'dismo' [53]. The values of each variable were converted using log-transformation based on a Box-Cox transformation test to promoting normality (Shapiro-Wilk tests: p > 0.05). A total of 17 environmental variable including an EcoregionIII Factor (EF), 8 temperature variables and 8 precipitation variables over 30 years (from 1987 to 2017) were collected. The EF was created according to Omernik [54], who defined a local ecosystem for its quality and integrity, by evaluating its pattern and composition of biotic and abiotic phenomena. The ambient temperature variables (expressed in°C) are mean annual temperature (MAT), standard deviation of annual temperature (SDAT), mean temperature of the warmest month (MTWM), and mean temperature of the coldest month (MTCM). We also collected mean temperature of each of four meteorological seasons: Spring (1 March-31 May), Summer (1 June-31 August), Autumn (September 1st-November 30th), and Winter (1 December-28 February), expressed as MTSP, MTSU, MTAU, and MTWI, respectively. The precipitation variables (expressed in mm) were collected in the same way as the temperature variables, hence mean annual precipitation (MAP), standard deviation of annual precipitation (SDAP), mean precipitation of the wettest month (MPWM), mean precipitation of the driest month (MPDM), and mean precipitation of Spring (MPSP), Summer (MPSU), Autumn (MPAU), and Winter (MPWI).

Mantel Tests and Canonical Correlation Analyses
To evaluate the influence of environmental adaptation on the formation of subgroups within prairie cordgrass, we conducted mantel tests and canonical correlation analyses using the SNP data, environmental and geographical data. For mantel tests, the environmental distance matrix was created based on 17 environmental variables, using the vegdist function in 'vegan' package in R [55,56]. To create a geographical distance matrix among populations, we calculated pair-wise geographical distances based on latitude/longitude degrees on an ellipsoidal model of the Earth, also known as the method of Vincenty's Formulae [57], using the gdist function under 'Imap' package in R [58]. Genetic distance (F ST values) matrix was calculated using the genet.dist function in 'hierfstat' package, in which the Weir & Cockerham approach was used [52]. The mantel tests were carried out using the 'vegan' package in R [56]. Significance testing of the correlations was performed with 10,000 permutations. In this study, we conducted Mantel test of correlations of both environmental and geographical distance with genetic distance. In addition, a Partial Mantel test was run between environmental and genetic distance, while controlling for geographical distance. Although Mantel test is popular in landscape genetics studies, it provides low detecting power in studying relationships between distance matrices and lacks ability to estimate proportional contribution of variation from environmental and geographical variables [55,[59][60][61]. Therefore, we conducted canonical correlation analyses (CCA) following Mantel tests to evaluate the rank of importance of environmental and geographical variables in contributing to the genetic variation within the species [62,63]. In order to reduce the dimensionality of genomic data while retaining information covering the whole genome, we chose the first 10 PCOA axes from the PCOA of the SNP data (a total of 50.7% variance explained, data not shown) for the CCA. All 20 environmental and geographical variables were used for creating pair-wise canonical variables with the 10 PCOA axes. The CCA first decomposed the variance contributed by each canonical variable, and then calculated the correlations between environmental/geographical variables with the selected canonical variables. The canonical correlation analysis was carried out using cc function in the 'CCA' package [64]. The statistical significance of canonical correlation coefficients was examined using F-approximations of Wilks' Lambda using p.asym function in 'CCP' package [65].

SNP Discovery
A total of 240 million single-end sequence reads were produced from the Illumina Hi-Seq2000 platform. The minimum and maximum lengths were 32 and 90 base pairs (bp), respectively. A total of 29.7 million reads were used to build the pseudo-reference genome based on available computational power. The final pseudo-reference genome contains 371,332 sequences and 19.1% of them were non-singletons which were filtered in the downstream analyses. The average length of the reference sequences was 82 bp with a standard deviation of 13.5 bp. The initial assembly and SNP calling without filters yielded 211,294 SNPs. A final subset of 9315 SNPs was retained and genotyped in 213 samples after applying restrictions on read depths and allele frequencies. The average read depth was 36.4 × per SNP. The distribution of sample read depth is also provided Figure A1. Individuals on average had 19.8% and 8.9% missing SNPs before and after imputation, respectively.

Ploidy Levels
There were three DNA ploidy levels found in this study: tetraploid, hexaploid, and octoploid cytotypes (Table A1). The intraspecific ploidy level variation was congruent to the results from Kim et al. [21]. In this study, 56 populations were classified as tetraploids (2n = 4x), 4 populations were classified as hexaploids (2n = 6x), and 36 populations were classified as octoploids (2n = 8x). Most of the tetraploids were identified in the East North Central and New England U.S. regions (CT, NJ, MA, ME, IN, MO, and LA) ( Figure 1). All four hexaploids were identified in Illinois. The majority of octoploids were identified in the West Central region (SD, NE, and ND). Two different ploidy levels were identified in OK, NY, MN, KS, IL, WI, and IA.

Population Structure
The simulation result from fastSTRUCTURE and the BIC score from DAPC suggested two genetic demes (K = 2, BIC = 581.08), based on 9315 nuclear SNPs (Figures 1 and A2). The first genetic deme (East deme) includes populations mostly from the New England (MA and ME), East North Central (WI, IL, and IN), and West Central (KS, OK, and LA) regions. The second genetic deme (West deme) includes populations mostly from West North Central (MN, NE, and SD) and West Central (KS and OK) regions. The prairie cordgrass cultivar, 'Red River', was categorized into West deme, and New York cultivars 'STP' and 'KST' were placed in East deme. Populations from the two demes were largely separated by the border of mixed-prairie and tallgrass prairie as defined by Weaver [8] and mapped by Lauenroth et al. [66]. The populations collected in North Dakota, South Dakota, Nebraska, and Kansas are generally in the mixed prairie. The populations collected in Minnesota, Iowa, Missouri, and Illinois are mostly located in the tallgrass prairie.
The first two principal coordinates separated East deme from West deme, indicating two major gene pools (Figures 2 and 3). Although ploidy level is not fixed within these two gene pools, tetraploids (4x) and octoploids (8x) are primary cytotypes in East deme and West deme, respectively. In East deme, populations were more scattered on directions of both PCOA1 and PCOA2 compared to those in West deme ( Figure 2). For example, populations collected from IA, OK, MO, and KS tended to form a subgroup separate from other populations. One octoploid populations from MN was clustered with the large IL group. Three octoploid populations from ND and NE located in the area between East deme and West deme on PCOA1. Three of the four hexaploid populations from IL clustered closely together and with most other IL populations, but one hexaploid population (marked as IL † ) was considerably different. In West deme, populations were less variable/more tightly clustered than populations from East deme on PCOA1 and PCOA2. There were only four tetraploids in West deme. There were several populations from MO, NE, IA, IL, IN, CT, and KS that could equally likely be categorized in East deme or West deme. Furthermore, most of these populations are geographically adjacent to both populations from East deme and from West deme. These populations were also distributed along the mixed-to tallgrass prairie border. This provides support that populations from these states (MO, NE, IA, IL, and KS) are in areas where intraspecific breeding occurred. Populations from both demes were tightly distributed on PCOA3, except for eight populations from MA, ME, IL, NJ, and CT ( Figure 3). Those populations could potentially be a subgroup from New England area. Differences in SNP missing rate could affect the results of population structure analysis. In our study, there was no significant difference between samples from two inferred demes for percentage of SNPs imputed (Kruskal-Wallis rank sum test: p = 0.77) Figure A3.

Analysis of Molecular Variance and Heterozygosity
Using 9315 SNP markers, analysis of molecular variance showed that SNP marker variance was significant among ploidy levels, among populations within ploidy levels, and among samples within populations ( Table 1). Variance of SNP markers that accounted for ploidy levels, populations within ploidy levels, and sample within populations were 2.8%, 32.9%, and 64.3%, respectively. SNP marker variance was also significant among demes, ploidy levels within demes, and populations within ploidy levels. Deme, ploidy levels within demes, and populations within ploidy levels accounted for 14.3%, 4.6%, 81.1% of the SNP marker variance, respectively.
Average heterozygosity was calculated across a whole population panel, within each deme, and between two demes ( Table 2). Average observed heterozygosity (Ho), average expected heterozygosity (He), and overall genetic diversity (Ht) across all populations were 0.27, 0.22, and 0.24, respectively. Ho, He, and Ht within East deme were 0.21, 0.19, and 0.20, respectively. Ho, He, and Ht within East deme were 0.35, 0.26, and 0.27, respectively. Inbreeding coefficients (Fis) across all populations, within East deme, and within West deme were −0.212, −0.133, and −0.369, respectively. The fixation index (Fst) across all populations, within East deme, within West deme, and between two demes were 0.05, 0.045, 0.053, and 0.079, respectively. Since a population-based imputation method (LD-kNNi) was used, the heterozygosity calculated in our study could be underestimated.

Mantel Tests and Canonical Correlation Analyses
The Mantel tests showed significant correlations of the genetic distance with both the environmental (r = 0.25) and the geographical distance (r = 0.33) (p < 0.001). The partial Mantel test generated a significant correlation between genetic and environmental distance (r = 0.068, p = 0.025) after controlling for geographical distance. This indicated that it is necessary to dissect the relationship between specific environmental variable and genetic distance. Following the Mantel and partial Mantel tests, CCA showed significant coefficients from five pairs of canonical variables (i.e., Canonical axes) (p < 0.01, Table 3), with correlations (r) ranged from 0.71 to 0.92. The top three canonical axes explained 73.5% variance cumulatively. Therefore, we selected the top three canonical axes and presented their correlations with genetic (PCOAs) and environmental/geographical variables in Table 4. The first three PCOAs (i.e., PCOA1, PCOA2, and PCOA3) showed significant correlations with canonical axes (II) (r = 0.618), (II) (r = 0.578), and (I) (r = 0.947), respectively. This indicated that populations separated by PCOA3 in PCOA were largely contributed by canonical variables on canonical axis (I). Populations separated by PCOA1 and PCOA2 were largely contributed by canonical variable on axis (II).   In canonical axis (I), LONG, MPDM, MPSP, and MPWI were significantly correlated with canonical axis (I) (r = 0.805, 0.601, 0.549, and 0.617, respectively), which resulted in separating East deme, West deme, and the New England populations (i.e., MA, ME, IL, NJ, and CT). The ALT and SDAT were also contributing to the pattern in a relatively low magnitude (r = −0.421 and −0.417, respectively). In canonical axis (II), ALT, MTWM, MPDM, and EF were significantly correlated with canonical axis (II) (r = 0.483, 0.426, −0.463, and 0.496, respectively), which resulted in separating East deme, West deme, and a widely scattered pattern in East deme populations. The LONG also showed a moderately high correlation with canonical axis (II) (r = −0.419). In canonical axis (III), a complex of environmental and geographical variables was correlated with PCOAs, especially PCOA7 and PCOA8. The LAT, MAT, MTWM, MTCM, MPSP, MPSU, MPAU, and MPWI were the top contributors in canonical axis (III). Notably, several precipitation variables were associated with LONG in canonical axis (I) while temperature variable were associated with LAT on canonical axis (II).

Intraspecific Genetic Diversity
Using nuclear molecular markers, significant genetic diversity and population structures were found within perennial grass species [67][68][69]. In this study, analysis of molecular variance showed significant variance among ploidy levels and demes but accounted for only a small portion (2.78% and 14.32%, respectively) of the total variance. Large variance among and within populations indicated a greater landscape diversity in the population level in prairie cordgrass. This is congruent to results from a genetic diversity study on big bluestem (Andropogon gerardii Vitman) accessions by Price et al. [69]. Fragmentation of habitats and rhizome-preferred reproductive nature of prairie cordgrass could explain the significant divergence among populations and individuals [1]. We used fixation statistics such as He, Ho, Fis, and Fst to evaluate the degree of divergence within and among demes. The populations within West deme exhibited higher heterozygosity than that in East deme, as indicated by both the fixation statistics and the principal component analysis between and within demes. However, this is likely due to a larger number of octoploidy populations in the West Deme compared to that in the East Deme. Although dismoic SNPs calls were used for samples of all three ploidy levels, we still observe a higher heterozygosity level in the octoploidy dominant West Deme than that in the tetraploidy dominant East Deme. Similar to switchgrass, the two identified potential regional gene pools have a dominant ploidy level, either tetraploid or octoploid [70]. Compared to the phylogeograhic study using chloroplast sequences by Kim et al. [22], we included more populations and revealed different patterns of adaptation patterns among those populations. In this study, two major population demes were divided largely by the border of mix-grass prairie and tallgrass prairie. However, in the Kim et al. [22] study, the first chloroplast haplotype group had a wide longitudinal distribution that ranged from central Nebraska to the east coast of Maine. Prairie cordgrass is an efficient wind-pollinated species, population formation through bi-paternal introgression could be potentially detected using nuclear DNA. As chloroplast DNA is mediated through a uni-parental inheritance pattern, the populations from the first chloroplast haplotype group may represent one large ancestral ecotype with a wide distribution that later subdivided into two ecotypes as indicated by the SNP data from this study. The third chloroplast haplotype group distantly apart from the first group may indicate a historical migration of populations from south to north whenever they are able to adapt to different hardiness zones. The geographical barrier and mismatching ploidy levels may prevent further admixture between tetraploids and octoploids separated by the mix-and tallgrass prairie border. In early studies on switchgrass, upland-lowland switchgrass differential is largely latitudinal, caused by a combination of temperature and photoperiod [32,71]. However, recent studies indicated that the differences in ploidy levels also played an important role in restricting gene flow between potential genetically distinct groups [70,72,73]. In this study of prairie cordgrass, the gradient appears to be largely longitudinal. The wide dispersal of several populations from IL, IA, MN, ME, and WI could be due to human activities such as railroad transportation in the Central US regions or migratory trafficking from the west to east coasts as many of the natural populations were collected along the railway tracks [22,74,75].

Genetic and Geographical/Environmental Associations
Phenological and morphological differentiation due to geographic isolation and climatic gradients was observed within several tallgrass species native to North America [1,[76][77][78][79]. In this study, we also detected significant influence of environmental condi-tion on distribution of natural prairie cordgrass populations collected from the east and midwest US regions. The Mantel tests and CCA indicated a strong correlation of genetic distance and environmental/geographical variables among the prairie cordgrass populations. Our results also indicated that both LONG and ALT played important roles in forming a general separation of gene pools from east to west in prairie cordgrass. Geographical barriers such as the tall-and mixture prairie border in the USA, could create separate gene pools [80,81]. Precipitation patterns or moisture gradient also has an impact on the distribution of grass species [81][82][83]. In our study, precipitation variable such as MPDM, MPSP, and MPWI, together with LONG and ALT, were highly correlated with the genetic distance in the same canonical axis. This is not surprising since the origins of populations collected in this study are generally covered by the east-west decreasing precipitation gradients, which plays an important role in shaping the great plain grass prairie [84]. Temperature patterns in mid-and eastern US are largely governed by LAT and ALT [85]. This explains the high correlations of LAT with the temperature variables such as MAT, MTWM, MTCM, MTSP, MTSU, MTAU, and MTWI. However, LAT and temperature variables are mostly significant on canonical axis (III), which indicated that temperature has less impact on the genetic distance among prairie cordgrass populations compared to precipitation and ALT. As ALT correlates both precipitation and temperature in eastern US, it is expected that ALT was significantly correlated with climatic variables such as MAP, SDAT, MPDM, MPSP, and MTWM, MPDM, MTAU, MPWI on canonical axes (I) and (II), respectively. Ecoregion factor is highly correlated with canonical axis (II) which indicated that factors such as landforms, soils, vegetation, land use, wildlife, and hydrology also play important roles in shaping the prairie cordgrass distribution. However, ecoregion factor is also highly correlated with environmental factors. According to Bailey [86], the ecoregions were first defined by the largest units and successively subdivides them. At the continental level, temperature and precipitation are major factors in defining the large sections of ecoregions. In our study, we also found that the ecoregion factor showed the similar level of correlation with genetic variation as those for temperature and precipitation variables. In conclusion, geographical factors of LONG and ALT, and environmental factors of MAT, MAP, MTWM, MPDM, MPSP, and MPWI are most important in distinguishing the intraspecific distribution of prairie cordgrass.

Conclusions
Our research reported the population genomic variation and potential diversity centers in prairie cordgrass based on the analysis of 9315 SNPs from 96 natural populations. Two distinct genetic groups were identified which were associated with ploidy levels and geographical and ecotypic separation. Analysis of intraspecific variation among and within genetic groups and ploidy levels revealed evidence of adaptation history of prairie cordgrass in the Midwest and Eastern USA. The major gene flow in prairie cordgrass could be a consequence of geographic, climatic events, and human activity. Future studies on local landscape variation in prairie cordgrass could provide further information on the adaptation strategies in perennial grass species. From a standpoint of improving prairie cordgrass, both for biofuel production and conservation purpose, the identification of divergent genetic resource could provide opportunities to combine breeding value from different gene pools. Data Availability Statement: Raw sequencing reads were submitted to NCBI BioProject: PRJNA594199 Geographical and environmental data were submitted to the DRYAD repository (https://datadryad. org/stash/share/vd_e9EKUGtcnMVPYie42Na2dNGtiuoRa006H4A8WyHY, (accessed on 10 July 2021).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Appendix A Figure A1. Graph presenting distribution of mean read depth for all 96 samples genotyped using 9315 SNPs with a GBS approach. In the x-axis, read depth data including primary and secondary alleles are presented. In the y-axis, the frequency of samples with corresponding read depth is presented.     --4x  -----11  0  100  East  Red River  --8x  -----12  100  0  West  STP *  §  --4x  -----9