Phylogeography and the Evolutionary History of Sunflower (Helianthus annuus L.): Wild Diversity and the Dynamics of Domestication

Patterns of genetic variation in crops are the result of selection and demographic changes that occurred during their domestication and improvement. In many cases, we have an incomplete picture of the origin of crops in the context of their wild progenitors, particularly with regard to the processes producing observed levels of standing genetic variation. Here, we analyzed sequence diversity in cultivated sunflower (Helianthus annuus L.) and its wild progenitor (common sunflower, also H. annuus) to reconstruct phylogeographic relationships and population genetic/demographic patterns across sunflower. In common sunflower, south-north patterns in the distribution of nucleotide diversity and lineage splitting indicate a history of rapid postglacial range expansion from southern refugia. Cultivated sunflower accessions formed a clade, nested among wild populations from the Great Plains, confirming a single domestication event in central North America. Furthermore, cultivated accessions sorted by market type (i.e., oilseed vs. confectionery) rather than breeding pool, recapitulating the secondary development of oil-rich cultivars during its breeding history. Across sunflower, estimates of nucleotide diversity and effective population sizes suggest that cultivated sunflower underwent significant population bottlenecks following its establishment ~5000 years ago. The patterns inferred here corroborate those from previous studies of sunflower domestication, and provide a comprehensive overview of its evolutionary history.


Introduction
Patterns of genetic variation in cultivated plants are the product of multiple processes that have occurred over their evolutionary histories. Gaining a comprehensive understanding of these patterns and the underlying processes requires reconstructing them within the broader context of the wild species from which they are derived. Such species-wide assessments can provide inferences into the ancestral lineages that gave rise to early domesticates and modern cultivars, and yield insights into the factors explaining the distribution of genetic diversity across gene pools. Such knowledge also has practical value, and can be applied to identify sources of novel alleles that should be preserved in germplasm collections and which may be of value in modern breeding programs [1][2][3][4]. Here, we characterize patterns of genetic variation in cultivated sunflower-a globally important oilseed crop-and its wild progenitor, common sunflower (both Helianthus annuus L.).
Common sunflower is a widely distributed annual herb whose native geographic range is centered in the Great Plains region of the United States and Canada [5]. Sunflower is thought to have been domesticated 3000-5000 years ago [6,7] by Native Americans who primarily used it as a source of edible seed [8]. Descendants of these early domesticates-the Native American landraces-were Table 1. Accession numbers, improvement status, and geographic origins of wild and cultivated lines examined in this study. Cultivated accessions are categorized as exotic (i.e., Native American landraces and open-pollinated varieties [OPVs]), HA, or RHA lines. Market type (i.e., oilseed or confectionery) is denoted for HA/RHA lines. All seeds were obtained from the North Central Regional Plant Introduction Station (Ames, IA, USA).

Sequence Processing and Variant Calling
We used iPyrad version 0.9.16 [28] to process reads and call variants for downstream analysis. Briefly, raw demultiplexed reads were filtered with cutadapt 1.12 [29], to remove reads containing adapter sequences and > 5% low quality (phred score < 20) or ambiguous bases. Filtered reads were aligned to the HA412-HOv2 genome assembly [30] with BWA-MEM 0.7.17 [31] using default parameters, then sorted and indexed using samtools 1.10 [32]. Indexed reads were merged with BEDtools 2.29 [33], and bases were called for sites with ≥ 6 and ≤ 10,000 reads. Merged reads were clustered across samples and aligned into GBS loci, and loci with > 20% shared heterozygous sites and > 10% variable sites were filtered to remove poorly aligned and paralogous loci. Remaining loci DNA extractions were prepared for sequencing following a two-enzyme GBS protocol [27] using the restriction enzymes HpaII and MseI. Resulting libraries were pooled at 96-plex and sequenced on the Illumina Nextseq 500 sequencing platform (Illumina, San Diego, CA, USA) in high-output mode and set to produce 75 bp single-end reads. All library preparation and sequencing was performed at the Georgia Genomics and Bioinformatics Core (Athens, GA, USA).

Sequence Processing and Variant Calling
We used iPyrad version 0.9.16 [28] to process reads and call variants for downstream analysis. Briefly, raw demultiplexed reads were filtered with cutadapt 1.12 [29], to remove reads containing adapter sequences and > 5% low quality (phred score < 20) or ambiguous bases. Filtered reads were aligned to the HA412-HOv2 genome assembly [30] with BWA-MEM 0.7.17 [31] using default parameters, then sorted and indexed using samtools 1.10 [32]. Indexed reads were merged with BEDtools 2.29 [33], and bases were called for sites with ≥ 6 and ≤ 10,000 reads. Merged reads were clustered across samples and aligned into GBS loci, and loci with > 20% shared heterozygous sites and > 10% variable sites were filtered to remove poorly aligned and paralogous loci. Remaining loci anchored to the 17 chromosome-level scaffolds on the HA412-HOv2 genome assembly were retained, and SNP filtering was conducted in VCFtools 0.1.16 [34].
Phylogenetic analyses were conducted on the "phylogenomics" dataset composed of biallelic SNPs present in ≥ 50% of ingroup and outgroup samples (232 H. annuus; 2 H. argophyllus, 3 H. petiolaris samples) at a minor allele frequency greater than 1% (MAF > 0.01). The sample coverage threshold in this dataset was chosen because it allowed for the retention of lower coverage, high mutation rate sites which are useful in resolving recent divergences [35,36]. Population genetic analyses were conducted on three datasets, each consisting of both cultivated and wild sunflower samples ("ingroup_all"), and only wild ("ingroup_wild") or cultivated ("ingroup_cultivated") samples. The three datasets were composed of biallalic SNPs present in ≥ 80% of samples at MAF > 0.01, thinned to include SNPs that were ≥ 1 kb apart to reduce non-independence amongst sites. Demographic reconstructions were conducted on an "ingroup_dadi" dataset consisting of two samples per wild population (32 total) and all 30 of the cultivated samples. The "ingroup_dadi" dataset consisted of biallelic SNPs present in ≥ 50% of samples with a minor allele count of 2 (MAC = 2), spaced ≥ 1 kb apart, and polarized by alleles fixed in H. petiolaris and argophyllus (i.e., the ancestral state at each site was set to the alleles observed in H. petiolaris and argophyllus). The reduced sample size and sample coverage thresholds of the "ingroup_dadi" dataset were chosen as they increased the number of segregating sites available for demographic analysis. Furthermore, the relatively low MAF and MAC thresholds of the "ingroup" datasets were chosen to allow for the inclusion of rare variants which provide greater resolution of genetic structure and demographic events [37], while excluding singleton alleles that may reflect sequencing error.

Patterns of Genetic Diversity across Breeding Pools and Geographic Space
Estimates of mean nucleotide diversity (π) across all sites were calculated with VCFtools to identify patterns in the distribution of genetic diversity across breeding pools within H. annuus (i.e., wild, OPV, RHA, HA), as well as the wild populations, separately. Differences in π among breeding pools were assessed by computing the 95% confidence intervals from 1000 bootstrap replicates of per-site estimates of π using the "boot" and "boot.ci" functions in the R [38] package boot [39].
We then estimated patterns of genetic diversity across wild sunflower populations. Global population genetic parameters (e.g., Weir and Cockerham's [40] F ST and F IS ) were estimated using all sites with the hierfstat R package [41]. Evidence of isolation-by-distance (IBD) was evaluated using a Mantel's test with the "mantel.randtest" function as implemented in the R package ade4 [42,43]. Pairwise estimates of Weir and Cockerham's F ST [40] were calculated using the "pariwise.WCfst" function in the hierfstat R package [41], and geographic distances between populations were calculated using the "distm" function in the geosphere R package [44]. Significance was assessed with 1 x 10 6 Monte Carlo simulations. Clinal trends in π were investigated with linear regression, treating latitude and longitude as fixed effects using the "lm" function in R. AMOVA was conducted using the poppR package in R [45] to determine how genetic diversity is distributed across the following scales: between genetic clusters identified through fastSTRUCTURE (described below); among populations within genetic clusters; among samples within each population; and within samples. Significance was assessed with 1000 Monte Carlo simulations.

Phylogenetic Relationships
We inferred phylogenetic relationships across wild and cultivated sunflower using RAxML 8.2.1 [46]. Analyses were conducted under GTR + CAT with ascertainment bias correction, with 20 tree searches and 100 bootstrap replicates to assess support. Trees were rooted with samples of H. petiolaris.

Population Clustering
We estimated individual ancestry coefficients with fastSTRUCTURE 1.0 [47] and ADMIXTURE 1.3 [48]. fastSTRUCTURE and ADMIXTURE analyses were run 5 times for K = 1-20 clusters using default parameters. The optimal number of clusters was determined using the "chooseK" tool and 10-fold cross-validation for fastSTRUCTURE and ADMIXTURE analyses, respectively. We then visualized samples in two-dimensional genetic space with principal component analysis (PCA) as implemented in the R package LEA [49].

Demographic History of Domestication
We modeled the divergence history between wild and cultivated sunflower using the diffusion approximation approach implemented in δaδi 2.0.3 [50]. We formulated three models that vary with respect to the presence and directionality of gene flow. Model A ( Figure S1A) describes a simple divergence without gene flow scenario, where ancestral populations of sunflower split at time T into wild and cultivated lineages. Following the split, the wild lineage undergoes an instantaneous size change to a current effective size of Nwild-current, while the cultivated lineage has a founding effective size of Ncult-founder, that grows or declines to a current effective size, Ncult-current. Model B ( Figure S1B) expands on Model A, and describes a divergence with gene flow scenario, allowing for symmetrical gene flow (Mw↔c) between the wild and crop lineages. Similarly, Model C ( Figure S1C) describes the same scenario as Model B, but allows for asymmetric gene flow (Mw→c, migration from wild into cultivated; Mw←c migration from cultivated into wild) between the lineages.
An unfolded 2D site-frequency spectrum was generated using the program easySFS (https: //github.com/isaacovercast/easySFS), sampling 24 and 30 haplotypes from the cultivated and wild lineages, respectively, to maximize the number of segregating SNPs for analysis [50]. Model fitting was performed using dadi_pipeline (https://github.com/dportik/dadi_pipeline), described in Portik et al. [51]. dadi_pipeline was run using custom settings (rounds = 4; replicates = 50, 50, 50, 100; algorithm steps = 3, 5, 15, 50; -fold parameters = 3, 2, 2, 1) and models were extrapolated to a grid size of 40, 50, 60 points and fitted with Nelder-Mead optimization. Maximum-likelihood parameter estimates from the best replicate run (i.e., highest log-likelihood) for each model were used to calculate the Akaike information criterion (AIC) scores for model testing [52] following Carstens et al. [53]. Standard deviations for parameter estimates were obtained using the FIM approach [54], which has been demonstrated to provide reasonable uncertainty estimates for datasets composed of effectively unlinked SNPs, compared to more computationally expensive bootstrapping. Parameter estimates and their associated 95% confidence intervals were converted to biological units assuming a mutation rate of 6.1 × 10 −9 substitutions/site/generation [55], and an effective sequence length ([bases sequenced to derive SNPs]*[SNPs used in the frequency spectrum/total number of SNPs called]) of L = 11.7 × 10 6 bp.

Patterns of Genetic Diversity across Breeding Pools and over Geographic Space
There were notable differences in nucleotide diversity (π) among breeding pools (Figure 2A), with a roughly two-fold difference in π between wild sunflower and both the exotic and elite lines (mean (95% CI): wild sunflower = 0.096 (0.094-0.098); exotic = 0.054 (0.051-0.058); HA = 0.046 (0.042-0.049); RHA = 0.036 (0.033-0.039)). Differences in π amongst cultivated samples were less pronounced, but the HA and RHA lines possessed 85% and 66% of the nucleotide diversity present in the exotic lines, respectively. The marked differences in diversity across breeding pools in H. annuus indicates that primitive domesticated lines and improved cultivars harbor progressively less genetic diversity than their wild progenitor.

Patterns of Genetic Diversity across Breeding Pools and over Geographic Space
There were notable differences in nucleotide diversity (π) among breeding pools (Figure 2A), with a roughly two-fold difference in π between wild sunflower and both the exotic and elite lines (mean (95% CI): wild sunflower = 0.096 (0.094-0.098); exotic = 0.054 (0.051-0.058); HA = 0.046 (0.042-0.049); RHA = 0.036 (0.033-0.039)). Differences in π amongst cultivated samples were less pronounced, but the HA and RHA lines possessed 85% and 66% of the nucleotide diversity present in the exotic lines, respectively. The marked differences in diversity across breeding pools in H. annuus indicates that primitive domesticated lines and improved cultivars harbor progressively less genetic diversity than their wild progenitor.
We observed moderate genetic differentiation (FST = 0.169) and inbreeding (FIS = 0.177) across and within populations of wild sunflower. FST between populations varied widely (range = 0.048-0.336; Table S1), and genetic differentiation was found to be spatially structured, as indicated by a significant pattern of IBD ( Figure 2B; Mantel's r = 0.371, P = 0.007). Furthermore, clinal patterns in the distribution of diversity were observed, as indicated by significant declines in π with increasing latitude ( Figure 2C; F1,14 = 4.82, P = 0.045, r = -0.450) and decreasing longitude ( Figure 2D; F1,14 = 18.1, P < 0.001, r = 0.729). AMOVA found that most genetic variation is partitioned within samples (68.8%, P < 0.001), with relatively little variation explained by differences between samples (13.2%, P < 0.001), populations (13.3%, P < 0.001), and genetic clusters (4.56%, P < 0.001). The sum of these results suggest that genetic diversity is continuously distributed across the range of wild sunflower, with the highest levels of diversity being concentrated in populations located in the southeastern portion of the range. In wild sunflower, pairwise genetic distances increase with geographic distances between populations (B). Furthermore, in wild sunflower, nucleotide diversity decreases with increasing latitude (C) and decreasing longitude (D).

Phylogenetic Relationships
Phylogenetic analyses infer a clear geographic pattern of lineage splitting in wild sunflower ( Figure 3A). Samples from each population were resolved as monophyletic (ML BS > 85), with the exception of the population in Wyoming (WY1), where one sample was resolved as sister to a clade of samples from a neighboring population in Montana. The earliest diverging lineage in our sample We observed moderate genetic differentiation (F ST = 0.169) and inbreeding (F IS = 0.177) across and within populations of wild sunflower. F ST between populations varied widely (range = 0.048-0.336; Table S1), and genetic differentiation was found to be spatially structured, as indicated by a significant pattern of IBD ( Figure 2B; Mantel's r = 0.371, P = 0.007). Furthermore, clinal patterns in the distribution of diversity were observed, as indicated by significant declines in π with increasing latitude ( Figure 2C;  Figure 2D; F 1,14 = 18.1, P < 0.001, r = 0.729). AMOVA found that most genetic variation is partitioned within samples (68.8%, P < 0.001), with relatively little variation explained by differences between samples (13.2%, P < 0.001), populations (13.3%, P < 0.001), and genetic clusters (4.56%, P < 0.001). The sum of these results suggest that genetic diversity is continuously distributed across the range of wild sunflower, with the highest levels of diversity being concentrated in populations located in the southeastern portion of the range.

Phylogenetic Relationships
Phylogenetic analyses infer a clear geographic pattern of lineage splitting in wild sunflower ( Figure 3A of wild sunflower was resolved as a population from south Texas (TX1; ML BS = 100), followed by a population from central Texas (TX2; ML BS = 97). Samples from outside of Texas form a wellsupported clade (ML BS = 96), with populations from New Mexico and the central Great Plains region (Oklahoma, Kansas, and Nebraska) forming a grade (i.e., relationships resolved among these populations were resolved with ML BS < 50), with respect to a strongly-supported western clade (ML BS = 100), comprising populations from Colorado, Wyoming, Montana, and Alberta. Relationships in the western clade show a south-north pattern of lineage splitting, mirroring the patterns observed more broadly across wild sunflower. Taken together, these phylogenetic patterns suggest that range expansion in wild sunflower occurred along two separate, south-north migration fronts, with multiple genetic lineages colonizing the central portion of the Great Plains, and a single genetic lineage migrating into and diversifying over the western portion of its range. All cultivated sunflower accessions were resolved as a strongly supported clade (ML BS = 100) nested within the New Mexico-Central Great Plains grade ( Figure 3A). Four Native American landraces (i.e., Hopi Dye, Arikara, Seneca, Maíz de Tejas, Maíz Negro) diverged early and form a grade at the base of the cultivated clade ( Figure 4A). There is little phylogenetic structure after the early diverging Native American landraces, but cultivated accessions appear to sort largely by market type (i.e., oilseed vs. confectionery) rather than heterotic group. This is apparent in the resolution of all but two of the oilseed lines as a clade (the most-inclusive clade containing the oilseed lines PI 599775 (HA123) and two high-oil OPVs (Peredovik and VNIIMK 8931)), and the paraphyly of the confectionery lines. Two oilseed lines (PI 599771 (HA061) and PI 561918 (HA378)) cluster with the confectionery lines, which may be a result of introgressions rather than independently derived oilseed lines. Overall, relationships among cultivated accessions are decidedly complex, but the resolution of Native American landraces at the base of the cultivated sunflower phylogeny is consistent with the view that all modern cultivars of sunflower are descended from Native American landraces. Furthermore, the paraphyly of the confectionery lines, and the sorting of most oilseed lines into a clade, indicate that the oilseed lines were derived from a non-oilseed progenitor, consistent with the known breeding history of cultivated sunflower.  All cultivated sunflower accessions were resolved as a strongly supported clade (ML BS = 100) nested within the New Mexico-Central Great Plains grade ( Figure 3A). Four Native American landraces (i.e., Hopi Dye, Arikara, Seneca, Maíz de Tejas, Maíz Negro) diverged early and form a grade at the base of the cultivated clade ( Figure 4A). There is little phylogenetic structure after the early diverging Native American landraces, but cultivated accessions appear to sort largely by market type (i.e., oilseed vs. confectionery) rather than heterotic group. This is apparent in the resolution of all but two of the oilseed lines as a clade (the most-inclusive clade containing the oilseed lines PI 599775 (HA123) and two high-oil OPVs (Peredovik and VNIIMK 8931)), and the paraphyly of the confectionery lines. Two oilseed lines (PI 599771 (HA061) and PI 561918 (HA378)) cluster with the confectionery lines, which may be a result of introgressions rather than independently derived oilseed lines. Overall, relationships among cultivated accessions are decidedly complex, but the resolution of Native American landraces at the base of the cultivated sunflower phylogeny is consistent with the view that all modern cultivars of sunflower are descended from Native American landraces. Furthermore, the paraphyly of the confectionery lines, and the sorting of most oilseed lines into a clade, indicate that the oilseed lines were derived from a non-oilseed progenitor, consistent with the known breeding history of cultivated sunflower.

Population Clustering
fastSTRUCTURE and ADMIXTURE analyses infer diffuse, geographically defined population structure across wild sunflower. Both analyses disagree with respect to the optimal value of K (fastSTRUCTURE, K = 3; ADMIXTURE, K = 8), but consistently identified a distinct cluster of cultivated accessions while sorting wild samples into increasingly smaller, geographically defined clusters until K = 17, where all wild samples were sorted by their collecting locality. This fractal pattern of population clustering is consistent with IBD, so we present results from fastSTRUCTURE analyses for K = 2-6, which circumscribe landscape-level, geogenetic clusters ( Figure 3B). At K = 2, samples were sorted into cultivated and wild clusters. At K = 3, the wild samples split into two clusters corresponding to a southern/eastern and a western cluster. At K = 4, the western cluster split into a southern-western cluster composed of samples from Colorado populations and a northernwestern cluster composed of samples from Wyoming, Montana, and Alberta. Finally, at K = 5, the southern-eastern cluster split into a southern cluster composed of samples from Texas, New Mexico and Oklahoma and an eastern cluster composed of samples from Kansas and Nebraska. Instances of admixture were uncommon across populations, with most samples possessing > 80% ancestry in a

Population Clustering
fastSTRUCTURE and ADMIXTURE analyses infer diffuse, geographically defined population structure across wild sunflower. Both analyses disagree with respect to the optimal value of K (fastSTRUCTURE, K = 3; ADMIXTURE, K = 8), but consistently identified a distinct cluster of cultivated accessions while sorting wild samples into increasingly smaller, geographically defined clusters until K = 17, where all wild samples were sorted by their collecting locality. This fractal pattern of population clustering is consistent with IBD, so we present results from fastSTRUCTURE analyses for K = 2-6, which circumscribe landscape-level, geogenetic clusters ( Figure 3B). At K = 2, samples were sorted into cultivated and wild clusters. At K = 3, the wild samples split into two clusters corresponding to a southern/eastern and a western cluster. At K = 4, the western cluster split into a southern-western cluster composed of samples from Colorado populations and a northern-western cluster composed of samples from Wyoming, Montana, and Alberta. Finally, at K = 5, the southern-eastern cluster split into a southern cluster composed of samples from Texas, New Mexico and Oklahoma and an eastern cluster composed of samples from Kansas and Nebraska. Instances of admixture were uncommon across populations, with most samples possessing > 80% ancestry in a given cluster. However, one Native American landrace accession (Hopi Dye) was consistently estimated to have ca. 50% membership in other clusters across all values of K, and samples from the Wyoming population were found to possess 30-40% admixed ancestry at K = 4 and 5 ( Figure 3B). PCA recapitulates these patterns, with cultivated accessions positioned distantly from wild samples ( Figure 3C), and wild samples from nearby populations grouping together in PC space ( Figure 3D).
The population structure within cultivated sunflower is more complex. fastSTRUCTURE and ADMIXTURE analyses favored lower values of K (fastSTRUCTURE, K = 2; ADMIXTURE, K = 1) and, for both analyses, ancestry assignments at K > 3 were difficult to interpret. That being said, we present fastSTRUCTURE results for K = 2-3, which reveal subtle, biologically interpretable structure within the cultivated lines. At K = 2, a single Native American landrace accession (Hopi Dye) is inferred as its own cluster, with three other Native American landrace accessions (Arikara, Seneca, Maíz de Tejas) sharing some ancestry (< 20%) with the Hopi Dye cluster ( Figure 4B). At K = 3, Hopi Dye remains a unique cluster, and two additional clusters emerge to separate accessions largely by market type (i.e., oilseed vs. confectionery lines) rather than breeding pool ( Figure 4B). PCA also infers subtle structure, with little separation of accessions by breeding pool (Figure 4C), and some differentiation occurring among market types ( Figure 4D).

Demographic Reconstruction
Demographic reconstructions estimate that the wild and cultivated sunflower lineages diverged between 900-5400 ybp ( Table 3). All models estimate current effective size of the wild lineage (Nwild-current) to be roughly 10-to 20-times greater than the current effective size of the cultivated lineage (Ncult-current). Similarly, all models estimate up to a 20-fold reduction in effective size between the founding (Ncult-founder) and contemporaneous (Ncult-current) cultivated lineage. These dramatic differences in current and historical population sizes between the cultivated and wild lineages are consistent with significant losses of genetic diversity during domestication and subsequent improvement. AIC favored model C (Table 3; Figure 5; Figure S2), which models asymmetric migration between the cultivated and wild lineages. In this model, the cultivated and wild lineages diverged 5370 ybp, with the cultivated lineage undergoing sequential bottlenecks, ultimately resulting in a nearly 20-fold reduction in the effective size of the modern breeding pool. Migration from the wild lineage into the cultivated lineage (Mw→c) was estimated at 3.81 migrants per year (95% CI: 3.58-4.04), which is an order of magnitude greater than migration from the cultivated into wild lineage (Mw←c = 0.35 (0.301-0.405)).

Phylogeography of Wild Sunflower
Populations across the range of wild sunflower have diverged primarily along a south-north axis ( Figure 3A), which is consistent with a scenario of postglacial range expansion from a southern refugium. The observed pattern of IBD ( Figure 2B), partitioning of most genetic diversity at finer spatial scales, and relatively weak population structure ( Figure 3B-D) indicates that genetic diversity is continuously distributed over the species range, and suggests that range expansion occurred rapidly following glacial retreat. Furthermore, linear declines in nucleotide diversity with increasing latitude and decreasing longitude ( Figure 2C,D) indicate that range expansion likely occurred in a stepwise fashion [56], through sequential founding events as colonizing populations migrated north-and westward from refugial populations located in the southeastern portion of the range.
Paleoecological studies provide a finer resolution to the approximate locations of refugial areas during the late Pleistocene, and suggest that many of the aforementioned species may have persisted in macrorefugia distributed along the Gulf Coast [74], with colder-adapted species surviving in smaller refugia along the Atlantic Coast [75], or in cryptic microrefugia at mid-latitudes [76,77]. In the case of wild sunflower, there are no fossils that place it in any of these regions during the LGM. However, there are records of composite pollen originating during the LGM from eastern Texas, the lower Mississippi River Valley, peninsular Florida, and the coastal Carolinas [74]. Given that the core of the wild sunflower distribution is centered in the Great Plains region [5], and that phylogenetic and genetic diversity is concentrated in the southern and eastern portions of its range, we postulate that its refugial areas may have been located in adjacent areas such as eastern Texas and the lower Mississippi River Valley, both of which harbored grassland species during the LGM [74]. Future ecogeographic studies incorporating well-curated distributional data of wild sunflower populations and paleoclimatic niche modelling would serve as excellent tests of this hypothesis.
Our analysis of wild sunflower adds to the growing body of work demonstrating the general trend of south-north postglacial migration inferred in wide-ranging North American plant species. Interestingly, there is a dearth of phylogeographic studies that have examined a wide-ranging plant species that spans the entirety of the Great Plains (but see [78]). As such, our study provides interesting insights into central North American plant phylogeography. One pattern inferred in wild sunflower is the resolution of separate south-north patterns of lineage splitting in the western and central Great Plains region ( Figure 3A), with multiple genetic lineages colonizing the central Great Plains, and a single genetic lineage migrating and diversifying over the western Great Plains. This pattern may be explained by the physiography of the western Great Plains, which is at higher elevation and possesses a cool, arid climate. This region-the High Plains-has been shown to be an important biogeographic break for many animals in North America [79], where previously isolated taxa/populations situated on either side of the region have been shown to have come into secondary contact as climates warmed during the Holocene [80][81][82]. Indeed, in our study, samples from a population collected in eastern Wyoming were resolved as paraphyletic ( Figure 3A), possessing slightly greater nucleotide diversity compared to other populations at similar latitudes ( Figure 2C) and having > 30% admixed ancestry at higher values of K ( Figure 3B). Our findings in wild sunflower suggest that the High Plains played an important role in generating contemporary patterns of divergence and genetic structure in not just animals, but wide-ranging plants distributed across central North America.

Insights into the Domestication and Breeding History of Cultivated Sunflower
Patterns of divergence and population structure in cultivated sunflower are complex, but largely reflective of its domestication and breeding history. Cultivated accessions form a strongly supported clade, nested among wild sunflower populations from the central Great Plains (Figure 3A), and estimated to have arisen into an independently evolving entity ca. 5370 ybp (Table 3; Figure 5). Within the cultivated clade, five Native American landraces (Hopi Dye, Arikara, Seneca, Maíz de Tejas, and Maíz Negro) were resolved as the earliest diverging lineages, which split in succession from a single founding lineage and eventually gave rise to the modern cultivars ( Figure 4A). These findings agree well with those from previous studies, demonstrating that extant sunflower cultivars trace back to a single origin of domestication [13][14][15] ca. 5000 ybp [7] in east-central North America [13,16]. Propagules of this initial domestication were then presumably dispersed between different Native American cultures who used it for food and cultural purposes [8].
Patterns of phylogenetic and population structure outside of the early diverging landraces become apparent when the cultivars are coded by market type (i.e., oilseed vs. confectionery) rather than breeding pool (i.e., exotic vs. HA vs. RHA) (Figure 4). The sorting of cultivars by market type is not unexpected [17,21,22], as early breeding efforts in Eastern Europe were focused on increasing oil content [10], which likely resulted in substantial genetic differentiation. The development of inbred lines and accompanying transition to hybrid breeding occurred much more recently [83]. Our results reflect this history, where nearly all oilseed lines were resolved as a clade in relation to a grade composed of confectionery lines, with two Russian developed high-oil OPVs (Peredovik and VNIIMK 8931) splitting early within the oilseed clade's history. Given the limited sample of accessions included in our study, future studies examining a greater number and diversity of both wild and domesticated lineages will be useful in confirming the patterns inferred in this study, and gaining more pointed insights into the origin of domesticated sunflower and the effects of historical breeding efforts in generating observed patterns of relatedness and genetic structure. Of particular value might be an expansion of the wild sunflower sampling to provide better coverage of the eastern and western portions of its range.

Domestication and Its Effects on Polymorphism
Domestication and improvement have generated large differences in observed levels of genetic diversity in the wild and cultivated sunflower breeding pools. Unsurprisingly, exotic and elite lines were found to harbor ca. 60% and 50% of the nucleotide diversity (π) present in wild sunflower, respectively (Figure 2A). These results compare favorably with those from previous surveys of SSR and SNP diversity in sunflower, which have consistently estimated up to a 50% reduction in various measures of diversity (primarily gene diversity (H e )) between wild and cultivated sunflower [16][17][18][19][20][21]84]. The consistency across studies and marker types suggest that the effects of domestication and improvement were dramatic, affecting both SSR and SNP variation across the sunflower genome.
Demographic reconstructions provide some insight into the timing and order of these changes in genetic diversity (Table 3; Figure 5). For example, we inferred a dramatic 12-fold reduction in effective size over the history of the cultivated lineage (i.e., Ncult-founder vs. Ncult-current) and a more subtle 1.5-fold difference in effective sizes between the wild lineage and the founding population of the cultivated lineage (i.e., Nwild-current vs. Ncult-founder). These results support the notion that genetic diversity in cultivated sunflower was lost progressively, with a moderate loss of diversity during the initial domestication bottleneck, and more severe reductions in diversity following strong directional selection and additional bottlenecks during improvement (reviewed in [85][86][87]). Losses in genetic diversity following domestication and improvement are a common feature of many cultivated plant species [87,88]. However, the patterns observed in sunflower contrasts with those reconstructed in other annual crops such as common bean and maize, where current effective sizes are much larger than their inferred domestication bottleneck sizes, possibly due to rapid population expansion or ongoing gene flow with wild relatives [89][90][91].
In sunflower, moderate rates of gene flow from the wild into cultivated breeding pools do not appear to have strongly influenced current effective sizes (Table 3; Figure 5), which may be reflective of the targeted nature of introgression events in cultivated sunflower breeding (e.g., the introduction of disease resistance loci from wild donors [84,92,93]). Overall, these results are consistent with the known history of cultivated sunflower, but many issues remain unresolved: specifically, the duration of the domestication bottleneck, and the tempo and mode of bottleneck-induced population declines. A genomic analysis of contemporary and archeological specimens (e.g., [94]) with recently developed methods designed to infer more granular changes in effective population size through time [95] may be useful in generating richer insights into the broad demographic patterns observed in this study.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/3/266/s1, Table S1: Pairwise F ST estimates between the cultivated accessions and wild sunflower populations, Figure S1: Schematic representations of demographic models fitted using δaδi. Estimated parameters are noted with text for each model, Figure S2. δaδi analysis of model C. Upper panels are the observed and expected site frequency spectra. Lower panels are a heat map and histogram of residuals.