Development of a Multipurpose Core Collection of Bread Wheat Based on High-Throughput Genotyping Data

Modern plant breeding practices have narrowed the genetic base of wheat throughout the world, increasing crop vulnerability. Therefore, there is clearly a need for introducing new germplasm in breeding programs to search for variability related to traits of agronomic interest for wheat improvement. The existence of subsets of accessions (core collections) that represent the diversity conserved in germplasm collections is a favored approach for breeders to explore novel variation and enhance the use of germplasm. In this study, a core collection of Spanish landraces of bread wheat has been created using high-throughput genotyping technologies (DArTseq), which yielded more than 50 K molecular markers. This marker system not only provides a robust estimate of the diversity, but also information about its distribution in the genome. Two core collections of 94 entries were created by using two common sampling strategies: the maximization strategy and the population structure-based method. Both core collections showed high geographic, phenotypic and genetic representativeness, but the collection obtained with the maximization strategy captured better the diversity displayed by the initial collection. This core collection, which includes a broad range of adapted genotypes, can be efficiently utilized for mining new alleles for useful traits in wheat breeding.


Introduction
Bread wheat (Triticum aestivum L.) is a major staple food crop that is widely grown throughout the world. Currently, in order to meet the increasing requirements of a growing population and tackle the challenges of global climate change, the genetic improvement of this crop must achieve several goals, including higher yield, adaptation to specific environments, tolerance to biotic stresses and quality enhancement. Modern plant breeding practices, in which only a small number of elite cultivars are included in breeding programs, have narrowed the genetic base of wheat throughout the world, increasing crop vulnerability. Therefore, there is clearly a need for introducing new germplasm in breeding programs so as to broaden the gene pool in which to search for new traits of agronomic interest necessary for wheat improvement. Wheat landraces are among the most suitable germplasm resource where the genetic variation required to that end can be searched [1] and utilized through a method formed core collections that had higher average genetic distances between genotypes, whereas the M-strategy captured more allelic diversity [20,21].
There are several computer programs to aid in core collection design. The Core Hunter algorithm has proved to be a fast and powerful method for designating core collections with increased genetic diversity, which can be applied with or without stratification of the whole collection [21][22][23]. Core Hunter has additional advantages: repeating the selection process produces a consistent solution, it is freely available and it is less time-consuming than other algorithms [21,22].
The objectives of the research were (a) to apply DArTseq GBS technology to provide a molecular basis for the design of a core collection of Spanish wheat landraces using the two most common approaches, M-strategy and stratified sampling, and (b) to evaluate the quality of the resulting core collections in order to select the most appropriate for a more efficient use of this germplasm in breeding. The use of the GBS technology has allowed the selection of a core collection based on more than 50K molecular markers distributed along the whole genome, and the detection of the presence of genomic regions where the different sampling methods employed performed differently. The results showed that the core collection created with the M-strategy using the Core Hunter algorithm performed better at retaining the diversity available in the initial collection.

Materials
The Spanish National Plant Genetic Resources Centre, CRF-INIA (Centro de Recursos Fitogenéticos, INIA, Madrid), maintains the national collection composed of 522 Spanish landraces of Triticum aestivum subsp. vulgare (Vill.). From this collection, a total of 189 genotypes were selected based on their collection site data (altitude, longitude, latitude [24]) and morphological spike traits (see Supplementary Tables S1 and S2) to represent the available diversity [25]. Homozygous lines were derived from these selected genotypes by collecting single bagged spikes from single selected plants during three generations. These 189 genotypes constituted the primary subset collection (PS) from which the entries for the final core collection were selected.
In the present study, the term "accessions" refer to genotypes that constitute the PS and "entries" are genotypes of the core collection [26].

Genetic and Phenotypic Characterization
High-throughput genotyping data for the PS accessions were obtained by DArTseq GBS technology at SAGA (Genetic Analysis Service for Agriculture, Mexico City, Mexico) as described in Pascual et al. [27]. This genotyping technology produces two different sets of markers: SNPs (Single Nucleotide Polymorphisms) and PAVs (Presence Absence Variants), from now on referred as DArTs (Diversity Arrays Technology markers). For this study, we selected a total of 59,276 DArTs and 14,830 SNPs, which were obtained after filtering out the markers that presented the same allelic profile or more than 10% missing data, as described by [27]. In that study, the genetic structure of the 189 accessions of the present research were analysed based on DArT markers, and the allelic profiles for the vernalization gene Vrn-A1 and the Glu-1 homoeoloci, determinants of wheat quality [28], were obtained. Both winter and spring landraces are included in the PS [27].
For phenotypic characterization, the accessions were sown in an augmented design during the season 2016-2017 at Alcala de Henares (Madrid). Seven qualitative (growth habit, awnedness, awn color, spike density, glume hairiness, glume color and seed color) and five quantitative agromorphological traits (days to heading and to maturity, plant height, spike length and spikelets per spike) were recorded according to the International Board of Plant Genetic Resources (IBPGR) [29] from five different plants in each accession.

Creation of the Core Collections
In order to determine the optimal collection size, simulations for sizes ranging from 5% to 100% of accessions included in the PS were performed with the DArT markers in the software Bio-R [30], which provides a graphical interface for the Core Hunter algorithm [21]. For simulations, the heterozygosity of the selected collections (HE = 1) was maximized, while the default values for the rest of parameters were maintained. Finally, the relationship between collection size and genetic diversity, quantified as the number of polymorphic markers retained, was examined. The optimal size was established as the point where the number of polymorphic markers increased asymptotically.
Two core collections were constructed based on the DArT markers. The maximization core collection (MCC) was obtained with the M-strategy using the Core Hunter algorithm and the expected heterozygosity (HE = 1) as the criteria of maximization (as described for the simulations). The stratified core collection (SCC) was created using the stratified sampling strategy. First, accessions were grouped based on populations. Then, inside each population, a number of accessions proportional to the genetic diversity (H s ), calculated with the DArTs as described by Nei [31], was selected to maximize the expected heterozygosity (HE = 1). Finally, a random core collection (RCC), where accessions were sampled randomly from the PS, was created to serve as reference.

Evaluation of the Core Collections
The quality of the different core collections created was evaluated using geographic, agromorphological and genetic data. Statistical analyses were performed with the software R version 3.5.2 [32].
For qualitative agromorphological traits and allelic profiles for the Vrn-A1 and Glu-1 loci, significant differences between the frequencies in the core collections and the PS were checked by Fisher's Exact Test (p-value < 0.05) [33]. For quantitative characters, the mean, variance, range and coefficient of variation were calculated for the PS, and for each one of the core collections. A homogeneity test (F-test) for variances and a t-test for means (p-value < 0.05) were used to compare the core collections and PS. The following evaluation parameters were calculated as described by Hu et al. [8]: mean difference percentage (MD), variance difference percentage (VD), coincidence rate of range (CR) and variable rate of coefficient of variation (VR). According to these parameters, a core collection can be considered representative if the percentage of traits with significant differences in their means is less than 20% (MD ≤ 20) and the coincidence rate of the range retained by the core collection is greater than 80% (CR ≥ 80%) [8].
The genetic diversity captured in each core collection was assessed with SNP markers. Different approaches were followed to evaluate the collections. First, the genetic diversity (H s ; [31]) was calculated for the PS and each of the core collections. Second, SNPs markers were classified according to their MAF (Minimum Allele Frequency) in: >0.1 (present in at least 19 accessions), ≥0.05 (9 accessions), ≥0.03 (6 accessions), >0.01 (2 accessions) and ≤0.01 (only in one accession). The number of markers fixed for each category in the core collections was calculated. Third, accessions selected and non-selected in each core collection were plotted in the Principal Coordinate Analysis (PCoA) performed by [27] to detect areas not sufficiently covered by the different core collections. Fourth, H s along the bread wheat genome in the PS and different core collections was calculated based on SNP markers located in the bread wheat genome as described by [27]. Results were analyzed in order to detect genomic regions in which the core collections failed to retain the available diversity.
Finally, in order to quantify the degree of dissimilarity, Gower's genetic distances [34] between accessions were computed using the agromorphological data, SNP markers and Glu-1 and Vrn-A1 alleles. For each core collection, entry to entry mean-distance (E-E), the distance between each accession in the PS and the nearest entry in the core collection (A-NE) and the distance between each entry in the core collection and the nearest neighboring entry (E-NE) were calculated and averaged over all entries as described in [26]. A-NE represents the selection of entries close to each accession in the PS; thus, lower values are obtained for greater representativeness in the core entries. The E-NE distance indicates the presence of groups of similar entries in the core; thus, both the mean and minimum E-NE reach a maximum when all the entries are far apart.

Results
High-throughput genotyping provides genetic information that can guarantee the full inclusion of the available genetic diversity when creating a core collection. In order to avoid constraints in terms of budget and time, a primary set of accessions with low genetic redundancy, representing the entire collection, was selected before genotyping. From the full collection of 522 T. aestivum subsp. vulgare accessions, we selected a PS of 189 landraces covering the full collection range for latitude, longitude and altitude (Supplementary Table S1). This PS included landraces collected in the first half of the 20th century from all Spanish regions (including the two Spanish archipelagos), in which nine agroecological growing zones have been described [35]. The PS also covers the variability for six traits currently used in wheat accession characterization [29] (Supplementary Table S2). According to a previous study [27], the PS was subdivided into 4 populations, with Pop 2 having the highest number of accessions. The genetic diversity values (H s ) in each population ranged from 0.13 to 0.32 (Table 1).

Creation of the Core Collections
The final size of the core collection was determined based on simulations using the 59,276 polymorphic DArT markers present in the PS. For sizes larger than 94 entries, the genetic gain (estimated as the number of polymorphic markers) increased asymptotically ( Figure S1). Thus, the size of the core collection was established in 94 genotypes, which captured 56,451 of the polymorphic DArT markers.
DArT markers were also used to select entries of the core collections. The MCC was obtained with the M-strategy by maximizing the expected heterozygosity. The SCC was created using the stratified sampling strategy based on the genetic structure of the PS and the genetic diversity within each population. Thus, those populations with higher diversity contributed a higher proportion of entries to the core collection. Finally, the RCC collection was established by randomly selecting entries from the PS. The final number of entries from each population included in the MCC, SCC and RCC are indicated in Table 1. PS, primary set; MCC, core collection generated with maximization strategy; SCC, core collection generated with stratified sampling; RCC, core collection generated with random sampling.

Evaluation of Core Collections
A core collection of germplasm is a useful approach for breeders only when it includes most of the available variation (both genotypic and phenotypic). In this study, each core collection was evaluated considering genotypic and phenotypic data not used for the selection of entries. We evaluated the quality of the three core collections (CCs) at different levels: (1) representativeness of the CC for geographic, phenotypic and allelic (Glu-1 and Vrn-A1 loci) variability present in the PS; (2) allelic richness estimated from SNPs; (3) degree of dissimilarity and redundancy according to distances between accessions; and (4) distribution of the genetic variability included in each CC with respect to the PS and along the full genome.

Representativeness of the Core Collections
The three CCs represented adequately the geographic diversity and variability of qualitative morphological traits present in the PS (Table 2, Figure 1). Table 2. Latitude, longitude and elevation ranges covered by the primary set and the core collections.

Latitude
Longitude PS, primary set; MCC, core collection generated with maximization strategy; SCC, core collection generated with stratified sampling; RCC, core collection generated with random sampling.
Agronomy 2020, 10, x FOR PEER REVIEW 6 of 16 Table 2. Latitude, longitude and elevation ranges covered by the primary set and the core collections.

Latitude
Longitude PS, primary set; MCC, core collection generated with maximization strategy; SCC, core collection generated with stratified sampling; RCC, core collection generated with random sampling. The results also showed that the frequency distributions of the qualitative agromorphological traits in the CCs were not significantly different from the PS (p-values for Fisher tests ranging from 0.37 to 1). For the quantitative traits, no significant differences among means were detected in the CCs (Table 3). Moreover, the null values for the evaluation parameters MD and VD indicated that The results also showed that the frequency distributions of the qualitative agromorphological traits in the CCs were not significantly different from the PS (p-values for Fisher tests ranging from 0.37 to 1). For the quantitative traits, no significant differences among means were detected in the CCs (Table 3). Moreover, the null values for the evaluation parameters MD and VD indicated that the PS was properly represented in the core collections (Table 3), and the high VR and CR values in the SCC and MCC revealed the prevalence of diverse entries in these subsets. The MCC generally had the highest coefficient of variation values, higher than those of the PS for some traits. Regarding the genotypic data, all Glu-1 alleles were included in the MCC and SCC (Table 4), whereas the RCC failed to capture one allele at each of the Glu-B1 and Glu-D1 loci. The three CCs included the three Vrn-A1 alleles identified in the Spanish landraces in a similar proportion, especially in the MMC (Table 4). Table 4. Alleles at the Glu-1 homoeoloci and gene Vrn-A1 in the primary set and the core collections.

Representativeness of the Core Collections
The allelic richness of the CCs was evaluated with 14,830 polymorphic SNPs in the PS. To analyze the degree of allele fixation in the three subsets, we studied the presence of monomorphic markers ( Table 5). All the SNPs with predominant alleles (MAF > 0.1) were polymorphic in the three CCs. For the rest of the markers, the MCC had the lowest number of fixed markers, whereas the SCC showed the highest values. Taking into account the number of accessions in the PS, the least frequent allele in markers with MAF ≤ 0.01 was present in only one accession. Thus, it was expected that some of them would not be included. Overall, the MCC possessed the highest gene diversity value, equal to that in the PS.

Distances between Entries
The mean Gower's distance between entries (E-E) was higher in the CCs than in the PS, thereby providing a gain of 1% in the RCC and SCC, and 2% in the MCC ( Table 6). The three CCs showed higher values for the minimum distances among the entries (mean E-NE distance and minimum E-NE distance) than the PS, especially the MCC. This last subset and the RCC also showed the lowest A-NE distances. PS, primary set; MCC, core collection generated with maximization strategy; SCC, core collection generated with stratified sampling; RCC, core collection generated with random sampling; E-E, mean distance between entries; A-NE, average distance between each accession in the PS and the nearest entry in the core collection; E-NE, average distance between each entry in the core collection and the nearest neighboring entry; E-NE min, minimum distance between each entry in the core collection and the nearest neighboring entry.

Distribution of Genetic Variability
The capture of the available genetic diversity in the three CCs was analyzed by representing the selected accessions in each one of them with respect to the PS by a PCoA analysis explaining 19.2% of the available SNPs diversity (Figure 2). The three subsets well covered the genetic variation, capturing accessions from the four populations of the PS. The RCC, however, failed to include some accessions from Pop 2 placed in the central part of the PCoA graph ( Figure 2).

Discussion
Several studies have shown the considerable variability among the Spanish wheat landraces compared to other germplasm collections [36,37]. These materials can be an important source of genes for wheat improvement, such as rust resistance [38] and quality traits [37,39]. However, even after the removal of redundant accessions, the collection maintained at CRF-INIA comprising more than 500 landraces is too large for evaluation and use in most breeding programs. Defining a core collection is, therefore, a pre-requisite and valuable tool for utilizing this germplasm, particularly for more complicated phenotypic screens, as has been demonstrated by the Spanish core collections of barley and durum wheat [40,41]. Both core subsets have facilitated detailed study of some difficult to analyze traits such as yield performance, root architecture, disease resistance or characterization of low molecular glutenin subunits [42][43][44][45][46].
On the other hand, the high genetic variability of the Spanish collection complicated the designing of a core collection of suitable size and able to capture the diversity present in the entire collection. To determine the optimal number of entries needed to retain an acceptable proportion of alleles present in the primary set, we tried to find a "point of compromise" between gain of genetic variation and elimination of genetic redundancy in the core collection. The final size of 94 entries captured 96% of polymorphic DArT markers present in the PS, which is in agreement with other studies that have reported values between 70 and 98% using SNPs [47,48] and SSR [40,49,50]. This size represents 18% of the entire collection, and is within the range from 5 to 30% recommended for

Discussion
Several studies have shown the considerable variability among the Spanish wheat landraces compared to other germplasm collections [36,37]. These materials can be an important source of genes for wheat improvement, such as rust resistance [38] and quality traits [37,39]. However, even after the removal of redundant accessions, the collection maintained at CRF-INIA comprising more than 500 landraces is too large for evaluation and use in most breeding programs. Defining a core collection is, therefore, a pre-requisite and valuable tool for utilizing this germplasm, particularly for more complicated phenotypic screens, as has been demonstrated by the Spanish core collections of barley and durum wheat [40,41]. Both core subsets have facilitated detailed study of some difficult to analyze traits such as yield performance, root architecture, disease resistance or characterization of low molecular glutenin subunits [42][43][44][45][46].
On the other hand, the high genetic variability of the Spanish collection complicated the designing of a core collection of suitable size and able to capture the diversity present in the entire collection.
To determine the optimal number of entries needed to retain an acceptable proportion of alleles present in the primary set, we tried to find a "point of compromise" between gain of genetic variation and elimination of genetic redundancy in the core collection. The final size of 94 entries captured 96% of polymorphic DArT markers present in the PS, which is in agreement with other studies that have reported values between 70 and 98% using SNPs [47,48] and SSR [40,49,50]. This size represents 18% of the entire collection, and is within the range from 5 to 30% recommended for retaining a great part of the genetic variability with a manageable number of accessions (e.g., [5,40,51]).
Preserving maximum genetic variation with a small number of accessions is a challenging task, and several improvements in sampling strategies have been devised over the last two decades. In order to create a multipurpose collection, we constructed two core collections using the most common approaches: M-strategy [16] and stratified sampling [6,16,17]. According to the objectives of these methodologies, we expected that the MCC maximized the total allelic diversity, and selected more diverse entries, whereas the SCC optimized the representativeness of the genetic diversity, including more representative entries. However, the effective utilization of the resulting core collections in breeding programs depends directly on their quality, which should be correctly evaluated.
Depending on the purpose of a core collection, a variety of metrics can be used to evaluate its quality. In the present study, considering that the aim is a multipurpose collection, the quality was evaluated using different types of variables such as geographic, phenotypic (discrete and continuous) and genotypic data that were not used in the core selection [26,51]. Also, a random core collection was generated to serve as a reference. Both the M-strategy and the stratified sampling selected entries that represented the geographic, phenotypic and genotypic variability of the PS, which validated both sampling strategies. However, the MCC included higher variation for quantitative agromorphological characters, indicating that this subset maximized the representativeness of the pattern of variation of these traits in the PS [26]. The allelic richness of the collections was analyzed with 14,830 SNPs covering all chromosomes. This genome-wide assay has proved to be a very convenient method for the analysis of the variability in germplasm collections [52]. On average, the CCs captured between 92.6 (in the SCC) and 93.8% (in the MCC) of SNPs present in the PS, and more than 90% of common alleles (0.1 ≥ MAF > 0.01). Common-localized alleles may be biologically specialized alleles that enhanced adaptation to different agroecological conditions, and is often the class of alleles most interesting to breeders. Other studies have reported that crop improvement is accompanied by a selective advantage of rare alleles present at low frequency (MAF < 0.05) [53,54]. The percentage of rare allele recovery was 82.9% in the SCC and 85.7% in MCC, which was in agreement with that obtained in other core collections with SNPs [55]. The core collections performed worse in preserving very rare-localized alleles (MAF ≤ 0.01) that were present in only one accession, which is in agreement with Wingen et al. [49]. Some studies reported that this type of allele, likely to be maintained by deleterious mutation-selection balance, would be of less interest; they seldom contribute to the improvement of elite varieties and, therefore, their inclusion in the CC might not be worthwhile [7,56,57]. In contrast, other authors have more recently proposed core subsets focused on preserving rare accessions and uncommon alleles, which may have unique genetic potentials for plant breeding [58]. In our case, the high number of accessions possessing specific rare alleles makes it difficult to retain a greater number of very rare alleles without increasing the sample size. Nevertheless, the MCC also maximized the coverage of this type of alleles and included an increased number of the most divergent accessions (56 in MCC, 52 in SCC and 50 in RCC out of the 94 accessions with the highest mean E-E distance in the PS). Moreover, the screening of the genetic diversity in each CC along the genome revealed that the MCC was able to better capture the available diversity from the PS in chromosomes 2A, 4A and 2D than the SCC.
Allelic richness is an evaluation criterion that ensures the inclusion of restricted alleles, whereas genetic distances between accessions is an evaluation criterion related to the concept of the maximization of the representation of genetic diversity in the whole collection [19,26,57]. In the present study, both phenotypic and genotypic variables were combined to calculate the distance among accessions since they provided complementary information thereby maximizing overall diversity for analysis [55,59]. The mean distance between accessions (E-E) reaches a maximum when diverse entries are sampled, but the presence of similar entries at the extreme ends of the distributions cannot be distinguished by this criterion. Therefore, two additional distances (A-NE and E-NE) recommended to evaluate the quality of multipurpose CCs were calculated [26]. The A-NE distance is a good criterion to evaluate the representativeness of genetic diversity of the PS, whereas the E-NE distance allows evaluation of whether the core collection has entries that are as different as possible from each other [26]. The lower A-NE and the higher E-NE distances in the MCC indicated that this subset maximized the representativeness of the genetic diversity of the PS and that all the entries were far apart genetically. The PCoA of the MCC also demonstrated that this subset was well distributed within the PS, covering the four genetic populations, even though the information on the genetic structure of the collection was not used to constrain the subset extraction. To some extent, this latter result could be related to the small number of populations identified in the collection [60].
Other studies have shown that the CC created based on the M-strategy maximizes the genetic variability index [23,40], whereas the structured method yields subsets that better represent the distribution of the genetic variability of the initial collections [7,20,40,57]. Also, worse values for E-NE distances have been reported in collections developed using the M-strategy [23,40]. In the present study, however, the M-strategy performed better than the stratified method, by increasing genetic diversity and reducing redundancy [26]. The higher quality of the MCC could be due to the lower degree of stratification of our collection. Only Pop 4 was clearly separated, while some overlap was shown among the other three populations, especially within Pop 2, which was the largest population [27]. In contrast, the stratified sampling performed better than the M-strategy in the Spanish durum wheat collection [40]. Comparisons of the genetic structure of the two wheat collections revealed that bread wheat populations exhibited a higher level of admixture and less genetic differentiation (Population differentiation index Dest = 0.22 in durum and 0.17 in bread wheat) [27]. This finding may explain why population-based sampling did not optimize the representativeness of the genetic diversity in bread wheat. Furthermore, the little gain by minimizing A-NE in the MCC compared to RCC could be also caused by the weaker structure of our bread wheat collection [26]. The results presented here have shown that the core collection designed with the M-strategy had superior performance; thus, this subset was selected as the Spanish core collection.
The Spanish wheat core collection constructed in the present study contains genotypes collected from every region in Spain where bread wheat is cultivated. Such coverage is essential because growing regions possess very diverse environmental conditions in terms of climate, altitude and soil characteristics. Wheat is grown from cold sub-humid areas in the northern parts of Spain to warm semi-arid regimes in the southeast [61], in basic or neutral soils in the Centre and East, and acid soils in the western regions [62]. Our core collection also includes all the Vrn-A1 alleles for the vernalization response identified in Spanish landraces. The Vrn-A1 gene is one of the most determinant loci involved in the transition from vegetative to reproductive growth [63], and thus for wheat adaptability. Such adaptation of Spanish landraces to different agroecological conditions has resulted in the accumulation of favorable alleles, including for stress tolerance, which can be incorporated into breeding programs [44,[64][65][66]. Furthermore, in the case of wheat, functional quality requirements must also be kept in mind in order to have a multipurpose collection useful for wheat improvement. Our collection covers the allelic variation for 30 alleles at the Glu-1 loci, the main genetic determinants of gluten quality [28,37].

Conclusions
The use of high-throughput genotyping technologies has allowed the selection of a core collection based on more than 50K molecular markers distributed across the whole genome. In addition, this approach has enabled us to detect the presence of genomic regions where different sampling methods employed performed differently. The M-method using the Core Hunter algorithm has demonstrated to be a fast and powerful method for designating core collections, especially for non-highly structured collections or in the absence of knowledge of a clear genetic structure in the whole collection. The core collection of Spanish landraces of bread wheat designed in the present study includes a broad range of adapted genotypes, and maximizes the representativeness of the genetic and phenotypic diversity in the initial collection of 522 landraces. This wheat core collection can be efficiently utilized in mining new alleles for useful traits and in broadening the genetic base in the cultivated wheat germplasm pool.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4395/10/4/534/s1, Figure S1: Relationship between number of polymorphic DArT markers and sample size. The trend line is shown in black. Table S1: Latitude, longitude and elevation ranges covered by the initial collection of 522 accessions and the primary set. Table S2: Qualitative agromorphological traits classes included in the initial 522 accessions and primary set.