Optimizing Training Population Size and Content to Improve Prediction Accuracy of FHB-Related Traits in Wheat

Genomic selection combines phenotypic and molecular marker data from a training population to predict the genotypic values of untested lines. It can improve breeding efficiency as large pools of untested lines can be evaluated for selection. Training population (TP) composition is one of the most important factors affecting the accuracy of genomic prediction. The University of Minnesota wheat breeding program implements genomic selection at the F5 stage for Fusarium head blight (FHB) resistance. This study used field data for FHB resistance in wheat (Triticum aestivum L.) to investigate the use of small-size TPs designed with and without stratified sampling for three FHB traits in three different F5 populations (TP17, TP18, and TP19). We also compared the accuracies of these two TP design methods with the accuracy obtained from a large size TP. Lastly, we evaluated the impact on trait predictions when the parents of F5 lines were included in the TP. We found that the small size TP selected randomly, without stratification, had the lowest predictive ability across the three F5 populations and across the three traits. This trend was statistically significant (p = 0.05) for all three traits in TP17 and two traits in TP18. Designing a small-size TP by stratified sampling led to a higher accuracy than a large-size TP in most traits across TP18 and TP19; this is because stratified sampling allowed the selection of a small set of closely related lines. We also observed that the addition of parental lines to the TP and evaluating the TP in two replications led to an increase in predictive abilities in most cases.


Introduction
Wheat (Triticum aestivum L.) is the most grown food crop in the world and considered the most important source of calories for humans [1]. Fusarium head blight (FHB), caused by Fusarium graminearum Schwabe, is a destructive fungal disease of wheat that threatens global wheat production and food security. It can cause a significant loss of grain yield [2,3] while also affecting grain quality due to the accumulation of mycotoxins, potentially making it unsafe for human and animal consumption [4]. No fungicide application, tillage system or crop rotation technique can completely eradicate FHB in wheat [5]. Fungicides can reduce FHB damage by as much as 40-70% [6,7] if application is within a few days following anthesis but the high cost of this additional input emphasizes the importance of developing wheat varieties resistant to FHB.
The success of a breeding program in improving FHB resistance depends on the availability of resistant germplasm, enough genetic variation, effective methods to evaluate resistance and select improved individuals [8]. While phenotyping for FHB resistance still requires considerable resources, advancement in sequencing technologies has significantly reduced the cost of genotyping [9]. Minimizing the cost of phenotyping for FHB resistance would ensure effective allocation of resources in a breeding program and genomic selection offers the potential to limit the need for phenotyping while also allowing breeders to broaden the pool for selection [10,11]. Genomic selection uses information from a training population that has been both genotyped and phenotyped to train a model that is applied to a breeding population of non-phenotyped individuals to predict their genotypic values that can be used in place of phenotypic values. Several studies have shown that genomic selection is a promising breeding strategy to improve FHB resistance [12][13][14][15].
The first step in the implementation of genomic selection in a breeding program is the establishment of a training population [16], which is the set of lines that are phenotyped and used to estimate marker effects. Since 2017, the hard-red spring wheat breeding program at the University of Minnesota has implemented genomic selection at the F5 (pre-yield trial) stage to select for FHB resistance. The goal at this stage is to identify and eliminate highly susceptible lines before entering them in yield trials. Any line advanced to the preliminary yield trial stage or later will continue to undergo phenotypic assessment of FHB resistance in misted inoculated nurseries at two distinct Minnesota locations on an annual basis. The training population in the 2017 growing season was a subset of 500 F5 lines selected by genomic and pedigree information (more details below). After 2017, our goal was to design an optimized training population (TP) that minimized the cost of phenotyping for FHB resistance while maximizing prediction accuracy for the untested lines. While simulation and empirical results have shown improved prediction accuracies when TP size is increased [17][18][19][20][21], other studies have achieved high accuracies with smaller TP sizes that were more closely related to the breeding population and lower accuracies as TP became more unrelated to the breeding population [22][23][24][25]. For example, Rincent et al. [22] found that an optimized set of 100 lines achieved the same prediction accuracy as a set of 200 lines selected at random.
Isidro et al. [26] introduced a TP optimization technique based on stratified sampling that allows the optimization of the TP based on genomic relationships, resulting in higher prediction accuracies in structured populations. This procedure maximizes relationships between the training and breeding population and allows for a TP to be designed around a breeding (target) population. The k-means clustering algorithm is a form of cluster analysis that partitions datapoints into groups to reflect underlying similarities. The assignment of an individual into a cluster is determined by its Euclidean distance relative to the mean of the cluster. Details on how the algorithm works are deferred to specialized texts [27]. However, the main aim is to reduce variability within clusters while increasing variability among clusters. This method is particularly useful when population structure is present as it tends to group lines into different subpopulations as stratified sampling has been shown to improve the efficiency of sampling compared to a simple random sample without stratification [28,29].
The objectives of this study are to determine if: 1) a small sized TP selected by stratified sampling will improve predictive abilities compared to a same size TP selected without stratification; 2) the addition of parents to the TP will improve the predictive abilities. In addition, we also evaluated the effects of multiple replications on predictive abilities.

Population Development
Each year the University of Minnesota's wheat breeding program evaluates between 1500-3000 F5 lines, resulting from approximately 170 unique crosses. These two-way or three-way crosses are mostly elite × elite but it is not uncommon to make crosses with un-adapted parents for some special trait improvement. The F5 lines are selected from F4 head rows based on height, lodging, leaf and stem rust resistance but have never been screened for FHB resistance. Usually, seven heads (~30 seeds per head) are harvested from F4 headrows and are bulk threshed, except in 2017, in which the seven heads were threshed individually. Seeds from these heads were used to plant short rows (1.5 m) for phenotypic evaluation of disease resistance and agronomic traits in the F5 plants. A single seed was used to plant an F5 plant in the greenhouse, which was genotyped and seed from this plant was sent to our winter nursery for the F6 generation. In 2017, a total of 1550 lines, selected from the 2016 F4 head row nurseries, were advanced to the F5 stage of the breeding program. Out of these, 500 lines were selected to serve as a genomic selection training population (TP17) and used to predict the remaining untested lines in the population (Table 1). These 500 individuals were selected based on their pedigree so that the wider genetic diversity of the breeding population was captured in the TP. In the 2018 growing season, the F5 population consisted of 2590 lines selected from the 2017 F4 head row nursery (Table 1). However, instead of selecting 500 lines to serve as TP, we selected an optimized set of 200 lines based on stratified sampling with the k-means clustering method (details below). The selection process used a k-means clustering analysis performed on all 2590 F5 lines so that the lines were divided into three clusters with each cluster containing lines that are more closely related to each other. To determine the value of k (i.e., the number of clusters), we tried different values for k until we were satisfied with the cluster separation and we chose that value as the optimum number of k, which in this case was three. After grouping into different clusters, a simple random sample was performed within each cluster.
Since we wanted a total of 200 F5 lines in the TP18 selected by stratified sampling, few lines were sampled within each cluster to capture the entire genetic space of the cluster. The number of lines selected from each cluster was proportional to the size of the cluster in the entire F5 population. We sampled 95 lines from cluster-one, 46 lines from cluster-two and 59 lines from cluster-three to make a total of 200 lines selected by genomic relationship. The lines sampled from each cluster served as the TP for the remaining untested lines in that cluster. To serve as both buffer and basis of comparison, a different set of 300 lines were also selected from the remaining F5 population. These lines were first selected so that each pedigree is sampled at least once and then more lines from each pedigree were added randomly in proportion to the family size. This method allows for the sampling of lines across the entire genetic distribution of the F5 population. In a structured population, random sampling of lines across subpopulations to make up the TP will reduce the similarity across lines in the TP. Lastly, 45 parental lines also were evaluated to make a total of 545 phenotyped lines.
The training population for the 2019 growing season (TP19) was selected similar to TP18 and consisted of 544 lines (Table 1). In summary, TP17 was sampled to maximize genetic diversity without stratification while TP18 and TP19 were first divided into clusters to maximize relationships within clusters and sampling was done within each cluster. Also, TP18 and TP19 had parental lines evaluated in the field while parental lines were not evaluated for TP17.
The 500 TP17 lines were grown as a single replication in misted, inoculated FHB nurseries in St. Paul and Crookston, Minnesota in 2017. For TP18 and TP19, the 200 lines selected by stratified sampling, and the parents were tested in two replications while the 300 lines sampled randomly without stratification were tested in one replication in the same FHB nurseries. Unfortunately, the irrigation system in the Crookston 2018 nursery failed, resulting in biased data, thus, the data from this location was omitted and only the data from St. Paul was used for further analysis in 2018. In all FHB nurseries five checks (Alsen (MR, [30], MN00269 (S), Roblin (S), Rollag (MR, [31]), and Wheaton (S, [32])) were planted for every one hundred plots. Procedures for misting, inoculation and data collection for the FHB traits disease index (DIS), visually scabby kernels (VSK) and micro-test weight (MicroTwt) were as described by Fuentes-Granados et al. [33]. For each plot, the number of infected spikelets per head was counted on each of 20 randomly selected main heads. FHB severity was calculated as the average percentage of infected spikelets per head for each plot. FHB incidence was calculated as the proportion of heads that had any visible infection out of the 20 heads used for FHB severity assessment. DIS was calculated as follows: To rate VSK, approximately 200-300 kernels were assessed visually to determine the percentage of kernels that showed signs of infection [34]. For MicroTwt, cleaned seed samples were poured into a 15.7 mL copper vessel (20 mm in diameter and 50 mm in height). A ruler was used to level the sample at the top of the vessel and then the sample was weighed.

Training Population Comparison
We assessed trait predictive abilities (correlation between the observed phenotypic values and the predicted genotypic values) in an optimized TP selected by stratified sampling compared with the same size TP selected without stratification. We also compared the two methods with training a prediction model on the entire TP (entire TP is a total of 500 F5 lines-200 selected by stratified sampling and 300 selected by pedigree information). For the set of 200 lines selected by stratified sampling in TP18 and TP19, cross-validation was performed within each cluster and the average predictive ability across clusters was used to assess this method of TP selection. For the 300 lines sampled by pedigree information and without stratification, we further sampled 200 lines randomly to remove confounding effects based on TP size. The process was repeated 100 times and a crossvalidation performed each time. The average predictive ability across 100 cross-validations was used to assess this method of TP selection. Lastly, we performed a cross-validation within the entire F5 TP (i.e., 500 F5 lines) and compared the predictive ability to the smaller sized TPs (Figure 1). To determine the effect of adding parents to the TP on the predictive abilities, we included parental lines with the F5 lines in each cluster. For example, there are 95 F5 lines in cluster one of TP18, and the addition of 45 parents increased the TP size to 140 lines and the addition of 45 parents to cluster two increased the TP size to 91 lines and so on. In the same vein, cross-validation was performed in each cluster and the average predictive abilities across the clusters were used to assess the effect of parental addition on the F5 lines in the TPs selected by stratified sampling. Furthermore, after randomly selecting 200 F5 lines from the 300 selected by pedigree information and without stratification, we added the parental lines to make a total of 245 lines in TP18 and 244 lines in TP19. The process was repeated 100 times with cross-validation performed each time. The average predictive abilities across the 100 cross-validations was used to assess how well this method performed. Finally, we added parental lines to the entire F5 TP so that there were 545 lines in TP18 and 544 in TP19. We compared the predictive ability of this larger sized TP with the smaller ones.
Since we have implemented a stratified sampling method for TP18 and TP19, we decided to also divide TP17 into clusters using the k-means clustering algorithm and then sample a few lines from each cluster to make a total of 200 F5 lines selected by stratified sampling. The performance of this selection method was assessed as the average predictive ability across clusters. From the remaining 300 lines, 200 were sampled randomly without stratification and a cross-validation was repeated 100 times. We assessed this selection method as the average predictive ability across the 100 crossvalidations. Again, we compared the performance of the small-sized TP by using the entire set of 500 F5 lines as the TP.
To assess the effect of replications, we used the lines selected by stratified sampling combined with parental lines in TP18 and TP19, since both sets of lines were tested in two replications. For comparison, we used data from the first replication (Rep1) as the second replication will not have existed if the study was not replicated. Thus, we compared the predictive abilities when phenotypic data from Rep1 and the mean across both replications were used to train the model. To assess significant differences in accuracies, we applied a paired t-test after Fisher's Z transformation.

Phenotypic Analysis
Since the 2018 cohort had data from just a single location, the 200 lines selected by genomic relationship and the 45 parental lines were averaged across two replications. The observed data for this single location was modeled as: where yik is the kth observation of the ith genotype, µ is the intercept, gi is the effect for the ith genotype, and ɛik is the plot error effect for yik. Assuming gi, and ɛik are random and independent with a mean of zero and variances of σ 2 g, and σ 2 ɛ, the phenotypic variance can be calculated as: where σ 2 g is the genetic variance, σ 2 ɛ is the plot error variance, and nr is the number of replications. For 2017 and 2019 cohort with multiple locations, the observed data was modeled as: where yijk is the kth observation of the ith genotype at the jth location, µ is the intercept, gi is the main effect for the ith genotype, lj is the main effect for the jth location, (gl)ij is the ith genotype-by-location interaction effect, and ɛijk is the plot error effect for yijk. Phenotypic values for each genotype was then estimated as: Assuming gi, lj, (gl)ij, and ɛijk are random, independent, each has a mean of zero and variances of σ 2 g, σ 2 l, σ 2 gl, and σ 2 ɛ respectively, phenotypic variance was calculated as: σ 2 P = σ 2 g + σ 2 gl/nl + σ 2 ɛ/nlnr (6) where σ 2 g is the genetic variance, σ 2 gl is the variance for genotype × location interaction, σ 2 ɛ is the plot error variance, nl is the number of locations, and nr is the number of replications. For the 2019 population, the 200 lines and 46 parents were first averaged across replications and then the above linear model was used to calculate the best linear unbiased estimates (BLUEs) of each line. The emmeans package version 1.4.2 [35] in R was used to estimate marginal means (BLUEs) for each line. To estimate phenotypic variances, the number of replications was set to one in both 2018 and 2019 cohorts to remain conservative.
Broad-sense heritability for TP18 with one location was calculated as: While the broad-sense heritability of TP17 and TP19 with two locations was calculated as:

Genotyping
The populations were genotyped using the genotyping by sequencing method [36]. Reads were aligned to the Triticum aestivum IWGSC RefSeq v1.0 [37] using bwa [38], and samtools and bcftools [39] for single nucleotide polymorphism (SNP) calling. This resulted in 7102 markers for the 2017 set, 4934 markers for the 2018 set, and 3046 markers for the 2019 set. SNP markers with minor allele frequency of <5% and >20% missing data were discarded. Missing markers were imputed with the linkage disequilibrium k-nearest neighbor (LD-kNNi) genotype imputation algorithm [40] implemented in TASSEL [41] with default parameters.

Stratified Sampling with k-means Clustering Algorithm
Euclidean distances estimated from SNP markers in each population were used to generate partitions that represent the underlying groups present in the population, based on the specified number of clusters. After the clustering process was performed on the entire F5 population (i.e., about 1500-3000 lines), a principal component analysis was carried out using the genotypic data. The first two principal components for each line with its corresponding cluster assignment were used to visualize the cluster partitioning of the data (Figure 2a-c). The cluster analysis is based on Euclidean distances derived from genotypic data so lines that fall within a cluster were expected to be more genetically similar to each other versus lines in a different cluster. Since we are trying to exploit the power of genetic relationships and its effects on prediction accuracy, individuals selected from a cluster will serve as the training population for the remaining untested individuals in that particular cluster. In addition, the number of individuals that represent a cluster as the training set depends on the number of lines in that cluster, i.e., clusters with more individuals will have more lines selected as the training set and vice-versa. This training population selection method tends to define a training set based on its relationship to the breeding set (untested individuals) and not the other way around.

Linkage Disequilibrium (LD) Analysis
We evaluated the LD between pairs of markers for the entire TP and for each cluster. The extent of LD in each TP was indicated as r 2 which is the squared allele frequency correlation between pairs of markers with known genomic positions. Generally, the larger the r 2 value, the larger the degree of association among loci in the genome. This estimation was performed in TASSEL v5.2.60 [41].

Genomic Selection Model
Three genomic selection models, Bayesian LASSO, Reproducing Kernel Hilbert Space (RKHS), and Ridge-Regression Best-Linear Unbiased Prediction (RR-BLUP), were used for model training. However, RR-BLUP performed similarly or slightly better than the two other models and had the fastest computational time; therefore, only results from RR-BLUP are presented in this study.
An RR-BLUP model was used for all genomic predictions with models trained as follows: u ~ MVN (0, Iσ 2 u) ɛ ~ MVN (0, Iσ 2 ɛ) (11) where y is a vector of BLUEs of the lines in the training population; 1 is an n × 1 vector with elements equal to 1; µ is the grand mean; Z is an n × p design matrix with elements of 1 or -1 (where -1 represents the minor allele); u is a p × 1 vectors of marker effects; ɛ is an n × 1 vector of residuals; σ 2 u is the variance of marker effects; I is an identity matrix; and σ 2 ɛ is the error variance. σ 2 u and σ 2 ɛ were estimated using restricted maximum likelihood (REML). The RR-BLUP model was implemented with the rrBLUP package version 4.6 [42] in R. Predictive abilities were calculated using a five-fold cross validation method with 500 iterations within the rrBLUP package. All analyses were performed using R version 3.5.2.

Accuracy of Genomic Selection for Line Advancement
Prior to implementing genomic selection, approximately 25% of the most susceptible F5 lines were discarded each year based on phenotypic evaluation. We assessed the effectiveness of genomic prediction in discarding susceptible lines and selecting moderately resistant lines. The likelihood of making a correct decision was calculated as the proportion of observed susceptible lines that the model predicted to be susceptible (true positives) and the proportion of lines observed to be moderately resistant and also predicted to be moderately resistant (true positives) expressed as a percentage of the entire cohort as shown below: Correct decision index (CDI) = Number of true positives + Number of true negatives Total number of lines x 100 The higher the correct decision index, the more likely it is to advance moderately resistant lines and discard susceptible lines and vice-versa.

Principal Component and Cluster Analyses for F5 Populations
All 2590 lines in the 2018 F5 population were divided into three clusters (Figure 2b). Cluster one had the highest number of lines with 1122, while cluster two had the lowest number with 679 F5 lines. Principal component (PC) analysis showed that the first two components accounted for 7.2% and 5.8% of genetic variance in the TP18 dataset. After stratification, 200 (8.5%) lines were selected to capture the genetic diversity of each cluster. Ninety-five lines were selected across cluster one, 46 across cluster two and 59 across cluster three. In 2019, the 2715 F5 lines also were divided into three clusters (Figure 2c). The cluster sizes ranged from 314 F5 lines in cluster three to 1406 F5 lines in cluster one with 90, 65, and 45 lines selected from cluster one, two and three, respectively. PC analysis revealed that the first component explained 9.3% of variation while the second component explained 7.4% of variation in the TP19 dataset. To evaluate the TP sampling with and without stratification on the 2017 set, the k-means clustering algorithm divided the 500 F5 lines into four clusters (Figure 2a). PC analysis showed that about 6.7% and 5.5% of variance were explained by the first and second components respectively, the lowest among the three populations. The size of the clusters ranged from 73 lines in cluster four to 201 lines in cluster one. From each cluster, 40% of the lines were selected to achieve a total population size of 200 and ranged from 30 lines in cluster four to 80 lines in cluster one.

Trait Correlations, Phenotypic Variances, Heritabilities and G × E Interaction
Heritability estimates were significantly different from zero for all sets and traits. The mean heritabilities across years were 0.39 for DIS, 0.40 for VSK, and 0.53 for MicroTwt (Table 2). Genotype by environment interaction was significant for all traits in TP17 and TP19 except MicroTwt weight in TP19. Across the 2018 and 2019 cohorts, 18 parental lines were common in both populations and their differential performance in the three environments (St. Paul 2018, St. Paul 2019, and Crookston 2019) are shown in Figure 3. Trait distribution for DIS, VSK and MicroTwt in the three TPs are presented in Figure 4.
FHB traits were significantly correlated in each population (Table 3, p < 0.05). As expected, MicroTwt had a negative relationship with DIS and VSK with a stronger association to VSK than DIS. Correlations between DIS and VSK were highest in TP17 at r = 0.57 and lowest in TP18 at 0.40 while the correlation between DIS and MicroTwt ranged from -0.47 in TP17 to -0.27 in TP18. Lastly, the correlation between VSK and MicroTwt weight ranged from -0.75 in TP18 to -0.61 in TP17.   TP17 ‡ TP18 TP19 TP17  TP18  TP19

Linkage Disequilibrium
As expected, genome-wide pairwise LD (r 2 ) was highest in the 2017 set that had the largest marker density (7102) with genome-wide r 2 values ranging from 0.23 in the entire TP17 to 0.28 in TP17 selected by stratified sampling. Lowest LD estimates were observed in TP19 as it had the least number of markers (3046) with genome-wide r 2 values ranging from 0.06 in the entire TP19 to 0.14 in TP19 lines selected by stratified sampling (Table 4). Across all the three populations, r 2 values were the lowest in the entire TP and the highest in sets selected by stratified sampling (Table 4).

Training Population Comparison
For TP17, predictive ability was the lowest for VSK and MicroTwt at 0.34 and highest for DIS at 0.45 when all 500 lines were used (Table 5). We further divided the 500 F5 lines into clusters to test the ability of TPs selected with and without stratified sampling. The average predictive abilities of TP17 designed by stratified sampling was lowest for MicroTwt at 0.29 and highest for VSK at 0.32. This range was lower than what we observed when the entire TP17 of 500 F5 lines was used to train the prediction model. However, we observed even lower predictive abilities (0.005 for DIS, 0.03 for VSK, and 0.02 for MicroTwt) when the same size TP17 was designed by random sampling without stratification with prediction abilities that were not significantly different from zero (Table 5).
Average predictive abilities for TP18 designed by stratified sampling ranged from 0.09 for DIS to 0.49 for MicroTwt while predictive abilities for TP18 selected randomly without stratification were not significantly different from zero for all traits. When the entire TP18 was used, predictive ability was the lowest for DIS at 0.30 while MicroTwt was the highest at 0.34. The predictive ability for DIS in the entire TP18 (0.30) was significantly higher (p < 0.05) than that of small size TP18 (0.09) selected by stratified sampling. We observed poorer predictive abilities for TP19 compared to TP17 and TP18. Average predictive ability in TP19 designed by stratified sampling was lowest for MicroTwt at 0.01 and highest for DIS at 0.10 while predictive abilities for TP19 selected randomly without stratification were not significantly different from zero for the three traits. Finally, when the entire TP19 was used, predictive abilities were also not significantly different from zero for all traits (Table 5).

Adding Parents to the 2018 and 2019 Training Populations
Eighteen parental lines were common to both the 2018 and 2019 F5 populations. When parents of the 2018 F5 lines were added to TP18 designed by the stratified sampling method, the average predictive abilities ranged from 0.25 for DIS to 0.53 for MicroTwt, a significant improvement (p < 0.05) in predictive ability of up to 16 percentage points for DIS, the trait that had the lowest predictive ability without parents in the TP. Although we also observed an increase in predictive ability with the addition of parents to the TP for the other two traits, the increments were rather small and non-significant (i.e., one and four percentage points for VSK and MicroTwt, respectively). While the addition of F5 parents to the lines selected randomly without stratification resulted in improved predictive abilities for DIS, VSK and MicroTwt by 13, 11 and 3 percentage points respectively, those increments were only significant for DIS (p < 0.05). When the parental lines were added to the entire TP18, small, non-significant improvements in predictive ability were observed ( Table 5).
The addition of 2019 F5 parents to the TP19 designed by stratified sampling led to a significant (p < 0.05) increase in predictive ability for DIS with a 34 percentage point gain. Although nonsignificant, we also observed an eight percentage point increase in predictive ability for VSK and a four percentage point increase for MicroTwt. For the TP19 designed without stratification, we observed an 11 percentage point improvement in predictive ability for DIS, 2 for VSK and 5 for MicroTwt. However, none of those increases were statistically significant. When parental lines were added to the entire TP19, the only significant increase (p < 0.05) in predictive ability was observed for DIS (15 percentage points), while VSK and MicroTwt had a non-significant 4 percentage point increase and 6 percentage point decrease, respectively (Table 5). Table 5. Predictive abilities of F5 lines, with and without parents, selected with different TP design (Stratified Sampling, Non-Stratified Sampling, and Entire Training Population) for FHB disease index (DIS), visual scabby kernels (VSK) and micro-test weight (MicroTwt).

Trait
Predictive

Single Replication vs. Multiple Replications
The effect of replication in our FHB disease nurseries was assessed with the combined set of F5 parents and lines selected by stratified sampling in TP18 and TP19. In the single location (St. Paul) for TP18, predictive abilities in Rep1 was highest for MicroTwt with 0.41 and lowest for DIS with 0.11. When the predictive abilities of phenotypic data averaged across replications was compared to Rep1, we observed significant (p < 0.05) increases up to 14 percentage points in predictive ability for DIS, 28 percentage points for VSK and 12 percentage points for MicroTwt (Table 6).
For TP19, predictive abilities when the phenotypic data from Rep 1 was used ranged from 0.06 for DIS to 0.11 for MicroTwt in St. Paul and from 0.08 for MicroTwt to 0.19 for DIS in Crookston. In St. Paul, when the predictive abilities of Rep1 were compared with the average across both replications, we found that the use of phenotypic data averaged across the two replications led to a significant 36 percentage point increase (p < 0.05) for DIS and non-significant 8 percentage point reduction for VSK and 8 percentage point increase for MicroTwt. In Crookston, when the predictive abilities of phenotypic data averaged across replications were compared to the Rep1, we observed a significant 16 percentage points increase (p < 0.05) in predictive abilities for DIS and no significant difference for both VSK and MicroTwt (Table 6).

Accuracy of Genomic Selection for Line Advancement
As expected, the ability to make correct selection decisions increases with increasing predictive ability. TP17 was highest for correct decision index with values ranging from 81% for VSK to 86% for DIS. The percentage of lines that were rightly discarded were between 52% for VSK and 72% for DIS. In TP18, the correct decision index was lowest for DIS at 73% and highest for MicroTwt at 80% with about 47% (DIS) to 59% (MicroTwt) lines correctly discarded. TP19 that had the lowest predictive ability also had the lowest correct decision index with values varying between 56% for VSK and 68% for DIS. The number of lines correctly discarded ranged from 30% for VSK to 36% for DIS and MicroTwt ( Figure 5). The red dashed line represents the cutoff limit to discard genotypes based on the observed field data so that every genotype to the right side of the red dashed line would be discarded for traits in which lower values are desirable (i.e., DIS and VSK) while the genotypes to the left of the red dashed line are discarded for MicroTwt. The green dashed line is the cutoff limit based on genomic predicted values, every genotype above the green dashed line would be discarded for DIS and VSK while genotypes below the green dashed line would be discarded for MicroTwt.

Small Population Size
Our results clearly showed that higher predictive abilities can be obtained when prediction models were trained on a small sized TP selected by stratified sampling compared to a TP of similar size selected without stratification. Except for the three traits in TP17 and DIS in TP18, the small sized population selected by stratified sampling also had higher predictive abilities than when the entire TP was used. Similar to our findings, several studies [23][24][25]43] have shown that a small TP with closely related lines had a greater predictive ability than a same-sized or larger TP with less-related lines.
With complete linkage between marker and QTL, increasing the size of the TP should increase the predictive abilities [44,45]; hence, using the entire TP should lead to a higher accuracy than a small optimized set. However, a large sized TP can have a lower predictive ability if the marker and QTL are not in complete linkage disequilibrium (LD). A smaller set of closely related lines should have increased precision in estimating additive genetic relationships that will increase predictive abilities [45]. A less-related TP (in our case, the small TP selected randomly without stratification and the entire TP) should have poorer marker and QTL associations and will require large TP size and marker density to have high accuracies [46]. This may explain why our lowest prediction accuracies were observed for the small TP selected randomly without stratification compared to when the entire TP was used. From theoretical expectations and field studies [17][18][19][20][21], we expected the predictive ability to be greatest when the entire TP was used; however, in this study, predictive abilities were only highest in TP17 when the entire TP was used. Comparatively, predictive ability for the entire TP was the highest in TP17 that also had the largest number of markers, and the lowest in TP19 that had less than half the marker density of TP17. The large marker density and relatively high r 2 values in TP17 suggests that genome-wide marker coverage adequately captured genomic relationships within the lines, leading to higher accuracies. Increasing marker density in TP19 would increase the probability that markers will be in LD with QTL, which might improve accuracies [46,47]. Furthermore, the low predictive ability observed in TP19 could also be due to the atypical environmental conditions in 2019 (cold season and late harvesting due to rains). Rainfalls late in the season caused seed bleaching that might have confounded the VSK ratings for TP19.
The greater predictive ability of a small optimized TP over the entire TP in 2018 and 2019 indicate the presence of population or family structure in the breeding germplasm. Since each cluster had closely related lines, the large DNA segments shared among these lines coupled with a higher accuracy to separate true signal from noise, due to the absence of less-related lines, allowed for a more precise estimate of realized genomic relationships that increases the power to accurately predict genotypic values, leading to higher predictive abilities.

Addition of Parents to TP
Across different populations, traits, and TP design methods, we found that the addition of parental lines to the TP led to improved accuracies for all but one trait in the TPs investigated. Several studies [48][49][50] also reported similar findings across different crop species. Interestingly, we observed the largest improvements in predictive abilities for DIS across the different TP design methods tested. While we are not sure why the addition of parents to DIS led to such large improvement in accuracies, a possible contributing factor for this might be that the level of inbreeding confounds our ability to accurately phenotype field-based traits like DIS. In our breeding program, parents are advanced breeding lines that are almost completely homogeneous (F7-derived) and homozygous. With such a high level of homogeneity and homozygosity, the parental lines are less likely to segregate within a row, thus ensuring more accurate disease scoring in the field. However, the F5 lines phenotypically evaluated are F3-derived, in theory; only ~94% homozygous and highly heterogenous. It is not uncommon to observe F5 lines segregating within a row for heading date or height in the field nurseries, the two traits that affect the amount and effectiveness of inoculum that reaches the spike and subsequently, the level of infection and DIS scoring. Hence, the addition of high quality DIS data of the parental lines likely augments the data for the F5 lines that are not well characterized; therefore, improving the prediction accuracy.
After correcting for bias due to heterosis, Liang et al. [49] observed a moderate increase in prediction accuracy when inbred parents were combined with their hybrid population compared to a scenario where data from inbred parents were excluded in pearl millet. Minamikawa et al. [50] studying Japanese pear, also showed that training a genomic prediction model on both parental and breeding populations led to a greater accuracy than when each population was used separately. Across eight growth and wood traits of eucalyptus, prediction accuracy increased in seven out of eight traits when all parental lines were included as part of the TP compared to a model trained on random lines [48]. The improvement in predictive ability when parental and F5 populations are combined as the TP over using the F5 lines alone indicates that the parental lines added more information for accurate estimation of genotypic effects.

Conclusions
In this study, we found that a small size TP of 200 F5 lines selected by stratified sampling had higher predictive abilities compared to the same-size TP selected randomly without stratification or using a larger TP of 500 F5 lines. We also found that combining both parental lines and the F5 lines as the TP for model training led to higher accuracies than when the F5 lines were used as the TP alone.
Our findings indicate that a training population consisting of a few hundred lines and their parents, tested in two replications, generate satisfactory predictions for the unphenotyped lines. This is especially important in traits such as FHB that are difficult and expensive to phenotype. In fact, genomic selection was implemented at the F5 stage of our breeding program to help reduce the amount of resources expended at this stage. Smaller breeding programs, particularly those with a limited capacity to operate FHB disease nurseries desiring to implement genomic selection can also benefit by developing TPs that are closely related to the target populations.  Acknowledgments: E.A. was supported by a Project Aggrad Fellowship. We thank Susan Reynolds for helping with field experiments and Ruth Dill-Macky for providing pathology support.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.