Increasing Predictive Ability by Modeling Interactions between Environments, Genotype and Canopy Coverage Image Data for Soybeans

Phenomics is a new area that offers numerous opportunities for its applicability in plant breeding. One possibility is to exploit this type of information obtained from early stages of the growing season by combining it with genomic data. This opens an avenue that can be capitalized by improving the predictive ability of the common prediction models used for genomic prediction. Imagery (canopy coverage) data recorded between days 14–71 using two collection methods (ground information in 2013 and 2014; aerial information in 2014 and 2015) on a soybean nested association mapping population (SoyNAM) was used to calibrate the prediction models together with the inclusion of several types of interactions between canopy coverage data, environments, and genomic data. Three different scenarios were considered that breeders might face testing lines in fields: (i) incomplete field trials (CV2); (ii) newly developed lines (CV1); and (iii) predicting lines in unobserved environments (CV0). Two different traits were evaluated in this study: yield and days to maturity (DTM). Results showed improvements in the predictive ability for yield with respect to those models that solely included genomic data. These relative improvements ranged 27–123%, 27–148%, and 65–165% for CV2, CV1, and CV0, respectively. No major changes were observed for DTM. Similar improvements were observed for both traits when the reduced canopy information for days 14–33 was used to build the training-testing relationships, showing a clear advantage of using phenomics in very early stages of the growing season.


Introduction
To meet the needs of feeding an increasing world population [1] and to satisfy people's dietary needs, it is important to increase food production. Soybean is a major oil seed in the United States; more than 90% of the oil seed production comes from soybean programs [2]. Since soybeans have a low production cost, they have the potential to increase yield, and are produced mostly as part of a crop rotation with maize, where it is essential to increase soybean yield to reduce the per-bushel production cost. Furthermore, the harvestable land area cannot be significantly increased for soybean, so we must sustainably improve the yield potential of soybean to enhance food production, thus there is a need to develop methods that enable us to increase soybean yield. Nowadays, modern platforms can be used for monitoring large planted regions intensively in time and space to deliver accurate information about the specific (physiological) conditions of the plants, thereby allowing a better characterization of the response of these genotypes to specific stress stimuli. Thus, we can characterize genotypes using high dimensional marker information and high dimensional phenotypic information.
Genomic Prediction (GP) techniques have become an important part of plant breeding programs due to their advantages when compared to traditional phenotypic and pedigree-based selections [3]. GP is a technique that aids selection for yield and quality related traits, and it has been shown [4] that it has the potential to lead to a threefold increase in genetic gain when compared to marker assisted selection.
In GP, the marker and phenotypic information of individuals in the training set are used to model the relationship between the phenotype and genotype, then the model is used to predict the phenotype for individuals in the testing set for which only marker information is available. GP was first introduced by Meuwissen et al. [5], and since then an extended number of models have been developed for phenotype prediction incorporating marker information [6][7][8]. Early applications of GP for the selection of soybean varieties were introduced by Jarquin et al. [9].
The prediction accuracy of GP methods can be improved by including the genotype by environment (G × E) component by borrowing information from related materials and correlated environments. Jarquin et al. [10] utilized the reaction norm model for genomic prediction where the genetic and environmental values were replaced by the regression on the markers, and in the interaction between the markers and the environmental covariates, respectively. Dealing with high-throughput phenotypic information, several authors [11][12][13] have shown improvements in predictive ability with the inclusion of these sources of information in the models for wheat and maize. Montesinos-Lopez et al. [14] showed that accounting for the band (hyper-spectral image data)-by-environment interaction also improved yield predictability in wheat when compared with those models that did not include this component in the models.
Herein, we extended the reaction norm model for prediction using canopy coverage image data. Xavier et al. [15] incorporated canopy coverage image data into a selection scheme by studying the large effect QTL associated with canopy coverage. A QTL (Quantitative Trait Loci) is a particular region in the genome that is statistically associated with the variation of one or several traits. Since canopy coverage is a trait with high estimated heritability and correlation with grain yield [15], it has the potential to improve prediction models when this information is included into the model.
We used the soybean nested association mapping (soyNAM) population data to implement our prediction models. The soyNAM data used consisted of phenotypic yield data on 5600 F5-derived recombinant inbred lines, over 4000 single nucleotide polymorphism (SNP) markers, and ground-based imagery and/or aerial imagery data, depending on the year. This article is organized as follows. First, we provide a brief description of the soyNAM phenotypic and marker data that were used for the predictions. Then, we describe how the canopy coverage image data were collected, and why it has the potential to increase prediction accuracy compared to traditional GP models when included into the prediction models. In the next section, the statistical models and cross validation (CV) schemes implemented for genomic-enabled prediction are described. We compared nine prediction models with three different CV schemes for yield and days to maturity (DTM) for the soyNAM data set. Additionally, we compared the effects of predictive ability with models that included canopy data captured during the early stages of the growing season (days 14-33) instead of the whole growing season. Finally, we discuss the results, and some future research avenues.

SoyNAM Phenotypic and Genotypic Data
The predictions were conducted using a soybean nested association mapping population (SoyNAM) containing 5600 recombinant inbred lines (RILs) created by crossing a common high yielding parent (IA3023) to 40 parents. For a detailed description of the structure of the soyNAM population the reader can refer to Xavier et al. [16]. For the analysis, 39 families and 5143 RILs were kept due to uncertainty in the phenotypic and/or genotypic data.
Phenotypic data were collected on grain yield (in kg/ha), days to maturity (in days after planting), plant height (in cm), lodging (score 1-5), seed size (mass of 100 seed in grams), seed composition (in % in the grain) of protein, oil, and fiber content. For this study, however, the focus was on grain yield and days to maturity.
A total of 5303 single nucleotide (SNP) markers were used to design the SoyNAM 6K BreadChip array [17]. Markers with a minor allele frequency of less than 0.15 were removed, and missing markers were imputed using a random forest algorithm [18]. After quality control, the final set of markers was comprised of 4240 SNPs.
For more information about the soyNAM parents, RILs, the experimental design, the collected phenotypes and corresponding genotypes, the reader can refer to Xavier et al. [16].

Canopy Coverage and Imagery Data Collection
Grain yield in soybean is influenced by genetic and environmental factors and their interactions. Among the environmental factors, drought, salinity and temperature have the largest impact on the agronomical traits (i.e., yield) [19]. Many of these factors are hard to measure, therefore, even though they have a large influence on the trait, it is not cost effective to include them in prediction models. Canopy light interception (LI) is one of the important factors that influences yield and growth in soybean, and is difficult to measure [20]. Canopy coverage, which is the proportion of ground area covered by the soybean plant, is a trait that can be more easily measured, and can be used as a replacement for canopy LI measurements. High-throughput phenotyping platforms allow us to capture information about a trait in a non-destructive way. It also permits the collection of time-series measurements at a relatively low cost that is essential to follow the growth change of a plant [21], and aids in quantifying the genotype by environmental interaction.
The canopy data used for the analyses were obtained either as ground-based imagery (in 2013 and 2014) or as aerial imagery data (in 2014 and 2015) as described by Xavier et al. [15]. Xavier et al. noted that the correlation between the ground-based data and the aerial data was high (r 2 = 0.87) for 2014, thus in 2015 only aerial imagery data were collected, and in our case, we only used data from the aerial platform for 2014. Thus, our models were calibrated using ground data for 2013 and aerial data for 2014 and 2015. The data were collected at different time points at regular intervals ranging from two to eight weeks after planting to provide information on the different phases of growth and crop physiology. The collected data were classified to determine the canopy coverage, which was defined as the proportion of image pixels that were canopy pixels to the total number of pixels for any given field plot. The canopy coverage data of the ground-based imageries were determined using the software SigmaScan Pro ® (SYSTAT, San Jose, CA, USA), which implements the method described by Karcher and Richardson [22]. The canopy coverage data of aerial imageries were obtained by the software ENVI 5.0 TM (Harris Geospatial Solutions, Boulder, CO, USA) using a binomial model.

Relationship between Canopy Coverage Data and Grain Yield
In order to use canopy coverage data successfully to increase the predictive ability of the prediction models, it should show some type of relationship with the traits that are predicted. In this study, for each year (2013-2015)-by-acquisition method (ground and aerial) combination, we computed the correlation between the interpolated canopy values (days 14-71) and the observed grain yield values ( Figure 1). The four data sets showed different patterns; however, in all cases the correlation decreased as the number of days increased. The 2013-ground and 2014-ground sets reached the highest correlation (0.27 and 0.34) on days 33 and 32, respectively. For the aerial method, the highest correlations (0.33 and 0.44) were reached on days 45 and 41, respectively. To assess the usefulness of canopy coverage data in the early stages of the vegetative growth, the information from days 14-30 (for the four data sets) was included in the prediction models, and the results were compared with those obtained using canopy information from the whole season.

Statistical Prediction Models
In this study we evaluated nine prediction models: six main effects models that included combinations of line, environment, marker genotype, and canopy coverage image information; seven models with two-way interaction(s) among the components; and two models with a three-way interaction between environments, marker genotypes, and the canopy coverage data. The interaction components were modeled using the reaction norm model [10]. All models assumed that the components were random effects. For all of the models, we considered two responses: grain yield and DTM. Since the canopy coverage imagery data were either ground based or collected from the air, we incorporated two types of canopy coverage image data. In the statistical models described below, the term CC (Canopy Coverage) denotes the canopy coverage where the ground-based imagery data were included from 2013 and the aerial canopy coverage image data were included from 2014 and 2015.

Model 1: Environment + Line
The response of the j th genotype in the i th environment y ij can be written as: where µ is the overall mean, E i (i = 1, . . . , I) is the random effect of the i th environment assuming ) denoting a normal density, where iid stands for independent and identically distributed observations, and σ 2 E represents the variance component of environments; L j represents the random effect of the j th line (j = 1, . . . , J) assuming L j iid ∼ N 0, σ 2 L , where σ 2 L is the variance of the line; and e ij is the random error term where e ij iid ∼ N 0, σ 2 e with σ 2 e as the residual variance. All the models described below include the terms specified for Model 1 with additional components.

Model 2: Environment + Line + CC
The only difference between this model and Model 1 is that the CC component is also included. The model can be described as follows: where µ, E i , L j and e ij are defined as before, and CC ij is the canopy coverage imagery data. It is defined as CC ij = ∑ 70 k=14 P ijk c k where P ijk is the k th canopy measurement for genotype j in environment i, c k is the effect of the k th canopy measurements, and c k iid where P is the centered and standardized (per columns) canopy matrix. For later derivations we define ω = PP K as the canopy coverage covariance structure between pairs of lines × environment combinations. The entries of the matrix describe the canopy coverage similarities between pairs of lines × environment combinations. The number of total canopy measurements for a genotype in an environment after the extrapolation of missing daily measures was K = 57 = 70 − 13.

Model 3: Environment + Line + Marker
In this model the response of the j th genotype in the i th environment y ij can be modeled as a linear function of the environment effect, line effect, marker effect, and a random residual: where µ, E i , L j and e ij are defined as before, and g j is the linear combination of p markers and the corresponding marker effects such that g j = ∑ p m=1 x jm b m . The marker effects are assumed to be is the marker effect variance. If we write the genomic values in a vector form so that g = g 1 , . . . , g J , then the covariance matrix can be expressed in the form Cov(g) = Gσ 2 g , where G = XX p is the genomic relationship matrix [23], X is the centered and standardized genotype matrix, and σ 2 g = p × σ 2 b . In summary, we can write that g = g j ∼ N 0, Gσ 2 g .

Model 4: Environment + Line + Marker + CC
This model is the combination of Model 2 and Model 3 as both the marker and CC canopy component are added in addition to the environment and line components: where all of the terms are defined before (Models 2 and 3). This model is an extension of Model 2 where the interaction term between canopy and environments is added: where all of the terms except CCE ij are defined as before (for Model 2). CCE ij is the canopy × environment measurement interaction term with CCE = CCE ij ∼ N 0, (ω) where Z E is the incidence matrix for the environments, which connects the environments to the phenotypes, σ 2 CCE is the variance component for the interaction term, ω is defined previously, and ' • ' stands for the Hadamard or Schur (element-by-element or cell-by-cell) product between two matrices.

Model 6: Environment + Line + Marker + (Marker × Environment Interaction)
This model accounts for the environment, line, and genomic main effects as per Model 3, but it also incorporates the genotype × environment interaction via co-variance structures, as per the environment × canopy interaction in Model 5. In this case, the model can be written as where gE = Eg ij ∼ N 0, Z g GZ g • (Z E Z E )σ 2 gE , where Z g and Z E are the incidence matrices for the lines and environments, respectively, σ 2 gE is the variance component of the gE ij interaction component, and G is the additive relationship matrix defined previously. This model is an extension to Model 5 as it does not only account for the main effect of the environment, line, marker and the canopy coverage, but also includes the interaction between the marker and the canopy coverage: where all of the terms except gCC ij are defined as before (for Model 6). gCC ij is the marker × canopy measurement interaction term with gCC = gCC ij ∼ N 0, Z G GZ G • (ω)σ 2 gCC , where Z E is the incidence matrix for the genotypes, which connects the genotypes and the phenotypes, σ 2 gCC is the variance component for the genotype × canopy coverage interaction term.

Model 8: Environment + Line + Marker + CC + (Marker × Environment Interaction) + (CC × Environment Interaction)
This model is a combination of Models 6 and 7 as it accounts for the main effect of environment, line, marker, and canopy coverage, and the interactions between the marker and environment and between the environment and the canopy coverage: where all of the terms are the same as defined previously.

Three-Way Interaction Models
Model 9: Environment + Line + Marker + CC + (Marker × Environment Interaction) + (CC × Environment Interaction) + (Marker × CC × Environment Interaction) This three-way interaction model is an extension of Model 8 with the addition of the marker × canopy coverage × environment interaction: where the main effects and two-way interaction terms are defined the same as previously, where Z E , Z G , G, and ω are defined as before, and σ 2 gCCE is the variance component for the three-way interaction term.

Description of Cross-Validation Schemes Implemented for Assessing Predictive Ability
The performance of the models was compared based on the Pearson correlation coefficient between the predicted phenotypic values and the observed phenotypic values. Three cross validation techniques (CV2, CV1, and CV0) were implemented and compared when predicting yield and DTM. The cross-validation schemes mimicked real plant breeding situations.
CV2 is the case where some lines were observed in some environments but not in others, and we attempted to predict the performance of these unobserved line × environment combinations. This scheme mimicked the situation of predicting incomplete field trials. The graphical representation of CV2 is shown in Figure 2a. Figure 2a is a simplified representation of CV2 where we observed values for five lines in five environments, and the goal was to predict the phenotype of the lines that were not observed in a particular environment. Y ij represents the phenotypic value for the i th line in the j th environment in the training set where i = 1, . . . , 5 and j = 1, . . . , 5, and NA represents the lines for which the phenotype needed to be predicted (testing set).
CV1 refers to the case where the performance of lines was evaluated in some environments, and some different genotypes were predicted in the same environments. The graphical representation of CV1 is shown in Figure 2b. Here, the performance of a new developed line (line 3) needed to be predicted. CV0 is the cross-validation scheme where the performance of lines was evaluated in some environments, and the goal was to predict the performance of the already tested lines in untested environments. To determine the training and testing sets for the CV0 prediction, the leave-one-environment out scheme was implemented, and since there was no random partitioning involved in the CV0 method, it was only implemented once.
For both the CV1 and CV2 schemes, five-fold random partitions (80% of data was used as the training set, and the remaining 20% was used as the testing set) were used to determine the training and testing sets, and the partitioning was repeated 50 times.

Analysis of Variance Components
The estimated variance components for the nine models for yield are shown in Table 1, and for DTM in Table 2. For both traits, a large amount of the total variation was accounted by the environmental variation. However, the environmental variation was reduced when the marker and canopy information were included into the model. Furthermore, the residual variation was reduced when the interaction components were added in the models. For yield, the proportion of the variability explained by the residual term was reduced from 30.7% to 20.9%. The DTM also showed a reduction of the residual variance from 22.9% to 10.7%.  E = environment, L = line, G = genotype, GE = genotype × environment interaction, CC = canopy information, CCE = canopy × environment interaction, GCC = genotype × environment × canopy interaction, GCCE = genotype × canopy × environment interaction, R = residual term.

Assessment of Predictive Ability
In this study, nine models were evaluated in terms of prediction accuracy. The main objective was to identify whether incorporating canopy coverage data into the prediction models would increase the prediction accuracy, and whether the inclusion of interaction terms would further improve the models. Jarquin et al. [10] showed the advantage of including the genotype × environment interaction into the prediction models, and here we wanted to determine whether the inclusion of interaction terms among the environment, the genotype, and the canopy coverage measurement would enhance the prediction accuracy.
Out of the nine models, only three did not include any canopy coverage: Model 1: Environment + Line, Model 3: Environment + Line + Marker, and Model 6: Environment + Line + Marker + (Marker × Environment Interaction). We evaluated the models using SNP marker information, canopy coverage measurement, and phenotypic information collected on yield and DTM on the SoyNAM recurrent inbred lines. The phenotypic information and the canopy coverage measurements were collected in 2013, 2014, and 2015. All of the prediction accuracies were calculated implementing the CV2, CV1, and CV0 cross-validation schemes. Figures 3a and 4a show the results for the nine models using CV2 for yield and DTM, respectively. Similarly, Figures 3b and 4b present the results for CV1, and Figures 3c and 4c for CV0.   Figure 3. Average prediction accuracy obtained for the nine models for yield using the CV2 (a); CV1 (b); and CV0 (c) cross-validation schemes. The three different colors represent the years for which the prediction was carried out. E = environment, L = line, G = genotype, GE = genotype × environment interaction, CC = canopy information, CCE = canopy × environment interaction, GCC = genotype × environment × canopy interaction, GCCE = genotype × canopy × environment interaction. The performance of the CV1 scheme depended mostly on the genetic similarities between the training and testing sets while CV2 and CV0 also accounted for the environmental variations via replicates of the same genotypes observed in other environments.
When we considered yield as the trait to be predicted, for the CV1 and CV2 cross-validation schemes, models 7-9 performed better than the models that did not include the marker, canopy data and environmental interactions. In these cases, the correlation between the predicted and observed values was around 0.61 for 2013 and 2014, and 0.42 for 2015. These values were slightly increased when more terms were included in the model. The common feature among the models was that all of these models (7)(8)(9) included the genotypic information, the canopy coverage information, and at least one interaction component involving canopy coverage and marker data. For CV0, model 8 reached the same levels of predictive ability than the CV2 and CV1 schemes.
For all of the cross-validation techniques, most of the models had the lowest mean prediction accuracy when the prediction was carried out for 2015. In addition, similar patterns were shown when contrasting results for 2013 vs. 2014. In general, all of the cross-validation schemes showed significant variations among the models in terms of predictive ability. However, in all cases, predictive ability was improved when the interactions involving canopy data were included.
When we evaluated the models for DTM, and when CV2 and CV0 were implemented, we did not see a large variation among the models in terms of accuracy of prediction. The prediction accuracy of all models was within 0.55 and 0.65. For CV1, models 1, 2, and 5 had significantly lower prediction accuracy than the rest of the models. These were the only models that did not include genotype as a main effect term or an interaction with environments. For DTM, we could not identify a year that had the highest prediction accuracy across the cross-validation schemes. For all of the three cross-validation techniques, model 6-the Environment + Line + Marker + (Marker × Environment interaction)-performed the best, but for CV0 and CV2 the difference between model 6 and some of the other models was not significant.
When predicting yield, we clearly saw the advantage of including the canopy coverage measurements, and the interactions among markers, environment, and canopy coverage measurement. The highest predictive ability for CV2 and CV1 were delivered by model 9 (three-way interaction model) while for CV0, model 8 produced the highest values.
For DTM, the advantage of including these terms was not as evident as for yield, but including the canopy coverage measurement as a main effect and as part of interactions did not hurt the model performance when the markers were also present in the model.

Effectiveness of Canopy Data from Early Stages
As mentioned before, another objective of this study was to compare the usefulness of canopy data collected during the early stages (days 14-33) of the growing season with the whole data set (days 14-71). Figures 5 and 6 show the correlation between the predicted and observed values for the nine models using the reduced (days 14-33) and the whole canopy sets (days 14-71) for yield and DTM, respectively. For yield ( Figure 5), in schemes CV2 and CV1, the whole and reduced sets showed similar results. The whole set was always slightly better in terms of prediction accuracy than the reduced set. With CV0, in most of the cases, the reduced set provided better results than the whole set. For DTM (Figure 6), the whole and reduced sets performed the same for CV2 and CV1, while for CV0 there were a few cases where the reduced set slightly outperformed the results of the full set.
Since the results of the reduced set of canopy values measured between days 14 and 33 were similar to those obtained using the information from the whole set (days 14-71), we are confident about the applicability of this technique to improve the predictive ability of the genomic prediction models using information from the early stages of the growing season. An important practical implication of these results is that we could reach the same degree of predictive ability using canopy coverage data collected during the very early stages of the growing season instead of collecting the canopy information throughout the whole growing season.  Average prediction accuracy obtained for the nine models for DTM using the CV2 (a); CV1 (b); and CV0 (c) cross-validation schemes for the complete (days 14-71) and reduced (days 14-33) canopy sets. The three different colors represent the years for which the prediction was carried out.

Conclusions
In our study we evaluated nine prediction models, from which six included the canopy coverage information either as a main effect or as an interaction effect with the genotype and/or environment information. The models were compared in terms of prediction accuracy for three different cross validation schemes (CV2, CV1, and CV0) and for two different traits (yield and DTM). We compared the model performance in two situations; when all of available canopy image data were fitted, and when the canopy image data was used only for days 14-33 (20 days). Our results indicated no significant difference between model performances when all of the available canopy information was utilized versus when the canopy data for only 20 days of the growing season was included into the models. When we compared the models with the common genomic prediction model (Model 3) for yield we observed a 27-123% improvement for CV2, a 27-148% improvement for CV1, and a 65-165% improvement for CV0 depending on the model. For DTM the improvements were 2-8% for CV2, 3-8% for CV1, and 1-2% for CV0 depending on the model.
Since the ultimate goal of performing predictions of unobserved genotypes is to save time and resources developing new improved cultivars, planting those genotypes for collecting canopy data might not be helpful to shorten breeding cycles to avoid spending money and resources to phenotype these lines. However, as shown, the integration of canopy coverage data and marker data improved model performance. Thus, this might allow a more accurate selection of superior breeding lines as pointed out by Crain et al. [12]. From a more practical perspective, the use of the canopy coverage information taken during the early stages of the growing season could aid the selection process when no phenotypes for yield and/or DTM are available in the case where breeders prefer phenotypic selection. In this scenario, when an extreme hydro-climatic (hurricane, tornado, etc.) event occurs, it might partially or completely destroy the cultivars before the harvest season, leaving no chance to perform phenotypic selection. Thus, the use of canopy coverage data might allow us to recover valuable information about these destroyed cultivars, therefore improving the accuracy of selection. Some future work should attempt to investigate the use of canopy information to predict the performance of future generations. In this case, our proposed solution is to predict the canopy values of untested lines, and use these predicted values as covariates for predicting yield and/or DTM.