A Multi-Trait Gaussian Kernel Genomic Prediction Model under Three Tunning Strategies

While genomic selection (GS) began revolutionizing plant breeding when it was proposed around 20 years ago, its practical implementation is still challenging as many factors affect its accuracy. One such factor is the choice of the statistical machine learning method. For this reason, we explore the tuning process under a multi-trait framework using the Gaussian kernel with a multi-trait Bayesian Best Linear Unbiased Predictor (GBLUP) model. We explored three methods of tuning (manual, grid search and Bayesian optimization) using 5 real datasets of breeding programs. We found that using grid search and Bayesian optimization improve between 1.9 and 6.8% the prediction accuracy regarding of using manual tuning. While the improvement in prediction accuracy in some cases can be marginal, it is very important to carry out the tuning process carefully to improve the accuracy of the GS methodology, even though this entails greater computational resources.


Introduction
Genomic selection (GS) is frequently used for genetic improvement and has many advantages over phenotype-based selection [1]. Nevertheless, breeders face an adversity of challenges to improve the accuracy of the GS methodology, similar to multi-trait (MT) genomic prediction models, which take advantage of correlated traits to improve prediction accuracy [2] under multiple environments. Consequently, to accurately predict breeding values or phenotypic values is a challenge of primordial interest in GS, as the goal is to increase genetic gain. For this reason, when the traits of interest do not have a complex genetic architecture, this achievement is usually simple to accomplish. However, for complex heritable traits, traits with complex genetic architecture (such as grain yield) and with strong epistatic effects, this goal has limited success [3,4].
Reproducing Kernel Hilbert Spaces (RKHS) regression is a popular method in plant and animal breeding [5,6] for the prediction of complex traits and modeling complex interactions more efficiently. The central idea of an RKHS regression is to project the given original input data available in a finite dimensional space onto an infinite dimensional Hilbert space. Kernel methods can incorporate any statistical machine learning algorithm to the resulting transformed data, after using a kernel function. Empirical experience indicates that generally better results are accomplished with the transformed input. For

Dataset 1. Japonica
This dataset contains information on the phenotypic performance of four traits (GY = Grain Yield, PHR = Percentage of Head Rice Recovery, GC = percentage of Chalky Grain, PH = Plant Height) of rice as reported by Monteverde et al. [15] and evaluated over the course of five years (2009, 2010, 2011, 2012 and 2013). The genotypes evaluated were 93, 292, 316, 316 and 134 lines for years 2009,2010,2011,2012 and 2013, respectively. This dataset contains 54 environmental covariates but, in this application, these covariates were not included in the analysis. In this dataset, a total of 1051 assessments were evaluated over five years. In this dataset, the genotypes evaluated were 320 and for each 44,598 markers remained after quality control that were coded with 0, 1 and 2. For more details about the data, see Monteverde et al. [15].

Dataset 2. Indica
This dataset contains information on the same traits as the Japonica dataset [15], with only three environments (years 2010, 2011 and 2012). In each year (environment), 327 genotypes were evaluated. Although this dataset contained environmental covariates they were not used in this study. The total number of observations in this balanced dataset was 981 since each line was included once in each environment. The genotyping-bysequencing (GBS) markers datasets were filtered to retain markers with 50% missing data after imputation and a minor allele frequency (MAF) > 0.05. The markers remaining after quality control were 92,430 SNPs for each line and were coded as 0, 1 and 2, where 0 was used if the SNP was homozygous for the major allele, 1 if the SNP was heterozygous and 2 if the SNP was homozygous for the other allele. For more details about the data, see Monteverde et al. [15].
This dataset contained a total of 1272 observations and is balanced, since each genotype was included once in each environment. For each genotype, 8268 single nucleotide polymorphism (SNP, or SNPs in plural) markers (coded with 0, 1 and 2) were available after quality control. For more details about the data, see Pandey et al. [16]. This dataset contains a total of 859 observations and is not balanced, since each genotype was not included in each environment. For each genotype, 5000 single nucleotide polymorphism (SNP, or SNPs in plural) markers (coded with 0, 1 and 2) were available after quality control. For more details about the data, see Gapare et al. [17].

Dataset 5. Disease
This dataset contains 438 wheat genotypes (lines), three traits. PTR that denotes Pyrenophora tritici-repentis (PTR), SN denotes Parastagonospora nodorum, a major fungal pathogen of wheat fungal taxon, and SB that denotes Bipolaris sorokiniana (SB), that causes seedling diseases, common root rot and spot blotch of several crops such as barley and wheat. These 438 lines were evaluated in the greenhouse for six replicates during a period of time. The replicates were considered as environments (Env1, Env2, Env3, Env4, Env5, and Env6).
For the three traits evaluated, the total number of observations was 438 × 6 = 2628. DNA samples were genotyped using 67, 436 SNPs. For each marker, the genotype for each line was coded as the number of copies of a designated marker-specific allele carried by the line (absence = zero and presence = one). SNP markers with unexpected heterozygous genotypes were recoded as either AA or BB. Those markers that had more than 15% missing values or with MAF < 0.05 were removed. A total of 11,617 SNPs were still available for analysis after quality control and imputation.

Multi-Trait Kernel Model
This model is given in Equation (1) as: where Y is the matrix of phenotypic response variables of order n × n T and ordered first by environments and then by lines, n T denotes the number of traits, 1 n is a vector of ones of length n, µ T is a vector of intercepts for each trait of length n T , T denotes the transpose of a vector or matrix, that is, µ = [µ 1 , . . . , µ n T ] T , X E is the design matrix of environments of order n × I, I denotes the number of environments, β E is the matrix of coefficients for environments with a dimension of I × n T , Z L is the design matrix of lines of order n × J, J denotes the number of lines, g is the matrix of random effects of lines of order J × n T distributed as g ∼ MN J×n T (0, K, Σ T ), that is, with a matrix-variate normal distribution with parameters M = 0, U = G and V = Σ T , K is the Gaussian kernel (GK) that mimics a covariance matrix to capture the degree of similarity between lines, such as the genomic relationship matrix (Linear kernel) proposed by [18] that was built with marker data of order J × J and Σ T is the variance-covariance matrix of traits of order n T × n T . Z EL is the design matrix of the genotype × environment interaction of order n × J I, gE is the matrix of genotype × environment interaction random effects distributed as gE ∼ MN J I×n T 0, where Σ E is a diagonal variance-covariance matrix of environments of order I × I and denotes the Hadamard product. is the residual matrix of dimension n × n T distributed as ∼ MN n×n T 0, I I J , R , where R is the residual variance-covariance matrix of order n T × n T .
The GK was computed using the GK function: where x i and x j are the marker vectors for the ith and jth individuals (genotypes), respectively [19,20]. It is necessary to point out that the GK function was reparametrized (Caamal-Pat, et al. [1]) as: using the variable change (ρ = e −γ ). Subsequently, the three strategies for tuning the bandwidth (γ) hyperparameter used in this implementation are listed as follows: (1) Manual tuning (no tuning, denoted as NT) setting the value of γ = 1, which is equivalent to setting ρ = e −1 .
(2) Tuning using a grid search (GrS) strategy with 26 values in the grid for the values of ρ between 0.01 and 0.999 with increments of 0.04, this means that 26 values of ρ were evaluated. The average of the normalized root mean square error of each predicted trait (NRMSE = 1 T ∑ n t=1 NRMSE t , t = 1, · · · , T) was used as metric for choosing the optimal ρ value in the inner testing set.
(3) Tuning ρ using the Bayesian optimization (BO) method. The average NRMSE was also used as metrics to select the optimal ρ value in the inner testing set.
The implementation of this model with the three strategies of tuning the ρ hyperparameter of the GK was carried out in the R statistical software [21,22].

Evaluation of Prediction Performance
In each of the five datasets, the seven outer fold cross validation was implemented (Montesinos-López et al. [19]). For this reason, 7 − 1 folds were assigned to the outertraining set, while the remaining were assigned to the outer-testing set until each of the 7 folds were tested once. For tuning the bandwidth hyperparameter of the Gaussian kernel five nested cross-validations was used; that is to say, the outer-training was divided into five groups where four were used for the inner training set (80% of the training) and one for the validation (inner-testing) set (20% of the outer training). Next, the average of the five validation folds was reported as the metric of prediction performance to select the optimal hyperparameter (bandwidth of the Gaussian kernel). Using this optimal hyperparameter (band width), the multi-trait kernel model (1) was then refitted with the whole outer-training set (the 7 − 1 folds), and finally, the prediction of each outer-testing set was obtained.
The prediction accuracy was reported in terms of the average normalized root mean square error for each trait , for t = 1, · · · , n T , where n T is the number of predicted traits; RMSE t,k and y t,k denote the NRMSE and the mean of the t-th trait for the kth fold, respectively), where ing the root mean square error of the t-th trait for the kth fold. In addition to the NRMSE for each trait, we also reported the average NRMSE of all traits as follows: NRMSE = 1 T ∑ T t=1 NRMSE t . These metrics were computed under the three strategies for tuning the bandwidth (γ) hyperparameter used in the implementation of the model, so that NRMSE NT , NRMSE GrS and NRMSE BO denote the NRMSE of no tuning (NT), grid search (GrS) and Bayesian optimization (BO) tuning strategies, respectively. The relative efficiencies were also reported and that were computed as: When RE GrS > 1 (RE BO > 1), the best performance prediction in terms of NRMSE was obtained using the GrS (BO) strategy, when RE GrS < 1 (RE BO < 1), the NT strategy was superior in terms of prediction accuracy and when RE GrS = 1 (RE BO = 1), both strategies of hyperparameter tuning were equally efficient. We also computed the relative efficiency in terms of NRMSE between the grid search strategy (GrS) and Bayesian optimization (BO) strategy (RE GrS/BO = NRMSE GrS /NRMSE BO ) and the interpretation is the same as the previous example.

Results
The results are provided in three sections for Japonica, Indica and Groundnut datasets 1-3, respectively. The results from dataset 1 (Japonica) are given in Table A1, Figure 1A-D, and Appendix B Table A4. The results from dataset 2 (Indica) are in Table A2, Figure 2A-D, and Appendix B Table A5. The results from dataset 3 (Groundnut) are shown in Table A3, Figure 3A-D, and Appendix B Table A6.
The results from dataset 4 (Cotton) can be found in Supplementary Materials Tables S1 and S2 and Figures S1A-S1D, whereas results from dataset 5 (Diseases) are found in Supplementary Materials Tables S3 and S4 and Figure S2A-D. Across traits, the prediction performance can be observed in Figure 1A (Table A1), were the best predictions (lower NRMS) were observed under the BO and GrS strategies and the worst under the NT strategy. In addition, across traits we can observe that the RE of comparing the NT strategy versus BO strategy for each environment and across environments were 1.  Table A1). This indicates that the BO method outperformed NT The prediction performance in terms of NRMSE of each trait across environments are given in Figure 2C and Table A2 and the relative efficiencies of comparing NT versus BO, NT versus GrS and GrS and BO are given in Figure 2D and Table A2, where in three out of four traits the best strategies for tuning are the BO and GrS, while the worst was the NT strategy. It should be noted that there are no relevant differences between the BO and GrS methods.

Dataset 3 Groundnut
Here, NRMSE_NPP, NRMSE_PYPP, NRMSE_SYPP and NRMSE_YPH denote the NRMSE of traits NPP, PYPP, SYPP and YPH. As shown in Table A3, in terms of NRMSE The prediction performance in terms of NRMSE of each trait across environments is given in Figure 3C and Table A3, while the relative efficiencies of comparing NT versus BO, NT versus GrS and GrS and BO are given in Figure 3D and Table A3, where we can appreciate that in the four's traits evaluated, the best strategies for tuning were the BO and GrS, while the worst was the NT strategy. No relevant differences were observed between the BO and GrS methods.

Discussion
As a predictive methodology, GS can help increase genetic gain by saving significant resources since candidate phenotypes do not need to be measured in the field, as they are predicted [23]. However, a number of factors still need to be improved for prediction

Dataset 1 Japonica
The prediction performance for each environment and across environments (Global) at Japonica's dataset in terms of normalized root mean squared error (NRMSE) and relative efficiency (RE) comparing the three methods of hyperparameter tuning (no tuning, grid search (GrS), and Bayesian optimization), under the 7FCV strategy are provided. NRMSE_GC, NRMSE_GY, NRMSE_PH and NRMSE_PHR denote the NRMSE of traits CC, GY, PH and PHR.
As can be seen in Figure 1 and Table  was an exception, as the best NRMSE was 0.827 and was found under the GrS strategy (see Figure 1 and Table A1). The standard error of prediction performance for every environment and across environments (Global) is provided in Appendix B Table A4.
Across traits, the prediction performance can be observed in Figure 1A (Table A1) Table A1). This means that the BO strategy is slightly better in terms of prediction performance than the GrS method since the RE were slightly superior to one.
In Figure 1C (Table A1), the prediction performance is provided for each trait across environments, while in Figure 1D the relative efficiencies of comparing NT versus BO, NT versus GrS, and GrS and BO, are provided and show that in all traits, the best strategies for tuning in terms of NRMSE are the BO and GrS method without relevant differences between the GrS and BO method.

Dataset 2 Indica
NRMSE_GC, NRMSE_GY, NRMSE_PH and NRMSE_PHR denote the NRMSE of traits CC, GY, PH, and PHR, respectively. Figure 2 and Table A2 shows that in terms of RMSE for the GC trait, the best performance under the GrS strategy was in environments 2010  Figure 2 and Table A2. The standard error of prediction performance for every environment (Global) is provided in Appendix B Table A5.
As we summarize across traits for each environment, we can see in Figure 2A and Table A2, that the best predictions (lower NRMS) were observed under the BO and GrS strategies and the worst under the NT strategy. We can also observe that the RE of comparing the NT strategy versus BO strategy for each environment and across environments were 1.12 (2010), 1.056 (2011), 1.039 (2012), and 1.064 (Global) ( Figure 2B and Table A2). This indicates that the BO method outperformed NT strategy in terms of NRMSE in all environments by 12% (2010), 5.6% (2011), 3.9% (2012), and 6.4% (Global). The RE of com-paring the NT strategy versus GrS strategy for each environment and across environments were 1.129 (2010), 1.061 (2011), 1.046 (2012), and 1.068 (Global) ( Figure 2B and Table A2). This indicates that the GrS method outperformed the NT strategy in terms of NRMSE in all the environments by 12.9% (2010), 6.1% (2011), 4.6% (2012), and 6.8% (Global). Finally, the RE of comparing the GrS method versus the BO method were 0.991 (2010), 0.995 (2011), 0.993 (2012), and 0.996 (Global) ( Figure 2B and Table A2). These results indicate that the BO strategy is slightly worse in terms of prediction performance than the GrS method since in most cases, the RE is less than one.
The prediction performance in terms of NRMSE of each trait across environments are given in Figure 2C and Table A2 and the relative efficiencies of comparing NT versus BO, NT versus GrS and GrS and BO are given in Figure 2D and Table A2, where in three out of four traits the best strategies for tuning are the BO and GrS, while the worst was the NT strategy. It should be noted that there are no relevant differences between the BO and GrS methods.

Dataset 3 Groundnut
Here, NRMSE_NPP, NRMSE_PYPP, NRMSE_SYPP and NRMSE_YPH denote the NRMSE of traits NPP, PYPP, SYPP and YPH. As shown in Table A3 In Figure 3 and Table A3 we can also see that in terms of NRMSE for the SYPP trait the best performance under the GrS strategy was in environments ALIYARNAGAR_R15  Table A3. The standard error of prediction performance for every environment and across environments (Global) are provided in the Appendix B Table A6.
The prediction performance in terms of NRMSE of each trait across environments is given in Figure 3C and Table A3, while the relative efficiencies of comparing NT versus BO, NT versus GrS and GrS and BO are given in Figure 3D and Table A3, where we can appreciate that in the four's traits evaluated, the best strategies for tuning were the BO and GrS, while the worst was the NT strategy. No relevant differences were observed between the BO and GrS methods.

Discussion
As a predictive methodology, GS can help increase genetic gain by saving significant resources since candidate phenotypes do not need to be measured in the field, as they are predicted [23]. However, a number of factors still need to be improved for prediction performance to be optimized. One of these factors is the choice of the statistical machine learning that will be used. In this regard, there are statistical machine learning methods that can only capture linear patterns, while others are able to capture non-linear patterns [19].
Kernel methods are very attractive since they are able to capture non-linear patterns and are very versatile, as they can be used with many statistical machine learning methods. For example, kernel methods can be applied in conventional mixed models, in support vector machines and even in many others machine learning methods such as random forest, and gradient boosting machine providing a modified input. In conventional mixed models, the use of kernels is quite straightforward since the genomic relationship matrix that is provided in this case is replaced by a particular kernel, enhancing the power of mixed models to capture nonlinear patterns in the data [19]. However, many kernels such as the Gaussian kernel implemented in this study have hyper-parameters that must be appropriately tuned to guarantee successful implementation.
For this reason, in this study we evaluated the influence in terms of prediction performance of three tuning strategies (manual tuning, grid Search and Bayesian optimization) under a multi-trait Bayesian GBLUP model. We found that using the grid search and Bayesian optimization outperform the prediction accuracy of the manual tuning by 2.1% and 3.1%, respectively, in the japonica dataset, by 6.4% and 6.8%, respectively, in the Indica dataset, by 4.4% and 4.6%, respectively, in the groundnut dataset, by 1.9% and 2.1%, respectively, in the Cotton dataset, and by 2.3% and 2.7%, respectively, in the disease dataset.
About the time for implementing the tunning methods we found that the grid Search method required around 15% more time for its implementation than the Bayesian optimization method, and around 20 times more expensive in computation resources than the manual tuning. The tuning process is more expensive in terms of computational resources. The grid search approach was slightly more costly than the Bayesian optimization since the size of the grid contain 26 values, however the larger the number of values in the grid search the larger the computational resources required by the grid search.
We found differences in prediction performance in each environment, and larger differences were observed when the environments represented years, since many times the year-to-year variability is significant. For example, in dataset 1 (Japonica) we observed higher prediction error in year 2009, compared to the other years and this can be attributed mostly to the effect of years and less to the unbalance in the number of genotypes evaluated in each environment. In addition, relevant differences in terms of prediction performance between environments were found in dataset 3 (Groundnut) where environments ALI-YARNAGAR_R15 and ICRISAT_PR15-16 resulted in the worst prediction performance and environments ICRISAT_R15 and JALGOAN_R15 with the best predictions. These results point out that even in the same year the location-to-location variability is considerably high.
In addition, for some datasets we found significant differences in terms of prediction performance between traits. For example, in dataset 1 (Japonica) the best predictions were observed in traits GC and PH and the worst in traits GY and PHR. While in dataset 2 (Indica) the best predictions were observed in traits GY and PH, while the worst in traits GC and PHR. In dataset 3 (Groundnut) the four traits showed a similar prediction performance.
The improvement in prediction accuracy of the grid search and Bayesian optimization is not very significant compared with the manual tuning; however, this is data dependent. By data dependent we mean that, if the dataset contain complex non-linear patterns, then using kernels with the appropriate implementation and tuned methods will result in important improvement in prediction accuracies with respect to not using kernels. However, if the data only contain linear patterns, we cannot expect an improvement in prediction accuracy using non-linear kernels. We also need to be aware that we are under a multi-trait framework, which lends to greater difficulty in selecting the bandwidth hyperparameter that can work simultaneously for all traits under study. In the context of tuning the bandwidth for Gaussian kernels, a greater gain in prediction accuracy was observed using grid search and Bayesian optimization as was shown by Montesinos-López et al. [20].

Conclusions
In this study we showed the importance of carefully tuning the Gaussian kernel to improve the prediction accuracy. We found that we can increase the prediction accuracy between 1.9% and 6.8% by tuning the Gaussian kernel using Bayesian optimization or the grid search method. We did not find any relevant differences between tuning with Bayesian optimization and the grid search method. In general, the results indicated that modest gain in prediction accuracy were obtained for some datasets, while in others major improvements were achieved. We encourage researchers dedicating sufficient time for the tuning process. It is also important to point out that the degree of improvement in prediction accuracy can be influenced by the metric used for evaluating the prediction performance in the validation set and for this reason our results are not conclusive. We encourage future benchmark studies to be able to see the influence of the metric used.

Acknowledgments:
The authors are thankful to administrative, technical field support and Lab assistances that established the different experiments in the field as well as in the Laboratory at the different institutions that generated the data used in this study.

Conflicts of Interest:
The authors declare no conflict of interest.