# A Comparison of Three Machine Learning Methods for Multivariate Genomic Prediction Using the Sparse Kernels Method (SKM) Library

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Models

#### 2.1.1. Bayesian MT-GBLUP Model

#### 2.1.2. Random Forest (RF) Model

**Step 1.**From the training dataset, draw bootstrap samples of size ${N}_{train}$.

**Step 2**. With the bootstrapped data, grow a random forest tree (${T}_{b}$) with the specific splitting criterion (appropriate for each response variable) by recursively repeating the following steps for each terminal node of the tree until the minimum node size (minimum size of terminal nodes) is reached.

- Randomly draw $mtry$ out of the $m$ independent variables (IVs); $mtry$ is a user-specified parameter and should be less than or equal to $p$ (total number of IVs);
- Select the best independent variable among the $mtry$ IVs.
- Split the node into two child nodes. The split ends when a stopping criterion is reached, for instance, when a node has less than a predetermined number of observations. No pruning is performed.

**Step 3**. The ensemble of trees is obtained as ${\left\{{T}_{b}\right\}}_{1}^{B}$.

#### 2.1.3. Multi-Trait Partial Least Square (MT-PLS) Method

**Step 1**. Initialize two matrices, $\mathit{E}$ = $\mathit{X}$ and $\mathit{F}$ = $\mathit{Y}$. Center and normalize each column of E0 and F0.

**Step 2**. Form a cross-product matrix $(\mathit{S}={\mathit{X}}^{T}\mathit{Y})$ and determine its singular value decomposition (SVD). The first left and right singular vectors, $w$ and $q$, are used as weight vectors for $\mathit{X}$ and $\mathit{Y}$, respectively, to obtain scores $t$ and $u$:

**Step 3.**Next, $\mathit{X}$ and $\mathit{Y}$ loadings are obtained by regressing against the same vector ($t$):

**Step 4.**Having extracted the first latent vector and the corresponding loading vectors, the matrices $E$ and $\mathit{F}$ are deflated by subtracting information related to the first latent vector. This produces deflated matrices ${E}_{n+1}$ and ${F}_{n+1}$, as shown in the calculations below.

**Step 5**. Calculate the cross-product matrix of ${E}_{n+1}$ and ${F}_{n+1}$ as in Step 2. With this new cross-product matrix, repeat steps 3 and 4 and save the resulting $w$, $t$, $p$ and $q$ vectors to form the next columns of matrices:

**W**,

**T**,

**P**and

**Q**, respectively. This yields the next component. Then, repeat the above steps until the deflated matrices are empty or the necessary number of components have been extracted. Then the algorithm stops.

#### 2.2. Datasets

#### 2.2.1. Dataset 1: Indica

^{2}/day) calculated using Armstrong’s formula; (4) EfPpit denotes effective precipitation (mm) computed as the average of daily precipitation in mm that is actually added and stored in the soil; (5) DegDay denotes the mean of daily average temperature minus 10°; (6) RelH denotes relative humidity (hs) computed as the sum of daily hours (0–24 h) with a relative humidity equal to 100%; (7) PpitDay denotes the precipitation day computed as the sum of days during which it rained; (8) MeanTemp denotes the mean of temperature (°C) over 24 h (0–24 h); (9) AvTemp denotes the average temperature (°C) calculated as daily (Max + Min) / 2; (10) MaxTemp denotes the average maximum daily temperature (°C); (11) MinTemp denotes the average minimum daily temperature (°C); (12) TankEv denotes tank water evaporation (mm) computed as the amount of evaporated water under the influence of sun and wind; (13) Wind denotes wind speed (2 m/km/24 h) computed as the distance covered by wind (in km) over 2 m height in one day; (14) PicheEv denotes Piche evaporation (mm) computed as the amount of evaporated water without the influence of the sun; (15) MinRelH stands for the minimum relative humidity (%) computed as the lowest value of relative humidity for the day; (16) AccumPpit denotes the daily accumulated precipitation (mm); (17) Sunhs denotes sunshine duration computed as the sum of total hours of sunshine per day; and (18) MinT15 denotes the minimum temperature below 15° computed as the sum of the days when the minimum temperature was below 15. More details related to how these environmental covariates were measured can are presented by Monteverde et al. [27].

#### 2.2.2. Dataset 2: Japonica

#### 2.2.3. Dataset 3: Groundnut Data

#### 2.2.4. Dataset 4: Disease Data

#### 2.2.5. Datasets 5–6: Elite Wheat Yield Trial (EYT) Years 2013–2014 and 2014–2015

**dataset 5**, the environments were bed planting with five irrigations (Bed5IR), early heat (EHT), flat planting with five irrigations (Flat5IR) and late heat (LHT). For EYT

**dataset 6**, the environments were bed planting with two irrigations (Bed2IR), Bed5IR, EHT, Flat5IR and LHT.

#### 2.3. Metrics for Evaluation of Prediction Accuracy

#### 2.4. Functions for Implementing the Multi-Trait Models Using the SKM Library

**bayesian_model()**: is a wrapper of the BGLR::BGLR() and BGLR::Multitrait() functions, the latter being the function used to fit a multivariate Bayesian regression model. The main arguments used to adjust this model are x, y and testing_indices, with which we specify the information of the predictor variables, response variables and indices for the testing set, respectively. Unlike the other functions used to implement the seven machine learning algorithms offered by the SKM library, it is necessary to specify the indices of the training set. The x argument must be a list of nested lists, wherein each list represents an effect of the predictor. To implement the GBLUP model in its Bayesian form, it is necessary to specify this argument as:x = list(G = list(x =**G**, model = “BGBLUP”)),**G**denotes the genomic relationship matrix.Use help(“bayesian_model”) in the R console to see more details about the parameters of this function.**partial_least_squares()**: is a wrapper of the pls::plsr() function, which is the function used to fit a multivariate partial least squares regression model for numerical responses. The main arguments used to fit this model are x and y, with which we specify the predictor variables and response variables, respectively. This function is also useful for implementing single-trait prediction models. Use help(“partial_least_squares”) in the R console to see more details about the parameters of this function.**random_forest()**: is a wrapper of the randomForestSRC::rfsrc() function, which is the function used to fit a random forest model. The main arguments used to fit this model are:- ➢
- x: predictor (or independent) variables in matrix form;
- ➢
- y: response variables (or dependent) variables in a matrix or in a data frame (in the multivariate case) or vector (in the univariate case);
- ➢
**trees_number**: is a**tunable**hyperparameter that specifies the number of regression trees used;- ➢
**node_size**: is a**tunable**hyperparameter that specifies the minimum number of terminal nodes in each regression tree;- ➢
**tune_type**: is an argument that specifies the type of tuning to use for hyperparameters (“Grid_search” by default). In the case of the “Grid_search” tuning type, the proposed values for the hyperparameters must be specified through a vector, whereas in the case of the “Bayesian_optimization” method, they must be specified through a list of two elements that indicate the range of the proposed values for each hyperparameter.

**cv_random()**: generates the folds to use under a random cross-validation framework, and we need to provide as input the number of observations (records_number), the number of folds (folds_number) and the proportion of observations to be included in the test set (testing_proportion). Each fold is built using random sampling with replacement and the proportion specified for the test set and the rest for the training set.**cv_kfold()**: generates the folds to be used under the k-fold cross-validation approach; the number of observations (records_number) and the number of desired folds (k) must be provided as the input. If the number specified in the records_number argument corresponds to the number of environments and the k argument corresponds to the number of environments, then the cross-validation scheme corresponds to leave-one-environment-out (LOEO).

**gs_summaries()**is a function that helps to evaluate the predictive capacity of each model by reporting summary statistics of the predictions made in the various generated folds. The main argument required in this function is a data frame containing the following columns: Fold, Line, Env, Observed and Predicted; the output of this function is a list of the prediction performance with three summaries (by “line”, by “env” and by “fold”). Use help(“gs_summaries”) in the R console to see more details about this feature. The SKM library can be installed from GitHub with the following lines of code:

- devtools::install_github (“cran/ randomForestSRC”)
- devtools::install_github (“gdlc/BGLR-R”)
- devtools::install_github (“rstudio/ tensorflow”)
- if (!require (“devtools”)) {install.packages (“devtools”)}
- devtools::install_github (“brandonmosqueda/SKM”)

#### 2.5. Data Availability and Supplementary Materials

## 3. Results

#### 3.1. Dataset 1: Indica

**Heritability and variance components**

**With predictor = G**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.909, 0.964 and 0.916 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model, and the loss in the accuracy of the predictions was 9.1% (2010), 3.6% (2011) and 8.4% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model achieved better performance than the

**PLS**model, as the relative efficiency was equal to 0.928; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 7.2%, as $\frac{1}{RE\_PLS}=\frac{1}{0.928}=1.078$ (Figure 1 with predictor = G) (Table 2).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, were 0.984, 0.994 and 0.949 for years 2010, 2011 and 2012, respectively; that is the performance of the

**random forest**model was 1.6% (2010), 0.6% (2011) and 5.1% (2012) lower than that of the

**GBLUP**model. In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the random forest model, as the relative efficiency was equal to 0.978; that is, across all environments, the

**GBLUP**model outperformed the

**random forest**model by 2.2% (Figure 1 with predictor = G) (Table 2).

**With predictor = G + E**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.904, 0.962 and 0.931 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower than that of the

**GBLUP**model, as the loss in the accuracy of the predictions was 9.6% (2010), 3.7% (2011) and 6.9% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.925; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 8.1%, as $\frac{1}{RE\_PLS}=\frac{1}{0.925}=1.081$ (Figure 1 with predictor = G + E) (Table 2).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 1.027, 1.027 and 1.009 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments (years), the performance of the predictions of the

**random forest**model was higher than that of the

**GBLUP**model, and the gains in the accuracy of the predictions were 2.7% (2010), 2.7% (2011) and 0.9% (2012). In addition, across all environments (global), we observed that the

**random forest**model performed better than the

**GBLUP**model, as the relative efficiency was equal to 1.009; that is, across all environments, the

**GBLUP**model was surpassed by the

**random forest**model by 0.9% (Figure 1 with predictor = G + E) (Table 2).

**With predictor = E + G + GE**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.898, 0.934 and 0.944 for the environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model and the loss in the accuracy of the predictions was 10.2% (2010), 6.6% (2011) and 5.6% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.918; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 8.9%, as $\frac{1}{RE\_PLS}=\frac{1}{0.918}=1.089$ (Figure 1 with predictor = G + E + GE) (Table 2).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 0.991, 0.982 and 0.990 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments (years), the performance of the predictions of the

**random forest**model was lower than that of the

**GBLUP**model, as the losses in the accuracy of the predictions were 0.9% (2010), 1.8% (2011) and 1% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.976; that is, across all environments, the

**random forest**model was outperformed by the

**GBLUP**model by 2.4% (Figure 1 with predictor = G + E + GE) (Table 2).

**With predictor = G + GE**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.913, 0.895 and 0.931 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model, and the loss in the accuracy of the predictions was 8.7% (2010), 10.5% (2011) and 6.9% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.907; that is, across all environments, the

**PLS**model was surpassed by the

**GBLUP**model by 9.3% (Figure 1 with predictor = G + GE) (Table 2).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 0.996, 0.983 and 0.990 for environments (years) 2010, 2011 and 2012, respectively; that is, in each of the environments (years), the performance of the predictions of the

**random forest**model was lower than that of the

**GBLUP**model, and the loss in the accuracy of the predictions was 0.4% (2010), 1.7% (2011) and 1% (2012). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.980; that is, across all environments, the

**random forest**model was outperformed by the

**GBLUP**model by 2% (Figure 1 with predictor = G + GE) (Table 2).

#### 3.2. Dataset 2: Japonica

**Heritability and variance components**

**With predictor = G**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.983, 1.003, 1.000, 1.057 and 0.937 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, only in environments (years) 2010 and 2012, the performance of the

**PLS**regression predictions was higher than that of the

**GBLUP**model, as the accuracy of the predictions was 0.3% (2010) and 5.7% (2012), whereas the prediction performance was the same in environment (year) 2011 for both models. However, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.973; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 2.7% (Figure 2 with predictor = G) (Table 4).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**Random Forest**model were 0.884, 0.951, 0.934, 0.967 and 0.932 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, in each of the environments, the performance of the predictions of the

**random forest**model was lower compared to that of the

**GBLUP**model, and the losses in the accuracy of predictions were 11.6% (2009), 4.9% (2010), 6.6% (2011), 3.3% (2012) and 6.8% (2013). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.906; that is, across all environments, the

**GBLUP**model outperformed the

**random forest**model by 10.4%, as $\frac{1}{RE\_RF}=\frac{1}{0.906}=1.104$ (Figure 1 with predictor = G).

**With predictor = G + E**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.938, 0.954, 0.864, 0.904 and 0.803 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model, as the loss in the accuracy of the predictions was 6.2% (2009), 4.6% (2010), 3.6% (2011), 9.6% (2012) and 19.7% (2013). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.823; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 21.5% $\frac{1}{RE\_PLS}=\frac{1}{0.823}=1.215$ (Figure 2 with predictor = G + E) (Table 4).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 0.977, 1.112, 0.928, 0.916 and 0.859 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, only in environment (year) 2010 was the performance of the predictions of the

**random forest**model superior with to that of the

**GBLUP**model, and the gain in the accuracy of predictions was 11.2% (2013). However, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.857; that is, across all environments, the

**GBLUP**model outperformed the

**random forest**model by 14.3% (Figure 2 with predictor = G + E) (Table 4).

**With predictor = E + G + GE**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.943, 0.812, 0.862, 0.919 and 0.800 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model, and the loss in the accuracy of the predictions was 5.7% (2009), 18.8% (2010), 13.8% (2011), 9.6% (2012) and 20.0% (2013). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.828; that is, across all environments, the

**GBLUP**model outperformed the

**PLS**model by 17.2% (Figure 2 with predictor = G + E + GE) (Table 4).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 1.024, 0.939, 0.928, 0.922 and 0.851 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, only in environment (year) 2009 was the performance of the predictions of the

**random forest**model superior to that of the

**GBLUP**model, as the gain in the accuracy of predictions was 2.4% (2013). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.859; that is, across all environments, the

**random forest**model was outperformed by the

**GBLUP**model by 14.1% (Figure 2 with predictor = G + E + GE) (Table 4).

**With predictor = G + GE**

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**PLS**model were 0.673, 0.473, 0.767, 0.675 and 0.684 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, in each of the environments, the performance of the predictions of the

**PLS**model was lower compared to that of the

**GBLUP**model, and the loss in the accuracy of the predictions was 32.7% (2009), 52.7% (2010), 23.3% (2011), 32.5% (2012) and 31.6% (2013). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**PLS**model, as the relative efficiency was equal to 0.663; that is, across all environments, the

**PLS**model was outperformed by the

**GBLUP**model by 33.7% (Figure 2 with predictor = G + GE).

**GBLUP**model vs. the

**random forest**model, under the

**sevenfold CV cross-validation**scheme, we observed that the relative efficiencies of the

**GBLUP**model vs. the

**random forest**model were 1.072, 0.968, 0.921, 0.922 and 0.833 for environments (years) 2009, 2010, 2011, 2012 and 2013, respectively; that is, only in environment (year) 2009 was the performance of the predictions of the

**random forest**model superior to that of the

**GBLUP**model, and the gain in the accuracy of predictions was 7.9% (2009). In addition, across all environments (global), we observed that the

**GBLUP**model performed better than the

**random forest**model, as the relative efficiency was equal to 0.859; that is, across all environments, the

**random forest**model was outperformed by the

**GBLUP**model by 14.1% (Figure 2 with predictor = G + GE) (Table 4).

## 4. Discussion

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Bassi, F.M.; Bentley, A.R.; Charmet, G.; Ortiz, R.; Crossa, J. Breeding schemes for the implementation of genomic selection in wheat (Triticum spp.). Plant Sci.
**2015**, 242, 23–36. [Google Scholar] [CrossRef] [PubMed] - Battenfield, S.D.; Guzmán, C.; Gaynor, R.C.; Singh, R.P.; Peña, R.J.; Dreisigacker, S.; Fritz, A.K.; Poland, J.A. Genomic selection for processing and end-use quality traits in the CIMMYT spring bread wheat breeding program. Plant Genome
**2016**, 9. [Google Scholar] [CrossRef] [PubMed] - Bhat, J.A.; Ali, S.; Salgotra, R.K.; Mir, Z.A.; Dutta, S.; Jadon, V.; Tyagi, A.; Mushtaq, M.; Jain, N.; Singh, P.K.; et al. Genomic selection in the era of next generation sequencing for complex traits in plant breeding. Front. Genet.
**2016**, 7, 221. [Google Scholar] [CrossRef] [PubMed] - Roorkiwal, M.; Rathore, A.; Das, R.R.; Singh, M.K.; Jain, A.; Srinivasan, S.; Gaur, P.M.; Chellapilla, B.; Tripathi, S.; Li, Y.; et al. Genome-enabled prediction models for yield related traits in Chickpea. Front. Plant Sci.
**2016**, 7, 1666. [Google Scholar] [CrossRef] - Crossa, J.; Pérez-Rodríguez, P.; Cuevas, J.; Montesinos-López, O.A.; Jarquín, D.; de Los Campos, G.; Burgueño, J.; González-Camacho, J.M.; Pérez-Elizalde, S.; Beyene, Y.; et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci.
**2017**, 22, 961–975. [Google Scholar] [CrossRef] - Wolfe, M.D.; Del Carpio, D.P.; Alabi, O.; Ezenwaka, L.C.; Ikeogu, U.N.; Kayondo, I.S.; Lozano, R.; Okeke, U.G.; Ozimati, A.A.; Williams, E.; et al. Prospects for Genomic Selection in Cassava Breeding. Plant Genome
**2017**, 10, 15. [Google Scholar] [CrossRef] - Huang, M.; Balimponya, E.G.; Mgonja, E.M.; McHale, L.K.; Luzi-Kihupi, A.; Guo-Liang Wang, G.-L.; Sneller, C.H. Use of genomic selection in breeding rice (Oryza sativa L.) for resistance to rice blast (Magnaporthe oryzae). Mol. Breed.
**2019**, 39, 114. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Crossa, J. Multivariate Statistical Machine Learning Methods for Genomic Prediction; Montesinos López, O.A., Montesinos López, A., Crossa, J., Eds.; Springer International Publishing: Cham, Switzerland, 2022; ISBN 978-3-030-89010-0. [Google Scholar]
- Arojju, S.K.; Cao, M.; Trolove, M.; Barrett, B.A.; Inch, C.; Eady, C.; Stewart, A.; Faville, M.J. Multi-Trait Genomic Prediction Improves Predictive Ability for Dry Matter Yield and Water-Soluble Carbohydrates in Perennial Ryegrass. Front. Plant Sci.
**2020**, 11, 1197. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Javier Luna-Vázquez, F.; Toledo, F.H.; Pérez-Rodríguez, P.; Lillemo, M.; Crossa, J. An R Package for Bayesian Analysis of Multi-environment and Multi-trait Multi-environment Data for Genome-Based Prediction. G3 Genes Genomes Genet.
**2019**, 9, 355–1369. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Crossa, J.; Cuevas, J.; Montesinos-López, J.C.; Salas-Gutiérrez, Z.; Lillemo, M.; Philomin, J.; Singh, R. A Bayesian Genomic Multi-output Regressor Stacking Model for Predicting Multi-trait Multi-environment Plant Breeding Data. G3 Genes Genomes Genet.
**2019**, 9, 3381–3393. [Google Scholar] [CrossRef] - Henderson, C.R.; Quaas, R.L. Multiple trait evaluation using relatives records. J. Anim. Sci.
**1976**, 43, 1188–1197. [Google Scholar] [CrossRef] - Pollak, E.J.; van der Werf, J.; Quaas, R.L. Selection Bias and Multiple Trait Evaluation. J. Dairy Sci.
**1984**, 67, 1590–1595. [Google Scholar] [CrossRef] - Schaeffer, L.R. Sire and Cow Evaluation Under Multiple Trait Models. J. Dairy Sci.
**1984**, 67, 1567–1580. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Gianola, D.; Crossa, J.; Hernández-Suárez, C.M. Multi-trait, multi-environment deep learning modeling for genomic-enabled prediction of plant. G3 Genes Genomes Genet.
**2018**, 8, 3829–3840. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Tuberosa, R.; Maccaferri, M.; Sciara, G.; Ammar, K.; Crossa, J. Multi-Trait, Multi-Environment Genomic Prediction of Durum Wheat With Genomic Best Linear Unbiased Predictor and Deep Learning Methods. Front. Plant Sci.
**2019**, 11, 1311. [Google Scholar] [CrossRef] - Palermo, G.; Piraino, P.; Zucht, H.D. Performance of PLS regression coefficients in selecting variables for each response of a multivariate PLS for omics-type data. Adv. Appl. Bioinform. Chem.
**2009**, 2, 57–70. [Google Scholar] [CrossRef] - Montesinos López, O.A.; Mosqueda González, B.A.; Palafox González, A.; Montesinos López, A.; Crossa, J. A General-Purpose Machine Learning R Library for Sparse Kernels Methods With an Application for Genome-Based Prediction. Front. Genet.
**2022**, 13, 887643. [Google Scholar] [CrossRef] - VanRaden, P.M. Efficient methods to compute genomic predictions. J. Dairy Sci.
**2008**, 91, 4414–4423. [Google Scholar] [CrossRef] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Waldmann, P. Genome-wide prediction using Bayesian additive regression trees. Genet. Sel. Evol.
**2016**, 48, 42. [Google Scholar] [CrossRef] - Wold, H. Estimation of principal components and related models by iterative least sqares. In Multivariate Analysis; Krishnaiah, P.R., Ed.; Academic Press: New York, NY, USA, 1966. [Google Scholar]
- Boulesteix, A.L.; Strimmer, K. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinform.
**2006**, 8, 32–44. [Google Scholar] [CrossRef] - Mevik, B.H.; Cederkvist, H.R. Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR). J. Chemometr.
**2004**, 18, 422–429. [Google Scholar] [CrossRef] - Pérez, P.; de los Campos, G. BGLR: A statistical package for whole genome regression and prediction. Genetics
**2014**, 198, 483–495. [Google Scholar] [CrossRef] - Mevik, B.-H.; Wehrens, R. The pls package: Principal component and partial least squares regression in R. J. Stat. Softw.
**2007**, 18, 1–24. [Google Scholar] [CrossRef] - Monteverde, E.; Gutierrez, L.; Blanco, P.; Pérez de Vida, F.; Rosas, J.E.; Bonnecarrère, V.; Quero, G.; McCouch, S. Integrating Molecular Markers and Environmental Covariates To Interpret Genotype by Environment Interaction in Rice (Oryza sativa L.) Grown in Subtropical Areas. G3 Genes Genomes Genet.
**2019**, 9, 1519–1531. [Google Scholar] [CrossRef] [PubMed] - Pandey, M.K.; Chaudhari, S.; Jarquin, D.; Janila, P.; Crossa, J.; Patil, S.C.; Sundravadana, S.; Khare, D.; Bhat, R.S.; Radhakrishnan, T.; et al. Genome-based trait prediction in multi- environment breeding trials in groundnut. Theor. Appl. Genet.
**2020**, 133, 3101–3117. [Google Scholar] [CrossRef] [PubMed] - Juliana, P.J.; Singh, R.P.; Poland, J.; Mondal, S.; Crossa, J.; Montesinos-López, O.A.; Dreisigacker, S.; Pérez-Rodríguez, P.; Huerta-Espino, J.; Crespo, L.; et al. Prospects and challenges of applied genomic selection-a new paradigm in breeding for grain yield in bread wheat. Plant Genome
**2018**, 11, 180017. [Google Scholar] [CrossRef] [PubMed] - Elshire, R.J.; Glaubitz, J.C.; Sun, Q.; Poland, J.A.; Kawamoto, K.; Buckler, E.S.; Mitchell, S.E. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE
**2011**, 6, e19379. [Google Scholar] [CrossRef] - Poland, J.A.; Brown, P.J.; Sorrells, M.E.; Jannink, J.L. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS ONE
**2012**, 7, e32253. [Google Scholar] [CrossRef] - Money, D.; Gardner, K.; Migicovsky, Z.; Schwaninger, H.; Zhong, G.; Myles, S. LinkImpute: Fast and accurate genotype imputation for nonmodel organisms. G3 Genes Genomes Genet.
**2015**, 5, 2383–2390. [Google Scholar] [CrossRef] - Bradbury, P.J.; Zhang, Z.; Kroon, D.E.; Casstevens, T.M.; Ramdoss, Y.; Buckler, E.S. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics
**2007**, 23, 2633–2635. [Google Scholar] [CrossRef] [PubMed] - Mockus, J. Bayesian Approach to Global Optimization: Theory and Applications; Springer: Dordrecht, The Netherlands, 2012. [Google Scholar]
- Montesinos-López, O.A.; Montesinos-López, A.; Kismiantini Roman-Gallardo, R.; Gardner, K.; Lillemo, M.; Fritsche-Neto, R.; Crossa, J. Partial least square enhances genome-based prediction of new environments. Front. Genet.
**2022**, 3, 3. [Google Scholar] [CrossRef] - Montesinos-López, O.A.; Montesinos-López, A.; Bernal-Sandoval, D.A.; Mosqueda-González, B.A.; Valenzo-Jiménez, M.A.; Crossa, J. Multi-trait genome-based prediction of new environments with partial least squares. Front. Genet. 2022; accepted. [Google Scholar]
- Costa-Neto, G.; Fritsche-Neto, R.; Crossa, J. Nonlinear kernels, dominance, and envirotyping data increase the accuracy of genome-based prediction in multi-environment trials. Heredity
**2021**, 126, 92–106. [Google Scholar] [CrossRef] - Costa-Neto, G.; Galli, G.; Carvalho, H.F.; Crossa, J.; Fritsche-Neto, R. EnvRtype: A software to interplay enviromics and quantitative genomics in agriculture. G3 Genes Genomes Genet.
**2021**, 11, jkab040. [Google Scholar] [CrossRef]

**Figure 1.**Prediction performance for each environment and across environments (Global) of dataset 1 (Indica) in terms of normalized mean square error (NRMSE) under four predictors (G, genotypic information; E + G. environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) and under the sevenfold cross-validation (CV) scheme.

**Figure 2.**Prediction performance for each environment and across environments (global) of dataset 2 (Japonica) in terms of normalized mean square error (NRMSE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) and under the sevenfold cross-validation (CV) scheme.

**Table 1.**Variance components (variance) and heritability estimates for

**dataset 1**(

**Indica**) for each trait. CV denotes coefficient of variation, and Locs denotes the average number of locations.

Trait | Component | Variance | Heritability | CV | Locs |
---|---|---|---|---|---|

GY | Loc:Hybrid | 394520.80 | 0.47 | 0.11 | 3 |

GY | Hybrid | 361259.05 | 0.47 | 0.11 | 3 |

GY | Loc | 496020.35 | 0.47 | 0.11 | 3 |

GY | Residual | 336143.27 | 0.47 | 0.11 | 3 |

PHR | Loc:Hybrid | 2.33 | 0.69 | 0.05 | 3 |

PHR | Hybrid | 3.74 | 0.69 | 0.05 | 3 |

PHR | Loc | 0.05 | 0.69 | 0.05 | 3 |

PHR | Residual | 2.65 | 0.69 | 0.05 | 3 |

GC | Loc:Hybrid | 1.73 | 0.54 | 0.63 | 3 |

GC | Hybrid | 1.48 | 0.54 | 0.63 | 3 |

GC | Loc | 0.06 | 0.54 | 0.63 | 3 |

GC | Residual | 1.96 | 0.54 | 0.63 | 3 |

PH | Loc:Hybrid | 1.96 | 0.76 | 0.06 | 3 |

PH | Hybrid | 9.66 | 0.76 | 0.06 | 3 |

PH | Loc | 4.81 | 0.76 | 0.06 | 3 |

PH | Residual | 2.58 | 0.76 | 0.06 | 3 |

**Table 2.**Prediction performance for each environment and across environments (Global) of

**dataset 1 (Indica)**in terms of normalized mean square error (NRMSE) and relative efficiency (RE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) under sevenfold cross validation. NRMSE_GBLUP, NRMSE_PLS and NRMSE_RF denote the NRMSE under the

**GBLUP**,

**PLS**and

**random forest**models, respectively. RE_PLS and RE_RF denote the relative efficiency (RE) calculated with the NRMSE of the

**PLS**and

**random forest**models, respectively. RE was calculated by dividing the prediction performance (with NRMSE) of the

**GBLUP**model between the prediction performance of the

**PLS**and

**random forest**models; that is, the

**GBLUP**model was considered the reference model.

Data | Predictor | Env | NRMSE_GBLUP | NRMSE_PLS | NRMSE_RF | RE_PLS | RE_RF |
---|---|---|---|---|---|---|---|

Indica | G | 2010 | 0.892 | 0.981 | 0.907 | 0.909 | 0.984 |

Indica | G | 2011 | 1.040 | 1.079 | 1.046 | 0.964 | 0.994 |

Indica | G | 2012 | 0.917 | 1.001 | 0.966 | 0.916 | 0.949 |

Indica | G | Global | 0.880 | 0.948 | 0.900 | 0.928 | 0.978 |

Indica | E + G | 2010 | 0.876 | 0.969 | 0.853 | 0.904 | 1.027 |

Indica | E + G | 2011 | 0.924 | 0.961 | 0.900 | 0.962 | 1.027 |

Indica | E + G | 2012 | 0.839 | 0.901 | 0.836 | 0.931 | 1.004 |

Indica | E + G | Global | 0.817 | 0.884 | 0.810 | 0.925 | 1.009 |

Indica | E + G + GE | 2010 | 0.861 | 0.959 | 0.869 | 0.898 | 0.991 |

Indica | E + G + GE | 2011 | 0.901 | 0.964 | 0.918 | 0.934 | 0.982 |

Indica | E + G + GE | 2012 | 0.840 | 0.890 | 0.849 | 0.944 | 0.990 |

Indica | E + G + GE | Global | 0.808 | 0.880 | 0.827 | 0.918 | 0.976 |

Indica | G + GE | 2010 | 0.874 | 0.957 | 0.877 | 0.913 | 0.996 |

Indica | G + GE | 2011 | 0.910 | 1.017 | 0.926 | 0.895 | 0.983 |

Indica | G + GE | 2012 | 0.851 | 0.914 | 0.859 | 0.931 | 0.990 |

Indica | G + GE | Global | 0.816 | 0.900 | 0.833 | 0.907 | 0.980 |

**Table 3.**Variance components (variance) and heritability’s estimates for

**Japonica**(

**dataset 2**) for each trait. CV denotes the coefficient of variation, and Locs denotes the average number of locations.

Trait | Component | Variance | Heritability | CV | Locs |
---|---|---|---|---|---|

GY | Loc:Hybrid | 186065.908 | 0.29 | 0.16 | 3.60 |

GY | Hybrid | 257287.998 | 0.29 | 0.16 | 3.60 |

GY | Loc | 1860782.427 | 0.29 | 0.16 | 3.60 |

GY | Residual | 272836.420 | 0.29 | 0.16 | 3.60 |

PHR | Loc:Hybrid | 0.0001 | 0.46 | 0.07 | 3.60 |

PHR | Hybrid | 0.0004 | 0.46 | 0.07 | 3.60 |

PHR | Loc | 0.0012 | 0.46 | 0.07 | 3.60 |

PHR | Residual | 0.0003 | 0.46 | 0.07 | 3.60 |

GC | Loc:Hybrid | 0.000 | 0.25 | 0.82 | 3.60 |

GC | Hybrid | 0.001 | 0.25 | 0.82 | 3.60 |

GC | Loc | 0.006 | 0.25 | 0.82 | 3.60 |

GC | Residual | 0.001 | 0.25 | 0.82 | 3.60 |

PH | Loc:Hybrid | 0.002 | 0.62 | 0.10 | 3.60 |

PH | Hybrid | 20.528 | 0.62 | 0.10 | 3.60 |

PH | Loc | 35.950 | 0.62 | 0.10 | 3.60 |

PH | Residual | 8.576 | 0.62 | 0.10 | 3.60 |

**Table 4.**Prediction performance for each environment and across environments (global) of

**dataset 2 (Japonica)**in terms of normalized mean square error (NRMSE) and relative efficiency (RE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction), under sevenfold cross validation. NRMSE_GBLUP, NRMSE_PLS and NRMSE_RF denote the NRMSE under the

**GBLUP**,

**PLS**and

**random forest**models, respectively. RE_PLS and RE_RF denote the relative efficiency (RE) calculated with the NRMSE of the

**PLS**and

**random forest**models, respectively. RE was calculated by dividing the prediction performance (with NRMSE) of the

**GBLUP**model between the prediction performance of the

**PLS**and

**random forest**models; that is, the

**GBLUP**model was considered the reference model.

Data | Predictor | Env | NRMSE_GBLUP | NRMSE_PLS | NRMSE_RF | RE_PLS | RE_RF |
---|---|---|---|---|---|---|---|

Japonica | G | 2009 | 2.469 | 2.511 | 2.793 | 0.983 | 0.884 |

Japonica | G | 2010 | 2.269 | 2.263 | 2.387 | 1.003 | 0.951 |

Japonica | G | 2011 | 1.350 | 1.349 | 1.445 | 1.000 | 0.934 |

Japonica | G | 2012 | 2.054 | 1.943 | 2.124 | 1.057 | 0.967 |

Japonica | G | 2013 | 1.263 | 1.348 | 1.356 | 0.937 | 0.932 |

Japonica | G | Global | 0.957 | 0.983 | 1.056 | 0.973 | 0.906 |

Japonica | E + G | 2009 | 0.975 | 1.040 | 0.998 | 0.938 | 0.977 |

Japonica | E + G | 2010 | 0.927 | 0.972 | 0.834 | 0.954 | 1.112 |

Japonica | E + G | 2011 | 0.790 | 0.914 | 0.851 | 0.864 | 0.928 |

Japonica | E + G | 2012 | 0.842 | 0.931 | 0.918 | 0.904 | 0.916 |

Japonica | E + G | 2013 | 0.778 | 0.969 | 0.906 | 0.803 | 0.859 |

Japonica | E + G | Global | 0.465 | 0.564 | 0.542 | 0.823 | 0.857 |

Japonica | E + G + GE | 2009 | 1.008 | 1.069 | 0.985 | 0.943 | 1.024 |

Japonica | E + G + GE | 2010 | 0.819 | 1.008 | 0.872 | 0.812 | 0.939 |

Japonica | E + G + GE | 2011 | 0.796 | 0.924 | 0.858 | 0.862 | 0.928 |

Japonica | E + G + GE | 2012 | 0.872 | 0.949 | 0.945 | 0.919 | 0.922 |

Japonica | E + G + GE | 2013 | 0.782 | 0.978 | 0.919 | 0.800 | 0.851 |

Japonica | E + G + GE | Global | 0.478 | 0.578 | 0.557 | 0.828 | 0.859 |

Japonica | G + GE | 2009 | 1.064 | 1.581 | 0.992 | 0.673 | 1.072 |

Japonica | G + GE | 2010 | 0.848 | 1.794 | 0.877 | 0.473 | 0.968 |

Japonica | G + GE | 2011 | 0.791 | 1.032 | 0.859 | 0.767 | 0.921 |

Japonica | G + GE | 2012 | 0.867 | 1.283 | 0.940 | 0.675 | 0.922 |

Japonica | G + GE | 2013 | 0.766 | 1.120 | 0.920 | 0.684 | 0.833 |

Japonica | G + GE | Global | 0.478 | 0.720 | 0.556 | 0.663 | 0.859 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Montesinos-López, O.A.; Montesinos-López, A.; Cano-Paez, B.; Hernández-Suárez, C.M.; Santana-Mancilla, P.C.; Crossa, J.
A Comparison of Three Machine Learning Methods for Multivariate Genomic Prediction Using the Sparse Kernels Method (SKM) Library. *Genes* **2022**, *13*, 1494.
https://doi.org/10.3390/genes13081494

**AMA Style**

Montesinos-López OA, Montesinos-López A, Cano-Paez B, Hernández-Suárez CM, Santana-Mancilla PC, Crossa J.
A Comparison of Three Machine Learning Methods for Multivariate Genomic Prediction Using the Sparse Kernels Method (SKM) Library. *Genes*. 2022; 13(8):1494.
https://doi.org/10.3390/genes13081494

**Chicago/Turabian Style**

Montesinos-López, Osval A., Abelardo Montesinos-López, Bernabe Cano-Paez, Carlos Moisés Hernández-Suárez, Pedro C. Santana-Mancilla, and José Crossa.
2022. "A Comparison of Three Machine Learning Methods for Multivariate Genomic Prediction Using the Sparse Kernels Method (SKM) Library" *Genes* 13, no. 8: 1494.
https://doi.org/10.3390/genes13081494