Development of a Genomic Prediction Pipeline for Maintaining Comparable Sample Sizes in Training and Testing Sets across Prediction Schemes Accounting for the Genotype-by-Environment Interaction

: The global growing population is experiencing challenges to satisfy the food chain supply in a world that faces rapid changes in environmental conditions complicating the development of stable cultivars. Emergent methodologies aided by molecular marker information such as marker assisted selection (MAS) and genomic selection (GS) have been widely adopted to assist the development of improved genotypes. In general, the implementation of GS is not straightforward, and it usually requires cross-validation studies to ﬁnd the optimum set of factors (training set sizes, number of markers, quality controls, etc.) to use in real breeding applications. In most cases, these different scenarios (combination of several factors) vary just in the levels of a single factor keeping ﬁxed the other levels of the other factors allowing the use of previously developed routines (code reuse). In this study we present a set of structured modules than are easily to assemble for constructing complex genomic prediction pipelines from scratch. Also, we proposed a novel method for selecting training-testing sets of similar sample sizes across different cross-validation schemes (CV2, predicting tested genotypes in observed environments; CV1, predicting untested genotypes in observed environments; CV0, predicting tested genotypes in novel environments; and CV00, predicting untested genotypes in novel environments). To show how our implementation works, we considered two real data sets. These correspond to selected samples of the USDA soybean collection (D1: 324 genotypes observed in 6 environments scored for 9 traits) and of the Soybean Nested Association Mapping (SoyNAM) experiment (D2: 324 genotypes observed in 6 environments scored for 6 traits). In addition, three prediction models which consider the effect of environments and lines (M1: E + L), environments, lines and main effect of markers (M2: E + L + G), and also the inclusion of the interaction between makers and environments (M3: E + L + G + G × E) were considered. The results conﬁrm that under CV2 and CV1 schemes, moderate improvements in predictive ability can be obtained with the inclusion of the interaction component, while for CV0 mixed results were observed, and for CV00 no improvements were shown. However, for this last scenario the inclusion of weather and soil data potentially could enhance the results of the interaction model.


Introduction
The world confronts several challenges for satisfying the increased demands to feed the growing human population, which is projected to grow close to 10 billion by 2050 [1,2]; however, not only the population is increasing but also the natural resources (e.g., forest, soil, land, and water availability, etc.) have been drastically affected due to environmental problems such as deforestation and land degradation [1]. In addition, in agriculture, the elite genotypes (high yield performance) have been negatively impacted due to the more often and more intense environmental perturbances. To guarantee the food requirements and confront these challenges, the new varieties yet to develop (in the very near future) will require to come up with a better resilience for a wide range of adaptation [3]. For this, new strategies and methodologies for selecting genotypes to face these environmental challenges [4] with high yield potential should be developed.
Breeders have implemented traditional breeding methods for selecting the best phenotypes to increase genetic gains [5,6]. However, phenotyping all genotypes in a wide range of environmental conditions is challenging because it requires a large number of plots in field experiments, it is also time-consuming of manual labor, and in general there is a reduced availability of sources such as land, water, and seed. A more elaborated traditional breeding method considers the use of the pedigree information derived of the genetic relationship between the genotypes in the population [7][8][9]. Pedigree based selection has been successful delivering predictions of the estimated breeding values of unobserved genotypes; however, its implementation presents some challenges [10]. For example, it requires to keep track of the genetic relationships between all genotypes in training and testing sets. Also, it does not account for the Mendelian segregation within populations from a pair of genotypes limiting the rates of genetic progress that can be accomplished in a given period of time [11].
Hence, the traditional selection methods based on phenotypic and pedigree information may not be the most suitable options for increasing genetic gains in short periods of time. Specially, because it is not easy to a priori estimate the recombination amount of the genome that comes from each of the parental lines [6] complicating the selection process. The development of modern sequencing technologies offered the opportunity of characterizing genotypes based on their genomic information [11]. A widely used alternative to the traditional selection methods is the Marker Assisted Selection (MAS) which the use of genomic information [12]. It considers a reduced influential set of molecular markers also known as quantitative trait loci (QTL) [13,14] to assist in the selection process. The main objective of MAS is to help on the selection of the best candidate genotypes by predicting their phenotypic performance using the most influential genomic variants. Furthermore, this methodology has demonstrated to be more effective than the pedigree-based selection method [12][13][14]. However, this method also presents some limitations specially when the traits are controlled by a large number of genes with small effects (complex traits) such as yield [15] limiting/reducing the accuracy of the selection.
To overcome the limitations of MAS, the implementation of an emergent methodology called GS became popular in the last decade in plant and animal breeding applications across species and traits. Conceptually, this method uses the information on all available molecular markers for selection purposes [11]. This methodology was first proposed by Bernardo [16] and later on Meuwissen [17] introduced a new framework to confront the challenge of dealing with a large set of markers (p) and a reduced number of phenotypic records (n) available for model fitting.
GS enables the prediction of the performance of genotypes at early stages of breeding programs using abundant molecular marker information of new/untested genotypes and a relative small number of genotypes with phenotypic and genomic information. Such that, the predicted values can be used for selecting the best candidate genotypes that would perform well on advanced phenotyping stages [18]. Another advantage of using GS is that breeders can reduce phenotyping costs by employing predictions as surrogates of phenotypes. At beginning, GS was used in plant breeding for performing within-environments predictions [18][19][20][21]. In general, breeders establish extensive field evaluation for testing new cultivars in a wide range of environmental conditions for releasing stable genotypes that outperforms current elite cultivars [4]. However, usually different response patterns are observed when same genotypes are observed in different environments presenting a change in the relative ranking from one environment to another complicating the selection process [22]. The occurrence of a change in the response patterns is also known as the presence of the genotype-by-environment interaction (G×E).
Several studies highlighted the impacts of accounting for the G×E in GS models [23][24][25] when performing predictions in multi environments. Several applications have been developed to conduct the predictions of genotypes in single and multi-environments [26][27][28][29][30][31][32]. However, to our knowledge, no comprehensive implementations/examples showing how the genomic prediction pipelines are built have been released. Among the tasks that a GS pipeline considers we have the application of quality control on genomic data, the assignation of training and testing sets for different cross-validation schemes, the construction of the different linear predictors, and the model fitting.
The main objective of this study is to provide an example of how a genomic prediction pipeline is built when considering different cross-validation schemes while preserving comparable sample sizes in training and testing sets. For this, we considered two soybean data sets. The first one (D1) corresponds to a sample of the USDA soybean collection with information on 324 genotypes tested in 6 environments (not all genotypes tested in all environments) and 9 traits. The second dataset (D2) corresponds to a sample of the SoyNAM experiment with information on 324 genotypes tested in 6 environments and 6 traits (all genotypes scored for all traits in all environments). The pipeline was built considering elemental modules that perform simple tasks and their implementation is controlled by changes in a parameter input file. Also, the outputs of the early stages of the pipeline become the input of more advanced stages allowing the assemble of complex structures in an easy way. Potentially, users will be able to easily modify and adapt the proposed pipeline to conduct their own data analyses (data sets for a desired set of parameters).

Phenotypic and Genomic Data
In this research, two different soybean datasets were used to show how the pipeline is implemented and these correspond to a sample of the USDA soybean collection, and a sample of the SoyNAM experiment.
Data set 1 (D1). Sample of the USDA soybean collection. The USDA soybean collection is comprised of 14,430 genotypes that were collected in many locations around the world and observed in 4 different locations (States; Illinois, Kentucky, Minnesota, and Missouri) in the USA from 1963 to 2003. Not all the genotypes were observed in all the location-by-year combinations (environments). Further details of the USDA soybean collection can be found in Bandillo [33]. The evaluation of the soybean genotypes in the US locations was gradually conducted. For this reason, the connectivity rate of genotypes across environments is very low. In this study, conveniently we selected a reduced set of genotypes (324) that observed in 6 environments (MN945, IL945, IL0102, MS989, MS2000_2, and MN0102) and showed moderate levels of connectivity. We selected genotypes that were observed in at least 2 environments and with complete information on all 9 traits (grain yield, plant height in centimeters, lodging 1-5, days to physiological maturity -DysToR8, oil content, protein content, seed weight of 100 seeds, early shattering 1-5, and stem term score 1-5). Figure 1 illustrates the levels of connectivity of the genotypes across environments for this data set (D1). Out the 100% of the total (324 × 6 = 1944) potential cells (all genotypes observed in all environments) only 33.6% (654) of these combinations were observed (vertical gray lines in Figure 1) in fields.

Figure 1.
Graphical representation of the allocation of genotypes (x-axis) in environments (y-axis) of 324 soybean genotypes selected from the USDA soybean collection observed in 6 environments. Vertical gray lines represent the genotypes-in-environments combinations that were observed while the white lines correspond to unobserved combinations. The information of the observed combinations (vertical gray lines) is used for stablishing training-testing partitions.
Data set 2 (D2). Sample of the SoyNAM project. A random sample of the SoyNAM project was selected for the second dataset. The SoyNAM project is comprised of 40 biparental populations (140 individuals per population) sharing a common hub parent (IA3023) crossed with elite parents (17), plant introductions (8), and parents with exotic ancestry (15). [34,35] provide a detailed description of the SoyNAM data set. Briefly, the resulting 5400 accessions derived of the 40 biparental populations were observed in 18 location × year combinations (environments). In our case, conveniently we selected a random sample of 324 genotypes observed in 6 environments (IA_2012, IL_2011, IL_2012, IN_2012, NE_2011, and NE_2012) and measured for 6 traits (yield, moisture, protein, oil, fiber, and seed size). In this case, all the 324 genotypes were observed in all the 6 environments and were scored for all 6 traits.

Models
The main objective of this research is to provide a genomic prediction pipeline easy to adapt to different realistic scenarios of interest in breeding programs. Here, a set of 3 models were considered to show how to compose the linear predictors to be used in different cross-validation schemes (4). Alternative models can be obtained by changing the assumptions of the model terms. Later we describe how to perform these changes in an easy manner.

M1. E + L
Consider that represents the performance of the i th genotype observed in the j th environment for a given trait (e.g., grain yield) and it can be described as the sum of a constant common effect across lines and environments , a fixed effect due to the environmental stimuli corresponding to the j th environment, a random effect corresponding to the i th line such that ~ 0, , and a random effect capturing the Data set 2 (D2). Sample of the SoyNAM project. A random sample of the SoyNAM project was selected for the second dataset. The Soy-NAM project is comprised of 40 biparental populations (140 individuals per population) sharing a common hub parent (IA3023) crossed with elite parents (17), plant introductions (8), and parents with exotic ancestry (15). [34,35] provide a detailed description of the SoyNAM data set. Briefly, the resulting 5400 accessions derived of the 40 biparental populations were observed in 18 location × year combinations (environments). In our case, conveniently we selected a random sample of 324 genotypes observed in 6 environments (IA_2012, IL_2011, IL_2012, IN_2012, NE_2011, and NE_2012) and measured for 6 traits (yield, moisture, protein, oil, fiber, and seed size). In this case, all the 324 genotypes were observed in all the 6 environments and were scored for all 6 traits.

Models
The main objective of this research is to provide a genomic prediction pipeline easy to adapt to different realistic scenarios of interest in breeding programs. Here, a set of 3 models were considered to show how to compose the linear predictors to be used in different cross-validation schemes (4). Alternative models can be obtained by changing the assumptions of the model terms. Later we describe how to perform these changes in an easy manner.

M1. E + L
Consider that y ij represents the performance of the i th genotype observed in the j th environment for a given trait (e.g., grain yield) and it can be described as the sum of a constant common effect across lines and environments (µ), a fixed effect due to the environmental stimuli E j corresponding to the j th environment, a random effect (L i ) corresponding to the i th line such that L i ∼ N 0, σ 2 L , and a random effect ε ij capturing One disadvantage of this model is that it does not allow the borrowing of information between genotypes complicating the prediction of the untested materials. To overcome this issue, the genomic information of the individuals in training and testing sets can be leveraged together with the phenotypic data from the observed genotypes (training set) to predict un-phenotyped individuals. Details of this approach are provided in model M2.

M2. E + L + G
In the previous model M1, the L i term is used to describe the effect of the i th genotype and it relies on phenotypic information only. Now, consider that this term can be also described by a linear combination between p molecular markers and their corresponding marker effects such as g i = ∑ p k=1 x ik b k , where b k correspond to marker effect of the k th SNP (x ik ). When the number of molecular markers (p) surpass the number of data points (n) available for model fitting, it is impossible to obtain a unique solution for the marker effects because it involves the inversion of non-full rank matrices. In these cases, further assumptions about the marker effects should be considered under the statistical framework. Several alternatives have been proposed to overcome this issue and some of these are based on penalized regressions (Ridge Regression, LASSO, ELASTICNET, etc.) and Bayesian approaches (Bayesian Ridge Regression, Bayesian LASSO, BayesA, BayesB, etc.) Meuwissen [17] proposed a set of models for cases where the number of genomic variants (p) was larger than the number of data points (n) available for model fitting. A compressive review of the available genomic models to deal with this issue can be found in [11].
In our case, the marker effects were considered independent and identically distributed (IID) outcomes from a normal distribution centered on zero with a common variance σ 2 b [36,37] such that b k ∼ N 0, σ 2 b . From results of the multivariate normal distribution, the vector of genomic effects g = {g i } ∼ N 0, Gσ 2 g where G = XX p , X is the standardized matrix (by columns) of marker SNPs and σ 2 g = pσ 2 b is the corresponding variance component. To avoid model miss specification due to imperfect genomic data the L i term is also included in the model together with the genomic effect g i . Considering the previous assumptions, we have that the resulting model becomes An advantage of this model is that it allows the borrowing of information between tested and untested genotypes permitting the prediction of materials yet to be observed. However, a disadvantage of this model is that across environments it returns the same genomic value g i for the i th genotype. To allow specific genomic values of genotypes in environments, the reaction norm model [25] was also implemented. This model decomposes the genomic effect as the sum of a common effect (intercept) of the genotypes across environments plus specific effects (slopes) for each environment. Further details of this model are provided next.

M3. E + L + G + G×E
Consider the inclusion of the gE ij model term to describe the specific response of the i th genotype in the j th environment g i E j . Jarquin [25] proposed to model the vector of genomic effects in interaction with environments via co-variance structures as where Z g and Z E are the corresponding incidence matrices that connect phenotypes with genotypes and environments, respectively, " • " represents the cell-by-cell product between two matrices also know as Hadamard or Shur

Cross-Validation Schemes
To assess the ability of the different prediction models for delivering accurate results, four prediction scenarios that are of interest for breeders were considered. These prediction scenarios attempt to mimic different realistic prediction problems that breeders might face [38] at different stages of the breeding pipeline. Figure 2 presents the four different cross-validation scenarios using as example a hypothetical population of 48 genotypes to be observed in 6 environments. The different colors (vertical lines) correspond to a fivefold assignation of either phenotypes (CV2, and CV0) or genotypes (CV1, and CV00).
CV2 (tested genotypes in observed environments), corresponds to the scenario of predicting incomplete field trials where some genotypes have been observed in some environments but not in others. In this case, the genotypes of interest are probably observed in other environments and also other genotypes have been already observed in the environment(s) of interest. A random fivefold assignation that is represented with different colors (black, gray, red, yellow and blue) in the top left panel of Figure 2 was considered.
Here the phenotypic data was randomly assigned to each one of the five folds (colors) trying to maintain folds of similar size (~20% of the observations). Then four folds are employed for model calibration when predicting the remaining fold, and this procedure is sequentially repeated for all five folds (one at a time). CV1 (untested genotypes in observed environments), mimics the scenario of predicting genotypes that have not been observed yet at any of the environments and the goal is to predict the performance of these genotypes in environments where other genotypes were already observed. Here, a fivefold cross-validation was implemented by assigning around 20% of the genotypes to folds (see bottom left panel in Figure 2) such that all the phenotypic records of each genotype across environments are assigned to the same fold (color) avoiding encountering phenotypes of the same genotype in different folds. In the bottom left panel in Figure 2, across environments (horizontal lines) the phenotypes of the 48 genotypes have the same color, and genotypes with the same color belong to the same fold. Same as before, four folds were considered for model training when predicting the remaining fold. This prediction procedure is sequentially repeated for each one of the five folds (one at a time).
CV0 (tested genotypes in unobserved environments), represents the prediction scenario of predicting the mean performance of genotypes in hypothetical unobserved environments. It considers phenotypic information of same and from other genotypes observed in other environments (training set). In this case, the conventional prediction procedure consists of leaving one environment out and then use the remaining environments for model calibration when predicting the excluded environment. This procedure is sequentially repeated for each environment (one at a time). However, in our case we introduced an alternative way to conduct the prediction of unobserved environments in an attempt for preserving similar sample sizes for training and testing sets than in previous schemes. In this way, it is possible to compare the results of the different cross-validation scenarios with similar sample sizes. The top right panel of Figure 2 illustrates an example that considers the prediction of the genotypes in gray color (horizontal lines) in environment 3. In this case, there is information available of the same genotype but observed in the remaining 5 environments. Here, the same fold assignation than in the CV2 was preserved such that it possible to conduct a direct comparison of the accomplished predictive ability between these two cross-validation scenarios. The prediction procedure consists of sequentially predicting each one of the five folds in each environment (one at a time). This procedure is repeated for each environment (one at a time).
To assess the ability of the different prediction models for delivering accurate results, four prediction scenarios that are of interest for breeders were considered. These prediction scenarios attempt to mimic different realistic prediction problems that breeders might face [38] at different stages of the breeding pipeline. Figure 2 presents the four different cross-validation scenarios using as example a hypothetical population of 48 genotypes to be observed in 6 environments. The different colors (vertical lines) correspond to a fivefold assignation of either phenotypes (CV2, and CV0) or genotypes (CV1, and CV00).

Figure 2.
Graphical representation of four cross-validation schemes (CV2, predicting tested genotypes in observed environments; CV1, predicting untested genotypes in observed environments; CV0, predicting tested genotypes in unobserved environments; and CV00, predicting untested genotypes in unobserved environments) that preserve comparable training and testing set sizes. The colored horizontal lines correspond to a fivefold assignation of phenotypes (CV2; top left) and genotypes (CV1; bottom left) to compose training and testing sets (e.g., for training set 4 folds are considered: black, blue, Figure 2. Graphical representation of four cross-validation schemes (CV2, predicting tested genotypes in observed environments; CV1, predicting untested genotypes in observed environments; CV0, predicting tested genotypes in unobserved environments; and CV00, predicting untested genotypes in unobserved environments) that preserve comparable training and testing set sizes. The colored horizontal lines correspond to a fivefold assignation of phenotypes (CV2; top left) and genotypes (CV1; bottom left) to compose training and testing sets (e.g., for training set 4 folds are considered: black, blue, red and yellow horizontal lines; while for testing set only one fold is considered: gray color lines). For CV0 (top right) and CV00 (bottom right), the genotypes represented with the horizontal gray lines in environment 3 (Env 3) correspond to the target prediction set. In addition, for CV00 the horizontal white lines in the remaining environments (1-2, 4-6) correspond to missing phenotypic information of target genotypes in other environments. Similar volumes of information for model training are employed for predicting comparable testing set sizes. CV00 (untested genotypes in unobserved environments), corresponds to the case of predicting new genotypes in novel environments. The conventional method for predicting untested genotypes in unobserved environments consist of discarding from the training set the phenotypic records of those genotypes in the target environment (testing set), then predict the performance of those genotypes in the novel environment using the phenotypic information available in the calibration environments. However, this procedure poses extra challenges as the number of common genotypes increases across environments. In our hypothetical example in Figure 2, all the genotypes were observed in all environments. Thus if the phenotypic information for those genotypes in the target environment is discarded across environments, no phenotypic records will be available for model calibration. However, with the proposed scheme (see bottom right panel in Figure 2) only the phenotypic information of those genotypes in the gray fold is deleted across environments. Then the information of the remaining folds (colors) in the other environments is used for model calibration when predicting the gray fold in environment 3. The same procedure is repeated for the remaining four folds (one at a time) in environment 3. This same procedure is repeated for the other environments (one at a time).

Assessment of Predictive Ability within and across Environments
For all cross-validation schemes, the predictive ability was assessed as the within environments correlation between predicted and observed values. The average correlation across environments was computed according to [39] for accounting for uncertainty and the sample size of the environments as where r i is the Pearson's correlation between predicted and observed values at the i th environment, V(r i ) = 1−r 2 i n i −2 corresponds to the sampling variance and n i is the number of observations at the i th environment.

Variance Components
To assess the relative importance of the different model terms on each one of the prediction models, a full data analysis (i.e., non missing values were considered) is conducted for computing the corresponding variance components. Then percentage of explained variability by the z th model term is computed as the ratio between the corresponding variance component and the sum of all the variance components of the model times 100, The percentage of explained variability was computed for each model for each trait.

Modules
One of the principal objectives of this manuscript is to provide a template for developing genomic prediction pipelines in an easy way. First, the different modules that are used for assembling the pipeline are presented. These modules work in a way that the outputs of these become the input of other modules in more advanced stages of the pipeline. The pipeline here presented considers all the different stages from the initial data sets until the prediction stage. Two examples of these pipelines are provided in the supplementary section.
The structure of the different modules is as follows: basically all of these are comprised of three directories of folders "code", "input" and "output". The "code" folder contains the "mainCode.R" script which performs a specific routine depending on the module. The "input" folder contains a parameter file "parameters.R" where the inputs of the module are specified. During the implementation of the different modules this is the only file that is modified according to the different conditions (parameters) to consider in the routines. The "output" folder is used for storing the results. Such that the outputs derived from these modules (routines) in previous stages will be used as the corresponding inputs in more advanced stages of the pipeline.
Initially, the phenotypic and genomic data are stored in a common repository/directory called "1.root.Data". Then the modules will refer to these data sets at the different stages of the pipeline. It is assumed that the genomic and phenotypic information (including missing values) are available for same genotypes only in both datasets. In our case, "Pheno.csv" and "SNPs.rda" correspond to the data sets containing the phenotypic and the molecular marker data. All of the modules used in our pipeline can be found in the 'modules' folder in the supplementary section. For assembling the pipeline at the different stages, these modules should be copied and only the "parameter.R" file should be modified.

SplittingSNPs: Module for Applying Quality Control on Marker Data
The "SplitingSNPs" module performs the quality control based on missingness of the marker data by discarding molecular markers that exceed a given proportion of missing values (PMV). In the "parameters.R" file it should be specified the path (mm.file) to the marker data set "SNPs.rda", the highest proportion of missing values PMV (NaN.freq.i) to tolerate for including a marker SNPs in the analysis. After applying the QC, the resulting set of maker SNPs ("X.csv") is stored in the "output" folder.

Gmatrix: Module for Constructing the Covariance Matrices Using Genomic and Environmental Factors
The "Gmatrix" module is used for constructing the relationship matrices between pairs of genotypes and between pairs of environments using the matrix of marker SNPs or the name (or ID) of the environments, respectively. In the "parameters.R" file the path to the matrix of phenotypes (phenotype.file) and the path to the matrix of molecular matrix (mm.file) should be specified for computing the kinship matrix using genomic data. If the value of "mm.file" is declared as "NULL" the covariance structure between genotypes or between environments is computed using the incidence matrices only. The value of the smallest allele frequency to tolerate for including markers in the analysis is specified with the "prop.MAF.j" option. The "colIDy" parameter indicates the column in the matrix of phenotypes that will be used to link the genotypes with marker data or phenotypes with environments. The outputs of this module will be store in the "output" folder and these are the resulting kinship matrix (G.rda), and its corresponding eigen value decomposition (EVD.rda) which are necessary to compute more elaborate model terms and for setting up the linear predictor, respectively.

Z: Module for Constructing Incidence Matrices for Genotypes and Environments
The "Z" module is used to include the main effects of genotypes or environments. In the "parameters.R" file in the "input" folder the path to the matrix (phenotype.file) that contains the phenotypic information and the column (colIDy) that contains the information of the ID of the genotypes or environments should be specified. In the "output" folder the resulting incidence matrix "Z.rda" is stored as well as a graphical representation of the distribution of phenotypes across genotypes or environments (exp.des.pdf).

Imatrix: Module for Constructing the Interaction Matrix between Markers and Environments
The "Imatrix" module is used to compute the Hadamard product (or cell-by-cell product) between two covariance structures. The resulting matrix is needed for including the interaction between molecular markers and environments [25]. In the "parameters.R" file in the "input" folder, the path to the resulting matrices of the two factors (G1.file and G2.file) that will be considered in the interaction (G and E in our case) should be specified. The "output" folder will contain the resulting covariance matrix (G.rda) and its corresponding eigen value decomposition (EVD.rda).

Preparing.CV1.CV2: Module for Assigning Genotypes and Phenotypes to Folds
This module is used for assigning phenotypes/genotypes to training-testing sets for CV1 and CV2 cross-validation schemes. In the "parameters.R" file in the "input" folder it should be specified the path to the matrix of phenotypes (phenotype.file), the number of folds to consider (folds) in the cross-validation, the column in the matrix of phenotypes (colIDy) that contains the names (IDs) of the genotypes, the type of cross-validation (either CV1 or CV2 or both). Also, it is possible to fix the seed value needed in the randomization and it should vary among the different replicates of the training-testing assignation. The resulting matrix (Y.csv) will be stored in the "output" folder. This matrix is identical to the initial matrix of phenotypes except those two columns containing the information of the fold assignation are added at the end of the matrix.

Preparing.CV0 and CV00 Module
Since one of the goals of this research is to provide a cross-validation scheme that preserves comparable sample sizes across different cross-validation schemes, the results from the cross-validation assignation CV1 and CV2 are used as inputs for the assignation of CV0 and CV00 schemes. For CV0 scheme, the resulting matrix from the previous module when conducting the CV2 assignation acts as input argument. In the "parameters.file" in the "input" folder, the path (phenotype.file) to the output file (Y.csv) derived from the "Preparing.CV2.CV1" module is specified, also the number of folds (it should be same number of folds than in the previous output), the column that contains the phenotypic information (colPhen), and the column that contains the information of the folds (colFolds) for the CV2 scheme. The "output" folder will contain the resulting matrix of phenotypes. In this case, depending on the number of folds, the same number of extra columns are added masking as missing values those phenotypes belonging to the different folds according to the different columns (one column for each fold). For the CV00 assignation, a similarly procedure is performed but in this case, the column of the assignation of folds for CV1 scheme is used instead. Also, the resulting matrix of phenotypes is stored in the "output" folder.

Fitting.Models: Module for Performing the Predictions of the Missing Values and Compute the Variance Components
This module is used for fitting the models, perform the predictions of missing values and computing the variance components. In the "parameters.R" file in the "input" folder the path to the matrix of phenotypes (phenotype.file) and the ID of the partitions (folds) to be predicted (e.g., ID of the folds [1, 2, 3, 4, and 5] or the ID of the environments [CV0 and CV00]) are specified. Also, since the BGLR [26,31] R package [40] was used for model fitting and it is based on the Bayesian framework, it is also necessary to specify the number of iterations (nIter) for the GIBBS sampler and the number of iterations to be used as burn-in (burnIn). Then the linear predictor is built by providing the different models terms and their corresponding assumptions.
For this, a list is started "AB <-list()" to add the paths to the different model matrices that were created in previous stages. Such that the i th element of the list corresponds to the i th model term "AB[[i]] <-". Also, it is necessary to specify the type of the effects with "FIXED" for a fixed effect, "BRR" for a random main effect (RR-BLUP) or "RKHS" for the GBLUP model. In addition, it is necessary to provide the column numbers in the matrix of phenotypes that contain the ID (colVAR) of the genotypes, the phenotypic information (colPhen), the different training-testing partitions (colCV, folds or environments), and set/fix the seed for replicating exactly same the results "set.seed(i)" with the GIBBS sampler.

Pipeline
Each one of the different stages (2-6) of the pipeline is built using the modules stored in the repository folder. For this, it is necessary to copy these accordingly to the different stages and only modify the "parameter.R" file in the "input" folder. The pipeline starts with the "1.root.Data" folder where the files with phenotypic (Phenos) and genomic information (SNPs) are stored. The next stage considers the implementation of quality control (QC) on the genomic data based on missing values and it corresponds to the "2.splitingSNPs" folder. Here, the path to the matrix of marker SNPs and the PMV should be provided; the resulting matrix "X.csv" is stored in the "output" folder. The next stages consider the computation of the different model terms. The main and the interactions effects should be stored in folders "3.Gmatrices" and "4.Imatrices", respectively. In folder "3.Gmatrices", the covariance matrices for "G" and "E" using the marker data and the ID of the environments, are computed. In addition, the incidence matrices that connects phenotypes with genotypes "ZL" and with environments "ZE" are computed. In the "4.Imatrices" folder, the outputs of the covariance matrices "G" and "E" are used to compute the interaction matrix "G×E".
The 5th stage corresponds to the training-testing assignation, and it is divided in 3 sections. In the 5.1.CV2.CV1 section, for each replicate (1-10) the folds (5) are assigned for the cross-validations schemes CV1 (predicting new genotypes in observed environments) and CV2 (incomplete field trials). The resulting matrices are stored in the corresponding "output" folder. For the cross-validation CV0 and CV00, the configuration between them is very similar. Under CV0, in the folder "5.2.CV0" for each trait × replicate combination the training-testing assignation was performed by masking as missing values the corresponding observations derived from the CV2 scheme. While for CV00, the corresponding observations derived from CV1 scheme were considered.
The 6th stage corresponds to the prediction of missing values and it also comprises three sections. These correspond to the three different assignation schemes in the previous stage. Here, the "fitting.Models" module stored in the "modules" folder was implemented. For CV1 and CV2, for each trait × model × replicate combination the prediction of the missing values is conducted, and the results are stored in the "output" folder according to the different folds (1-5). The model fitting for CV0 and CV00 schemes was performed for each trait × model × replicate × fold combination and the resulting predicted values are stored in the "output" folder.
Finally, the 7th stage considers the computation of the variance components. For this, the same module than in the previous state "fitting.Modules" was implemented for each trait × model combination by performing a full data analysis (i.e., no missing values were considered). Here, it is necessary to assign "−999" to the "folds" parameter in the "parameters.R" file. The resulting file "fm_full.R.Data" stored in the output folder contains the obtained variance components among other objects.

Results
Since one of the main objectives of this manuscript is to provide a template for implementing genomic prediction pipelines in an easy way, the obtained results are briefly described for both data sets. The supplementary materials section contains the full pipelines for data sets D1 and D2. The only difference between these two pipelines is that while for data set D1 the main effect of the environments was considered as fixed, for the second data set D2 it was treated as random. This was intended in this way to show the flexibility of the pipeline for considering different assumptions of the model terms.

D1: Sample of the USDA Collection
Percentage of variability explained for the different model terms. Table 1 presents for each trait and model, the percentage of variability explained for each model term. For grain yield, with model M1 (E + L) the main effect of the environments (E) explains 55.7% of the total variability while the line effect (L) captures 25.2% and the residual term (R) 19.2%. The main effect of makers (G) introduced in M2 (E + L + G) captured 14.2% of the variability, and the residual variance (R) was increased to 25.5% compared with M1 (19.2%). The inclusion of the interaction between markers and environments (G×E) in M3 captured 9.0% of the total variability and the residual term (R) only 17.6%. Similar trends were observed for the remaining 8 traits. In general, for all traits, the model that includes the interaction between markers and environments (M3) returned the lowest residual variance. Also, as expected as the different model terms were added to the linear predictor, the variability explained by the environmental term (E) was reduced. Table 1. Percentage of explained variability for each model term for all traits (9) using phenotypic and genomic data from the USDA soybean collection for 324 genotypes observed in 6 environments. Three models were considered, and these were constructed with the following components: E, main effect of environments; L, main effect of genotypes; G main effect of markers; and G×E for the interaction between genotypes and environments using marker data. Prediction Accuracy Table 2 presents the mean (10 replicates) average correlation for 4 cross-validation schemes (CV2, CV1, CV0, and CV00) and 3 models (M1: E + L, M2: E + L + G, and M3: E + L + G + G×E). Under CV2, for grain yield the models M1, M2, and M3 returned a mean average correlation of 0.576, 0.670, and 0.718, respectively. For CV1, the models M1-M3 returned a mean average correlation of −0.121, 0.635 and 0.662. While under CV0, the respective values for these three models were 0.114, 0.135 and 0.163; and for CV00, −0.010, 0.088 and 0.114. The predictive ability in CV2, CV1, CV0 and CV00 schemes was benefited when including the interaction between marker genotypes and environments with model M3. Similar trends were observed for the remaining traits.

D2: Sample of the SoyNAM
Percentage of variability explained for the different model terms. Table 3 presents for each trait and model, the percentage of variability explained for each model term. For grain yield, under model M1 (E+L) the main effect of the environments (E) explained 68.0% of the total variability while the line effect (L) captured 7.8%, and the residual term (R) 24.2%. When the main effect of the markers (G) was included with M2, it captured 5% of the phenotypic variability and the residual term (R) 26.3%. The genotype by environment interaction (G×E) from M3 captured 7.2% of the variability and the residual term (R) addressed 20.4%. Also, for all traits the model that included the interaction term (M3) returned the lowest un-explained variability captured by the residual term (R). Similarly than with the previous data set (D1), as the different model terms were added the variability explained by the environmental term (E) was reduced. Table 2. Average mean (10 replicates) correlation between predicted and observed values for four cross-validation scenarios, three models (M1: E + L, M2: E + L + G, and M3: E + L + G + G×E) and 9 traits from a sample of the USDA Soybean collection comprised of 324 genotypes observed in 6 environments (not all genotypes in all environments).   Table 3. Percentage of explained variability for each model term for all traits (6) using phenotypic and genomic data from the SoyNAM experiment for 324 genotypes observed in 6 environments. Three models were considered, and these were constructed with the following components: E, main effect of environments; L, main effect of genotypes; G main effect of markers; and G×E for the interaction between genotypes and environments using marker data. Prediction Accuracy Table 4 presents the mean (10 replicates) average correlation for 4 cross-validation schemes (CV2, CV1, CV0, and CV00) and 3 models (M1: E + L, M2: E + L + G, and M3: E + L + G + G×E). Under CV2, for grain yield the models M1, M2 and M3 returned a mean average correlation of 0.342, 0.380 and 0.446, respectively. For CV1, the models M1-M3 returned a mean average correlation of −0.15, 0.296 and 0.373. While under CV0, the respective values for these three models were 0.197, 0.234 and 0.210; and for CV00, −0.014, 0.182 and 0.160. Also as expected, the predictive ability under the CV2 and CV1 schemes was slightly improved by including the interaction between marker genotypes and environments with model M3. However, for the remaining schemes (CV0 and CV00) the correlation between predicted and observed values was slightly reduced with the inclusion of the interaction effect (M3). Similar trends were observed for the remaining traits for all cross-validations schemes, showing only marginal improvements for oil content (0.647) under CV0. Table 4. Average mean (10 replicates) correlation between predicted and observed values for four cross-validation scenarios, three models (M1: E + L, M2: E + L + G, and M3: E + L + G + G×E) and 6 traits from a sample of the SoyNAM experiment comprised of 324 genotypes observed in 6 environments (all genotypes in all environments).

Data Analysis
The main objective of this manuscript is to provide a template of a genomic prediction pipeline easy to implement, modify and adapt to other data sets. For this reason, the results derived from these pipelines are briefly discussed focusing on the proposed implementation instead. These two data sets (D1: USDA soybean collection, and D2: SoyNAM) were already studied in several manuscripts [41,42]. The obtained results from our study (percentage of variability explained by the different model terms and prediction accuracy) are in line with the results of these studies.
In general, for both data sets (D1 and D2) and for all traits (15 = 9 + 6) the inclusion of the interaction term helped to reduce the percentage of variability explained by the environmental component. Also it helped to decrease the non-explained variability addressed by the residual term. However, while for grain yield in D1 the variability explained by the environmental term varied between 51% and 55.7% for D2 it ranged between 63.8% and 68%, which represent around 10% of more variability captured by the environmental component in D2. A similar trend was observed for the residual term capturing a nonexplained variability ranging between 19.2 and 17.6% for D1 and between 24.2% and 20.4% for D2. Thus, there was less unexplained variability in D1 which potentially contributed to deliver higher correlations between predicted and observed values for this data set compared with D2.
Regarding the predictive ability, across all models, traits and schemes of cross-validation, different results were obtained in both data sets. Under CV2 and CV1 schemes, the correlation between predicted and observed values using molecular marker information (M2 and M3) was significantly higher for grain yield (0.670-0.718) for D1 than for D2 (0.380-0.446). While for protein and oil content the results were comparable in both data sets with these results ranging between 0.757 and 0.832 for D1 and between 0.657 and 0.737 for D2. Under CV0 scheme, for D1 in 8 (except for protein content) out of the 9 traits the M3 model outperformed the main effects models M2 while for D2 the M3 model was superior to M2 only for oil content. Regarding CV00, mixed results were observed; for D1 the M3 model slightly outperformed M2 in 4 (grain yield, days to maturity, oil content, and seed weight-100) out of the 9 traits while for D2 for all 6 traits M2 model outperformed M3. Perhaps the larger genetic diversity in D1 helped to increase the predictive ability of the genomic models (M2 and M3) with respect to D2 where all genotypes were observed in all environments.

Flexibility of the Pipeline
There are many implementations to conduct genomic prediction studies such as BGLR [26,31], rr-BLUP [30], asreml-R [27,28], sommer [29], BWGS [43], and bWGR [32]. However, to our knowledge there are not available comprehensive examples of genomic prediction pipelines for conducting exhaustive studies while considering multiple factors. For example, providing plenty flexibility for selecting markers by applying quality control, different cross-validation schemes, model terms, model development (different types of effects and assumptions on these), etc. The pipeline here presented is based on a collection of modules that perform all the needed tasks that are required to conduct genomic prediction studies.

Potential Extensions of the Current Pipeline
The modules here described allow the construction of more elaborated pipelines in an easy way. For example, studies on finding the optimal quality control [44,45] can be performed by considering different combinations of percentage of missing values (PMV) and minor allele frequency (maf), different ways for composing training and testing sets [4,41], different features for sparse testing designs [46], study the distribution of the variance components by considering random sets of molecular markers [47], include weather data [38], leverage the information of correlated traits [48], and predict the performance of hybrid crosses using genomic inbred data [38] among other studies.

Conclusions
GS is a widely adopted methodology in plant and animal breeding programs and conceptually its implementation is easy to follow. Initially, it requires a set of genomic and phenotypic data for model calibration for predicting the trait performance of candidate genotypes in target environments. However, in order to achieve the highest prediction accuracy between predicted and observed values, many factors should be assessed through cross-validation studies helping in the correct implementation of GS in real prediction problems.
For this reason, factors such as quality control on marker covariates, suitable crossvalidation schemes mimicking real prediction scenarios, election of the prediction model among others should be evaluated. In most of the cases, the evaluation of these factors corresponds to minimal variations on the set of parameters to evaluate in the pipeline. Thus, there is no need to start from scratch the set of analyses when modifying the levels of the parameter(s) of interest(s). This allows the reuse of code that was already developed at different stages of the pipeline. Thus, it is easy to perform simple modifications in these codes to adapt them to particular cases.
In this study, we provide a set of modules than can be easily assembled to build complex prediction pipelines where the outputs of the early stages become the input of the more advanced ones. One feature of the proposed modules is that these can be used in a black box fashion where the specifics of the different analyses are controlled with a parameter file and there is no need to modify the main scrip (mainCode). With respect to the different cross-validation schemes, we provided a novel framework that allows similar sample sizes in calibration and prediction sets such that the results of the different prediction scenarios can be directly contrasted.
Finally, with respect to the obtained results in both data sets, we confirmed again the advantages of considering the genotype-by-environment interaction in prediction models under the cross-validation schemes CV2 and CV1. Under the CV0 scheme mixed results were observed for the first data set D1 while for D2 in most cases the main effects model was slightly superior. With CV00 scheme no significant differences were observed for both data sets. We conclude, that using similar sample sizes in training sets the genotype-by-environment interaction can be leveraged when a portion of the data in the target environment has been observed via other genotypes while for the case of novel environments there is a need of incorporating other sources of information such as soil and weather data to improve results.
Author Contributions: R.P., Conceptualization, data collection, data analysis, elaboration of the first draft. M.G., Conceptualization, provided oversight of the study. D.J., Conceptualization, development of the modules, provided oversight of the study and participated in the conceptualization of the study. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.