Review Reports - Genomic Selection in Cereal Breeding

Round 1

Reviewer 1 Report

There is nothing really wrong with this review in terms of the topics covered, but in general I find the coverage of the area of "genomic selection in plant breeding" somewhat superficial. A lot of key references seem to be missing from the review.

It's a bit funny to denote genomic selection as a new method in plant breeding (L9). If you count in Rex Bernardo's seminal work in the 1990s (not references in this review), this method has been around for about 25 years now. In a similar vein, it is somewhat outdated to say (see L384) "GS [...] could be a power ful tool in plant breeding" - It already is!

Regarding Bayesian analysis, there is a large thread of recent papers that takes Bayesian analysis to a new level, providing new packages for the approach. This thread is not at all covered. There is certainly more than just Bayes A and B, which was proposed already by Meuwissen et al. in 2001. In the meantime a lot has happened, including a substantial extension of the Bayesian alphabeth.

It is not accurate to say (L366) that GBLUP requires no iterations. The method to estimate the variance is usually REML, which is clearly an iterative method.

Author Response

Our detailed response and modifications are marked in italics and yellow below the original reviewers’ comments in the text below.

We have updated the manuscript with substantially more references.

We believe most people will agree that Meuwissen et al. 2001 was the original source proposing genome-wide selection, and the first authors in plant breeding were papers like Bernardo 2007, and the review by Heffner et al. 2009, with the bulk of plant breeding literature on GS being less than 5 years old. The work from the 1990’s falls in our definition of ‘Marker Assisted Selection’, and many plant breeding authors agree on our distinguishing of MAS and GS (including Bernardo himself in the 2007 paper). To accommodate the reviewers comment, we have removed the word ‘new’, and to honor the work by Bernardo from the 1990’s we have added at the end of the introduction some text about Bernardo’s 1990’s work how it may be seen as preliminary works leading up to Meuwissen’s GS.

We have updated the methods section to give a more complete overview of most of the methods considered for genomic prediction. However we did not want to make this review an extensive description and comparison of methods, because the subject is enough for a review on its own, and there is limited novelty in extensive review of methods because several others have already done that. We hope we now struck the right balance to sufficiently cover the approaches considered for genomic prediction.

It is not accurate to say (L366) that GBLUP requires no iterations. The method to estimate the variance is usually REML, which is clearly an iterative method.

In practice, most breeders will run GBLUP with pre-estimated variance components; then GBLUP needs no iteration to obtain the breeding values; we have made our text more precise on this point.

Reviewer 2 Report

Review on agronomy-422660
The review paper focused on the application of genomic selection in cereal breeding. It reviews the basic set-up of GS, the strategies for implementation of GS in breeding programs and the use of additive and non-additive genetic effects in GS. The application of GS in cereal breeding programs is still at the beginning stage. Therefore a comprehensive review of different strategies for implementing GS in applied breeding programs is very important and interesting for both researchers and breeders. However, a major drawback of this manuscript is that the authors ignored many recent literatures on the application of GS in plant breeding. For a review paper this is not acceptable. In Section 3 which is in my opinion the most important part of the paper, many discussions were based on general theoretical thoughts and experiences from animal breeding, but one could easily find a few publications in plant breeding. In Section 5 discussing the use of non-additive genetic effects, the authors didn’t even mention the RKHS model which is the most commonly used non-additive model. Therefore I think the authors should try to have an overview of all relevant literature and carefully revise/rewrite the manuscript, especially Section 3. The following are some detailed suggestions/comments which may help the authors to improve the quality of their manuscript:
Line 42: GS models do not predict “marker-trait associations“ but “marker effects”.
Line 68: Generally speaking, the research on GS in plant breeding has been extensively carried out for many years. I think it is not appropriate to say “first plant GS research papers”.
Line 89-90: “the variation in accuracy increased between CV rounds” This is difficult to understand: Yes there is variation in accuracy between CV rounds. What does “increased” mean here? Compared with what?
Section 2.5: There are certainly other types of GS model (e.g. non-parametric models, machine
learning models such as support vector machine and neural networks) besides BLUP and Bayesian models. From the subtitle I understand that the author just wanted to discuss these two types of models. This is fine. But the discussion was limited to the three models introduced by Meuwissen et al. (RR-BLUP, Bayes A and Bayes B) in 2001. The difference between these models has been well[1]known. Is it really worth to repeat a detailed discussion here? On the other hand, there are many other Bayesian models (e.g. Bayes C, D, …) and variation of BLUP models (e.g. BLUP models incorporating biological information or results from GWAS). Maybe it is better to have an overview of these models and give a general discussion about the difference between BLUP and Bayesian models here. Again I think the authors should search for more literature.
Line 161-163: The BLUP method modeling all marker effects was commonly called RR-BLUP (Ridge/Random regression-BLUP). As the author stated, the BLUP model with a genomic relationship matrix (G-matrix) replacing the pedigree-matrix was called GBLUP. These two models are different but statistically equivalent (Habier et al. 2007). Please be precise.
Line 191: I think “across-generation” is not a proper term. In Figure 3, each column is a breeding cycle and in each cycle there are several generations. Looking at the arrows, I think it is “across breeding cycle” instead of “across-generation”.
Line 202-204: The use of PYT in GS has been discussed in the literature (e.g. Michael et al. 2017).
Line 212-214: GS across breeding cycles has been discussed in the literature (e.g. Michael et al. 2016).
Line 216-218: Please note that there are literatures also using data from real wheat breeding program but reporting relatively high prediction accuracy across years (e.g. He et al. 2016).
Section 3.3: It is not clear how to make the “within-generation” GS. First of all I assume the authors meant “within breeding cycle” GS looking at Figure 2. If one generation or several generations were used as training population, then the next generation will be predicted by GS. If the whole generation is predicted by GS, then this generation will not have any phenotype information. This will be the idea of the next subsection. Or did the authors mean to partly phenotype the next generation and partly use GEBV?
Line 229-230: The idea of selecting new parents purely based on GEBV has been investigated (e.g.Longin et al. 2015).
Line 270: While it is a good point that using pedigree information enables the prediction of non[1]genotyped lines with known pedigree, one has to consider that in plant breeding (e.g. wheat) it is usually difficult to have complete pedigree information for the individuals in the parent pool. This is one reason that prediction using pedigree information has not been as common as in animal breeding.
Line 293: I do not understand. Why is it interesting to predict early-stage breeding materials with good market value? They will not be released in market anyhow. Isn’t the breeding value of these materials more important?
Line 294-297: Indeed the Hadamard product of the G-matrix models the pairwise interactions among markers and ignores higher-order interactions. But the reproducing kernel Hilbert space (RKHS) regression model captures all levels of interactions (Jiang et al. 2015) and has been extensively used in the literature of plant breeding (e.g. Morota and Gianola 2014). There are also other models exploiting non-additive effects. Please search for more literature.
Line 298-299: It was mentioned without any reference that the authors’ own experience showed low prediction accuracy for the prediction of TGV. But many literatures reported the RKHS model gives higher prediction accuracy than the GBLUP or Bayesian models (e.g. Pérez-Rodríguez et al. 2012).
Line 326-328: There have been a number of previous studies on GS accounting for GxE in plant
breeding (e.g. Jarquín et al. 2014, Cuevas et al. 2017). The authors have to search for more literature!
References Cuevas, J., Crossa, J., Montesinos-López, O. A., Burgueño, J., Pérez-Rodríguez, P., & de los Campos, G. (2017). Bayesian genomic prediction with genotype× environment interaction kernel models. G3:Genes, Genomes, Genetics, 7(1), 41-53.
Habier, D., Fernando, R. L., & Dekkers, J. C. (2007). The impact of genetic relationship information on genome-assisted breeding values. Genetics, 177(4), 2389-2397.
He, S., Schulthess, A. W., Mirdita, V., Zhao, Y., Korzun, V., Bothe, R., ... & Jiang, Y. (2016). Genomic selection in a commercial winter wheat population. Theoretical and applied genetics, 129(3), 641-651.

Jarquín, D., Crossa, J., Lacaze, X., Du Cheyron, P., Daucourt, J., Lorgeou, J., ... & Burgueño, J. (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and applied genetics, 127(3), 595-607.
Jiang, Y., & Reif, J. C. (2015). Modelling epistasis in genomic selection. Genetics, 201(2), 759-768.
Longin, C. F. H., Mi, X., & Würschum, T. (2015). Genomic selection in wheat: optimum allocation of test resources and comparison of breeding strategies for line and hybrid breeding. Theoretical and applied genetics, 128(7), 1297-1306.
Michel, S., Ametz, C., Gungor, H., Epure, D., Grausgruber, H., Löschenberger, F., & Buerstmayr, H. (2016). Genomic selection across multiple breeding cycles in applied bread wheat breeding. Theoretical and Applied Genetics, 129(6), 1179-1189.
Michel, S., Ametz, C., Gungor, H., Akgöl, B., Epure, D., Grausgruber, H., ... & Buerstmayr, H. (2017). Genomic assisted selection for enhancing line breeding: merging genomic and phenotypic selection in winter wheat breeding programs with preliminary yield trials. Theoretical and Applied Genetics, 130(2), 363-376.
Morota, G., & Gianola, D. (2014). Kernel-based whole-genome prediction of complex traits: a
review. Frontiers in genetics, 5, 363.
Pérez-Rodríguez, P., Gianola, D., González-Camacho, J. M., Crossa, J., Manès, Y., & Dreisigacker, S. (2012). Comparison between linear and non-parametric regression models for genome-enabled prediction in wheat. G3: Genes, Genomes, Genetics, 2(12), 1595-1605.

Author Response

We like to thank the reviewer for reading our manuscript. In response to the reviewers’ comments, we have expanded the coverage of the literature, especially on subjects like statistical models, GxE, prediction of non-additive effects (using kernel methods), and on ideas to use GS in practical breeding.

Our detailed response and modifications are marked in italics and yellow below the original reviewers’ comments in the text below.

The review paper focused on the application of genomic selection in cereal breeding. It reviews the basic set-up of GS, the strategies for implementation of GS in breeding programs and the use of additive and non-additive genetic effects in GS. The application of GS in cereal breeding programs is still at the beginning stage. Therefore a comprehensive review of different strategies for implementing GS in applied breeding programs is very important and interesting for both researchers and breeders. However, a major drawback of this manuscript is that the authors ignored many recent literatures on the application of GS in plant breeding. For a review paper this is not acceptable.

We have updated the manuscript with substantially more references.

In Section 3 which is in my opinion the most important part of the paper, many discussions were based on general theoretical thoughts and experiences from animal breeding, but one could easily find a few publications in plant breeding.

We indeed found some publications which are now included in the manuscript.

In Section 5 discussing the use of non-additive genetic effects, the authors didn’t even mention the RKHS model which is the most commonly used non-additive model.

Our experience is that RKHS may have a large attention in literature, but in pract2ice we see breeders mostly using the GBLUP + G*G (Hadamard) version because it can be fit in standard mixed model packages for breeding value estimation. To accommodate the reviewer’s comment we have added the RKHS model, both in section 2 on methods, and in the section on non-additive effects.

Therefore I think the authors should try to have an overview of all relevant literature and carefully revise/rewrite the manuscript, especially Section 3.

We have done and hope the rewritten manuscript can be found improved.

The following are some detailed suggestions/comments which may help the authors to improve the quality of their manuscript:
Line 42: GS models do not predict “marker-trait associations“ but “marker effects”.

We have changed it to ‘marker effects’.
Line 68: Generally speaking, the research on GS in plant breeding has been extensively carried out for many years. I think it is not appropriate to say “first plant GS research papers”.

We have removed ‘first papers’.
Line 89-90: “the variation in accuracy increased between CV rounds” This is difficult to understand: Yes there is variation in accuracy between CV rounds. What does “increased” mean here? Compared with what?

The variation in accuracy increased comparing a small training data to a large training data. We reworded the sentence and hope it is clear now.
Section 2.5: There are certainly other types of GS model (e.g. non-parametric models, machine learning models such as support vector machine and neural networks) besides BLUP and Bayesian models. From the subtitle I understand that the author just wanted to discuss these two types of models. This is fine. But the discussion was limited to the three models introduced by Meuwissen et al. (RR-BLUP, Bayes A and Bayes B) in 2001. The difference between these models has been well[1]known. Is it really worth to repeat a detailed discussion here? On the other hand, there are many other Bayesian models (e.g. Bayes C, D, …) and variation of BLUP models (e.g. BLUP models incorporating biological information or results from GWAS). Maybe it is better to have an overview of these models and give a general discussion about the difference between BLUP and Bayesian models here. Again I think the authors should search for more literature.

We were a bit in doubt how to address this comment, on the one side we agree that it is not really worth to repeat a detailed discussion here on methods, because this is already addressed in several other reviews, and it does not add much to our aim of especially describing ways to use GS in cereal breeding. We have now expanded the section on methods to still ‘lightly’ but more comprehensively cover the most important methods covering more of the Bayesian alphabet (but sticking to the mainly used and known versions), machine learning and kernel methods. We hope this is now satisfactorily.
Line 161-163: The BLUP method modeling all marker effects was commonly called RR-BLUP (Ridge/Random regression-BLUP). As the author stated, the BLUP model with a genomic relationship matrix (G-matrix) replacing the pedigree-matrix was called GBLUP. These two models are different but statistically equivalent (Habier et al. 2007). Please be precise.

We made our wording more precise.
Line 191: I think “across-generation” is not a proper term. In Figure 3, each column is a breeding cycle and in each cycle there are several generations. Looking at the arrows, I think it is “across breeding cycle” instead of “across-generation”.

We agree, this is better called ‘across cycle’ and we have replaced it throughout the text.
Line 202-204: The use of PYT in GS has been discussed in the literature (e.g. Michael et al. 2017).

We have added this references.
Line 212-214: GS across breeding cycles has been discussed in the literature (e.g. Michael et al. 2016).

We have added this references.
Line 216-218: Please note that there are literatures also using data from real wheat breeding program but reporting relatively high prediction accuracy across years (e.g. He et al. 2016).

Here we were actually talking about breeding cycles (the yearly ‘breeding cohorts’), the good prediction accuracies across years in He et al., 2016 are based on using related material, which is much less the case across breeding cycles.
Section 3.3: It is not clear how to make the “within-generation” GS. First of all I assume the authors meant “within breeding cycle” GS looking at Figure 2. If one generation or several generations were used as training population, then the next generation will be predicted by GS. If the whole generation is predicted by GS, then this generation will not have any phenotype information. This will be the idea of the next subsection. Or did the authors mean to partly phenotype the next generation and partly use GEBV?

Yes, we had already described in our first manuscript this would mainly apply to having incomplete data in a particular generation. It can apply to having expense malting quality data on only a part of the progeny, or having not all progeny in all environments, etc. We have expanded and clarified this now, because we think it is straightforward way for breeders to use GS and to save money by only partly phenotyping each generation.
Line 229-230: The idea of selecting new parents purely based on GEBV has been investigated (e.g.Longin et al. 2015).

We have added this references.
Line 270: While it is a good point that using pedigree information enables the prediction of non[1]genotyped lines with known pedigree, one has to consider that in plant breeding (e.g. wheat) it is usually difficult to have complete pedigree information for the individuals in the parent pool. This is one reason that prediction using pedigree information has not been as common as in animal breeding.

In our within-cycle GS, pedigree information will be available, but we agree it will be more problematic in the across-cycle prediction. We have added this distinction.
Line 293: I do not understand. Why is it interesting to predict early-stage breeding materials with good market value? They will not be released in market anyhow. Isn’t the breeding value of these materials more important?

We have clarified better. Development of lines for the market will take a breeder still about 5 years for seed multiplication and at least 2 years of field trials. Hence we argue to put promising candidates in the market track as fast as possible. If not using TGV estimates, the other alternatives are to use additive GEBV, or select them randomly (there is limited or no phenotypic information in the fast-cycle GS), both of these options are less interesting or less efficient.
Line 294-297: Indeed the Hadamard product of the G-matrix models the pairwise interactions among markers and ignores higher-order interactions. But the reproducing kernel Hilbert space (RKHS) regression model captures all levels of interactions (Jiang et al. 2015) and has been extensively used in the literature of plant breeding (e.g. Morota and Gianola 2014). There are also other models exploiting non-additive effects. Please search for more literature.

We have added references on the kernel methods.
Line 298-299: It was mentioned without any reference that the authors’ own experience showed low prediction accuracy for the prediction of TGV. But many literatures reported the RKHS model gives higher prediction accuracy than the GBLUP or Bayesian models (e.g. Pérez-Rodríguez et al. 2012).

The reference given is based on analysis of CIMMYT data, which is much better designed than data from many practical breeders. In data sets from practical breeders, we see that the structure and use of parents is unbalanced, and often does not allow very accuracy estimation of TGV. However, as this is based on our unpublished experience, we have reworded and now state that it could be ‘promising’ to estimate TGV.
Line 326-328: There have been a number of previous studies on GS accounting for GxE in plant breeding (e.g. Jarquín et al. 2014, Cuevas et al. 2017). The authors have to search for more literature!

We have added more references on GxE.

Round 2

Reviewer 1 Report

This revision is a substantial improvement. I only have a few small editorial things to suggest. L185: The BLUP method is not a model, but an estimation method. L206: Check name: MacLullogh should be McCulloch, I think. L339: high heritable => highly heritable L568: Check the journal's guidelines on formatting requirements for references and make sure, references are all formatted exactly according to these guidelines, including use of uppercase and lowercase letters.

Author Response

We like to thank the reviewers for their feedback, which has improved the manuscript significantly.

This revision is a substantial improvement. I only have a few small editorial things to suggest. L185: The BLUP method is not a model, but an estimation method.

This have been changed.

L206: Check name: MacLullogh should be McCulloch, I think.

This have been changed.

L339: high heritable => highly heritable L568:

This have been changed.

Check the journal's guidelines on formatting requirements for references and make sure, references are all formatted exactly according to these guidelines, including use of uppercase and lowercase letters

The reference list have been throughly updated.

Reviewer 2 Report

All my comments have been carefully addressed in the revised manuscript. I think the quality of the paper has been significantly improved. I only have the following two additional comments/suggestions:

Lines 212-215: Ref [36] is the original paper introducing the RKHS model into the area of GS, but it did not explain/prove why the RKHS model is capable of capturing epistasis. The proof was given in [Y. Jiang and J. C. Reif, Modeling Epistasis in Genomic Selection (2015) Genetics 201:759-768]. In their proof, it was also shown that why the Hadamard product of G captures two-way epistatic interactions, which was actually published earlier than Ref [46].

Lines 234-256: The authors have extended Section 2.5 to review many statistical models in GS, which is very nice. But now this part looks completed seperated from the previous paragraphs. I suggest to shorten this part and integrate this part into pervious paragraphs. I am sure the authors will find a good solution.

Author Response

We like to thank the reviewers for their feedback, which has improved the manuscript significantly.

This have been changed and the reference have been included.

We have chosen to put most of the detail in a text box. This makes the main text read well covering genomic prediction methods on a general level, and the interested reader can read the text box for further detail and references on specific methods. We hope this re-working accommodates the comment.