Towards the Modeling and Prediction of the Yield of Oilseed Crops: A Multi-Machine Learning Approach
Round 1
Reviewer 1 Report
The manuscript investigates prediction of oilseed crop yields in from different regions in Iran using different machine learning models. The manuscript is generally written well and predicted critical traits controlling yield that supports our typical agronomic knowledge of yield components. My comments are as below:
Table 1 footnote: Are you sure “RF” refers to “radio frequency” as, for example, reference #74 use random forest model. Please check.
Table 1: Some other important studies should be included here and in general discussion. For example:
1) Dhaliwal, J. K., Panday, D., Saha, D., Lee, J., Jagadamma, S., Schaeffer, S., Mengistu, A. 2022. Predicting and interpreting cotton yield and its determinants under long-term conservation management practices using machine learning. Computers and Electronics in Agriculture, 199, 107107. doi.org/10.1016/j.compag.2022.107107.
2) Shahhosseini et al. 2020. Forecasting Corn Yield With Machine Learning Ensembles. https://doi.org/10.3389/fpls.2020.01120
3) Hoffman et al. Analysis of climate signals in the crop yield record of sub-Saharan Africa. https://doi.org/10.1111/gcb.13901
Page 7, L3: Please write “1% soil organic matter”
Page 7, L12-13: Did you use an average input parameter value obtained from the 10 plants? Or, all 10 measurements were considered as pseudo-replications? If so, how did you avoid prediction bias due to data leakage as some of the replicated observations from a certain treatment will be randomly selected into the training data and rest in the test data. In those instances, the model would already see partial data for a treatment during training. Please clarify.
Section 2 (2.1): It would be good to provide total number of observations used to train and test the ML models. Section 3.1 mentions n of 135 which means only testing data includes only 27 observations (20%). However, the number of observations in train and test data seems to be much higher in Figure 5. Please clarify n for train and test data in Figure 5 and also in Table 7. Is n=135 is a representative number to train and test and robust ML model?
Page 13, L2: Soil organic matter?
Page 13, L3: Replace “declared” with found or observed
Other general comments: The model was developed based on the data from different regions in Iran. If the authors have access to data, it would be good to see the model’s predictability in other soil and climatic regions.
The discussion section is weak and modestly focused on model performance and little emphasis was given on insights from variable importance and management guidance to improve sesame yield and quality. It would be helpful to see how the models predict the functional relationships between the critical input and output (SSY) parameters and interaction between the predictor variables influencing SSY.
Author Response
-Reviewer 1
Comments and Suggestions for Authors
The manuscript investigates prediction of oilseed crop yields in from different regions in Iran using different machine learning models. The manuscript is generally written well and predicted critical traits controlling yield that supports our typical agronomic knowledge of yield components. My comments are as below:
Authors’ response: We would like to thank the esteemed reviewer for reviewing the manuscript and expressing his/her critical point of view and other comments, which certainly helped us to improve the quality of manuscript.
Specific comments are listed below.
- Table 1 footnote: Are you sure “RF” refers to “radio frequency” as, for example, reference #74 use random forest model. Please check.
Authors’ response: Thank you for rising that. It was a typo. It`s modified. See Table 1.
- Table 1: Some other important studies should be included here and in general discussion. For example:
1) Dhaliwal, J. K., Panday, D., Saha, D., Lee, J., Jagadamma, S., Schaeffer, S., Mengistu, A. 2022. Predicting and interpreting cotton yield and its determinants under long-term conservation management practices using machine learning. Computers and Electronics in Agriculture, 199, 107107. doi.org/10.1016/j.compag.2022.107107.
2) Shahhosseini et al. 2020. Forecasting Corn Yield With Machine Learning Ensembles. https://doi.org/10.3389/fpls.2020.01120
3) Hoffman et al. Analysis of climate signals in the crop yield record of sub-Saharan Africa. https://doi.org/10.1111/gcb.13901
Authors’ response: It’s done. See Table 1.
- Page 7, L3: Please write “1% soil organic matter”
Authors’ response: As truly mentioned. It`s done.
- Page 7, L12-13: Did you use an average input parameter value obtained from the 10 plants? Or, all 10 measurements were considered as pseudo-replications? If so, how did you avoid prediction bias due to data leakage as some of the replicated observations from a certain treatment will be randomly selected into the training data and rest in the test data. In those instances, the model would already see partial data for a treatment during training. Please clarify.
Authors’ response: Thanks for your critical comment, the following information was added in section 2.1 of Materials and Methods on P6L23, to clarify this issue. We used 378 in modeling, 80% of which is 302 in the training phase and 76 data for use. K-fold cross validation method was also used to make sure that the prediction problem by the models does not suffer from over-fitting (over training). According to the results of Table 7, the mean and standard deviation of the prediction results are given. Therefore, on this basis, it is possible to check the generalization issue and make sure that there is no overfitting problem. According to the explanations given in the article for these tables, one can be confident in the prediction results of the model.
- Section 2 (2.1): It would be good to provide total number of observations used to train and test the ML models. Section 3.1 mentions n of 135 which means only testing data includes only 27 observations (20%). However, the number of observations in train and test data seems to be much higher in Figure 5. Please clarify n for train and test data in Figure 5 and also in Table 7. Is n=135 is a representative number to train and test and robust ML model?
Authors’ response: As explained for comment no.4, the manuscript is modified.
- Page 13, L2: Soil organic matter?
Authors’ response: It`s done.
- Page 13, L3: Replace “declared” with found or observed
Authors’ response: It`s done.
Other general comments: The model was developed based on the data from different regions in Iran. If the authors have access to data, it would be good to see the model’s predictability in other soil and climatic regions. The discussion section is weak and modestly focused on model performance and little emphasis was given on insights from variable importance and management guidance to improve sesame yield and quality. It would be helpful to see how the models predict the functional relationships between the critical input and output (SSY) parameters and interaction between the predictor variables influencing SSY.
Authors’ response: According to the explanation of section 2.1, the results of the cultivation of 135 different genotypes of sesame but in the same weather conditions. Therefore, weather conditions and other conditions and variables affecting the growth and performance of genotypes can be assumed to be the same. Based on sensitivity analysis, the order of effective variables in predicting SSY was discussed. The result of the mutual effects of the variables in the prediction error of the model is significant, which is discussed. Above-mentioned points are truly applicable and must be evaluated in deep. We are pleasant with your suggestions. In the future researches, we will definitely plan to consider the climate and chemical parameters.
Author Response File: Author Response.docx
Reviewer 2 Report
The article is about developing model for predicting yield of sesame using machine learning technique.
The article is well written. However, I am having some observations
Title: OK
Abstract: Good
Introduction: Too lengthy, can be shortened
Explain what is Flowering time 10% and 100% (as given in Table 2)
Specific comments:
1. Calculate and give a table of correlation of the independent variables with significance level (p value), before running PCA
2. Page 11, L23: Correct the statement, variables do not explain total variability in PCA, rather the PCs explain variability
3. Page 12, L18: Quadratic model with PCA1 and PC5 was best. Plz explain whether models were fitted with all PCs pairwise separately.? Authors are requested to describe the methodology completely so that authors can understand the method used clearly
4. Page 14, L2-9: Authors described that variable with maximum power (minimum p value, i.e. Maximum SS) to find out the most effective parameters.
Explain the methods used clearly how Table 7 was calculated, i.e. what do you want to say by excluding PCA and including PCA
Improve the English in the article
The article can be accepted after incorporating the comments mentioned above,
Author Response
-Reviewer 2
The article is about developing model for predicting yield of sesame using machine learning technique. The article is well written. However, I am having some observations. Title: OK; Abstract: Good; Introduction: Too lengthy; can be shortened Explain what is Flowering time 10% and 100% (as given in Table 2).
Authors’ response: Thanks for reviewing the manuscript and expressing general comments. The Introduction modified. Flowering time 10% and 100% are described P7L22. FT10% mean time that spent to plant physically flowered 10% of all plant and FT100% time that plant completely flowered.
Specific comments:
- Calculate and give a table of correlation of the independent variables with significance level (p value), before running PCA.
Authors’ response: The significant-level of inputs are evaluated in Table 6 by comparison criteria such as sum of square and P-value. This can be as a precise alternative analysis. Also the correlation value are included in Table 3 for all independent variables.
- Page 11, L23: Correct the statement, variables do not explain total variability in PCA, rather the PCs explain variability.
Authors’ response: As truly mentioned principal components (PCs) disclose the variability. It`s modified. See P12 L7.
- Page 12, L18: Quadratic model with PCA1 and PC5 was best. Plz explain whether models were fitted with all PCs pairwise separately.? Authors are requested to describe the methodology completely so that authors can understand the method used clearly.
Authors’ response: The highlighted amounts are illustrated the best performance of ML-PCA and ML-noPCA (using original inputs, not PCs). Generally, ML by using nine inputs resulted higher accuracy in comparison with 5 PCs by merging with PCA. Although the ML showed better performance by using ML-no-PCA, the computing process is faster by ML-PCA. See P13 L13.
- Page 14, L2-9: Authors described that variable with maximum power (minimum p value, i.e. Maximum SS) to find out the most effective parameters.
Authors’ response: As before mentioned in comment No.1, the significant-level of independent variables are described in Table 6 based on the sum of square and P-value.
- Explain the methods used clearly how Table 7 was calculated, i.e. what do you want to say by excluding PCA and including PCA.
Authors’ response: Thanks for rising this point. We decided to observe the synergetic effect of ML technique merged with PCs which created form nine independent variables as inputs. Thus, modelling for yield prediction once run by ML with PCA (with five PCs as inputs or (+, including PCA)) and second run by ML without PCA (nine independent variables or (*-, excluding PCA) See Table 2). The prediction performances are compared based on the multi criteria (RSME, TSSE, MAPE and etc.). please see Table 4 and P13L13.
- Improve the English in the article.
Authors’ response: It`s done. Manuscript was revised again by one of the native authors.
The article can be accepted after incorporating the comments mentioned above
Author Response File: Author Response.docx
Reviewer 3 Report
See the atatchment
Comments for author File: Comments.pdf
Author Response
Reviewer 3
Authors are predicting the yield of oilseed crops; however, I have following comments.
- Authors need to revise the abstract, use proper way of scientific writing style.
Authors’ response: It`s done.
- Should put nomenclature and graphical abstract towards the end or after instruction (not journals requirement to show graphical abstract).
Authors’ response: Nomenclature and graphical abstract are moved to the end of manuscript.
- Author must cite following relevant article in related work
- Prediction of thymoquinone content in black seed oil using multivariate analysis: An efficient model for its quality assessment
- A convolution neural network-based seed classification system
- RUBBER SEED OIL EPOXIDATION: EXPERIMENTAL STUDY AND SOFT COMPUTATIONAL PREDICTION.
- A Deep Learning-Based Model for Date Fruit Classification
- Smart Seed Classification System Based on MobileNetV2 Architecture.
Authors’ response: It`s done. See P5L8 and Table 1.
- The distribution of training and testing samples is not mentioned properly.
Authors’ response: The training and testing steps used 80 and 20% of total datasets randomly. Please SeeP6 L23 and P10 L5.
- Author needs to mention the ratio of training and test set.
Authors’ response: Please see response of comment No. 4. Also, it`s noted in P10 L2.
- 135 samples have been used for training and test. Wondering is it sufficient to train the model? Author should provide reasons to justify the sample size.
Authors’ response: Yes, thanks for your reminder. While apologizing for the problem in this section, based on comments 4 and 5 of referee No. 1, the explanation regarding the number of data has been made more complete and the following explanation is given:
135 sesame genotypes were investigated in the form of randomized complete block design with 3 replications. 10 random plants were measured from each experimental unit and its average was considered as the result of that repetition. Therefore, a total of 135*3=405 data were obtained. Based on the Grubbs test, 27 data were identified as outliers and were removed from the data set. Therefore, at the end, 378 data remained. 80% and 20% of the data set including 302 and 76 were used for training and testing of machine learning methods, respectively.
- Ablation study will make it more clear how ML models perform.
Authors’ response: Yes, this can be a good suggestion. But since the discussion of the importance of independent variables in predicting oilseed performance was investigated with the help of the result of Figure 5 for the MLR model, and in the sensitivity analysis section (results of Figure 8), the issue of the importance of variables in predicting oilseed performance was addressed by machine learning models. has been Therefore, Ablation study was not used.
- Future scope about this work is missing in conclusion.
Authors’ response: It`s done. See P20 L11.
Author Response File: Author Response.docx
Round 2
Reviewer 3 Report
the author has incorporated all the previous comments but i don't understand why he put something like "highlight" and then followed the contribution at the beginning of the paper.
this should be mentioned in the last para of the introduction, where they should start by writing " this research work is predicting ..... the contribution of this work is mentioned as follows: Four ML models...
First use of coupled...
Use of DPR...
ETC
Author Response
- the author has incorporated all the previous comments but i don't understand why he put something like "highlight" and then followed the contribution at the beginning of the paper. this should be mentioned in the last para of the introduction, where they should start by writing " this research work is predicting ..... the contribution of this work is mentioned as follows: Four ML models...First use of coupled... Use of DPR... ETC
Author`s response: It`s done.
Author Response File: Author Response.doc