Next Article in Journal
Fiber Lidar Sensing of the Vertical Profiles of Low-Level Cloud Extinction Coefficients at 1064 nm
Previous Article in Journal
Detecting and Predicting Vegetation Transitions Based on Resilience Dynamics and Land-Cover Changes
Previous Article in Special Issue
Synergistic Effects and Differential Roles of Dual-Frequency and Multi-Dimensional SAR Features in Forest Aboveground Biomass and Component Estimation
 
 
Article
Peer-Review Record

Field-Aware and Explainable Modelling for Early-Season Crop Yield Prediction Using Satellite-Derived Phenology

Remote Sens. 2026, 18(6), 890; https://doi.org/10.3390/rs18060890
by Ignacio Fuentes 1,2 and Dhahi Al-Shammari 3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2026, 18(6), 890; https://doi.org/10.3390/rs18060890
Submission received: 5 February 2026 / Revised: 7 March 2026 / Accepted: 12 March 2026 / Published: 14 March 2026
(This article belongs to the Special Issue Advances in Multi-Sensor Remote Sensing for Vegetation Monitoring)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is generally written well but some major changes needs to be done and some of the points needs to be explained. Please find my comments below

  1. Why does incorporating some of the features such as clay and sand fractions degrades the model performance. Is it due to the scale or these datasets have some inhenrent issues? Would high resolution imagery resolve this? if yes, please add some points based on this to discussion section
  2. I am very concerned about the selection of models as only tree based models are used in this study along with MLP. Some of the models which could have been used in this study were made a part of future studies. There needs to be clear explanation for this
  3. A point was mentioned where it was mentioned that the red-edge based VIs performs very well under different conditions. This point needs to be discussed in this study in this detail as to why. The main reason for this could be the saturation issues in other indices which is generally addressed by the red-edge data particularly in high biomass region. The paper lacks this explanation which needs to be added. 
  4. Also please work on readability of your figures and their text

Best wishes for your revision

Author Response

The paper is generally written well but some major changes needs to be done and some of the points needs to be explained. Please find my comments below

Response: We thank you for the careful reading and constructive feedback. We have substantially revised the manuscript to improve methodological clarity, internal consistency, and figure readability. In particular, we:

  • Corrected vegetation index formulas and ensured consistency between equations and band definitions.
  • Clarified the sample structure, rasterization procedure, and cutoff DOY logic.
  • Expanded explanations regarding model selection and explainability methods.
  • Improved figure resolution, axis labelling, and caption precision.

All changes are described in detail below.

Comment 1: Why does incorporating some of the features such as clay and sand fractions degrades the model performance. Is it due to the scale or these datasets have some inhenrent issues? Would high resolution imagery resolve this? if yes, please add some points based on this to discussion section

Response 1: The degradation observed when including soil variables likely reflects scale incompatibility and limited within-field variability rather than inherent dataset flaws. The soil layers (250 m resolution) are substantially coarser than the Sentinel-2 pixel size (10–20 m). When downscaled to pixel resolution, these soil variables become nearly constant within individual fields. Under a leave-one-field-out (LOFO) validation scheme, such variables contribute little to spatial generalisation because they do not explain within-field variability in yield.

We have now expanded the Discussion section to clarify this (lines 652-661):

The limited contribution of soil variables (clay, sand fraction, and field capacity) is likely explained by spatial scale incompatibility. The soil layers were derived from OpenLandMap at 250 m resolution, substantially coarser than the 20 m resolution used for yield and spectral predictors. After spatial resampling, soil properties exhibited minimal within-field variability. Under a LOFO CV framework, predictors that primarily encode between-field differences contribute less to spatial generalisation performance. In contrast, high-resolution phenological metrics capture sub-field canopy variability directly linked to yield heterogeneity. The results suggest that higher-resolution soil surveys or proximal sensing approaches may be required for soil properties to meaningfully improve pixel-level yield prediction [70,71]”.

Comment 2: I am very concerned about the selection of models as only tree based models are used in this study along with MLP. Some of the models which could have been used in this study were made a part of future studies. There needs to be clear explanation for this.

Response 2: The modelling framework prioritised algorithms capable of capturing nonlinear relationships and handling multicollinearity among spectral and phenological predictors as some studies have discussed the non-linearity of remote sensing and yield data. Tree-based ensemble methods are well-suited for these conditions and have been widely adopted in remote sensing yield modelling. This is already stated in lines 280-283: “These models were selected because they are well suited to high-dimensional and potentially collinear predictors and can represent complex non-linear interactions between crop development and environmental conditions [54–56]”.

However, recurrent neural networks (GRU networks) were explored in preliminary analysis to exploit the temporal structure of the spectral time series. However, they exhibited lower predictive performance and reduced stability under leave-one-field-out validation, likely due to the limited number of samples. We have clarified this justification in Section 2.5 (lines 285-289):

“In addition, we conducted an exploratory experiment using a recurrent neural network (GRU) trained on truncated multi-date sequences under the same LOFO protocol; however, its performance was consistently inferior to the ensemble models across cutoff dates (e.g., RMSE ≈ 1.47 - 1.60 t ha⁻¹; R² ≈ 0.09 - 0.23), and it was therefore not retained for subsequent analyses focused on the best-performing model families.”

Comment 3: A point was mentioned where it was mentioned that the red-edge based VIs performs very well under different conditions. This point needs to be discussed in this study in this detail as to why. The main reason for this could be the saturation issues in other indices which is generally addressed by the red-edge data particularly in high biomass region. The paper lacks this explanation which needs to be added. 

Response 3: We thank the reviewer for highlighting the need to further justify the dominance of red-edge metrics in the modelling framework. We have expanded the Discussion to provide a clearer physiological and radiative-transfer explanation of why NDRE-based phenological descriptors outperformed NDVI-derived metrics. Specifically, we now emphasise that red-edge reflectance is less prone to saturation at moderate-to-high canopy densities and is more directly linked to chlorophyll concentration and nitrogen status, which are key determinants of yield potential. We also clarify that the superiority of NDRE metrics in our analysis was consistently supported by both permutation-based importance and SHAP attribution across cutoff DOYs. The revised paragraph appears in lines 615-629 of the Discussion:

“Across models and explainability analyses, RE-based phenological metrics, especially peak NDRE, emerged as the most influential predictors of yield variability (Figures 7 and 8). This dominance was observed in both LOFO permutation importance, reflecting spatial generalization, and SHAP analyses, reflecting contribution to fitted predictions. This finding aligns with growing evidence that RE reflectance is more sensitive to chlorophyll concentration [51], nitrogen status [65], and canopy structure [53] than traditional red-NIR indices. In dense winter wheat canopies NDVI is prone to saturation under moderate-to-high biomass conditions [17,22], reducing its sensitivity to physiological differences during peak growth. In contrast, the red-edge region remains responsive across a broader range of canopy densities [51,53], enabling detection of subtle variations in crop vigour that are directly linked to yield formation. Phenological summaries such as maximum value, area under the curve, and temporal slope further enhance this sensitivity by integrating information across time, reducing noise from individual acquisitions and aligning predictors with biologically meaningful crop development phases [66,67]. These results support a shift away from single-date vegetation indices toward temporally aggregated, phenology-aware representations for yield modelling.”

Comment 4: Also please work on readability of your figures and their text. Best wishes for your revision

Response 4: Figures 1, 2, 9, 10 have been revised to improve readability, including increased font size, enhanced axis labelling.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper highlights that accurate and early crop yield prediction at the subfield scale is essential for precision agriculture and food system planning. The authors evaluated a phenology-based machine learning framework for winter wheat yield prediction using Sentinel-2 satellite imagery, climate reanalysis data, and field-level yield data. Phenological metrics derived from NDVI, NDWI, and NDRE were combined with cumulative seasonal precipitation and seasonal potential evapotranspiration, and multiple modeling strategies were evaluated using a leave-one-field-out (LOFO CV) cross-validation scheme to ensure spatial generalization. Of the models evaluated, the Random Forest (RF) algorithm achieved the highest overall performance, explaining up to 73% of yield variability with a root mean square error (RMSE) of 0.88 t ha⁻¹ at the optimal prediction time. This reinforces the good results RF has achieved in most applications worldwide.

The article is of regular interest to the scientific community and would be publishable once the following important considerations are addressed:
The location map needs substantial improvement, establishing a clearer and more precise structure, improving the resolution, and adding graphic elements that provide greater clarity.
The methodology needs to be more precise and explicit. This involves including a very clear methodological framework, outlining the sequence of steps for selection, preprocessing, treatment, and the application of algorithms, statistics, and other methods.
Greater precision is required regarding the structure of the training and sampling data.
The other items in the paper are, in my opinion, well presented.

Author Response

Comment 1: This paper highlights that accurate and early crop yield prediction at the subfield scale is essential for precision agriculture and food system planning. The authors evaluated a phenology-based machine learning framework for winter wheat yield prediction using Sentinel-2 satellite imagery, climate reanalysis data, and field-level yield data. Phenological metrics derived from NDVI, NDWI, and NDRE were combined with cumulative seasonal precipitation and seasonal potential evapotranspiration, and multiple modeling strategies were evaluated using a leave-one-field-out (LOFO CV) cross-validation scheme to ensure spatial generalization. Of the models evaluated, the Random Forest (RF) algorithm achieved the highest overall performance, explaining up to 73% of yield variability with a root mean square error (RMSE) of 0.88 t ha⁻¹ at the optimal prediction time. This reinforces the good results RF has achieved in most applications worldwide.

The article is of regular interest to the scientific community and would be publishable once the following important considerations are addressed:

Response 1: We thank you for the positive assessment and have addressed the specific methodological clarifications below.

Comment 2: The location map needs substantial improvement, establishing a clearer and more precise structure, improving the resolution, and adding graphic elements that provide greater clarity.

Response 2: Suggestion accepted. We modified the image including Germany and Bavaria administrative boundaries as references, slightly modified the fields caption and visual properties, and increased font size for readability.

Comment 3: The methodology needs to be more precise and explicit. This involves including a very clear methodological framework, outlining the sequence of steps for selection, preprocessing, treatment, and the application of algorithms, statistics, and other methods.

Response 3: We appreciate the reviewer’s suggestion. The Materials and Methods section has been revised to provide a clearer and more explicit description of the methodological framework, including the sequence of data acquisition, preprocessing, feature construction, model training, and validation steps. In particular, additional details were added regarding sampling of covariates, feature construction from truncated time series, and the cross-validation framework used for model evaluation.

To further improve clarity, we also introduced a new workflow diagram (new Figure 2) that summarises the full methodological pipeline, from data sources and preprocessing to feature engineering, model training, validation, and explainability and uncertainty analyses.

 

Comment 4: Greater precision is required regarding the structure of the training and sampling data. The other items in the paper are, in my opinion, well presented.

Response 4: We thank the reviewer for this important comment. We have substantially revised Section 2.5.2 to explicitly define the modelling unit and sampling structure. The revised manuscript now clarifies that yield observations were rasterised to the Sentinel-2 grid, resulting in a unique pixel-level yield value per spatial unit. For each cutoff DOY, one feature vector per pixel was constructed using truncated time series, and separate models were trained independently for each cutoff. We also explicitly state that LOFO CV was used to ensure strict spatial separation between training and testing data. These clarifications eliminate ambiguity regarding the structure of the training samples.

Reviewer 3 Report

Comments and Suggestions for Authors

Overall, the manuscript addresses an application-driven question, namely early-season pixel-level yield prediction with field-level generalization. It combines Sentinel-2 phenological metrics with climate reanalysis data, adopts a leave-one-field-out cross-validation scheme to emphasize cross-field transferability, and incorporates permutation importance, SHAP, and ensemble-based uncertainty estimates. The topic is relevant and practically meaningful. However, in its current form, several key methodological components are insufficiently defined or internally inconsistent, critical formulas contain substantive errors, and aspects of the data structure and evaluation protocol raise reproducibility concerns. These issues directly affect the credibility and generalizability of the reported results. 

(1) The definitions of vegetation indices contain clear errors and internal inconsistencies and must be corrected as a priority. In Section 2.3, the NDVI and NDRE formulas are written in the form “(ρNIR + ρred)/(ρNIR − ρred)” and similar variants, which contradict the standard definitions and would fundamentally alter the sign and value range of the indices. In addition, Equations (2) and (3) for NDRE appear duplicated, and no explicit formula is provided for NDWI, despite references to ρgreen in the text. As presented, the cited NDWI definition, the band combinations actually used, and the mathematical expressions are not aligned. Until these inconsistencies are resolved, subsequent interpretations regarding NDWI slope, NDRE_max importance, and related findings lack a rigorous mathematical foundation.

(2) The definition of the modeling unit and label construction is unclear, raising the possibility of label leakage or optimistic bias due to repeated samples. The manuscript states that harvester yield data were rasterized to Sentinel-2 resolution and modeled at pixel level, but also mentions that “each pixel inheriting the observed yield of its corresponding field.” These two descriptions imply different label structures. If field-level mean yield is assigned to all pixels, the claimed sub-field prediction should be reconsidered conceptually. If pixel-level rasterized yield is used, the rasterization procedure, handling of overlapping tracks, and outlier filtering must be clearly described. More critically, when constructing phenological features under multiple cutoff DOYs, it is unclear whether the same pixel contributes multiple samples at different temporal truncation points. LOFO partitioning by field alone does not automatically prevent information leakage across temporal representations of the same spatial unit. The manuscript should explicitly define the sample table structure and the training–testing split logic across cutoffs.

(3) The technical logic regarding feature construction and scale matching remains incomplete. CHIRPS and ERA5-Land data have substantially coarser spatial resolution than Sentinel-2, yet the manuscript does not specify the resampling strategy or how grid boundaries intersecting field polygons are handled. In a pixel-level modeling framework, it is important to demonstrate that coarse-resolution climate variables are not merely encoding field-level or regional gradients. A variance decomposition analysis comparing within-field and between-field variability, as well as baseline models using climate variables alone, would help clarify whether the model is learning meaningful agronomic signals rather than spatial structure.

(4) The cross-validation, sampling, and weighting procedures lack sufficient detail for reproducibility. Although the manuscript states that the number of pixels per field was capped and field-balanced weights were applied, no explicit limits, sampling strategies, or weighting formulas are provided. It is also unclear whether the same strategy was consistently applied within the nested cross-validation framework. Furthermore, tuning hyperparameters at DOY175 and fixing them for earlier cutoffs may introduce unfair comparisons across temporal windows, as feature distributions differ with truncation date. A sensitivity analysis in which representative early cutoffs are independently tuned would strengthen the justification for the fixed-parameter approach.

(5) The interpretation of explainability and uncertainty requires a more rigorous methodological framing. SHAP values are computed on a model refitted to the full dataset, without clear indication that explanations are derived under strict out-of-field conditions. This risks interpreting in-sample associations as robust mechanisms. Out-of-fold explanations or fold-wise aggregation under the LOFO scheme would provide stronger support. Regarding uncertainty, intervals derived from the distribution of individual trees in a random forest do not constitute formal statistical confidence intervals. The manuscript should avoid overinterpretation and clarify that these represent model disagreement, or consider more principled interval estimation approaches.

(6) The claimed novelty would benefit from clearer articulation of measurable technical contributions. At present, the framework largely combines established components. To substantiate innovation, the authors should isolate and quantify the incremental contribution of early-season truncation design, cross-field generalization strategy, or handling of scale-mismatched covariates through targeted ablation or comparative experiments.

(7) The reporting of figures and performance metrics requires clearer definition of statistical aggregation. It is not specified whether R² and RMSE are calculated over all pooled pixels or averaged across fields after per-field evaluation. The statistical unit underlying scatter or density plots should be clearly stated in figure captions. In addition, information regarding changes in sample size before and after yield rasterization, as well as any outlier filtering procedures, should be reported to enhance transparency and reproducibility.

In summary, the manuscript would substantially benefit from correcting the vegetation index formulas and ensuring consistency in band definitions, clarifying the label construction and sample structure, and providing detailed documentation of the LOFO and cutoff-specific data partitioning procedures. Only after these foundational issues are resolved can the reported results be considered fully verifiable.

Author Response

Overall, the manuscript addresses an application-driven question, namely early-season pixel-level yield prediction with field-level generalization. It combines Sentinel-2 phenological metrics with climate reanalysis data, adopts a leave-one-field-out cross-validation scheme to emphasize cross-field transferability, and incorporates permutation importance, SHAP, and ensemble-based uncertainty estimates. The topic is relevant and practically meaningful. However, in its current form, several key methodological components are insufficiently defined or internally inconsistent, critical formulas contain substantive errors, and aspects of the data structure and evaluation protocol raise reproducibility concerns. These issues directly affect the credibility and generalizability of the reported results. 

Response: We thank you for the detailed and constructive critique. The manuscript has been substantially revised to correct formula errors, clarify sample structure, improve reproducibility, and strengthen methodological rigor. Each point is addressed below.

Comment 1: The definitions of vegetation indices contain clear errors and internal inconsistencies and must be corrected as a priority. In Section 2.3, the NDVI and NDRE formulas are written in the form “(ρNIR + ρred)/(ρNIR − ρred)” and similar variants, which contradict the standard definitions and would fundamentally alter the sign and value range of the indices. In addition, Equations (2) and (3) for NDRE appear duplicated, and no explicit formula is provided for NDWI, despite references to ρgreen in the text. As presented, the cited NDWI definition, the band combinations actually used, and the mathematical expressions are not aligned. Until these inconsistencies are resolved, subsequent interpretations regarding NDWI slope, NDRE_max importance, and related findings lack a rigorous mathematical foundation.

Response 1: We sincerely thank the reviewer for identifying this issue. The NDVI, NDRE, and NDWI formulas have been corrected to their standard definitions. The duplicated equation (NDRE) has been removed and internal consistency verified. All downstream interpretations remain valid because the correct implementation was used in code; only the manuscript formula contained typographical errors.

Comment 2: The definition of the modeling unit and label construction is unclear, raising the possibility of label leakage or optimistic bias due to repeated samples. The manuscript states that harvester yield data were rasterized to Sentinel-2 resolution and modeled at pixel level, but also mentions that “each pixel inheriting the observed yield of its corresponding field.” These two descriptions imply different label structures. If field-level mean yield is assigned to all pixels, the claimed sub-field prediction should be reconsidered conceptually. If pixel-level rasterized yield is used, the rasterization procedure, handling of overlapping tracks, and outlier filtering must be clearly described. More critically, when constructing phenological features under multiple cutoff DOYs, it is unclear whether the same pixel contributes multiple samples at different temporal truncation points. LOFO partitioning by field alone does not automatically prevent information leakage across temporal representations of the same spatial unit. The manuscript should explicitly define the sample table structure and the training–testing split logic across cutoffs.

Response 2: We appreciate the reviewer’s detailed and insightful comments regarding the modelling unit and potential information leakage. We have revised Sections 2.3 and 2.5.2 to clarify that yield was rasterised to the Sentinel-2 grid by averaging harvester observations within each pixel, resulting in a unique pixel-level yield value. Thus, the response variable reflects sub-field variability rather than field-level mean yield.

Regarding temporal truncation, models were trained independently for each cutoff DOY. For a given cutoff, each pixel contributed only one sample, derived from observations up to that date. No pixel contributed multiple representations within the same model. LOFO CV ensured that all pixels from a given field were held out simultaneously, preventing spatial leakage between training and testing sets.

Comment 3: The technical logic regarding feature construction and scale matching remains incomplete. CHIRPS and ERA5-Land data have substantially coarser spatial resolution than Sentinel-2, yet the manuscript does not specify the resampling strategy or how grid boundaries intersecting field polygons are handled. In a pixel-level modeling framework, it is important to demonstrate that coarse-resolution climate variables are not merely encoding field-level or regional gradients. A variance decomposition analysis comparing within-field and between-field variability, as well as baseline models using climate variables alone, would help clarify whether the model is learning meaningful agronomic signals rather than spatial structure.

Response 3: We thank the reviewer for highlighting the importance of scale compatibility between predictors. We have clarified the spatial alignment procedure in Section 2.4.2, where coarse-resolution climate and soil datasets are sampled at the Sentinel-2 pixel centroid using nearest-neighbour lookup. As a consequence, multiple Sentinel-2 pixels may share identical climate values when located within the same CHIRPS or ERA5-Land grid cell (Lines 251-260).

To explicitly assess the spatial support of predictors, we conducted a variance decomposition analysis separating within-field and between-field variability, which is stated now in section 2.6.1 (lines 380-386). Results (included in Supplementary materials Figure S1) show that cumulative rainfall and PET exhibit negligible within-field variability (median within-field standard deviation ≈ 0 for both variables), with most variance occurring between fields (between-field fraction ≈ 0.72 for rainfall and ≈ 0.94 for PET). In contrast, phenological metrics derived from Sentinel-2 show substantial within-field variability (e.g., NDRE_max median within-field standard deviation ≈ 0.039), confirming their ability to capture sub-field canopy heterogeneity.

In addition, we evaluated baseline models using climate covariates alone, which explained only a limited fraction of yield variability under LOFO cross-validation (24 and 28% using RF and LGBM models; lines 449-456). These results confirm that coarse-resolution climate variables primarily encode between-field environmental gradients rather than intra-field spatial structure. Their contribution therefore lies in modulating field-level yield potential, while high-resolution phenological predictors drive pixel-scale spatial discrimination. These analyses have been incorporated in the revised manuscript.

 

Comment 4: The cross-validation, sampling, and weighting procedures lack sufficient detail for reproducibility. Although the manuscript states that the number of pixels per field was capped and field-balanced weights were applied, no explicit limits, sampling strategies, or weighting formulas are provided. It is also unclear whether the same strategy was consistently applied within the nested cross-validation framework. Furthermore, tuning hyperparameters at DOY175 and fixing them for earlier cutoffs may introduce unfair comparisons across temporal windows, as feature distributions differ with truncation date. A sensitivity analysis in which representative early cutoffs are independently tuned would strengthen the justification for the fixed-parameter approach.

Response 4: We agree and have expanded the description of sampling and weighting for reproducibility (section 2.5.2). We now explicitly state that a maximum cap of 100 pixels per field was defined as a safeguard against dominance by very large fields; however, no field exceeded this threshold in our dataset, so no pixel subsampling was applied and all pixels were retained. We also added the exact field-balanced weighting scheme used during training, , where is the number of training pixels in the same field as pixel . Importantly, weights are computed within each outer LOFO training fold using training data only, and the same weighting is passed through the nested tuning procedure to ensure consistency.

In our framework, hyperparameters were tuned using nested LOFO cross-validation at cutoff DOY 175 and subsequently fixed for earlier cutoff dates. This design ensures that changes in model performance across cutoff DOYs reflect the progressive availability of seasonal information rather than differences arising from repeated hyperparameter optimisation. Cutoff DOY 175 corresponds to a late developmental stage where spectral and climatic predictors contain the most complete information about crop conditions, providing a stable basis for model configuration. Tree-based ensemble methods such as Random Forest and gradient boosting are generally robust to moderate hyperparameter variation; therefore, a single tuned configuration yields a consistent comparison across temporal truncation levels. We have clarified this rationale in Section 2.5.4. (lines 357-365)

Comment 5: The interpretation of explainability and uncertainty requires a more rigorous methodological framing. SHAP values are computed on a model refitted to the full dataset, without clear indication that explanations are derived under strict out-of-field conditions. This risks interpreting in-sample associations as robust mechanisms. Out-of-fold explanations or fold-wise aggregation under the LOFO scheme would provide stronger support. Regarding uncertainty, intervals derived from the distribution of individual trees in a random forest do not constitute formal statistical confidence intervals. The manuscript should avoid overinterpretation and clarify that these represent model disagreement, or consider more principled interval estimation approaches.

Response 5: Thank you for the comment. We agree that explainability should reflect the model behaviour under spatially independent validation. In the revised manuscript, SHAP values are explicitly described as being derived from predictions generated during the leave-one-field-out (LOFO) cross-validation procedure, ensuring that explanations correspond to out-of-sample model behaviour. The relevant methodological description has been clarified in Section 2.6.1, and the wording in the Results section and Figure 8 caption has been revised accordingly to avoid ambiguity.

Additionally, following the reviewer’s suggestion, we replaced the term “confidence interval” with “ensemble prediction interval” throughout the manuscript to more accurately describe the dispersion derived from the Random Forest ensemble.

 

Comment 6: The claimed novelty would benefit from clearer articulation of measurable technical contributions. At present, the framework largely combines established components. To substantiate innovation, the authors should isolate and quantify the incremental contribution of early-season truncation design, cross-field generalization strategy, or handling of scale-mismatched covariates through targeted ablation or comparative experiments.

Response 6: We thank the reviewer for this suggestion. We have revised the Introduction and Results to more explicitly articulate the measurable technical contributions of this work. In particular, we now frame the comparison of covariate sets and cutoff DOYs as a structured ablation-style experimental design that quantifies the incremental effects of (i) early-season phenological truncation, (ii) inclusion of cumulative climate covariates, and (iii) strict field-level cross-validation for spatial generalisation.

The Introduction now explicitly states:

“Unlike prior studies that rely on random cross-validation or full-season predictors [17], this work systematically evaluates early-season truncation under strict field-level generalisation and explicitly examines the scale compatibility of heterogeneous covariates within a unified modelling framework.”

In the Results (lines 441–448), we now quantify these incremental effects. For example, under Random Forest at DOY 175, inclusion of cumulative climate variables reduced RMSE from 1.00 to 0.87 t ha⁻¹ (13% reduction) and increased R² from 0.64 to 0.73 (14% relative improvement) relative to phenology-only models. In contrast, adding soil properties increased RMSE to 0.96 t ha⁻¹ and reduced R² to 0.67, indicating no additional transferable performance gain under field-level validation.

These results demonstrate that between-field climatic gradients meaningfully enhance spatial generalisation, whereas soil variability does not provide additional predictive value at this modelling scale. We believe this explicit quantification of incremental predictor contributions under strict spatial separation clarifies the measurable methodological advances of the proposed framework.

Comment 7:   In addition, information regarding changes in sample size before and after yield rasterization, as well as any outlier filtering procedures, should be reported to enhance transparency and reproducibility.

Response 7: Thanks. We have clarified that pooled performance metrics (R², RMSE, MAE) were computed across all held-out pixels concatenated across LOFO folds. Per-field metrics were calculated separately and summarised using boxplots. Figure captions now explicitly state the statistical unit. We also report that 917 rasterised pixels remained after boundary buffering and quality filtering.

Comment 8: In summary, the manuscript would substantially benefit from correcting the vegetation index formulas and ensuring consistency in band definitions, clarifying the label construction and sample structure, and providing detailed documentation of the LOFO and cutoff-specific data partitioning procedures. Only after these foundational issues are resolved can the reported results be considered fully verifiable.

Response 8: We have carefully revised the vegetation index formulas to ensure consistency in band definitions and corrected typographical formatting issues. Section 2.3 now clearly describes the rasterization procedure (averaging harvester points within Sentinel-2 pixels) and confirms that pixel-level yield labels were used throughout.

We also clarified that for each cutoff DOY, a separate dataset snapshot was constructed, and LOFO partitioning was applied independently within each cutoff, preventing temporal leakage between representations. Now lines 228-231 from section 2.4.1 state: “For each cutoff DOY, phenological features were recomputed independently from truncated time series, resulting in distinct modelling datasets. Cross-validation was applied separately within each cutoff, ensuring no leakage across temporal snapshots.”

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The efforts of the authors are appreciated, this version has incorporated the recommendations made, I consider that it is publishable.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have revised the manuscript according to my comments carefully. The revised manuscript has been greatly improved compared with the original version. I have no further comments.

Back to TopTop