Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Generalization Enhancement Strategies to Enable Cross-Year Cropland Mapping with Convolutional Neural Networks Trained Using Historical Samples

Remote Sens. 2025, 17(3), 474; https://doi.org/10.3390/rs17030474

by Sam Khallaghi^1,2, Rahebeh Abedi¹

, Hanan Abou Ali³

, Hamed Alemohammad^1,2, Mary Dziedzorm Asipunu⁴, Ismail Alatise¹

, Nguyen Ha¹, Boka Luo¹, Cat Mai¹, Lei Song^1,5

, Amos Olertey Wussah⁴, Sitian Xiong¹, Yao-Ting Yao^1,2, Qi Zhang¹ and Lyndon D. Estes^1,*

Reviewer 1:

Xiaosuo Wu

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2025, 17(3), 474; https://doi.org/10.3390/rs17030474

Submission received: 7 November 2024 / Revised: 23 December 2024 / Accepted: 13 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue Remote Sensing and Associated Artificial Intelligence in Agricultural Applications (2nd Edition))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a method aimed at improving the generalization ability of cross-year farmland mapping models. Although the research design is rigorous and the experimental section is fairly detailed, there is still room for improvement, particularly in the detail of the experimental description and interpretation of the results, which is crucial for agricultural monitoring.

1. While different normalization methods are compared, the paper does not provide sufficient explanation of the reasons for choosing each method or their specific impact on model performance. It is recommended to further elaborate on the rationale for these choices and provide a deeper analysis of the effects of each method.

2. Although various data augmentation techniques are mentioned, their integration with the training process and specific contributions to model performance are not clearly explained. It is suggested to provide more details on the criteria for selecting these techniques and how they enhance model performance.

3. The paper mentions using 4,977 labeled images for training but does not specify details such as data splitting, training cycles, or learning rate adjustments. More information about the training process should be provided to ensure the reliability of the results.

4. While confusion matrices and probability maps are provided, there is insufficient visualization to comprehensively assess model performance. Additional visualizations, such as farmland mapping results for different years and regions, and the impact of different techniques on these results, should be included.

5. Although the paper mentions omission errors in some years' farmland mapping, it does not analyze the underlying causes of these errors. A more detailed error analysis is recommended to identify performance bottlenecks in the model.

6. The paper discusses the improvement in model generalization, but lacks a comparison with existing methods. It is suggested to include a discussion of the advantages and limitations of the proposed method in comparison to current techniques.

7. The paper mentions the use of the U-Net model but does not provide sufficient details about its configuration. It is recommended to include a detailed description of the U-Net architecture, including the number of layers, filter sizes, and activation functions, and to explain why U-Net was chosen.

8. The paper mentions the use of hyperparameters but does not explain the selection process in detail. It is advised to elaborate on the hyperparameter optimization strategy and discuss the impact of these parameters on model performance.

Comments on the Quality of English Language

Overall OK, some parts need to be optimized.

Author Response

Comment 1. While different normalization methods are compared, the paper does not provide sufficient explanation of the reasons for choosing each method or their specific impact on model performance. It is recommended to further elaborate on the rationale for these choices and provide a deeper analysis of the effects of each method.

Response 1. We appreciate the reviewer’s feedback and the opportunity to clarify our rationale for exploring different input normalization methods and their impact on model performance.

The effect of input normalization on model predictions is not yet fully understood in the context of remote sensing or even computer vision, and it is often an overlooked aspect of the pre-processing pipeline. Many existing studies do not specify their normalization procedures, and among those that do, per-band z-score standardization is the most commonly reported approach. However, one of our goals was to highlight the importance of normalization and demonstrate its potential impact on model performance, encouraging the community to be more mindful of this critical but under-discussed component of the learning framework.

To achieve this, we conducted an empirical analysis of the performance of several commonly used normalization techniques, as presented in Table 1. Our analysis provides a comprehensive evaluation from both temporal and spatial perspectives, ensuring that the reported effects are robust across time and space. We also explored the impact of normalization on key properties of remote sensing data, such as pairwise band correlation and brightness value distribution, as both of these properties can influence the learning dynamics of segmentation models.

However, we acknowledge that neither of these properties alone can provide a conclusive basis for selecting a normalization method. Our newly added analysis (section 3.3 in results), we show that the spectral properties are not that strongly linked on how the model decides between crop and non-crop assignment to each pixel over the years, which makes the impact of input normalization on model performance more curious. Ultimately, we recommend that practitioners treat normalization as a tunable parameter within their recognition frameworks. Exploring different normalization techniques as part of a broader parameter tuning strategy can help tailor the preprocessing pipeline to the specific characteristics of their data and task if it’s not cropland mapping.

Comment 2. Although various data augmentation techniques are mentioned, their integration with the training process and specific contributions to model performance are not clearly explained. It is suggested to provide more details on the criteria for selecting these techniques and how they enhance model performance.

Response 2. In our study, we used two types of augmentations: geometric and photometric. Geometric augmentations, such as flipping, rotation, and resizing, are widely used in recognition frameworks, and their benefits have been extensively documented in prior research. As a result, we considered it unnecessary to elaborate on their contributions to the model's performance, as their value is already well established in the literature.

We used two types of augmentation: geometric and photometric. Geometric augmentations are almost ubiquitous in all recognition frameworks and their benefits are reported numerously that we found it unnecessary to add to the volume of the article.

However, we have added further clarification regarding the implementation of these augmentations

lines 293-305 (updated version):

“To further bolster the model’s resilience against overfitting, and to enhance its adaptability to varying crop patterns and image reflectance artifacts and domain shift, we expanded our training dataset with a combination of spatial and photometric transformations, thereby augmenting data diversity and robustness. The augmentations were applied on-the-fly with a 50% chance in the order of flip, rotation between ±90°, uniform resize, and photometric transformations. Flip was randomly selected from one of the horizontal, vertical, or diagonal types, and for photometric augmentation, one of the gamma correction, Gaussian noise, additive, and multiplicative noise gets randomly selected with equal probability and applied in each epoch. We used flip, rotation and resize to increase the input diversity and make the model invariant to size and orientations [71], as these properties are arbitrary and can vary substantially for crop fields, but, given their existing widespread and standard use [15, 25], did not further analyze their effects on the model performance.”

However, we did conduct a thorough evaluation of photometric augmentations because they directly address the temporal domain shift problem. We reported the results of these experiments in Table 2 and Figure 5, as well as additional details and visualizations in Appendix Figures A2, A3, and A4 (A4 added in the update).

Comment 3. The paper mentions using 4,977 labeled images for training but does not specify details such as data splitting, training cycles, or learning rate adjustments. More information about the training process should be provided to ensure the reliability of the results.

Response 3. Thank you for your comment. We believe these details were already provided in the paper. Below, we highlight the relevant sections (which we edited slightly for readability):

Lines 254-262 (updated version; data splitting for training and validation):

“To train our cropland mapping model, we assembled a set of 4,977 labeled images de-veloped through manual digitization of field boundaries in the 2018 imagery, primarily as part of a prior mapping initiative [11]. This dataset includes 4,229 labeled samples en-compassing four different areas across the Ghanaian landscape, where annual crops such as maize are primarily produced. The dataset was further enriched with an additional 100 samples from Nigeria, 70 samples from Congo, and 578 samples from Tanzania derived from a similar procedure in 2020 to broaden the range of agronomic diversity. The sam-ples were divided into 4,781 samples for training, with the remaining 196 (4%) reserved with the highest label quality reserved for model validation in 2018 (Figure 1).”

Lines 355-359 (updated version; test dataset):

“We evaluated model performance at a larger extent used for map production. To do so, we randomly selected 4 tiles of size 2358x2358 from the 5 years and manually annotated the crop fields in the resulting 20 scenes, providing an independent test set equivalent in which each scene was equivalent in extent to nearly 111 contiguous training/validation chips (see appendix A2).”

Lines 343-348 (updated version; training process):

“We developed our pipeline using the PyTorch 1.9.0 library and trained our large network (157 M parameters) on an A30 GPU machine for 120 epochs with a batch size of 32. After running initial experiments on SGD, SGD with momentum, Nesterov, Adam, and Sharpness-Aware Minimization (SAM) Optimizers [79] we adopted Nesterov as the optimizer in our pipeline. The initial learning rate was set to 0.003 which was updated with a polynomial learning rate decay policy with a power of 0.8.”

Comment 4. While confusion matrices and probability maps are provided, there is insufficient visualization to comprehensively assess model performance. Additional visualizations, such as farmland mapping results for different years and regions, and the impact of different techniques on these results, should be included.

Response 4. Thank you, we understand the importance of providing comprehensive visualizations to better assess model performance across different years, regions, and model configurations. To address this, we have already included several relevant visualizations and quantitative comparisons in the manuscript. Specifically:

Table 2 presents the quantitative results for different years, comparing model performance with and without key techniques such as photometric augmentation, dropout, and MC-dropout. This table provides a clear comparison of how these techniques influence model performance over time.

Figure 5 and the Appendix Figures (A2, A3, A4) present spatial confusion matrices for each tile (covering different regions in Ghana) across multiple years. These confusion matrices visualize the model’s predictive accuracy for each tile, and we have repeated this analysis for the different model configurations (with and without photometric augmentation, dropout, and MC-dropout) to highlight the effect of these techniques on model performance.

Additionally, in response to this suggestion, we have added new visualizations (figures 6 and A5/table 4) to the manuscript to illustrate the impact of different normalization techniques.

Comment 5. Although the paper mentions omission errors in some years' farmland mapping, it does not analyze the underlying causes of these errors. A more detailed error analysis is recommended to identify performance bottlenecks in the model.

Response 5. Explaining how the model makes decisions is not a trivial task and an unsolved problem in CV. We believe a comprehensive analysis would require the incorporation of additional methods, such as those designed for explainable AI, which is beyond the scope of this paper. However, to investigate the potential reason for omission error we added a new section to the results (section 3.3, lines 482-542). In this, we compared the persistent crop mask for all years (2018-2022) between our reference and model prediction labels. We then captured the average per-band reflectance of all the pixels in FP, TP, TN, FN, and tried to find the relationship between the omission/commission errors and spectral reflectance over the years. Unfortunately, these results were inconclusive.

Comment 6. The paper discusses the improvement in model generalization, but lacks a comparison with existing methods. It is suggested to include a discussion of the advantages and limitations of the proposed method in comparison to current techniques.

Response 6. We appreciate the reviewer’s suggestion to include a comparison with existing methods. Many current approaches for multi-year cropland or crop type mapping rely on domain adaptation techniques, which typically require access to SITS (Satellite Image Time Series) data. Unfortunately, such data was not available to us in this study. We avoided a comparison between just different architectures as the aim of this paper is not on the architecture but to highlight the impact of designing and optimizing key components of the learning framework. We argue that any well-established semantic segmentation model can be enhanced by carefully incorporating these components. Specifically, for our segmentation task on binary cropland mapping using historical data (e.g., using a pre-trained model on one year to predict subsequent years within the same region), we demonstrate that incorporating techniques such as MC-dropout, input normalization, photometric augmentations (to mitigate domain shift in multi-year data), and carefully selected loss functions significantly improves performance. This approach enabled us to generate reliable multi-year crop maps without the need for domain adaptation.

While we acknowledge that a comparison of model architectures — with and without the proposed framework components — could provide additional insights, we opted not to include it in this paper to maintain a clear and focused narrative. Including such a comparison would have significantly increased the length of the paper. However, we recognize its potential value and view it as an opportunity for future research, which could form the basis of a dedicated follow-up study.

However, we updated the introduction with a more clarified explanation of the paper’s aim:

Lines 207-222 (updated version):

“In this study, we introduce a novel workflow that leverages input normalization, Monte Carlo Dropout (MC-dropout), and a task-specific loss function to enhance the temporal generalization capabilities of field boundary masks at a national scale. We applied these techniques to a single model that is well-understood and widely deployed for large-scale production mapping [e.g. Brandt papers, find other U-net papers], making this study highly relevant to practitioners whose goal is to repeatedly reduce reliable maps with an effective model, and who may lack the time to test the broad and expanding array of architectural variations. Our focus was to enhance our ability to produce yearly cropland masks for annual crops, excluding woody crops, aligning with common practices in the literature [67, 68]. The cropland masks are particularly designed to distinguish between the field interior, field edge, and non-field background classes, which improves the ability to perform post-hoc instance segmentation using the score maps for the field interior class. The approach we demonstrate here significantly reduces the reliance on extensive multi-year sample collection, or complex transfer learning strategies, marking a meaningful improvement toward cost-effective, large-scale and annually repeatable agricultural monitoring.”

Comment 7. The paper mentions the use of the U-Net model but does not provide sufficient details about its configuration. It is recommended to include a detailed description of the U-Net architecture, including the number of layers, filter sizes, and activation functions, and to explain why U-Net was chosen.

Response 7. We added a figure for the architecture in the appendix (A1) and referred to it in the text.

Comment 8. The paper mentions the use of hyperparameters but does not explain the selection process in detail. It is advised to elaborate on the hyperparameter optimization strategy and discuss the impact of these parameters on model performance.

Response 8. We did not use a grid search to find the hyperparameters as the number of parameters to tune was too large, they are set and refined by trial and error when working on the pipeline design. In the updated paper we mention this explicitly now.

Lines 348-349 (updated version):

“All the free parameters of the model are chosen based on trial and error.”

Reviewer 2 Report

Comments and Suggestions for Authors

1. The abstract effectively summarizes the key challenges, methods, and findings. However, consider including specific metrics or results to provide a clearer snapshot of the model's performance improvements.

2. It will be better to provide a more detailed rationale for choosing specific techniques (e.g., TFL, dropout) over others.

3. For the Model Architecture part, please include a diagram of the U-Net architecture to aid understanding of the model structure and workflow.

4. Please provide a detailed discussion of the limitations faced during the study. For example, how did cloud cover affect data availability and model performance?

Comments on the Quality of English Language

1. In the Abstract "However, effective DL models often require large, expensive label datasets..."

Please change "label datasets" to "labeled datasets" for grammatical accuracy.

2. "Our approach combines several techniques, including an area-based loss function, Tversky-focal loss (TFL), data augmentation, and the use of regularization techniques like dropout."

Consider rephrasing to "Our approach integrates several techniques, such as an area-based loss function, TFL, data augmentation, and regularization methods like dropout."

3. Ensure consistent terminology throughout the manuscript (e.g., "deep learning" vs. "DL").

4. Use more transitional phrases to improve the flow between sentences and sections.

5. A thorough proofreading session is recommended to catch any remaining grammatical or typographical errors.

Author Response

Comment 1. The abstract effectively summarizes the key challenges, methods, and findings. However, consider including specific metrics or results to provide a clearer snapshot of the model's performance improvements.

Response 1. We have updated the abstract with your recommendation.

Lines 25-27 (updated version):

“Our U-Net-based workflow successfully generated multi-year crop maps over large areas, out-performing the base model without photometric augmentation and MC-dropout by 17 IoU points.”

Comment 2. It will be better to provide a more detailed rationale for choosing specific techniques (e.g., TFL, dropout) over others.

Response 2. We chose TFL chose because it was specifically developed for semantic segmentation. The Tversky part of the loss gives control over FP and FN and the focal part puts more weight on hard examples. We used dropout as it is an easy substitute to create an ensemble model to improve robustness and generalization power of the model which is one of the main aspects of the paper.

lines 328-335 (updated version):

“To refine the training of our model we adopted several strategies. We employed Focal Tversky loss [77], developed for segmentation tasks and known for its efficacy in handling imbalanced datasets and small object sizes. This was done by setting a weighting scheme (α and β hyperparameters) that controls the trade-off between false positives (FP) and false negatives (FN) and the focal hyperparameter (γ), which controls the model's focus on hard-to-classify examples. We experimentally set α and γ hyperparameters to 0.65 and 0.9 respectively, optimizing the model's ability to learn from challenging cases and re-ducing the impact of easy negatives.”

Lines 316-320:

“We further applied MC-dropout [60] to make model prediction ensembles, which besides providing an uncertainty measure, improved the model’s generalization power and robustness, as shown in Table 2 in the results section. Through experimentation, we set the number of MC trials to 10 and used a fixed dropout rate of 0.15 for the training phase and 0.1 for the inference phase.”

Comment 3. For the Model Architecture part, please include a diagram of the U-Net architecture to aid understanding of the model structure and workflow.

Response 3. We added a figure for the architecture in the appendix (A1) and referred to it in the text.

Comment 4. Please provide a detailed discussion of the limitations faced during the study. For example, how did cloud cover affect data availability and model performance?

Response 4. We did not have cloud cover issues because we used Planet analytic base map which is generally a cloud-free temporal mosaic. The 2018 imagery was developed using a temporal compositing process designed to create cloud-free seasonal composites, as described in Estes et al (2022). These details are described in section 2.1 of the manuscript. However, the biggest limitation is how to adaptively harden the probability maps when we use spatial dropout and we need to improve model prediction as the consistency analysis shows. This limitation is explained here in the text:

lines 561-581 (updated version):

“The implementation of MC-dropout with a 0.15 training rate and 0.1 prediction rate, combined with 30 MC trials, showed marked improvements in model performance compared to using dropout only during training. Notably, spatial dropout proved more effective at reducing false positives compared to traditional dropout layers, though it generated probability maps with substantially higher variation in confidence levels. This increased variation presents challenges for determining optimal hardening thresholds, as we observed significant fluctuations in threshold values both across different tiles and years which hinders the usage of spatial MC-dropout in large-scale mapping. The spatio-temporal consistency analysis also showed that our methodology substantially im-proved temporal generalizability, but the low metric values suggest that the improvements are insufficient to fully capture inter-annual cropland dynamics. To overcome this limitation, one strategy would be to fine-tune the model for several epochs on a small labeled dataset for the prediction year. We also need to better understand the reason be-hind the omission error. Our evaluation of the spectral characteristics of consistently missed or hallucinated crop pixels suggests that omission error cannot be reliably explained by spectral reflectance alone, and spatial context also needed to be considered to understand the model’s decision-making process which is a further directive. This is also an expected behavior as the photometric augmentations are specifically designed to re-duce the model’s reliance on pixel-level reflectance but at the same time makes it harder to explain the impact of input normalization on model performance.”

And the mechanism we used for adaptive hardening is explained through lines 321 to 327:

“As the output probability maps have significant variation between tiles and time points, we used the class masks to find the optimal threshold for hardening the probability maps. We iterated though a range of potential threshold values and evaluated the difference between the number of field and background pixels at each threshold, seeking to maximize this difference. We also set a condition that the background count does not exceed 10% of the total field class count. This approach helps in balancing between maximizing TP while keeping the FP within acceptable limits.”

Comment 5. In the Abstract "However, effective DL models often require large, expensive label datasets..."

Please change "label datasets" to "labeled datasets" for grammatical accuracy.

Response 5. We have updated the abstract accordingly.

Comment 6. "Our approach combines several techniques, including an area-based loss function, Tversky-focal loss (TFL), data augmentation, and the use of regularization techniques like dropout."

Consider rephrasing to "Our approach integrates several techniques, such as an area-based loss function, TFL, data augmentation, and regularization methods like dropout."

Response 6. We have updated the abstract accordingly.

Comment 7. Ensure consistent terminology throughout the manuscript (e.g., "deep learning" vs. "DL").

Response 7. We have updated the manuscript for consistent terminology.

Comment 8. Use more transitional phrases to improve the flow between sentences and sections.

Response 8. Done.

Comment 9. A thorough proofreading session is recommended to catch any remaining grammatical or typographical errors.

Response 9. Done

Reviewer 3 Report

Comments and Suggestions for Authors

This paper introduces and evaluates several techniques, including an area-based loss function, Tversky-focal loss (TFL), data augmentation, and regularization techniques such as dropout, to enhance model generalization for cross-year cropland mapping. Overall, this work lacks technical innovation and significant contributions, as it employs widely used deep learning architectures (e.g., U-Net) and common techniques for model enhancement (e.g., data normalization, augmentation and loss functions).

Below please find a few specific comments for potential improvement:

1. More advanced deep learning methods and architectures, such as transfer learning, should be explored, designed, and evaluated to improve model generalizability.

2. A broader range of loss functions should be explored and evaluated.

3. For each technical enhancement, visual aids (e.g., cropland maps or figures similar to Figure 5) should be provided to illustrate how each technique impacts crop classification results.

4. Some content in the Introduction section, e.g., introduction of image normalization, and augmentation can be moved to Section 2 Materials and Methods. Instead, this section can focus more on relevant existing research on enhancing cropland classification and mapping, and how this work advanced existing works.

5. Section 2.1 should include more details about the sample distribution across different crop types and geographic locations, ideally summarized in a table.

6. In Section 2.1, the allocation of only 4% of samples for model validation may be insufficient. Additionally, there is no detailed description of the datasets used for model testing.

7. The Results section should be divided into subsections for better structure and easier readability.

Author Response

Below please find a few specific comments for potential improvement:

Comment 1. More advanced deep learning methods and architectures, such as transfer learning, should be explored, designed, and evaluated to improve model generalizability.

Response 1. We respectfully disagree with the notion that technical innovation must always stem from the introduction of new architectures or training frameworks. Novelty can also arise from the creative and effective use of existing tools applied to a single architecture and its associated pipeline, particularly within the context of workflows that are geared towards consistent and reliable map production in resource constrained environments, where resource constraints might include a lack of time or R&D budget needed to explore different architectures, and then to rebuild the necessary scaffolding to make it production ready. Our work demonstrates how thoughtful design choices and the repurposing and combination of well-established techniques can address key challenges in cross-year cropland mapping, using an existing and widely understood model that is valued for its relative ease of implementation and consistent performance for large area semantic segmentation tasks. For instance, while MC-dropout is traditionally used to quantify model uncertainty, we re-purposed it to enhance model robustness in multi-year crop mapping, by exploiting its properties as an ensemble of decision-makers, which can be very helpful as inter-annual variations in crop appearance commonly introduce uncertainty into model predictions.

Additionally, we considered the practical constraints of data availability and the study region. For instance, in Southern African countries such as Ghana, where tropical weather conditions lead to persistent cloud cover, it is challenging to obtain continuous, cloud-free time series imagery. To address this limitation, we based our approach on mono-temporal base map imagery from Planet, which offers consistent, cloud-free coverage. This data constraint inherently limits the applicability of many transfer learning frameworks that require time-series inputs to capture required phenological cycles for domain adaptation. We also did not use model transfer learning, as we wanted full flexibility to examine different image normalization procedures, and, to some extent, because of the limitation of available geospatial pre-trained models at the time when the project was started in 2019.

Instead, we asked a more practical question: What can be done given the constraints and limitations we face to be able to produce multi-year cropland maps? This shift in perspective led us to develop a framework that demonstrates the value of thoughtful design choices considering our data and compute resources. Our paper aims to share this experience with the broader community, as we believe it provides valuable insights for others working with similar constraints.

We further highlighted typically overlooked aspects of deep learning framework design, such as the impact of input normalization. While input normalization is often considered a trivial preprocessing step, we demonstrated that it can significantly affect model performance. By systematically analyzing and reporting its effects, we shed light on an important design consideration that is often ignored in the literature.

We also explored the model prediction with and without the added generalization components for spatio-temporal consistency and investigated the relationship between spectral reflectance and commission/omission errors on delineating cropland pixels. We believe the combination of these factors constitute a technical innovation, and our findings make a valuable contribution to the literature, particularly for practitioners whose goal is to reliably produce maps.

Comment 2. A broader range of loss functions should be explored and evaluated.

Response 2. The aim here is first using a task-specific loss function so we decided to compare CE as the candidate for frequency-based loss and TFL for areal-based. CE is chosen because it’s the most widely used loss function. The reason why among all the loss function designed for semantic segmentation we choose TFL is well explained in the paper based on previous studies.

Lines 328-335:

“We employed Focal Tversky loss [77], developed for segmentation tasks and known for its efficacy in handling imbalanced datasets and small object sizes. This was done by setting a weighting scheme (α and β hyperparameters) that controls the trade-off between false positives (FP) and false negatives (FN) and the focal hyperparameter (γ), which controls the model's focus on hard-to-classify examples. We experimentally set α and γ hyperparameters to 0.65 and 0.9 respectively, optimizing the model's ability to learn from challenging cases and reducing the impact of easy negatives.”

We also tested Tanimoto and dice loss while we were at the phase of designing the framework for a well behaving model for in-domain prediction in the same year. Admittedly we never tested these losses with the latest setting of our pipeline but as explained we think a comprehensive evaluation of the existing loss functions is beyond the scope of the paper and would be better pursued in a separate paper.

Comment 3. For each technical enhancement, visual aids (e.g., cropland maps or figures similar to Figure 5) should be provided to illustrate how each technique impacts crop classification results.

Response 3. Similar figures like figure 5 is provided in appendix (A2, A3, A3, A4, A5, A6). For input normalization, table 1 explores the quantitative aspects of the impact comprehensively. Visual aids would not add much more information. However, to improve the understanding of the model behavior, we added a new sub-section to the result that investigates the simultaneous spatial-temporal consistency of multi-temporal predictions for each tile is compared to multi-temporal reference annotation of that tile. This analysis is supported for with both visual graphics and quantitative metrics. This is followed by another analysis that creates a spatial confusion matrix by comparing the “persistent crop” between the reference and model prediction. We then grabbed the pixels that fall into the TP (in both ref and pred as crop), TN (in both ref and pred as crop), FP (model hallucination; what model consistently considered as crop but in the ref it was always non-crop), FN (in ref always crop but model consistently misses) from the input planet imagery (e.g. image chips) get the average value per band and site to investigate the spectral pattern of these accuracy semantics for how the model differentiate between crop and non-crop categories.

Comment 4. Some content in the Introduction section, e.g., introduction of image normalization, and augmentation can be moved to Section 2 Materials and Methods. Instead, this section can focus more on relevant existing research on enhancing cropland classification and mapping, and how this work advanced existing works.

Response 4. We designed the introduction to divide its focus on different components of design in a systematic way so the audience get the required familiarity when they get to the methods section. This approach is more in tune with our overall message on the design process based on the limitations of the project including data, computation resources. We agree there are many works on cropland mapping in general, but not many when it comes to using a model pre-trained on a specific year to predict cropland in subsequent years, and we believe we covered the existing work.

Comment 5. Section 2.1 should include more details about the sample distribution across different crop types and geographic locations, ideally summarized in a table.

Response 5. We are not focused on crop types. Our annotation has no information on crop type as the paper is focused on mapping annual crop fields. In the introduction, we clearly mention what we consider to be a crop field:

Lines 214-216 (updated version):

“Our focus is on producing cropland masks for annual crops, excluding woody crops, aligning with common practices in the literature”

Comment 6. In Section 2.1, the allocation of only 4% of samples for model validation may be insufficient. Additionally, there is no detailed description of the datasets used for model testing.

Response 6. That 4% we used for validation are having the highest label quality and well-distribution (please check figure 1). These validation chips are from the same year (2018) as the training set and used for only in-domain monitoring of the loss while we were training on 2018.

You are correct section 2.1 does not describe the test dataset; it is explained in the methods section:

Lines 367-371 (updated version):

“After completing the training phase, in which the model was trained using the samples that were predominantly collected from 2018 imagery, without fine-tuning on samples collected in subsequent years, we conducted a multi-faceted evaluation of the model’s predictions for the years 2018-2022. We evaluated model performance at a larger extent used for map production. To do so, we randomly selected 4 tiles of size 2358x2358 from the 5 years and manually annotated the crop fields in the resulting 20 scenes, providing an independent test set equivalent in which each scene was equivalent in extent to nearly 111 contiguous training/validation chips (see appendix A2).”

Comment 7. The Results section should be divided into subsections for better structure and easier readability.

Response 7. Done. We divided the result into three subsections to improve clarity.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has been significantly improved.

Article Menu

Generalization Enhancement Strategies to Enable Cross-Year Cropland Mapping with Convolutional Neural Networks Trained Using Historical Samples

Further Information

Guidelines

MDPI Initiatives

Follow MDPI