Review Reports - Enhanced Cropland SOM Prediction via LEW-DWT Fusion of Multi-Temporal Landsat 8 Images and Time-Series NDVI Features

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript is devoted to the development of new approaches to monitoring soil organic carbon content based on digital soil mapping and remote sensing data. This topic is not new. Numerous similar studies are being conducted in various regions of the world. The authors proposed improving the quality of modeling by incorporating multi-temporal satellite data into the list of predictors.

The manuscript requires revision:

At a minimum, the names of the soils found in the study area should be provided. After all, the humus content in soils and its specific spatial variation are largely associated with soils.
Reducing all data to a single spatial resolution of 30 meters seems odd. Of course, 5 km pixels can be represented as a 30-meter raster, but this is meaningless. For example, the information in the pixels is generalized over an area of 5 x 5 km. Within such a pixel, the variation in both soil organic matter content and all predictors is very large. It is incorrect to correlate such a pixel with point soil sampling data or Landsat pixels. A more methodologically correct approach is to generalize all information to 5x5 km pixels, rather than vice versa. This is most likely the source of modeling errors, and this needs to be discussed.
It is unclear why the authors do not want to split their sample set into training and validation. After all, using 10-fold cross-validation in modeling only allows us to assess the quality of the model, not the quality of the map. To assess the quality of a map based on the developed model, it is necessary to have a validation sample set of field data.
The obtained quality of the best model does not allow us to construct a map with an error lower than the actual spatial variation in organic matter content in the soils of the site. That is, the variation coefficient based on field survey data is 12%, and the error of the model and map will a priori be higher than this value. Thus, simply showing a single value on the map equal to the average soil organic matter content in the study area will yield a more accurate result than using the proposed approach.
The conclusion about the low influence of topography and vegetation type on the quality of the models is due precisely to the poor quality of the initial data, in which information on relief and vegetation is highly generalized. This should be discussed in the Discussion.
For a more accurate comparison of the maps in Figures 9 and 10, they should be presented using a common color scale (common legend).

Author Response

Comments 1: [At a minimum, the names of the soils found in the study area should be provided. After all, the humus content in soils and its specific spatial variation are largely associated with soils.]

Response 1: We sincerely thank the reviewer for pointing out this critical omission. We agree that providing specific soil classification names is fundamental, as soil genesis and properties dictate the spatial variability of humus and SOM content. In the revised manuscript (Section 2.1, Study Area), we have explicitly specified the soil types based on the Chinese Soil Taxonomy (CST), with cross-references to the WRB and USDA systems to ensure international readability.

(1) Soil Classification: We clarified that the soils in Yucheng are developed from Yellow River alluvium. The dominant soil type is Fluvo-aquic soil (according to CST), which generally corresponds to Cambisols or Fluvisols in the WRB system, and Inceptisols (specifically Aquic Inceptisols) in USDA Soil Taxonomy.

(2) Impact on SOM: We added a description of how these soils influence SOM dynamics. Specifically, the varying texture (ranging from sandy loam to silt loam) and the specific hydrological conditions of Fluvo-aquic soils play a significant role in carbon sequestration. We also noted the presence of Saline-alkali soils in certain parts of the study area, where soil salinity acts as a constraint on vegetation growth (input to SOM) and microbial activity (decomposition of SOM).

Changes Made:

We have cited relevant soil geography literature and local soil survey data to support these descriptions. [ Page 4, paragraph 2.1, and line 144-158.]

Comments 2: [Reducing all data to a single spatial resolution of 30 meters seems odd. Of course, 5 km pixels can be represented as a 30-meter raster, but this is meaningless. For example, the information in the pixels is generalized over an area of 5 x 5 km. Within such a pixel, the variation in both soil organic matter content and all predictors is very large. It is incorrect to correlate such a pixel with point soil sampling data or Landsat pixels. A more methodologically correct approach is to generalize all information to 5x5 km pixels, rather than vice versa. This is most likely the source of modeling errors, and this needs to be discussed.]

Response 2: We greatly appreciate the reviewer for highlighting this critical methodological issue regarding scale effects and the Modifiable Areal Unit Problem (MAUP). We fully acknowledge the theoretical validity of your concern: simply resampling coarse-resolution data (e.g., 5 km climate/soil moisture products) to 30 m does not increase their actual information content. Furthermore, correlating a 5-km pixel (representing a regional average) with a point-scale soil sample or a 30-m Landsat pixel indeed creates a support mismatch, which introduces uncertainty into the modeling process. However, we respectfully retained the 30-m target resolution for the prediction mapping based on the following practical and methodological considerations:

(1) Application Requirement: The primary objective of this study is to support precision agriculture at a regional scale. A 5-km resolution (where one pixel covers 2,500 hectares) is too coarse to depict the intra-field spatial heterogeneity of SOM required for precise soil management.

(2) Role of Multi-scale Predictors: In the Digital Soil Mapping (DSM) framework, different predictors serve different roles. The fine-resolution data (Landsat 8, 30 m) capture the local spatial variation and details, while the coarse-resolution data (climate, soil moisture) represent the regional background trends. The machine learning models (RF/CNN) are designed to learn the relationship between local SOM content and this combination of "regional trend + local detail." Upscaling all data to 5 km would discard the critical high-frequency spatial information provided by the Landsat imagery, which is the core predictor in this study.

(3) Sample Size Constraint: The study area (Yucheng City) is approximately 990 km². If we generalized all information to 5×5 km pixels, the entire study area would be represented by only ~40 pixels. This sample size is insufficient for training and validating complex machine learning models like CNN and RF.

Changes Made:

Following your suggestion, we have added content to the "6. Limitations and Outlook" section, discussing in detail the "scale effects and uncertainties in multi-source data fusion." We openly discuss that the scale mismatch between point samples and coarse covariates is a limitation of the current multisource fusion approach and contributes to the modeling uncertainty. [Page 24 , paragraph 6, and line 860-871]

Comments 3: [It is unclear why the authors do not want to split their sample set into training and validation. After all, using 10-fold cross-validation in modeling only allows us to assess the quality of the model, not the quality of the map. To assess the quality of a map based on the developed model, it is necessary to have a validation sample set of field data.]

Response 3: We appreciate the reviewer’s rigorous perspective on model validation. We fully agree that an independent field validation dataset (external validation) is the "gold standard" for assessing the final mapping quality. However, we chose the 10-fold cross-validation (CV) strategy instead of a fixed "train/validation" split (hold-out method) based on the following considerations regarding sample size and model stability:

(1) Constraint of Sample Size (N=198): The total sample size in this study is limited to 198 points. Machine learning models, especially the Convolutional Neural Network (CNN) used in this study, are data-hungry and require sufficient examples to learn complex non-linear spatial features effectively. If we employed a traditional hold-out method (e.g., 70% training / 30% validation), the training set would be reduced to approximately 138 samples. This reduction would significantly increase the risk of model overfitting or underfitting, failing to demonstrate the true potential of the proposed method.

(2) Robustness of Accuracy Estimation: With a small dataset, a single random split into training and validation sets can be highly sensitive to exactly which points are selected (i.e., the result may vary largely depending on the random seed). In contrast, 10-fold CV ensures that every sample is used for validation exactly once while preserving maximal data for training in each fold. This approach provides a more statistically stable and unbiased estimate of the model's predictive performance and, by extension, the mapping quality for unvisited locations.

(3) Standard Practice in DSM: Using k-fold CV is a widely accepted validation strategy in Digital Soil Mapping (DSM) studies when the sample size is small to moderate (e.g., Wadoux et al., 2019; Biswas & Zhang, 2018). The error metrics derived from CV (RMSE, R²) represent the expected error of the map at any given location within the study domain.

Changes Made:

We have clarified this justification in Section “3.6 Modeling and evaluation” of the revised manuscript. We explicitly stated that due to the limited sample size, 10-fold CV was adopted to maximize data utilization for training the deep learning model while providing a robust assessment of predictive accuracy. [Page 12 , paragraph 3.6. and line 393-400.]

Comments 4: [The obtained quality of the best model does not allow us to construct a map with an error lower than the actual spatial variation in organic matter content in the soils of the site. That is, the variation coefficient based on field survey data is 12%, and the error of the model and map will a priori be higher than this value. Thus, simply showing a single value on the map equal to the average soil organic matter content in the study area will yield a more accurate result than using the proposed approach.]

Response 4: We appreciate the reviewer’s rigorous statistical scrutiny regarding the practical utility of the model. The reviewer correctly points out that if a model's error exceeds the natural variability (Coefficient of Variation, CV) of the site, a simple mean value would indeed be a better predictor.

However, based on the statistical results presented in our manuscript, we respectfully demonstrate that our model significantly outperforms the "mean value" baseline.

(1) Statistical Comparison (Model RMSE vs. Natural Variation):

Natural Variation: As shown in Table 7, the Standard Deviation (SD) of the observed SOM is 2.42 g/kg, with a CV of 12%. Using a "single mean value" for the entire map would result in a prediction error (RMSE) approximately equal to the SD (≈2.42 g/kg).

Model Error: As shown in Table 9, our proposed CNN model achieved an RMSE of 1.29 g/kg.

Relative Error: The Relative RMSE (R-RMSE) of our model is calculated as: R-RMSE= RMSE/Mean = 1.29/20.12 ≈ 6.4%.

Conclusion: The model's relative error (6.4%) is approximately half of the natural spatial variation (12%). This indicates a substantial improvement in accuracy compared to using a simple mean.

(2) Interpretation of R²:

The Coefficient of Determination (R² ) is 0.62 for the CNN model. By definition, R2 compares the model's Mean Squared Error (MSE) against the variance of the data (which represents the error of the mean model). An R²of 0.62 implies that our model explains 62% of the spatial variability that a simple mean value cannot capture.

(3) Practical Value for Mapping:

A simple mean value would produce a homogeneous map, which offers no information for site-specific management. In contrast, our model successfully delineates the spatial heterogeneity of SOM (e.g., higher values in the west, lower in the east), which is essential for determining variable-rate fertilization strategies in precision agriculture.

Changes Made:

We have also added supplementary discussion at the end of the "4.5 Prediction Accuracy" section in the manuscript. [Page 18 , paragraph 4.5. and line 574-581.]

Comments 5: [The conclusion about the low influence of topography and vegetation type on the quality of the models is due precisely to the poor quality of the initial data, in which information on relief and vegetation is highly generalized. This should be discussed in the Discussion.]

Response 5: We sincerely appreciate this insightful comment. We acknowledge that the resolution and quality of the input covariates can significantly influence the feature importance analysis. We have expanded the discussion on this point in the revised manuscript. We argue that the low importance of topographic and vegetation type variables is attributed to a combination of environmental homogeneity and data limitations:

(1) Geographical Reality: The study area (Yucheng City) is located in the Yellow River Alluvial Plain, characterized by extremely flat terrain and a uniform agricultural system (intensive winter wheat-summer maize rotation). The lack of significant topographic relief and the dominance of a single cropping system naturally limit the explanatory power of topography and categorical vegetation type variables in the model.

(2) Data Generalization: We fully agree with the reviewer that the "generalization" of initial data plays a role. The SRTM DEM (30 m) used may lack the vertical accuracy to capture micro-topographic variations (e.g., small ridges or depressions) that influence local soil moisture and SOM redistribution. Similarly, the categorical vegetation type map is too coarse to reflect intra-specific crop growth differences (which is why the time-series NDVI features performed much better than the static vegetation type variable).

Changes Made:

We have added a discussion in Section “5.1 Relative importance of environmental features” to clarify that the minor influence of these factors is likely a result of both the flat terrain context and the insufficient resolution of the covariates to detect micro-scale variations. [Page 20, paragraph 5.1.1. and line 682-692.]

Comments 6: [For a more accurate comparison of the maps in Figures 9 and 10, they should be presented using a common color scale (common legend).]

Response 6: We fully accept this constructive suggestion. We agree that using a unified color scale (legend) is essential for a direct and accurate visual comparison between the different modeling scenarios. We have redrawn Figure 9 and Figure 10 in the revised manuscript to ensure visual consistency:

(1)For Figure 9 (Spatial distribution of SOM): We determined the global minimum and maximum values across both prediction maps (Ev and Ev-Tn-Mm) and applied a unified color scale range (set to 14–29 g/kg). This unified scale now clearly highlights that the multisource fusion model (Fig. 9b) captures a wider range of SOM variability compared to the single-source model (Fig. 9a).

(2) For Figure 10 (Uncertainty distribution): We applied a unified color scale range (set to 0–2.4) for both uncertainty maps. With the fixed legend, the visual comparison now distinctively shows that the uncertainty values in the Ev-Tn-Mm model (Fig. 10b) are overall lower (depicted by cooler colors) than those in the Ev model (Fig. 10a), visually confirming the improved model stability.

The captions for Figure 9 and Figure 10 have been updated to reflect that a uniform color scale was applied. [Page 19, paragraph 4.6. line 624-626.]

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1、The manuscript employs LEW-DWT as one of the core methods; however, the distinction between the present study and existing research is not clearly articulated. The innovative contribution of the study is insufficiently described. The authors are advised to further elaborate on the necessity and advantages of applying LEW-DWT for soil organic matter inversion.

2、The dataset consists of 198 soil samples, which are used to train a CNN model with a relatively large number of parameters. Although 10-fold cross-validation is adopted, the risk of overfitting cannot be fully excluded. It is recommended to supplement the analysis with model stability assessments or to explicitly discuss the applicability and limitations of the conclusions.

3、The selection of CNN parameters lacks sufficient theoretical justification. The authors are encouraged to conduct parameter sensitivity analysis or ablation experiments to validate the rationality and robustness of the adopted network architecture.

4、The manuscript uses RF-based variable importance to interpret the prediction mechanism of the CNN model. However, such an approach does not directly reflect the feature learning process of CNNs and may lead to over-interpretation. Further clarification and discussion are required.

5、The Ev–Tn–Mm feature combination integrates environmental factors, temporal NDVI, and multispectral information simultaneously, making it difficult to determine the independent contribution of each feature category to performance improvement. Additional comparative experiments are recommended.

6、The discussion section mainly reiterates the results, with insufficient analysis of the underlying reasons for performance differences under varying conditions.

7、The temporal coverage of some environmental variables does not align with the soil sampling period in 2020. Further clarification and justification are necessary.

8、Most figures suffer from quality degradation or distortion. The authors are advised to re-upload high-resolution figures.

9、Some equations in the manuscript are incorrectly numbered or inconsistently referenced. The authors should carefully revise and correct these issues.

Author Response

Comments 1: [The manuscript employs LEW-DWT as one of the core methods; however, the distinction between the present study and existing research is not clearly articulated. The innovative contribution of the study is insufficiently described. The authors are advised to further elaborate on the necessity and advantages of applying LEW-DWT for soil organic matter inversion.]

Response 1: Thank you very much to the reviewer for pointing out that we need to elaborate more clearly on the innovativeness and necessity of our method. We agree that the introduction section needs to more explicitly demonstrate the specific advantages of LEW-DWT over other methods. In the revised version, we further elaborate on the reasons for choosing LEW-DWT.

We explicitly compare it with the following methods: (1) spatio-temporal fusion (such as STARFM): it should be noted that although they are useful for data reconstruction, they are not well suited for enhancing the features of existing real observations. (2) traditional DWT: it is emphasized that standard DWT typically uses simple fusion rules (mean/maximum), resulting in edge blurring.

We emphasize that SOM exhibits high spatial heterogeneity in fragmented agricultural landscapes. In our proposed method, "local energy" serves as a proxy variable for feature richness (texture/edge). By performing weighted fusion based on local energy, LEW-DWT avoids the smoothing effect of traditional methods, which is crucial for accurate field-scale mapping.

Changes in Manuscript:

Please refer to pages 2-3, lines 75-97. We have conducted supplementary argumentation as per your request, and in Section "4.2 Analysis of fused multitemporal multispectral images", we have added a paragraph specifically discussing the quality evaluation of fused images. Additionally, Table 9 has been included, presenting the quantitative comparison results of SAM, IE, and AG. In Section "4.5 Prediction accuracy", a new paragraph has been added, detailing the comparative results of these three fusion methods.

[Pages 2-3, paragraph 1, and lines 75-97]

[Page 18, paragraph 4.5, and line 543-556.]

[Page 14, paragraph 4.2, and line 414-427, 435-436]

Comments 2: [The dataset consists of 198 soil samples, which are used to train a CNN model with a relatively large number of parameters. Although 10-fold cross-validation is adopted, the risk of overfitting cannot be fully excluded. It is recommended to supplement the analysis with model stability assessments or to explicitly discuss the applicability and limitations of the conclusions.]

Response 2: We appreciate the reviewers' rigorous attention to the relationship between sample size and model complexity. We acknowledge the inherent risk of overfitting when applying CNN to a small dataset (N=198). To address this issue and evaluate model stability, we conducted the following analysis and discussion:

1. Lightweight architecture and regularization:

Unlike deep architectures that require massive datasets, such as ResNet or VGG, we designed a shallow and lightweight convolutional neural network specifically tailored for this study (see Table 6 for details). It consists of only three convolutional layers with small filter sizes (3×3, 2×2), significantly limiting the total number of trainable parameters. We actively incorporated Dropout layers after the convolutional and fully connected layers, with neuron dropout rates of 0.2 and 0.1, respectively. This method of randomly dropping neurons forces the network to learn robust features rather than memorizing specific samples, effectively mitigating overfitting.

2. Model stability evaluation: We analyzed the stability of the model through the results of 10-fold cross-validation.

(1) Summarizing the gap: The difference between the training RMSE (approximately 1.12 g·kg⁻¹) and the validation root mean squared error (1.29 g·kg⁻¹) remains within a reasonable range, indicating that the model has not simply "remembered" the training data.

(2) Cross-validation variance: The standard deviation (SD) of the RMSE for 10-fold cross-validation is relatively low (SD≈0.15), indicating that the model's performance remains consistent regardless of the division of the training/validation sets.

3. Discussion on limitations:

We have revised Section 6 (Limitations and Outlook). This section explicitly points out the limited sample size and discusses the applicability of the research results. We clarify that although this lightweight Convolutional Neural Network (CNN) performs well in this specific area, larger-scale applications require the expansion of the soil spectral library. [Page 24, paragraph 6 , and line 818-829.]

Comments 3: [The selection of CNN parameters lacks sufficient theoretical justification. The authors are encouraged to conduct parameter sensitivity analysis or ablation experiments to validate the rationality and robustness of the adopted network architecture.]

Response 3: We are grateful to the reviewers for their valuable feedback on the principles of our CNN architecture. We acknowledge that parameter selection should be based on theoretical and empirical validation.

1. Theoretical basis: We have chosen a 4-layer CNN (3×3 and 2×2) with mixed filter sizes to strictly adhere to the specific characteristics of our input data:

(1) Input size limitation: As described in Section 3.5.2, we use a small 7×7 pixel window as input to capture local neighborhood information. Using standard large filters (e.g., 5×5) would rapidly reduce the spatial dimension, making deep feature extraction impossible.

(2) Using a 2×2 filter: Therefore, we adopted a strategy that starts with 3×3 filtering to capture direct context, followed by three layers of 2×2 filtering. This smaller filter size allows us to maintain a deeper network (4 layers) and learn nonlinear relationships even with small input data blocks.

(3) Parameter Efficiency: Furthermore, 2×2 filters have significantly fewer parameters compared to 3×3 filters, which is crucial for preventing overfitting with our sample size (N=198).

2. Parameter sensitivity analysis (via grid search):

To ensure the robustness of the architecture, we conducted a parameter sensitivity analysis using the following method: grid search. We evaluated:

(1) Filter combination: We tested the comparison between "All 3×3" and "Mixed 3×3 & 2×2". The result showed that the root mean square error (RMSE) of the mixed strategy was lower, as the feature map size reduction of "All 3×3" was too aggressive.

(2) Network depth: We tested depths of [3, 4, 5] layers. The result showed that a 4-layer structure achieved the best balance between feature abstraction and gradient propagation.

(3) Dropout rate: Test values [0.1, 0.2, 0.5]. Result: The optimal value for the convolution layer is 0.2.

Changes in Manuscript:

The final architecture presented in Table 6 represents the optimal configuration determined through this process. We have incorporated these details into the revised manuscript in Section 3.5.2. [Pages 12, paragraph 3.5.2, and lines 378-388.]

Comments 4: [The manuscript uses RF-based variable importance to interpret the prediction mechanism of the CNN model. However, such an approach does not directly reflect the feature learning process of CNNs and may lead to over-interpretation. Further clarification and discussion are required.]

Response 4: We sincerely thank the reviewer for pointing out this important methodological distinction. We explicitly agree that the feature importance derived from Random Forest (RF)—which is based on node impurity reduction at the pixel level—cannot directly represent the internal feature learning process of the Convolutional Neural Network (CNN), which relies on spatial convolution filters. To address this and avoid overinterpretation, we have refined the manuscript in the Abstract, Section 4.4, and Section 5.1:

Clarified Objective: We explicitly stated that the RF-based analysis is intended to identify the dominant environmental drivers of SOM distribution in the study area, serving as an ecological analysis tool rather than a mechanical explanation of the CNN.

Discussion of Differences: In the Discussion section, we added a paragraph explaining that while RF identifies which variables are statistically important (e.g., Soil Moisture), the CNN’s superior performance (R² =0.62) is attributed to its ability to extract the spatial texture and context of these variables—a feature that RF cannot quantify.

These revisions ensure that the distinction between the "explanatory tool" (RF) and the "predictive model" (CNN) is clear.

[Pages 16, paragraph 4.4, and lines 491-496.]

[Pages 20, paragraph 5.1, and lines 630-639.]

Comments 5: [The Ev–Tn–Mm feature combination integrates environmental factors, temporal NDVI, and multispectral information simultaneously, making it difficult to determine the independent contribution of each feature category to performance improvement. Additional comparative experiments are recommended.]

Response 5: We explicitly thank the reviewer for this constructive suggestion. We agree that disentangling the independent contributions of environmental factors, time-series NDVI, and multispectral information is crucial for validating the model's robustness. To address this comprehensively, we adopted a dual-validation strategy: we conducted an additional ablation experiment (as requested) and corroborated the results with the Variable Importance Analysis in the revised manuscript.

1. Additional Comparative Experiment (Ablation Study)

We conducted additional tests using the CNN model to evaluate the performance of different feature combinations. The results are summarized below:

Model Input	MAE	RMSE	R²
Ev	1.13	1.50	0.54
EV + Mm	1.05	1.43	0.57
EV + Tn	1.02	1.40	0.58
Ev + Tn + Mm	0.91	1.29	0.62

As shown in the table, adding either Mm or Tn individually improves accuracy compared to the baseline. Specifically, the Ev+Tn combination (R²=0.58) contributes slightly more than Ev+Mm (R²=0.57), while the full combination yields the highest accuracy (R²=0.62), confirming a synergistic effect.

2. Alignment with Variable Importance Analysis (Section 4.4, 5.1)

These experimental results align perfectly with the Variable Importance Analysis we added to the revised manuscript (see Section 4.4, 5.1 and Figure 8). Our RF-based analysis quantified the contributions as follows:

（1）Time-series NDVI (Tn): Contributed 6.74% to the total importance (Line 535).

（2）Multispectral Data (Mm): Contributed 6.10% (Line 535).

The fact that Tn (6.74%) is slightly higher than Mm (6.10%) in importance explains why the Ev+Tn model (R²=0.58) performs slightly better than Ev+Mm (R²=0.57).

Conclusion:

By combining the ablation experiment results with the internal feature importance analysis, we have confirmed that each feature category provides independent and complementary information, justifying the integration of the Ev-Tn-Mm framework.

[Pages 16, paragraph 4.4, and lines 491-496.]

[Pages 20, paragraph 5.1, and lines 630-639.]

Comments 6: [The discussion section mainly reiterates the results, with insufficient analysis of the underlying reasons for performance differences under varying conditions.]

Response 6: We sincerely thank the reviewer for this insightful comment. In the revised manuscript, we have significantly expanded the Discussion section to go beyond reiterating results and delve into the root causes of performance variations. Specifically:

(1) Mechanism of Fusion Algorithms (Section 5.2): We added a detailed analysis explaining why LEW-DWT outperforms Traditional DWT and Simple Splicing. We clarified that Traditional DWT often fails due to the "smoothing effect" of fixed fusion rules on high-frequency details, whereas our LEW-DWT strategy effectively preserves local texture and edge information by utilizing local energy weighting.

(2) Impact of Environmental Context (Section 6): We conducted a comparative analysis with previous high-accuracy studies (e.g., Meng et al.). We explained that the performance difference is rooted in the geographical distinctiveness (Yellow River alluvial plain with low SOM vs. Black soil region) and data resolution (Multispectral vs. Hyperspectral). We argue that our method provides a cost-effective solution that compensates for lower spectral resolution by enhancing spatial features.

(3) Variable Importance Attributes (Section 5.1.1): We elaborated on why topographic variables showed low importance, attributing it to the flat terrain of the alluvial plain and the resolution limits of the DEM in capturing micro-topography .

[Pages 23, paragraph 5.2., and lines 764-785.]

[Pages 25, paragraph 6, and lines 843-871.]

[Pages 21, paragraph 5.1.1., and lines 682-692.]

Comments 7: [The temporal coverage of some environmental variables does not align with the soil sampling period in 2020. Further clarification and justification are necessary.]

Response 7: We thank the reviewer for their careful observation regarding the temporal alignment of the datasets. We acknowledge that the temporal coverage of certain environmental covariates (specifically Soil Moisture, 2003–2018, and Climate data) does not perfectly overlap with the 2020 soil sampling. However, we justify this selection based on the pedogenic mechanism of Soil Organic Matter (SOM). SOM accumulation is a slow, cumulative process driven by long-term environmental conditions rather than instantaneous fluctuations.

In the revised manuscript Section 2.3.3, we have added a clarification to explain our rationale:

（1）Long-term Drivers vs. Instantaneous States: We utilized the multi-year average (e.g., for soil moisture) to represent the long-term stable spatial distribution of water content. This serves as a climatic background factor that controls regional SOM formation over decades, which is more scientifically relevant than the transient moisture conditions on the specific sampling date.

（2）Data Stability: Variables like Mean Annual Precipitation (MAP) and Mean Soil Moisture (MSM) define the fundamental spatial patterns of the region's ecology, which remain relatively stable compared to the high temporal variability of vegetation (NDVI).

Therefore, while we used precise, time-synchronized satellite imagery (2014–2023) to capture dynamic vegetation features, we intentionally used long-term aggregate environmental data to capture the background constraints on SOM storage. [Pages 6, paragraph 2.3.3., and lines 207-213.]

Comments 8: [Most figures suffer from quality degradation or distortion. The authors are advised to re-upload high-resolution figures.]

Response 8: Thank you very much for pointing out the issues raised by the reviewer. We have re uploaded the images in the revised manuscript. In case there are still unclear situations, we have compressed all the original images and uploaded them to the editorial department.

Comments 9: [Some equations in the manuscript are incorrectly numbered or inconsistently referenced. The authors should carefully revise and correct these issues.]

Response 9: Thank you for the reviewer's suggestions. We have made revisions to the manuscript.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This study focuses on the regional-scale precise prediction of soil organic matter (SOM) in farmland, proposing a multi-source data fusion framework based on local energy weighted discrete wavelet transform (LEW-DWT). It integrates multi-temporal Landsat 8 images, MODIS time series NDVI features, and 41 types of environmental variables. Through comparative analysis of random forest (RF) and convolutional neural network (CNN) models, it provides a new technical solution for digital soil mapping. The research design is logically clear, the data sources are standardized, and the variable importance analysis is in-depth. The results have practical application value for precision agriculture and soil carbon management. However, there are certain problems, and the specific suggestions are as follows:

In the introduction, the performance of mainstream multi-temporal fusion techniques such as STARFM and ESTARFM in SOM prediction is not compared, and the reason why LEW-DWT is superior to these mature methods is not explained. The references related to LEW-DWT are insufficient, and there is a lack of literature support for the method itself. It is suggested to add comparative explanations and supplement two application literatures of LEW-DWT or improved DWT in soil property prediction.
The limitations of traditional methods are expressed vaguely. Only the experience dependence of RF is mentioned, and the accuracy bottlenecks of traditional models such as PLSR and single-temporal remote sensing models in SOM prediction are not quantified, making it difficult to highlight the improvement of this study. It is suggested to appropriately supplement relevant discussions.
In the materials, the resolutions of MODIS NDVI (250m), Landsat 8 (30m), and other covariates (soil moisture 5km) have scale differences. The impact of scale conversion methods during fusion on accuracy has not been evaluated. It is suggested to evaluate the impact of scale conversion on prediction accuracy. The soil moisture data only mentions "2003-2018", and the sensor type of the data source is not specified. It is suggested to supplement it.
In the methods, the key parameters of LEW-DWT are not explained, the window size for local energy calculation is not specified, and the selection basis is missing. The reason for the number of wavelet decomposition layers is insufficient, and the impact of different decomposition layers on the fusion effect has not been evaluated. It is suggested to supplement the details of LEW-DWT parameters and add a parameter sensitivity analysis chart.
Traditional fusion methods are not set as controls, making it impossible to quantify the fusion advantages of LEW-DWT. It is suggested to set traditional DWT and simple splicing as two control experiments, compare the prediction accuracy of LEW-DWT with them, and quantify the improvement of the fusion method.
In the results, the quality evaluation of the fused images is missing. Only the prediction accuracy is indirectly verified to validate the fusion effect, and the spectral fidelity and spatial detail retention of the fused images are not directly evaluated. It is suggested to add indicators such as spectral angle matching degree, information entropy, and average gradient between the fused images and the original Landsat 8 images to verify the spectral and spatial retention capabilities of LEW-DWT.
In the discussion, the comparison with similar studies is not in-depth. Only the R²=0.86 of Meng et al. (2022) is mentioned, and the reason for the lower R²=0.62 of this study is not analyzed. The performance of other fusion methods in multi-temporal SOM prediction is not compared. It is suggested to supplement in-depth comparisons with similar studies and compare the performance of models such as CNN-LSTM in multi-temporal SOM prediction to explain the competitiveness of the model in this study.
The advantages of LEW-DWT are not quantified, and the improvement in accuracy of LEW-DWT compared to traditional DWT, STARFM, and other fusion methods in SOM prediction is not clear. It is suggested to quantify the advantages of LEW-DWT.
Format issues: Some chart legends are too brief, such as the variable names in Figure 8 are not fully labeled; some references are not formatted uniformly.

Author Response

Comments 1: [In the introduction, the performance of mainstream multi-temporal fusion techniques such as STARFM and ESTARFM in SOM prediction is not compared, and the reason why LEW-DWT is superior to these mature methods is not explained. The references related to LEW-DWT are insufficient, and there is a lack of literature support for the method itself. It is suggested to add comparative explanations and supplement two application literatures of LEW-DWT or improved DWT in soil property prediction.]

Response 1: We thank the reviewer for this critical observation regarding the methodology selection. We agree that STARFM and ESTARFM are benchmark techniques in spatiotemporal fusion.

1. Comparison and Justification for LEW-DWT: We have revised the Introduction to clarify why LEW-DWT was preferred over STARFM-like algorithms for this specific study:

(1) Different Objectives: STARFM/ESTARFM are primarily designed for image reconstruction (generating synthetic images for missing dates). Our goal, however, was feature fusion—specifically, integrating multi-temporal spectral features with spatial details to construct a composite predictor for SOM.

(2) Preservation of Spatial Details: SOM variation in fragmented croplands relies heavily on spatial texture. STARFM can sometimes introduce smoothing effects or prediction errors in highly heterogeneous areas due to the "mixed pixel" problem in coarse MODIS data. In contrast, LEW-DWT (Discrete Wavelet Transform with Local Energy Weighting) excels at multiresolution analysis. By decomposing images into high-frequency (details/edges) and low-frequency (approximation) components and fusing them based on local energy, LEW-DWT maximally preserves the edge information and spatial texture of the original Landsat images, which is critical for mapping SOM at the field scale.

2. Additional References:

Following your suggestion, we have added three new references to support the application of Wavelet Transform in soil property prediction and image fusion:

Changes Made:

We have rewritten the relevant paragraph in the Introduction (Page 2) to explicitly compare these methods and cite the new references.] [Page 2, paragraph 1, and line 75-97.]

Comments 2: [The limitations of traditional methods are expressed vaguely. Only the experience dependence of RF is mentioned, and the accuracy bottlenecks of traditional models such as PLSR and single-temporal remote sensing models in SOM prediction are not quantified, making it difficult to highlight the improvement of this study. It is suggested to appropriately supplement relevant discussions.]

Response 2: We appreciate this constructive suggestion. We agree that a more specific discussion on the limitations of traditional methods is necessary to establish a clear baseline for our study. We have revised the Introduction to explicitly address the quantitative bottlenecks of methods like Partial Least Squares Regression (PLSR) and single-date remote sensing models.

(1)Limitations of PLSR: We highlighted that while PLSR is a standard tool for spectral analysis, it is fundamentally a linear model. Previous studies indicate that PLSR often struggles to capture the complex, non-linear relationships between SOM and environmental covariates, with prediction accuracies typically plateauing (e.g., R² often ranges between 0.4 and 0.6 in complex landscapes) [Ref].

(2)Limitations of Single-date Models: We added a discussion on how single-date imagery is highly susceptible to transient environmental noise (e.g., soil moisture anomalies, straw cover). This reliance often leads to unstable predictions and lower accuracy compared to multi-temporal approaches.

These additions explicitly demonstrate why the non-linear learning capability of CNN and the stability of the LEW-DWT fusion framework are required to overcome these specific bottlenecks. [Page 3, paragraph 1, and line 105-116 .]

Comments 3: [In the materials, the resolutions of MODIS NDVI (250m), Landsat 8 (30m), and other covariates (soil moisture 5km) have scale differences. The impact of scale conversion methods during fusion on accuracy has not been evaluated. It is suggested to evaluate the impact of scale conversion on prediction accuracy. The soil moisture data only mentions "2003-2018", and the sensor type of the data source is not specified. It is suggested to supplement it.]

Response 3: We appreciate the key issues raised by the reviewers regarding differences in data scales and details of data sources.

1. Regarding the impact assessment of scale conversion: We acknowledge that fusing data of different resolutions (from 30 meters to 5 kilometers) does introduce uncertainty. To evaluate and mitigate this impact, we have adopted the following strategies:

(1) In the preprocessing stage, we did not use a single interpolation method for all variables. For continuous environmental variables such as soil moisture at 5km and MODIS NDVI at 250m, we used cubic convolution interpolation instead of nearest neighbor interpolation.

(2) Triple convolution interpolation can fit a smooth surface based on the values of surrounding pixels, thereby avoiding the "mosaic effect" or "block artifacts" that occur when dividing coarse pixels (5km) into 30m grids. This ensures that low resolution variables exist as smooth "background trends" in the model, while high-resolution Landsat data provides "local details".

We have added a detailed description of this evaluation and method in the "2.4 Data Preprocessing" section of the revised manuscript, clearly stating that this processing method is to maintain spatial continuity and minimize the abrupt errors caused by scale conversion.

2. Regarding the Soil Moisture Data Source: Thank you for your suggestion. We have added detailed information in the article:

(1) Sensor type: This dataset is generated based on passive microwave remote sensing data, with source data including AMSR-E and AMSR2 from JAXA and SMOS sensors from ESA.

(2) Time selection (2003-2018): We used the multi-year average of this period. Due to the slow changing nature of soil organic matter (SOM), the average soil moisture over many years can better represent the long-term hydro climatic background that has formed in the region. This long-term background is the key factor driving the spatial differentiation of SOM, rather than the instantaneous humidity on the day of sampling.

Changes in Manuscript:

Please refer to the revised Sections 2.3.3 [Page 6, paragraph 2.3.3, and line 207-213.] and 2.4 [Page 7, paragraph 2.4, and line 218-224.], where we have added sensor descriptions and specific resampling methods.

Comments 4: [In the methods, the key parameters of LEW-DWT are not explained, the window size for local energy calculation is not specified, and the selection basis is missing. The reason for the number of wavelet decomposition layers is insufficient, and the impact of different decomposition layers on the fusion effect has not been evaluated. It is suggested to supplement the details of LEW-DWT parameters and add a parameter sensitivity analysis chart.]

Response 4: We greatly appreciate the reviewer's attention to the implementation details of the algorithm. In fact, clear parameter settings and their selection are crucial for the authenticity and scientificity of the method. Based on your suggestion, we conducted parameter analysis and added detailed information and charts in the revised manuscript.

1. Regarding the Decomposition Level: We used a hierarchical structure in the original text, but did not explain the reason. In the revised version, we have added hardness analysis based on SOM prediction accuracy (RMSE) (see Figure 3). The result test shows that the 2-level division provides the best prediction accuracy. Low level systems (Level 1) cannot fully separate high-frequency spatial details, while high level systems (Level 3 or 4), although obtaining more details, also introduce significant noise and cause spectral distortion. Therefore, we confirm that Level 2 is the best choice.

2. Regarding the Local Energy Window Size: We have added window parameters in the fusion rule section. We have used a 3×3 sliding window to calculate local energy. The purpose analysis shows that the 3×3 window performs well in capturing small pixel textures in farmland, while the 5×5 or 7×7 window performs well. An increased window can lead to over smoothing of local features, thereby reducing the ability of the fused image to reflect SOM spatial variance.

Changes in Manuscript:

We have explicitly added a description of the following content in section 3.3: the 3×3 window size and the reason for the 2-level decomposition. We have added a new figure (Figure 3) to demonstrate the sensitivity analysis of these two parameters for SOM prediction of RMSE, in order to support our selection. [Page 10, paragraph 3.3, and line 298-304, 312-317.]

Comments 5: [Traditional fusion methods are not set as controls, making it impossible to quantify the fusion advantages of LEW-DWT. It is suggested to set traditional DWT and simple splicing as two control experiments, compare the prediction accuracy of LEW-DWT with them, and quantify the improvement of the fusion method.]

Response 5: We greatly appreciate the highly constructive suggestion from the reviewer. Indeed, introducing a benchmark control group is crucial for objectively quantifying the performance improvement of LEW-DWT. Based on your suggestion, we have added comparative experiments, using Simple Stacking and Standard DWT as control groups, and calculated their prediction accuracy based on the optimal model (CNN).

We have added comparative results in the revised manuscript (see Section 4.5/Table 10):

(1) Simple Splicing: performed the worst (RMSE=1.45, R²=0.53), indicating that simply concatenating multi-source data cannot effectively utilize multi-scale spatial features.

(2) Traditional DWT performs better than simple stacking (RMSE=1.37, R²=0.57), but is not as good as LEW-DWT in preserving high-frequency texture details.

(3) LEW-DWT achieved the best performance (RMSE=1.29, R²=0.62).

Quantification of improvement level:

Compared with Traditional DWT, LEW-DWT reduces RMSE by about 5.8% and increases R2 by about 8.7%. This proves that introducing the "local energy weighting" strategy can more sensitively capture small texture changes on the surface of cultivated land (such as micro topography and vegetation differences), thereby significantly improving the prediction accuracy of SOM.

Changes in Manuscript:

We have added a new paragraph in Section 4.5 (Prediction accuracy) that provides a detailed description of the comparative results of these three fusion methods. We have updated the discussion section to quantify the extent of improvement of LEW-DWT compared to traditional methods. [Page 19, paragraph 4.5, and line 574-595.]

Comments 6: [In the results, the quality evaluation of the fused images is missing. Only the prediction accuracy is indirectly verified to validate the fusion effect, and the spectral fidelity and spatial detail retention of the fused images are not directly evaluated. It is suggested to add indicators such as spectral angle matching degree, information entropy, and average gradient between the fused images and the original Landsat 8 images to verify the spectral and spatial retention capabilities of LEW-DWT.]

Response 6: Thank you for the professional suggestions from the reviewers. We fully agree that verifying the fusion effect solely through the final SOM prediction accuracy is not comprehensive enough. Adding direct quality assessment of the fused image itself is crucial for verifying the effectiveness of LEW-DWT. Based on your suggestion, we selected benchmark images from 2020 as references and calculated three key indicators for the fusion results of Traditional DWT and LEW-DWT:

(1) Spectral Angle Mapper (SAM): used to evaluate spectral fidelity.

(2) Information Entropy (IE): used to evaluate the amount of information contained in an image.

(3) Average Gradient (AG): used to evaluate the clarity of spatial details and texture.

Evaluation results (see newly added Table 9 for details): The results indicate that LEW-DWT outperforms Traditional DWT in all three indicators:

(1) Spectral fidelity: The SAM value of LEW-DWT is lower (0.038 vs. 0.056), indicating that it better preserves the spectral characteristics of the original bands and reduces spectral distortion.

(2) Spatial details: The AG value of LEW-DWT is significantly higher (4.92 vs. 3.85), demonstrating that the weighting strategy based on local energy effectively preserves high-frequency edge information, resulting in clearer image textures.

(3) Information richness: The increase in IE value (5.65 vs. 5.12) indicates that the fused image contains richer multi-temporal comprehensive information.

These direct image quality indicators further confirm the advantages of LEW-DWT in preserving spectral information and enhancing spatial details, laying a data foundation for subsequent high-precision SOM prediction.

Changes in Manuscript:

In Section 4.2 (Analysis of fused multitemporal multispectral images), we have added a paragraph specifically discussing the quality evaluation of fused images. Additionally, a new Table 9 has been included, presenting quantitative comparison results for SAM, IE, and AG. [Page 14, paragraph 4.2, and line 432-445.]

Comments 7: [In the discussion, the comparison with similar studies is not in-depth. Only the R²=0.86 of Meng et al. (2022) is mentioned, and the reason for the lower R²=0.62 of this study is not analyzed. The performance of other fusion methods in multi-temporal SOM prediction is not compared. It is suggested to supplement in-depth comparisons with similar studies and compare the performance of models such as CNN-LSTM in multi-temporal SOM prediction to explain the competitiveness of the model in this study.]

Response 7: We are grateful to the reviewers for their valuable suggestions on in-depth discussions of model performance and comparative analysis.

1. Regarding the R² difference analysis with Meng et al. [18]:

We acknowledge that the accuracy of this study (R²=0.62) is lower than that of Meng et al.'s study in the Northeast region (R²=0.86). This is mainly attributed to two objective differences:

(1) Geographical differences: Meng et al.'s study was conducted in the black soil region of Northeast China, where the SOM content is high and exhibits significant variation. However, the study area (Yucheng) is characterized by fluvo-aquic soil, with low SOM content and a narrow range, which statistically limits the upper limit of R².

(2) Data source differences (key): Meng et al. integrated GF-5 hyperspectral data, utilizing its hundreds of continuous bands to capture the fine spectral absorption characteristics of SOM. In contrast, this study uses Landsat 8 multispectral data. Although the spectral resolution is lower, Landsat data is freely available and has a long time series, making it more suitable for large-scale, low-cost operational monitoring.

Therefore, achieving an accuracy of 0.62 solely with multispectral data demonstrates the effectiveness of the LEW-DWT fusion strategy proposed in this study in extracting spatial texture information.

2. Regarding comparison with other models (such as CNN-LSTM):

Based on your suggestion, we have added a discussion on models such as CNN-LSTM. The advantage of CNN-LSTM (as in Zhang et al. [75]) lies in extracting time-dependent features from long time series (such as NDVI curves). The advantage of our study (LEW-DWT + CNN) lies in enhancing spatial texture details through multi-temporal fusion. Due to the fragmented farmland in Yucheng, spatial details at the parcel scale are crucial for SOM prediction. Although we did not use an LSTM structure, our 2D-CNN, coupled with feature fusion, effectively captures this spatial heterogeneity, achieving a prediction capability comparable to or even more applicable than complex temporal models.

Changes in Manuscript:

We have rewritten the relevant paragraphs in Section 5.3 (Model Performance), providing a detailed explanation of the geographical reasons for the accuracy discrepancies, and incorporating a comparative discussion on the strengths and weaknesses of the CNN-LSTM model. [Page 23, paragraph 5.3, and line 802-826.]

Comments 8: [The advantages of LEW-DWT are not quantified, and the improvement in accuracy of LEW-DWT compared to traditional DWT, STARFM, and other fusion methods in SOM prediction is not clear. It is suggested to quantify the advantages of LEW-DWT.]

Response 8: We sincerely thank the reviewer for this insightful comment. We acknowledge the need to quantitatively benchmark LEW-DWT against established methods to clarify its specific improvements in accuracy.

1. Clarification on STARFM:

Regarding STARFM (Spatial and Temporal Adaptive Reflectance Fusion Model), we would like to clarify that it is primarily designed for generating synthetic high-resolution images from coarse-resolution data (spatiotemporal fusion). In contrast, our study focuses on the pixel-level feature fusion of real observed multi-temporal Landsat images to enhance spectral information richness. Due to this fundamental difference in application scenarios, STARFM was not included as a direct comparator.

2. Quantitative Comparison with Standard Methods:

However, to fully address the reviewer’s request for quantification against "Traditional DWT" and "Other Fusion Methods," we conducted a comparative analysis against two standard baselines using the same CNN model:

Simple Splicing (Other Method): Directly stacking bands without mathematical transformation.

Traditional DWT: Using standard discrete wavelet transform with a simple mean fusion rule.

3. Quantified Improvements:

As presented in the newly added Table 11 (Section 4.5), the results clearly quantify the advantages of LEW-DWT:

vs. Simple Splicing (R²=0.53): LEW-DWT (R²=0.62) reduced the RMSE by 11.0%.

vs. Traditional DWT (R²=0.57): LEW-DWT (R²=0.62) reduced the RMSE by 5.8%.

These comparisons confirm that the proposed LEW-DWT strategy offers a statistically significant improvement over traditional fusion techniques by more effectively preserving local texture and spectral details.

Changes in Manuscript:

We have provided additional argumentation in the new section titled "5.2 Comparative Analysis of Applicability and Fusion Strategies of Wavelet Transform". [Page 23, paragraph 5.2, and line 764-785.]

Comments 9: [Format issues: Some chart legends are too brief, such as the variable names in Figure 8 are not fully labeled; some references are not formatted uniformly.]

Response 9: Thank you very much for pointing out the issues raised by the reviewer. We fully accept them, but we have a little bit of confusion about this issue, such as Figure 8. We hope you can specifically point it out. Thank you!

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors After revision, the manuscript has been significantly improved and can be recommended for publication.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all the reviewers’ comments.

Reviewer 3 Report

Comments and Suggestions for Authors

The author has completed systematic optimization in response to potential concerns raised in the first review. The key parameters of the LEW - DWT fusion and the sensitivity analysis have been clearly defined, the structure and hyperparameters of the CNN model have been refined, the variable screening (RFE method) and multi-source data pre-processing procedures have been improved, and the classification and quantitative analysis of variable importance has been strengthened. These efforts have significantly enhanced the scientificity and reproducibility of the paper. At present, the paper has addressed key issues such as the lack of core technical details, the ambiguity of model design, and the insufficiency of result interpretation. The innovation points are clear, the experimental data are solid, and the results are highly credible. Overall, it meets the requirements for publication in the journal.