Review Reports - MAWC-Net: A Multi-Scale Attention Wavelet Convolutional Neural Network for Soil pH Prediction

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a new method for soil pH prediction. The method combines so many modules, as multiscale convolution network, attention mechanism, wavelet decomposition, etc..

The paper is quite OK, but some improvements are necessary.

First, a table with the specific structure of the network is necessary, i.e., how many layers each module has, how many neurons, or the filter size, and so on for the complete network.

Also, an idea of the number of parameters to train in this network, and the time necessary to train it and to apply it. It is possible to use that big network in real time?

Also, in the comparison procedure, what parameters are used in the different models, i.e., the CNN-Transformer, the SVR, XGBost, etc…?, are you optimized the parameters of these models for this application?

Author Response

Comments 1: First, a table with the specific structure of the network is necessary, i.e., how many layers each module has, how many neurons, or the filter size, and so on for the complete network.

Response 1: Thank you for your valuable suggestion. We fully agree that providing a detailed architectural description enhances the clarity and reproducibility of our method. Accordingly, we [have added a new table (Table 2) in the revised manuscript], which presents the complete configuration of MAWC-Net, including layer types, input/output dimensions, convolutional kernel sizes, number of filters, grouping and padding settings, activation and normalization layers, multi-branch structures in MSBCM and MSCDM, HWDM decomposition outputs, and all fully connected layers. This comprehensive table offers a clear, step-by-step overview of the entire network architecture and its components. [page 5-6, and L195-227]

Comments 2: Also, an idea of the number of parameters to train in this network, and the time necessary to train it and to apply it. It is possible to use that big network in real time?

Response 2: Thank you for the reviewer’s insightful comment. We [have included detailed information on model size, training time, and computational cost in Section 2.5 of the revised manuscript]. MAWC-Net contains 8.60 million trainable parameters (≈32.8 MB), which represents a moderately sized architecture. Under our hardware configuration (Intel i7-13620H CPU and NVIDIA RTX 4060 GPU), each training epoch requires approximately 1.21 s and utilizes about 3.6 GB of GPU memory. These results indicate that the proposed network is computationally efficient and can be deployed on standard GPU platforms. Inference for a single spectrum can be completed within milliseconds, suggesting that MAWC-Net is feasible for near–real-time or real-time soil property prediction applications. [page 14, and L453-469]

Comments 3: Also, in the comparison procedure, what parameters are used in the different models, i.e., the CNN-Transformer, the SVR, XGBost, etc…?, are you optimized the parameters of these models for this application?

Response 3: Thank you for raising this important point. To ensure a fair and rigorous comparison, we did not rely on default settings and [have explicitly stated this in the corresponding sections of the manuscript].For the deep learning baselines, we applied grid search and validation-based selection on key hyperparameters such as learning rate, batch size, number of convolutional filters, attention heads, and hidden dimensions. For the traditional machine learning methods (e.g., SVR, Random Forest, XGBoost), the major hyperparameters were optimized using five-fold cross-validation, including kernel parameters, tree depth, learning rate, and regularization terms.Through this tuning process, each baseline model was trained under its best-performing configuration for this task, ensuring an objective and fair comparison with the proposed MAWC-Net. [page 15-16, and L508-518]

Reviewer 2 Report

Comments and Suggestions for Authors

For the publication, the following comments should be solved carefully:

1) Define every abbreviation at first mention and use one term consistently throughout. For example, choose one name for the wavelet module and keep it everywhere.

2) Justify the reduction to 128 bands. Explain how 4,200 to 128 was selected, and add a small ablation that compares no-PAA or higher dimensions to show any information loss.

3) Authors adopt a multi-scale convolutional architecture with wavelet decomposition, but many alternatives exist—notably transformer-based and attention-based models. Please justify the choice of this CNN-wavelet design against those alternatives with empirical or theoretical evidence. While no single architecture is universally superior, the manuscript should explain, scientifically and concretely, why this model is appropriate for the task and data. For discussion of attention-based methods, authors may cite and analyze “Evaluating Cross-Building Transferability of Attention-Based Automated Fault Detection and Diagnosis for Air Handling Units: Auditorium and Hospital Case Study.” For transformer approaches, please cite and discuss “A hybrid SMOTE and Trans-CWGAN for data imbalance in real operational AHU AFDD: A case study of an auditorium building.”

4) Ensure baseline fairness and clarify the split. Confirm that all baselines used the same preprocessing and similar tuning budgets, and state the exact train, validation, and test percentages with the stratification method in one sentence.

5) Discuss dataset limitations and future work. Note the reliance on LUCAS 2009, outline plans to validate on additional datasets or regions, and double-check that key numbers in the abstract and the main results match exactly.

Author Response

Comments 1: Define every abbreviation at first mention and use one term consistently throughout. For example, choose one name for the wavelet module and keep it everywhere.

Response 1: We thank the reviewer for this valuable suggestion. We [have carefully revised the manuscript to define all abbreviations at their first occurrence]. In addition, the naming of the wavelet module has been standardized throughout the paper, and we consistently refer to it as the Haar wavelet decomposition. [page 1-20, and L21-630]

Comments 2: Justify the reduction to 128 bands. Explain how 4,200 to 128 was selected, and add a small ablation that compares no-PAA or higher dimensions to show any information loss.

Response 2: We appreciate the reviewer’s insightful comment. The reduction of the spectral dimensionality from 4,200 bands to 128 was primarily motivated by hardware constraints and model scalability considerations. We [have explained the reason in the corresponding sections of the manuscript]. In our framework, the input shape is (B, 1, L); thus, the number of parameters increases dramatically with the spectral length L. As shown below, even moderate increases in L lead to exponential growth in model size:

PAA → 64 bands: 2,207,849 parameters (8.42 MB)

PAA → 128 bands: 8,599,657 parameters (32.81 MB)

PAA → 256 bands: 34,162,793 parameters (130.32 MB)

PAA → 512 bands: 136,407,145 parameters (520.35 MB)

No PAA (full 4,200 bands): 9,173,010,281 parameters (34,992.26 MB)

Given the limitations of our computational environment, training the full-length (4,200-band) spectra or performing ablation experiments at such scales was not feasible. Therefore, we adopted Piecewise Aggregate Approximation (PAA) to reduce the spectral length. Visual comparison shows that the global spectral shape and major absorption characteristics are well preserved after downsampling, indicating limited loss of essential information.

Moreover, the manuscript has been updated to include relevant literature supporting the downsampling of the original 4,200-band spectra, such as hyperspectral soil studies that also employ dimensionality reduction as a preprocessing step (e.g., “Residual Attention Network with Atrous Spatial Pyramid Pooling for Soil Element Estimation in LUCAS Hyperspectral Data”). [page 3, and L129-133]

Comments 3: Authors adopt a multi-scale convolutional architecture with wavelet decomposition, but many alternatives exist—notably transformer-based and attention-based models. Please justify the choice of this CNN-wavelet design against those alternatives with empirical or theoretical evidence. While no single architecture is universally superior, the manuscript should explain, scientifically and concretely, why this model is appropriate for the task and data. For discussion of attention-based methods, authors may cite and analyze “Evaluating Cross-Building Transferability of Attention-Based Automated Fault Detection and Diagnosis for Air Handling Units: Auditorium and Hospital Case Study.” For transformer approaches, please cite and discuss “A hybrid SMOTE and Trans-CWGAN for data imbalance in real operational AHU AFDD: A case study of an auditorium building.”

Response 3: We appreciate the reviewer’s suggestion to justify this design choice against transformer-based and attention-based alternatives. We [primarily discussed the advantages and limitations of MAWC-Net].

First, the multi-scale convolution strategy is well suited for one-dimensional spectral data without an explicit temporal dimension. Parallel convolutions with different kernel sizes (e.g., 3, 5, 7, 9) allow the network to extract features at multiple receptive fields, capturing both narrow absorption peaks and broader spectral patterns. In contrast, Transformer architectures are primarily designed for sequential data with temporal dependencies, where self-attention mechanisms dynamically model long-range interactions. In the attention-based study “Evaluating Cross-Building Transferability of Attention-Based Automated Fault Detection and Diagnosis for Air Handling Units”, the core architecture relies on heavy self-attention mechanisms. In the Transformer-based “A hybrid SMOTE and Trans-CWGAN for data imbalance in real operational AHU AFDD”, the primary modeling component is again the Transformer with multi-head self-attention. When using a Transformer architecture, the multi-head self-attention mechanism is an essential component. In self-attention, the input features must be projected into query (Q), key (K), and value (V) matrices through separate learned linear transformations. The attention operation then computes QK^T followed by a softmax normalization and a weighted multiplication with V. These steps involve multiple large matrix multiplications, and in the multi-head setting, they are repeated across several attention heads. As a result, the computational cost grows significantly, leading to substantial memory usage and increased runtime complexity. During our preliminary experiments, we also evaluated self-attention modules, but the computational overhead was prohibitive given the limitations of our hardware. Therefore, we focused on computationally efficient alternatives that still retain strong representational capacity.

Second, the attention modules used in our model—Path Attention Module (PAM) and CBAM—are lightweight and computationally inexpensive. PAM shares conceptual similarity with the channel reweighting strategy in Squeeze-and-Excitation Networks. CBAM follows the design of Convolutional Block Attention Module. The efficiency considerations also align with the principles of ECA-Net: Efficient Channel Attention. These attention modules perform simple weighted operations, avoiding the quadratic complexity characteristic of self-attention. As such, they provide effective feature refinement with minimal computational burden.

Third, the Haar Wavelet Decomposition Module (HWDM) is adopted because wavelet transforms are widely used in signal processing and hyperspectral analysis for denoising, decomposition, and feature enhancement. Soil hyperspectral reflectance data share essential properties with classical signals—continuity, smooth trends, and localized variations—making wavelet analysis a natural fit. HWDM separates the spectrum into low-frequency trends and high-frequency details, functioning somewhat similarly to pooling while preserving frequency-domain information. This enhances the model’s ability to capture multi-scale spectral variations that are important for soil property prediction. [page 19-20, and L612-661]

Comments 4: Ensure baseline fairness and clarify the split. Confirm that all baselines used the same preprocessing and similar tuning budgets, and state the exact train, validation, and test percentages with the stratification method in one sentence.

Response 4: We thank the reviewer for the helpful suggestion. [All baseline models were trained using exactly the same preprocessing procedures and comparable hyperparameter tuning budgets to ensure fairness]. In addition, we applied three-fold stratified cross-validation, dividing the dataset into three non-overlapping folds, with each fold serving once as the test set while the remaining two folds were used for training. Within each training portion, we further employed a Stratified Shuffle Split, allocating 80% of the data for model training and 20% for validation. [page 4, 16, and L143-161, L514-516]

Comments 5: Discuss dataset limitations and future work. Note the reliance on LUCAS 2009, outline plans to validate on additional datasets or regions, and double-check that key numbers in the abstract and the main results match exactly.

Response 5: We thank the reviewer for this valuable suggestion. We acknowledge that our current study relies solely on the LUCAS 2009 dataset, which may limit the generalizability of our findings. In future work, we intend to explore the validation of MAWC-Net on additional soil spectral datasets and across diverse geographical regions to further assess its robustness and applicability. Additionally, we [have carefully cross-checked the key numbers reported in the abstract and the main results to ensure consistency throughout the manuscript].

Reviewer 3 Report

Comments and Suggestions for Authors

The problem of predicting soil acidity from visible-infrared reflectance spectra is considered. The authors provide an elaborate method combined from convolutional neural network including attention blocks and Haar decomposition, which reportedly outperforms existing analogues. There are the following drawbacks.

1. It s not clear why Savitsky-Golay filter is applied for preprocessing. Please explain your choice. Why 21-point window is selected?

2. Some samples (each 1 of 20) were treated as outliers prior to processing (lines 135-138). This is not honest practice, at least without explaining a decision why each excluded sample is condemned to be an outlier. Being far from main distribution does not mean being wrong. Maybe the 'outliers' represent rare types of soil. Excluding outliers by a simple thoughtless thresholding ruins the performance of method with real data, and results is obtaining a cozy laboratory system and polished results, irrelevant to the real problem. Please repeat with real data or provide a carefully designed method of outlier exclusion, which is really targeted to wrong samples, rather than to non-frequent samples.

3. Table 1. "pH in CaCl2", "pH in H2O" were not encountered before. Please explain the meaning in advance.

4. Comparison with other methods lacks one necessary aspect. Size of the models (number of trainable parameters) should be matched as well. Maybe the proposed architecture wins by its greater parameter number rather than smart design.

5. Also please mention how much of the training did the rival solutions receive. The authors model ran 500 epochs to saturation. Were rival models saturated by training?

Author Response

Comments 1: It s not clear why Savitsky-Golay filter is applied for preprocessing. Please explain your choice. Why 21-point window is selected?

Response 1: We have clarified in the revised manuscript that the Savitzky–Golay (SG) filter is applied for smoothing and noise reduction of the spectral signals prior to feature extraction. While previous studies in soil spectroscopy often use a window length of 11 points [Refs: A Novel Transformer-CNN Approach for Predicting Soil Properties from LUCAS Vis-NIR Spectral Data; Spectral Fusion Modeling for Soil Organic Carbon by a Parallel Input-Convolutional Neural Network], we empirically selected a window length of 21 points to achieve stronger noise suppression while preserving the main spectral features. This choice was validated to maintain the integrity of the spectral curves and ensure robust model performance.

Comments 2: Some samples (each 1 of 20) were treated as outliers prior to processing (lines 135-138). This is not honest practice, at least without explaining a decision why each excluded sample is condemned to be an outlier. Being far from main distribution does not mean being wrong. Maybe the 'outliers' represent rare types of soil. Excluding outliers by a simple thoughtless thresholding ruins the performance of method with real data, and results is obtaining a cozy laboratory system and polished results, irrelevant to the real problem. Please repeat with real data or provide a carefully designed method of outlier exclusion, which is really targeted to wrong samples, rather than to non-frequent samples.

Response 2: Regarding the treatment of outliers, we have revised the manuscript to clarify that samples identified as outliers were determined using the Mahalanobis distance, which accounts for multivariate variability across the spectral features. Only a small proportion of samples (approximately 5%) were flagged as outliers. We carefully inspected these points and performed experiments both with and without Mahalanobis-based outlier removal. The results show minimal differences: for pH in CaCl2, R² changed from 0.950 to 0.953 and RMSE from 0.319 to 0.307; for pH in H2O, R² changed from 0.943 to 0.945 and RMSE from 0.322 to 0.315. These results demonstrate that removing the outliers does not compromise model performance. This approach is consistent with standard practice in hyperspectral soil studies [Estimating Forest Soil Properties for Humus 693 Assessment—Is Vis-NIR the Way to Go? ; Data fusion of XRF and Vis-nir using outer product analysis, granger–Ramanathan, and least 695 squares for prediction of key soil attributes.], ensuring the model learns from representative data while mitigating the impact of extreme measurement noise.

Comments 3: Table 1. "pH in CaCl2", "pH in H2O" were not encountered before. Please explain the meaning in advance.

Response 3: We appreciate the reviewer’s comment. In the revised manuscript, we [have ensured that the meanings of pH(CaCl₂) (pH measured in a CaCl₂ solution) and pH(H₂O) (pH measured in a suspension of soil in water) are clearly stated at their first appearance]. These definitions were already provided in Section 2.1, and we have now additionally clarified them at their initial mention in the Abstract and the main text to avoid any ambiguity. [page 3, and L119-120]

Comments 4: Comparison with other methods lacks one necessary aspect. Size of the models (number of trainable parameters) should be matched as well. Maybe the proposed architecture wins by its greater parameter number rather than smart design.

Response 4: We appreciate the reviewer’s insightful comment regarding model size and its potential influence on comparative performance. In the revised manuscript, we [have added a detailed discussion of parameter counts to ensure that the comparison with baseline models is fair and scientifically justified].

First, the traditional machine learning baselines (PLSR, Ridge, SVR, and XGBoost) are fundamentally non–deep learning methods that do not rely on large numbers of trainable parameters; therefore, matching parameter counts with deep neural networks is not applicable. Their inclusion serves to provide a classical benchmark commonly used in soil spectroscopy studies.

Second, for the deep learning baselines (VGG16, ResNet18, and CNN-Transformer), these architectures inherently contain far more parameters than MAWC-Net. For clarity, the trainable parameter counts are summarized in the revised manuscript:

VGG16: 138,357,544 params, 527.79 MB

ResNet18: 11,689,512 params, 44.59 MB

CNN-Transformer: 11,708,481 params, 44.66 MB

MAWC-Net: 8,599,657 params, 32.81 MB

Thus, MAWC-Net does not outperform the baselines simply because it has more parameters; in fact, it achieves superior accuracy despite having fewer parameters than most deep-learning baselines. This supports that the performance gain comes from the proposed architectural design—multi-scale convolution, lightweight attention, and wavelet decomposition—rather than from model size alone.

We have added these comparisons and the corresponding discussion to the revised manuscript to ensure transparency and fairness in baseline evaluation. [page 17, and L545-550]

Comments 5: Also please mention how much of the training did the rival solutions receive. The authors model ran 500 epochs to saturation. Were rival models saturated by training?

Response 5: To ensure a fair and consistent comparison, all baseline methods were trained using the same preprocessing procedures and comparable hyperparameter tuning budgets. We have also reported the number of training epochs and the early-stopping criteria for each baseline in the revised manuscript. As noted in Section 3.2 (Comparative Experiment), all models reached convergence well within 500 epochs, confirming that the baselines were sufficiently trained.

Reviewer 4 Report

Comments and Suggestions for Authors

The paper under consideration deals with a very topical subject of improving the quality and volume of information gathered (or extracted) from electromagnetic spectra. Apart from increasing the amount of information by itself, this would increase the crucial information for decision making upon this information, both in scientific research and in practical, everyday applications. In general, a really principal increase in the information content of spectral data may be obtained using computer-based techniques, probably now machine-learning and neural networks proved to be the best solution.

This situation is especially topical and crucial in chemical analysis of complex samples. In fact, the mainstream is to use sample preparation, either chemical or physical (or both altogether) to simplify the sample composition to obtain feature spectra. However, this approach undoubtedly changes the sample and not always possible (very large sets of samples or distant analysis). Classical spectral feature extraction by physico-mathematical approaches is robust but may result in loss of information. Thus, machine-learning (ML) may be a principal approach for complex samples. Especially soil sample, which are among most complicated samples in the analysis, excelling in this aspect even biomedical materials.

The authors made a very logical and clear application of ML approach in revealing the features of soil samples, and especially challenging is the use of Vis-NIR-SWIR spectra, which are most featureless (compared to UV or MIR-FIR or Raman spectra). The workability and reliable results obtained here show the possibilities of this approach. The research design is correct and is described in due detail in the large experimental part of the study. The selection of the samples and thee control of ML based results by external and relevant non-spectral features (like pH and matrix components) is fully approved. All the findings in the paper are discussed well and supported by illustrations. The discussion of the data and findings is made in due form and is very clear. All the conclusions are supported with data.

In my opinion, this study is topical and well-done, and requires just correcting some minor formatting issues in the text and other minor issues given below

Significant digits. The authors correctly use 3 significant digits throughout the text, which, in my opinion, correspond to the precision of the initial data and calculations. However, the data in Figure 8 show mean values of 4 significant digits, which probably needs to be corrected.
Significant digits. The same problem is Table 1 (Statistical characteristics of soil pH datasets). Some data (mean values again) show 4 significant digits, the precision that cannot be achieved for pH measurements, especially in complex samples like soil. I recommend correct all the feature values in this table to the correct 3 significant digits.
Although Figure 1 shows that some spectra and features change upon data handing, any specific spectra cannot be discussed (and this is not needed due to ML application). Thus, the whole Figure or at least panels a through c (or b through d) should better be in the Supplementary.

Author Response

Comments 1: Significant digits. The authors correctly use 3 significant digits throughout the text, which, in my opinion, correspond to the precision of the initial data and calculations. However, the data in Figure 8 show mean values of 4 significant digits, which probably needs to be corrected.

Response 1: We thank the reviewer for pointing this out.We [have revised Figure 8 to address the concern regarding the number of digits]. Ambiguous digits in the figure have been removed, and the values are now described in detail within the text using 3 significant digits, consistent with the realistic precision of pH measurements in soil samples. [page 17, and L558-559]

Comments 2: Significant digits. The same problem is Table 1 (Statistical characteristics of soil pH datasets). Some data (mean values again) show 4 significant digits, the precision that cannot be achieved for pH measurements, especially in complex samples like soil. I recommend correct all the feature values in this table to the correct 3 significant digits.

Response 2: We appreciate the reviewer’s comment. [All feature values in Table 1 have been revised to 3 significant digits] to accurately reflect the measurement precision achievable for soil pH, and all other numerical values throughout the manuscript have also been adjusted to maintain 3 significant digits. [page 4, and L162-163]

Comments 3: Although Figure 1 shows that some spectra and features change upon data handing, any specific spectra cannot be discussed (and this is not needed due to ML application). Thus, the whole Figure or at least panels a through c (or b through d) should better be in the Supplementary.

Response 3: We appreciate the reviewer’s suggestion. After careful consideration, we agree that Figure 1 is not essential for the main text due to the nature of the ML application. Therefore, we [have removed it from the manuscript]. [page 3, and L110-141]

Reviewer 5 Report

Comments and Suggestions for Authors

The current introduction is overly long and some contexts are repetive. I strongly recommend to restructure the current introduction to: (1) clearly define the specific limitations of existing Vis–NIR modeling approaches; (2) avoid listing unrelated deep learning architectures; (3) explicitly state how MAWC-Net fills the identified research gap.
Important training and model-complexity details are missing, such as parameter count, FLOPs, training time per epoch, hyperparameter selection criteria, and sensitivity to PAA dimensionality. I believe it will be better if the authors can add these details to enhance reproducibility.
You should include statistical tests to verify that MAWC-Net significantly outperforms baseline models, rather than relying solely on point estimates.
This manuscript repeatedly claims improved feature extraction but does not provide interpretability analyses. I strongly recommend to add wavelength-importance visualization (e.g., attention heatmaps, Grad-CAM-style maps, or SHAP), and highlight spectral regions contributing most to predictions.
Please improve the quality of your figures and simplify the tables, making the manuscript easier for readers to read/
The Discussion section mainly restates experimental results. Please deepen the discussion by: (1) comparing MAWC-Net’s structure with related multi-scale and wavelet-enhanced models; (2) analyzing failure cases or prediction biases; (3) clarifying practical implications for real-world soil sensing.

Author Response

Comments 1: The current introduction is overly long and some contexts are repetive. I strongly recommend to restructure the current introduction to: (1) clearly define the specific limitations of existing Vis–NIR modeling approaches; (2) avoid listing unrelated deep learning architectures; (3) explicitly state how MAWC-Net fills the identified research gap.

Response 1: We appreciate the reviewer’s comment. We have revised the Introduction to clearly articulate both the limitations of existing Vis–NIR modeling approaches and how MAWC-Net addresses these gaps. Specifically, we emphasize that most previous models operate with a single receptive field, which constrains their ability to capture fine-grained absorption features and broader spectral patterns simultaneously. In response to this limitation, MAWC-Net introduces two distinct multi-scale convolutional modules, enabling the model to perceive spectral information at multiple scales and thereby substantially enhancing feature extraction capability.

In addition, we [have carefully streamlined the Introduction] by removing redundant descriptions and ensuring that all referenced deep learning architectures are directly relevant to soil spectral analysis. The retained studies represent recent and closely related work in using hyperspectral or Vis–NIR/Vis–NIR–SWIR data for soil property prediction. [page 1-3, and L30-109]

Comments 2: Important training and model-complexity details are missing, such as parameter count, FLOPs, training time per epoch, hyperparameter selection criteria, and sensitivity to PAA dimensionality. I believe it will be better if the authors can add these details to enhance reproducibility.

Response 2: We sincerely appreciate the reviewer’s valuable comments. In this revision, we [have supplemented all the requested training and model-complexity details] to enhance the reproducibility of our study. Specifically, we added the parameter count of MAWC-Net (8,599,657 parameters; 32.8 MB), the training time per epoch (approximately 1.21 s), the GPU memory consumption (about 3.6 GB), and the complete hardware/software configuration. We also clarified that all hyperparameters were selected using a grid-search strategy. These additions have been incorporated into the revised manuscript (Section 2.5, Experimental Setup). [page 14, and L453-469]

Comments 3: You should include statistical tests to verify that MAWC-Net significantly outperforms baseline models, rather than relying solely on point estimates.

Response 3: We appreciate the reviewer’s suggestion regarding statistical tests. To assess the robustness of MAWC-Net, we conducted 30 independent runs with different random seeds and reported the mean and variation of RMSE values. The results consistently show that MAWC-Net outperforms all baseline models across both pH(CaCl₂) and pH(H₂O) prediction tasks.

Given that traditional machine learning baselines (PLSR, Ridge, SVR, XGBoost) exhibit negligible variability across runs, the performance difference is already evident from the reported metrics. For deep learning baselines with inherent randomness, one could perform paired statistical tests; however, the RMSE differences are substantial and consistent across all seeds, making the superiority of MAWC-Net clear without formal significance testing.

We believe that the combination of multiple-seed experiments and reported performance metrics sufficiently demonstrates both robustness and statistical reliability of our model.

Comments 4: This manuscript repeatedly claims improved feature extraction but does not provide interpretability analyses. I strongly recommend to add wavelength-importance visualization (e.g., attention heatmaps, Grad-CAM-style maps, or SHAP), and highlight spectral regions contributing most to predictions.

Response 4: We thank the reviewer for the suggestion regarding visualization of wavelength importance. As discussed in Section 4.2 “Method Advantages and Limitations,” although MAWC-Net incorporates multiple attention mechanisms that highlight critical spectral regions, the network’s structural complexity—particularly the coupling of CBAM and PAM modules—makes it challenging to explicitly identify dominant spectral bands.

Consequently, generating attention heatmaps or SHAP/Grad-CAM style visualizations may not accurately reflect the internal decision-making process of MAWC-Net and could be misleading. We have acknowledged this limitation in the manuscript and note that enhancing interpretability is a potential direction for future work.

Comments 5: Please improve the quality of your figures and simplify the tables, making the manuscript easier for readers to read/

Response 5: We thank the reviewer for this valuable suggestion. In response, we have improved the quality of all figures by increasing resolution and ensuring consistent font sizes and line widths for better readability. Additionally, we have simplified the tables by removing redundant information and reorganizing the data presentation to make the manuscript clearer and easier to read.

Comments 6: The Discussion section mainly restates experimental results. Please deepen the discussion by: (1) comparing MAWC-Net’s structure with related multi-scale and wavelet-enhanced models; (2) analyzing failure cases or prediction biases; (3) clarifying practical implications for real-world soil sensing.

Response 6: Thank you very much for this constructive comment. According to the reviewer’s suggestions, we [have substantially enriched the Discussion section in three aspects]. First, we added a structural comparison between MAWC-Net and existing multi-scale and wavelet-enhanced spectral models. The revised discussion highlights how the proposed MSBCM and MSCDM modules differ from conventional multi-scale CNNs or wavelet-based feature extractors in terms of receptive-field diversity, kernel sharing, computational efficiency, and frequency-domain representation. Second, we expanded the analysis of failure cases and prediction biases. Specifically, we discuss how deviations at extreme pH values or abnormal spectral patterns may be caused by a combination of intrinsic soil heterogeneity, hyperspectral acquisition variability (illumination, moisture, sample preparation), and occasional measurement noise, and how these insights can guide future data collection and preprocessing improvements. Third, we elaborated on the practical implications of MAWC-Net for real-world field sensing, emphasizing its potential for rapid pH estimation in precision agriculture and soil fertility assessment, while also noting deployment considerations such as model lightweighting. These revisions collectively deepen the discussion by linking architectural design with empirical behaviors and application scenarios. [page 19-20, and L612-661]

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a new method for soil pH prediction. The method combines so many modules, as multiscale convolution network, attention mechanism, wavelet decomposition, etc..

The authors have answered my previous questions, so the paper can be published.

Author Response

We sincerely thank the reviewer for their positive evaluation of our work. We are pleased that the proposed method and the revisions have addressed your previous concerns. We greatly appreciate your time and constructive comments, which have helped improve the quality of the manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

1) Authors should clearly highlight all revised text in the manuscript. In the current version, it is difficult to verify whether the comments have been adequately addressed.

2) The authors provided the following response to Comment 3, but the statement discussed here should also be incorporated into the main text, and the corresponding papers should be cited:

“Response 3: We appreciate the reviewer’s suggestion to justify this design choice against transformer-based and attention-based alternatives. We [primarily discussed the advantages and limitations of MAWC-Net].
First, the multi-scale convolution strategy is well suited to one-dimensional spectral data without an explicit temporal dimension. Parallel convolutions with different kernel sizes (e.g., 3, 5, 7, 9) allow the network to extract features at multiple receptive fields, capturing both narrow absorption peaks and broader spectral patterns. In contrast, Transformer architectures are primarily designed for sequential data with temporal dependencies, where self-attention mechanisms dynamically model long-range interactions. In the attention-based study “Evaluating Cross-Building Transferability of Attention-Based Automated Fault Detection and Diagnosis for Air Handling Units,” the core architecture relies heavily on self-attention mechanisms. In the Transformer-based study “A Hybrid SMOTE and Trans-CWGAN for Data Imbalance in Real Operational AHU AFDD,” the primary modeling component is again a Transformer with multi-head self-attention.
When using a Transformer architecture, the multi-head self-attention mechanism is an essential component. In self-attention, the input features must be projected into query (Q), key (K), and value (V) matrices through separate learned linear transformations. The attention operation then computes QKᵀ followed by a softmax normalization and a weighted multiplication with V. These steps involve multiple large matrix multiplications, and in the multi-head setting they are repeated across several attention heads. As a result, the computational cost grows significantly, leading to substantial memory usage and increased runtime complexity. During our preliminary experiments, we also evaluated self-attention modules, but the computational overhead was prohibitive given our hardware limitations. Therefore, we chose to focus on computationally efficient alternatives that still retain strong representational capacity.”

Author Response

Comments 1: Authors should clearly highlight all revised text in the manuscript. In the current version, it is difficult to verify whether the comments have been adequately addressed.

Response 1: Thank you for this helpful suggestion. We fully understand the reviewer’s concern regarding the clarity of revisions. In the revised manuscript, we have highlighted all newly added, removed, and modified text in red to ensure that every change can be easily identified. This includes revisions related to terminology consistency (e.g., the unified naming of the wavelet module), explanations for using PAA, and clarifications regarding the fairness of baseline comparisons. We have carefully checked the entire document to ensure that all revisions corresponding to the reviewers’ comments are clearly marked. We hope this improves the transparency and readability of the revised manuscript.

Comments 2:

The authors provided the following response to Comment 3, but the statement discussed here should also be incorporated into the main text, and the corresponding papers should be cited:

Response 2: Thank you for this insightful comment. We agree that the discussion provided in our previous response should also be integrated into the manuscript. Accordingly, the full rationale behind our design choice—particularly the comparison between the multi-scale convolution strategy and attention- or Transformer-based alternatives—has now been incorporated into the Discussion section of the revised manuscript. All newly added text associated with this revision is highlighted in red for easy verification. We hope these additions further improve the clarity and completeness of the manuscript. [page 19-20, L617-637]

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have responded to all of the comments and revised the paper accordingly. It can be published now.

Author Response

Reviewer 5 Report

Comments and Suggestions for Authors

The authors addressed all my questions and revised as my comments. I have no further questions or comments. Recommend to accept in current form.

Author Response

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

I recommend that this paper is published, but in the proof stage, authors need to check the main text for the readiness. For example, full name of HWDM appears twice in line 192, and 216.