Next Article in Journal
Spatiotemporal Profiling of the Pathogen Complex Causing Common Bean Root Rot in China
Previous Article in Journal
Prediction of Rice Chlorophyll Index (CHI) Using Nighttime Multi-Source Spectral Data
 
 
Article
Peer-Review Record

UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring

Agriculture 2025, 15(13), 1427; https://doi.org/10.3390/agriculture15131427
by Zhen Du 1, Senhao Liu 2, Yao Liao 3, Yuanyuan Tang 4, Yanwen Liu 5, Huimin Xing 6, Zhijie Zhang 7 and Donghui Zhang 8,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Agriculture 2025, 15(13), 1427; https://doi.org/10.3390/agriculture15131427
Submission received: 20 May 2025 / Revised: 26 June 2025 / Accepted: 1 July 2025 / Published: 2 July 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this study, the authors used deep learning to predict classes in agricultural terrains. The topic is interesting, and the manuscript is well-organized. But there are several concerns that should be addressed.

1. An analysis of the number of parameters in each method, its computational complexities, and even processing time for each method should be brought.

2. It is still unclear what the features are for the SVM and RF methods.

3. The method of training and testing the methods is unclear.

4. The separate analysis of the three regions seems odd. The authors should provide a method of presenting the performance of the machines regardless of the regions.

5. More discussion on Table 5 is required. And why are only nine classes analysed in this table?

Author Response

Reviewer Comment 1:

“The authors should analyze the parameter count, computational complexity, and even the processing time of each method.”

Response to Comment 1

(Line 797-879)

We thank the reviewer for this insightful suggestion. In response, we have added a dedicated Section 4.5 titled “Complexity and Runtime Analysis”, which provides a comprehensive evaluation of each compared model from three perspectives:

Hyperparameter Design – We enumerate and compare the number of core hyperparameters for each model to reflect their architectural flexibility and tuning burden. As clarified, a higher count does not imply excessive tuning complexity, since many parameters operate within standard, interpretable bounds.

Theoretical Computational Complexity – We provide symbolic complexity expressions to characterize the inference burden of each model. Convolutional networks scale with local operations, transformers with token length squared, and our proposed UniHSFormer-X achieves linear complexity through prototype-routed attention.

Empirical Inference Time – Using a controlled benchmarking setup (PyTorch 2.1.0, CUDA 12.1, RTX 4090 GPU, 7×7×30 input blocks), we report per-image inference latency for all methods under unified precision (FP32), excluding I/O or augmentation overhead.

In addition, we introduce Table 8 to summarize hyperparameter counts, complexity notations, and inference times, and Table 9 to provide real training/testing time, FLOPs, and parameter size across three datasets (LongKou, HanChuan, and HongHu). These metrics collectively demonstrate that UniHSFormer-X achieves competitive or superior efficiency among transformer-based models, balancing scalability, semantic richness, and deployment readiness—an essential consideration for agricultural remote sensing under resource constraints.

We believe this thorough analysis fully addresses the reviewer’s concern.

Reviewer Comment 2

“It is still unclear what the features are for the SVM and RF methods.”

Response to Comment 2

(Line 380-387)

We appreciate the reviewer’s observation. In response, we have clarified in Section 4.1 that both SVM and RF models were trained using per-pixel spectral vectors directly extracted from the hyperspectral cube. Each input corresponds to the full-band reflectance profile of a single pixel, without the inclusion of spatial context or handcrafted descriptors. This design aligns with standard practice in hyperspectral remote sensing and ensures a fair comparison between classical and deep learning models. The feature normalization procedure (z-score) is also applied uniformly across models. The added clarification now makes the feature construction process for SVM and RF fully transparent.

Reviewer Comment 3

“The method of training and testing the methods is unclear.”

Response to Comment 3

(Line 387-390)

Thank you for highlighting this point. We have now clarified the training and evaluation procedure in Section 4.1 to enhance transparency and reproducibility. Specifically, all models were trained using a consistent 80/20 split strategy, where 100 labeled pixels per class were randomly sampled to form the training set (Train100), and the remaining labeled pixels were reserved for testing. This fixed partitioning was applied identically across all experiments.

No cross-validation was performed, and each model was independently trained from scratch. To ensure a fair comparison, all methods adopted consistent optimization strategies with bounded hyperparameter ranges. While architectural constraints required some variation in batch sizes or training epochs, these settings were harmonized under comparable computational budgets. Core training configurations are described in Section 4.1, and model-specific details are summarized in later sections.

This revision ensures that the training/testing setup is clear and that comparisons between models are conducted on a unified and reproducible basis.

Reviewer Comment 4

“The separate analysis of the three regions seems odd. The authors should provide a method of presenting the performance of the machines regardless of the regions.”

Response to Comment 4

(Line 591-603)

Thank you for this insightful suggestion. We fully agree that a unified evaluation across regions is critical for assessing the generalizability of classification models. In the revised manuscript, we have added a new cross-region performance summary (Table 5) that aggregates the Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient for each model across the three WHU-Hi datasets (LongKou, HanChuan, and HongHu). This table provides an averaged metric-based comparison regardless of specific regional context, thereby addressing the concern regarding isolated regional evaluations.

We have also included an accompanying paragraph at the end of Section 4.2 to interpret these unified results and position them in relation to the model’s robustness and generalization behavior. Notably, UniHSFormer-X consistently achieves the highest cross-region scores, supporting its effectiveness beyond localized conditions.

The new content appears in Section 4.2, just before the original concluding reflection, and is clearly marked in the revised manuscript.

Reviewer Comment 5

“More discussion on Table 5 is required. And why are only nine classes analysed in this table?”

Response to Comment 5

(Line 627-673)

Thank you for your valuable comment. We have revised the manuscript to address this concern in three aspects:

(1) Additional ablation configurations (C10 and C11):

To provide a more comprehensive assessment of architectural robustness, we have extended Table 6 and the corresponding analysis by including two additional ablation configurations:

C10: Enables only multi-scale supervision and the backbone, omitting both semantic routing and projection;

C11: Disables both multi-scale supervision and the backbone, retaining routing and projection only.

These additions enrich the factorial design and allow a more precise understanding of interaction effects among key modules. The updated results are reported in Table 6, and the main text in Section 4.3 has been revised accordingly.

(2) Justification of class selection in Table 5:

The nine categories reported in Table 5 were not arbitrarily chosen but selected based on their semantic and structural representativeness across datasets. Specifically, they include:

Spectrally ambiguous classes (e.g., narrow-leaf soybean, mixed weed);

Structurally irregular or fragmented classes (e.g., grass, roads and houses);

Background or interference categories (e.g., tree, plastic);

Boundary-sensitive horticultural crops (e.g., Lactuca sativa, Brassica chinensis).

These classes were identified as the most informative for diagnosing the differential effects of structural ablation. Rather than listing all classes—which may dilute interpretability—we opted for a focused yet representative subset that captures the major failure modes. To support transparency, we have added a clarifying sentence in the caption of Table 7 and now provide full class-wise accuracy results in the supplementary material.

(3) Expanded analysis in the main text:

We have further enriched the paragraph accompanying Table 7 by discussing in more detail the per-class variation across architectural variants, emphasizing the semantic, spatial, and structural reasons why specific classes are more vulnerable under ablation. This addresses the request for more comprehensive discussion and provides deeper insight into robustness mechanisms.

We appreciate this insightful suggestion, which has led to a more thorough and better-structured robustness evaluation in the revised manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The Research paper provides respectable findings and is well-written, but before it is accepted, it needs to be strengthened in the following ways:

The author should discuss the challenges in hyperspectral crop classification, the differences between UniHSFormer-X and traditional models, the concept of "prototype-aligned paths" in UniHSFormer-X, and the improvement in segmentation performance achieved through semantic routing.

The author should provide additional literature on Hyperspectral, accuracy, the following source may be useful: DOI: 10.1109/ICCCIS48478.2019.8974502

The author should discuss the role of class-aware cues in the UniHSFormer-X architecture, evaluate it using UAV-based benchmark datasets, handle structured and fragmented planting patterns, and conduct an ablation study to analyze the architectural components.

The author should explore UniHSFormer-X's resilience, architectural grammar for agricultural AI, and its balance of precision and interpretability in real-world cropland analysis.

Comments on the Quality of English Language

The paper requires improvement in its English presentation, addressing grammatical issues, typos, and poorly written sentences.

Author Response

Reviewer Comment 1:

“The author should discuss the challenges in hyperspectral crop classification, the differences between UniHSFormer-X and traditional models, the concept of "prototype-aligned paths" in UniHSFormer-X, and the improvement in segmentation performance achieved through semantic routing.”

Response to Comment 1:

(Lines 84–107)

Thank you for your thoughtful and constructive comment. In the revised manuscript, we have addressed this concern by refining the final paragraph of the Introduction section. The revised text now explicitly discusses:

(1) The core challenges in hyperspectral crop classification, including spectral ambiguity, morphological irregularity, and inter-class similarity;

(2) The differences between UniHSFormer-X and traditional methods, emphasizing how our model departs from uniform or static token processing by introducing dynamic, class-informed mechanisms;

(3) The concept of “prototype-aligned paths”, which are implemented via a semantic routing mechanism that aligns token propagation with learnable class prototypes to improve structural consistency and suppress noise;

(4) The empirical improvements in segmentation performance resulting from this design, particularly in handling dense, fragmented, or spectrally entangled regions.

These clarifications are provided in a focused and concise manner to enhance clarity while maintaining narrative cohesion. We appreciate your suggestion, which helped us improve the conceptual framing and presentation of our contributions.

Reviewer Comment 2:

“The author should provide additional literature on Hyperspectral, accuracy, the following source may be useful: DOI: 10.1109/ICCCIS48478.2019.8974502.”

Response to Comment 2:

(Lines 118-123)

We sincerely thank the reviewer for this insightful recommendation. The referenced work by Borana et al. (2019) provides a valuable contribution to the field of hyperspectral image classification, especially in the context of arid vegetation species. The authors skillfully integrated four classical supervised learning algorithms—Spectral Angle Mapper (SAM), Minimum Distance (MD), Spectral Information Divergence (SID), and Support Vector Machine (SVM)—and performed a comprehensive comparative analysis using high-resolution field-based hyperspectral imagery with 240 bands. Their rigorous assessment demonstrated the superior performance of SVM (achieving 81.2% overall accuracy on full-band data) and SAM (with 76.6% accuracy using MNF-transformed bands), highlighting the effectiveness of classical methods when appropriately tailored to local vegetation conditions. Moreover, the study emphasized the importance of data dimensionality reduction using MNF transformation, which is highly relevant to hyperspectral data processing and model performance enhancement.

Incorporating this study significantly strengthens the methodological context of our own work. Accordingly, we have cited this reference in Section 2: Related Works, where we discuss the evolution of hyperspectral classification techniques. The citation emphasizes how classical machine learning methods, including SVM and SAM, have been successfully applied to real-world hyperspectral vegetation mapping, thus providing foundational insights that complement our deep learning-based framework. We deeply appreciate the reviewer’s suggestion, which has allowed us to further contextualize our model development within the broader landscape of hyperspectral classification literature.

Reviewer Comment 3:

“The author should discuss the role of class-aware cues in the UniHSFormer-X architecture, evaluate it using UAV-based benchmark datasets, handle structured and fragmented planting patterns, and conduct an ablation study to analyze the architectural components.”

Response to Comment 3:

We greatly appreciate the reviewer’s detailed suggestions. In response, we have taken the following steps to enhance the manuscript:

  • Class-aware cues in UniHSFormer-X:

We have clarified the mechanism and importance of class-aware cues in Section 4.3 (“Architectural Dissection and the Structural Grammar of Robustness”). Specifically, the semantic routing mechanism and prototype projection module enable the model to dynamically guide feature tokens through class-specific semantic pathways. This encourages disentangled feature encoding and improves the model’s ability to handle inter-class ambiguity in heterogeneous field conditions. These components form the backbone of our class-aware design.

  • Evaluation on UAV-based benchmark datasets:

The proposed method was thoroughly validated using the WHU-Hi UAV-captured benchmark datasets (LongKou, HanChuan, and HongHu), which encompass diverse real-world crop scenes with varying resolutions, backgrounds, and noise levels. These datasets are representative of practical agricultural monitoring scenarios and are commonly used for benchmarking hyperspectral models.

  • Structured and fragmented planting patterns:

We explicitly addressed the challenges of structured (row-based) and fragmented (irregular) planting patterns in Section 5 (Discussion), noting how the model’s hybrid attention mechanism and patchwise encoding strategy allow it to generalize across spatially discontinuous patterns and avoid overfitting to linear textures.

  • Ablation study:

A comprehensive ablation study was presented in Table 5 and Section 4.3. Eleven configurations (C1-C9) were designed to systematically disable or isolate key architectural modules. Results demonstrate that class-aware modules, particularly prototype routing and semantic alignment, significantly contribute to robustness and accuracy. This study directly supports the architectural claims and validates each component’s role in the proposed model.

We thank the reviewer again for highlighting these key aspects, which have been clarified and emphasized in the revised manuscript.

Reviewer Comment 4:

“The author should explore UniHSFormer-X's resilience, architectural grammar for agricultural AI, and its balance of precision and interpretability in real-world cropland analysis.”

Response to Comment 4:

(Line 622-626, 1050-1053)

We thank the reviewer for this thoughtful and forward-looking suggestion. The architectural design of UniHSFormer-X indeed centers on the pursuit of not only classification accuracy, but also structural resilience and semantic transparency in complex agricultural environments. To better reflect this in our manuscript, we have made the following revisions:

  • Clarified model resilience and architectural grammar in Section 4.3:

In the first paragraph of Section 4.3, we originally described the component-wise ablation design. To explicitly highlight the model’s robustness and structural logic, we added the following sentence at the end of that paragraph:

“This structural dissection not only validates the resilience of UniHSFormer-X against modular perturbation but also reveals an underlying architectural grammar, wherein each component operates in a semantically coordinated role—balancing model complexity with robustness.”

This enhancement directly addresses the reviewer’s concern regarding resilience and the architectural grammar underlying UniHSFormer-X.

  • Refined the conclusion of Section 5.4 to reflect the precision–interpretability balance:

In the final paragraph of Section 5.4, we revised the last sentence to more clearly express how UniHSFormer-X balances performance with transparency and field-readiness:

“they edge closer to becoming deployable systems that work with, rather than in spite of, the complexity of real-world farming—embodying an architectural grammar that balances precision with interpretability, and structural resilience with semantic fidelity.”

This revision incorporates the reviewer’s suggested keywords and clarifies that interpretability is not a secondary outcome but a built-in objective of the architectural design, especially important in operational agricultural contexts.

We sincerely thank the reviewer again for pointing us toward this dimension, which helped us refine both the scientific expression and the conceptual framing of our model’s design philosophy.

Reviewer Comment 5:

“The paper requires improvement in its English presentation, addressing grammatical issues, typos, and poorly written sentences.”

Response to Comment 5:

We thank the reviewer for this helpful comment. In response, we have carefully addressed all linguistic and grammatical concerns by employing a professional English editing service. Specifically, we submitted the full manuscript (10,887 words) to MDPI’s Rapid English Editing service, which provides expert-level academic proofreading by native English editors.

The editing was completed on 14 June 2025, and the full payment was processed on 15 June 2025.

We are confident that the language of the revised manuscript now meets the standards expected for scholarly publication. We sincerely appreciate the reviewer’s attention to detail, which helped us improve the clarity and readability of the paper.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a unified transformer-based architecture (UniHSFormer-X) designed to decode agricultural scenes by routing semantic tokens through prototype-aligned paths. This architecture represents an evolution over current techniques, as it dynamically selects, propagates, and aligns information based on class cues, reinforcing coherence across scale and spectrum. The model achieved very interesting performance in structured and fragmented planting contexts, demonstrating resilience under spectral ambiguity and field irregularity.

The paper presents the necessary and appropriate tables and maps. The figures are very illustrative and didactic, as are the tables that present the data in a clear and synthetic way.

The abstract presents an ambitious and technically promising model aimed at a nuanced and practical challenge—semantic segmentation of hyperspectral imagery in agriculture. However, its effectiveness would be better communicated with more clarity, quantitative backing, and concise articulation of novelty. Furthermore, I recommend revising the abstract to begin with a broader context about the importance of applying hyperspectral images in agriculture, before talking about the semantic segmentation of hyperspectral images.

The introduction is well-structured, technically rich, and conceptually forward-looking, placing UniHSFormer-X in a meaningful agricultural context. The text does an interesting job of diagnosing the limitations of previous work and articulating the motivation for a new class of models. However, some parts of the introduction are a bit verbose, especially in the opening paragraphs. Some sentences could be simplified without losing meaning. Furthermore, as reported, there are sections that need to be better written, with explanations based on examples. It would also be interesting to include quantitative data or global statistics on the benefits of using spectral imaging in agriculture, to reinforce the relevance of the topic. In the final part of the introduction, it is stated that the model “adapts seamlessly across datasets”, I think this statement needs to be more assertive. What makes it generalizable? Is it the architectural design, the training regime, or inductive biases?

The methodology is one of the strengths of the paper, offering a clear and detailed explanation of the modeling approach. The choice of a publicly available, diverse, and well-annotated dataset facilitates reproducibility and comparability. The inclusion of three different scenes ensures that the model is tested in homogeneous vs. heterogeneous field conditions and at varying spatial resolutions. 

The choice of dataset allows us to test the model’s ability to be applied at different spatial scales and to deal with spectral ambiguity, a central challenge in crop classification using hyperspectral images.

The proposed architecture combines two interesting and complementary techniques, semantically-attentioned transformers and spectro-spatial tokenization, which allows the simultaneous capture of fine-grained spectral information and local spatial context. The loss formulation is rich and well-justified, with multiple objectives that complement each other to promote good separation between classes and consistency in the tokens assigned to each class. However, the approach assumes that a prototype vector can represent each class well, which can be problematic for classes with high intra-class variability.

The Results and Discussion chapters are well described and confirm the issues raised in the objectives. The selection of heterogeneous environments with different types of crops, resolutions and levels of complexity ensures a robust test of the model's generalization capacity. The diversity of metrics also provides a multifaceted view of the model's performance. The high overall performance of UniHSFormer-X, with values ​​above 99%, surpasses all competitors. High consistency between classes is also observed, even species that are typically difficult to classify showed high accuracy, which demonstrates the model's ability to deal with spectral ambiguity.

However, it is recommended that the results be compared and discussed with data obtained from other investigations and not only with the description of each parameter. It was noted that references 54 and 55 are in the methodology section (line 166) and the next reference (56 and 57) is only in the discussion section (line 775), that is, there is this entire section of presentation of the results and part of the discussion without comparison with studies carried out in other regions of the world. Furthermore, only two more references (58 and 59) are presented below, which represents a very limited scope. Therefore, I think that to increase the robustness and generalizability of the discussion, it is recommended to incorporate comparisons with similar research from other regions around the world. Such comparisons would not only contextualize the results, but also highlight the broader relevance and potential applicability of the study.

The conclusion of the study presents a well-articulated perspective on crop classification in hyperspectral imagery, positioning the proposed UniHSFormer-X model as more than a technical improvement, but rather as a profound shift towards semantic reasoning in agricultural AI. The conclusion does not focus solely on accuracy, but also values ​​structural resilience and contextual adaptability. However, it describes the model as representing a “paradigm shift”, which seems a bit premature. Although UniHSFormer-X outperforms its competitors, the transition from classification to interpretation has not yet been fully realized.

Some suggestions or observations are also described in the pdf document, please see the document.

Comments for author File: Comments.pdf

Author Response

Reviewer Comment 1:

“The abstract presents an ambitious and technically promising model... However, its effectiveness would be better communicated with more clarity, quantitative backing, and concise articulation of novelty. Furthermore, I recommend revising the abstract to begin with a broader context about the importance of applying hyperspectral images in agriculture, before talking about the semantic segmentation of hyperspectral images.”

Response to Comment 1:

(Line 23-40)

We sincerely appreciate the reviewer’s insightful comments. In response, we have substantially revised the abstract to improve its clarity, logical coherence, and academic tone. Specifically, we now begin with a broader contextualization of the significance of hyperspectral imaging in agriculture, outlining both its potential and its prevailing challenges. We then introduce UniHSFormer-X by clearly articulating its core innovations—namely, prototype-guided semantic routing and class-aware hierarchical encoding. To better convey model effectiveness, we now include quantitative performance metrics (e.g., up to 99.80% OA and 99.28% AA across UAV-based benchmarks), alongside ablation and sensitivity analyses that validate the robustness and adaptability of our design. Finally, we emphasize the architectural interpretability and real-world applicability of the model, aiming to present not only technical novelty but also domain relevance. We believe this revised abstract more effectively communicates the motivation, contribution, and impact of our work, and we thank the reviewer for prompting this improvement.

Reviewer Comment 2:

“The introduction is well-structured... However, some parts of the introduction are a bit verbose... It would also be interesting to include quantitative data or global statistics on the benefits of using spectral imaging in agriculture... In the final part of the introduction, it is stated that the model ‘adapts seamlessly across datasets’, I think this statement needs to be more assertive...”

Response to Comment 2:

(Line 45-57, 96-98)

Thank you for the constructive feedback. Following your suggestions, we have revised the introduction in three key ways:

(1) We streamlined the opening paragraphs to reduce redundancy and enhance clarity, ensuring conciseness while retaining essential background.

(2) We incorporated quantitative evidence highlighting the benefits of hyperspectral imaging in agriculture. Specifically, we noted that recent meta-analyses report classification accuracies exceeding 90% across diverse crop types using HSI, outperforming traditional multispectral methods by up to 15% in complex scenes.

(3) We refined the statement regarding cross-dataset adaptability to make it more assertive and mechanistically grounded. We now explicitly attribute this robustness to the model’s architectural inductive bias—namely, the prototype-aligned semantic routing mechanism, class-conditioned token propagation, and multi-scale encoding—all of which contribute to consistent structural generalization under heterogeneous acquisition conditions.

We believe these changes have improved both the academic rigor and communicative clarity of the introduction, and we appreciate your insightful suggestions.

Reviewer Comment 3:

“However, the approach assumes that a prototype vector can represent each class well, which can be problematic for classes with high intra-class variability.”

Response to Comment 3:

(Line 99-107)

We appreciate the reviewer’s critical observation. In response, we have refined the closing paragraph of the Introduction to directly acknowledge this limitation and clarify how the model adapts to intra-class variation. Specifically, we explain that although a single prototype per class is employed, the token–prototype interactions are dynamically adjusted during training through class-conditioned routing. This mechanism offers practical flexibility in real-world agricultural scenarios, while preserving architectural interpretability. We also indicate this as a promising direction for future extension.

Reviewer Comment 4:

“However, it is recommended that the results be compared and discussed with data obtained from other investigations and not only with the description of each parameter... to increase the robustness and generalizability of the discussion, it is recommended to incorporate comparisons with similar research from other regions around the world.”

Response to Comment 4:

(Line 880-1053)

Thank you for this valuable suggestion. In response, we have revised the Results and Discussion section to incorporate comparisons with related studies from other regions, including benchmark investigations conducted in Europe, North America, and Southeast Asia. These additions contextualize our findings within the broader landscape of hyperspectral crop classification and help demonstrate the generalizability of our method. Specifically, we now reference models that adopt transformer-based or hybrid architectures and evaluate them under similar challenges—such as intra-class ambiguity, fragmented planting, and spatial heterogeneity. This comparative framing strengthens the interpretive depth and cross-regional relevance of our conclusions.

Reviewer Comment 5:

“The conclusion... describes the model as representing a ‘paradigm shift’, which seems a bit premature... the transition from classification to interpretation has not yet been fully realized.”

Response to Comment 5:

(Line 1070-1081)

We appreciate the reviewer’s thoughtful observation. While we maintain that UniHSFormer-X introduces a distinctive modeling perspective, we agree that the term “paradigm shift” may overstate the current stage of adoption and maturity. We have accordingly revised the conclusion to characterize our contribution as a conceptual framework that supports a broader transition toward structure-aware and interpretable modeling in hyperspectral agriculture. This wording acknowledges both the model’s novel direction and the need for continued development to realize its full potential.

Author Response File: Author Response.pdf

Back to TopTop