UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring

Du, Zhen; Liu, Senhao; Liao, Yao; Tang, Yuanyuan; Liu, Yanwen; Xing, Huimin; Zhang, Zhijie; Zhang, Donghui

doi:10.3390/agriculture15131427

Open AccessArticle

UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring

by

Zhen Du

¹,

Senhao Liu

²,

Yao Liao

³,

Yuanyuan Tang

⁴,

Yanwen Liu

⁵

,

Huimin Xing

⁶,

Zhijie Zhang

⁷ and

Donghui Zhang

^8,*

¹

School of Economics and Management, East China University of Technology, Nanchang 330013, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

³

Guizhou Ecological Meteorology and Agrometeorology Center, Guiyang 550002, China

⁴

Changsha Natural Resources Comprehensive Survey Center, China Geological Survey, Changsha 410000, China

⁵

School of Resources and Environment Science and Engineering, Hubei University of Science and Technology, Xianning 437100, China

⁶

College of Surveying and Planning, Shangqiu Normal University, Shangqiu 476000, China

⁷

School of Geography, Development and Environment, The University of Arizona, Tucson, AZ 85719, USA

⁸

Institute of Remote Sensing Satellite, China Academy of Space Technology, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(13), 1427; https://doi.org/10.3390/agriculture15131427

Submission received: 20 May 2025 / Revised: 26 June 2025 / Accepted: 1 July 2025 / Published: 2 July 2025

(This article belongs to the Topic Advances in Smart Agriculture with Remote Sensing as the Core and Its Applications in Crops Field)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Hyperspectral imaging (HSI) plays a pivotal role in modern agriculture by capturing fine-grained spectral signatures that support crop classification, health assessment, and land-use monitoring. However, the transition from raw spectral data to reliable semantic understanding remains challenging—particularly under fragmented planting patterns, spectral ambiguity, and spatial heterogeneity. To address these limitations, we propose UniHSFormer-X, a unified transformer-based framework that reconstructs agricultural semantics through prototype-guided token routing and hierarchical context modeling. Unlike conventional models that treat spectral–spatial features uniformly, UniHSFormer-X dynamically modulates information flow based on class-aware affinities, enabling precise delineation of field boundaries and robust recognition of spectrally entangled crop types. Evaluated on three UAV-based benchmarks—WHU-Hi-LongKou, HanChuan, and HongHu—the model achieves up to 99.80% overall accuracy and 99.28% average accuracy, outperforming state-of-the-art CNN, ViT, and hybrid architectures across both structured and heterogeneous agricultural scenarios. Ablation studies further reveal the critical role of semantic routing and prototype projection in stabilizing model behavior, while parameter surface analysis demonstrates consistent generalization across diverse configurations. Beyond high performance, UniHSFormer-X offers a semantically interpretable architecture that adapts to the spatial logic and compositional nuance of agricultural imagery, representing a forward step toward robust and scalable crop classification.

Keywords:

hyperspectral imaging; crop classification; agricultural remote sensing; transformer architecture; semantic routing; prototype projection; semantic segmentation; UAV-based monitoring

1. Introduction

Accurate crop type classification remains a foundational task in agricultural remote sensing, supporting precision farming, yield estimation, and land-use monitoring [1,2,3]. However, real-world planting systems—with interleaved plots, visually similar crops, and complex backgrounds—have challenged the representational capacity of traditional paradigms, particularly in horticultural and mixed cropping contexts where class boundaries are spectrally ambiguous [4,5,6,7].

HSI provides such potential, capturing fine spectral signatures across hundreds of contiguous bands to distinguish subtle physiological differences [8,9,10]. However, spectral richness alone is insufficient under spatial irregularity or ambiguous class transitions. This demand is underscored by the growing role of hyperspectral systems in global agricultural monitoring: recent meta-analyses report classification accuracies exceeding 90% across diverse crop types when using HSI-based models, outperforming traditional multispectral methods by up to 15% in complex scenes [11,12].

Early deep learning approaches—most notably 3D-CNNs and their spatial–spectral variants—improved over traditional machine learning by jointly encoding spectral and spatial features [10,13]. These methods proved effective in structured croplands with clear geometric regularity. However, their fixed receptive fields and limited context-awareness hinder generalization in fields with fragmented planting patterns, non-uniform illumination, or mixed vegetative cover [14,15]. Such limitations are amplified in agricultural environments where crop plots are irregularly shaped, interspersed with non-crop elements like mulch, water bodies, or infrastructure, and where class adjacency blurs semantic coherence [16]. Transformer-based architectures, by contrast, offer the potential to model agricultural scenes as holistic semantic fields [17,18]. Their self-attention mechanisms enable each spatial–spectral token to contextualize its identity based on distant dependencies—allowing, for instance, a lettuce patch partially occluded by shade to be aligned with similar instances across the scene. However, most existing transformer models, when directly applied to HSI, inherit limitations from their vision or language origins as follows: token sequences are treated uniformly, spatial hierarchy is flattened, and semantic structure is often diluted [17,19]. This is particularly problematic in agricultural settings where semantic function is tied to both local cues and spatial arrangement—for example, when distinguishing between classes that differ more by field layout than by spectral reflectance [20]. Recent hybrid models have attempted to combine the locality-preserving strengths of CNNs with the global abstraction capacity of transformers [21,22,23]. While such architectures achieve progress in scene-level accuracy, they often do not explicitly address the semantic disambiguation and spatial logic intrinsic to crop distributions. Static fusion schemes, redundant token flows, and uniform projection strategies fall short in maintaining class continuity and field coherence—especially under conditions of low inter-class variance or high structural density [24,25].

To address these limitations, we propose UniHSFormer-X, a unified transformer framework specifically tailored for hyperspectral crop classification under real-world agricultural constraints. Unlike traditional models that treat spectral features uniformly or apply static spatial kernels, UniHSFormer-X introduces a prototype-aligned semantic routing mechanism that dynamically guides token propagation based on class-informed cues [26,27]. This allows the model to emphasize semantically meaningful structures while suppressing redundant or noisy activations [28]. By aligning spectral–spatial tokens with learnable semantic anchors (prototypes), the model captures fine-grained agricultural semantics, especially in morphologically irregular or spectrally ambiguous regions. Empirical results across three benchmark scenes demonstrate that this routing strategy not only improves boundary sharpness and inter-class separability but also enhances performance on challenging subclasses with dense spatial entanglement, attributable to the model’s architectural inductive bias as follows: class-conditioned routing, prototype alignment, and multi-scale encoding collectively promote structural generalization across heterogeneous datasets [29,30]. Through this design, UniHSFormer-X bridges the gap between global abstraction and local structure, offering a robust, interpretable solution to long-standing challenges in hyperspectral agricultural segmentation. While UniHSFormer-X employs a single prototype per class to anchor semantic alignment, we acknowledge that this inductive bias may be suboptimal for classes with high intra-class variability. To address this, our routing mechanism allows token–prototype interactions to adapt dynamically based on class-conditioned cues during training—thus preserving flexibility while maintaining interpretability. This also opens avenues for future extension toward multi-prototype or instance-aware routing schemes.

2. Related Works

Hyperspectral imaging (HSI) has become increasingly pivotal in agricultural monitoring due to its capacity to capture detailed spectral characteristics across narrow and contiguous bands. This enables fine-grained crop classification, particularly in environments with subtle spectral variability and spatial complexity. Yet, real-world agricultural fields present a number of challenges—such as class adjacency, morphological irregularity, and background interference (e.g., shadows, plastic, bare soil)—that demand not only precise spectral discrimination but also structural resilience across spatial scales [31,32]. The transition from shallow statistical models to deep learning frameworks marks a fundamental shift in how such complexity is managed. Early deep models focused predominantly on convolutional neural networks (CNNs), leveraging their capacity to extract spatial context and local spectral patterns. Traditional machine learning classifiers, such as SVM and SAM, have also shown promising performance in hyperspectral vegetation mapping tasks—for instance, Borana et al. achieved over 81% accuracy in arid vegetation species classification using SVM on high-resolution field hyperspectral data [33]. Approaches such as 1D-CNN and 2D-CNN emphasized either spectral vectorization or spatial neighborhood encoding, but their separation limited comprehensive feature learning [34,35,36]. The introduction of 3D-CNNs bridged this gap by applying spectral–spatial convolutions over data cubes, significantly improving joint feature extraction [37,38]. Nonetheless, despite their dense connectivity, these models typically operate within constrained receptive fields and fixed kernel structures, which can be insufficient in modeling remote dependencies and long-range contextual cues—features often critical in heterogeneous croplands where spectrally similar classes are interleaved across irregular field boundaries.

To overcome these limitations, the transformer architecture, originally devised for sequence modeling in language tasks, has been adapted for remote sensing applications [39,40]. Its self-attention mechanism enables dynamic weighting of feature contributions across entire sequences, facilitating the learning of global dependencies within high-dimensional hyperspectral cubes. In agricultural contexts, this global capacity is especially valuable [15]: it allows the model to suppress local spectral noise and integrate class-consistent information across disjoint planting zones, where identical crop types may appear under different lighting, soil, or occlusion conditions. Initial attempts in this direction relied on direct application of vision transformers (ViT), which proved adept at capturing global semantics but often lacked inductive biases for spatial structure, leading to suboptimal performance in scenarios requiring fine boundary delineation [41,42]. Subsequent refinements, including spectral tokenization and spatial–spectral fusion, aimed to better align transformer modules with the structural characteristics of HSI data, particularly where agricultural boundaries are implicit or ill-defined [43].

Hybrid models emerged as a response to the complementary strengths of CNNs and transformers [44,45]. These models seek to embed the locality-preserving advantages of convolutional modules within the globally aware context of attention mechanisms. Notably, CMTNet and CTDBNet adopt dual-branch architectures, where convolutional layers capture shallow spatial details while transformer encoders extract long-range spectral dependencies [46,47]. Similarly, MST-SSSNet introduces separable convolutional streams coupled with transformer layers to sequentially refine spectral–spatial embeddings [18]. Such architectures demonstrate improved robustness to class imbalance and inter-class spectral similarity, yet often suffer from rigid fusion schemes and token redundancy during inference—limitations that become particularly pronounced in high-density horticultural fields [46,47]. More recent advances emphasize semantic disentanglement, frequency-aware decomposition, and multiscale encoding [48,49]. For instance, dual-frequency transformers, octave convolution networks, and frequency-domain filters have been explored to selectively enhance discriminative patterns while suppressing noise [50]. Although these techniques contribute to improved representation diversity, their reliance on fixed decomposition strategies can limit adaptability across different crop types, maturity stages, or resolution scales. Moreover, many existing methods still employ static projection heads or uniform supervision strategies, which fail to account for the semantic hierarchy inherent in agricultural land cover—where crops may differ more by spatial arrangement than by pure reflectance [51].

These observations highlight a persistent gap in achieving scale-invariant, class-adaptive, and semantically unified representation learning for agriculture [52,53]. Addressing this challenge necessitates a framework that integrates not only flexible attention and spectral–spatial interaction but also dynamic token selection and prototype-guided alignment—while explicitly reflecting the hierarchical, irregular, and high-ambiguity nature of agricultural imagery. It is in this context that UniHSFormer-X is introduced—a unified architecture designed to route informative tokens via learned prototype attention, reinforce discriminability through contrastive semantic projection, and adaptively balance local-global fusion across multi-resolution crop scenes.

3. Materials and Methods

3.1. Benchmark Datasets

To comprehensively evaluate the proposed UniHSFormer-X framework in the context of real-world agricultural scenarios, we adopt the publicly available WHU-Hi benchmark, a UAV-borne hyperspectral image collection curated by the RSIDEA group at Wuhan University. The dataset comprises three distinct scenes—WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu—captured over representative farmland regions in central China [54,55]. These scenes differ significantly in terms of spatial layout, crop diversity, and environmental complexity, offering a robust testbed for validating cross-scale generalization and fine-grained class discrimination, both of which are central to modern hyperspectral crop classification. All data were acquired using Headwall Nano-Hyperspec sensors onboard UAV platforms, covering 270+ contiguous spectral bands in the 400–1000 nm range, with spatial resolutions ranging from 0.043 m to 0.463 m. The combination of high spectral granularity and ultra-fine spatial detail enables pixel-level modeling of crop structure, physiological variation, and inter-class spectral overlap—challenges that UniHSFormer-X is explicitly designed to address via semantic routing and prototype-guided learning (Figure 1).

WHU-Hi-LongKou depicts a relatively homogeneous agricultural layout, with six dominant crop categories (corn, cotton, sesame, broad-leaf soybean, narrow-leaf soybean, and rice) and a few background classes. The image covers 550 × 400 pixels at a spatial resolution of 0.463 m and was acquired under clear midsummer illumination. As a baseline scene, it enables controlled evaluation of spectral–spatial encoding strategies. WHU-Hi-HanChuan corresponds to a rural-urban fringe region, featuring mixed vegetation, artificial surfaces, and complex shadow interference due to lower solar elevation. This scene comprises 16 annotated land-cover types and offers a higher spatial resolution of ~0.109 m across 274 bands. Its structural heterogeneity and spectral ambiguity provide a compelling test case for UniHSFormer-X’s token routing mechanism.

WHU-Hi-HongHu is the most complex of the three, consisting of 22 crop classes, including multiple cultivars of cabbage, lettuce, and brassica, often exhibiting minimal spectral separation. Captured at a resolution of 0.043 m and spanning 940 × 475 pixels, this scene poses considerable challenges in terms of intra-class variability and shadow-normalization, making it ideal for testing the proposed model’s fine-grained semantic alignment capabilities. All datasets underwent radiometric calibration and geometric correction using Headwall’s proprietary preprocessing pipeline. Ground-truth annotations were manually digitized with field validation by domain experts, while Table 1 summarizes class-wise label distributions used in our 100-sample-per-class training protocol.

3.2. Proposed Architecture

Despite the remarkable progress in hyperspectral crop classification, most existing frameworks still face challenges in balancing spectral fidelity, spatial context, and semantic alignment. Shallow CNN-based encoders often overlook spectral locality, while purely transformer-based methods may suffer from insufficient class-level structure without explicit inductive guidance. Moreover, token redundancy and inconsistent routing across layers hinder efficiency and generalizability. To address these gaps, we propose UniHSFormer-X, a unified architecture that integrates spectral–spatial tokenization, prototype-guided semantic routing, hierarchical transformer encoding, and multi-objective optimization. As illustrated in Figure 2, the framework is designed to establish class-aware information flow across the full pipeline, enabling robust, interpretable, and scalable learning from high-dimensional hyperspectral data.

A. Spectral–Spatial Tokenization with Multi-Resolution Patching

Hyperspectral imagery (HSI) is typically modeled as a tensor

X \in ℝ^{H \times W \times B}

, where

H

and

W

denote the spatial dimensions, and

B

is the number of spectral bands, typically exceeding 270. Each pixel

X_{h, w, :} \in ℝ^{B}

contains a fine-grained reflectance vector but lacks spatial context when treated independently.

To jointly encode spectral and spatial information, we divide the HSI into non-overlapping patches

P_{i} \in ℝ^{p \times p \times B}

, forming the following:

X = ⋃ {(i = 1)}^{N} P_{i}, N = (H \cdot W) / p^{2}

(1)

Each patch is embedded into a token

t_{i} \in ℝ^{d}

via the following:

t_{i} = W \cdot v e c (P_{i}) + b

(2)

Optionally, a spectral attention kernel

γ \in ℝ^{B}

is used to reweight band contributions before embedding as follows:

\tilde{P_{i}} (h, w, b) = \frac{\exp (γ_{b} \cdot P_{i} (h, w, b))}{\sum_{b^{'} = 1}^{B} \exp (γ_{b^{'}} \cdot P_{i} (h, w, b^{'}))}

(3)

We add relative positional encoding

e_{i}

to yield final tokens

z_{i} = t_{i} + e_{i}

, forming the input sequence

{\{z_{i}\}}_{i = 1}^{N}

.

To standardize physical coverage, patch size

p

is adapted to resolution r by the following:

p = a r g m i n_{p}^{'} |(p^{'} \cdot r) / L - 1|

(4)

Compared with CNNs as follows:

H_{CNN} = \{h (X) = \sum_{k} σ (w_{k} * X + b_{k})\} vs . H_{Token} = \{h (X) = MHSA (\{z_{i}\})\}

(5)

we have

H_{CNN} ⊊ H_{Token}

, confirming broader representational capacity. This module provides semantically consistent input tokens for downstream modeling, preserving spectral integrity and spatial alignment. The full tokenization process is illustrated in Figure 3, showing how spectral–spatial patches are converted into learnable tokens compatible with transformer-based modeling. These spectral–spatial tokens form the raw feature basis for semantic structuring via prototype-driven routing.

B. Prototype-Guided Semantic Routing Mechanism

To impose semantic regularity on these tokens and guide their downstream processing, we introduce a learnable prototype set of learnable class prototypes

{\{P_{k}\}}_{k = 1}^{K}

, each representing a spectral–spatial class centroid in

ℝ^{d}

. For each token

z_{i}

, its similarity to each prototype is computed via scaled dot-product attention as follows:

a_{i k} = \frac{\exp (z_{i}^{⊤} P_{k} / τ)}{\sum_{k^{'} = 1}^{K} \exp (z_{i}^{⊤} P_{k^{'}} / τ)}

(6)

These weights serve the following two roles: (1) soft assignment of tokens to semantic classes, and (2) routing gate controlling participation in deeper encoding. High-confidence tokens are retained, while uncertain ones may shortcut or be downweighted. The semantic routing process is illustrated in Figure 4. Each token

z_{i}

computes attention scores to all prototypes

P_{k}

, forming soft assignments via a scaled dot-product mechanism. High-confidence tokens contribute to prototype refinement, while low-confidence tokens may be bypassed. The prototypes are updated using weighted residual aggregation, enabling class-level alignment across layers. To enable prototype adaptation, we apply a gated residual update as follows:

P_{k}^{(l + 1)} = LayerNorm (P_{k}^{(l)} + \sum_{i} a_{i k}^{(l)} \cdot z_{i}^{(l)})

(7)

This allows prototypes to evolve toward their semantic centers across layers. Semantic routing constrains token dynamics, reduces distraction from background pixels, and provides inductive guidance for class-aware representation learning. It also lays the groundwork for downstream attention integration in the transformer backbone.

C. Hierarchical Transformer with Prototype-Aware Attention

After routing, the selected tokens undergo layered refinement within a hierarchical transformer, where semantic alignment and contextual reasoning co-evolve. To enhance the representation capacity of routed tokens, we adopt a multi-stage transformer encoder, where each stage performs multi-head self-attention (MHSA) followed by feedforward layers. Tokens interact not only with other tokens but also with semantic prototypes, forming a joint attention space as follows:

Z^{(l + 1)} = MHSA ([Z^{(l)}; P^{(l)}]) + FFN (\cdot)

(8)

This allows token updates to incorporate both spatial–spectral context and class-level semantics. After each stage, spatial resolution is reduced via token merging (e.g., 2 × 2 pooling), while the full spectral dimension is retained. Prototypes are refined layer-wise through gated aggregation of token features as follows:

P_{k}^{(l + 1)} = LayerNorm (P_{k}^{(l)} + \sum_{i} a_{i k}^{(l)} \cdot z_{i}^{(l)})

(9)

This bidirectional flow enables the encoder to maintain semantic consistency across depth. The hierarchical structure enlarges receptive fields, reduces computational complexity, and regularizes learning by reinforcing class-aware attention. In effect, the transformer backbone becomes both a spectral–spatial encoder and a semantic alignment mechanism.

D. Unified Semantic Projection and Multi-Objective Optimization

Following hierarchical refinement, each token

z_{i}

encodes class-relevant spectral–spatial information aligned with learned prototypes. To finalize classification, we map token features into a unified semantic space where similarity to class prototypes directly informs the prediction.

Let

ϕ : ℝ^{d} \to ℝ^{q}

be the projection head. The embedded token is transformed as

h_{i} = ϕ (z_{i})

, and compared with projected class prototypes

ϕ (P_{k})

to compute logits. Classification is supervised via cross-entropy loss as follows:

L_{cls} = - \sum_{i} \log \frac{\exp (ϕ {(z_{i})}^{⊤} ϕ (P_{c_{i}}))}{\sum_{k} \exp (ϕ {(z_{i})}^{⊤} ϕ (P_{k}))}

(10)

To further enhance feature discriminability, we impose a prototype-based contrastive loss as follows:

L_{proto} = - \log \frac{\exp (\cos (h_{i}, P_{c_{i}}) / β)}{\sum_{k = 1}^{K} \exp (\cos (h_{i}, P_{k}) / β)}

(11)

where cos (h_i,P_ci) denotes cosine similarity and

β

is a temperature factor. This encourages same-class tokens to concentrate around their respective prototype and repels them from others. To ensure semantic routing consistency, we introduce an alignment loss between the prototype attention weights

a_{i k}

and the ground-truth label

c_{i}

as follows:

L_{align} = - \log a_{i c_{i}}

(12)

We also apply a diversity regularizer to encourage prototype separation as follows:

L_{div} = \sum_{j \neq k} \cos (P_{j}, P_{k})

(13)

The total training objective is a weighted combination:

L_{total} = λ_{1} L_{cls} + λ_{2} L_{proto} + λ_{3} L_{align} + λ_{4} L_{div}

(14)

During inference, we exploit routing scores

a_{i k}

as semantic confidence indicators. Tokens with maximum affinity exceeding a threshold

δ

are retained for final prediction; others may be pruned to accelerate computation. This final module unifies the model’s predictions under a semantically interpretable embedding, while contrastive and routing-aligned objectives reinforce the geometry and robustness of the feature space. Together, they ensure that UniHSFormer-X produces accurate, explainable, and generalizable outputs for hyperspectral crop classification.

4. Experiment and Analysis

4.1. Experimental Setup

To validate the effectiveness of UniHSFormer-X across varying spectral resolutions and agricultural structures, we conduct extensive experiments on the WHU-Hi benchmark. This benchmark comprises three UAV-based hyperspectral scenes—LongKou, HanChuan, and HongHu—each characterized by distinct crop types, acquisition resolutions, and label granularity. These heterogeneous conditions provide a challenging and diverse testbed for evaluating both discriminative capacity and generalizability. Following the Train100 protocol, 100 pixels per class are randomly selected for training, while all remaining labeled pixels are used exclusively for testing. This strict, class-balanced sampling ensures that the evaluation focuses on semantic learning rather than data volume. To improve robustness under local appearance shifts, standard augmentations including random flips and rotations are applied.

The UniHSFormer-X model is implemented in PyTorch and trained on a single NVIDIA RTX 4090 GPU (NVIDIA Corporation, headquartered in Santa Clara, CA, USA). Patch sizes are adaptively selected to maintain spatial comparability across datasets of different resolutions, ensuring that each patch encodes a consistent physical footprint. Unlike conventional CNN pipelines, our spectral–spatial tokenization module preserves the full spectral fidelity of each patch and transforms it into a semantically enriched token set via lightweight 3D projections. The number of semantic prototypes is matched to the number of valid classes and remains fixed during training.

We design our optimization schedule to reflect the staged information flow of the network. In early epochs, lower routing thresholds and soft gate margins encourage broader token participation, allowing the model to stabilize its prototype attention patterns. As training proceeds, the gate hardness is gradually increased, effectively pruning noisy or low-contribution tokens and promoting high-confidence semantic routing. The model is trained for 300 epochs using the AdamW optimizer with a warm-start schedule and implicit annealing. Token dropout and token-prototype alignment regularization are applied throughout to enhance stability. Training supervision is composed of three loss components: categorical cross-entropy for basic class discrimination, a prototype-level contrastive loss that introduces semantic margin among classes, and a routing-alignment loss that encourages consistency between attention scores and prototype assignments. These losses jointly drive the semantic calibration of both token and prototype representations.

To assess performance, we adopt Overall Accuracy (OA), Average Accuracy (AA), per-class accuracy, and the Kappa Coefficient (κ). We further report Kappa@100 (K@100), a variant that isolates agreement under fixed-sample constraints and is particularly sensitive to small-class misalignment. Inference efficiency and token retention ratios are also measured to quantify the impact of routing and pruning under real-time constraints. The formal definitions of these metrics are given as follows:

Overall Accuracy (OA) evaluates the proportion of correctly classified pixels across the entire dataset:

OA = \frac{\sum_{i = 1}^{n} x_{i i}}{N}

(15)

where

x_{i i}

denotes the number of pixels correctly predicted as class i, and N is the total number of labeled pixels. Average Accuracy (AA) reflects the mean class-wise recall, emphasizing balance across categories:

AA = \frac{1}{n} \sum_{i = 1}^{n} \frac{x_{i i}}{\sum_{j = 1}^{n} x_{i j}}

(16)

Kappa Coefficient (κ) measures agreement between predicted and ground-truth labels, normalized by expected agreement by chance:

κ = \frac{N \sum_{i = 1}^{n} x_{i i} - \sum_{i = 1}^{n} (x_{i +} \cdot x_{+ i})}{N^{2} - \sum_{i = 1}^{n} (x_{i +} \cdot x_{+ i})}

(17)

Kappa@100 (K@100) computes the same κ metric under the Train100 constraint, i.e., using only 100 labeled pixels per class for training, to reflect model agreement under minimal supervision. We compare UniHSFormer-X against a comprehensive set of baselines, spanning classical classifiers (RF, SVM), spectral–spatial CNNs (3D-CNN, ResNet), vision transformer variants (ViT, SSFTT), and state-of-the-art hybrid architectures (CTMixer, CTDBNet). All models are re-trained using identical Train100 splits and evaluation metrics for fair benchmarking.

For classical machine learning baselines such as SVM and RF, input features were constructed as raw per-pixel spectral vectors, comprising reflectance values across all available hyperspectral bands. No spatial features or handcrafted descriptors were used, and no feature fusion or augmentation was applied. This ensures consistency with canonical use of these models in hyperspectral classification, and it allows for direct performance comparison with spectral–spatial deep learning models. All models were optimized under consistent strategies and bounded hyperparameter ranges, ensuring fairness without exhaustive tuning. Core settings such as optimizer, batch size, and training epochs are described in later sections alongside model complexity metrics.

4.2. Model Performance Across Different Agricultural Scenarios

Across varied agricultural terrains, the interplay between spatial composition, class diversity, and spectral ambiguity generates markedly different pressures on classification systems. As scene structure shifts from monoculture regularity to horticultural density, the assumptions embedded in traditional spatial–spectral modeling begin to erode. Architectures optimized for uniformity may find themselves mismatched with fragmentation; models trained to emphasize global coherence may overlook small, semantically critical fluctuations. In such settings, the ability to modulate attention—not merely expand it—becomes central. The question is no longer whether global or local modeling prevails, but how representation is routed, pruned, or withheld in response to complexity that resists uniform encoding.

The WHU-Hi-LongKou dataset offers a clean and structured agricultural landscape, composed of well-separated crop bands and minimal background interference. This setting allows for a focused evaluation of models’ ability to preserve field geometry, distinguish spectrally similar crops, and maintain classification stability across large homogeneous zones and sparse minor classes. The outputs shown in Figure 5 reveal the segmentation behaviors across nine representative models. At first glance, differences in spatial coherence and class boundary precision are apparent. Traditional methods such as RF and SVM (Figure 5a,b) generate overly fragmented predictions, particularly within narrow-leaf soybean (C5), mixed weed (C9), and transitional zones between roads and crops. This outcome is expected, given their reliance on isolated spectral signatures and lack of spatial encoding. CNN-based architectures (Figure 5c,d) introduce substantial improvements in intra-field continuity. The 3D-CNN, benefiting from joint spectral–spatial convolution, achieves stable segmentation within dominant crops like corn (C1) and sesame (C3), though its performance remains uneven on low-sample or morphologically irregular classes. ResNet sharpens field edges but shows inconsistencies near class boundaries—especially where texture and structure subtly vary, as seen in the C2–C5 transitions.

Vision Transformer models (ViT, SSFTT; Figure 5e,f) further enhance inter-class separation by capturing long-range dependencies. ViT suppresses small-scale noise but occasionally oversmooths fine-grained boundaries, especially in the presence of spectral redundancy. SSFTT, which augments attention with spectral–spatial fusion modules, performs more reliably across both major and minor classes, particularly in structurally ambiguous regions such as roads (C8) and weed (C9), where context-based reasoning proves beneficial. Hybrid models (Figure 5g,h) such as CTMixer and CTDBNet display even more stable behavior, especially in spatially intricate regions. CTMixer balances local and global features through depth-wise mixing, whereas CTDBNet introduces multi-depth supervision to reinforce semantic consistency. Both models maintain strong coherence across broad crop zones, though occasional fluctuations in small, spectrally entangled classes still emerge—suggesting limits in static fusion and token uniformity.

Against this backdrop, the predictions shown in Figure 5i offer a subtly different impression. Without explicit emphasis, one notices that class boundaries are tightly aligned to field edges, that transitional zones show minimal confusion, and that small, spectrally ambiguous categories—such as narrow-leaf soybean and mixed weed—appear unusually well resolved. These visual outcomes are mirrored in the quantitative results of Table 2, where the final model achieves 99.80% OA, 99.28% AA, and 99.49 Kappa—all metrics surpassing competing baselines, yet without dramatic deviation from the upper-tier performers. The underlying design may offer some insight into this behavior. Unlike other models that treat all tokens or patches uniformly, UniHSFormer-X introduces a semantic routing mechanism that gates feature propagation through learned class prototypes. This has the effect of selectively amplifying tokens that carry structurally or semantically relevant information, while attenuating background noise or class-ambiguous cues. When combined with a hierarchical transformer backbone that preserves spatial resolution while refining context layer-by-layer, the model seems to naturally align its attention with field structure and class boundaries, even in challenging regions.

Interestingly, the model does not outperform its competitors through brute-force overfitting or spatial over-smoothing. On the contrary, its strength lies in balanced precision across the class spectrum—as evidenced by the uniformly high per-class accuracies, including in traditionally unstable categories such as C2 (cotton, 99.93%), C5 (narrow-leaf soybean, 99.40%), and C9 (mixed weed, 96.18%). This consistency suggests that the model’s improvements are less about peak values and more about structural robustness and semantic alignment.

In contrast to the ordered agricultural geometry of LongKou, the WHU-Hi-HanChuan dataset features a spatial composition shaped by mixed-function landscapes—urban edges, orchards, vegetable plots, and various artificial surfaces—interlaced with cast shadows and irregular textures. With 16 annotated classes, including multiple legumes, synthetic coverings, and background structures, HanChuan presents a more demanding test for models aiming to balance spectral discrimination with structural generalization. Displayed in Figure 6 are the predicted maps from all models, exposing divergences in handling spatial complexity. While most models are able to maintain the overall layout of the scene, notable differences emerge in how they handle spatial irregularity, spectral redundancy, and class imbalance. These discrepancies underscore a broader observation: high performance in highly structured environments does not necessarily translate to consistency under spatial or material complexity.

Among the traditional methods, RF and SVM (Figure 6a,b) show difficulties in preserving semantic continuity. Their predictions fragment notably in mid-density classes such as soybean (C2), grass (C8), and plastic (C12), where similar spectral responses intersect with blurred boundaries. The absence of spatial reasoning mechanisms leads to scattered predictions even in prominent regions like strawberry fields (C1). CNN-based approaches (3D-CNN and ResNet; Figure 6c,d) improve upon this, especially in dominant crop regions. 3D-CNN effectively captures compact zones such as C1 and C10, supported by its spectral–spatial feature stacking. However, its outputs begin to soften near occluded or edge-blurred objects like watermelon (C6), while cowpea (C7) and tree crowns (C11) remain inconsistently segmented. ResNet shows improved balance on flat classes like bare soil and plastic but exhibits occasional class bleeding where vegetation interfaces are ambiguous. Transformer-based architectures (ViT and SSFTT; Figure 6e,f) demonstrate stronger contextual awareness, helping recover fragmented classes and sharpen semantic clusters. ViT’s self-attention offers a broader spatial lens but occasionally smooths over localized transitions—noticeable in interwoven classes such as narrow-leaf legumes (C5) and cabbage variants (C13). SSFTT, with its integrated spectral–spatial attention, mitigates this effect and shows more stable behavior in tree-grass and road-vegetation junctions. However, both models remain sensitive to token redundancy when local patterns lack contrast.

Hybrid models (CTMixer and CTDBNet; Figure 6g,h) aim to reconcile local detail with global abstraction. CTMixer, through depth-wise mixing, enhances class separation in synthetic materials (e.g., C12), while CTDBNet benefits from hierarchical supervision, improving minority-class reliability. Both handle fringe cases better than their predecessors, though they occasionally inherit the instability of either branch when confronting structurally ambiguous zones. By comparison, the map in Figure 6i offers a more controlled reconstruction of the scene. Tree outlines are retained with minimal bleed, road segments remain continuous across gaps, and marginal categories such as plastic and grass show fewer outliers. The difference is not radical in form but rather subtle in distribution: fewer misclassifications at boundaries, less class drift in occluded areas, and more regular transitions in spatially tangled zones. The model appears to maintain focus where structure is sparse, and restraint where tokens might otherwise overwhelm the representation.

As reported in Table 3, this behavioral distinction corresponds with measurable consistency across classes. With 97.50% OA, 94.36% AA, and 95.23 Kappa, the model avoids the typical trade-offs between dominant and fringe classes, performing robustly not only on high-frequency crops like C1 (99.84%) but also on structurally complex or low-sample categories such as plastic (C12: 91.03%) and watermelon (C6: 88.94%). This suggests a capacity for stable representation under both spatial disorder and spectral similarity. Rather than relying on depth or scale alone, the model’s architectural decisions—particularly in the use of class-informed token filtering and guided projection—may help mitigate the kind of over-aggregation often seen in standard attention blocks. Such mechanisms seem to discourage indiscriminate feature propagation and instead favor structured alignment, especially where material heterogeneity and occlusion complicate the classification landscape.

Compared with the structured field layout of LongKou and the spatial heterogeneity of HanChuan, WHU-Hi-HongHu introduces a new level of complexity. It is not merely a matter of spatial irregularity or spectral overlap—but of semantic granularity. The dataset comprises 22 annotated classes, many of which represent subtle botanical variants (e.g., different Brassica species, lettuce types, or hybrid vegetable rows) distributed within dense and spectrally entangled plots. In this context, models are challenged to demonstrate not only class-level discrimination, but also continuity in spatially narrow, interleaved planting regimes. Figure 7 reveals how different models respond to this heightened semantic density. For most models, structural segmentation deteriorates noticeably when transitioning from broadfield crops to fragmented horticultural zones. RF and SVM (Figure 7a,b) produce scattered outputs with substantial class bleeding, particularly in the lower half of the scene where visually similar crops are densely interleaved. These methods—lacking spatial regularization—are quickly overwhelmed by the subtle variation in reflectance and spatial adjacency.

CNN-based models (Figure 7c,d) attempt to enforce local coherence. 3D-CNN preserves the general contour of fields and suppresses pixel-level noise, reaching 85.14% OA and 82.12% AA, yet still suffers in micro-structured categories such as Class 13 and 22, where field boundaries are both spectrally and spatially ambiguous. ResNet improves local smoothness in some dominant classes but shows inconsistency in capturing fine spatial rhythms, likely due to its fixed-size receptive field. ViT (Figure 7e) introduces global receptive awareness and recovers large-area coherence, especially in monoculture zones. However, it appears less confident in discriminating visually close varieties—e.g., Classes 5 through 7, or Classes 12 through 15—frequently merging semantically distinct crops under similar spectral expressions. SSFTT corrects part of this tendency by integrating spatial information into the attention mechanism, showing improved edge discipline and better recovery of tree cover and bare soil. Still, both methods tend to blur boundaries in areas where class separability is not strictly driven by colorimetric cues.

CTMixer and CTDBNet (Figure 7f–h) continue the trend toward structured outputs, with clearer delineation of elongated field blocks and a marked reduction in speckle noise. CTMixer maintains clean segmentation for dominant crops while CTDBNet, aided by multi-scale fusion, demonstrates an ability to recover thin classes and background boundaries. Nonetheless, even these models occasionally falter under extreme inter-class proximity—where neither pixel distance nor global context is sufficient to resolve the ambiguity. In this highly entangled context, the output of Figure 7i stands out in its restraint rather than overreach. The transitions are sharper but not exaggerated; class boundaries are aligned to planting structure without artificial smoothing; and confusion between subclasses of leafy vegetables is significantly suppressed. Notably, the model does not erase class boundaries in the name of regularity, it retains complexity where necessary and abstains from oversimplification.

This balance is mirrored in Table 4. UniHSFormer-X attains 98.42% OA, 97.06% AA, and a Kappa score of 98.02, leading across all metrics. More importantly, it demonstrates reliable per-class accuracy even in traditionally unstable classes: Class 5 (98.24%), Class 13 (97.86%), and Class 22 (98.58%)—all of which have historically challenged both CNNs and Transformers due to their visual similarity and narrow margins. The model’s architecture may help explain this stability. The semantic routing mechanism limits token redundancy while preserving task-relevant features—an advantage in settings where spatial density and spectral proximity co-occur. Meanwhile, hierarchical attention layers allow it to reconcile global field layout with class-local detail. But perhaps the more significant outcome lies in what is not observed: the absence of category collapse, the rarity of excessive smoothing, and the steady confidence across all tiers of class frequency.

To complement the region-wise analysis, we computed cross-region averages of OA, AA, and Kappa for each model, as presented in Table 5. This unified perspective enables a fairer assessment of generalization across heterogeneous scenes. UniHSFormer-X remains consistently top-ranked, achieving average OA, AA, and Kappa values of 98.57%, 96.90%, and 97.58%, respectively. The performance advantage also extends to minor classes and ambiguous regions, indicating a robustness that transcends specific landscapes. Such cross-region synthesis not only reinforces the model’s superiority but also provides context for interpreting localized inconsistencies observed in individual benchmarks.

What emerges is not always best captured in numerical precision. Some models resist local distortion but falter when confronted with near-class redundancy; others respond well to spatial heterogeneity yet overcommit in zones of weak semantic contrast. The most telling behaviors are not those that dominate a metric, but those that sustain structure without imposing artificial order, as follows: where transitions are managed without erasure, and where ambiguity is acknowledged but not propagated. Here, consistency becomes less about convergence and more about restraint—about how selectively a model aligns features to meaning. Architectures that embed this restraint not as post-processing but as structural grammar may begin to demonstrate not just better results, but more appropriate errors.

4.3. Architectural Dissection and the Structural Grammar of Robustness

To disentangle the contribution of architectural components to model stability, we designed a series of controlled structure variants by selectively disabling key modules in UniHSFormer-X. Specifically, the following five core components were considered: (1) the semantic routing mechanism, which regulates token propagation through class-aware paths; (2) prototype projection, responsible for aligning features to learned semantic anchors; (3) multi-scale supervision, providing auxiliary gradient signals from different encoder depths; (4) the transformer backbone, enabling global spatial abstraction; and (5) encoder depth, controlling the vertical capacity for hierarchical representation. This structural dissection not only validates the resilience of UniHSFormer-X against modular perturbation but also reveals an underlying architectural grammar, wherein each component operates in a semantically coordinated role—balancing model complexity with robustness.

By systematically toggling these modules, we constructed eleven representative configurations (C1–C11), ranging from a fully disabled scaffold (C1) to the complete UniHSFormer-X model (C9). Single-module ablations (e.g., C2, C3, C4) isolate the effect of core components—routing, projection, and supervision—while C5 and C6 evaluate the influence of architectural backbone and encoder depth. Dual-removal configurations (C7 and C8) further examine how inter-module dependencies shape representational robustness. To extend this dissection, two additional combinations (C10 and C11) were included as follows: C10 preserves supervision and backbone while removing both semantic routing and projection, revealing how loss of semantic anchoring affects residual convergence; C11, by contrast, disables routing and the Transformer backbone together, simulating a low-capacity and low-context regime. Each configuration was evaluated on three structurally diverse datasets—LongKou, HanChuan, and HongHu—with overall accuracy (OA) results reported in Table 6 and visualized in Figure 8.

The degradation trends are notably non-uniform. In C1, where all higher-level modules are absent, OA drops to 91.5% in LongKou, 78.5% in HanChuan, and 76.82% in HongHu, marking a collapse in spatial coherence and semantic alignment. Importantly, recovery from such collapse is nonlinear. For instance, activating projection alone (C3) elevates performance in LongKou to 99.0% but still leaves fragmentation in HongHu (94.32%), where categories are narrow, spectrally entangled, and spatially nested. Conversely, C2 enables routing but omits projection, yielding asymmetric gains—suggesting that directional flow without semantic anchoring struggles to stabilize higher-entropy scenes. Multi-scale supervision, removed in C4, shows moderate standalone impact, implying a regularization rather than foundational role.

Backbone-related ablations (C5 and C6) show that structural abstraction and vertical depth remain indispensable in complex scenarios. C5, without a Transformer backbone, suffers notable OA drops—particularly in HanChuan and HongHu—highlighting the central role of global attention in resolving spatial fragmentation. C6, with reduced encoder depth, exhibits subtler yet consistent degradation, especially in fine-scale boundary classes, revealing that shallow hierarchies limit vertical expressiveness. In C7 and C8, which simultaneously remove routing and encoder depth or projection, respectively, performance collapses accelerate. These cases suggest that robustness is not merely additive but highly interaction-dependent as follows: the absence of both directional semantics and deep representation in C7 causes compounding ambiguity, while the absence of projection in C8 destabilizes feature anchoring under shallow encoding.

C10 and C11 further validate these dependencies. C10 preserves global structure and multi-scale gradients but lacks any semantic scaffolding; its performance lags behind single-module ablations, implying that even strong backbone and supervision cannot fully compensate for missing semantic anchors. C11, lacking both routing and global abstraction, suffers the sharpest decline outside of C1, confirming the synergy between spatial semantics and context-aware modeling.

As visualized in Figure 8, robustness in UniHSFormer-X emerges not from any single component but from their negotiated interdependence. Routing informs where features matter; projection aligns how they are represented; supervision regulates how fast they converge. Backbone and depth, in turn, determine how far and how deep that information propagates. Remove one, and the system bends; remove several, and it fractures—especially under complex visual entanglement. Generalization, then, is less about adding capacity than about distributing functional logic across architecture.

While overall accuracy provides a broad assessment of model performance, it often masks the uneven sensitivities that different categories exhibit under structural ablations. To uncover how architectural modules influence specific class behaviors, we performed a focused comparison across three representative datasets—LongKou, HanChuan, and HongHu—each featuring distinct landscape regularity, object density, and inter-class ambiguity. Four critical categories were selected from each dataset, reflecting small-sample instability, semantic overlap, structural fragmentation, and non-agricultural interference.

Table 7 summarizes the per-class accuracy across three structural variants as follows: the full UniHSFormer-X model, and two reduced versions (C7 and C8) lacking combinations of semantic routing, prototype projection, and hierarchical encoder depth. As expected, the complete model achieves consistently high accuracy across all categories, serving as the structural baseline. In contrast, C7 and C8 introduce varying levels of performance degradation, but not uniformly so—each class responds differently depending on its semantic properties and structural context. In LongKou, narrow-leaf soybean (C5) and mixed weed (C9) show the most pronounced decline under structural removal. These classes are both morphologically irregular and spectrally ambiguous, relying heavily on the model’s ability to distinguish subtle spatial cues. The removal of routing and hierarchical refinement (C7) reduces their precision by over 5%, while the absence of prototype projection (C8) amplifies this decline further. Roads and houses (C8), a background class often misclassified as crop margins, also exhibits sensitivity to attention degradation, though its structured layout still provides spatial anchors that mitigate complete collapse.

The HanChuan dataset presents a more fragmented environment, with categories such as grass (C9) and plastic (C12) displaying higher dependence on context-aware attention. Without hierarchical encoding, the model fails to consolidate long-range coherence, resulting in scattered predictions. Interestingly, cowpea (C2) and water spinach (C5)—both mid-frequency crops—also exhibit notable accuracy drops, suggesting that in disordered scenes, even dominant crops require feature routing to resist background drift. In HongHu, where visual similarity between horticultural crops is particularly high, the distinction becomes even sharper. Brassica chinensis (C12) and Celtuce (C15), both nested within dense planting grids, show marked declines under C7/C8, with accuracy falling below 90% in C8. The Tree class (C22), often located at plot boundaries, remains relatively stable, but even here, minor precision losses reveal how background structures benefit from depth-enhanced alignment. Most telling is Lactuca sativa (C14), whose boundaries are consistently preserved under the full model but become blurred as projection modules are removed. These results suggest that the benefits of structural components do not distribute evenly across the class space. Semantic routing and prototype projection not only enhance global modeling, but also act as class-specific filters, modulating attention in semantically ambiguous or spatially irregular regions. Their removal disproportionately affects classes that depend on local structure, class context, or spatial continuity. In contrast, dominant or spectrally isolated classes exhibit resilience, though not immunity. Taken together, this analysis reinforces the need to evaluate structural robustness not solely by aggregate scores, but also by its ability to preserve semantic fidelity under complexity. As datasets grow more heterogeneous, architectural design must shift from uniform modeling toward class-aware mechanisms that respond adaptively to variability in both space and semantics.

Patterns observed across datasets and configurations reveal that robustness in structured semantic modeling is not a consequence of scale or resolution, but of how architectural elements negotiate ambiguity. The capacity to preserve spatial regularity, to suppress class interference, and to recover fine-grained categories appears less tied to individual modules than to the relationships they encode. Where these relationships weaken, so too does the model’s sense of structure—not catastrophically, but perceptibly. The challenge is not just architectural adequacy, but compositional fitness.

4.4. Parameter Configuration and Behavioral Stability

While architectural design fundamentally shapes a model’s capacity to extract semantic structure, the role of parameter configuration in stabilizing this capacity is often overlooked. In UniHSFormer-X, two hyperparameters play a uniquely structural role: the number of semantic prototypes (P), which determines the resolution of concept-space anchoring; and the Top-K routing size (K), which controls the selectivity of attention propagation. Together, these parameters govern how features are filtered, grouped, and relayed through the network—imposing a latent grammar on information flow.

To investigate their joint influence, we conducted a grid-based analysis over P and K, fixing the token dimension to 128. Prototype numbers were varied from 4 to 16 in steps of 2, and routing sizes from K = 1 to K = 8. Each (P, K) configuration was independently trained and evaluated on three datasets—LongKou, HanChuan, and HongHu—covering structured, fragmented, and entangled spectral–spatial landscapes. The resulting OA surfaces are shown in Figure 9. The effects of these parameters are neither monotonic nor dataset-invariant. Across all three datasets, two behavioral patterns emerge as follows: (1) extremely low routing sizes (K ≤ 2) consistently lead to performance instability, suggesting that insufficient spatial aggregation prevents effective structural alignment; (2) overly dense prototype sets (P ≥ 14) tend to introduce semantic redundancy, increasing confusion in fine-grained class regions. In contrast, a mid-range prototype count (P = 8–12) and moderate routing (K = 3–6) define a stability basin, where classification accuracy remains both high and robust to minor parameter variation.

This basin is the most expansive in LongKou, where field structure is regular, margins are clean, and class boundaries exhibit low entropy. Under such conditions, the model displays graceful degradation even under suboptimal settings. In HanChuan and HongHu, however, the margin for error narrows: structural clutter and spectral overlap amplify the cost of poor routing or excessive semantic partitioning. For instance, at K = 1 with P = 16, OA in HongHu drops by over 2% relative to its local optimum, indicating that excessive semantic fragmentation cannot compensate for the absence of contextual guidance. What these surfaces reveal is that parameter selection does not merely optimize performance—it conditions representational behavior. Prototype granularity defines the semantic resolution at which features are interpreted, while routing determines the structural scope over which relevance is computed. Misaligning these elements can induce failure modes that are not reflected in aggregate metrics alone—such as over-smoothing, class coalescence, or loss of boundary fidelity.

More fundamentally, these results suggest that parameter tuning in semantically modular architectures like UniHSFormer-X should move beyond grid search. The presence of dataset-dependent basins indicates that optimal values are not universal, but emergent from the interplay between model logic and landscape structure. Designing routing strategies or prototype sets that adapt to visual density, class topology, or contextual noise may therefore prove more effective than fixed settings alone. The question is not whether a model works under ideal settings, but whether it remains intelligible and stable as its semantic and structural parameters are perturbed. This view reframes parameter tuning not as a peripheral task, but as a lens into model generalization and the latent structure of semantic encoding.

4.5. Complexity and Runtime Analysis

In real-world agricultural deployments, particularly under constrained hardware or real-time demands, model accuracy alone is insufficient. It must be accompanied by efficient computational performance, compact design, and manageable configuration overheads. Beyond classification accuracy, practical deployment depends on the complexity of model design, the computational cost of inference, and the flexibility or overhead introduced by hyperparameter configurations. To this end, we conducted a tri-dimensional evaluation across representative models, encompassing (1) the number of critical hyperparameters, (2) theoretical computational complexity per inference, and (3) the empirical runtime per image.

We emphasize that a higher hyperparameter count does not necessarily imply higher practical burden. Many parameters—such as learning rate, dropout, or patch size—are typically initialized with standard values and seldom tuned beyond default. What hyperparameter count instead reflects is the potential for architectural flexibility and task adaptability. For example, while our UniHSFormer-X exposes nine configurable options, including spectral and token dimensions, routing structure, and optimization variables, all were bounded within interpretable ranges, requiring no exhaustive search and enabling stable convergence without manual tuning.

Theoretical complexity is expressed using symbolic notation, allowing for model-agnostic comparison across diverse architectural families. Convolutional models (e.g., 3D-CNN, ResNet) scale with local kernel operations, while transformer-based architectures (e.g., ViT, SSFTT, CTDBNet) scale quadratically with token length. Our method, in contrast, replaces global attention with a prototype-guided routing mechanism, achieving linear complexity with respect to token count and spectral prototype size, thus ensuring scalability without sacrificing semantic depth.

To complement theoretical analysis, we measured per-image inference latency on a NVIDIA RTX 4090 GPU, under PyTorch 2.1.0 and CUDA 12.1, using 7 × 7 × 30 input blocks and FP32 precision. All methods were benchmarked in isolation, excluding data I/O or augmentation steps, to ensure forward-pass comparability. To ensure clarity and reproducibility, we summarize the definitions of key variables used in the complexity formulas as follows:

n

: number of patches/sequence length;

d

: embedding dimension or input feature dimension;

k

: mixing kernel size (specific to CTMixer);

c

: number of channels;

V_{l}

: number of voxels in layer lll (3D-CNN only);

C_{i n} \cdot C_{o u t}

: input/output channels per convolution layer;

s v

: number of support vectors (SVM);

ntrees

: number of trees (Random Forest);

P

: number of routing prototypes (UniHSFormer-X).

Table 8 summarizes the results. As expected, SVM and RF incur the lowest computational cost but lack capacity for high-dimensional representation. Convolution-based methods offer a moderate balance. Among transformer models, UniHSFormer-X achieves a competitive latency (~20.3 ms/image) while maintaining structured modularity and interpretability. This balance is rooted in its token-semantic decoupling and prototype-routed attention, which avoids costly full self-attention while preserving context-aware encoding. The results demonstrate that our model is not only accurate and generalizable but also computationally efficient and modularly tunable—qualities crucial for agricultural deployments under resource constraints.

While inference efficiency is crucial for real-time tasks, training cost and scalability are equally important in large-scale deployments. Table 9 further extends the comparison along these dimensions.

Beyond inference-time efficiency, practical deployment in large-scale agricultural scenarios requires consideration of training overhead, model scalability, and parameter footprint. Table 9 presents a comprehensive comparison of training time, test time, theoretical FLOPs, and parameter count across three datasets—LongKou (LK), HanChuan (HC), and HongHu (HH).

Our UniHSFormer-X exhibits consistently low training and test latency across datasets. Compared to the transformer baselines, it achieves a 17.2–30.4% reduction in training time and a 15.5–21.3% reduction in testing time versus ViT, SSFTT, and CTDBNet. The streamlined token-prototype routing and modular block composition contribute to this acceleration, avoiding costly full-attention computation while maintaining representational flexibility.

In terms of FLOPs, UniHSFormer-X requires 962.84 M operations, ranking among the most efficient in the transformer group and notably outperforming ViT (1462.44 M) and CTDBNet (1482.7 M). Despite having 9 tunable hyperparameters (see Table 8), its parameter count remains at 0.48 M, lower than most Transformer variants and only moderately above shallow models like 3D-CNN (0.16 M) or SSFTT (0.16 M). This reflects a balanced architectural compactness, where semantic expressiveness is achieved without redundancy.

Classical machine learning models such as RF and SVM offer minimal training/testing time and negligible parameter size, but lack the expressive capacity required for fine-grained spectral–spatial classification, as shown in previous sections. CNN-based backbones like ResNet and 3D-CNN strike a moderate balance but incur significantly higher FLOPs due to stacked convolutional operations and volumetric encoding.

Collectively, these results affirm that UniHSFormer-X not only achieves high accuracy and generalization, but also demonstrates efficient training, moderate memory demands, and scalable inference, positioning UniHSFormer-X as a scalable, deployable, and semantically robust solution for hyperspectral agricultural monitoring under practical constraints.

5. Discussions

5.1. Structured Landscapes and Recognition Breakdown in Cropland Settings

In agricultural remote sensing, classification accuracy is only as meaningful as the model’s ability to accommodate the spatial and semantic complexity of real-world cropland structures [56]. Across our three evaluation sites—LongKou, HanChuan, and HongHu—clear patterns emerged that reflect not merely algorithmic competence but the alignment (or misalignment) between model logic and landscape regularity [57].

LongKou represents a prototypical case of structurally regulated crop organization. The distinct field boundaries, consistent planting orientation, and limited class interference allow even relatively conventional models to produce coherent predictions. In this setting, spatial smoothness and boundary retention are sufficient to maintain semantic fidelity, and the differences between models primarily manifest as variations in edge sharpness or minor class leakage. This is precisely the type of environment in which general-purpose architectures—CNNs or global attention models—tend to perform well.

By contrast, the HanChuan and HongHu datasets illustrate the real constraints of model robustness under structural fragmentation. HanChuan includes urban margins, shadowed zones, and partial occlusions from artificial surfaces, while HongHu features dense interleaved horticultural plots, subtle phenotypic variation, and minimal spatial redundancy. In these environments, class boundaries are not just irregular, they are often absent in a spectral sense. Here, the assumption of local semantic coherence breaks down, and models lacking spatial reasoning or adaptive structure control begin to fail in more systemic ways [58]. Errors are no longer isolated but cascade across visually entangled areas as follows: strawberry blends with cowpea, lettuce bleeds into mustard, plastic interferes with wet soils [38]. These breakdowns highlight the following crucial distinction: misclassification in cropland mapping is rarely caused by a lack of capacity—it arises from the absence of architectural constraints that reflect how agricultural categories are spatially and phenotypically organized. Fields are not simply grids of classes but compositional structures defined by planting strategy, crop maturity, and landscape heterogeneity. A model that overlooks these factors may achieve nominal accuracy, but its outputs lack semantic structure and field-level interpretability.

What distinguishes structurally aware models such as UniHSFormer-X is not their depth or scale, but their ability to preserve spatial logic under irregularity. Rather than imposing smoothness or overfitting global features, they enable selective focus as follows: identifying where class boundaries emerge, when to preserve ambiguity, and how to filter irrelevant local signals. This behavior is particularly evident in HongHu, where the capacity to distinguish densely packed botanical variants reflects not just spectral modeling, but a grammar of spatial interpretation embedded in the architecture itself. In essence, field-level classification must go beyond per-pixel accuracy. It demands that models operate within—and respond to—the constraints of agricultural form. Structural robustness, then, is not merely a desirable property; it is a prerequisite for applying hyperspectral modeling in heterogeneous cropland environments [59,60].

5.2. Routed Semantics and Class Disambiguation in Agricultural Contexts

In complex agricultural environments, where class boundaries are ambiguous and spatial continuity is irregular, improving performance requires more than expanding perceptual range or deepening the network. What matters is not only what features are aggregated, but how they are routed and anchored semantically. Within UniHSFormer-X, the introduction of a semantic routing mechanism serves precisely this role—it defines a dynamic path through which tokens interact with learned class prototypes, selectively reinforcing features that conform to meaningful structural expectations [61].

This mechanism has particular resonance in crop recognition scenarios where classes are spectrally similar yet functionally distinct. For example, in LongKou, the difference between broad-leaf and narrow-leaf soybean is subtle, both in reflectance and morphology. Similarly, in HanChuan, plastic coverings and certain leafy crops may produce near-identical spectral profiles. Traditional models tend to either blur these categories together or rely on edge-based heuristics prone to fragmentation. Semantic routing, by contrast, allows for selective reinforcement: it promotes feature paths that align with consistent class behavior while attenuating paths that introduce semantic noise [62]. Moreover, this routing is context-aware—not in the sense of merely looking wider, but in recognizing when spatial information is structurally meaningful. For instance, in HongHu’s high-density horticultural plots, the model must differentiate between celtuce, mustard, and multiple lettuce types grown in immediate proximity. These are not easily separable through spectrum alone. The routing mechanism enhances separability by aligning tokens with semantic anchors that have been shaped across broader scenes, enabling the model to treat subtle edge shifts or texture transitions as class-significant rather than dismissible variance. Prototype projection plays a complementary role. It translates the high-dimensional token space into a condensed semantic manifold, where ambiguous features are drawn toward stable representations. This is especially beneficial for small-sample or boundary classes—such as mixed weed, bare soil, or edge-adjacent water—where raw spectral signals are insufficient to define class identity. Together, routing and projection function not as classifiers, but as semantic filters, suppressing spatial noise and reinforcing class-coherent continuity.

The ablation results further confirm this logic. When routing is disabled, even with projection and attention intact, the model begins to exhibit erratic transitions in cluttered zones. Conversely, when projection is removed but routing remains active, performance becomes more consistent but less precise—suggesting that routing defines where representation flows, and projection defines how it converges [63]. Their interaction is not additive, but compositional. From an agricultural perspective, this suggests a shift in how classification systems should be designed, not merely to categorize pixels, but to also organize spatial semantics [64]. A model that routes semantic information through learned agricultural hierarchies—recognizing not just what a crop looks like, but how its presence fits into spatial planting logic—is better equipped to handle the nuance and ambiguity of real cropland environments. Ultimately, semantic routing is not a cosmetic module—it is a structural response to the core challenges of crop type recognition. It replaces brute-force attention with directed flow, and substitutes statistical association with contextual reasoning [14]. As agricultural AI systems move toward higher autonomy and broader deployment, such mechanisms may become not just helpful, but foundational.

5.3. Parameter Elasticity and Landscape-Aware Adaptation

Beyond structural modules, the effectiveness of a classification model in agricultural settings also depends on its elasticity to parameter configuration, namely, how well it maintains stability when critical hyperparameters shift in response to data complexity. In architectures like UniHSFormer-X, where semantic prototypes and routing paths play essential roles, small changes in the number of prototypes (P) or routing breadth (K) can produce nonlinear effects, especially in visually entangled or spatially heterogeneous cropland scenes [65].

As observed in Figure 9, the model’s performance exhibits configuration basins—regions in the P–K parameter space where accuracy remains stable and well-behaved. LongKou, with its structured planting and wide row spacing, offers a broad and forgiving basin. Most configurations within moderate prototype counts (P = 8–12) and routing sizes (K = 3–6) yield high overall accuracy. This elasticity reflects the low intra-class variation and spatial separability inherent to the field structure itself. In contrast, HanChuan and HongHu demonstrate markedly narrower basins and higher sensitivity. In HanChuan, slight under-provisioning of routing (K = 2) or overexpansion of prototype count (P = 16) results in rapid degradation, especially near object edges or background transitions. In HongHu, where class distinctions are subtle and plots are densely packed, even moderate parameter misalignment leads to semantic drift—leafy vegetables collapse into one another, and spectral regularity fails to serve as a reliable organizing cue.

What these differences suggest is that parameter tuning cannot be decoupled from landscape structure. The same configuration that performs robustly in one field layout may be inadequate in another, not because of noise or label imbalance, but due to changes in how semantic categories are distributed, bounded, and interleaved. Agricultural scenes are not fixed domains; they are dynamic systems shaped by crop rotation, cultivation practices, and environmental conditions [66]. A static set of parameters, even when carefully optimized, may fall short when models are deployed at scale across varied regions or seasons. This points toward the need for landscape-aware parameter adaptation. Rather than treating P and K as global constants, future systems may benefit from dynamic tuning based on input characteristics as follows: adjusting routing sparsity in response to field fragmentation, or modulating prototype granularity in proportion to estimated inter-class proximity [67]. Such adaptation could be achieved through lightweight predictors or meta-learning modules that infer optimal configurations from low-level spatial cues [18].

More generally, parameter sensitivity analysis provides a diagnostic lens into model design. A model that performs well under ideal conditions but collapses under small shifts is unlikely to generalize in real-world agro-ecological deployments. Conversely, models that exhibit broad basins of stability suggest internal regularization—not merely overfitting resistance, but an intrinsic alignment between architectural logic and the structural semantics of cropland imagery. In this sense, the goal is not to find perfect hyperparameters, but to design systems that degrade gracefully—that maintain interpretability and structural coherence even as environmental complexity increases. For agricultural intelligence to scale beyond test fields and benchmarks, robustness must be understood not just as statistical performance, but as behavioral continuity under change.

5.4. Toward Transferable Semantic Grammars for Agricultural AI

As agricultural classification models evolve from research prototypes to operational systems, the nature of what constitutes robustness begins to shift. It is no longer sufficient for a model to perform well within curated datasets or narrow geographic bounds; instead, robustness must be recast as a form of semantic continuity across spatial, temporal, and agronomic variability [20,68]. In this context, UniHSFormer-X provides not just a high-performing architecture, but a blueprint for how semantic understanding in agriculture might be structured, routed, and generalized [69].

At the heart of this model lies a premise that extends beyond attention or feature fusion; the recognition of crop types—especially in fragmented, diverse planting environments—requires more than pixel-level inference. It calls for the construction of a latent grammar; a rule-based system through which semantic categories are assembled, aligned, and maintained as coherent entities [70]. The routing mechanism, with its selective token propagation, begins to approximate this grammar. By defining where class-relevant information should flow and how ambiguous patterns should be filtered or suppressed, the model learns not just to classify, but to regulate semantic behavior. Such behavior becomes particularly valuable when the target environment deviates from the training distribution [67]. Cropland geometry shifts with seasonal rotation; class inventories evolve with hybrid cultivars; imaging conditions vary with cloud cover, irrigation, and soil disturbance. Yet within these changes, the following structural invariants persist: row continuity, boundary tension, contextual adjacency between crops and infrastructure. Models that encode these invariants structurally—through modular semantics and hierarchical attention—are more likely to remain interpretable and responsive even as surface variability increases. More importantly, this grammar-like organization allows for partial knowledge transfer [24,61]. When deployed in unseen regions, a model may not recognize every class precisely, but it can still maintain boundary structure, suppress background confusion, and approximate unknown crops through aligned prototypes. This is not simply domain generalization, it is semantic modularity under constraint, a capacity to adapt without collapsing into uniform prediction or incoherent noise.

What emerges is a vision of agricultural AI not as a single optimized pipeline, but as a network of composable semantic tools—each tuned not just to crops, but to the organizational logic of agriculture itself. From seedbed to satellite, the demands placed on classification models are not static but structural. And as models begin to reflect this logic internally—through routing, projection, and context-aware abstraction—they edge closer to becoming deployable systems that work with, rather than in spite of, the complexity of real-world farming—embodying an architectural grammar that balances precision with interpretability, and structural resilience with semantic fidelity.

6. Conclusions

The classification of crops in hyperspectral imagery is not merely an exercise in pattern recognition, it is a semantic task shaped by the spatial grammar of agriculture. Through this study, we proposed UniHSFormer-X, a unified transformer-based framework designed to align hyperspectral tokens with structurally meaningful semantics across diverse cropland settings. By integrating semantic routing, prototype-guided projection, and a hierarchical transformer encoder, the model moves beyond conventional spatial–spectral fusion, establishing a dynamic mechanism for information flow that mirrors the organizational logics of field planting and phenotypic structure. Comprehensive experiments across the WHU-Hi benchmark demonstrate the model’s superiority in both overall accuracy and class-wise fidelity, particularly in challenging horticultural environments where conventional models tend to degrade. Ablation studies reveal that no single module suffices—robustness stems from the interaction between routing, supervision, and depth-aware abstraction. Moreover, the parameter stability analysis highlights the following critical insight: model behavior is conditioned not only by internal architecture but also by the external structure of the landscape itself.

These findings point to a broader paradigm shift: from fixed, generic pipelines to semantically modular, context-adaptive systems capable of reasoning within the constraints of real-world agriculture. In this view, a model’s strength is not defined by its peak performance, but by its ability to maintain structure under uncertainty, to disambiguate meaning where boundaries blur, and to adapt gracefully to variation across time, region, and crop taxonomy. What emerges is not a finished solution, but a template—an architectural grammar for agricultural AI. Future work may refine this blueprint through self-adaptive routing, continual learning under seasonal dynamics, or cross-modal integration with soil, weather, and management data. The direction, however, is clear; from classification to interpretation, from segmentation to semantics, from pixels to patterns of agriculture.

Author Contributions

Z.D.: methodology, software, project administration, and writing—original draft; S.L.: methodology, software, writing—original draft, and funding acquisition; Y.L. (Yao Liao): conceptualization, writing—review and editing; Y.T. and Y.L. (Yanwen Liu): investigation, data curation, and visualization; H.X. and Z.Z.: writing—review and editing, supervision, and project administration; D.Z.: data curation, writing—review and editing, visualization, and software. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the JiangXi Provincial Social Science Fund (22YJ19).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data and algorithm code presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Agrawal, N.; Govil, H.; Kumar, T. Agricultural Land Suitability Classification and Crop Suggestion Using Machine Learning and Spatial Multicriteria Decision Analysis in Semi-Arid Ecosystem. Environ. Dev. Sustain. 2024, 27, 13689–13726. [Google Scholar] [CrossRef]
Allu, A.; Mesapam, S. Fusion of Different Multispectral Band Combinations of Sentinel-2A with UAV Imagery for Crop Classification. J. Appl. Remote Sens. 2024, 18, 016511. [Google Scholar] [CrossRef]
Amherdt, S.; Di Leo, N.; Balbarani, S.; Pereira, A.; Cornero, C.; Pacino, M. Exploiting Sentinel-1 Data Time-Series for Crop Classification and Harvest Date Detection. Int. J. Remote Sens. 2021, 42, 7313–7331. [Google Scholar] [CrossRef]
Khosravi, I. Advancements in Crop Mapping through Remote Sensing: A Comprehensive Review of Concept, Data Sources, and Procedures over Four Decades. Remote Sens. Appl.-Soc. Environ. 2025, 38, 101527. [Google Scholar] [CrossRef]
Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A Review of Deep Learning in Multiscale Agricultural Sensing. Remote Sens. 2022, 14, 559. [Google Scholar] [CrossRef]
Das, P.; Kumar, T.; Barman, D.; Kar, M.; Chunduri, S.; Mandal, K.; Dash, D.; Nalini, J.; Chamundeswari, D.; Mitra, S.; et al. Field-Scale Estimation of Phenotypic Parameters for Jute and Allied Fibre Crops: An Unmanned Aerial Vehicle Remote Sensing Approach. J. Indian Soc. Remote Sens. 2025, 53, 1439–1456. [Google Scholar] [CrossRef]
Di Tommaso, S.; Wang, S.; Vajipey, V.; Gorelick, N.; Strey, R.; Lobell, D. Annual Field-Scale Maps of Tall and Short Crops at the Global Scale Using GEDI and Sentinel-2. Remote Sens. 2023, 15, 4123. [Google Scholar] [CrossRef]
Aneece, I.; Thenkabail, P.; McCormick, R.; Alifu, H.; Foley, D.; Oliphant, A.; Teluguntla, P. Machine Learning and New-Generation Spaceborne Hyperspectral Data Advance Crop Type Mapping. Photogramm. Eng. Remote Sens. 2024, 90, 687–698. [Google Scholar] [CrossRef]
Ashraf, M.; Chen, L.; Innab, N.; Umer, M.; Baili, J.; Kim, T.; Ashraf, I. Novel 3-D Deep Neural Network Architecture for Crop Classification Using Remote Sensing-Based Hyperspectral Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12649–12665. [Google Scholar] [CrossRef]
Bhosle, K.; Musande, V. Evaluation of Deep Learning CNN Model for Land Use Land Cover Classification and Crop Identification Using Hyperspectral Remote Sensing Images. J. Indian Soc. Remote Sens. 2019, 47, 1949–1958. [Google Scholar] [CrossRef]
Chamundeeswari, G.; Srinivasan, S.; Bharathi, S. Efficient Urban Green Space Destruction and Crop Stress Yield Assessment Model. Intell. Autom. Soft Comput. 2022, 33, 515–534. [Google Scholar] [CrossRef]
Luciano, A.; Picoli, M.; Rocha, J.; Franco, H.; Sanches, G.; Leal, M.; le Maire, G. Generalized Space-Time Classifiers for Monitoring Sugarcane Areas in Brazil. Remote Sens. Environ. 2018, 215, 438–451. [Google Scholar] [CrossRef]
Babu, V.; Ram, N. Deep Residual CNN with Contrast Limited Adaptive Histogram Equalization for Weed Detection in Soybean Crops. Trait. Du Signal 2022, 39, 717–722. [Google Scholar] [CrossRef]
Barriere, V.; Claverie, M.; Schneider, M.; Lemoine, G.; d’Andrimont, R. Boosting Crop Classification by Hierarchically Fusing Satellite, Rotational, and Contextual Data. Remote Sens. Environ. 2024, 305, 114110. [Google Scholar] [CrossRef]
Lodato, F.; Pennazza, G.; Santonico, M.; Vollero, L.; Grasso, S.; Pollino, M. In-Depth Analysis and Characterization of a Hazelnut Agro-Industrial Context through the Integration of Multi-Source Satellite Data: A Case Study in the Province of Viterbo, Italy. Remote Sens. 2024, 16, 1227. [Google Scholar] [CrossRef]
Yin, Q.; Gao, L.; Zhou, Y.; Li, Y.; Zhang, F.; Lopez-Martinez, C.; Hong, W. Coherence Matrix Power Model for Scattering Variation Representation in Multi-Temporal PolSAR Crop Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9797–9810. [Google Scholar] [CrossRef]
Gao, J.; Ji, X.; Chen, G.; Guo, R. Main-Sub Transformer With Spectral-SpatialSeparable Convolution for HyperspectralImage Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2747–2762. [Google Scholar] [CrossRef]
Li, K.; Zhao, W.; Peng, R.; Ye, T. Multi-Branch Self-Learning Vision Transformer (MSViT) for Crop Type Mapping with Optical-SAR Time-Series. Comput. Electron. Agric. 2022, 203, 107497. [Google Scholar] [CrossRef]
Farmonov, N.; Esmaeili, M.; Abbasi-Moghadam, D.; Sharifi, A.; Amankulova, K.; Mucsi, L. HypsLiDNet: 3-D-2-D CNN Model and Spatial-Spectral Morphological Attention for Crop Classification With DESIS and LiDAR Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11969–11996. [Google Scholar] [CrossRef]
Niu, B.; Feng, Q.; Chen, B.; Ou, C.; Liu, Y.; Yang, J. HSI-TransUNet: A Transformer Based Semantic Segmentation Model for Crop Mapping from UAV Hyperspectral Imagery. Comput. Electron. Agric. 2022, 201, 107297. [Google Scholar] [CrossRef]
Asadi, B.; Shamsoddini, A. Crop Mapping through a Hybrid Machine Learning and Deep Learning Method. Remote Sens. Appl.-Soc. Environ. 2024, 33, 101090. [Google Scholar] [CrossRef]
Chaudhury, B.; Sahadevan, A.; Mitra, P. Multi-Task Hybrid Spectral-Spatial Temporal Convolution Networks for Classification of Agricultural Crop Types and Growth Stages Using Drone-Borne Hyperspectral and Multispectral Images. J. Appl. Remote Sens. 2023, 17, 038503. [Google Scholar] [CrossRef]
Emmi, L.; Le Flécher, E.; Cadenat, V.; Devy, M. A Hybrid Representation of the Environment to Improve Autonomous Navigation of Mobile Robots in Agriculture. Precis. Agric. 2021, 22, 524–549. [Google Scholar] [CrossRef]
Hoppe, H.; Dietrich, P.; Marzahn, P.; Weiss, T.; Nitzsche, C.; von Lukas, U.; Wengerek, T.; Borg, E. Transferability of Machine Learning Models for Crop Classification in Remote Sensing Imagery Using a New Test Methodology: A Study on Phenological, Temporal, and Spatial Influences. Remote Sens. 2024, 16, 1493. [Google Scholar] [CrossRef]
Chabalala, Y.; Adam, E.; Ali, K. Identifying the Optimal Phenological Period for Discriminating Subtropical Fruit Tree Crops Using Multi-Temporal Sentinel-2 Data and Google Earth. S. Afr. J. Geomat. 2023, 12, 262–283. [Google Scholar] [CrossRef]
Cai, Z.; Xu, B.; Yu, Q.; Zhang, X.; Yang, J.; Wei, H.; Li, S.; Song, Q.; Xiong, H.; Wu, H.; et al. A Cost-Effective and Robust Mapping Method for Diverse Crop Types Using Weakly Supervised Semantic Segmentation with Sparse Point Samples. ISPRS J. Photogramm. Remote Sens. 2024, 218, 260–276. [Google Scholar] [CrossRef]
Ebrahimi, S.; Kumar, S. What Helps to Detect What? Explainable AI and Multisensor Fusion for Semantic Segmentation of Simultaneous Crop and Land Cover Land Use Delineation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5423–5444. [Google Scholar] [CrossRef]
Huang, L.; Miao, B.; She, B.; Zhang, A.; Zhao, J.; Ruan, C. Rapid Mapping of Soybean Planting Areas under Complex Crop Structures: A Modified GWCCI Approach. Comput. Electron. Agric. 2025, 235, 110326. [Google Scholar] [CrossRef]
Huang, X.; Wang, H.; Li, X. A Multi-Scale Semantic Feature Fusion Method for Remote Sensing Crop Classification. Comput. Electron. Agric. 2024, 224, 109185. [Google Scholar] [CrossRef]
Shen, Z.; Liu, W.; Xu, S. DS-SwinUNet: Redesigning Skip Connection With Double Scale Attention for Land Cover Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4382–4395. [Google Scholar] [CrossRef]
Hasituya; Chen, Z. Mapping Plastic-Mulched Farmland with Multi-Temporal Landsat-8 Data. Remote Sens. 2017, 9, 557. [Google Scholar] [CrossRef]
Metwaly, M.; Metwalli, M.; Abd-Elwahed, M.; Zakarya, Y. Digital Mapping of Soil Quality and Salt-Affected Soil Indicators for Sustainable Agriculture in the Nile Delta Region. Remote Sens. Appl.-Soc. Environ. 2024, 36, 101318. [Google Scholar] [CrossRef]
Borana, S.L.; Yadav, S.K.; Parihar, S.K. Hyperspectral Data Analysis for Arid Vegetation Species: Smart & Sustainable Growth. In Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18–19 October 2019; pp. 495–500. [Google Scholar]
Liao, C.; Wang, J.; Xie, Q.; Al Baz, A.; Huang, X.; Shang, J.; He, Y. Synergistic Use of Multi-Temporal RADARSAT-2 and VENμS Data for Crop Classification Based on 1D Convolutional Neural Network. Remote Sens. 2020, 12, 832. [Google Scholar] [CrossRef]
Kou, W.; Shen, Z.; Liu, D.; Liu, Z.; Li, J.; Chang, W.; Wang, H.; Huang, L.; Jiao, S.; Lei, Y.; et al. Crop Classification Methods and Influencing Factors of Reusing Historical Samples Based on 2D-CNN. Int. J. Remote Sens. 2023, 44, 3278–3305. [Google Scholar] [CrossRef]
Gallo, I.; Ranghetti, L.; Landro, N.; La Grassa, R.; Boschetti, M. In-Season and Dynamic Crop Mapping Using 3D Convolution Neural Networks and Sentinel-2 Time Series. ISPRS J. Photogramm. Remote Sens. 2023, 195, 335–352. [Google Scholar] [CrossRef]
Alotaibi, Y.; Rajendran, B.; Rani, K.; Rajendran, S. Dipper Throated Optimization with Deep Convolutional Neural Network-Based Crop Classification for Remote Sensing Image Analysis. PEERJ Comput. Sci. 2024, 10, e1828. [Google Scholar] [CrossRef]
Bhosle, K.; Musande, V. Evaluation of CNN Model by Comparing with Convolutional Autoencoder and Deep Neural Network for Crop Classification on Hyperspectral Imagery. Geocarto Int. 2022, 37, 813–827. [Google Scholar] [CrossRef]
Bagci, R.; Acar, E.; Türk, Ö. Identification of Cotton and Corn Plant Areas by Employing Deep Transformer Encoder Approach and Different Time Series Satellite Images: A Case Study in Diyarbakir, Turkey. Comput. Electron. Agric. 2023, 209, 107838. [Google Scholar] [CrossRef]
Gao, Q.; Wu, T.; Tang, H.; Yang, J.; Wang, S. Large Area Crops Mapping by Phenological Horizon Attention Transformer (PHAT) Method Using MODIS Time-Series Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10995–11013. [Google Scholar] [CrossRef]
Papadopoulou, E.; Mallinis, G.; Siachalou, S.; Koutsias, N.; Thanopoulos, A.; Tsaklidis, G. Agricultural Land Cover Mapping through Two Deep Learning Models in the Framework of EU’s CAP Activities Using Sentinel-2 Multitemporal Imagery. Remote Sens. 2023, 15, 4657. [Google Scholar] [CrossRef]
Poudel, U.; Stephen, H.; Ahmad, S. Evaluating Irrigation Performance and Water Productivity Using EEFlux ET and NDVI. Sustainability 2021, 13, 7967. [Google Scholar] [CrossRef]
Kopec, D.; Zakrzewska, A.; Halladin-Dabrowska, A.; Wylazlowska, J.; Slawik, L. The Essence of Acquisition Time of Airborne Hyperspectral and On-Ground Reference Data for Classification of Highly Invasive Annual Vine Echinocystis Lobata (Michx.) Torr. & A. Gray. Giscience Remote Sens. 2023, 60, 2204682. [Google Scholar] [CrossRef]
Jenifer, A.; Natarajan, S. CocoNet: A Hybrid Machine Learning Framework for Coconut Farm Identification and Its Cyclonic Damage Assessment on Bitemporal SAR Images. J. Appl. Remote Sens. 2021, 15, 042408. [Google Scholar] [CrossRef]
Kong, J.; Wang, H.; Wang, X.; Jin, X.; Fang, X.; Lin, S. Multi-Stream Hybrid Architecture Based on Cross-Level Fusion Strategy for Fine-Grained Crop Species Recognition in Precision Agriculture. Comput. Electron. Agric. 2021, 185, 106134. [Google Scholar] [CrossRef]
Abbasi, R.; Martinez, P.; Ahmad, R. Crop Diagnostic System: A Robust Disease Detection and Management System for Leafy Green Crops Grown in an Aquaponics Facility. Artif. Intell. Agric. 2023, 10, 1–12. [Google Scholar] [CrossRef]
Gao, Z.; Guo, D.; Ryu, D.; Western, A. Training Sample Selection for Robust Multi-Year within-Season Crop Classification Using Machine Learning. Comput. Electron. Agric. 2023, 210, 107927. [Google Scholar] [CrossRef]
Martinez, J.; Oliveira, H.; dos Santos, J.; Feitosa, R. Open Set Semantic Segmentation for Multitemporal Crop Recognition. IEEE Geosci. Remote Sens. Lett. 2021, 19, 2501905. [Google Scholar] [CrossRef]
Tang, Z.; Wang, X.; Jiang, Q.; Pan, H.; Deng, G.; Chen, H.; You, Y.; Li, S.; Hou, H. Parcel-Scale Crop Planting Structure Extraction Combining Time-Series of Sentinel-1 and Sentinel-2 Data Based on a Semantic Edge-Aware Multi-Task Neural Network. Int. J. Digit. Earth 2025, 18, 2497487. [Google Scholar] [CrossRef]
Song, B.; Min, S.; Yang, H.; Wu, Y.; Wang, B. A Fourier Frequency Domain Convolutional Neural Network for Remote Sensing Crop Classification Considering Global Consistency and Edge Specificity. Remote Sens. 2023, 15, 4788. [Google Scholar] [CrossRef]
Zhen, Z.; Chen, S.; Yin, T.; Gastellu-Etchegorry, J. Improving Crop Mapping by Using Bidirectional Reflectance Distribution Function (BRDF) Signatures with Google Earth Engine. Remote Sens. 2023, 15, 2761. [Google Scholar] [CrossRef]
Ajadi, O.; Barr, J.; Liang, S.; Ferreira, R.; Kumpatla, S.; Patel, R.; Swatantran, A. Large-Scale Crop Type and Crop Area Mapping across Brazil Using Synthetic Aperture Radar and Optical Imagery. Int. J. Appl. Earth Obs. Geoinf. 2021, 97, 102294. [Google Scholar] [CrossRef]
Chen, R.; Xiong, S.; Zhang, N.; Fan, Z.; Qi, N.; Fan, Y.; Feng, H.; Ma, X.; Yang, H.; Yang, G.; et al. Fine-Scale Classification of Horticultural Crops Using Sentinel-2 Time-Series Images in Linyi Country, China. Comput. Electron. Agric. 2025, 236, 110425. [Google Scholar] [CrossRef]
Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-Borne Hyperspectral with High Spatial Resolution (H2) Benchmark Datasets and Classifier for Precise Crop Identification Based on Deep Convolutional Neural Network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-Borne Hyperspectral Remote Sensing From Observation and Processing to Applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Kafy, A.A.; Bakshi, A.; Saha, M.; Al Faisal, A.; Almulhim, A.; Rahaman, Z.; Mohammad, P. Assessment and Prediction of Index Based Agricultural Drought Vulnerability Using Machine Learning Algorithms. Sci. Total Environ. 2023, 867, 161394. [Google Scholar] [CrossRef] [PubMed]
Abernethy, J.; Beeson, P.; Boryan, C.; Hunt, K.; Sartore, L. Preseason Crop Type Prediction Using Crop Sequence Boundaries. Comput. Electron. Agric. 2023, 208, 107768. [Google Scholar] [CrossRef]
Patel, U.; Pathan, M.; Kathiria, P.; Patel, V. Crop Type Classification with Hyperspectral Images Using Deep Learning: A Transfer Learning Approach. Model. Earth Syst. Environ. 2023, 9, 1977–1987. [Google Scholar] [CrossRef]
Tang, P.; Chanussot, J.; Guo, S.; Zhang, W.; Qie, L.; Zhang, P.; Fang, H.; Du, P. Deep Learning with Multi-Scale Temporal Hybrid Structure for Robust Crop Mapping. ISPRS J. Photogramm. Remote Sens. 2024, 209, 117–132. [Google Scholar] [CrossRef]
Chabalala, Y.; Adam, E.; Ali, K. Machine Learning Classification of Fused Sentinel-1 and Sentinel-2 Image Data towards Mapping Fruit Plantations in Highly Heterogenous Landscapes. Remote Sens. 2022, 14, 2621. [Google Scholar] [CrossRef]
Che, H.; Pan, Y.; Xia, X.; Zhu, X.; Li, L.; Huang, Y.; Zheng, X.; Wang, L. A New Transferable Deep Learning Approach for Crop Mapping. Giscience Remote Sens. 2024, 61, 2395700. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, L.; Chen, Y.; Zheng, C. Crop Identification of UAV Images Based on an Unsupervised Semantic Segmentation Method. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6004405. [Google Scholar] [CrossRef]
Arun, P.; Karnieli, A. Reinforced Deep Learning Approach for Analyzing Spaceborne-Derived Crop Phenology. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103984. [Google Scholar] [CrossRef]
Cai, K.; Zhang, X.; Zhang, M.; Ge, Q.; Li, S.; Qiao, B.; Liu, Y. Improving Air Pollutant Prediction in Henan Province, China, by Enhancing the Concentration Prediction Accuracy Using Autocorrelation Errors and an Informer Deep Learning Model. Sustain. Environ. Res. 2023, 33, 13. [Google Scholar] [CrossRef]
Bueno, I.; Antunes, J.; Reis, A.; Werner, J.; Toro, A.; Figueiredo, G.; Esquerdo, J.; Lamparelli, R.; Coutinho, A.; Magalha, P. Mapping Integrated Crop-Livestock Systems in Brazil with Planetscope Time Series and Deep Learning. Remote Sens. Environ. 2023, 299, 113886. [Google Scholar] [CrossRef]
Wang, Y.; Feng, L.; Sun, W.; Wang, L.; Yang, G.; Chen, B. A Lightweight CNN-Transformer Network for Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery. Comput. Electron. Agric. 2024, 226, 109370. [Google Scholar] [CrossRef]
Zhou, X.; Wang, J.; Shan, B.; He, Y. Early-Season Crop Classification Based on Local Window Attention Transformer with Time-Series RCM and Sentinel-1. Remote Sens. 2024, 16, 1376. [Google Scholar] [CrossRef]
Wang, H.; Zhang, L.; Wu, R. MSAFormer: A Transformer-Based Model for PM2.5 Prediction Leveraging Sparse Autoencoding of Multi-Site Meteorological Features in Urban Areas. Atmosphere 2023, 14, 1294. [Google Scholar] [CrossRef]
Ramathilagam, A.; Natarajan, S.; Kumar, A. TransCropNet: A Multichannel Transformer with Feature-Level Fusion for Crop Classification in Agricultural Smallholdings Using Sentinel Images. J. Appl. Remote Sens. 2023, 17, 024501. [Google Scholar] [CrossRef]
Xie, J.; Hua, J.; Chen, S.; Wu, P.; Gao, P.; Sun, D.; Lyu, Z.; Lyu, S.; Xue, X.; Lu, J. HyperSFormer: A Transformer-Based End-to-End Hyperspectral Image Classification Method for Crop Classification. Remote Sens. 2023, 15, 3491. [Google Scholar] [CrossRef]

Figure 1. Illustration of the WHU-Hi benchmark datasets used in this study. Figure 1. Representative examples of the WHU-Hi benchmark datasets used in this study. (a) WHU-Hi-LongKou: Left-top—RGB composite of the hyperspectral cube; Right-top—ground-truth land-cover map; Bottom—sample photographs of major crops, including corn, sesame, and rice. (b) WHU-Hi-HanChuan: Left—RGB composite of the hyperspectral cube; Middle—ground-truth annotation; Right—field photographs of classes such as strawberry, cowpea, sorghum, and soybean. (c) WHU-Hi-HongHu: Left—RGB cube rendering; Middle—ground-truth land-use map; Right—typical crop samples, including tuber mustard, pakchoi, lettuce, and cabbage. These scenes collectively represent the complex spatial and spectral diversity of agricultural regions across different study areas.

Figure 2. Overall architecture of UniHSFormer-X. To overcome spectral–spatial-semantic imbalances in hyperspectral crop classification, we propose UniHS-Former-X, a unified transformer architecture integrating spectral–spatial tokenization, prototype-guided routing, and hierarchical encoding for robust, interpretable learning.

Figure 3. Spectral–spatial tokenization process in UniHSFormer-X. The spectral–spatial tokenization process in UniHSFormer-X converts input patches into semantically consistent tokens while preserving spectral–spatial integrity, forming the foundation for prototype-driven semantic structuring.

Figure 4. Prototype-guided semantic routing mechanism in UniHSFormer-X. The learnable prototype set in UniHSFormer-X enables semantic routing via scaled dot-product attention, guiding token processing through soft class assignments and adaptive prototype refinement for class-aware representation learning.

Figure 5. Classification visualization maps of nine models on the WHU-Hi-LongKou dataset. (a–i) Predicted maps from RF, SVM, 3D-CNN, ResNet, ViT, SSFTT, CTMixer, CTDBNet, and UniHSFormer-X, respectively. (j) Ground truth reference. UniHSFormer-X exhibits superior boundary definition and semantic coherence across classes.

Figure 6. Classification results of nine models on the WHU-Hi-HanChuan dataset. (a–i) Predicted maps from RF, SVM, 3D-CNN, ResNet, ViT, SSFTT, CTMixer, CTDBNet, and UniHSFormer-X. (j) Ground truth. Compared to other models, UniHSFormer-X maintains clearer class separation under spatial irregularity and mixed material interference.

Figure 7. Classification maps of nine models on the WHU-Hi-HongHu dataset. (a–i) Predicted outputs from RF, SVM, 3D-CNN, ResNet, ViT, SSFTT, CTMixer, CTDBNet, and UniHSFormer-X. (j) Ground truth reference. Compared to other models, UniHSFormer-X produces fine-grained boundaries and balanced segmentation across dense horticultural plots and visually similar crop subclasses.

Figure 8. Accuracy variation across datasets under different module configurations. Figure 8 reveals that model robustness emerges from interdependent module interactions, where semantic routing guides feature importance, projection ensures proper representation, and multi-scale supervision regulates convergence—with performance degradation varying by scene complexity when components are ablated.

Figure 9. Overall accuracy surfaces of UniHSFormer-X across different Prototype Numbers (P) and Top-K Routing Sizes (K). Each subplot corresponds to one of the three datasets: LongKou (left), HanChuan (middle), and HongHu (right). Performance basins emerge where architectural parameters harmonize with scene structure. The different colors in the 3D surface plots represent varying levels of overall accuracy (OA), helping to visually distinguish performance peaks and valleys across combinations of prototype numbers and Top-K routing sizes.

Table 1. Sample distribution per class in WHU-Hi-LongKou, HanChuan, and HongHu. Table 1 presents the class-specific label distributions adhering to the 100-sample-per-class training protocol, ensuring equitable representation across all categories while maintaining dataset balance for reliable classification performance assessment.

No.	WHU-Hi LongKou		WHU-Hi HanChuan		WHU-Hi HongHu
No.	Class Name	Samples	Class Name	Samples	Class Name	Samples
C1	Corn	34,511	Strawberry	44,735	Red roof	14,041
C2	Cotton	8374	Cowpea	22,753	Road	3512
C3	Sesame	3031	Soybean	10,287	Bare soil	21,821
C4	Broad-leaf soybean	63,212	Sorghum	5353	Cotton	163,285
C5	Narrow-leaf soybean	4151	Water spinach	1200	Cotton firewood	6218
C6	Rice	11,854	Watermelon	4533	Rape	44,557
C7	Water	67,056	Greens	5903	Chinese cabbage	24,103
C8	Roads and houses	7124	Trees	17,978	Pakchoi	4054
C9	Mixed weed	5229	Grass	9469	Cabbage	10,819
C10			Red roof	10,516	Tuber mustard	12,394
C11			Gray roof	16,911	Brassica parachinensis	11,015
C12			Plastic	3679	Brassica chinensis	8954
C13			Bare soil	9116	Small Brassica chinensis	22,507
C14			Road	18,560	Lactuca sativa	7356
C15			Bright object	1136	Celtuce	1002
C16			Water	75,401	Film covered lettuce	7262
C17					Romaine lettuce	3010
C18					Carrot	3217
C19					White radish	8712
C20					Garlic sprout	3486
C21					Broad bean	1328
C22					Tree	4040

Table 2. Quantitative comparison of classification performance on the WHU-Hi-LongKou dataset. Metrics include Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient. UniHSFormer-X achieves the highest scores across all metrics, demonstrating robust class-wise consistency and effective structural segmentation.

Class	RF	SVM	3D-CNN	ResNet	ViT	SSFTT	CTMixer	CTDBNet	UniHSFormer-X
C1	87.98	92.8	97.1	89.35	93.97	98.32	98.16	99.94	99.96
C2	44.3	46.28	64.44	90.26	72.4	99.54	99.92	98.61	99.93
C3	91.2	89.71	98.08	87.06	97.51	98.63	98.88	98.45	99.34
C4	89.16	88.18	92.54	91.67	89.02	99.46	99.85	99.94	99.94
C5	32.87	42.49	74.03	88.41	90.8	97.21	98.02	98.66	99.4
C6	84.67	86.3	94.78	86.4	87.2	100	99.92	99.92	99.92
C7	88.16	86.11	98.96	86.27	85.47	99.73	99.97	99.98	99.97
C8	63.85	66.5	81.43	83.16	83.18	97.15	94.05	98.87	98.91
C9	43.93	66.88	89.88	84.83	87.05	94.92	95.74	97.52	96.18
OA (%)	83.39	84.25	93.64	88.54	87.81	99.14	99.25	99.75	99.8
AA (%)	69.57	73.92	87.92	87.49	87.4	98.33	98.28	99.1	99.28
K × 100	84.89	88.89	93.32	90.16	90.99	99.53	99.08	99.41	99.49

Table 3. Quantitative results on the WHU-Hi-HanChuan dataset. The final model achieves the highest scores across OA, AA, and Kappa, with stable per-class performance in both structurally dominant and spectrally ambiguous categories.

Class	RF	SVM	3D-CNN	ResNet	ViT	SSFTT	CTMixer	CTDBNet	UniHSFormer-X
C1	77.49	72.01	87.19	87.5	89.35	95.97	94.03	94.86	99.84
C2	72.21	49.46	94.85	78.09	76.23	98.2	94.18	98.26	98.82
C3	39.49	72.36	88.44	87.16	89.86	95.14	93.49	95.25	97.24
C4	86.35	92.3	93.45	90.18	92.09	94.62	96.66	97.35	98.58
C5	14	80.09	64.18	90.48	83.61	82.84	97.2	95.32	92.25
C6	23.1	47.48	75.65	86.39	82.9	84.03	90.57	96.6	88.94
C7	49.06	87.48	75.76	87.1	86.86	94.64	96.16	97.13	95.32
C8	35.55	60.27	90.41	78.69	77.48	94.97	93.51	94.22	95.71
C9	87.28	62.44	88.5	86.38	89.21	94	95.75	98.18	95.9
C10	83.19	93.5	93.74	90.53	89.49	95.34	96.82	97.48	98.55
C11	48.46	93.88	89.1	91.75	86.63	97.68	93.75	96.26	98.46
C12	26.42	61.42	75.16	84.69	83.83	79.26	94.87	96.61	91.03
C13	69.21	56.41	83.6	84.2	94.42	85.93	87.87	93.89	87.32
C14	94.63	64.35	94.11	84.91	86.21	97.6	98.19	98.16	93.99
C15	37.87	72.06	68.53	88.85	86.57	96.64	97.53	97.51	92.96
C16	91.47	92.68	97.33	91.13	92.4	98.55	96.33	97.58	99.91
OA (%)	73.67	76.35	91.23	87.17	87.82	96.91	94.98	96.64	97.5
AA (%)	58.49	72.39	85	86.75	86.7	93.03	94.81	96.79	94.36
K × 100	71.12	74.52	89.78	88.23	85.12	93.04	94.23	94.89	95.23

Table 4. Quantitative results on WHU-Hi-HongHu. While overall performance improves across all methods, UniHSFormer-X maintains a notably consistent per-class accuracy across both dominant and minor classes, highlighting its robustness in high-density, multi-class environments.

Class	RF	SVM	3D-CNN	ResNet	ViT	SSFTT	CTMixer	CTDBNet	UniHSFormer-X
C1	91	83.14	94.18	97.01	96.74	95.17	97.81	98.61	98.34
C2	49.7	96.13	86.5	78.95	99.26	82.5	89.77	94.65	95.96
C3	97.24	73.23	97.11	98.08	98.6	91.18	93.45	95.62	98.85
C4	95.13	77.95	98.39	98.57	98.92	99.36	97.01	98.5	98.97
C5	22.56	77.32	80.14	51.09	72.08	84.17	91.83	98.65	98.24
C6	23.23	81.85	77.55	78.66	87.08	99.39	99	98.5	98.69
C7	47.56	59.65	93.67	89.3	94.34	90.59	88.65	95.73	98.52
C8	14.35	41.85	40.14	39.35	63.12	88.88	87.72	91.49	89.92
C9	81.84	91.96	99.02	98.96	99.1	95.95	97.32	99.93	98.69
C10	29.87	53.83	58.49	76.33	86.38	93.22	92.57	98.16	98.58
C11	14.16	47.45	84.7	82.25	83.08	89.51	91.02	97.64	97.36
C12	14.26	61.88	66.69	38	48.92	89.27	86.85	98.06	97.73
C13	21.45	50.32	31.16	30	38.47	89.86	89.38	98.36	97.86
C14	57.52	64.84	59.89	93.04	93.7	98.15	96.8	98.63	98.33
C15	9.76	86.7	77.86	98.25	94.58	90.68	96.68	97.21	95.32
C16	79.66	76.97	95.71	99.22	97.31	98.4	96.36	95.68	98.53
C17	57.1	69.89	92.82	84.3	99.34	81.32	91.87	98.41	97.28
C18	18.21	80.16	67.12	57.61	62.68	96.68	94.12	96.24	97.64
C19	48.59	68.57	53.17	71.11	62.3	94.4	94.37	98.23	96.82
C20	26.44	78.01	72.78	69.14	81.31	83.58	86.03	99.54	97.82
C21	16.2	75.26	48.81	55.46	87.11	85.77	65.76	92.09	94.28
C22	10.08	79.63	54.84	47.85	46.16	94.48	89.01	98.42	98.58
OA (%)	65.5	73.24	85.14	86.23	88.91	95.79	94.89	97.96	98.42
AA (%)	42.09	71.66	82.12	74.3	81.8	91.61	91.65	97.2	97.06
K × 100	53.32	70.12	87.23	85.52	89.42	94.53	95.21	97.51	98.02

Table 5. Cross-region average classification performance across WHU-Hi datasets. Table 5 reports the average classification results of each model across three benchmark datasets (LongKou, HanChuan, and HongHu). Metrics include Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient, each computed as the arithmetic mean across the three regions.

Model	Avg OA (%)	Avg AA (%)	Avg Kappa (%)
RF	74.19	56.72	69.44
SVM	77.95	72.66	77.84
3D-CNN	90	85.01	90.11
ResNet	87.31	82.85	87.97
ViT	88.18	85.3	88.51
SSFTT	97.28	94.32	95.7
CTMixer	96.37	94.91	96.17
CTDBNet	98.12	97.7	97.27
UniHSFormer-X	98.57	96.9	97.58

Table 6. Module configuration and corresponding OA (%) across three datasets. The modular ablation experiment shows that the robustness of the model comes from the synergistic effect of routing mechanism, projection alignment, and multi-scale supervision—in complex scenarios, component loss can lead to a decrease in nonlinear performance, where semantic routing and Transformer backbone play a key role in maintaining classification accuracy.

ID	Semantic Routing	Prototype Projection	Multi-Scale Supervision	Transformer Backbone	Hierarchical Encoder Depth	LongKou OA	HanChuan OA	HongHu OA
C1	×	×	×	×	×	91.5	78.5	76.82
C2	✓	×	✓	✓	✓	99.2	95	95.22
C3	×	✓	✓	✓	✓	99	93.9	94.32
C4	✓	✓	×	✓	✓	99.3	95.9	96.62
C5	✓	✓	✓	×	✓	95.4	89.7	89.3
C6	✓	✓	✓	✓	×	96.9	92.4	92.72
C7	×	✓	✓	✓	×	96.1	88.8	88.62
C8	✓	×	✓	✓	×	95.9	89.9	89.52
C9	✓	✓	✓	✓	✓	99.8	97.5	98.42
C10	×	×	✓	✓	✓	94.8	90.1	90.62
C11	✓	✓	×	×	✓	92.5	87.3	86.47

Table 7. Per-class Accuracy Comparison under Structural Variants across Three Benchmark Datasets. Table 7 demonstrates that while the complete UniHSFormer-X model maintains consistently high accuracy across all classes, the degraded variants C7/C8 exhibit non-uniform performance drops—with spectrally ambiguous (narrow-leaf soybean C5) and morphologically irregular (mixed weed C9) categories showing the highest sensitivity to missing semantic routing and hierarchical encoding, suffering over 5% accuracy decline.

Dataset	Class	C7	C8	UniHSFormer-X
LongKou	C2 Cotton	95.02	91.78	99.93
LongKou	C5 Narrow-leaf soybean	94.97	91.47	99.4
LongKou	C8 Roads and houses	94.11	91.9	98.91
LongKou	C9 Mixed weed	90.56	88.7	96.18
HanChuan	C2 Cowpea	93.76	92.35	98.82
HanChuan	C5 Water spinach	88.24	85.99	92.25
HanChuan	C9 Grass	89.89	89.71	95.9
HanChuan	C12 Plastic	85.04	82.29	91.03
HongHu	C12 Brassica chinensis	93.41	91.07	97.73
HongHu	C14 Lactuca sativa	92.14	91.64	98.33
HongHu	C15 Celtuce	89.6	88.4	95.32
HongHu	C22 Tree	92.71	90.34	98.58

Table 8. Comparative Summary of Model Complexity, Hyperparameter Design, and Inference Efficiency.

Model	Hyperparameter Count	Theoretical Complexity	Inference Time (ms/Image)
RF	3 (n_estimators, max_depth, min_samples_leaf)	$O (ntrees \cdot \max_depth)$	3.47
SVM	3 (kernel, C, gamma)	$O (sv \cdot d)$	1.94
3D-CNN	5 (patch_size, conv_depth, kernel_size, lr, batch_size)	$O (\sum_{l} V_{l} \cdot k^{3} \cdot c_{in} \cdot c_{out})$	16.4
ResNet	5 (resnet_depth, patch_size, dropout, lr, batch_size)	$O (\sum_{i} H_{i} \cdot W_{i} \cdot k^{2} \cdot C_{i n} \cdot C_{o u t})$	12.8
ViT	7 (patch_size, dim, depth, heads, mlp_ratio, dropout, lr)	$O (n^{2} \cdot d)$	18.2
SSFTT	7 (spectral_dim, token_dim, depth, heads, routing_depth, dropout, lr)	$O (n^{2} \cdot d)$	25.1
CTMixer	6 (token_dim, mixer_depth, patch_size, dropout, mlp_ratio, lr)	$O {(n \cdot d \cdot k)}^{*}$	19.1
CTDBNet	7 (token_dim, head_count, attention_depth, dropout, lr, decoder_depth, fusion_ratio)	$O (n^{2} \cdot d)$	20.4
UniHSFormer-X	9 (spectral_dim, token_dim, routing_depth, routing_heads, global_depth, channel_ratio, dropout, lr, batch_size)	$O (n \cdot P \cdot d)$	20.3

Table 9. Training and testing time, FLOPs, and parameter count of compared models.

Model	Train (s) LK	Test (s) LK	Train (s) HC	Test (s) HC	Train (s) HH	Test (s) HH	FLOPs (M)	Params (M)
RF	–	3.47	–	3.84	–	4.21	–	0.012
SVM	–	1.94	–	2.03	–	2.17	–	0.003
3D-CNN	44.4	53.6	58.9	65	60.8	67.3	1728.97	0.16
ResNet	26.3	22.5	30.2	25.3	33.6	27.9	1104.21	0.72
ViT	24.8	22.2	27.9	24.5	30.1	26.7	1462.44	0.91
SSFTT	9.2	9.2	8.7	19.3	5.6	15.8	781.38	0.16
CTMixer	20.4	20.9	23	21.9	24.6	23.1	1024.53	0.59
CTDBNet	21.7	21.4	24.6	23.3	27.4	24.7	1482.7	0.84
UniHSFormer-X	18.7	18.3	20.4	19.8	22.4	21.1	962.84	0.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, Z.; Liu, S.; Liao, Y.; Tang, Y.; Liu, Y.; Xing, H.; Zhang, Z.; Zhang, D. UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring. Agriculture 2025, 15, 1427. https://doi.org/10.3390/agriculture15131427

AMA Style

Du Z, Liu S, Liao Y, Tang Y, Liu Y, Xing H, Zhang Z, Zhang D. UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring. Agriculture. 2025; 15(13):1427. https://doi.org/10.3390/agriculture15131427

Chicago/Turabian Style

Du, Zhen, Senhao Liu, Yao Liao, Yuanyuan Tang, Yanwen Liu, Huimin Xing, Zhijie Zhang, and Donghui Zhang. 2025. "UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring" Agriculture 15, no. 13: 1427. https://doi.org/10.3390/agriculture15131427

APA Style

Du, Z., Liu, S., Liao, Y., Tang, Y., Liu, Y., Xing, H., Zhang, Z., & Zhang, D. (2025). UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring. Agriculture, 15(13), 1427. https://doi.org/10.3390/agriculture15131427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UniHSFormer X for Hyperspectral Crop Classification with Prototype-Routed Semantic Structuring

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Benchmark Datasets

3.2. Proposed Architecture

4. Experiment and Analysis

4.1. Experimental Setup

4.2. Model Performance Across Different Agricultural Scenarios

4.3. Architectural Dissection and the Structural Grammar of Robustness

4.4. Parameter Configuration and Behavioral Stability

4.5. Complexity and Runtime Analysis

5. Discussions

5.1. Structured Landscapes and Recognition Breakdown in Cropland Settings

5.2. Routed Semantics and Class Disambiguation in Agricultural Contexts

5.3. Parameter Elasticity and Landscape-Aware Adaptation

5.4. Toward Transferable Semantic Grammars for Agricultural AI

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI