1. Introduction
With the acceleration of global urbanization and the continuous aggravation of ecological and environmental pressures, accurate land-use and land-cover (LULC) classification has become an indispensable technical basis for urban planning, disaster response, and decision-making on sustainable development [
1,
2,
3,
4,
5]. Ensuring sustainable development has become increasingly important for human society. Traditional remote sensing methods mainly rely on single sensor data (e.g., hyperspectral HSI or multispectral images), and methods such as SVM, Random Forest, and MP [
1] are often used to learn relevant features. However, when facing complex surface scenes, these methods often encounter technical bottlenecks such as “similar objects exhibiting different spectral features” or “dissimilar objects with similar spectral characteristics,” which limit the improvement of classification accuracy.
In recent years, the development of deep learning has significantly improved the classification performance. One-dimensional/two-dimensional/three-dimensional CNN architectures have achieved good results by combining spectral and spatial feature extraction [
6]. Zhou et al. [
7] proposed a shallow convolutional neural network (consisting of two convolutional layers and two fully connected layers) that significantly outperforms the traditional methods, followed by a complex-domain convolutional neural network [
8] and a 3D convolutional network architectures [
9] to further improve performance, but the ability to model long distance dependencies is still limited. Recurrent neural networks (RNNs) have certain advantages in modeling spectral sequences [
9], but their complex training mechanism limits their wide application. The seminal review by Ahmad [
4] provides a critical analysis of spectral band confusion mechanisms.
The introduction of Transformer brings new breakthroughs in remote sensing classification: SpectralFormer [
10] uses neighbouring bands to model spectral information, which effectively improves the classification performance, but the number of model parameters is large; GLT-Net [
11] introduces a global–local attention mechanism to deal with long-distance dependency; LIIT [
12] improves the fusion effect of HSI and LiDAR data through the interaction of local information.
With the increasing abundance of multimodal remote sensing data, fusion of multi-source information has become an important direction to improve classification performance. Synthetic aperture radar (SAR) extracts structural features by analysing the amplitude and phase of the reflected signals from the ground surface; LiDAR provides three-dimensional information by measuring the ground surface and the target height with high accuracy [
13,
14,
15]; and multispectral sensors observe the features of the ground based on the information of the reflectance of different wavelength bands, and reflect the physical attributes of the features by constructing spectral indices [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25].
The fusion of heterogeneous data from multiple sources achieves information complementarity and provides technical support for comprehensive characterization and high-precision classification of feature characteristics. Among the early fusion methods, MP/AP and other models [
26] manually constructed feature combinations to improve the boundary separability, but it is easy to generate redundant information; RF model [
27] realised multimodal voting decision-making, but it relies on the design of manual rules; Cao et al. [
28] used splicing/pooling fusion, which is associated with the problem of feature conflict; and Guo et al. [
29] implemented POI fusion with density maps, but it is not possible to achieve the problem of feature clash. Guo et al. [
30] implement POI fusion with density map, but ignore the spatial distribution pattern; Li et al. [
30] model the interaction between SAR phase and optical texture by covariance matrix, and introduce bilinear pooling to improve the robustness in complex scenes; Kang et al. [
31] propose a cross-gate fusion module to balance the multimodal contributions with shared gating weights; Li et al. [
32] design a collaborative attention gating unit to model the long-distance dependency relationship effectively. Li et al. [
32] designed a collaborative attention gating unit to effectively model long-distance dependencies.
Attention mechanisms are increasingly used in multimodal remote sensing classification. Wang et al. [
33] constructed a multilevel attention model but did not explicitly model cross-modal associations; Xu et al. [
34] introduced SENet to adjust the channel weights to enhance feature complementarity; Liu et al. [
35] proposed a pyramid attention mechanism to optimize multiscale feature alignment.
Although the above methods significantly improve the classification performance, there are still three core challenges in the existing research: (1) Traditional CNN and ViT encounter performance saturation, struggling with accurate classification of spectral fuzzy and structurally unique classes, crucial for urban planning and environmental monitoring; (2) The heterogeneous feature fusion bottleneck significantly hampers effective integration of spectral and spatial information across modalities. Specifically, for cross-modal data like LiDAR and optical imagery, the absence of robust fusion mechanisms leads to suboptimal performance; and (3) Increased model parameters lead to higher computational overhead, limiting adaptability to edge computing devices.
The goal of this thesis is to improve the heterogeneous feature fusion capability of land images while reducing the computational overhead. Specifically, we designed a MixtureRS—a Mixture of Expert Network based Remote Sensing Land Classification. 1. Propose a sparse Mixture of Expert (MoE) based land classification network: Improve the convergence speed by 40% through the Top-k routing mechanism, reduce the testing error by 7.2% using adaptive depth regularization, and realise expert specialised characterization of spectral–spatial features. 2. Constructing a lightweight multimodal fusion framework: innovatively combining heterogeneous convolution and channel-split tokenization strategies, modeling the complementary characteristics of LiDAR and optical data through cross-modal attention, and efficiently decoupling the MoE parameters from the computational effort to satisfy the demand for real-time processing onboard/UAVs.
4. Discussion
The empirical evidence presented above prompts three central questions: why does the MoE-augmented transformer generalize better, where does it still underperform, and how may future research extend these findings? We address each point in turn.
4.1. Why Does MoE Help?
From an optimization standpoint, the Top-k gating produces sparse, expert-wise gradients that reduce co-adaptation among feed-forward sub-modules. Such sparsity mitigates gradient interference, an issue particularly acute when distinct spectral–spatial patterns (e.g., grass vs. asphalt) co-exist within a mini-batch. Moreover, conditional computation implicitly regularizes depth: tokens routed to fewer than all experts traverse shallower effective subnetworks, acting as a form of adaptive DropPath that has been shown to curb overfitting. The large gains in confused categories (water vs. shadowed asphalt) support this theoretical lens, as the router can delegate shadow handling to an “illumination” expert while reserving another specialist for true water bodies with high near-infrared absorption.
4.2. Failure Modes and Limitations
Despite overall success, MixtureRS underperforms RF on Healthy Grass. Visual inspection shows that these pixels are uniformly textured, leading the router to allocate minimal capacity while conventional decision trees still benefit from bagging many weak learners. Similarly, the standard deviation for Stressed Grass remains high (7.27%), reflecting sensitivity to seasonal phenology. These observations suggest that the current gating policy could be augmented with a curriculum mechanism that allocates more experts to ambiguous low-variance spectra.
4.3. Broader Implications
The proposed architecture exemplifies a trend toward conditional computation in remote sensing analytics. By dynamically modulating depth and width, the model adapts to local scene complexity, a property of paramount importance for large-scale, multi-sensor Earth observation pipelines where resource budgets fluctuate across orbital passes. Furthermore, the MoE paradigm opens the door to lifelong learning: new experts could be appended to accommodate novel land-cover categories without catastrophic forgetting.
4.4. Future Work
Three avenues appear promising. First, incorporating uncertainty-aware routing could further stabilize high-variance classes by deferring ambiguous tokens to ensembles of experts. Second, coupling the MoE router with graph-based spatial regularizers may suppress salt-and-pepper artefacts commonly observed in transformer outputs. Third, extending the framework to tri-modal settings (e.g., HSI + LiDAR + SAR) would test the scalability of conditional computation under even richer sensor fusion scenarios.
4.5. Concluding Remarks
In sum, the experimental study demonstrates that a carefully designed mixture-of-experts transformer not only eclipses conventional and convolutional counterparts but also advances the state of the art over homogeneous transformer baselines. The gains are most pronounced in spectrally ambiguous or structurally distinctive classes, validating the central premise that adaptive model capacity, informed by multimodal cues, is key to next-generation land-use and land-cover classification.
5. Conclusions
This study illustrates that a sparse Mixture-of-Experts (MoE) transformer, implemented in the MixtureRS framework, can significantly enhance multimodal land-use and land-cover classification beyond the capabilities of traditional convolutional and homogeneous vision transformer models. Combining hyperspectral imagery with LiDAR-derived height data, MixtureRS achieved an overall accuracy of 88.64%, an average accuracy of 90.23%, and a Cohen’s Kappa of 87.67—surpassing the strongest non-conditional baseline by over 12 percentage points across key metrics. Notably, the approach yields substantial improvements in classifying spectrally ambiguous or structurally distinctive categories such as water, railway, and parking lots, which are critical for urban planning and environmental monitoring.
The analysis highlights four mechanistic advantages driven by conditional computation: (1) sparse expert activation via Top-k routing reduces gradient interference, promoting faster convergence; (2) adaptive depth regularizes the model akin to DropPath without stochastic instability; (3) expert specialization facilitates a disentangled representation space that effectively fuses heterogeneous modalities; and (4) the scalable architecture enables growth in parameters without significant computational overhead, supporting real-time deployment on spaceborne or airborne platforms.
However, limitations remain. MixtureRS underperforms traditional random forests for homogeneous grass surfaces, indicating a need for better capacity allocation for low-variance classes, perhaps via curriculum routing or ensemble techniques. The model’s sensitivity to phenological shifts and its memory footprint also pose challenges for edge deployment, especially on lightweight UAVs. Furthermore, the assumption of perfect co-registration between hyperspectral and LiDAR data may not hold in practical scenarios, potentially diminishing cross-attention performance.
Looking ahead, promising directions include integrating uncertainty-aware gating to adaptively allocate expert capacity to uncertain tokens, applying graph-based spatial regularizers to reduce noise artifacts, and extending the framework to incorporate additional modalities such as SAR or ultra-high-resolution imagery. These advancements will further test the scalability and robustness of conditional computation in complex remote sensing applications.
In conclusion, this work substantiates that adaptive model capacity guided by multimodal cues is crucial for future remote sensing analytics. MixtureRS sets a new benchmark, providing a flexible and efficient architecture that effectively balances data complexity with computational constraints—marking a significant step toward more intelligent and scalable Earth observation systems.