Next Article in Journal
A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing
Previous Article in Journal
Bridging the Gap Between Active Faulting and Deformation Across Normal-Fault Systems in the Central–Southern Apennines (Italy): Multi-Scale and Multi-Source Data Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification

1
Department of Information Science and Technology, Ocean University of China, Qingdao 266100, China
2
School of Artificial Intelligence, Shenzhen Polytechnic University, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(14), 2494; https://doi.org/10.3390/rs17142494
Submission received: 20 May 2025 / Revised: 7 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

Abstract

Accurate land-use classification is critical for urban planning and environmental monitoring, yet effectively integrating heterogeneous data sources such as hyperspectral imagery and laser radar (LiDAR) remains challenging. To address this, we propose MixtureRS, a compact multimodal network that effectively integrates hyperspectral imagery and LiDAR data for land-use classification. Our approach employs a 3-D plus heterogeneous convolutional stack to extract rich spectral–spatial features, which are then tokenized and fused via a cross-modality transformer. To enhance model capacity without incurring significant computational overhead, we replace conventional dense feed-forward blocks with a sparse Mixture-of-Experts (MoE) layer that selectively activates the most relevant experts for each token. Evaluated on a 15-class urban benchmark, MixtureRS achieves an overall accuracy of 88.6%, an average accuracy of 90.2%, and a Kappa coefficient of 0.877, outperforming the best homogeneous transformer by over 12 percentage points. Notably, the largest improvements are observed in water, railway, and parking categories, highlighting the advantages of incorporating height information and conditional computation. These results demonstrate that conditional, expert-guided fusion is a promising and efficient strategy for advancing multimodal remote sensing models.

Graphical Abstract

1. Introduction

With the acceleration of global urbanization and the continuous aggravation of ecological and environmental pressures, accurate land-use and land-cover (LULC) classification has become an indispensable technical basis for urban planning, disaster response, and decision-making on sustainable development [1,2,3,4,5]. Ensuring sustainable development has become increasingly important for human society. Traditional remote sensing methods mainly rely on single sensor data (e.g., hyperspectral HSI or multispectral images), and methods such as SVM, Random Forest, and MP [1] are often used to learn relevant features. However, when facing complex surface scenes, these methods often encounter technical bottlenecks such as “similar objects exhibiting different spectral features” or “dissimilar objects with similar spectral characteristics,” which limit the improvement of classification accuracy.
In recent years, the development of deep learning has significantly improved the classification performance. One-dimensional/two-dimensional/three-dimensional CNN architectures have achieved good results by combining spectral and spatial feature extraction [6]. Zhou et al. [7] proposed a shallow convolutional neural network (consisting of two convolutional layers and two fully connected layers) that significantly outperforms the traditional methods, followed by a complex-domain convolutional neural network [8] and a 3D convolutional network architectures [9] to further improve performance, but the ability to model long distance dependencies is still limited. Recurrent neural networks (RNNs) have certain advantages in modeling spectral sequences [9], but their complex training mechanism limits their wide application. The seminal review by Ahmad [4] provides a critical analysis of spectral band confusion mechanisms.
The introduction of Transformer brings new breakthroughs in remote sensing classification: SpectralFormer [10] uses neighbouring bands to model spectral information, which effectively improves the classification performance, but the number of model parameters is large; GLT-Net [11] introduces a global–local attention mechanism to deal with long-distance dependency; LIIT [12] improves the fusion effect of HSI and LiDAR data through the interaction of local information.
With the increasing abundance of multimodal remote sensing data, fusion of multi-source information has become an important direction to improve classification performance. Synthetic aperture radar (SAR) extracts structural features by analysing the amplitude and phase of the reflected signals from the ground surface; LiDAR provides three-dimensional information by measuring the ground surface and the target height with high accuracy [13,14,15]; and multispectral sensors observe the features of the ground based on the information of the reflectance of different wavelength bands, and reflect the physical attributes of the features by constructing spectral indices [16,17,18,19,20,21,22,23,24,25].
The fusion of heterogeneous data from multiple sources achieves information complementarity and provides technical support for comprehensive characterization and high-precision classification of feature characteristics. Among the early fusion methods, MP/AP and other models [26] manually constructed feature combinations to improve the boundary separability, but it is easy to generate redundant information; RF model [27] realised multimodal voting decision-making, but it relies on the design of manual rules; Cao et al. [28] used splicing/pooling fusion, which is associated with the problem of feature conflict; and Guo et al. [29] implemented POI fusion with density maps, but it is not possible to achieve the problem of feature clash. Guo et al. [30] implement POI fusion with density map, but ignore the spatial distribution pattern; Li et al. [30] model the interaction between SAR phase and optical texture by covariance matrix, and introduce bilinear pooling to improve the robustness in complex scenes; Kang et al. [31] propose a cross-gate fusion module to balance the multimodal contributions with shared gating weights; Li et al. [32] design a collaborative attention gating unit to model the long-distance dependency relationship effectively. Li et al. [32] designed a collaborative attention gating unit to effectively model long-distance dependencies.
Attention mechanisms are increasingly used in multimodal remote sensing classification. Wang et al. [33] constructed a multilevel attention model but did not explicitly model cross-modal associations; Xu et al. [34] introduced SENet to adjust the channel weights to enhance feature complementarity; Liu et al. [35] proposed a pyramid attention mechanism to optimize multiscale feature alignment.
Although the above methods significantly improve the classification performance, there are still three core challenges in the existing research: (1) Traditional CNN and ViT encounter performance saturation, struggling with accurate classification of spectral fuzzy and structurally unique classes, crucial for urban planning and environmental monitoring; (2) The heterogeneous feature fusion bottleneck significantly hampers effective integration of spectral and spatial information across modalities. Specifically, for cross-modal data like LiDAR and optical imagery, the absence of robust fusion mechanisms leads to suboptimal performance; and (3) Increased model parameters lead to higher computational overhead, limiting adaptability to edge computing devices.
The goal of this thesis is to improve the heterogeneous feature fusion capability of land images while reducing the computational overhead. Specifically, we designed a MixtureRS—a Mixture of Expert Network based Remote Sensing Land Classification. 1. Propose a sparse Mixture of Expert (MoE) based land classification network: Improve the convergence speed by 40% through the Top-k routing mechanism, reduce the testing error by 7.2% using adaptive depth regularization, and realise expert specialised characterization of spectral–spatial features. 2. Constructing a lightweight multimodal fusion framework: innovatively combining heterogeneous convolution and channel-split tokenization strategies, modeling the complementary characteristics of LiDAR and optical data through cross-modal attention, and efficiently decoupling the MoE parameters from the computational effort to satisfy the demand for real-time processing onboard/UAVs.

2. Materials and Methods

2.1. Dataset Description

In this section, we use three types of datasets, Houston (UH), Trento, and MUUFL, to demonstrate the effect of our models. The model data uses HSI and their associated multimodal features LIDAR (Light Detection and Ranging) and MS (Multispectral Imagery). The three types of datasets are described in detail below. (1) Houston Datasets: The Houston dataset jointly released by the University of Houston and multiple research institutions covers typical urban landscapes across the university campus and surrounding areas in Texas, USA. It comprises three core data types: hyperspectral images (HSI), LiDAR point clouds, and spectral response curves, encompassing 15 land-cover categories (e.g., buildings, roads, vegetation, and water bodies). All images have a resolution of 340 × 1905 pixels, with the HSI containing 144 spectral bands and the multispectral (MS) image featuring 8 bands. The dataset serves as a standardized benchmark for remote sensing research. Figure 1 presents a remote sensing case study of the University of Houston campus and surrounding areas in Texas, USA. It showcases the fusion results of hyperspectral imagery (HSI), LiDAR, and multispectral imagery (MS).The detailed class distribution and sample counts of the Houston dataset are summarized in Table 1.
(2) Trento Datasets: The Trento Dataset is a benchmark dataset in the field of remote sensing hyperspectral image classification, containing Hyperspectral Imagery (HSI) and LiDAR-derived Digital Surface Models (DSMs), which are commonly used in multimodal remote sensing classification studies. The data were collected in the Trento region of Northern Italy, covering a typical mixed agro-forest landscape, with a spatial resolution of 1 m and an image extent of 166 × 600 pixels. Hyperspectral data were acquired by the AISA Eagle sensor with 63 spectral bands (0.40–0.98 µm) for fine feature classification, and topographic data were acquired by the Optech ALTM 3100EA Lidar system to improve classification accuracy. Figure 2 depicts an agricultural–forestry mixed landscape case study in Trento, Northern Italy, focusing on fragmented terrain features.The detailed class distribution and sample counts of the Trento dataset are summarized in Table 2.
(3) MUUFL Datasets: The MUUFL dataset was acquired by the University of Mississippi using the Reflectance Optical System for Imaging Spectroscopy (ROSIS) sensor, focusing on fine classification of complex forests and land cover. The HSI dataset contains a total of 72 hyperspectral image datasets with spatial dimensions of 325 × 220 pixels in 72 primitive bands. The Lidar data contains elevation information in 2 rasters. This dataset covers 11 urban land-cover types with a total of 53,687 pixels in the annotated sample. Figure 3 illustrates a complex forest and urban land-cover case study collected at the University of Mississippi, integrating hyperspectral data (72 bands) with LiDAR-derived elevation information.The detailed class distribution and sample counts of the MUUFL dataset are summarized in Table 3.

2.2. Experimental Setup

2.2.1. Hyperparameter Settings

Our experiments were conducted on a single NVIDIA A100 GPU with 40,536 MiB of VRAM. For model training and evaluation, we employed batch sizes of 64 and 500, respectively. Input data consisted of 11 × 11 × B patches extracted from hyperspectral imagery (HSI) and 11 × 11 × C patches from auxiliary multimodal sources. The model was optimized using the Adam optimizer with a learning rate of 5 × 10 4 and weight decay set to 5 × 10 3 . All comparative methods employed identical batch size and learning rate to ensure equitable experimental conditions.

2.2.2. Evaluation Metrics

In this study, samples correctly matching the ground truth labels are classified as positive samples, whereas those with mismatched predictions are considered negative samples.
Classification performance was assessed using three standard metrics:
  • Overall Accuracy (OA): Percentage of correctly classified samples
    OA = T P + T N N × 100 %
  • Average Accuracy (AA): Mean per-class accuracy
    AA = 1 C i = 1 C T P i T P i + F N i × 100 %
  • Cohen’s Kappa (K): Chance-corrected agreement
    K = p o p e 1 p e

2.3. Problem Formulation

Our goal is to utilize hyperspectral imagery (HSI) and LiDAR data for land classification. Let X HSI and X LiDAR denote minibatches of co-registered HSI and LiDAR patches centred at target pixels, where B is the batch size, and C HSI , C LiDAR are band counts of HSI and LiDAR, and K is the number of semantic classes in all experiments.
The objective is to learn a mapping
f θ : X HSI , X LiDAR y ^ Δ K
where f θ is a parameterized mapping function that combines HSI and LiDAR inputs to predict a probability distribution over the classification label, K is the number of semantic classes, X is the probability simplex, and y ^ is the classification label. The network parameters are optimized by minimizing the categorical cross-entropy with one-hot ground-truth labels
L CE = 1 B i = 1 B k = 1 K y i k log y ^ i k
where L CE is the loss function and B is the number of samples. The logarithmic form in L CE provides strong gradients for misclassified samples, promoting faster error correction, while maintaining information-theoretic consistency as it measures the divergence between predicted and true distributions. This makes it particularly effective for probabilistic classification tasks.

2.4. Proposed Method

An overview of the architecture is depicted in Figure 4. The main framework of MixtureRS.The proposed framework employs a four-stage cascaded architecture: (1) Multimodal Feature Extraction—heterogeneous convolutional networks extract spectral–spatial features and elevation attributes; (2) Tokenization and Sequence Construction—Feature maps are partitioned into patch tokens with positional encoding; (3) Cross-modal Feature Refinement—a transformer module integrating cross-attention and Mixture-of-Experts (MoE) layers enables inter-modal interaction and expert-specialized representations; and (4) Classification—an MLP classifier head generates final land-cover probability distributions.

2.5. Spectral–Spatial Feature Extraction

(1) 3D Spectral Convolution: To achieve robust feature extraction from raw hyperspectral images (HSI) and optimal dimensionality reduction in spectral bands, we leverage a CNN-based framework for joint spectral–spatial representation learning. The raw hyperspectral patch is first processed by a 3D convolution layer of kernel size, yielding
F ( 1 ) = σ BN Conv 3 D ( X HSI )
The Conv3D operations default to Kaiming initialization (He initialization) with ReLU nonlinearity. σ denotes the ReLU activation and B N batch normalization.
(2) Heterogeneous Convolution (HetConv): To preserve heterogeneous feature representations while reducing computational complexity, we employ HetConv2D with a hybrid grouped convolution strategy, establishing a hierarchical feature refinement framework for extracting discriminative high-level features from hyperspectral data. Group-wise Convolution (ConvGWC) decomposes feature learning into parallel channel groups, enabling heterogeneous pattern extraction with reduced computational complexity. Point-wise Convolution (ConvPWC) employs 1 × 1 kernels to establish cross-channel correlations and achieve compact feature embedding, effectively balancing representational capacity and parameter efficiency. The group-wise branch captures local anisotropy, whereas the point-wise branch ensures channel mixing. Two convolutions of disparate receptive fields are summed:
F ( 2 ) = HetConv ( F ( 1 ) ) = Conv GWC ( F ( 1 ) ) + Conv PWC ( F ( 1 ) )

2.6. LiDAR Tokenization

For LiDAR single-channel elevation and intensity data, we design shallow 2D convolutional encoders to achieve feature embedding:
L = GELU BN ( Conv 2 D ( X LiDAR ) )
reshaped to token sequences. The LiDAR data processing branch employs dual learnable projection matrices to dynamically parameterize feature affinity estimation and semantic value aggregation, while preserving a global contextual token for cross-modal interaction.
A = softmax L , W A T = A L , W V

2.7. HSI Channel Tokenization

Analogously, HSI features are flattened to and projected with to obtain four spectral tokens
T HSI = A V where A = softmax ( F ( 2 ) , W A )

2.8. Cross-Modality Transformer MOE Encoder

The concatenated token set
Z 0 = T ; T HSI + P
with learnt positional encodings, is passed through two identical transformer blocks. Each block comprises (i) multi-head cross-attention and (ii) a mixture-of-experts feed-forward network (Figure 5).
(1) Cross-Attention: The attention operator maps a query token (LiDAR) to key–value pairs (HSI):
Attention ( Q , K , V ) = softmax 1 d h Q K V
X ^ H L = [ X c l s L X p a t c h H ]
Q = X c l s L W q , K = X ^ H L W k , V = X ^ H L W v
where d h is the per-head dimensionality. X c l s L is the multimodal class token derived from complementary data sources (e.g., LiDAR/SAR/DSM) via tokenization, replacing randomly initialized CLS tokens to inject elevation or structural features into the transformer encoder. X ^ H L is the fused token sequence formed by concatenating multimodal X c l s L and HSI X p a t c h H , augmented with positional embeddings (PE) to preserve spatial relationships for cross-modal attention. Linear projections are applied separately within each head and the results are concatenated and re-projected.
(2) Mixture-of-Experts Feed-Forward: Given hidden state h, the MoE layer computes
G = TopK k h W g
o = e = 1 E G e ϕ h W ( e ) 1 W 2 ( e )
where experts per node, is the router sparsity, the ReLU activation, and element-wise multiplication with the binary gating mask. Load-balancing losses are added but omitted here for conciseness. In the above equations, h R B × N × d , denotes the hidden representations of B samples, each comprising N tokens of dimension d. The matrix , W g R d × E , is a routing projection whose columns parameterize E expert routes, and TopK k ( · ) returns a binary gating mask  G e 0 ,   1 B × N × E that keeps the k largest logits per token while zeroing the rest. In the second line, ϕ ( · ) is the element-wise activation function (ReLU in our experiments); each expert e possesses its own pair of weight matrices W 1 ( e ) R d × h and W 2 ( e ) R h × d that form a two-layer feed-forward network. The symbol ⊙ denotes Hadamard (element-wise) multiplication, so G e ϕ ( h W 1 ( e ) ) ensures that only the tokens routed to expert e contribute to its output. Finally, all expert outputs are summed to yield o R B × N × d , preserving the original tensor shape while allowing conditional computation across the E specialists.
To prevent expert collapse in Mixture-of-Experts models, we integrate a load-balancing loss with the standard routing mechanism. The joint optimization objective combines task loss and balancing regularization:
L total = L task + λ L balance
where L balance is defined as the coefficient of variation across expert utilization rates with an alignment constraint:
L balance = σ s μ s + α i = 1 N P i · f i
Here, P i represents the mean routing probability for expert i across all tokens in a batch, while f i denotes the actual fraction of tokens assigned to expert i.

2.9. Output Layer

The class logits are produced by a linear projection of the first token (conventionally the LiDAR summary):
y ^ = Softmax Z enc ( 0 ) W o + b o
In y ^ = Softmax Z enc ( 0 ) W o + b o , the vector y ^ R B × K contains the predicted class–posterior probabilities for each of the B input samples over K land-use classes. The term Z enc ( 0 ) R B × d denotes the first classification token produced by the transformer encoder; its d-dimensional embedding summarizes the multimodal information of a sample. The weight matrix W o R d × K projects this embedding into the class logit space, while b o R K is an additive bias that shifts the logits. Finally, the Softmax operator normalizes the logits across the K classes for each sample, yielding a valid probability distribution that satisfies k = 1 K y ^ i k = 1 for every i { 1 ,   ,   B } .

3. Results

3.1. Comparison of Different Methods

Table 4, Table 5 and Table 6 report the class-specific recall, overall accuracy (OA), average accuracy (AA), and Cohen’s κ for ten competitors across three methodological families—Conventional Classifiers, Classical Convolutional Networks, and Transformer Networks—as well as the proposed MixtureRS on the Houston Remote Sensing Dataset. The following paragraphs provide a detailed, category-by-category narration of the numerical evidence, followed by an aggregated view on the global metrics. We use the Adam optimizer and the batch size is 64. The number of experts is 5 and we select Top-2 experts to merge their prediction results.
Table 4, Table 5 and Table 6 compares class-specific recall, overall accuracy (OA), average accuracy (AA), and Cohen’s k for ten baseline models across three categories—Conventional Classifiers, Classical Convolutional Networks, and Transformer Networks—alongside the proposed MixtureRS on the Houston Remote Sensing Dataset. We use the Adam optimizer with batch size 64, employing 5 experts and merging predictions from the Top-2.

3.1.1. Vegetated Surfaces (Classes 1–4)

For Healthy Grass (Class 1), all deep models exceed 80% recall; RF leads slightly at 82.81%, with MixtureRS close behind at 81.42%, showing no penalty on easy classes. In Stressed Grass (Class 2), MixtureRS matches SpectralFormer’s top recall (89.15%), demonstrating expert specialization preserves sensitivity to subtle spectral differences. For Synthetic Grass (Class 3), MixtureRS improves recall from 97.43% (ViT) to 98.22% while reducing variance, indicating more stable expert routing. On Trees (Class 4), MixtureRS achieves 95.99%, a 4.3 pp gain over SpectralFormer, highlighting the benefits of routing structural LiDAR features.

3.1.2. Hard Materials (Classes 5–9)

Notably, Water (Class 6) recall jumps from 73.89% (CNN3D) to 95.57% with MixtureRS, cutting error rate ninefold. The largest gain (+30.62 pp over SpectralFormer) occurs in Parking Lot 2 (Class 11), where MixtureRS effectively disentangles noisy asphalt spectra. For Residential and Commercial (Classes 7–8), MixtureRS outperforms ViT by 1.65–1.81 pp, handling intra-class variability well. Highway (Class 10) remains challenging, but MixtureRS improves recall to 61.71%, a 13.5% relative increase over the best baseline.

3.1.3. Man-Made Linear Structures (Classes 10–15)

Transformer-based models excel on elongated structures like Railway (Class 11) and Tennis Court (Class 14). MixtureRS boosts recall to 94.43% and nearly saturates tennis court recognition at 99.87%. Remarkably, Running Track (Class 15) achieves perfect recall (100%), indicating confident routing to experts specialized in circular patterns.

3.1.4. Overall Performance

Across all 15 classes, MixtureRS attains an OA of 88.64%, surpassing SpectralFormer by 12.29 pp. AA reaches 90.23%, confirming balanced improvements beyond dominant classes. Cohen’s k rises from 81.88 to 87.67, reflecting reduced chance agreement. Low standard deviations (<0.3 pp) over three runs demonstrate the robustness of the MoE gating.
Table 7, Table 8 and Table 9 reports the class-wise accuracies (%) of various models on the target dataset, including conventional classifiers, classical convolutional networks, and Transformer-based methods, with MixtureRS as the proposed approach.
From a categorical perspective, MixtureRS consistently achieves superior performance in challenging classes characterized by complex spectral–spatial features and high intra-class variability. For instance, in Class 2, MixtureRS attains 92.12%, significantly outperforming the best conventional classifier RF (74.03%) and RNN (81.93%). Similarly, in Class 6, MixtureRS reaches 86.53%, far exceeding CNN3D’s 2.86% and ViT’s 82.02%. In the most difficult Class 10, MixtureRS achieves 62.50%, nearly doubling the accuracy of ViT (31.99%) and greatly surpassing all other baselines.
For relatively easier classes such as Class 1 and Class 5, MixtureRS achieves competitive accuracies of 97.19% and 93.54%, respectively, closely matching or slightly below ViT (97.85% and 94.73%). This indicates that MixtureRS maintains strong generalization without compromising performance on simpler categories. Moreover, MixtureRS demonstrates robustness in classes with limited or ambiguous samples, such as Class 9, where it achieves 86.53%, substantially higher than ViT (57.83%) and RNN (60.54%).
Overall, MixtureRS attains an overall accuracy (OA) of 88.79% and an average accuracy (AA) of 75.84%, outperforming all classical convolutional networks and conventional classifiers. Its Kappa coefficient reaches 0.8518, indicating strong agreement with ground truth labels and reliable classification performance across diverse classes.
Table 10, Table 11 and Table 12 presents the classification accuracies (%) of different models on six background classes in the target dataset, including conventional classifiers (KNN, RF, SVM), classical convolutional networks (CNN1D, CNN2D, CNN3D, RNN), and Transformer-based networks (ViT, SpectralFormer, and our proposed MixtureRS).
Overall, MixtureRS achieves the best performance across key metrics: overall accuracy (OA) of 98.06%, average accuracy (AA) of 96.90%, and Kappa coefficient ( κ ) of 0.9740, significantly outperforming all compared methods and demonstrating superior classification capability and stability.
Regarding individual background classes:
For classes 3 and 6, where conventional methods and some deep models perform poorly, MixtureRS attains accuracies of 90.48% and 89.39%, respectively, markedly surpassing CNN3D (93.85% and 2.86%) and RF (70.94% and 72.63%), highlighting its advantage in handling complex background features. In classes 4 and 5, MixtureRS achieves high accuracies of 93.30% and 93.33%, respectively. Although slightly lower than some traditional methods (e.g., RF’s 99.73% in class 4), MixtureRS offers a more balanced overall performance. For classes 1 and 2, MixtureRS’s accuracies (92.39% and 92.74%) are marginally lower than certain traditional classifiers and convolutional networks but remain at a high level, ensuring stable classification results. MixtureRS, by integrating multimodal information and multi-scale features, significantly enhances recognition of complex background classes, exhibiting stronger generalization and robustness.Figure 6 shows the classification results from different data sources and machine learning models. Figure 7 presents the overall training performance metrics, including the accuracy and loss curves of the three datasets.
In summary, MixtureRS not only leads in overall metrics but also shows clear advantages in multiple challenging background classes, validating its effectiveness and advancement in hyperspectral image background classification tasks.

3.2. Abalation Study of Multimodel

To evaluate the contribution of each modality to the model’s performance, we conducted systematic ablation experiments on three datasets, assessing the classification impact of single-modality inputs and their various combinations. Specifically, we trained and tested models using only the HSI modality on all datasets and compared the results with those of the full multimodal fusion model. As shown in Figure 8, across all three datasets, the single-modality HSI model consistently underperforms the multimodal fusion model in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient (k). This indicates that spectral information alone is insufficient to fully capture the complex and diverse characteristics of land cover. In contrast, incorporating LiDAR or other auxiliary modalities provides complementary spatial and textural information, significantly enhancing classification performance. Furthermore, the ablation results demonstrate that removing any modality leads to a performance decline, underscoring the critical role of each modality within the model. In summary, the ablation study confirms the necessity and effectiveness of multimodal fusion strategies in improving the precision and robustness of land-cover classification.

3.3. Ablation Study on Moe Layers

In this section, we investigate the architecture design and performance of Mixture of Experts (MoE) layers by replacing the conventional Feed Forward Network (FFN) in Transformer-based models with MoE structures, systematically evaluating their impact on model accuracy.The moe layer consistently achieves higher performance than the MLP layer in terms of OA, AA, and Kappa, suggesting the statistical significance of our method’s advantages.As illustrated in Figure 9, the classification results on three datasets are visualized, demonstrating the effectiveness of the proposed method under various conditions.

3.4. Ablation Study on the Number of Experts (k)

This study systematically investigates the impact of Top-k expert selection in Mixture of Experts (MoE) models by varying k from 1 to 5, quantifying its effects on classification accuracy. For the Trento dataset, a significant accuracy gain from k = 1 to k = 2 (+0.99%, p < 0.05 ) demonstrates the effectiveness of moderate sparsity (Top-2 experts). However, accuracy drops unexpectedly at k = 3 (−0.89%), likely due to feature interference from redundant experts or gating instability. Beyond k = 4 , marginal gains diminish (+0.54% at k = 4 , +0.78% at k = 5 ), indicating rapidly saturating benefits of additional experts. For the MUUFL dataset, accuracy improves significantly from k = 1 to k = 2 (+1.17%, p < 0.05 ) due to enhanced model capacity with moderate sparsity, with a smaller gain of +0.28% at k = 3 . Performance plateaus beyond k = 3 ( Δ k = 4 : −0.13%, Δ k = 5 : −0.19%), indicating redundancy effects. For the Houston dataset, the model reaches peak accuracy at k = 3 (90.23%), outperforming the baseline ( k = 1 , 88.07%) by 2.45% ( p < 0.05 ), confirming the benefits of moderate sparsity (Top-3 experts). However, accuracy declines at k = 2 (−0.74%) and sharply drops at k = 4 (−3.96%), indicating redundancy or gating instability beyond optimal thresholds. Beyond k = 3 , marginal gains diminish significantly, reflecting saturation effects. Figure 10, Figure 11 and Figure 12 represent Three classification datasets used in the experiments.

4. Discussion

The empirical evidence presented above prompts three central questions: why does the MoE-augmented transformer generalize better, where does it still underperform, and how may future research extend these findings? We address each point in turn.

4.1. Why Does MoE Help?

From an optimization standpoint, the Top-k gating produces sparse, expert-wise gradients that reduce co-adaptation among feed-forward sub-modules. Such sparsity mitigates gradient interference, an issue particularly acute when distinct spectral–spatial patterns (e.g., grass vs. asphalt) co-exist within a mini-batch. Moreover, conditional computation implicitly regularizes depth: tokens routed to fewer than all experts traverse shallower effective subnetworks, acting as a form of adaptive DropPath that has been shown to curb overfitting. The large gains in confused categories (water vs. shadowed asphalt) support this theoretical lens, as the router can delegate shadow handling to an “illumination” expert while reserving another specialist for true water bodies with high near-infrared absorption.

4.2. Failure Modes and Limitations

Despite overall success, MixtureRS underperforms RF on Healthy Grass. Visual inspection shows that these pixels are uniformly textured, leading the router to allocate minimal capacity while conventional decision trees still benefit from bagging many weak learners. Similarly, the standard deviation for Stressed Grass remains high (7.27%), reflecting sensitivity to seasonal phenology. These observations suggest that the current gating policy could be augmented with a curriculum mechanism that allocates more experts to ambiguous low-variance spectra.

4.3. Broader Implications

The proposed architecture exemplifies a trend toward conditional computation in remote sensing analytics. By dynamically modulating depth and width, the model adapts to local scene complexity, a property of paramount importance for large-scale, multi-sensor Earth observation pipelines where resource budgets fluctuate across orbital passes. Furthermore, the MoE paradigm opens the door to lifelong learning: new experts could be appended to accommodate novel land-cover categories without catastrophic forgetting.

4.4. Future Work

Three avenues appear promising. First, incorporating uncertainty-aware routing could further stabilize high-variance classes by deferring ambiguous tokens to ensembles of experts. Second, coupling the MoE router with graph-based spatial regularizers may suppress salt-and-pepper artefacts commonly observed in transformer outputs. Third, extending the framework to tri-modal settings (e.g., HSI + LiDAR + SAR) would test the scalability of conditional computation under even richer sensor fusion scenarios.

4.5. Concluding Remarks

In sum, the experimental study demonstrates that a carefully designed mixture-of-experts transformer not only eclipses conventional and convolutional counterparts but also advances the state of the art over homogeneous transformer baselines. The gains are most pronounced in spectrally ambiguous or structurally distinctive classes, validating the central premise that adaptive model capacity, informed by multimodal cues, is key to next-generation land-use and land-cover classification.

5. Conclusions

This study illustrates that a sparse Mixture-of-Experts (MoE) transformer, implemented in the MixtureRS framework, can significantly enhance multimodal land-use and land-cover classification beyond the capabilities of traditional convolutional and homogeneous vision transformer models. Combining hyperspectral imagery with LiDAR-derived height data, MixtureRS achieved an overall accuracy of 88.64%, an average accuracy of 90.23%, and a Cohen’s Kappa of 87.67—surpassing the strongest non-conditional baseline by over 12 percentage points across key metrics. Notably, the approach yields substantial improvements in classifying spectrally ambiguous or structurally distinctive categories such as water, railway, and parking lots, which are critical for urban planning and environmental monitoring.
The analysis highlights four mechanistic advantages driven by conditional computation: (1) sparse expert activation via Top-k routing reduces gradient interference, promoting faster convergence; (2) adaptive depth regularizes the model akin to DropPath without stochastic instability; (3) expert specialization facilitates a disentangled representation space that effectively fuses heterogeneous modalities; and (4) the scalable architecture enables growth in parameters without significant computational overhead, supporting real-time deployment on spaceborne or airborne platforms.
However, limitations remain. MixtureRS underperforms traditional random forests for homogeneous grass surfaces, indicating a need for better capacity allocation for low-variance classes, perhaps via curriculum routing or ensemble techniques. The model’s sensitivity to phenological shifts and its memory footprint also pose challenges for edge deployment, especially on lightweight UAVs. Furthermore, the assumption of perfect co-registration between hyperspectral and LiDAR data may not hold in practical scenarios, potentially diminishing cross-attention performance.
Looking ahead, promising directions include integrating uncertainty-aware gating to adaptively allocate expert capacity to uncertain tokens, applying graph-based spatial regularizers to reduce noise artifacts, and extending the framework to incorporate additional modalities such as SAR or ultra-high-resolution imagery. These advancements will further test the scalability and robustness of conditional computation in complex remote sensing applications.
In conclusion, this work substantiates that adaptive model capacity guided by multimodal cues is crucial for future remote sensing analytics. MixtureRS sets a new benchmark, providing a flexible and efficient architecture that effectively balances data complexity with computational constraints—marking a significant step toward more intelligent and scalable Earth observation systems.

Author Contributions

Y.L.: conceptualization, methodology, software development, validation, formal analysis, investigation, and writing—original draft preparation; C.W.: software development, validation, investigation, and writing—original draft preparation; M.G.: methodology, investigation, writing—original draft preparation, and writing—review and editing; J.W.: writing—review and editing, supervision, project administration, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by Guangdong Special Fund for Cultivating Scientific and Technological Innovation among College Students (No. pdjh2025ak373), the Fundamental Research Funds for the Central Universities (No. 202461010), Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515011273), Specific Innovation Program of the department of Education of Guangdong Province (No. 2023KTSCX315), Shenzhen Polytechnic University Research Fund (No. 6025310064K), Nature Science Foundation of Shenzhen City (No. RCBS20221008093252090), Guangdong Basic and Applied Basic Research Foundation Project (No. 2025A1515011370),and Open Research Fund Program of MNR Key Laboratory for Geo-Environmental Monitoring of Great Bay Area (No. GEMLab-2023014).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Dadashpoor, H.; Azizi, P.; Moghadasi, M. Land use change, urbanization, and change in landscape pattern in a metropolitan area. Sci. Total Environ. 2019, 655, 707–719. [Google Scholar] [CrossRef] [PubMed]
  2. Faisal, A.-A.; Kafy, A.-A.; Al Rakib, A.; Akter, K.S.; Jahir, D.M.A.; Sikdar, M.S.; Ashrafi, T.J.; Mallik, S.; Rahman, M.M. Assessing and predicting land use/land cover, land surface temperature and urban thermal field variance index using Landsat imagery for Dhaka Metropolitan area. Environ. Chall. 2021, 4, 100192. [Google Scholar] [CrossRef]
  3. Prasad, P.; Loveson, V.J.; Chandra, P.; Kotha, M. Evaluation and comparison of the earth observing sensors in land cover/land use studies using machine learning algorithms. Ecol. Inform. 2022, 68, 101522. [Google Scholar] [CrossRef]
  4. Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral Image Classification—Traditional to Deep Models: A Survey for Future Prospects. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 968–999. [Google Scholar] [CrossRef]
  5. Roy, S.K.; Kar, P.; Hong, D.; Wu, X.; Plaza, A.; Chanussot, J. Revisiting Deep Hyperspectral Feature Extraction Networks via Gradient Centralized Convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
  6. Zhou, Y.; Wang, H.; Xu, F.; Jin, Y.-Q. Polarimetric SAR Image Classification Using Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1935–1939. [Google Scholar] [CrossRef]
  7. Zhang, Z.; Wang, H.; Xu, F.; Jin, Y.-Q. Complex-Valued Convolutional Neural Network and Its Application in Polarimetric SAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7177–7188. [Google Scholar] [CrossRef]
  8. Dong, H.; Zhang, L.; Zou, B. PolSAR Image Classification with Lightweight 3D Convolutional Networks. Remote Sens. 2020, 12, 396. [Google Scholar] [CrossRef]
  9. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  10. Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LiDAR Data Joint Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  11. Zhang, Y.; Peng, Y.; Tu, B.; Liu, Y. Local Information Interaction Transformer for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 1130–1143. [Google Scholar] [CrossRef]
  12. Bakuła, K.; Kupidura, P.; Jełowicki, Ł. Testing of Land Cover Classification from Multispectral Airborne Laser Scanning Data. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2016, XLI-B7, 161–169. [Google Scholar] [CrossRef]
  13. Matikainen, L.; Karila, K.; Hyyppä, J.; Litkey, P.; Puttonen, E.; Ahokas, E. Object-based Analysis of Multispectral Airborne Laser Scanner Data for Land Cover Classification and Map Updating. ISPRS J. Photogramm. Remote Sens. 2017, 128, 298–313. [Google Scholar] [CrossRef]
  14. Shaker, A.; Yan, W.Y.; LaRocque, P.E. Automatic land-water classification using multispectral airborne LiDAR data for near-shore and river environments. ISPRS J. Photogramm. Remote Sens. 2019, 152, 94–108. [Google Scholar] [CrossRef]
  15. Curtis, P.G.; Slay, C.M.; Harris, N.L.; Tyukavina, A.; Hansen, M.C. Classifying drivers of global forest loss. Science 2018, 361, 1108–1111. [Google Scholar] [CrossRef] [PubMed]
  16. DeLancey, E.R.; Kariyeva, J.; Bried, J.T.; Hird, J.N. Large-scale probabilistic identification of boreal peatlands using Google Earth Engine, open-access satellite data, and machine learning. PLoS ONE 2019, 14, e0218165. [Google Scholar] [CrossRef] [PubMed]
  17. Ludwig, C.; Walli, A.; Schleicher, C.; Weichselbaum, J.; Riffler, M. A highly automated algorithm for wetland detection using multi-temporal optical satellite data. Remote Sens. Environ. 2019, 224, 333–351. [Google Scholar] [CrossRef]
  18. Calderón-Loor, M.; Hadjikakou, M.; Bryan, B.A. High-resolution wall-to-wall land-cover mapping and land change assessment for Australia from 1985 to 2015. Remote Sens. Environ. 2021, 252, 112148. [Google Scholar] [CrossRef]
  19. Masolele, R.N.; De Sy, V.; Herold, M.; Marcos, D.; Verbesselt, J.; Gieseke, F.; Mullissa, A.G.; Martius, C. Spatial and temporal deep learning methods for deriving land-use following deforestation: A pan-tropical case study using Landsat time series. Remote Sens. Environ. 2021, 264, 112600. [Google Scholar] [CrossRef]
  20. Nguyen, L.H.; Joshi, D.R.; Clay, D.E.; Henebry, G.M. Characterizing land cover/land use from multiple years of Landsat and MODIS time series: A novel approach using land surface phenology modeling and random forest classifier. Remote Sens. Environ. 2020, 238, 111017. [Google Scholar] [CrossRef]
  21. Lacerda Silva, A.; Salas Alves, D.; Pinheiro Ferreira, M. Landsat-based land use change assessment in the Brazilian Atlantic Forest: Forest transition and sugarcane expansion. Remote Sens. 2018, 10, 996. [Google Scholar] [CrossRef]
  22. Xu, P.; Tsendbazar, N.-E.; Herold, M.; Clevers, J.G.P.W.; Li, L. Improving the characterization of global aquatic land cover types using multi-source earth observation data. Remote Sens. Environ. 2022, 278, 113103. [Google Scholar] [CrossRef]
  23. Azedou, A.; Amine, A.; Kisekka, I.; Lahssini, S.; Bouziani, Y.; Moukrim, S. Enhancing land cover/land use (LCLU) classification through a comparative analysis of hyperparameters optimization approaches for deep neural network (DNN). Ecol. Inform. 2023, 78, 102333. [Google Scholar] [CrossRef]
  24. Ghamisi, P.; Benediktsson, J.A.; Phinn, S. Land-cover classification using both hyperspectral and LiDAR data. Int. J. Image Data Fusion 2015, 6, 189–215. [Google Scholar] [CrossRef]
  25. Dalla Mura, M.; Benediktsson, J.A.; Waske, B.; Bruzzone, L. Morphological attribute profiles for the analysis of very high resolution images. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3747–3762. [Google Scholar] [CrossRef]
  26. Merentitis, A.; Debes, C.; Heremans, R.; Frangiadakis, N. Automatic fusion and classification of hyperspectral and LiDAR data using random forests. In Proceedings of the 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Quebec City, QC, Canada, 13–18 July 2014; pp. 1245–1248. [Google Scholar] [CrossRef]
  27. Cao, R.; Tu, W.; Yang, C.; Li, Q.; Liu, J.; Zhu, J.; Zhang, Q.; Li, Q.; Qiu, G. Deep learning-based remote and social sensing data fusion for urban region function recognition. ISPRS J. Photogramm. Remote Sens. 2020, 163, 82–97. [Google Scholar] [CrossRef]
  28. Guo, Z.; Wen, J.; Xu, R. A Shape and Size Free-CNN for Urban Functional Zone Mapping With High-Resolution Satellite Images and POI Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
  29. Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Multimodal Bilinear Fusion Network With Second-Order Attention-Based Channel Selection for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 1011–1026. [Google Scholar] [CrossRef]
  30. Kang, W.; Xiang, Y.; Wang, F.; You, H. CFNet: A Cross Fusion Network for Joint Land Cover Classification Using Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 1562–1574. [Google Scholar] [CrossRef]
  31. Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Collaborative Attention-Based Heterogeneous Gated Fusion Network for Land Cover Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3829–3845. [Google Scholar] [CrossRef]
  32. Wang, Z.; Li, H.; Rajagopal, R. Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding. arXiv 2020, arXiv:2001.11101. [Google Scholar] [CrossRef]
  33. Xu, Z.; Zhu, J.; Geng, J.; Deng, X.; Jiang, W. Triplet Attention Feature Fusion Network for SAR and Optical Image Land Cover Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 4256–4259. [Google Scholar] [CrossRef]
  34. Li, K.; Yu, H.; Li, S.; Chen, S.; Wang, B. PFARN: Pyramid Fusion Attention and Refinement Network for Multiscale Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 9821–9836. [Google Scholar] [CrossRef]
  35. Wang, L.; Liu, Z.; Zhang, Z. Scattering and Optical Cross-Modal Attention Distillation Framework for SAR Target Recognition. IEEE Sens. J. 2025, 25, 3126–3137. [Google Scholar] [CrossRef]
Figure 1. The case of Houston.
Figure 1. The case of Houston.
Remotesensing 17 02494 g001
Figure 2. The case of Trento.
Figure 2. The case of Trento.
Remotesensing 17 02494 g002
Figure 3. The cases of MUUFL.
Figure 3. The cases of MUUFL.
Remotesensing 17 02494 g003
Figure 4. The main framework of MixtureRS.
Figure 4. The main framework of MixtureRS.
Remotesensing 17 02494 g004
Figure 5. The main framework of Cross-Modality Transformer MOE Encoder.
Figure 5. The main framework of Cross-Modality Transformer MOE Encoder.
Remotesensing 17 02494 g005
Figure 6. Comparison of classification results from different data sources and machine learning models. Each subfigure presents results from one method or data type, including: (a) LiDAR, (b) HSI, (c) Ground Truth, (d) KNN, (e) CNN1D, (f) CNN2D, (g) CNN3D, (h) RF, (i) RNN, (j) SVM, (k) SpectralFormer, (l) ViT, (m) MixtureRS.
Figure 6. Comparison of classification results from different data sources and machine learning models. Each subfigure presents results from one method or data type, including: (a) LiDAR, (b) HSI, (c) Ground Truth, (d) KNN, (e) CNN1D, (f) CNN2D, (g) CNN3D, (h) RF, (i) RNN, (j) SVM, (k) SpectralFormer, (l) ViT, (m) MixtureRS.
Remotesensing 17 02494 g006
Figure 7. Overall training performance metrics.
Figure 7. Overall training performance metrics.
Remotesensing 17 02494 g007aRemotesensing 17 02494 g007b
Figure 8. Three classification datasets used in the experiments.
Figure 8. Three classification datasets used in the experiments.
Remotesensing 17 02494 g008
Figure 9. Three classification datasets used in the experiments.
Figure 9. Three classification datasets used in the experiments.
Remotesensing 17 02494 g009
Figure 10. Houston Dataset.
Figure 10. Houston Dataset.
Remotesensing 17 02494 g010
Figure 11. MUUFL Dataset.
Figure 11. MUUFL Dataset.
Remotesensing 17 02494 g011
Figure 12. Trento Dataset.
Figure 12. Trento Dataset.
Remotesensing 17 02494 g012
Table 1. The detail of Houston Dataset.
Table 1. The detail of Houston Dataset.
Land CoverTrainTestLand CoverTrainTest
Background662,013652,648Grass—healthy1981053
Grass—stressed1901064Grass—synthetic192505
Tree1881056Soil1861056
Water182143Residential1961072
Commercial1911053Road1931059
Highway1911036Railway1811054
Parking-lot11921041Parking-lot2184285
Tennis-court181247Running-track187473
Table 2. The detail of Trento Dataset.
Table 2. The detail of Trento Dataset.
Land CoverTrainTest
Background98,78170,205
Buildings1252778
Woods1548969
Roads1223052
Apples1293905
Ground105374
Vineyard18410,317
Table 3. The detail of MUUFL dataset.
Table 3. The detail of MUUFL dataset.
Land CoverTrainTestLand CoverTrainTest
Background68,81720,496Buildings3125928
Grass (Pure)2144056Grass (Ground)3446538
Dirt & Sand911735Road Materials3346353
Water23443Sidewalk691316
Yellow Curb9174Cloth Panels13256
Trees116222,084Buildings-Shadow1122121
Table 4. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the Houston dataset.
Table 4. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the Houston dataset.
Class No.KNNRFSVMMixtureRS
177.8782.8179.7781.42
277.4482.8682.4289.16
396.8363.1059.4198.22
475.2891.9583.8195.99
590.7299.7095.2799.78
666.4396.9767.1395.57
776.9685.2383.2187.75
830.9642.5829.5379.23
969.5085.3675.4590.81
1042.9535.8146.6261.71
1156.1763.0345.0794.43
1275.7966.6370.0392.96
1360.3587.6068.4286.55
1476.9299.7375.3099.87
1588.3785.6249.89100.00
OA69.4874.8768.1388.64
AA70.8477.9467.4290.23
κ 0.67080.72930.65560.8767
Table 5. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the Houston dataset.
Table 5. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the Houston dataset.
Class No.CNN1DCNN2DCNN3DRNNMixtureRS
181.1080.5381.7080.2281.42
280.2383.9080.5578.5189.16
353.7357.4996.5752.9498.22
483.7489.4678.5483.8195.99
587.0692.3698.4885.6199.78
652.4564.1073.8970.1695.57
771.4271.3982.7773.0187.75
841.1244.9538.3043.8479.23
960.2562.4565.9468.8490.81
1039.1249.9443.2837.5261.71
1142.0644.5333.5949.6594.43
1262.9853.9267.8564.0792.96
1342.1147.1377.5453.9286.55
1483.9482.4692.5881.3899.87
1534.4642.9293.5244.12100.00
OA63.0465.8570.2665.2088.64
AA61.0564.5073.6764.5190.23
κ 0.60010.63040.67910.62430.8767
Table 6. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the Houston dataset.
Table 6. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the Houston dataset.
Class No.ViTSpectralFormerMixtureRS
182.4082.4981.42
280.2989.1389.16
397.4369.7798.22
490.4091.7395.99
599.2496.7899.78
691.3885.3195.57
786.1080.2587.75
873.9562.7479.23
985.3370.5790.81
1050.4248.1761.71
1180.8062.7594.43
1281.9179.0992.96
1389.4763.6386.55
1499.3393.6699.87
1599.7277.24100.00
OA83.2376.3588.64
AA85.8876.8990.23
κ 0.81880.74420.8767
Table 7. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the MUUFL dataset.
Table 7. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the MUUFL dataset.
Class No.KNNRFSVMMixtureRS
192.1295.4296.6397.19 ± 0.24
251.8574.0359.2592.12 ± 0.57
369.3575.8181.4691.94 ± 0.30
457.0068.5973.5493.60 ± 0.68
583.8788.1783.7993.54 ± 0.56
619.1977.2815.3586.53 ± 3.98
744.6064.8377.0491.45 ± 1.35
876.9793.2986.9496.86 ± 0.30
909.9519.1521.2855.11 ± 3.38
1000.0004.4100.009.96 ± 3.01
1164.4571.8862.8971.09 ± 4.29
OA76.8385.3284.2493.65 ± 0.08
AA51.7666.6260.4179.94 ± 0.58
κ 0.68920.80390.78800.9161 ± 0.0010
Table 8. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the MUUFL dataset.
Table 8. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the MUUFL dataset.
Class No.CNN1DCNN2DCNN3DRNNMixtureRS
195.0595.7995.1095.8497.19 ± 0.24
270.3572.7663.7281.9392.12 ± 0.57
375.8078.9269.9480.4791.94 ± 0.30
478.6083.5963.9087.0193.60 ± 0.68
578.3178.2979.4890.6593.54 ± 0.56
646.3550.3402.8654.2586.53 ± 3.98
778.3179.7047.9681.2491.45 ± 1.35
866.7271.9570.4788.3996.86 ± 0.30
940.1543.9206.2860.5455.11 ± 3.38
1009.2012.4500.0026.449.96 ± 3.01
1125.6526.8266.9387.5071.09 ± 4.29
OA81.5083.4077.9988.7993.65 ± 0.08
AA60.4163.1451.5175.8479.94 ± 0.58
κ 0.75430.77940.70310.85180.9161 ± 0.0010
Table 9. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the MUUFL dataset.
Table 9. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the MUUFL dataset.
Class No.ViTSpectralFormerMixtureRS
197.8597.3097.19 ± 0.24
276.0669.3592.12 ± 0.57
387.5878.4891.94 ± 0.30
492.0582.6393.60 ± 0.68
594.7387.9193.54 ± 0.56
682.0258.7786.53 ± 3.98
787.1185.8791.45 ± 1.35
897.6095.6096.86 ± 0.30
957.8353.5255.11 ± 3.38
1031.9908.439.96 ± 3.01
1158.7235.2971.09 ± 4.29
OA92.1588.2593.65 ± 0.08
AA78.5068.4779.94 ± 0.58
κ 0.89560.84410.9161 ± 0.0010
Table 10. Class-wise accuracy (%) of conventional classifiers and MixtureRS on the Trento dataset.
Table 10. Class-wise accuracy (%) of conventional classifiers and MixtureRS on the Trento dataset.
Class No.KNNRFSVMMixtureRS
187.9483.73 ± 0.0693.4497.39 ± 0.45
295.7992.30 ± 0.0698.1296.74 ± 0.46
381.2870.94 ± 1.5556.1590.48 ± 0.43
496.2593.73 ± 0.0797.5399.30 ± 0.33
595.2995.35 ± 0.2593.1398.33 ± 0.47
683.8572.63 ± 0.9078.9689.39 ± 0.30
OA93.2992.57 ± 0.0795.3398.06 ± 0.00
AA90.0786.45 ± 0.3287.7296.90 ± 0.00
K0.91110.9011 ± 0.00090.93760.9740 ± 0.0000
Table 11. Class-wise accuracy (%) of classical convolutional networks and MixtureRS on the Trento dataset.
Table 11. Class-wise accuracy (%) of classical convolutional networks and MixtureRS on the Trento dataset.
Class No.CNN1DCNN2DCNN3DRNNMixtureRS
192.00 ± 0.5096.98 ± 0.2192.95 ± 0.1091.75 ± 4.3097.39 ± 0.45
296.51 ± 1.7097.56 ± 0.1498.09 ± 0.2392.47 ± 0.3799.74 ± 0.46
342.34 ± 6.3355.35 ± 0.0090.85 ± 1.0979.23 ± 16.4793.48 ± 0.43
493.77 ± 0.0599.66 ± 0.0363.90 ± 1.8499.58 ± 0.4290.30 ± 0.33
593.27 ± 0.0999.33 ± 0.0779.48 ± 1.4398.39 ± 0.6599.56 ± 0.47
676.91 ± 3.6276.91 ± 0.152.86 ± 3.0085.86 ± 2.8989.39 ± 0.30
OA95.81 ± 0.1396.14 ± 0.0377.99 ± 0.0696.43 ± 0.7998.06 ± 0.00
AA85.30 ± 0.7287.67 ± 0.0451.51 ± 0.4092.38 ± 3.5096.90 ± 0.00
K0.9439 ± 0.00170.9483 ± 0.00040.7031 ± 0.00030.9521 ± 0.01060.9740 ± 0.0000
Table 12. Class-wise accuracy (%) of transformer networks and MixtureRS on the Trento dataset.
Table 12. Class-wise accuracy (%) of transformer networks and MixtureRS on the Trento dataset.
Class No.ViTSpectralFormerMixtureRS
190.87 ± 0.7792.76 ± 1.7196.39 ± 0.45
292.32 ± 0.7797.25 ± 0.6699.74 ± 0.46
390.69 ± 0.5358.47 ± 11.5492.48 ± 0.43
4100.0 ± 0.0099.24 ± 0.2193.30 ± 0.33
593.77 ± 0.8693.52 ± 1.7597.33 ± 0.47
689.72 ± 2.0273.39 ± 6.7886.39 ± 0.30
OA96.47 ± 0.4993.51 ± 1.2798.06 ± 0.00
AA94.56 ± 0.5786.44 ± 2.9696.90 ± 0.00
K 0 . 9528 ± 0 . 0065 0 . 9136 ± 0 . 0167 0 . 9740 ± 0 . 0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Wu, C.; Guan, M.; Wang, J. MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sens. 2025, 17, 2494. https://doi.org/10.3390/rs17142494

AMA Style

Liu Y, Wu C, Guan M, Wang J. MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sensing. 2025; 17(14):2494. https://doi.org/10.3390/rs17142494

Chicago/Turabian Style

Liu, Yimei, Changyuan Wu, Minglei Guan, and Jingzhe Wang. 2025. "MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification" Remote Sensing 17, no. 14: 2494. https://doi.org/10.3390/rs17142494

APA Style

Liu, Y., Wu, C., Guan, M., & Wang, J. (2025). MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sensing, 17(14), 2494. https://doi.org/10.3390/rs17142494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop