MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification

Liu, Yimei; Wu, Changyuan; Guan, Minglei; Wang, Jingzhe

doi:10.3390/rs17142494

Open AccessArticle

MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification

¹

Department of Information Science and Technology, Ocean University of China, Qingdao 266100, China

²

School of Artificial Intelligence, Shenzhen Polytechnic University, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2494; https://doi.org/10.3390/rs17142494

Submission received: 20 May 2025 / Revised: 7 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Topic Advances in Multi-Scale Geographic Environmental Monitoring: Ecosystem Differences and Multi-Scale Comparisons)

Download

Browse Figures

Versions Notes

Abstract

Accurate land-use classification is critical for urban planning and environmental monitoring, yet effectively integrating heterogeneous data sources such as hyperspectral imagery and laser radar (LiDAR) remains challenging. To address this, we propose MixtureRS, a compact multimodal network that effectively integrates hyperspectral imagery and LiDAR data for land-use classification. Our approach employs a 3-D plus heterogeneous convolutional stack to extract rich spectral–spatial features, which are then tokenized and fused via a cross-modality transformer. To enhance model capacity without incurring significant computational overhead, we replace conventional dense feed-forward blocks with a sparse Mixture-of-Experts (MoE) layer that selectively activates the most relevant experts for each token. Evaluated on a 15-class urban benchmark, MixtureRS achieves an overall accuracy of 88.6%, an average accuracy of 90.2%, and a Kappa coefficient of 0.877, outperforming the best homogeneous transformer by over 12 percentage points. Notably, the largest improvements are observed in water, railway, and parking categories, highlighting the advantages of incorporating height information and conditional computation. These results demonstrate that conditional, expert-guided fusion is a promising and efficient strategy for advancing multimodal remote sensing models.

Keywords:

hyperspectral imagery; LiDAR fusion; Mixture-of-Experts; transformer networks; land-use and land-cover classification; multimodal remote sensing

Graphical Abstract

1. Introduction

With the acceleration of global urbanization and the continuous aggravation of ecological and environmental pressures, accurate land-use and land-cover (LULC) classification has become an indispensable technical basis for urban planning, disaster response, and decision-making on sustainable development [1,2,3,4,5]. Ensuring sustainable development has become increasingly important for human society. Traditional remote sensing methods mainly rely on single sensor data (e.g., hyperspectral HSI or multispectral images), and methods such as SVM, Random Forest, and MP [1] are often used to learn relevant features. However, when facing complex surface scenes, these methods often encounter technical bottlenecks such as “similar objects exhibiting different spectral features” or “dissimilar objects with similar spectral characteristics,” which limit the improvement of classification accuracy.

In recent years, the development of deep learning has significantly improved the classification performance. One-dimensional/two-dimensional/three-dimensional CNN architectures have achieved good results by combining spectral and spatial feature extraction [6]. Zhou et al. [7] proposed a shallow convolutional neural network (consisting of two convolutional layers and two fully connected layers) that significantly outperforms the traditional methods, followed by a complex-domain convolutional neural network [8] and a 3D convolutional network architectures [9] to further improve performance, but the ability to model long distance dependencies is still limited. Recurrent neural networks (RNNs) have certain advantages in modeling spectral sequences [9], but their complex training mechanism limits their wide application. The seminal review by Ahmad [4] provides a critical analysis of spectral band confusion mechanisms.

The introduction of Transformer brings new breakthroughs in remote sensing classification: SpectralFormer [10] uses neighbouring bands to model spectral information, which effectively improves the classification performance, but the number of model parameters is large; GLT-Net [11] introduces a global–local attention mechanism to deal with long-distance dependency; LIIT [12] improves the fusion effect of HSI and LiDAR data through the interaction of local information.

With the increasing abundance of multimodal remote sensing data, fusion of multi-source information has become an important direction to improve classification performance. Synthetic aperture radar (SAR) extracts structural features by analysing the amplitude and phase of the reflected signals from the ground surface; LiDAR provides three-dimensional information by measuring the ground surface and the target height with high accuracy [13,14,15]; and multispectral sensors observe the features of the ground based on the information of the reflectance of different wavelength bands, and reflect the physical attributes of the features by constructing spectral indices [16,17,18,19,20,21,22,23,24,25].

The fusion of heterogeneous data from multiple sources achieves information complementarity and provides technical support for comprehensive characterization and high-precision classification of feature characteristics. Among the early fusion methods, MP/AP and other models [26] manually constructed feature combinations to improve the boundary separability, but it is easy to generate redundant information; RF model [27] realised multimodal voting decision-making, but it relies on the design of manual rules; Cao et al. [28] used splicing/pooling fusion, which is associated with the problem of feature conflict; and Guo et al. [29] implemented POI fusion with density maps, but it is not possible to achieve the problem of feature clash. Guo et al. [30] implement POI fusion with density map, but ignore the spatial distribution pattern; Li et al. [30] model the interaction between SAR phase and optical texture by covariance matrix, and introduce bilinear pooling to improve the robustness in complex scenes; Kang et al. [31] propose a cross-gate fusion module to balance the multimodal contributions with shared gating weights; Li et al. [32] design a collaborative attention gating unit to model the long-distance dependency relationship effectively. Li et al. [32] designed a collaborative attention gating unit to effectively model long-distance dependencies.

Attention mechanisms are increasingly used in multimodal remote sensing classification. Wang et al. [33] constructed a multilevel attention model but did not explicitly model cross-modal associations; Xu et al. [34] introduced SENet to adjust the channel weights to enhance feature complementarity; Liu et al. [35] proposed a pyramid attention mechanism to optimize multiscale feature alignment.

Although the above methods significantly improve the classification performance, there are still three core challenges in the existing research: (1) Traditional CNN and ViT encounter performance saturation, struggling with accurate classification of spectral fuzzy and structurally unique classes, crucial for urban planning and environmental monitoring; (2) The heterogeneous feature fusion bottleneck significantly hampers effective integration of spectral and spatial information across modalities. Specifically, for cross-modal data like LiDAR and optical imagery, the absence of robust fusion mechanisms leads to suboptimal performance; and (3) Increased model parameters lead to higher computational overhead, limiting adaptability to edge computing devices.

The goal of this thesis is to improve the heterogeneous feature fusion capability of land images while reducing the computational overhead. Specifically, we designed a MixtureRS—a Mixture of Expert Network based Remote Sensing Land Classification. 1. Propose a sparse Mixture of Expert (MoE) based land classification network: Improve the convergence speed by 40% through the Top-k routing mechanism, reduce the testing error by 7.2% using adaptive depth regularization, and realise expert specialised characterization of spectral–spatial features. 2. Constructing a lightweight multimodal fusion framework: innovatively combining heterogeneous convolution and channel-split tokenization strategies, modeling the complementary characteristics of LiDAR and optical data through cross-modal attention, and efficiently decoupling the MoE parameters from the computational effort to satisfy the demand for real-time processing onboard/UAVs.

2. Materials and Methods

2.1. Dataset Description

In this section, we use three types of datasets, Houston (UH), Trento, and MUUFL, to demonstrate the effect of our models. The model data uses HSI and their associated multimodal features LIDAR (Light Detection and Ranging) and MS (Multispectral Imagery). The three types of datasets are described in detail below. (1) Houston Datasets: The Houston dataset jointly released by the University of Houston and multiple research institutions covers typical urban landscapes across the university campus and surrounding areas in Texas, USA. It comprises three core data types: hyperspectral images (HSI), LiDAR point clouds, and spectral response curves, encompassing 15 land-cover categories (e.g., buildings, roads, vegetation, and water bodies). All images have a resolution of 340 × 1905 pixels, with the HSI containing 144 spectral bands and the multispectral (MS) image featuring 8 bands. The dataset serves as a standardized benchmark for remote sensing research. Figure 1 presents a remote sensing case study of the University of Houston campus and surrounding areas in Texas, USA. It showcases the fusion results of hyperspectral imagery (HSI), LiDAR, and multispectral imagery (MS).The detailed class distribution and sample counts of the Houston dataset are summarized in Table 1.

(2) Trento Datasets: The Trento Dataset is a benchmark dataset in the field of remote sensing hyperspectral image classification, containing Hyperspectral Imagery (HSI) and LiDAR-derived Digital Surface Models (DSMs), which are commonly used in multimodal remote sensing classification studies. The data were collected in the Trento region of Northern Italy, covering a typical mixed agro-forest landscape, with a spatial resolution of 1 m and an image extent of 166 × 600 pixels. Hyperspectral data were acquired by the AISA Eagle sensor with 63 spectral bands (0.40–0.98 µm) for fine feature classification, and topographic data were acquired by the Optech ALTM 3100EA Lidar system to improve classification accuracy. Figure 2 depicts an agricultural–forestry mixed landscape case study in Trento, Northern Italy, focusing on fragmented terrain features.The detailed class distribution and sample counts of the Trento dataset are summarized in Table 2.

(3) MUUFL Datasets: The MUUFL dataset was acquired by the University of Mississippi using the Reflectance Optical System for Imaging Spectroscopy (ROSIS) sensor, focusing on fine classification of complex forests and land cover. The HSI dataset contains a total of 72 hyperspectral image datasets with spatial dimensions of 325 × 220 pixels in 72 primitive bands. The Lidar data contains elevation information in 2 rasters. This dataset covers 11 urban land-cover types with a total of 53,687 pixels in the annotated sample. Figure 3 illustrates a complex forest and urban land-cover case study collected at the University of Mississippi, integrating hyperspectral data (72 bands) with LiDAR-derived elevation information.The detailed class distribution and sample counts of the MUUFL dataset are summarized in Table 3.

2.2. Experimental Setup

2.2.1. Hyperparameter Settings

Our experiments were conducted on a single NVIDIA A100 GPU with 40,536 MiB of VRAM. For model training and evaluation, we employed batch sizes of 64 and 500, respectively. Input data consisted of

11 \times 11 \times B

patches extracted from hyperspectral imagery (HSI) and

11 \times 11 \times C

patches from auxiliary multimodal sources. The model was optimized using the Adam optimizer with a learning rate of

5 \times 10^{- 4}

and weight decay set to

5 \times 10^{- 3}

. All comparative methods employed identical batch size and learning rate to ensure equitable experimental conditions.

2.2.2. Evaluation Metrics

In this study, samples correctly matching the ground truth labels are classified as positive samples, whereas those with mismatched predictions are considered negative samples.

Classification performance was assessed using three standard metrics:

Overall Accuracy (OA): Percentage of correctly classified samples

$OA = \frac{T P + T N}{N} \times 100 %$
Average Accuracy (AA): Mean per-class accuracy

$AA = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}} \times 100 %$
Cohen’s Kappa (K): Chance-corrected agreement

$K = \frac{p_{o} - p_{e}}{1 - p_{e}}$

2.3. Problem Formulation

Our goal is to utilize hyperspectral imagery (HSI) and LiDAR data for land classification. Let

X^{HSI}

and

X^{LiDAR}

denote minibatches of co-registered HSI and LiDAR patches centred at target pixels, where B is the batch size, and

C_{HSI}, C_{LiDAR}

are band counts of HSI and LiDAR, and K is the number of semantic classes in all experiments.

The objective is to learn a mapping

f_{θ} : X^{HSI}, X^{LiDAR} \to \hat{y} \in Δ^{K}

(1)

where

f_{θ}

is a parameterized mapping function that combines HSI and LiDAR inputs to predict a probability distribution over the classification label, K is the number of semantic classes,

X

is the probability simplex, and

\hat{y}

is the classification label. The network parameters are optimized by minimizing the categorical cross-entropy with one-hot ground-truth labels

L_{CE} = - \frac{1}{B} \sum_{i = 1}^{B} \sum_{k = 1}^{K} y_{i k} log {\hat{y}}_{i k}

(2)

where

L_{CE}

is the loss function and B is the number of samples. The logarithmic form in

L_{CE}

provides strong gradients for misclassified samples, promoting faster error correction, while maintaining information-theoretic consistency as it measures the divergence between predicted and true distributions. This makes it particularly effective for probabilistic classification tasks.

2.4. Proposed Method

An overview of the architecture is depicted in Figure 4. The main framework of MixtureRS.The proposed framework employs a four-stage cascaded architecture: (1) Multimodal Feature Extraction—heterogeneous convolutional networks extract spectral–spatial features and elevation attributes; (2) Tokenization and Sequence Construction—Feature maps are partitioned into patch tokens with positional encoding; (3) Cross-modal Feature Refinement—a transformer module integrating cross-attention and Mixture-of-Experts (MoE) layers enables inter-modal interaction and expert-specialized representations; and (4) Classification—an MLP classifier head generates final land-cover probability distributions.

2.5. Spectral–Spatial Feature Extraction

(1) 3D Spectral Convolution: To achieve robust feature extraction from raw hyperspectral images (HSI) and optimal dimensionality reduction in spectral bands, we leverage a CNN-based framework for joint spectral–spatial representation learning. The raw hyperspectral patch is first processed by a 3D convolution layer of kernel size, yielding

F^{(1)} = σ (BN (Conv 3 D (X^{HSI})))

(3)

The Conv3D operations default to Kaiming initialization (He initialization) with ReLU nonlinearity.

σ

denotes the ReLU activation and

B N

batch normalization.

(2) Heterogeneous Convolution (HetConv): To preserve heterogeneous feature representations while reducing computational complexity, we employ HetConv2D with a hybrid grouped convolution strategy, establishing a hierarchical feature refinement framework for extracting discriminative high-level features from hyperspectral data. Group-wise Convolution (ConvGWC) decomposes feature learning into parallel channel groups, enabling heterogeneous pattern extraction with reduced computational complexity. Point-wise Convolution (ConvPWC) employs 1 × 1 kernels to establish cross-channel correlations and achieve compact feature embedding, effectively balancing representational capacity and parameter efficiency. The group-wise branch captures local anisotropy, whereas the point-wise branch ensures channel mixing. Two convolutions of disparate receptive fields are summed:

F^{(2)} = HetConv (F^{(1)}) = Conv GWC (F^{(1)}) + Conv PWC (F^{(1)})

(4)

2.6. LiDAR Tokenization

For LiDAR single-channel elevation and intensity data, we design shallow 2D convolutional encoders to achieve feature embedding:

L = GELU (BN (Conv 2 D (X^{LiDAR})))

(5)

reshaped to token sequences. The LiDAR data processing branch employs dual learnable projection matrices to dynamically parameterize feature affinity estimation and semantic value aggregation, while preserving a global contextual token for cross-modal interaction.

A^{ℓ} = softmax (L, W A^{ℓ}) T^{ℓ} = A^{ℓ} (L, W V^{ℓ})

(6)

2.7. HSI Channel Tokenization

Analogously, HSI features are flattened to and projected with to obtain four spectral tokens

T^{HSI} = A V where A = softmax (F^{(2)}, W_{A})

(7)

2.8. Cross-Modality Transformer MOE Encoder

The concatenated token set

Z_{0} = [T^{ℓ}; T^{HSI}] + P

(8)

with learnt positional encodings, is passed through two identical transformer blocks. Each block comprises (i) multi-head cross-attention and (ii) a mixture-of-experts feed-forward network (Figure 5).

(1) Cross-Attention: The attention operator maps a query token (LiDAR) to key–value pairs (HSI):

Attention (Q, K, V) = softmax (\frac{1}{\sqrt{d_{h}}} Q K^{⊤}) V

(9)

\begin{matrix} {\hat{X}}^{H L} & = [X_{c l s}^{L} ∥ X_{p a t c h}^{H}] \end{matrix}

(10)

\begin{matrix} Q & = X_{c l s}^{L} W_{q}, K = {\hat{X}}^{H L} W_{k}, V = {\hat{X}}^{H L} W_{v} \end{matrix}

(11)

where

\sqrt{d_{h}}

is the per-head dimensionality.

X_{c l s}^{L}

is the multimodal class token derived from complementary data sources (e.g., LiDAR/SAR/DSM) via tokenization, replacing randomly initialized CLS tokens to inject elevation or structural features into the transformer encoder.

{\hat{X}}^{H L}

is the fused token sequence formed by concatenating multimodal

X_{c l s}^{L}

and HSI

X_{p a t c h}^{H}

, augmented with positional embeddings (PE) to preserve spatial relationships for cross-modal attention. Linear projections are applied separately within each head and the results are concatenated and re-projected.

(2) Mixture-of-Experts Feed-Forward: Given hidden state h, the MoE layer computes

\begin{matrix} G & = {TopK}_{k} (h W g) \end{matrix}

(12)

\begin{matrix} o & = \sum_{e = 1}^{E} G e ⊙ ϕ (h W^{(e)} 1) W_{2}^{(e)} \end{matrix}

(13)

where experts per node, is the router sparsity, the ReLU activation, and element-wise multiplication with the binary gating mask. Load-balancing losses are added but omitted here for conciseness. In the above equations,

h \in R^{B \times N \times d},

denotes the hidden representations of B samples, each comprising N tokens of dimension d. The matrix

, W g \in R^{d \times E},

is a routing projection whose columns parameterize E expert routes, and

TopK k (\cdot)

returns a binary gating mask

G e \in {0, 1}^{B \times N \times E}

that keeps the k largest logits per token while zeroing the rest. In the second line,

ϕ (\cdot)

is the element-wise activation function (ReLU in our experiments); each expert e possesses its own pair of weight matrices

W 1^{(e)} \in R^{d \times h}

and

W 2^{(e)} \in R^{h \times d}

that form a two-layer feed-forward network. The symbol ⊙ denotes Hadamard (element-wise) multiplication, so

G e ⊙ (ϕ (h W 1^{(e)}))

ensures that only the tokens routed to expert e contribute to its output. Finally, all expert outputs are summed to yield

o \in R^{B \times N \times d}

, preserving the original tensor shape while allowing conditional computation across the E specialists.

To prevent expert collapse in Mixture-of-Experts models, we integrate a load-balancing loss with the standard routing mechanism. The joint optimization objective combines task loss and balancing regularization:

L_{total} = L_{task} + λ L_{balance}

(14)

where

L_{balance}

is defined as the coefficient of variation across expert utilization rates with an alignment constraint:

L_{balance} = \frac{σ_{s}}{μ_{s}} + α \sum_{i = 1}^{N} P_{i} \cdot f_{i}

(15)

Here,

P_{i}

represents the mean routing probability for expert i across all tokens in a batch, while

f_{i}

denotes the actual fraction of tokens assigned to expert i.

2.9. Output Layer

The class logits are produced by a linear projection of the first token (conventionally the LiDAR summary):

\hat{y} = Softmax (Z_{enc}^{(0)} W o + b_{o})

(16)

In

\hat{y} = Softmax (Z_{enc}^{(0)} W_{o} + b_{o})

, the vector

\hat{y} \in R^{B \times K}

contains the predicted class–posterior probabilities for each of the B input samples over K land-use classes. The term

Z_{enc}^{(0)} \in R^{B \times d}

denotes the first classification token produced by the transformer encoder; its d-dimensional embedding summarizes the multimodal information of a sample. The weight matrix

W_{o} \in R^{d \times K}

projects this embedding into the class logit space, while

b_{o} \in R^{K}

is an additive bias that shifts the logits. Finally, the

Softmax

operator normalizes the logits across the K classes for each sample, yielding a valid probability distribution that satisfies

\sum_{k = 1}^{K} {\hat{y}}_{i k} = 1

for every

i \in {1, \dots, B}

.

3. Results

3.1. Comparison of Different Methods

Table 4, Table 5 and Table 6 report the class-specific recall, overall accuracy (OA), average accuracy (AA), and Cohen’s

κ

for ten competitors across three methodological families—Conventional Classifiers, Classical Convolutional Networks, and Transformer Networks—as well as the proposed MixtureRS on the Houston Remote Sensing Dataset. The following paragraphs provide a detailed, category-by-category narration of the numerical evidence, followed by an aggregated view on the global metrics. We use the Adam optimizer and the batch size is 64. The number of experts is 5 and we select Top-2 experts to merge their prediction results.

Table 4, Table 5 and Table 6 compares class-specific recall, overall accuracy (OA), average accuracy (AA), and Cohen’s k for ten baseline models across three categories—Conventional Classifiers, Classical Convolutional Networks, and Transformer Networks—alongside the proposed MixtureRS on the Houston Remote Sensing Dataset. We use the Adam optimizer with batch size 64, employing 5 experts and merging predictions from the Top-2.

3.1.1. Vegetated Surfaces (Classes 1–4)

For Healthy Grass (Class 1), all deep models exceed 80% recall; RF leads slightly at 82.81%, with MixtureRS close behind at 81.42%, showing no penalty on easy classes. In Stressed Grass (Class 2), MixtureRS matches SpectralFormer’s top recall (89.15%), demonstrating expert specialization preserves sensitivity to subtle spectral differences. For Synthetic Grass (Class 3), MixtureRS improves recall from 97.43% (ViT) to 98.22% while reducing variance, indicating more stable expert routing. On Trees (Class 4), MixtureRS achieves 95.99%, a 4.3 pp gain over SpectralFormer, highlighting the benefits of routing structural LiDAR features.

3.1.2. Hard Materials (Classes 5–9)

Notably, Water (Class 6) recall jumps from 73.89% (CNN3D) to 95.57% with MixtureRS, cutting error rate ninefold. The largest gain (+30.62 pp over SpectralFormer) occurs in Parking Lot 2 (Class 11), where MixtureRS effectively disentangles noisy asphalt spectra. For Residential and Commercial (Classes 7–8), MixtureRS outperforms ViT by 1.65–1.81 pp, handling intra-class variability well. Highway (Class 10) remains challenging, but MixtureRS improves recall to 61.71%, a 13.5% relative increase over the best baseline.

3.1.3. Man-Made Linear Structures (Classes 10–15)

Transformer-based models excel on elongated structures like Railway (Class 11) and Tennis Court (Class 14). MixtureRS boosts recall to 94.43% and nearly saturates tennis court recognition at 99.87%. Remarkably, Running Track (Class 15) achieves perfect recall (100%), indicating confident routing to experts specialized in circular patterns.

3.1.4. Overall Performance

Across all 15 classes, MixtureRS attains an OA of 88.64%, surpassing SpectralFormer by 12.29 pp. AA reaches 90.23%, confirming balanced improvements beyond dominant classes. Cohen’s k rises from 81.88 to 87.67, reflecting reduced chance agreement. Low standard deviations (<0.3 pp) over three runs demonstrate the robustness of the MoE gating.

Table 7, Table 8 and Table 9 reports the class-wise accuracies (%) of various models on the target dataset, including conventional classifiers, classical convolutional networks, and Transformer-based methods, with MixtureRS as the proposed approach.

From a categorical perspective, MixtureRS consistently achieves superior performance in challenging classes characterized by complex spectral–spatial features and high intra-class variability. For instance, in Class 2, MixtureRS attains 92.12%, significantly outperforming the best conventional classifier RF (74.03%) and RNN (81.93%). Similarly, in Class 6, MixtureRS reaches 86.53%, far exceeding CNN3D’s 2.86% and ViT’s 82.02%. In the most difficult Class 10, MixtureRS achieves 62.50%, nearly doubling the accuracy of ViT (31.99%) and greatly surpassing all other baselines.

For relatively easier classes such as Class 1 and Class 5, MixtureRS achieves competitive accuracies of 97.19% and 93.54%, respectively, closely matching or slightly below ViT (97.85% and 94.73%). This indicates that MixtureRS maintains strong generalization without compromising performance on simpler categories. Moreover, MixtureRS demonstrates robustness in classes with limited or ambiguous samples, such as Class 9, where it achieves 86.53%, substantially higher than ViT (57.83%) and RNN (60.54%).

Overall, MixtureRS attains an overall accuracy (OA) of 88.79% and an average accuracy (AA) of 75.84%, outperforming all classical convolutional networks and conventional classifiers. Its Kappa coefficient reaches 0.8518, indicating strong agreement with ground truth labels and reliable classification performance across diverse classes.

Table 10, Table 11 and Table 12 presents the classification accuracies (%) of different models on six background classes in the target dataset, including conventional classifiers (KNN, RF, SVM), classical convolutional networks (CNN1D, CNN2D, CNN3D, RNN), and Transformer-based networks (ViT, SpectralFormer, and our proposed MixtureRS).

Overall, MixtureRS achieves the best performance across key metrics: overall accuracy (OA) of 98.06%, average accuracy (AA) of 96.90%, and Kappa coefficient (

κ

) of 0.9740, significantly outperforming all compared methods and demonstrating superior classification capability and stability.

Regarding individual background classes:

For classes 3 and 6, where conventional methods and some deep models perform poorly, MixtureRS attains accuracies of 90.48% and 89.39%, respectively, markedly surpassing CNN3D (93.85% and 2.86%) and RF (70.94% and 72.63%), highlighting its advantage in handling complex background features. In classes 4 and 5, MixtureRS achieves high accuracies of 93.30% and 93.33%, respectively. Although slightly lower than some traditional methods (e.g., RF’s 99.73% in class 4), MixtureRS offers a more balanced overall performance. For classes 1 and 2, MixtureRS’s accuracies (92.39% and 92.74%) are marginally lower than certain traditional classifiers and convolutional networks but remain at a high level, ensuring stable classification results. MixtureRS, by integrating multimodal information and multi-scale features, significantly enhances recognition of complex background classes, exhibiting stronger generalization and robustness.Figure 6 shows the classification results from different data sources and machine learning models. Figure 7 presents the overall training performance metrics, including the accuracy and loss curves of the three datasets.

In summary, MixtureRS not only leads in overall metrics but also shows clear advantages in multiple challenging background classes, validating its effectiveness and advancement in hyperspectral image background classification tasks.

3.2. Abalation Study of Multimodel

To evaluate the contribution of each modality to the model’s performance, we conducted systematic ablation experiments on three datasets, assessing the classification impact of single-modality inputs and their various combinations. Specifically, we trained and tested models using only the HSI modality on all datasets and compared the results with those of the full multimodal fusion model. As shown in Figure 8, across all three datasets, the single-modality HSI model consistently underperforms the multimodal fusion model in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient (k). This indicates that spectral information alone is insufficient to fully capture the complex and diverse characteristics of land cover. In contrast, incorporating LiDAR or other auxiliary modalities provides complementary spatial and textural information, significantly enhancing classification performance. Furthermore, the ablation results demonstrate that removing any modality leads to a performance decline, underscoring the critical role of each modality within the model. In summary, the ablation study confirms the necessity and effectiveness of multimodal fusion strategies in improving the precision and robustness of land-cover classification.

3.3. Ablation Study on Moe Layers

In this section, we investigate the architecture design and performance of Mixture of Experts (MoE) layers by replacing the conventional Feed Forward Network (FFN) in Transformer-based models with MoE structures, systematically evaluating their impact on model accuracy.The moe layer consistently achieves higher performance than the MLP layer in terms of OA, AA, and Kappa, suggesting the statistical significance of our method’s advantages.As illustrated in Figure 9, the classification results on three datasets are visualized, demonstrating the effectiveness of the proposed method under various conditions.

3.4. Ablation Study on the Number of Experts (k)

This study systematically investigates the impact of Top-k expert selection in Mixture of Experts (MoE) models by varying k from 1 to 5, quantifying its effects on classification accuracy. For the Trento dataset, a significant accuracy gain from

k = 1

to

k = 2

(+0.99%,

p < 0.05

) demonstrates the effectiveness of moderate sparsity (Top-2 experts). However, accuracy drops unexpectedly at

k = 3

(−0.89%), likely due to feature interference from redundant experts or gating instability. Beyond

k = 4

, marginal gains diminish (+0.54% at

k = 4

, +0.78% at

k = 5

), indicating rapidly saturating benefits of additional experts. For the MUUFL dataset, accuracy improves significantly from

k = 1

to

k = 2

(+1.17%,

p < 0.05

) due to enhanced model capacity with moderate sparsity, with a smaller gain of +0.28% at

k = 3

. Performance plateaus beyond

k = 3

(

Δ k = 4

: −0.13%,

Δ k = 5

: −0.19%), indicating redundancy effects. For the Houston dataset, the model reaches peak accuracy at

k = 3

(90.23%), outperforming the baseline (

k = 1

, 88.07%) by 2.45% (

p < 0.05

), confirming the benefits of moderate sparsity (Top-3 experts). However, accuracy declines at

k = 2

(−0.74%) and sharply drops at

k = 4

(−3.96%), indicating redundancy or gating instability beyond optimal thresholds. Beyond

k = 3

, marginal gains diminish significantly, reflecting saturation effects. Figure 10, Figure 11 and Figure 12 represent Three classification datasets used in the experiments.

4. Discussion

The empirical evidence presented above prompts three central questions: why does the MoE-augmented transformer generalize better, where does it still underperform, and how may future research extend these findings? We address each point in turn.

4.1. Why Does MoE Help?

From an optimization standpoint, the Top-k gating produces sparse, expert-wise gradients that reduce co-adaptation among feed-forward sub-modules. Such sparsity mitigates gradient interference, an issue particularly acute when distinct spectral–spatial patterns (e.g., grass vs. asphalt) co-exist within a mini-batch. Moreover, conditional computation implicitly regularizes depth: tokens routed to fewer than all experts traverse shallower effective subnetworks, acting as a form of adaptive DropPath that has been shown to curb overfitting. The large gains in confused categories (water vs. shadowed asphalt) support this theoretical lens, as the router can delegate shadow handling to an “illumination” expert while reserving another specialist for true water bodies with high near-infrared absorption.

4.2. Failure Modes and Limitations

Despite overall success, MixtureRS underperforms RF on Healthy Grass. Visual inspection shows that these pixels are uniformly textured, leading the router to allocate minimal capacity while conventional decision trees still benefit from bagging many weak learners. Similarly, the standard deviation for Stressed Grass remains high (7.27%), reflecting sensitivity to seasonal phenology. These observations suggest that the current gating policy could be augmented with a curriculum mechanism that allocates more experts to ambiguous low-variance spectra.

4.3. Broader Implications

The proposed architecture exemplifies a trend toward conditional computation in remote sensing analytics. By dynamically modulating depth and width, the model adapts to local scene complexity, a property of paramount importance for large-scale, multi-sensor Earth observation pipelines where resource budgets fluctuate across orbital passes. Furthermore, the MoE paradigm opens the door to lifelong learning: new experts could be appended to accommodate novel land-cover categories without catastrophic forgetting.

4.4. Future Work

Three avenues appear promising. First, incorporating uncertainty-aware routing could further stabilize high-variance classes by deferring ambiguous tokens to ensembles of experts. Second, coupling the MoE router with graph-based spatial regularizers may suppress salt-and-pepper artefacts commonly observed in transformer outputs. Third, extending the framework to tri-modal settings (e.g., HSI + LiDAR + SAR) would test the scalability of conditional computation under even richer sensor fusion scenarios.

4.5. Concluding Remarks

In sum, the experimental study demonstrates that a carefully designed mixture-of-experts transformer not only eclipses conventional and convolutional counterparts but also advances the state of the art over homogeneous transformer baselines. The gains are most pronounced in spectrally ambiguous or structurally distinctive classes, validating the central premise that adaptive model capacity, informed by multimodal cues, is key to next-generation land-use and land-cover classification.

5. Conclusions

This study illustrates that a sparse Mixture-of-Experts (MoE) transformer, implemented in the MixtureRS framework, can significantly enhance multimodal land-use and land-cover classification beyond the capabilities of traditional convolutional and homogeneous vision transformer models. Combining hyperspectral imagery with LiDAR-derived height data, MixtureRS achieved an overall accuracy of 88.64%, an average accuracy of 90.23%, and a Cohen’s Kappa of 87.67—surpassing the strongest non-conditional baseline by over 12 percentage points across key metrics. Notably, the approach yields substantial improvements in classifying spectrally ambiguous or structurally distinctive categories such as water, railway, and parking lots, which are critical for urban planning and environmental monitoring.

The analysis highlights four mechanistic advantages driven by conditional computation: (1) sparse expert activation via Top-k routing reduces gradient interference, promoting faster convergence; (2) adaptive depth regularizes the model akin to DropPath without stochastic instability; (3) expert specialization facilitates a disentangled representation space that effectively fuses heterogeneous modalities; and (4) the scalable architecture enables growth in parameters without significant computational overhead, supporting real-time deployment on spaceborne or airborne platforms.

However, limitations remain. MixtureRS underperforms traditional random forests for homogeneous grass surfaces, indicating a need for better capacity allocation for low-variance classes, perhaps via curriculum routing or ensemble techniques. The model’s sensitivity to phenological shifts and its memory footprint also pose challenges for edge deployment, especially on lightweight UAVs. Furthermore, the assumption of perfect co-registration between hyperspectral and LiDAR data may not hold in practical scenarios, potentially diminishing cross-attention performance.

Looking ahead, promising directions include integrating uncertainty-aware gating to adaptively allocate expert capacity to uncertain tokens, applying graph-based spatial regularizers to reduce noise artifacts, and extending the framework to incorporate additional modalities such as SAR or ultra-high-resolution imagery. These advancements will further test the scalability and robustness of conditional computation in complex remote sensing applications.

In conclusion, this work substantiates that adaptive model capacity guided by multimodal cues is crucial for future remote sensing analytics. MixtureRS sets a new benchmark, providing a flexible and efficient architecture that effectively balances data complexity with computational constraints—marking a significant step toward more intelligent and scalable Earth observation systems.

Author Contributions

Y.L.: conceptualization, methodology, software development, validation, formal analysis, investigation, and writing—original draft preparation; C.W.: software development, validation, investigation, and writing—original draft preparation; M.G.: methodology, investigation, writing—original draft preparation, and writing—review and editing; J.W.: writing—review and editing, supervision, project administration, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by Guangdong Special Fund for Cultivating Scientific and Technological Innovation among College Students (No. pdjh2025ak373), the Fundamental Research Funds for the Central Universities (No. 202461010), Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515011273), Specific Innovation Program of the department of Education of Guangdong Province (No. 2023KTSCX315), Shenzhen Polytechnic University Research Fund (No. 6025310064K), Nature Science Foundation of Shenzhen City (No. RCBS20221008093252090), Guangdong Basic and Applied Basic Research Foundation Project (No. 2025A1515011370),and Open Research Fund Program of MNR Key Laboratory for Geo-Environmental Monitoring of Great Bay Area (No. GEMLab-2023014).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Dadashpoor, H.; Azizi, P.; Moghadasi, M. Land use change, urbanization, and change in landscape pattern in a metropolitan area. Sci. Total Environ. 2019, 655, 707–719. [Google Scholar] [CrossRef] [PubMed]
Faisal, A.-A.; Kafy, A.-A.; Al Rakib, A.; Akter, K.S.; Jahir, D.M.A.; Sikdar, M.S.; Ashrafi, T.J.; Mallik, S.; Rahman, M.M. Assessing and predicting land use/land cover, land surface temperature and urban thermal field variance index using Landsat imagery for Dhaka Metropolitan area. Environ. Chall. 2021, 4, 100192. [Google Scholar] [CrossRef]
Prasad, P.; Loveson, V.J.; Chandra, P.; Kotha, M. Evaluation and comparison of the earth observing sensors in land cover/land use studies using machine learning algorithms. Ecol. Inform. 2022, 68, 101522. [Google Scholar] [CrossRef]
Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral Image Classification—Traditional to Deep Models: A Survey for Future Prospects. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 968–999. [Google Scholar] [CrossRef]
Roy, S.K.; Kar, P.; Hong, D.; Wu, X.; Plaza, A.; Chanussot, J. Revisiting Deep Hyperspectral Feature Extraction Networks via Gradient Centralized Convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, H.; Xu, F.; Jin, Y.-Q. Polarimetric SAR Image Classification Using Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1935–1939. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, H.; Xu, F.; Jin, Y.-Q. Complex-Valued Convolutional Neural Network and Its Application in Polarimetric SAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7177–7188. [Google Scholar] [CrossRef]
Dong, H.; Zhang, L.; Zou, B. PolSAR Image Classification with Lightweight 3D Convolutional Networks. Remote Sens. 2020, 12, 396. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LiDAR Data Joint Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Zhang, Y.; Peng, Y.; Tu, B.; Liu, Y. Local Information Interaction Transformer for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 1130–1143. [Google Scholar] [CrossRef]
Bakuła, K.; Kupidura, P.; Jełowicki, Ł. Testing of Land Cover Classification from Multispectral Airborne Laser Scanning Data. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2016, XLI-B7, 161–169. [Google Scholar] [CrossRef]
Matikainen, L.; Karila, K.; Hyyppä, J.; Litkey, P.; Puttonen, E.; Ahokas, E. Object-based Analysis of Multispectral Airborne Laser Scanner Data for Land Cover Classification and Map Updating. ISPRS J. Photogramm. Remote Sens. 2017, 128, 298–313. [Google Scholar] [CrossRef]
Shaker, A.; Yan, W.Y.; LaRocque, P.E. Automatic land-water classification using multispectral airborne LiDAR data for near-shore and river environments. ISPRS J. Photogramm. Remote Sens. 2019, 152, 94–108. [Google Scholar] [CrossRef]
Curtis, P.G.; Slay, C.M.; Harris, N.L.; Tyukavina, A.; Hansen, M.C. Classifying drivers of global forest loss. Science 2018, 361, 1108–1111. [Google Scholar] [CrossRef] [PubMed]
DeLancey, E.R.; Kariyeva, J.; Bried, J.T.; Hird, J.N. Large-scale probabilistic identification of boreal peatlands using Google Earth Engine, open-access satellite data, and machine learning. PLoS ONE 2019, 14, e0218165. [Google Scholar] [CrossRef] [PubMed]
Ludwig, C.; Walli, A.; Schleicher, C.; Weichselbaum, J.; Riffler, M. A highly automated algorithm for wetland detection using multi-temporal optical satellite data. Remote Sens. Environ. 2019, 224, 333–351. [Google Scholar] [CrossRef]
Calderón-Loor, M.; Hadjikakou, M.; Bryan, B.A. High-resolution wall-to-wall land-cover mapping and land change assessment for Australia from 1985 to 2015. Remote Sens. Environ. 2021, 252, 112148. [Google Scholar] [CrossRef]
Masolele, R.N.; De Sy, V.; Herold, M.; Marcos, D.; Verbesselt, J.; Gieseke, F.; Mullissa, A.G.; Martius, C. Spatial and temporal deep learning methods for deriving land-use following deforestation: A pan-tropical case study using Landsat time series. Remote Sens. Environ. 2021, 264, 112600. [Google Scholar] [CrossRef]
Nguyen, L.H.; Joshi, D.R.; Clay, D.E.; Henebry, G.M. Characterizing land cover/land use from multiple years of Landsat and MODIS time series: A novel approach using land surface phenology modeling and random forest classifier. Remote Sens. Environ. 2020, 238, 111017. [Google Scholar] [CrossRef]
Lacerda Silva, A.; Salas Alves, D.; Pinheiro Ferreira, M. Landsat-based land use change assessment in the Brazilian Atlantic Forest: Forest transition and sugarcane expansion. Remote Sens. 2018, 10, 996. [Google Scholar] [CrossRef]
Xu, P.; Tsendbazar, N.-E.; Herold, M.; Clevers, J.G.P.W.; Li, L. Improving the characterization of global aquatic land cover types using multi-source earth observation data. Remote Sens. Environ. 2022, 278, 113103. [Google Scholar] [CrossRef]
Azedou, A.; Amine, A.; Kisekka, I.; Lahssini, S.; Bouziani, Y.; Moukrim, S. Enhancing land cover/land use (LCLU) classification through a comparative analysis of hyperparameters optimization approaches for deep neural network (DNN). Ecol. Inform. 2023, 78, 102333. [Google Scholar] [CrossRef]
Ghamisi, P.; Benediktsson, J.A.; Phinn, S. Land-cover classification using both hyperspectral and LiDAR data. Int. J. Image Data Fusion 2015, 6, 189–215. [Google Scholar] [CrossRef]
Dalla Mura, M.; Benediktsson, J.A.; Waske, B.; Bruzzone, L. Morphological attribute profiles for the analysis of very high resolution images. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3747–3762. [Google Scholar] [CrossRef]
Merentitis, A.; Debes, C.; Heremans, R.; Frangiadakis, N. Automatic fusion and classification of hyperspectral and LiDAR data using random forests. In Proceedings of the 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Quebec City, QC, Canada, 13–18 July 2014; pp. 1245–1248. [Google Scholar] [CrossRef]
Cao, R.; Tu, W.; Yang, C.; Li, Q.; Liu, J.; Zhu, J.; Zhang, Q.; Li, Q.; Qiu, G. Deep learning-based remote and social sensing data fusion for urban region function recognition. ISPRS J. Photogramm. Remote Sens. 2020, 163, 82–97. [Google Scholar] [CrossRef]
Guo, Z.; Wen, J.; Xu, R. A Shape and Size Free-CNN for Urban Functional Zone Mapping With High-Resolution Satellite Images and POI Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Multimodal Bilinear Fusion Network With Second-Order Attention-Based Channel Selection for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 1011–1026. [Google Scholar] [CrossRef]
Kang, W.; Xiang, Y.; Wang, F.; You, H. CFNet: A Cross Fusion Network for Joint Land Cover Classification Using Optical and SAR Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 1562–1574. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Collaborative Attention-Based Heterogeneous Gated Fusion Network for Land Cover Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3829–3845. [Google Scholar] [CrossRef]
Wang, Z.; Li, H.; Rajagopal, R. Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding. arXiv 2020, arXiv:2001.11101. [Google Scholar] [CrossRef]
Xu, Z.; Zhu, J.; Geng, J.; Deng, X.; Jiang, W. Triplet Attention Feature Fusion Network for SAR and Optical Image Land Cover Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 4256–4259. [Google Scholar] [CrossRef]
Li, K.; Yu, H.; Li, S.; Chen, S.; Wang, B. PFARN: Pyramid Fusion Attention and Refinement Network for Multiscale Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 9821–9836. [Google Scholar] [CrossRef]
Wang, L.; Liu, Z.; Zhang, Z. Scattering and Optical Cross-Modal Attention Distillation Framework for SAR Target Recognition. IEEE Sens. J. 2025, 25, 3126–3137. [Google Scholar] [CrossRef]

Figure 1. The case of Houston.

Figure 2. The case of Trento.

Figure 3. The cases of MUUFL.

Figure 4. The main framework of MixtureRS.

Figure 5. The main framework of Cross-Modality Transformer MOE Encoder.

Figure 6. Comparison of classification results from different data sources and machine learning models. Each subfigure presents results from one method or data type, including: (a) LiDAR, (b) HSI, (c) Ground Truth, (d) KNN, (e) CNN1D, (f) CNN2D, (g) CNN3D, (h) RF, (i) RNN, (j) SVM, (k) SpectralFormer, (l) ViT, (m) MixtureRS.

Figure 7. Overall training performance metrics.

Figure 8. Three classification datasets used in the experiments.

Figure 9. Three classification datasets used in the experiments.

Figure 10. Houston Dataset.

Figure 11. MUUFL Dataset.

Figure 12. Trento Dataset.

Table 1. The detail of Houston Dataset.

Land Cover	Train	Test	Land Cover	Train	Test
Background	662,013	652,648	Grass—healthy	198	1053
Grass—stressed	190	1064	Grass—synthetic	192	505
Tree	188	1056	Soil	186	1056
Water	182	143	Residential	196	1072
Commercial	191	1053	Road	193	1059
Highway	191	1036	Railway	181	1054
Parking-lot1	192	1041	Parking-lot2	184	285
Tennis-court	181	247	Running-track	187	473

Table 2. The detail of Trento Dataset.

Land Cover	Train	Test
Background	98,781	70,205
Buildings	125	2778
Woods	154	8969
Roads	122	3052
Apples	129	3905
Ground	105	374
Vineyard	184	10,317

Table 3. The detail of MUUFL dataset.

Land Cover	Train	Test	Land Cover	Train	Test
Background	68,817	20,496	Buildings	312	5928
Grass (Pure)	214	4056	Grass (Ground)	344	6538
Dirt & Sand	91	1735	Road Materials	334	6353
Water	23	443	Sidewalk	69	1316
Yellow Curb	9	174	Cloth Panels	13	256
Trees	1162	22,084	Buildings-Shadow	112	2121

Table 4. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the Houston dataset.

Class No.	KNN	RF	SVM	MixtureRS
1	77.87	82.81	79.77	81.42
2	77.44	82.86	82.42	89.16
3	96.83	63.10	59.41	98.22
4	75.28	91.95	83.81	95.99
5	90.72	99.70	95.27	99.78
6	66.43	96.97	67.13	95.57
7	76.96	85.23	83.21	87.75
8	30.96	42.58	29.53	79.23
9	69.50	85.36	75.45	90.81
10	42.95	35.81	46.62	61.71
11	56.17	63.03	45.07	94.43
12	75.79	66.63	70.03	92.96
13	60.35	87.60	68.42	86.55
14	76.92	99.73	75.30	99.87
15	88.37	85.62	49.89	100.00
OA	69.48	74.87	68.13	88.64
AA	70.84	77.94	67.42	90.23
$κ$	0.6708	0.7293	0.6556	0.8767

Table 5. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the Houston dataset.

Class No.	CNN1D	CNN2D	CNN3D	RNN	MixtureRS
1	81.10	80.53	81.70	80.22	81.42
2	80.23	83.90	80.55	78.51	89.16
3	53.73	57.49	96.57	52.94	98.22
4	83.74	89.46	78.54	83.81	95.99
5	87.06	92.36	98.48	85.61	99.78
6	52.45	64.10	73.89	70.16	95.57
7	71.42	71.39	82.77	73.01	87.75
8	41.12	44.95	38.30	43.84	79.23
9	60.25	62.45	65.94	68.84	90.81
10	39.12	49.94	43.28	37.52	61.71
11	42.06	44.53	33.59	49.65	94.43
12	62.98	53.92	67.85	64.07	92.96
13	42.11	47.13	77.54	53.92	86.55
14	83.94	82.46	92.58	81.38	99.87
15	34.46	42.92	93.52	44.12	100.00
OA	63.04	65.85	70.26	65.20	88.64
AA	61.05	64.50	73.67	64.51	90.23
$κ$	0.6001	0.6304	0.6791	0.6243	0.8767

Table 6. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the Houston dataset.

Class No.	ViT	SpectralFormer	MixtureRS
1	82.40	82.49	81.42
2	80.29	89.13	89.16
3	97.43	69.77	98.22
4	90.40	91.73	95.99
5	99.24	96.78	99.78
6	91.38	85.31	95.57
7	86.10	80.25	87.75
8	73.95	62.74	79.23
9	85.33	70.57	90.81
10	50.42	48.17	61.71
11	80.80	62.75	94.43
12	81.91	79.09	92.96
13	89.47	63.63	86.55
14	99.33	93.66	99.87
15	99.72	77.24	100.00
OA	83.23	76.35	88.64
AA	85.88	76.89	90.23
$κ$	0.8188	0.7442	0.8767

Table 7. Class-wise accuracy (%) of Conventional Classifiers and MixtureRS on the MUUFL dataset.

Class No.	KNN	RF	SVM	MixtureRS
1	92.12	95.42	96.63	97.19 ± 0.24
2	51.85	74.03	59.25	92.12 ± 0.57
3	69.35	75.81	81.46	91.94 ± 0.30
4	57.00	68.59	73.54	93.60 ± 0.68
5	83.87	88.17	83.79	93.54 ± 0.56
6	19.19	77.28	15.35	86.53 ± 3.98
7	44.60	64.83	77.04	91.45 ± 1.35
8	76.97	93.29	86.94	96.86 ± 0.30
9	09.95	19.15	21.28	55.11 ± 3.38
10	00.00	04.41	00.00	9.96 ± 3.01
11	64.45	71.88	62.89	71.09 ± 4.29
OA	76.83	85.32	84.24	93.65 ± 0.08
AA	51.76	66.62	60.41	79.94 ± 0.58
$κ$	0.6892	0.8039	0.7880	0.9161 ± 0.0010

Table 8. Class-wise accuracy (%) of Classical Convolutional Networks and MixtureRS on the MUUFL dataset.

Class No.	CNN1D	CNN2D	CNN3D	RNN	MixtureRS
1	95.05	95.79	95.10	95.84	97.19 ± 0.24
2	70.35	72.76	63.72	81.93	92.12 ± 0.57
3	75.80	78.92	69.94	80.47	91.94 ± 0.30
4	78.60	83.59	63.90	87.01	93.60 ± 0.68
5	78.31	78.29	79.48	90.65	93.54 ± 0.56
6	46.35	50.34	02.86	54.25	86.53 ± 3.98
7	78.31	79.70	47.96	81.24	91.45 ± 1.35
8	66.72	71.95	70.47	88.39	96.86 ± 0.30
9	40.15	43.92	06.28	60.54	55.11 ± 3.38
10	09.20	12.45	00.00	26.44	9.96 ± 3.01
11	25.65	26.82	66.93	87.50	71.09 ± 4.29
OA	81.50	83.40	77.99	88.79	93.65 ± 0.08
AA	60.41	63.14	51.51	75.84	79.94 ± 0.58
$κ$	0.7543	0.7794	0.7031	0.8518	0.9161 ± 0.0010

Table 9. Class-wise accuracy (%) of Transformer Networks and MixtureRS on the MUUFL dataset.

Class No.	ViT	SpectralFormer	MixtureRS
1	97.85	97.30	97.19 ± 0.24
2	76.06	69.35	92.12 ± 0.57
3	87.58	78.48	91.94 ± 0.30
4	92.05	82.63	93.60 ± 0.68
5	94.73	87.91	93.54 ± 0.56
6	82.02	58.77	86.53 ± 3.98
7	87.11	85.87	91.45 ± 1.35
8	97.60	95.60	96.86 ± 0.30
9	57.83	53.52	55.11 ± 3.38
10	31.99	08.43	9.96 ± 3.01
11	58.72	35.29	71.09 ± 4.29
OA	92.15	88.25	93.65 ± 0.08
AA	78.50	68.47	79.94 ± 0.58
$κ$	0.8956	0.8441	0.9161 ± 0.0010

Table 10. Class-wise accuracy (%) of conventional classifiers and MixtureRS on the Trento dataset.

Class No.	KNN	RF	SVM	MixtureRS
1	87.94	83.73 ± 0.06	93.44	97.39 ± 0.45
2	95.79	92.30 ± 0.06	98.12	96.74 ± 0.46
3	81.28	70.94 ± 1.55	56.15	90.48 ± 0.43
4	96.25	93.73 ± 0.07	97.53	99.30 ± 0.33
5	95.29	95.35 ± 0.25	93.13	98.33 ± 0.47
6	83.85	72.63 ± 0.90	78.96	89.39 ± 0.30
OA	93.29	92.57 ± 0.07	95.33	98.06 ± 0.00
AA	90.07	86.45 ± 0.32	87.72	96.90 ± 0.00
K	0.9111	0.9011 ± 0.0009	0.9376	0.9740 ± 0.0000

Table 11. Class-wise accuracy (%) of classical convolutional networks and MixtureRS on the Trento dataset.

Class No.	CNN1D	CNN2D	CNN3D	RNN	MixtureRS
1	92.00 ± 0.50	96.98 ± 0.21	92.95 ± 0.10	91.75 ± 4.30	97.39 ± 0.45
2	96.51 ± 1.70	97.56 ± 0.14	98.09 ± 0.23	92.47 ± 0.37	99.74 ± 0.46
3	42.34 ± 6.33	55.35 ± 0.00	90.85 ± 1.09	79.23 ± 16.47	93.48 ± 0.43
4	93.77 ± 0.05	99.66 ± 0.03	63.90 ± 1.84	99.58 ± 0.42	90.30 ± 0.33
5	93.27 ± 0.09	99.33 ± 0.07	79.48 ± 1.43	98.39 ± 0.65	99.56 ± 0.47
6	76.91 ± 3.62	76.91 ± 0.15	2.86 ± 3.00	85.86 ± 2.89	89.39 ± 0.30
OA	95.81 ± 0.13	96.14 ± 0.03	77.99 ± 0.06	96.43 ± 0.79	98.06 ± 0.00
AA	85.30 ± 0.72	87.67 ± 0.04	51.51 ± 0.40	92.38 ± 3.50	96.90 ± 0.00
K	0.9439 ± 0.0017	0.9483 ± 0.0004	0.7031 ± 0.0003	0.9521 ± 0.0106	0.9740 ± 0.0000

Table 12. Class-wise accuracy (%) of transformer networks and MixtureRS on the Trento dataset.

Class No.	ViT	SpectralFormer	MixtureRS
1	90.87 ± 0.77	92.76 ± 1.71	96.39 ± 0.45
2	92.32 ± 0.77	97.25 ± 0.66	99.74 ± 0.46
3	90.69 ± 0.53	58.47 ± 11.54	92.48 ± 0.43
4	100.0 ± 0.00	99.24 ± 0.21	93.30 ± 0.33
5	93.77 ± 0.86	93.52 ± 1.75	97.33 ± 0.47
6	89.72 ± 2.02	73.39 ± 6.78	86.39 ± 0.30
OA	96.47 ± 0.49	93.51 ± 1.27	98.06 ± 0.00
AA	94.56 ± 0.57	86.44 ± 2.96	96.90 ± 0.00
K	$0.9528 \pm 0.0065$	$0.9136 \pm 0.0167$	$0.9740 \pm 0.0000$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Wu, C.; Guan, M.; Wang, J. MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sens. 2025, 17, 2494. https://doi.org/10.3390/rs17142494

AMA Style

Liu Y, Wu C, Guan M, Wang J. MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sensing. 2025; 17(14):2494. https://doi.org/10.3390/rs17142494

Chicago/Turabian Style

Liu, Yimei, Changyuan Wu, Minglei Guan, and Jingzhe Wang. 2025. "MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification" Remote Sensing 17, no. 14: 2494. https://doi.org/10.3390/rs17142494

APA Style

Liu, Y., Wu, C., Guan, M., & Wang, J. (2025). MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification. Remote Sensing, 17(14), 2494. https://doi.org/10.3390/rs17142494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MixtureRS: A Mixture of Expert Network Based Remote Sensing Land Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Experimental Setup

2.2.1. Hyperparameter Settings

2.2.2. Evaluation Metrics

2.3. Problem Formulation

2.4. Proposed Method

2.5. Spectral–Spatial Feature Extraction

2.6. LiDAR Tokenization

2.7. HSI Channel Tokenization

2.8. Cross-Modality Transformer MOE Encoder

2.9. Output Layer

3. Results

3.1. Comparison of Different Methods

3.1.1. Vegetated Surfaces (Classes 1–4)

3.1.2. Hard Materials (Classes 5–9)

3.1.3. Man-Made Linear Structures (Classes 10–15)

3.1.4. Overall Performance

3.2. Abalation Study of Multimodel

3.3. Ablation Study on Moe Layers

3.4. Ablation Study on the Number of Experts (k)

4. Discussion

4.1. Why Does MoE Help?

4.2. Failure Modes and Limitations

4.3. Broader Implications

4.4. Future Work

4.5. Concluding Remarks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI