Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification

Liu, Guangen; Song, Jiale; Chu, Yonghe; Zhang, Lianchong; Li, Peng; Xia, Junshi

doi:10.3390/rs17172923

Open AccessArticle

Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification

by

Guangen Liu

¹

,

Jiale Song

²,

Yonghe Chu

¹,

Lianchong Zhang

³,

Peng Li

^2,*

and

Junshi Xia

⁴

¹

College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China

²

Institute for Complexity Science, Henan University of Technology, Zhengzhou 450001, China

³

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

⁴

RIKEN Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo 103-0027, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2923; https://doi.org/10.3390/rs17172923

Submission received: 8 July 2025 / Revised: 10 August 2025 / Accepted: 21 August 2025 / Published: 22 August 2025

Download

Browse Figures

Versions Notes

Abstract

Recently, Transformers have made significant progress in the joint classification task of HSI and LiDAR due to their efficient modeling of long-range dependencies and adaptive feature learning mechanisms. However, existing methods face two key challenges: first, the feature extraction stage does not explicitly model category ambiguity; second, the feature fusion stage lacks a dynamic perception mechanism for inter-modal differences and uncertainties. To this end, this paper proposes a Deep Fuzzy Fusion Network (DFNet) for the joint classification of hyperspectral and LiDAR data. DFNet adopts a dual-branch architecture, integrating CNN and Transformer structures, respectively, to extract multi-scale spatial–spectral features from hyperspectral and LiDAR data. To enhance the model’s discriminative robustness in ambiguous regions, both branches incorporate fuzzy learning modules that model class uncertainty through learnable Gaussian membership functions. In the modality fusion stage, a Fuzzy-Enhanced Cross-Modal Fusion (FECF) module is designed, which combines membership-aware attention mechanisms with fuzzy inference operators to achieve dynamic adjustment of modality feature weights and efficient integration of complementary information. DFNet, through a hierarchical design, realizes uncertainty representation within and fusion control between modalities. The proposed DFNet is evaluated on three public datasets, and the extensive experimental results indicate that the proposed DFNet considerably outperforms other state-of-the-art methods.

Keywords:

hyperspectral image (HSI); light detection and ranging (LiDAR); fuzzy learning; fuzzy fusion; joint classification

1. Introduction

Hyperspectral images (HSIs), as a form of remote sensing data that integrates imaging and spectral technologies, can capture reflectance information across hundreds of continuous narrow bands ranging from visible to near-infrared wavelengths [1]. Compared to traditional multispectral imagery, HSIs possess higher spectral resolution, enabling detailed characterization of the physical attributes and biochemical characteristics of ground objects. It is widely applied in fields such as agricultural pest and disease monitoring [2], soil pollution analysis [3], forestry resource investigation [4], and ecosystem assessment [5]. With the maturation of HSI technology and the widespread availability of satellite data, the efficient and accurate classification of HSI data has become a core task in remote sensing intelligent processing.

In the early stages of HSI classification, researchers primarily utilized traditional machine learning methods such as SVMs [6,7], MKL [8], and PCA [9] to model high-dimensional spectral data. However, these methods relied on handcrafted features and struggled to adapt to complex remote sensing scenarios, especially showing significant shortcomings when addressing high-dimensional redundancy and mixed pixels. With the development of deep learning, Convolutional Neural Networks (CNNs) [10,11,12,13,14,15], due to their ability to automatically extract spatial and spectral local features, have become mainstream in hyperspectral image classification, significantly improving accuracy. Hu et al. [16] pioneered a 1D CNN classification method based on five convolutional layers, which successfully extracted spectral features by using spectral information as input, but failed to consider spatial information simultaneously. To address this deficiency, Zhao and Du [17] proposed a 2D CNN HSI classification model based on Principal Component Analysis (PCA) dimensionality reduction. This model extracts spatial features from the data after applying PCA for dimensionality reduction. To further enhance spatial information modeling capabilities, Meng et al. [18] proposed a multi-scale fusion network (FDMFN), which achieves multi-scale feature integration through cross-layer connections; Zhu et al. [19] introduced RSSAN, employing an attention mechanism to improve feature selection; and Meng et al. [20] also designed a lightweight module (LSSCM) that reduces parameter size while maintaining accuracy.

Currently, HSI classification methods mainly rely on spectral data, which can achieve good results. However, in urban areas, spectrally similar roads and rooftops are easily confused. Light Detection and Ranging (LiDAR), on the other hand, can provide accurate elevation and structural features that complement hyperspectral data at the information level. This effectively breaks through the performance bottleneck of a single modality and improves the accuracy and reliability of land cover classification in urban areas [21,22,23,24,25,26]. In the early stages of joint classification, most studies adopted traditional machine learning methods to implement simple concatenation and the joint modeling of multimodal features. For example, Colgan et al. [27] utilized HSI and LiDAR data to construct a two-stage SVM classifier for tree species identification. Huang and Zhu [28] fed spectral, elevation, and texture features into an RF classifier for ensemble learning. While these methods demonstrated the effectiveness of fusing the two modalities, they are limited by shallow features, simple fusion strategies, and insufficient modeling of nonlinear relationships, making them difficult to cope with the demands of high-dimensional semantic understanding in complex remote sensing scenarios. To address these issues, existing research applies CNNs to the joint classification of HSI and LiDAR data. For example, Xu et al. [29] constructed a dual-branch Convolutional Neural Network (CNN) architecture comprising a dual-channel CNN module and a cascaded CNN module. The dual-channel CNN module is used to mine the spatial and spectral features of hyperspectral imagery (HSI), while the cascaded CNN module extracts the elevation features from LiDAR data. Furthermore, Hang et al. [30] proposed a coupled CNN framework that utilizes weight-sharing CNN modules to extract features from HSI and LiDAR data separately, and combines feature-level and decision-level fusion during the fusion stage. Although the aforementioned methods have achieved relatively ideal results in classification performance, the resulting classification maps appear overly smooth in certain areas due to insufficient feature richness and inadequate utilization of contextual information.

To address the aforementioned challenges, researchers have proposed the Transformer model. The Transformer is capable of efficiently modeling long-range dependencies, resolving the issue of over-smoothing in certain regions of the classification result map. Consequently, it exhibits promising applications in HSI and LiDAR joint classification tasks [31,32,33,34,35]. Feng et al. [36] proposed a linear self-attention fusion model (LSAF) that leverages a linear self-attention module to enrich contextual feature representation between hyperspectral and LiDAR data, and integrates classification results through an adaptive decision fusion module. Ding et al. [37] proposed a global-local Transformer network to learn discriminative spectral–spatial features. Yao et al. [38], on the other hand, extended the traditional visual Transformer, employing a cross-modal attention module to facilitate the exchange of heterogeneous information. The method of Wang et al. [21] combines multi-scale features with a Swin Transformer, achieving non-local feature fusion through layer-by-layer expansion of the receptive field, while preserving spatial features of images. Feng et al. [39] proposed the Dynamic Scale Hierarchical Fusion Network (DSHFNet), which can dynamically select and fuse features at different scales based on the similarity in scale space. This effectively reduces feature dimensionality and addresses the issues of unreliable single-scale features and the excessive dimensionality of multi-scale features found in traditional methods. As research deepens, the integration of CNNs and Transformers has become a new research trend to extract spatial and spectral contextual information from multimodal data more effectively. For example, the Hierarchical CNN-Transformer (HCT) proposed by Zhao et al. [40] designed a cross-token attention mechanism, achieving deeper and more efficient cross-modal information fusion at the token level. The multi-scale 3D–2D hybrid CNN and lightweight attention-free Transformer (M2FNet) proposed by Sun et al. [41] designed a Feature Enhancement (FE) module and a Depthwise Dilated Convolution module (DConvformer) to achieve deeper and more efficient cross-modal information fusion at the feature level. However, these methods often incur high computational costs. To address this challenge, the hybrid self-attention and convolutional network (MACN) proposed by Li et al. [42] redesigned the convolutional and self-attention structures to achieve local–global feature extraction from multi-source remote sensing data and effectively reduce computational overhead. Wang et al. [43] recently proposed a multi-scale cross-attention network (MS2CANet) framework, which improves classification accuracy in complex information regions through spatial–spectral cross-modal attention and enhances useful features while suppressing noise using a feature recalibration module. Although existing Transformer-based methods have achieved significant progress in hyperspectral image and LiDAR data fusion and classification performance, they still have limitations in both feature learning and feature fusion:

(1) The feature extraction stage does not explicitly model class ambiguity, making it difficult to effectively capture high-resolution features of hyperspectral and LiDAR data in local areas such as land cover boundaries and fine-grained spatial structures. This leads to limited boundary discrimination and transition zone modeling capabilities in complex scenes.

(2) The fusion process lacks a mechanism to dynamically perceive differences and uncertainties between modalities, which results in fluctuating cross-modal feature fusion performance and makes it difficult to effectively leverage the complementary advantages of multi-source data.

Fuzzy logic and its neural extensions have provided a compelling framework for modeling uncertainty through membership functions, which are essential for transforming crisp inputs into fuzzy representations [44,45]. More recently, hybrid models combining deep learning and fuzzy logic have shown promising results in improving interpretability and robustness across complex datasets [46,47]. Among different membership function shapes, Gaussian membership functions (GMFs) are particularly attractive due to their smoothness and infinite differentiability, which facilitate stable optimization in gradient-based deep architectures. Empirical studies have shown that GMFs can better capture nonlinear and heterogeneous patterns in high-dimensional spaces compared to triangular or trapezoidal functions, making them well suited for multimodal and ambiguous data modeling [48,49]. In this work, we integrate GMFs into the proposed method to bridge crisp deep feature maps with fuzzy representations.

To address the aforementioned limitations, this paper proposes a hierarchical Deep Fuzzy Fusion Network to jointly process hyperspectral and LiDAR data. Specifically, we employ a dual-branch architecture, leveraging the strengths of both CNNs and Transformers to extract multi-scale features from HSI and LiDAR data, respectively. Each branch integrates a fuzzy learning module, which models category uncertainty through learnable Gaussian membership functions, significantly enhancing the model’s response and representation capabilities in fuzzy regions at class boundaries. Furthermore, in the fusion stage, we design a Fuzzy-Enhanced Cross-Modal Fusion module (FECF), based on membership-aware attention mechanisms and fuzzy inference operators, to strengthen the model’s ability to model uncertainty in boundary and fuzzy regions. This module can more effectively mine and utilize the complementary information between hyperspectral and LiDAR data in fuzzy regions, thereby improving overall classification accuracy. The main contributions of this paper are as follows:

(1) We propose a novel dual-branch cross-modal feature fusion framework for HSI and LiDAR data classification. By introducing fuzzy learning modules in each branch and using Gaussian membership functions, the model’s discriminative ability for boundary-ambiguous regions is enhanced.

(2) A Fuzzy-Enhanced Cross-Modal Fusion encoding module is proposed to enhance the information complementarity between HSI and LiDAR features, thus improving the ability to recognize object boundaries and mixed pixels.

The remainder of this paper is organized as follows: Section 2 provides a detailed description of the proposed method. Section 3 first describes the dataset and experimental settings, followed by the experimental validation of the proposed method. Section 4 presents the conclusions and future work.

2. Methods

2.1. Overall Framework

The overall workflow of the proposed method is illustrated in Figure 1, which mainly includes four stages: data preprocessing, fuzzy-enhanced feature extraction, Fuzzy-Enhanced Cross-Modal Fusion encoding, and classification. First, PCA is applied to reduce the dimensionality of HSI data, followed by spatial alignment and normalization of both HSI and LiDAR data. Second, the (FFEM) employs CNNs to extract per-modality features, followed by the integration of a fuzzy learning module (FLM) designed to enhance the modeling capability for blurred boundaries and uncertainty. Then, the extracted features undergo deep interaction through a Fuzzy-Enhanced Cross-Modal Fusion module (FECF) Fuzzy-enhanced Feature Extraction Module, combined with a Fuzzy Fusion Module (FFM) to enhance inter-modal correlation and information complementarity. Finally, the fused multimodal features are input into a classifier to complete land cover classification.

2.2. Data Preprocessing

For a set of HSI data, let

X_{H} \in R^{m \times n \times l}

represent the HSI data and

X_{L} \in R^{m \times n}

represent the corresponding LiDAR data covering the same geographical area, where m and n indicate the spatial dimensions, and l is the number of HSI spectral bands. Each pixel can be represented as a one-hot encoded vector. Although the rich spectral information in HSI data is valuable, it also results in large data sizes and computationally expensive processing. To reduce the spectral dimensionality, PCA is applied to extract the top b principal components from

X_{H}

, preserving the spatial dimensions but reducing the number of spectral bands to b. After PCA dimensionality reduction, the HSI data

X_{H}

is transformed into

X_{H}^{pca} \in R^{m \times n \times b}

. Next, for each pixel, both 3D and 2D patches are extracted, resulting in a 3D patch cube

X_{H}^{p} \in R^{s \times s \times b}

and a 2D patch

X_{L}^{p} \in R^{s \times s}

, where

s \times s

denotes the patch size. The index of the central pixel is used to label each patch. For edge pixels, zero-padding is performed with a padding width of

(s - 1) / 2

. Thus, the patches of HSI and LiDAR data are both of size

m \times n

. After removing patches whose labels are zero, the remaining sample patches are divided into training and test sets.

2.3. Fuzzy-Enhanced Feature Extraction Module

In hyperspectral image classification tasks, CNNs often struggle to accurately delineate boundary regions, especially in areas with a high proportion of mixed pixels, such as the transition zones between vegetation and soil. They are insufficient in handling pixel-wise uncertainty and class transitions. These regions are typically forced into a single category, ignoring the natural transitional characteristics between land cover types, which weakens the model’s discriminative ability in boundary areas and affects the overall accuracy of classification results. To address this issue, as shown in Figure 1, a fuzzy learning module (FLM) is introduced during the feature extraction process of each branch. This module constructs multiple learnable Gaussian-shaped fuzzy membership functions for each channel to perform fuzzy encoding and soft clustering modeling on the intermediate features extracted by the CNN, thereby capturing the uncertainty distribution structure of the features. Compared to traditional convolutional features, the fuzzy-enhanced features generated by the FLM are more robust in representing fuzzy regions, boundary transitions, and heterogeneous mixed pixels, thus significantly improving the model’s classification performance.

In detail, the HSI data is processed using two consecutive convolutional layers, Conv3-D and Conv2-D, to extract its spatial and spectral features, respectively. Each HSI patch

X_{H}^{p}

of size

s \times s \times b

is first input into the Conv3-D layer, with 8 convolutional kernels of size

3 \times 3 \times 3

. The resulting feature tensors are unfolded along the spatial dimension and used as input for the Conv2-D layer, where 64 convolutional kernels of size

3 \times 3

are used to obtain 64 two-dimensional feature maps. For the LiDAR branch, two consecutive Conv2-D layers are used to extract features. Each LiDAR patch

X_{L}^{p}

of size

s \times s

passes through 16 convolutional kernels of size

3 \times 3

and 64 convolutional kernels of size

3 \times 3

to extract high-level features. Specifically, in hyperspectral image classification tasks, due to sensor accuracy limitations and the influence of complex surface environments, HSI data is often affected by spectral noise interference, whereas LiDAR data is prone to coherent speckle noise. These noises often exhibit certain spatial structural characteristics, which CNNs can mistakenly learn as “useful features”, leading to blurred classification boundaries and local overfitting problems. To address this problem, we introduced a fuzzy learning module (FLM) after the convolution operation.

For the sake of narrative convenience, the subscripts used to denote the HSI and LiDAR branches are temporarily omitted in the following discussion. As shown in Figure 2, the features

X \in R^{B \times C \times H \times W}

extracted by CNNs within the FFEM module are transformed into fuzzy representations by the proposed FLM, where B is the batch size, C is the number of feature channels, and

H, W

indicate the spatial dimensions (height and width) of the features. From the perspective of fuzzy set theory [44], each scalar feature value

X_{b, c, h, w}

can be regarded as a crisp observation whose degree of membership to several fuzzy sets is determined by a set of learnable membership functions. For each feature channel c, the FLM applies N GMFs to every spatial location

(h, w)

, mapping the original feature value into N fuzzy membership degrees. Each Gaussian membership function is parameterized by a center

μ_{c, j} \in R^{C \times N}

and standard deviation

σ_{c, j} \in R^{C \times N}

, both learned during training, enabling adaptive and data-driven fuzzy partitioning of the feature space. The formula is as follows:

X_{b, c, h, w, j}^{'} = exp (- {(\frac{X_{b, c, h, w} - μ_{c, j}}{σ_{c, j}})}^{2})

(1)

where

j \in {1, 2, \dots, N}

indexes the fuzzy sets for channel c. GMFs are chosen for their smoothness, locality, and infinite differentiability, which support stable gradient-based learning in deep architectures [47,48]. Moreover, they provide a localized, nonlinear mapping that is well suited for modeling heterogeneous and ambiguous multimodal features, as demonstrated in recent neuro-fuzzy models [46].

Intuitively, Equation (1) transforms each raw convolutional activation

X_{b, c, h, w}

into a graded fuzzy membership value with respect to multiple semantic prototypes (determined by the learnable

μ_{c, j}

and

σ_{c, j}

). This graded representation enables the model to retain partial belonging information rather than making hard, binary decisions, thereby capturing subtle semantic variations and uncertainty in the feature space. Such a soft representation is particularly beneficial in multimodal scenarios where data distributions are heterogeneous and boundary regions between classes are not crisply defined. The parameter

σ_{c, j}

controls the fuzziness of membership. A larger

σ_{c, j}

implies a broader response, representing higher uncertainty. The learnable parameters allow the model to adapt partition granularity to data statistics.

To aggregate the fuzzy membership responses while retaining differentiability, the FLM applies the LogSumExp operator. The aggregation is formally defined as

X_{b, c, h, w}^{'} = log \sum_{j = 1}^{N} X_{b, c, h, w, j}^{'}

(2)

where

X_{b, c, h, w}^{'}

denotes the fuzzy feature information at each spatial location. LogSumExp acts as a smooth approximation to maximum operation. Therefore, both GMFs and LogSumExp operator are infinitely differentiable, which is crucial for stable gradient propagation in end-to-end training via backpropagation. This design aligns with recent advances in deep fuzzy architectures, where smooth aggregation operators help retain gradient flow and improve robustness in multimodal tasks [48]. Consequently, the feature map X extracted by CNNs undergoes processing through the FLM, yielding the fuzzy feature map

X^{'} \in R^{B \times C \times H \times W}

with improved robustness to uncertainty and noise.

Finally, residual connections integrate the feature maps X and

X^{'}

via element-wise addition, as formalized in Equation (3).

\hat{X} = B N (X^{'}) + X

(3)

where batch normalization (BN) is applied to constrain the dynamic range of the fuzzy feature map

X^{'}

, and

\hat{X}

denotes the output of the FFEM module. Therefore, let

{\hat{X}}_{H}

and

{\hat{X}}_{L}

denote the outputs of the FFEM modules of the HSI and LiDAR branches, respectively.

By introducing fuzzy membership functions, the FLM performs soft partitioning of CNN features and dynamically adjusts weights for boundary regions. It preserves spatial perception capabilities while enhancing robustness against mixed pixels and fuzzy boundaries, thus alleviating the issues of class overlap and feature confusion to improve classification reliability. Therefore, the FFEM addresses the challenges of CNNs in modeling spectral nonlinear distributions and fuzzy boundaries in hyperspectral classification. Next, the outputs

{\hat{X}}_{H}

and

{\hat{X}}_{L}

serve as the inputs to the fuzzy enhancement cross-modal fusion encoding module, providing higher-quality and more semantically hierarchical feature representations for the subsequent fusion encoding.

2.4. Fuzzy-Enhanced Cross-Modality Fusion Module

Existing Transformer-based methods primarily rely on data-driven approaches to automatically allocate weights to features from different modalities, but they fall short in effectively modeling the disparity and complementarity across modalities. This leads to insufficient extraction and utilization of complementary information from hyperspectral and LiDAR data, especially in ambiguous regions, such as land boundaries. To address this issue, we designed a Fuzzy-Enhanced Cross-Modal Fusion (FECF) module that combines membership-aware attention mechanisms with fuzzy inference operators to achieve dynamic adjustment of modal feature weights and the efficient integration of complementary information.

First, to adapt to the transformer architecture, the features extracted by the dual-branch convolutional network need to be flattened and represented as tokens. Therefore, the features

{\hat{X}}_{H}

and

{\hat{X}}_{L}

are flattened into a set of vectors. The flattened HSI and LiDAR feature maps are denoted as

{\hat{X}}_{H} \in R^{u_{h} v_{h} \times z_{h}}

and

{\hat{X}}_{L} \in R^{u_{l} v_{l} \times z_{l}}

, where

u_{h}

and

v_{h}

represent the height and width of the HSI feature map, respectively,

u_{l}

and

v_{l}

represent the height and width of the LiDAR feature map, respectively, and

z_{h}

and

z_{l}

denote the number of channels of the HSI and LiDAR feature maps, respectively. Inspired by [50], we employ two learnable weight matrices,

W^{a}

and

W^{b}

, to derive the HSI tokens

T_{h s i} \in R^{n_{h} \times z_{h}}

and LiDAR tokens

T_{l i d a r} \in R^{n_{l} \times z_{l}}

, respectively, via the following formulas:

\begin{matrix} T_{h s i} = softmax {({\hat{X}}_{H} W^{a})}^{T} {\hat{X}}_{H} \end{matrix}

(4)

\begin{matrix} T_{l i d a r} = softmax {({\hat{X}}_{L} W^{b})}^{T} {\hat{X}}_{L} \end{matrix}

(5)

where

n_{h}

and

n_{l}

represent the number of HSI and LiDAR tokens, respectively.

{\hat{X}}_{H} W^{a}

and

{\hat{X}}_{L} W^{b}

denote the

1 \times 1

point-wise products. The softmax operation is used to emphasize the relatively important semantic part.

Subsequently, as illustrated in Figure 1, we employ transformer encoders based on Multi-Head Self-Attention (MHSA) mechanism on two separate branches to model the semantic correlations between the feature tokens. Specifically, the input tokens fed into the transformer encoder first undergo concatenation with a trainable classification token (CLS). Positional information is then integrated into the token embeddings to preserve sequential-order information. Consequently, for input tokens

T_{h s i}

and

T_{l i d a r}

undergoing the aforementioned processing, we obtain corresponding processed embeddings

T_{h s i}^{i n}

and

T_{l i d a r}^{i n}

, respectively. The embedding

T_{h s i}^{i n}

and

T_{l i d a r}^{i n}

are then respectively processed through the transformer encoder, which comprises an MHSA layer followed by a Feed-Forward Network (FFN), both wrapped with residual connections and Layer Normalization (LN). Taking the

T_{h s i}^{i n}

tokens as an example, the process can be described as follows:

\begin{matrix} T_{h s i}^{o} = M H S A (L N (T_{h s i}^{i n})) + T_{h s i}^{i n} \end{matrix}

(6)

\begin{matrix} T_{h s i}^{'} = F F N (L N (T_{h s i}^{o})) + T_{h s i}^{o} \end{matrix}

(7)

where

T_{h s i}^{'}

is the output of the transformer encoder module. By applying the same processing procedure to the LiDAR feature tokens, we obtain the output

T_{l i d a r}^{'}

for the LiDAR branch.

After completing intra-modal modeling, a cross-attention module is introduced to achieve semantic alignment and information fusion between heterogeneous modalities [40,51]. Specifically, each modality’s CLS token acts as an interaction bridge, whereby the CLS token of one modality is concatenated with the feature tokens of another modality and subjected to attention projection, thereby explicitly modeling the complementary relationships between modalities. The cross-attention module for the HSI token branch is illustrated in Figure 3. Specifically, for the HSI branch, it first combines the LiDAR patch tokens with its own CLS tokens through concatenation, which can be formulated as follows:

\begin{matrix} t_{hsi}^{{cls}^{'}} = f_{hsi} (t_{hsi}^{cls}) \end{matrix}

(8)

\begin{matrix} T_{hsi}^{ct} = Concat (t_{hsi}^{{cls}^{'}}, T_{l i d a r}^{'} ∖ t_{l i d a r}^{c l s}) \end{matrix}

(9)

where

t_{h s i}^{c l s}

and

t_{l i d a r}^{c l s}

denote the class tokens from

T_{h s i}^{'}

and

T_{l i d a r}^{'}

, respectively.

f_{h s i} (\cdot)

serves as a projection function for dimension alignment, which is applied to make the projected

t_{hsi}^{{cls}^{'}}

share the same dimensionality with the patch tokens in

T_{l i d a r}^{'}

.

T_{h s i}^{c t}

denotes the combined feature tokens.

The module subsequently performs cross-attention (CA) between

t_{hsi}^{{cls}^{'}}

and

T_{h s i}^{c t}

, using the CLS token as the sole query since the information from patch tokens has already been fused into it. In mathematical terms, the CA is represented as

Q = t_{hsi}^{{cls}^{'}} W^{Q}, K = T_{h s i}^{c t} W^{K}, V = T_{h s i}^{c t} W^{V}

(10)

CA (x_{hsi}^{ct}) = softmax (\frac{Q K^{T}}{\sqrt{c / h}}) V

(11)

Here,

W^{Q}

,

W^{K}

, and

W^{V} \in R^{c \times (c / h)}

are learnable weights, where c is the embedding dimension and h is the number of attention heads. As with self-attention, we incorporate multiple heads in the CA. Let

T_{H S I}

and

T_{L i D A R}

denote the outputs of the HSI and LiDAR branches after the CA module, respectively.

Although the transformer and CA perform excellently in modeling intra- and inter-modal relationships, in complex regions such as those with noise interference, mixed pixels, and blurred boundaries, the uncertainty in feature distribution hinders the effective exploitation of complementarity between HSI and LiDAR. This results in the loss of critical information and difficulties in class differentiation. To address these issues, a Fuzzy Fusion Module is introduced to enhance the informational complementarity between HSI and LiDAR features and strengthen the recognition capability of land cover boundaries and mixed pixels.

Figure 4 illustrates the processing flow of the Fuzzy Fusion Module (FFM). Specifically, for the HSI branch, the input feature tensor

T_{H S I} \in R^{B \times N \times D}

is structured with batch size B, token count N, and feature dimension D. The mean

μ

and standard deviation

σ

are calculated across the batch and sequence dimensions (i.e., along axes B and N), aggregating statistics for each feature channel. Furthermore, the FFM employs a residual connection to retain the original features while fusing fuzzy representations based on the Gaussian smoothing of global statistics. This design explicitly retains critical information and improves feature robustness by fusing local details with contextual fuzzy membership information, effectively capturing inherent uncertainties. This operation can be formulated as

f (T_{HSI}) = \frac{1}{σ \sqrt{2 π}} \cdot exp (- \frac{{(T_{HSI} - μ)}^{2}}{2 σ^{2}}) + T_{HSI}

(12)

Here, the

f (T_{H S I})

function implements a Gaussian-based fuzzy membership transformation that combines raw input features with their globally contextualized fuzzy representations.

Subsequently, adaptive feature fusion is applied to combine raw input features

T_{H S I}

with their fuzzy-transformed counterparts

f (T_{H S I})

using a residual coefficient

m \in [0, 1]

. This fusion operation captures contextual uncertainty while preserving local details. Next, a linear transformation is used to map the fused features to an attention space via learnable parameters

W_{H S I}

(weight matrix) and

b_{H S I}

(bias vector). Finally, the FFM applies a sigmoid activation

σ (z) = 1 / (1 + e^{- z})

followed by a dropout regularization, producing the final attention weights. Thus, the attention weights

A_{H S I}

can be formulated as follows:

A_{H S I} = Dropout (σ (W_{H S I} (m \cdot T_{H S I} + f (T_{HSI})) + b_{H S I}))

(13)

Furthermore, the FFM employs the attention weight

A_{H S I}

to dynamically balance the contributions of the original and transformed features. This adaptive fusion mechanism enables the model to focus on the most salient information and enhances its ability to adapt to complex data, formulated as

T_{H S I}^{f} = A_{H S I} ⊙ T_{H S I} + (1 - A_{H S I}) ⊙ f (T_{H S I})

(14)

where ⊙ denotes element-wise multiplication, and

T_{H S I}^{f}

is the fused features. For the LiDAR features, an identical fuzzy transformation process is applied, which yields the output

T_{L i D A R}^{f}

. Subsequently, the HIS features

T_{H S I}^{f}

and LiDAR features

T_{L i D A R}^{f}

are fused through a concatenation operation, achieving effective integration of multimodal data. The formula is as follows:

T_{f u s e} = Concat (T_{H S I}^{f}, T_{L i D A R}^{f})

(15)

where

T_{f u s e}

represents the fused features of the multimodal data.

The model then applies a linear transformation to project

T_{f u s e}

into a lower-dimensional space that matches the shapes of

T_{H S I}

and

T_{L i D A R}

. The dimensionally reduced representation

T_{f u s e}^{'}

is then separately added to

T_{H S I}

and

T_{L i D A R}

through residual connections, yielding the final representations

T_{h s i}^{e n d}

and

T_{l i d a r}^{e n d}

, respectively, as illustrated in Figure 1.

Although both FLM and FFM leverage fuzzy principles, they address distinct stages of the multimodal learning pipeline. The FLM operates at the feature extraction stage, transforming convolutional feature maps into fuzzy membership representations to improve the feature extraction capability for boundary regions of hyperspectral imagery and LiDAR data. This method effectively alleviates the fuzziness caused by mixed pixels, thereby enhancing the model’s classification performance in category transition areas. In contrast, the FFM functions at the feature integration stage, where it dynamically weights and combines heterogeneous fuzzy-enhanced features across modalities using an attention mechanism guided by membership degrees. This adaptive approach not only optimizes the feature fusion process but also strengthens cross-modal correlations, enabling more robust alignment.

2.5. Classification

After the FECF module completes the cross-modal information exchange, the cls tokens from the HSI and LiDAR branches, each serving as a compact representation of global semantics, are fed into independent multilayer perceptron (MLP) layers. Each MLP consists of two fully connected layers, with the first layer embedding a GELU activation function to introduce nonlinearity, and the second layer outputting class probabilities through a softmax operation. The dimension of the output layer matches the number of land cover classes, and the softmax function normalizes the activation values for each class, producing a probability distribution as output.

p_{i} = \frac{exp (z_{i})}{\sum_{j = 1}^{C} exp (z_{j})}, i = 1, \dots, C

(16)

Here,

z_{i}

denotes the predicted score for the i class, C is the total number of classes, and

p_{i}

is the predicted probability for this class. The final probability vector is obtained by adding the two output probability vectors, and the pixel category is identified by the label associated with the highest probability value.

The cross-entropy loss function is adopted as the loss function in the integrated network, as illustrated by the formula below.

L = - \sum_{i = 1}^{C} y_{i} log (p_{i})

(17)

where

y_{i}

is the one-hot encoding of the label, and

p_{i}

denotes the predicted probability for class i. The overall training process of the proposed deep fuzzy fusion method is described in Algorithm 1.

Algorithm 1 Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification

1:: Input: HSI data $X_{H} \in R^{m \times n_{H}}$ , LiDAR data $X_{L} \in R^{m \times n_{L}}$ , ground-truth data $Y \in R^{m \times n_{y}}$
2:: PCA bands number $b = 30$ , patch size $s = 11$ , training sample rate $ϕ %$
3:: Output: Predicted labels of the test set
4:: Set batch size $= 64$ , optimizer Adam (learning rate: $1 \times 10^{- 3}$ ), epochs number $e = 100$ , initialize all weights
5:: Obtain $X_{H}^{P C A}$ after PCA transform
6:: Create all sample patches from $X_{H}^{P C A}$ and $X_{L}$ , divide into training set and test set
7:: Generate training loader and test loader
8:: for $i = 1$ to e do
9:: Extract spatial–spectral features from $X_{H}^{P C A}$ using 3D and 2D CNN layers, extract elevation features from $X_{L}$
10:: Enhance features through FLM to obtain ${\hat{X}}_{H}$ and ${\hat{X}}_{L}$ by calculating Equations (1)–(3)
11:: Transform the feature vector to generate tokens by calculating Equations (4) and (5)
12:: Input to the transformer encoder for feature learning to obtain features $T_{H S I}$ by calculating Equations (6) and (7)
13:: Achieve semantic alignment and information fusion to obtain $T_{H S I}$ and $T_{L i D A R}$ by calculating Equations (8)–(11)
14:: Fuse features input to FFM by calculating Equations (12)–(15)
15:: Compute to obtain classification prediction labels by calculating Equations (16) and (17)
16:: end for
17:: Use the trained model to predict labels for the test set

3. Experiments

To validate the effectiveness of the proposed framework, experiments are conducted on three publicly available multimodal datasets. First, we present detailed information about the three datasets used in the experiments. Subsequently, we describe the specific experimental setup and demonstrate the role of each module within the framework through ablation studies. Finally, the experimental results show a significant advantage compared to existing methods.

3.1. Experimental Setup

3.1.1. Comparison with Other Classification Methods

To verify the effectiveness in classification tasks, we conducted comparative experiments with several representative classification methods, such as MSA-GCN [52], S3F2Net [53], AMSSENet [54], MACN [42], DSHFNet [39], Exvit [38], S2ATNet [55], and HCT [40]. For a fair comparison, the network parameters of these methods were set according to the descriptions in their respective references.

3.1.2. Evaluation Indicators

In the classification performance models of the proposed framework and other existing frameworks, four widely used quantitative analysis criteria were calculated, namely overall accuracy (OA), average accuracy (AA), Kappa coefficient (K), and per-class accuracy. For each metric, a higher value indicates more accurate classification.

3.2. Dataset Description

3.2.1. Houston2013

This is a public remote sensing dataset from the IEEE GRSS 2013 Data Fusion Contest, covering the University of Houston and surrounding urban areas in the United States. The data include a 144-band hyperspectral image and a single-band LiDAR, both with a spatial resolution of 2.5 m and a size of 349 × 1905 pixels, containing 15 classes of urban land cover. Figure 5 displays the specific situation of the dataset.

3.2.2. Trento

Acquired in a rural area south of Trento, Italy, this dataset comprises a hyperspectral image (63 bands, 420–989 nm) and LiDAR data, both with a spatial resolution of 1 m and an image size of 600 × 166 pixels, covering 6 classes of rural land cover types. Figure 6 displays the specific situation of the dataset.

3.2.3. MUUFL

Collected at the University of Southern Mississippi’s Gulf Park campus, this dataset includes a preprocessed 64-band HSI and LiDAR data in DSM format. The HSI resolution is 0.54 × 1.0 m, the LiDAR resolution is 0.60 × 0.78 m, and the image size is 325 × 220 pixels, covering 11 classes of campus land cover types. Figure 7 displays the specific situation of the dataset.

The datasets are available at (accessed on 1 November 2024): https://github.com/AnkurDeria/MFT. The names of land categories, along with the numbers of training and testing samples used in the experiments for the three datasets mentioned above, are presented in Table 1.

3.2.4. Configurations

All experiments in this study were implemented using the PyTorch(2.0.1) framework and conducted on a server equipped with an Intel(R) Xeon(R) Platinum 8352V CPU@2.10GHz and a 24GB Nvidia GeForce RTX 4090 GPU. In our method, we use the Adam optimizer to update the network parameters with a learning rate of 1 × 10⁻³ and a batch size of 64. The Houston03 dataset is trained for 500 epochs, while the Trento and MUUFL datasets are trained for 100 epochs. The number of GMFs N is set to 30 for all experiments. The mean (

μ

) and the standard deviation (

σ

) are randomly initialized from the standard normal distribution (0,1).

3.3. Classification Results and Analysis

(1) For the Houston2013 dataset, the classification results of various models are presented in Table 2, with corresponding visualized results shown in Figure 8. As seen from the table, the proposed method achieves the best results across the three key metrics—overall accuracy (OA), average accuracy (AA), and Kappa coefficient—reaching 93.82%, 94.68%, and 93.32%, respectively, significantly outperforming other compared methods in overall performance.

Specifically, compared with MSA-GCN, S3F2Net, AMSSENet, MACN, DSHFNet, Exvit, S2ATNet, and HCT, the proposed method improves OA by 4.38%, 0.41%, 1.65%, 4.22%, 6.77%, 3.22%, 0.50%, and 2.15%, respectively, fully demonstrating the effectiveness of the proposed model in fusing hyperspectral and LiDAR data. From the classification maps in Figure 8, it is evident that other models generally suffer from issues such as blurred boundaries and insufficient fine-grained recognition. For example, MSA-GCN shows misclassification and severe boundary confusion in Residential (C7) and Road (C9). S3F2Net and S2ATNet perform steadily in multiple categories but exhibit local noise in structurally complex categories. MACN and AMSSENet have certain recognition capabilities for primary categories like Healthy Grass (C1) but show unsatisfactory performance in secondary categories such as Tennis Court (C14). DSHFNet and Exvit lack effective handling of fine spatial boundaries, resulting in block-like mixed regions in some categories. Although HCT performs strongly overall, misclassifications still occur in categories like Highway (C10).

In contrast, the proposed method achieves clearer and more accurate distinctions in multiple complex categories, particularly excelling in boundary-complex classes such as Stressed Grass (C2), Tennis Court (C14), and Running Track (C15), further proving the advantage of the fuzzy enhancement mechanism in improving feature discriminability and robustness.

(2) The classification results on the Trento dataset are shown in Table 3. From the detailed classification results, the proposed method achieves an overall accuracy (OA) of 97.53%, an average accuracy (AA) of 96.18%, and a Kappa coefficient of 96.71%. For specific categories such as Woods (C4) and Vineyard (C5), the proposed model also demonstrates very high accuracy, reaching 100% and

99.54 \pm 1.17

%, respectively, indicating its strong classification ability in these particular categories. From the classification map in Figure 9, it can be observed that MSA-GCN and HCT exhibit some confusion at category boundaries, while MACN and AMSSENet show stronger recognition capabilities for certain classes. DSHIFNet and Exvit do not show significant effectiveness in handling spatial boundaries. Although S3F2Net and S2ATNet perform well overall, there is still slight noise at complex boundaries. By introducing FLM and FFM, the proposed model demonstrates superior classification performance. The model design achieves a more effective integration of features extracted by multiple membership functions, forming more comprehensive and accurate generalized features, thereby improving classification accuracy and robustness.

(3) As shown in Table 4, the classification performance of each model on the MUUFL dataset further validates the effectiveness of the proposed method. The proposed method achieves the best results across three key metrics: overall accuracy (OA), average accuracy (AA), and Kappa coefficient, reaching 87.90%, 89.05%, and 84.51%, respectively, outperforming other state-of-the-art approaches. Specifically, compared with MSA-GCN, S3F2Net, AMSSENet, MACN, DSHFNet, Exvit, S2ATNet, and HCT, the proposed method improves OA by approximately 0.18%, 0.53%, 1.02%, 1.62%, 1.17%, 1.04%, 0.73%, and 2.24%, demonstrating stable and leading classification performance.

From a category-level perspective, the proposed method achieves the best results in multiple key classes. For example, it attains 91.07% accuracy in the majority Trees (C1) class, which is significantly higher than other models (such as MSA-GCN’s 84.25%). Additionally, it obtains accuracies of 83.10%, 76.75%, and 95.73% for Mixed Ground Surface (C3), Sidewalk (C9), and Yellow Curb (C10), respectively, exhibiting strong discriminative ability for spectrally similar classes. It is also noteworthy that a 100% recognition rate was achieved for the Cloth Panels (C11) category, demonstrating outstanding ability to detect this prominent target type. The classification map comparisons in Figure 10 show that MSA-GCN and AMSSENet have more noise and confusion at boundaries; MACN and Exvit exhibit discontinuities in complex regions, indicating that the classification accuracy of these networks is challenged when dealing with intricate backgrounds or detail-rich scenarios; S3F2Net and S2ATNet show certain classification confusions; and HCT contains more noise. By contrast, the proposed method’s classification maps have relatively smooth boundaries and significantly reduced noise. Combined quantitative and visual results indicate that the proposed method has clear advantages in spectral–spatial information fusion and noise suppression.

(4) With regard to feature distribution analysis, Figure 11, Figure 12 and Figure 13 present t-SNE feature visualizations of three datasets (Houston, Trento, and MUUFL), comparing the distribution performance of four methods (HCT, Exvit, DSHFNet, and the proposed method) in the feature space. Different colors are used in each subplot to represent different categories, allowing for a clear observation of the classification results. The results indicate that the proposed method exhibits the best intra-class compactness, clear boundaries, and more distinct category distributions across all datasets. This suggests that the proposed method can effectively cluster samples from the same class together and reduce the overlap between classes. In contrast, HCT, Exvit, and DSHFNet exhibit varying degrees of inter-class mixing or weaker clustering effects, leading to relatively inadequate feature separability. This may be because these methods, when dealing with complex data, fail to effectively capture the important differences between features, resulting in suboptimal classification performance. Overall, the proposed method consistently demonstrates superior feature separability and clustering performance across different datasets. This result indicates that the proposed method is more capable of distinguishing between different land cover categories and can more clearly reflect the features and attributes of different geographical regions.

4. Discussion

4.1. Parameter Analysis

(1) Patch Size: When all other parameters are fixed, an appropriate patch size can ensure the model captures sufficient local features, thereby achieving the best overall accuracy (OA). Therefore, we selected patch sizes from the candidate set comprising 7, 9, 11, 13, and 15 to evaluate their effects. Figure 14a shows the impact of different patch sizes on OA. When the patch size is set to 11, the OA values for all three datasets reach their optimum.

(2) Spectral Dimension: To evaluate the effect of different spectral dimensions, other hyperparameters were fixed, and values were selected from the candidate set comprising 10, 15, 20, 25, and 30. Figure 14b shows the variation in overall accuracy on the three datasets by changing the spectral dimension. The best classification performance was achieved when we chose 30.

(3) Token Number: Figure 14c shows the impact of different token numbers on the overall accuracy (OA) of the three datasets. In the parameter experiment, the token number was selected from the candidate set comprising 2, 4, 6, 8, and 10. The analysis results indicate that setting the token number to 4 achieves the highest overall accuracy across all datasets.

(4) Batch Size and Learning Rate: To explore the network’s performance under different batch size and learning rate settings, we conducted experiments with various combinations. As shown in Figure 15, on the Houston2013 dataset, decreasing batch size and increasing learning rate generally lead to improved overall accuracy (OA); with a large batch size (256), OA is relatively low, but improves as the batch size decreases. On the Trento dataset, performance declines with either very large or very small batch sizes and low learning rates a moderate batch size and relatively high learning rate are beneficial for OA improvement. The MUUFL dataset exhibits a similar trend: overly large batch sizes or very low learning rates may cause overfitting and degrade classification. When the batch size is 64 and learning rate is 1 × 10⁻³, the best OA values are achieved across all three datasets.

(5) Fuzzy Set Quantity Impact Analysis: In the proposed method, the number of fuzzy membership sets N determines the granularity of modeling mixed pixel uncertainty. Therefore, we conducted comparative experiments on three datasets, setting N to {10, 20, 30, 40, 50}. As shown in Figure 16, on the Houston2013 dataset, when N is 20, the overall accuracy (OA) reaches 93.67%, a significant improvement compared to 93.29% when using 10 sets. When N is 30, the OA further improves to 93.82%, fully demonstrating the effectiveness of a moderate number of fuzzy sets. On the Trento dataset, the best performance is achieved when N is 30, with an OA of 97.53%, average accuracy (AA) of 96.71%, and Kappa coefficient of 98.05%, which is an increase of 1.32%, 3.44%, and 3.11%, respectively, compared to using N of 10, and avoids the performance degradation caused by excessive fuzzy sets (40,50). Similarly, for the MUUFL dataset, when N is 30, the OA slightly increases to 87.90%, but when the number of sets increases to 50, the Kappa coefficient dramatically decreases to 80.54%, which is a decrease of 3.97%, indicating that the spectral–spatial features of this dataset are very sensitive to the over-parameterization of fuzzy rules. Therefore, setting the number of fuzzy sets to 30 achieves the best balance of OA, AA, and Kappa coefficients across all datasets, providing sufficient expressiveness to model complex uncertainties while avoiding overfitting due to over-parameterization.

4.2. Performance Analysis of Different Training Samples

To evaluate the stability and robustness of the proposed method under different training sample percentages, we compared it with MSA-GCN, S3F2Net, AMSSENet, MACN, DSHFNet, Exvit, S2ATNet, and HCT using various training sample sizes. For the Houston2013, Trento, and MUUFL datasets, we randomly selected 10, 20, 30, 40, and 50 labeled samples for training. The experimental results are shown in Figure 17. Clearly, the overall accuracy (OA) increases with the number of training samples. The proposed method maintains strong performance even with a limited number of samples, demonstrating excellent feature extraction capabilities. In particular, fuzzy membership and fuzzy fusion techniques help the model handle uncertainty and improve classification accuracy in few-shot scenarios.

4.3. Robustness Against Noise

Figure 18 illustrates the classification results of the Houston2013 hyperspectral dataset under different signal-to-noise ratio (SNR) conditions, specifically including SNR values of 20, 40, 60, and a noise-free control group. In this experiment, Gaussian noise is added only to the original hyperspectral data after data loading but before PCA dimensionality reduction and normalization. The signal power is denoted as

P_{s}

, and the noise is zero-mean Gaussian noise with noise variance

σ^{2}

. The relationship between the signal-to-noise ratio and noise variance is expressed as

S N R_{d B} = 10 {log}_{10} (\frac{P_{s}}{σ^{2}})

. Therefore, by adjusting the variance of the Gaussian noise, different noise intensities under various SNR conditions are realized, enabling the evaluation of the model’s robustness and performance under different noise levels. The magnified detail comparison on the right side of the figure shows that as the signal-to-noise ratio (SNR) increases, the classification results gradually approach the ideal performance under noiseless conditions. At a low SNR of 20, noise causes a slight impact on classification, resulting in a small number of misclassifications near some object boundaries; however, as the SNR increases to 40 and 60, classification accuracy significantly improves, and misclassifications are greatly reduced. Especially at 60, the classification results are nearly identical to those of the noise-free scenario. This outcome fully demonstrates that the proposed method has strong noise suppression capability and robustness, maintaining excellent classification performance in noisy environments, effectively validating the stability and anti-interference ability of the DFNet model when processing noisy hyperspectral data.

4.4. Welch’s t-Test

To comprehensively evaluate the significant differences between our proposed method and eight existing methods, we employed Welch’s t-test, which is a statistical method used for significance analysis. In this method, the t-value and p-value are two key indicators. The t-value serves as the standardized measure of the difference between two group means, indicating that the larger the t-value, the more significant the difference between the two sample means. The p-value represents the statistical significance of the observed difference and is used to determine whether the difference might be caused by random error. In this study, we set the significance level at 0.05. If the p-value is below this threshold, we have sufficient reason to reject the null hypothesis, demonstrating the existence of a significant difference. To ensure test validity, normality and homogeneity of variance tests were conducted: all data approximately follow a normal distribution and the group variances are largely homogeneous. Therefore, the data meet the assumptions for Welch’s t-test, guaranteeing the rationality and stability of the statistical analysis. Table 5 presents the detailed results of Welch’s t-test significance analysis, based on the overall accuracy calculated from 10 experimental runs. In the Houston dataset, our proposed method shows significant statistical differences compared to all eight comparison methods; among them, the p-values relative to MSA-GCN, AMSSENet, MACN, and DSHFNet are relatively small, confirming the superiority of our method. Meanwhile, in the Trento dataset, five out of eight comparison methods exhibit significant differences. Although MSA-GCN, S3F2Net, and AMSSENet did not reach significance, all their t-values remain positive. Finally, in the MUUFL dataset, the p-values of MSA-GCN, AMSSENet, Exvit, and HCT are all below 0.05, indicating highly significant differences. Overall, the statistical analysis results strongly demonstrate the effectiveness and superiority of our proposed method.

4.5. Training Time Efficiency Comparison of Models

As shown in Figure 19, the classification performance and training efficiency of different models are compared on three hyperspectral datasets: Houston2013, Trento, and MUUFL. The overall accuracy and average accuracy displayed in the figure correspond to the vertical and horizontal coordinates of the bubble centers, respectively, while the bubble size reflects the training time of the models (measured in seconds). Larger bubbles indicate longer training times, and all training times were measured under the same hardware conditions. This bubble chart intuitively illustrates the trade-off between model performance and training efficiency. Experimental results demonstrate that the proposed model achieves high efficiency across all three datasets, attaining the highest classification accuracy within an extremely short training time. Although the HCT model requires less training time on the Houston2013 and MUUFL datasets, its accuracy is significantly lower than that of our method, highlighting the superiority of our model in both efficiency and performance.

4.6. Ablation Analysis

Adding the fuzzy learning module (FLM) on the Houston2013 dataset improves the overall accuracy (OA) to

92.64 \pm 0.23 %

, average accuracy (AA) to

93.75 \pm 0.14 %

, and Kappa coefficient to

92.04 \pm 0.29 %

. Incorporating the Fuzzy Fusion Module (FFM) optimizes OA to

93.82 \pm 0.47 %

, AA to

94.68 \pm 0.49 %

, and Kappa to

93.32 \pm 0.51 %

. Regardless of the dataset, the inclusion of FLM and FFM significantly enhances the model’s performance. By introducing fuzzy learning, the ability to model boundary ambiguity in the data is effectively strengthened, enabling refined processing of fuzzy and uncertain information, thereby achieving performance improvements across all datasets. More advanced feature fusion techniques further enhance the model’s classification accuracy and stability. As summarized in Table 6, the performance metrics further validate the effectiveness of the fuzzy learning module (FLM) and Fuzzy Fusion Module (FFM).

5. Conclusions

This paper proposes a Deep Fuzzy Fusion Network (DFNet) designed for the joint classification of hyperspectral imaging (HSI) and LiDAR data. DFNet employs a dual-branch architecture that integrates Convolutional Neural Networks (CNNs) and Transformers to extract multi-scale spatial–spectral features from both hyperspectral and LiDAR data. To enhance the model’s discriminative robustness in ambiguous regions, both branches incorporate fuzzy learning modules that model class uncertainty using learnable Gaussian membership functions, thereby improving the responsiveness in boundary areas. In the modality fusion stage, a Fuzzy-Enhanced Cross-Modal Fusion (FECF) module is designed, which combines membership-aware attention mechanisms with fuzzy inference operators to dynamically adjust modality feature weights and efficiently integrate complementary information. Through a hierarchical design, DFNet effectively represents intra-modal uncertainty and regulates inter-modal fusion, significantly improving classification accuracy in complex structures and ambiguous boundary regions. Future work will explore more efficient fuzzy learning and nonlinear feature extraction techniques to strengthen the model’s discriminative capability in boundary-ambiguous areas and enhance the modeling accuracy of fine-grained features.

Author Contributions

Conceptualization, G.L., J.S. and Y.C.; Formal analysis, G.L. and J.S.; Methodology, G.L., J.S. and Y.C.; Software, J.S.; Supervision, G.L., Y.C., P.L. and J.X.; Writing—original draft, G.L. and J.S.; Writing—review and editing, G.L., Y.C., L.Z., P.L. and J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62206087 and by the High-level Talent Program of Henan University of Technology, China (Grant No. 31401586).

Data Availability Statement

Data are available in Section 3.2.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Badidová, B.; Forgáč, R.; Očkay, M.; Javurek, M.; Krammer, P.; Hluchý, L. A Dual-Camera Analysis of PCA Coefficients for Hyperspectral Classification of Tree Species. In Proceedings of the 2025 Cybernetics &Informatics (K&I), Mikulov na Morave, Czech Republic, 2–5 February 2025; pp. 1–5. [Google Scholar] [CrossRef]
Murugan, D.; Garg, A.; Singh, D. Development of an Adaptive Approach for Precision Agriculture Monitoring with Drone and Satellite Data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 5322–5328. [Google Scholar] [CrossRef]
Huang, X.; Liu, H.; Zhang, L. Spatiotemporal Detection and Analysis of Urban Villages in Mega City Regions of China Using High-Resolution Remotely Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3639–3657. [Google Scholar] [CrossRef]
Kempeneers, P.; Sedano, F.; Seebach, L.; Strobl, P.; San-Miguel-Ayanz, J. Data Fusion of Different Spatial Resolution Remote Sensing Images Applied to Forest-Type Mapping. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4977–4986. [Google Scholar] [CrossRef]
Kendler, S.; Ron, I.; Cohen, S.; Raich, R.; Mano, Z.; Fishbain, B. Detection and Identification of Sub-Millimeter Films of Organic Compounds on Environmental Surfaces Using Short-Wave Infrared Hyperspectral Imaging: Algorithm Development Using a Synthetic Set of Targets. IEEE Sens. J. 2019, 19, 2657–2664. [Google Scholar] [CrossRef]
Li, H. An Overview on Remote Sensing Image Classification Methods with a Focus on Support Vector Machine. In Proceedings of the 2021 International Conference on Signal Processing and Machine Learning (CONF-SPML), Stanford, CA, USA, 14 November 2021; pp. 50–56. [Google Scholar] [CrossRef]
Liu, G.; Wang, L.; Fei, L.; Liu, D.; Yang, J. Hyperspectral Image Classification Based On Fuzzy Nonparallel Support Vector Machine. In Proceedings of the 2022 Global Conference on Robotics, Artificial Intelligence and Information Technology (GCRAIT), Chicago, IL, USA, 30–31 July 2022; pp. 242–246. [Google Scholar] [CrossRef]
Gu, Y.; Chanussot, J.; Jia, X.; Benediktsson, J.A. Multiple Kernel Learning for Hyperspectral Image Classification: A Review. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6547–6565. [Google Scholar] [CrossRef]
Lim, K.M.; Lee, C.P.; Zahisham, Z.; Lim, J.Y.; Mogan, J.N. PCA-ViT: Hyperspectral Image Classification using Principal Component Analysis and Vision Transformer. In Proceedings of the 2024 IEEE 12th Conference on Systems, Process & Control (ICSPC), Malacca, Malaysia, 7 December 2024; pp. 30–34. [Google Scholar] [CrossRef]
Zhu, X.-H.; Li, K.-R.; Deng, Y.-J.; Long, C.-F.; Wang, W.-Y.; Tan, S.-Q. Center-Highlighted Multiscale CNN for Classification of Hyperspectral Images. Remote Sens. 2024, 16, 4055. [Google Scholar] [CrossRef]
Shi, K.; Liu, Q.; Zheng, Z.; Xiao, L. Efficient Implementation for Composite CNN-Based HSI Classification Algorithm with Huawei Ascend Framework. In Proceedings of the 2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Athens, Greece, 31 October–2 November 2023; pp. 1–5. [Google Scholar] [CrossRef]
Chhapariya, K.; Buddhiraju, K.M.; Kumar, A. Spectral-Spatial Classification of Hyperspectral Images with Multi-Level CNN. In Proceedings of the 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Rome, Italy, 13–16 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
Kang, B.; Kim, S. Analysis of the influence of 3D-CNN on spatial random information in hyperspectral image classification. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; pp. 1123–1126. [Google Scholar] [CrossRef]
He, X.; Chen, Y. Transferring CNN Ensemble for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2021, 18, 876–880. [Google Scholar] [CrossRef]
Yadav, P.P.; Shetty, A.; Raghavendra, B.S.; Narasimhadhan, A.V. 1-D CNN for Mineral Classification using Hyperspectral Data. In Proceedings of the 2023 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Bangalore, India, 10–13 December 2023; pp. 1–4. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Spectral–Spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and Deep Learning Approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Meng, Z.; Li, L.; Jiao, L.; Feng, Z.; Tang, X.; Liang, M. Fully Dense Multiscale Fusion Network for Hyperspectral Image Classification. Remote Sens. 2019, 11, 2718. [Google Scholar] [CrossRef]
Zhu, M.; Jiao, L.; Liu, F.; Yang, S.; Wang, J. Residual Spectral–Spatial Attention Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 449–462. [Google Scholar] [CrossRef]
Meng, Z.; Jiao, L.; Liang, M.; Zhao, F. A Lightweight Spectral-Spatial Convolution Module for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5505105. [Google Scholar] [CrossRef]
Wang, A.; Lei, G.; Dai, S.; Wu, H.; Iwahori, Y. Multiscale Attention Feature Fusion Based on Improved Transformer for Hyperspectral Image and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4124–4140. [Google Scholar] [CrossRef]
Li, Z.; Liu, R.; Sun, L.; Zheng, Y. Multi-Feature Cross Attention-Induced Transformer Network for Hyperspectral and LiDAR Data Classification. Remote Sens. 2024, 16, 2775. [Google Scholar] [CrossRef]
Feng, S.; Deng, H.; Hu, Y.; Zhao, C.; Li, W.; Tao, R. Fractional Fourier-Enhanced Fusion Network Based on Pareto Optimization for Hyperspectral and LiDAR Data Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5513416. [Google Scholar] [CrossRef]
Yang, J.X.; Wang, J.; Li, Z.; Sui, C.; Long, Z.; Zhou, J. HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5505605. [Google Scholar] [CrossRef]
Pan, H.; Li, X.; Ge, H.; Wang, L.; Shi, C. A Hierarchical Coarse–Fine Adaptive Fusion Network for the Joint Classification of Hyperspectral and LiDAR Data. Remote Sens. 2024, 16, 4029. [Google Scholar] [CrossRef]
Wang, S.; Hou, C.; Chen, Y.; Liu, Z.; Zhang, Z.; Zhang, G. Classification of Hyperspectral and LiDAR Data Using Multi-Modal Transformer Cascaded Fusion Net. Remote Sens. 2023, 15, 4142. [Google Scholar] [CrossRef]
Colgan, M.S.; Baldeck, C.A.; Féret, J.-B.; Asner, G.P. Mapping Savanna Tree Species at Ecosystem Scales Using Support Vector Machine Classification and BRDF Correction on Airborne Hyperspectral and LiDAR Data. Remote Sens. 2012, 4, 3462–3480. [Google Scholar] [CrossRef]
Huang, R.; Zhu, J. Using Random Forest to Integrate LiDAR Data and Hyperspectral Imagery for Land Cover Classification. In Proceedings of the 2013 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Melbourne, VIC, Australia, 21–26 July 2013; pp. 3978–3981. [Google Scholar] [CrossRef]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource Remote Sensing Data Classification Based on Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 937–949. [Google Scholar] [CrossRef]
Hang, R.; Li, Z.; Ghamisi, P.; Hong, D.; Xia, G.; Liu, Q. Classification of Hyperspectral and LiDAR Data Using Coupled CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4939–4950. [Google Scholar] [CrossRef]
Bhavanam, S.R. Vision Transformer-Driven LiDAR Data Fusion for Enhanced Hyperspectral Image Classification. In Proceedings of the 2024 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Goa, India, 2–5 December 2024; pp. 1–4. [Google Scholar] [CrossRef]
Zhu, F.; Shi, C.; Shi, K.; Wang, L. Joint Classification of Hyperspectral and LiDAR Data Using Hierarchical Multimodal Feature Aggregation-Based Multihead Axial Attention Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503817. [Google Scholar] [CrossRef]
Yang, B.; Wang, X.; Xing, Y.; Cheng, C.; Jiang, W.; Feng, Q. Modality Fusion Vision Transformer for Hyperspectral and LiDAR Data Collaborative Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17052–17065. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Q.; Zhang, J.; Liang, X. Joint Classification of Hyperspectral and LiDAR Data Based on Heterogeneous Attention Feature Fusion Network. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 8616–8619. [Google Scholar] [CrossRef]
Huang, W.; Wu, T.; Zhang, X.; Li, L.; Lv, M.; Jia, Z.; Zhao, X.; Ma, H.; Vivone, G. MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12803–12818. [Google Scholar] [CrossRef]
Feng, M.; Gao, F.; Fang, J.; Dong, J. Hyperspectral and Lidar Data Classification Based on Linear Self-Attention. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 2401–2404. [Google Scholar] [CrossRef]
Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LiDAR Data Joint Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541213. [Google Scholar] [CrossRef]
Yao, J.; Zhang, B.; Li, C.; Hong, D.; Chanussot, J. Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5514415. [Google Scholar] [CrossRef]
Feng, Y.; Song, L.; Wang, L.; Wang, X. DSHFNet: Dynamic Scale Hierarchical Fusion Network Based on Multiattention for Hyperspectral Image and LiDAR Data Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522514. [Google Scholar] [CrossRef]
Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint Classification of Hyperspectral and LiDAR Data Using a Hierarchical CNN and Transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5500716. [Google Scholar] [CrossRef]
Sun, L.; Wang, X.; Zheng, Y.; Wu, Z.; Fu, L. Multiscale 3-D–2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 2100116. [Google Scholar] [CrossRef]
Li, K.; Wang, D.; Wang, X.; Liu, G.; Wu, Z.; Wang, Q. Mixing Self-Attention and Convolution: A Unified Framework for Multisource Remote Sensing Data Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5523216. [Google Scholar] [CrossRef]
Wang, X.; Zhu, J.; Feng, Y.; Wang, L. MS2CANet: Multiscale Spatial–Spectral Cross-Modal Attention Network for Hyperspectral Image and LiDAR Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5501505. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
Gu, X.; Han, J.; Shen, Q.; Angelov, P.P. Autonomous learning for fuzzy systems: A review. Artif. Intell. Rev. 2023, 56, 7549–7595. [Google Scholar] [CrossRef]
Ding, W.; Zhou, T.; Huang, J.; Jiang, S.; Hou, T.; Lin, C.-T. FMDNN: A Fuzzy-Guided Multigranular Deep Neural Network for Histopathological Image Classification. IEEE Trans. Fuzzy Syst. 2024, 32, 709–4723. [Google Scholar] [CrossRef]
Talpur, N.; Abdulkadir, S.J.; Alhussian, H.; Hasan, M.H.; Aziz, N.; Bamhdi, A. Deep Neuro-Fuzzy System application trends, challenges, and future perspectives: A systematic survey. Artif. Intell. Rev. 2023, 56, 865–913. [Google Scholar] [CrossRef]
Cui, Y.; Wu, D.; Xu, Y. Curse of Dimensionality for TSK Fuzzy Neural Networks: Explanation and Solutions. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Apiecionek, L. Fuzzy Neural Networks—A Review with Case Study. Appl. Sci. 2025, 15, 6980. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Chen, C.-F.R.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 347–356. [Google Scholar] [CrossRef]
Alowonou, K.C.; Han, J.-H. MSA-GCN: Exploiting Multi-Scale Temporal Dynamics With Adaptive Graph Convolution for Skeleton-Based Action Recognition. IEEE Access 2024, 12, 193552–193563. [Google Scholar] [CrossRef]
Wang, X.; Song, L.; Feng, Y.; Zhu, J. S3F2Net: Spatial-Spectral-Structural Feature Fusion Network for Hyperspectral Image and LiDAR Data Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4801–4815. [Google Scholar] [CrossRef]
Gao, H.; Feng, H.; Zhang, Y.; Xu, S.; Zhang, B. AMSSE-Net: Adaptive Multiscale Spatial–Spectral Enhancement Network for Classification of Hyperspectral and LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531317. [Google Scholar] [CrossRef]
Ni, K.; Li, Z.; Yuan, C.; Zheng, Z.; Wang, P. Selective Spectral–Spatial Aggregation Transformer for Hyperspectral and LiDAR Classification. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5501205. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed DFNet.

Figure 2. Fuzzy learning module (FLM) structure.

Figure 3. The cross-attention module for the HSI token branch.

Figure 4. Fuzzy Fusion Module (FFM) structure.

Figure 5. Houston2013. (a) Pseudo-color image of the HSI. (b) Gray image of the LiDAR-based DSM. (c) Ground-truth map.

Figure 6. Trento. (a) Pseudo-color image of HSI. (b) Gray image of the LiDAR-based DSM. (c) Ground-truth map.

Figure 7. MUUFL. (a) Pseudo-color image of the HSI. (b) Gray image of the LiDAR-based DSM. (c) Ground-truth map.

Figure 8. Classification visualization of different methods on the Houston2013 dataset. (a) Ground-truth labels. (b) MSA-GCN. (c) S3F2Net. (d) AMSSENet. (e) MACN. (f) DSHFNet. (g) Exvit. (h) S2ATNet. (i) HCT. (j) Proposed.

Figure 9. Classification visualization of different methods on the Trento dataset. (a) Ground-truth labels. (b) MSA-GCN. (c) S3F2Net. (d) AMSSENet. (e) MACN. (f) DSHFNet. (g) Exvit. (h) S2ATNet. (i) HCT. (j) Proposed.

Figure 10. Classification visualization of different methods on the MUUFL dataset. (a) Ground-truth labels. (b) MSA-GCN. (c) S3F2Net. (d) AMSSENet. (e) MACN. (f) DSHFNet. (g) Exvit. (h) S2ATNet. (i) HCT. (j) Proposed.

Figure 11. Feature visualization of different methods on Houston2013: (a) HCT. (b) Exvit. (c) DSHFNet. (d) DFNet.

Figure 12. Feature visualization of different methods on Trento: (a) HCT. (b) Exvit. (c) DSHFNet. (d) DFNet.

Figure 13. Feature visualization of different methods on MUUFL: (a) HCT. (b) Exvit. (c) DSHFNet. (d) DFNet.

Figure 14. Impact of different parameters on multiple datasets: (a) Patch size. (b) Spectral dimension. (c) Token number.

Figure 15. Impact of learning rate and batch size on performance. (a) Huston2013, (b) Trento, and (c) MUUFL.

Figure 16. Impact of the number of fuzzy sets on performance: (a) Huston2013, (b) Trento, (c) and MUUFL.

Figure 17. The impact of different methods using varying amounts of training samples on classification accuracy (OA): (a) Houston2013, (b) Trento, (c) and MUUFL.

Figure 18. Classification maps of the Houston2013 dataset under different signal-to-noise ratio (SNR) conditions: (a) SNR = 20; (b) SNR = 40; (c) SNR = 60; (d) no noise added.

Figure 19. Comparison of different models in terms of OA, AA, and training time. (a) Houston2013. (b) Trento. (c) MUUFL. This figure consists of three bubble charts, each showing OA values on the y-axis and AA values on the x-axis. It is important to note that the center coordinates of each bubble represent the model’s OA and AA values, while the size of the bubble indicates the training time of different models.

Table 1. Class distribution of training and testing samples on the Houston2013, MUUFL, and Trento datasets.

No.	Houston2013			Trento			MUUFL
No.	Class Name	Train	Test	Class Name	Train	Test	Class Name	Train	Test
C1	Healthy Grass	20	1231	Apple Trees	10	4024	Trees	120	23,126
C2	Stressed Grass	20	1234	Buildings	10	2893	Mostly Grass	120	4150
C3	Synthetic Grass	20	677	Ground	10	469	Mixed Ground Surface	120	6762
C4	Trees	20	1224	Wood	10	9113	Dirt and Sand	120	1706
C5	Soil	20	1222	Vineyard	10	10,491	Road	120	6767
C6	Water	20	305	Roads	10	3164	Water	120	346
C7	Residential	20	1248				Building Shadows	120	2113
C8	Commercial	20	1224				Buildings	120	6120
C9	Road	20	1232				Sidewalk	120	1265
C10	Highway	20	1207				Yellow Curb	120	63
C11	Railway	20	1215				Cloth Panels	120	149
C12	Parking Lot 1	20	1213
C13	Parking Lot 2	20	449
C14	Tennis Court	20	408
Total		300	14,729		60	30,154		1320	52,558

Table 2. Performance of various classifiers with the Houston2013 dataset (best results are in boldface).

No.	MSA-GCN [52]	S3F2Net [53]	AMSSENet [54]	MACN [42]	DSHFNet [39]	Exvit [38]	S2ATNet [55]	HCT [40]	Proposed
1	95.90 ± 1.77	98.06 ± 0.73	96.36 ± 1.72	92.80 ± 6.13	90.58 ± 0.61	90.56 ± 3.98	97.41 ± 0.91	95.45 ± 2.75	98.33 ± 0.67
2	93.95 ± 2.21	95.23 ± 0.56	96.91 ± 1.99	94.23 ± 5.52	98.71 ± 0.82	95.54 ± 1.56	92.29 ± 0.89	98.01 ± 1.52	98.53 ± 0.61
3	99.99 ± 0.01	99.63 ± 0.45	99.99 ± 0.01	99.49 ± 0.61	100 ± 0	99.28 ± 0.33	99.96 ± 0.56	99.16 ± 0.97	99.58 ± 0.34
4	97.72 ± 2.22	97.41 ± 2.31	98.45 ± 1.72	98.80 ± 0.91	95.01 ± 0.93	92.78 ± 0.45	97.05 ± 0.37	98.27 ± 1.68	97.56 ± 1.91
5	99.91 ± 0.19	97.56 ± 1.12	99.84 ± 0.32	99.65 ± 0.32	99.97 ± 0.04	98.28 ± 0.26	99.91 ± 0.25	99.48 ± 0.29	99.31 ± 0.62
6	96.90 ± 2.97	94.02 ± 2.71	96.27 ± 2.39	99.08 ± 0.63	98.80 ± 0.84	97.73 ± 0.52	99.01 ± 0.53	98.75 ± 0.84	98.44 ± 1.65
7	85.57 ± 3.13	96.75 ± 1.71	94.20 ± 1.85	89.66 ± 4.62	93.40 ± 1.64	92.87 ± 1.17	87.56 ± 1.53	89.59 ± 2.96	93.77 ± 3.22
8	80.65 ± 3.45	82.66 ± 1.78	76.92 ± 5.42	83.39 ± 6.20	92.37 ± 1.36	82.39 ± 2.72	73.15 ± 0.89	87.82 ± 2.28	88.36 ± 3.45
9	60.12 ± 7.23	81.84 ± 0.89	71.60 ± 6.63	86.26 ± 1.99	60.95 ± 2.71	87.52 ± 1.01	82.87 ± 1.23	82.45 ± 2.41	85.76 ± 1.66
10	94.07 ± 4.17	94.42 ± 3.37	95.80 ± 2.16	90.35 ± 8.91	89.90 ± 1.98	91.88 ± 5.61	97.42 ± 0.72	98.22 ± 2.06	95.85 ± 2.61
11	82.15 ± 6.76	96.39 ± 1.07	94.32 ± 1.93	97.46 ± 2.54	96.70 ± 0.45	93.72 ± 1.67	96.37 ± 1.12	93.99 ± 2.77	97.55 ± 1.41
12	87.49 ± 4.11	87.85 ± 0.99	87.75 ± 0.50	92.12 ± 3.17	88.96 ± 1.16	93.59 ± 2.92	95.31 ± 0.43	96.39 ± 2.72	93.55 ± 1.28
13	89.50 ± 4.59	94.98 ± 0.46	85.47 ± 4.20	96.79 ± 0.77	94.54 ± 0.41	94.18 ± 0.36	94.21 ± 0.36	96.44 ± 0.91	93.33 ± 0.82
14	100 ± 0	100 ± 0	99.99 ± 0.01	100 ± 0	99.84 ± 0.24	99.60 ± 0.92	100 ± 0	99.98 ± 0.12	100 ± 0
15	100 ± 0	99.99 ± 0.11	99.99 ± 0.01	99.99 ± 0.02	100 ± 0	100 ± 0	100 ± 0	100 ± 0	100 ± 0
OA (%)	89.44 ± 1.17	93.41 ± 0.93	92.17 ± 0.52	93.16 ± 0.73	92.19 ± 0.36	92.83 ± 0.69	93.32 ± 0.16	91.11 ± 1.17	93.82 ± 0.47
AA (%)	91.01 ± 1.03	94.21 ± 0.27	92.93 ± 0.51	94.31 ± 0.62	93.45 ± 0.31	93.49 ± 0.27	94.08 ± 0.12	92.29 ± 0.88	94.68 ± 0.49
Kappa (%)	88.58 ± 1.27	92.81 ± 0.76	91.53 ± 0.57	92.60 ± 0.57	91.56 ± 0.39	92.59 ± 0.88	92.43 ± 0.27	90.39 ± 1.27	93.32 ± 0.51

Table 3. Performance of various classifiers with the Trento dataset (best results are in boldface).

No.	MSA-GCN [52]	S3F2Net [53]	AMSSENet [54]	MACN [42]	DSHFNet [39]	Exvit [38]	S2ATNet [55]	HCT [40]	Proposed
1	73.74 ± 23.71	99.31 ± 0.57	97.86 ± 1.65	98.33 ± 0.96	99.64 ± 0.04	99.16 ± 0.32	96.15 ± 0.82	98.86 ± 0.58	97.97 ± 1.00
2	92.00 ± 5.36	91.36 ± 1.78	90.77 ± 2.72	92.94 ± 4.88	96.29 ± 0.24	87.54 ± 0.97	93.21 ± 0.56	88.89 ± 2.17	94.57 ± 2.52
3	94.19 ± 1.41	96.53 ± 1.17	90.19 ± 4.72	96.37 ± 1.34	96.97 ± 0.43	85.76 ± 0.29	98.44 ± 1.33	97.65 ± 0.19	96.63 ± 1.41
4	99.98 ± 0.01	100 ± 0	99.99 ± 0.01	100 ± 0	97.84 ± 0.57	99.79 ± 0.71	100 ± 0	100 ± 0	100 ± 0
5	98.86 ± 0.87	99.51 ± 0.32	98.37 ± 1.68	93.11 ± 2.75	98.22 ± 0.75	99.39 ± 0.50	96.55 ± 1.09	97.17 ± 1.27	99.54 ± 1.17
6	94.95 ± 1.20	93.54 ± 1.61	85.29 ± 2.21	89.80 ± 3.73	79.57 ± 0.83	87.74 ± 1.51	92.41 ± 0.79	91.42 ± 0.11	91.95 ± 2.30
OA (%)	94.70 ± 3.21	97.36 ± 0.89	96.56 ± 0.55	95.58 ± 0.92	96.14 ± 0.21	96.21 ± 0.32	97.07 ± 0.76	95.94 ± 1.21	97.53 ± 0.78
AA (%)	92.29 ± 4.32	95.97 ± 0.51	93.74 ± 1.11	95.09 ± 0.96	94.76 ± 0.16	93.62 ± 0.11	95.71 ± 0.56	94.71 ± 2.31	96.18 ± 1.09
Kappa (%)	92.92 ± 4.28	96.59 ± 0.65	95.41 ± 0.74	94.15 ± 1.19	94.86 ± 0.32	96.05 ± 0.07	96.23 ± 1.21	94.59 ± 1.63	96.71 ± 1.04

Table 4. Performance of various classifiers with the MUUFL dataset (best results are in boldface).

No.	MSA-GCN [52]	S3F2Net [53]	AMSSENet [54]	MACN [42]	DSHFNet [39]	Exvit [38]	S2ATNet [55]	HCT [40]	Proposed
1	84.52 ± 4.19	90.86 ± 1.01	85.72 ± 3.75	86.60 ± 2.14	88.12 ± 0.81	90.14 ± 0.76	88.79 ± 0.12	89.53 ± 1.23	91.07 ± 1.18
2	72.35 ± 6.97	86.86 ± 0.27	82.06 ± 7.38	82.15 ± 1.13	83.80 ± 1.21	84.91 ± 6.14	84.89 ± 0.14	87.14 ± 0.65	83.25 ± 1.32
3	68.50 ± 9.06	72.66 ± 1.82	72.65 ± 5.29	75.51 ± 2.93	69.69 ± 2.87	67.69 ± 9.29	75.33 ± 0.32	78.32 ± 1.45	83.30 ± 1.71
4	88.51 ± 6.34	97.36 ± 0.96	88.86 ± 10.24	95.70 ± 1.72	92.77 ± 2.59	88.96 ± 0.43	94.31 ± 0.07	95.66 ± 2.35	95.13 ± 1.23
5	79.38 ± 5.95	87.21 ± 1.85	84.96 ± 1.18	85.73 ± 2.70	85.00 ± 2.06	84.70 ± 1.44	77.21 ± 1.21	80.59 ± 1.56	83.58 ± 1.06
6	98.84 ± 1.06	100 ± 0	98.98 ± 0.48	99.71 ± 0.44	100 ± 0	97.68 ± 1.22	100 ± 0	98.45 ± 1.75	99.71 ± 0.65
7	84.90 ± 5.03	93.87 ± 1.75	87.67 ± 4.78	92.29 ± 0.96	94.14 ± 1.81	96.60 ± 3.98	91.52 ± 0.34	93.81 ± 1.89	92.04 ± 0.83
8	92.45 ± 2.70	95.33 ± 1.81	89.49 ± 4.2	94.73 ± 1.17	95.27 ± 0.75	94.55 ± 1.56	94.15 ± 0.06	94.54 ± 0.65	95.93 ± 1.58
9	51.87 ± 5.25	74.42 ± 2.23	72.32 ± 3.37	75.66 ± 2.44	60.21 ± 2.90	76.14 ± 3.16	70.83 ± 1.56	73.75 ± 1.12	76.75 ± 0.96
10	78.09 ± 7.99	95.12 ± 2.25	88.56 ± 3.31	88.88 ± 3.61	85.34 ± 2.22	90.43 ± 6.73	96.82 ± 0.63	92.06 ± 1.76	95.23 ± 1.69
11	95.16 ± 3.87	98.65 ± 0.94	99.19 ± 0.72	99.46 ± 0.50	99.59 ± 0.11	98.43 ± 0.31	97.98 ± 0.31	97.98 ± 1.56	99.32 ± 0.49
OA (%)	87.72 ± 1.11	87.37 ± 0.78	86.88 ± 1.47	86.28 ± 0.72	86.73 ± 0.21	87.18 ± 0.59	87.17 ± 0.71	86.65 ± 0.60	87.90 ± 0.68
AA (%)	87.65 ± 1.44	88.36 ± 0.26	88.35 ± 1.41	88.32 ± 0.63	88.76 ± 1.86	87.95 ± 2.28	87.90 ± 1.55	88.75 ± 0.48	89.05 ± 2.17
Kappa (%)	84.35 ± 1.41	83.69 ± 0.65	82.89 ± 1.78	81.98 ± 0.84	82.54 ± 0.22	83.46 ± 0.79	81.72 ± 0.94	82.26 ± 0.73	84.51 ± 0.59

Table 5. Experimental results of Welch’s t-test significance analysis.

Dataset	Parameter	MSA-GCN	S3F2Net	AMSSENet	MACN	DSHFNet	Exvit	S2ATNet	HCT
Houston	t-value	9.195624	3.01439	5.071625	3.632233	6.415514	5.887311	2.82812	5.609908
	p-value	0.00029	0.029832	0.002629	0.010786	0.000365	0.000646	0.02116	0.003009
Trento	t-value	2.962538	1.526042	3.203359	3.501311	3.720645	2.528394	2.033683	2.45583
	p-value	0.01824	0.171371	0.017157	0.008792	0.014491	0.048236	0.090201	0.041923
MUUFL	t-value	5.162606	2.332814	8.535332	2.704986	2.878544	7.386312	2.16467	3.683253
	p-value	0.005847	0.079994	0.000049	0.041651	0.021543	0.001693	0.062353	0.010282

Table 6. Performance comparison on three datasets with Base, fuzzy learning, and Fuzzy Fusion modules.

Datasets	Base	Fuzzy Learn	Fuzzy Fusion	OA	AA	Kappa
Houston2013	✔			91.11 ± 1.17	92.29 ± 0.88	90.39 ± 1.27
	✔	✔		92.64 ± 0.23	93.75 ± 0.14	92.04 ± 0.29
	✔		✔	91.31 ± 0.15	92.62 ± 0.12	90.61 ± 0.15
	✔	✔	✔	93.82 ± 0.47	94.68 ± 0.49	93.32 ± 0.51
Trento	✔			95.94 ± 1.21	94.71 ± 2.31	94.59 ± 1.63
	✔	✔		96.33 ± 0.39	94.98 ± 0.35	95.11 ± 0.41
	✔		✔	97.11 ± 0.67	95.04 ± 0.45	95.25 ± 0.88
	✔	✔	✔	97.53 ± 0.78	96.18 ± 1.09	96.71 ± 1.04
MUUFL	✔			86.65 ± 0.60	88.75 ± 0.48	82.26 ± 0.73
	✔	✔		86.69 ± 0.51	88.62 ± 0.54	82.81 ± 0.43
	✔		✔	86.79 ± 0.33	88.94 ± 0.26	82.56 ± 0.37
	✔	✔	✔	87.90 ± 0.68	89.05 ± 2.17	84.51 ± 0.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, G.; Song, J.; Chu, Y.; Zhang, L.; Li, P.; Xia, J. Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification. Remote Sens. 2025, 17, 2923. https://doi.org/10.3390/rs17172923

AMA Style

Liu G, Song J, Chu Y, Zhang L, Li P, Xia J. Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification. Remote Sensing. 2025; 17(17):2923. https://doi.org/10.3390/rs17172923

Chicago/Turabian Style

Liu, Guangen, Jiale Song, Yonghe Chu, Lianchong Zhang, Peng Li, and Junshi Xia. 2025. "Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification" Remote Sensing 17, no. 17: 2923. https://doi.org/10.3390/rs17172923

APA Style

Liu, G., Song, J., Chu, Y., Zhang, L., Li, P., & Xia, J. (2025). Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification. Remote Sensing, 17(17), 2923. https://doi.org/10.3390/rs17172923

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification

Abstract

1. Introduction

2. Methods

2.1. Overall Framework

2.2. Data Preprocessing

2.3. Fuzzy-Enhanced Feature Extraction Module

2.4. Fuzzy-Enhanced Cross-Modality Fusion Module

2.5. Classification

3. Experiments

3.1. Experimental Setup

3.1.1. Comparison with Other Classification Methods

3.1.2. Evaluation Indicators

3.2. Dataset Description

3.2.1. Houston2013

3.2.2. Trento

3.2.3. MUUFL

3.2.4. Configurations

3.3. Classification Results and Analysis

4. Discussion

4.1. Parameter Analysis

4.2. Performance Analysis of Different Training Samples

4.3. Robustness Against Noise

4.4. Welch’s t-Test

4.5. Training Time Efficiency Comparison of Models

4.6. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI