1. Introduction
Hyperspectral images (HSIs), as a form of remote sensing data that integrates imaging and spectral technologies, can capture reflectance information across hundreds of continuous narrow bands ranging from visible to near-infrared wavelengths [
1]. Compared to traditional multispectral imagery, HSIs possess higher spectral resolution, enabling detailed characterization of the physical attributes and biochemical characteristics of ground objects. It is widely applied in fields such as agricultural pest and disease monitoring [
2], soil pollution analysis [
3], forestry resource investigation [
4], and ecosystem assessment [
5]. With the maturation of HSI technology and the widespread availability of satellite data, the efficient and accurate classification of HSI data has become a core task in remote sensing intelligent processing.
In the early stages of HSI classification, researchers primarily utilized traditional machine learning methods such as SVMs [
6,
7], MKL [
8], and PCA [
9] to model high-dimensional spectral data. However, these methods relied on handcrafted features and struggled to adapt to complex remote sensing scenarios, especially showing significant shortcomings when addressing high-dimensional redundancy and mixed pixels. With the development of deep learning, Convolutional Neural Networks (CNNs) [
10,
11,
12,
13,
14,
15], due to their ability to automatically extract spatial and spectral local features, have become mainstream in hyperspectral image classification, significantly improving accuracy. Hu et al. [
16] pioneered a 1D CNN classification method based on five convolutional layers, which successfully extracted spectral features by using spectral information as input, but failed to consider spatial information simultaneously. To address this deficiency, Zhao and Du [
17] proposed a 2D CNN HSI classification model based on Principal Component Analysis (PCA) dimensionality reduction. This model extracts spatial features from the data after applying PCA for dimensionality reduction. To further enhance spatial information modeling capabilities, Meng et al. [
18] proposed a multi-scale fusion network (FDMFN), which achieves multi-scale feature integration through cross-layer connections; Zhu et al. [
19] introduced RSSAN, employing an attention mechanism to improve feature selection; and Meng et al. [
20] also designed a lightweight module (LSSCM) that reduces parameter size while maintaining accuracy.
Currently, HSI classification methods mainly rely on spectral data, which can achieve good results. However, in urban areas, spectrally similar roads and rooftops are easily confused. Light Detection and Ranging (LiDAR), on the other hand, can provide accurate elevation and structural features that complement hyperspectral data at the information level. This effectively breaks through the performance bottleneck of a single modality and improves the accuracy and reliability of land cover classification in urban areas [
21,
22,
23,
24,
25,
26]. In the early stages of joint classification, most studies adopted traditional machine learning methods to implement simple concatenation and the joint modeling of multimodal features. For example, Colgan et al. [
27] utilized HSI and LiDAR data to construct a two-stage SVM classifier for tree species identification. Huang and Zhu [
28] fed spectral, elevation, and texture features into an RF classifier for ensemble learning. While these methods demonstrated the effectiveness of fusing the two modalities, they are limited by shallow features, simple fusion strategies, and insufficient modeling of nonlinear relationships, making them difficult to cope with the demands of high-dimensional semantic understanding in complex remote sensing scenarios. To address these issues, existing research applies CNNs to the joint classification of HSI and LiDAR data. For example, Xu et al. [
29] constructed a dual-branch Convolutional Neural Network (CNN) architecture comprising a dual-channel CNN module and a cascaded CNN module. The dual-channel CNN module is used to mine the spatial and spectral features of hyperspectral imagery (HSI), while the cascaded CNN module extracts the elevation features from LiDAR data. Furthermore, Hang et al. [
30] proposed a coupled CNN framework that utilizes weight-sharing CNN modules to extract features from HSI and LiDAR data separately, and combines feature-level and decision-level fusion during the fusion stage. Although the aforementioned methods have achieved relatively ideal results in classification performance, the resulting classification maps appear overly smooth in certain areas due to insufficient feature richness and inadequate utilization of contextual information.
To address the aforementioned challenges, researchers have proposed the Transformer model. The Transformer is capable of efficiently modeling long-range dependencies, resolving the issue of over-smoothing in certain regions of the classification result map. Consequently, it exhibits promising applications in HSI and LiDAR joint classification tasks [
31,
32,
33,
34,
35]. Feng et al. [
36] proposed a linear self-attention fusion model (LSAF) that leverages a linear self-attention module to enrich contextual feature representation between hyperspectral and LiDAR data, and integrates classification results through an adaptive decision fusion module. Ding et al. [
37] proposed a global-local Transformer network to learn discriminative spectral–spatial features. Yao et al. [
38], on the other hand, extended the traditional visual Transformer, employing a cross-modal attention module to facilitate the exchange of heterogeneous information. The method of Wang et al. [
21] combines multi-scale features with a Swin Transformer, achieving non-local feature fusion through layer-by-layer expansion of the receptive field, while preserving spatial features of images. Feng et al. [
39] proposed the Dynamic Scale Hierarchical Fusion Network (DSHFNet), which can dynamically select and fuse features at different scales based on the similarity in scale space. This effectively reduces feature dimensionality and addresses the issues of unreliable single-scale features and the excessive dimensionality of multi-scale features found in traditional methods. As research deepens, the integration of CNNs and Transformers has become a new research trend to extract spatial and spectral contextual information from multimodal data more effectively. For example, the Hierarchical CNN-Transformer (HCT) proposed by Zhao et al. [
40] designed a cross-token attention mechanism, achieving deeper and more efficient cross-modal information fusion at the token level. The multi-scale 3D–2D hybrid CNN and lightweight attention-free Transformer (M2FNet) proposed by Sun et al. [
41] designed a Feature Enhancement (FE) module and a Depthwise Dilated Convolution module (DConvformer) to achieve deeper and more efficient cross-modal information fusion at the feature level. However, these methods often incur high computational costs. To address this challenge, the hybrid self-attention and convolutional network (MACN) proposed by Li et al. [
42] redesigned the convolutional and self-attention structures to achieve local–global feature extraction from multi-source remote sensing data and effectively reduce computational overhead. Wang et al. [
43] recently proposed a multi-scale cross-attention network (MS2CANet) framework, which improves classification accuracy in complex information regions through spatial–spectral cross-modal attention and enhances useful features while suppressing noise using a feature recalibration module. Although existing Transformer-based methods have achieved significant progress in hyperspectral image and LiDAR data fusion and classification performance, they still have limitations in both feature learning and feature fusion:
(1) The feature extraction stage does not explicitly model class ambiguity, making it difficult to effectively capture high-resolution features of hyperspectral and LiDAR data in local areas such as land cover boundaries and fine-grained spatial structures. This leads to limited boundary discrimination and transition zone modeling capabilities in complex scenes.
(2) The fusion process lacks a mechanism to dynamically perceive differences and uncertainties between modalities, which results in fluctuating cross-modal feature fusion performance and makes it difficult to effectively leverage the complementary advantages of multi-source data.
Fuzzy logic and its neural extensions have provided a compelling framework for modeling uncertainty through membership functions, which are essential for transforming crisp inputs into fuzzy representations [
44,
45]. More recently, hybrid models combining deep learning and fuzzy logic have shown promising results in improving interpretability and robustness across complex datasets [
46,
47]. Among different membership function shapes, Gaussian membership functions (GMFs) are particularly attractive due to their smoothness and infinite differentiability, which facilitate stable optimization in gradient-based deep architectures. Empirical studies have shown that GMFs can better capture nonlinear and heterogeneous patterns in high-dimensional spaces compared to triangular or trapezoidal functions, making them well suited for multimodal and ambiguous data modeling [
48,
49]. In this work, we integrate GMFs into the proposed method to bridge crisp deep feature maps with fuzzy representations.
To address the aforementioned limitations, this paper proposes a hierarchical Deep Fuzzy Fusion Network to jointly process hyperspectral and LiDAR data. Specifically, we employ a dual-branch architecture, leveraging the strengths of both CNNs and Transformers to extract multi-scale features from HSI and LiDAR data, respectively. Each branch integrates a fuzzy learning module, which models category uncertainty through learnable Gaussian membership functions, significantly enhancing the model’s response and representation capabilities in fuzzy regions at class boundaries. Furthermore, in the fusion stage, we design a Fuzzy-Enhanced Cross-Modal Fusion module (FECF), based on membership-aware attention mechanisms and fuzzy inference operators, to strengthen the model’s ability to model uncertainty in boundary and fuzzy regions. This module can more effectively mine and utilize the complementary information between hyperspectral and LiDAR data in fuzzy regions, thereby improving overall classification accuracy. The main contributions of this paper are as follows:
(1) We propose a novel dual-branch cross-modal feature fusion framework for HSI and LiDAR data classification. By introducing fuzzy learning modules in each branch and using Gaussian membership functions, the model’s discriminative ability for boundary-ambiguous regions is enhanced.
(2) A Fuzzy-Enhanced Cross-Modal Fusion encoding module is proposed to enhance the information complementarity between HSI and LiDAR features, thus improving the ability to recognize object boundaries and mixed pixels.
The remainder of this paper is organized as follows:
Section 2 provides a detailed description of the proposed method.
Section 3 first describes the dataset and experimental settings, followed by the experimental validation of the proposed method.
Section 4 presents the conclusions and future work.
2. Methods
2.1. Overall Framework
The overall workflow of the proposed method is illustrated in
Figure 1, which mainly includes four stages: data preprocessing, fuzzy-enhanced feature extraction, Fuzzy-Enhanced Cross-Modal Fusion encoding, and classification. First, PCA is applied to reduce the dimensionality of HSI data, followed by spatial alignment and normalization of both HSI and LiDAR data. Second, the (FFEM) employs CNNs to extract per-modality features, followed by the integration of a fuzzy learning module (FLM) designed to enhance the modeling capability for blurred boundaries and uncertainty. Then, the extracted features undergo deep interaction through a Fuzzy-Enhanced Cross-Modal Fusion module (FECF) Fuzzy-enhanced Feature Extraction Module, combined with a Fuzzy Fusion Module (FFM) to enhance inter-modal correlation and information complementarity. Finally, the fused multimodal features are input into a classifier to complete land cover classification.
2.2. Data Preprocessing
For a set of HSI data, let represent the HSI data and represent the corresponding LiDAR data covering the same geographical area, where m and n indicate the spatial dimensions, and l is the number of HSI spectral bands. Each pixel can be represented as a one-hot encoded vector. Although the rich spectral information in HSI data is valuable, it also results in large data sizes and computationally expensive processing. To reduce the spectral dimensionality, PCA is applied to extract the top b principal components from , preserving the spatial dimensions but reducing the number of spectral bands to b. After PCA dimensionality reduction, the HSI data is transformed into . Next, for each pixel, both 3D and 2D patches are extracted, resulting in a 3D patch cube and a 2D patch , where denotes the patch size. The index of the central pixel is used to label each patch. For edge pixels, zero-padding is performed with a padding width of . Thus, the patches of HSI and LiDAR data are both of size . After removing patches whose labels are zero, the remaining sample patches are divided into training and test sets.
2.3. Fuzzy-Enhanced Feature Extraction Module
In hyperspectral image classification tasks, CNNs often struggle to accurately delineate boundary regions, especially in areas with a high proportion of mixed pixels, such as the transition zones between vegetation and soil. They are insufficient in handling pixel-wise uncertainty and class transitions. These regions are typically forced into a single category, ignoring the natural transitional characteristics between land cover types, which weakens the model’s discriminative ability in boundary areas and affects the overall accuracy of classification results. To address this issue, as shown in
Figure 1, a fuzzy learning module (FLM) is introduced during the feature extraction process of each branch. This module constructs multiple learnable Gaussian-shaped fuzzy membership functions for each channel to perform fuzzy encoding and soft clustering modeling on the intermediate features extracted by the CNN, thereby capturing the uncertainty distribution structure of the features. Compared to traditional convolutional features, the fuzzy-enhanced features generated by the FLM are more robust in representing fuzzy regions, boundary transitions, and heterogeneous mixed pixels, thus significantly improving the model’s classification performance.
In detail, the HSI data is processed using two consecutive convolutional layers, Conv3-D and Conv2-D, to extract its spatial and spectral features, respectively. Each HSI patch of size is first input into the Conv3-D layer, with 8 convolutional kernels of size . The resulting feature tensors are unfolded along the spatial dimension and used as input for the Conv2-D layer, where 64 convolutional kernels of size are used to obtain 64 two-dimensional feature maps. For the LiDAR branch, two consecutive Conv2-D layers are used to extract features. Each LiDAR patch of size passes through 16 convolutional kernels of size and 64 convolutional kernels of size to extract high-level features. Specifically, in hyperspectral image classification tasks, due to sensor accuracy limitations and the influence of complex surface environments, HSI data is often affected by spectral noise interference, whereas LiDAR data is prone to coherent speckle noise. These noises often exhibit certain spatial structural characteristics, which CNNs can mistakenly learn as “useful features”, leading to blurred classification boundaries and local overfitting problems. To address this problem, we introduced a fuzzy learning module (FLM) after the convolution operation.
For the sake of narrative convenience, the subscripts used to denote the HSI and LiDAR branches are temporarily omitted in the following discussion. As shown in
Figure 2, the features
extracted by CNNs within the FFEM module are transformed into fuzzy representations by the proposed FLM, where
B is the batch size,
C is the number of feature channels, and
indicate the spatial dimensions (height and width) of the features. From the perspective of fuzzy set theory [
44], each scalar feature value
can be regarded as a crisp observation whose degree of membership to several fuzzy sets is determined by a set of learnable membership functions. For each feature channel
c, the FLM applies
N GMFs to every spatial location
, mapping the original feature value into
N fuzzy membership degrees. Each Gaussian membership function is parameterized by a center
and standard deviation
, both learned during training, enabling adaptive and data-driven fuzzy partitioning of the feature space. The formula is as follows:
where
indexes the fuzzy sets for channel
c. GMFs are chosen for their smoothness, locality, and infinite differentiability, which support stable gradient-based learning in deep architectures [
47,
48]. Moreover, they provide a localized, nonlinear mapping that is well suited for modeling heterogeneous and ambiguous multimodal features, as demonstrated in recent neuro-fuzzy models [
46].
Intuitively, Equation (1) transforms each raw convolutional activation into a graded fuzzy membership value with respect to multiple semantic prototypes (determined by the learnable and ). This graded representation enables the model to retain partial belonging information rather than making hard, binary decisions, thereby capturing subtle semantic variations and uncertainty in the feature space. Such a soft representation is particularly beneficial in multimodal scenarios where data distributions are heterogeneous and boundary regions between classes are not crisply defined. The parameter controls the fuzziness of membership. A larger implies a broader response, representing higher uncertainty. The learnable parameters allow the model to adapt partition granularity to data statistics.
To aggregate the fuzzy membership responses while retaining differentiability, the FLM applies the LogSumExp operator. The aggregation is formally defined as
where
denotes the fuzzy feature information at each spatial location. LogSumExp acts as a smooth approximation to maximum operation. Therefore, both GMFs and LogSumExp operator are infinitely differentiable, which is crucial for stable gradient propagation in end-to-end training via backpropagation. This design aligns with recent advances in deep fuzzy architectures, where smooth aggregation operators help retain gradient flow and improve robustness in multimodal tasks [
48]. Consequently, the feature map
X extracted by CNNs undergoes processing through the FLM, yielding the fuzzy feature map
with improved robustness to uncertainty and noise.
Finally, residual connections integrate the feature maps
X and
via element-wise addition, as formalized in Equation (3).
where batch normalization (BN) is applied to constrain the dynamic range of the fuzzy feature map
, and
denotes the output of the FFEM module. Therefore, let
and
denote the outputs of the FFEM modules of the HSI and LiDAR branches, respectively.
By introducing fuzzy membership functions, the FLM performs soft partitioning of CNN features and dynamically adjusts weights for boundary regions. It preserves spatial perception capabilities while enhancing robustness against mixed pixels and fuzzy boundaries, thus alleviating the issues of class overlap and feature confusion to improve classification reliability. Therefore, the FFEM addresses the challenges of CNNs in modeling spectral nonlinear distributions and fuzzy boundaries in hyperspectral classification. Next, the outputs and serve as the inputs to the fuzzy enhancement cross-modal fusion encoding module, providing higher-quality and more semantically hierarchical feature representations for the subsequent fusion encoding.
2.4. Fuzzy-Enhanced Cross-Modality Fusion Module
Existing Transformer-based methods primarily rely on data-driven approaches to automatically allocate weights to features from different modalities, but they fall short in effectively modeling the disparity and complementarity across modalities. This leads to insufficient extraction and utilization of complementary information from hyperspectral and LiDAR data, especially in ambiguous regions, such as land boundaries. To address this issue, we designed a Fuzzy-Enhanced Cross-Modal Fusion (FECF) module that combines membership-aware attention mechanisms with fuzzy inference operators to achieve dynamic adjustment of modal feature weights and the efficient integration of complementary information.
First, to adapt to the transformer architecture, the features extracted by the dual-branch convolutional network need to be flattened and represented as tokens. Therefore, the features
and
are flattened into a set of vectors. The flattened HSI and LiDAR feature maps are denoted as
and
, where
and
represent the height and width of the HSI feature map, respectively,
and
represent the height and width of the LiDAR feature map, respectively, and
and
denote the number of channels of the HSI and LiDAR feature maps, respectively. Inspired by [
50], we employ two learnable weight matrices,
and
, to derive the HSI tokens
and LiDAR tokens
, respectively, via the following formulas:
where
and
represent the number of HSI and LiDAR tokens, respectively.
and
denote the
point-wise products. The softmax operation is used to emphasize the relatively important semantic part.
Subsequently, as illustrated in
Figure 1, we employ transformer encoders based on Multi-Head Self-Attention (MHSA) mechanism on two separate branches to model the semantic correlations between the feature tokens. Specifically, the input tokens fed into the transformer encoder first undergo concatenation with a trainable classification token (CLS). Positional information is then integrated into the token embeddings to preserve sequential-order information. Consequently, for input tokens
and
undergoing the aforementioned processing, we obtain corresponding processed embeddings
and
, respectively. The embedding
and
are then respectively processed through the transformer encoder, which comprises an MHSA layer followed by a Feed-Forward Network (FFN), both wrapped with residual connections and Layer Normalization (LN). Taking the
tokens as an example, the process can be described as follows:
where
is the output of the transformer encoder module. By applying the same processing procedure to the LiDAR feature tokens, we obtain the output
for the LiDAR branch.
After completing intra-modal modeling, a cross-attention module is introduced to achieve semantic alignment and information fusion between heterogeneous modalities [
40,
51]. Specifically, each modality’s
CLS token acts as an interaction bridge, whereby the
CLS token of one modality is concatenated with the feature tokens of another modality and subjected to attention projection, thereby explicitly modeling the complementary relationships between modalities. The cross-attention module for the
HSI token branch is illustrated in
Figure 3. Specifically, for the
HSI branch, it first combines the LiDAR patch tokens with its own
CLS tokens through concatenation, which can be formulated as follows:
where
and
denote the class tokens from
and
, respectively.
serves as a projection function for dimension alignment, which is applied to make the projected
share the same dimensionality with the patch tokens in
.
denotes the combined feature tokens.
The module subsequently performs cross-attention (CA) between
and
, using the CLS token as the sole query since the information from patch tokens has already been fused into it. In mathematical terms, the CA is represented as
Here,
,
, and
are learnable weights, where
c is the embedding dimension and
h is the number of attention heads. As with self-attention, we incorporate multiple heads in the CA. Let
and
denote the outputs of the HSI and LiDAR branches after the CA module, respectively.
Although the transformer and CA perform excellently in modeling intra- and inter-modal relationships, in complex regions such as those with noise interference, mixed pixels, and blurred boundaries, the uncertainty in feature distribution hinders the effective exploitation of complementarity between HSI and LiDAR. This results in the loss of critical information and difficulties in class differentiation. To address these issues, a Fuzzy Fusion Module is introduced to enhance the informational complementarity between HSI and LiDAR features and strengthen the recognition capability of land cover boundaries and mixed pixels.
Figure 4 illustrates the processing flow of the Fuzzy Fusion Module (FFM). Specifically, for the HSI branch, the input feature tensor
is structured with batch size
B, token count
N, and feature dimension
D. The mean
and standard deviation
are calculated across the batch and sequence dimensions (i.e., along axes B and N), aggregating statistics for each feature channel. Furthermore, the FFM employs a residual connection to retain the original features while fusing fuzzy representations based on the Gaussian smoothing of global statistics. This design explicitly retains critical information and improves feature robustness by fusing local details with contextual fuzzy membership information, effectively capturing inherent uncertainties. This operation can be formulated as
Here, the
function implements a Gaussian-based fuzzy membership transformation that combines raw input features with their globally contextualized fuzzy representations.
Subsequently, adaptive feature fusion is applied to combine raw input features
with their fuzzy-transformed counterparts
using a residual coefficient
. This fusion operation captures contextual uncertainty while preserving local details. Next, a linear transformation is used to map the fused features to an attention space via learnable parameters
(weight matrix) and
(bias vector). Finally, the FFM applies a sigmoid activation
followed by a dropout regularization, producing the final attention weights. Thus, the attention weights
can be formulated as follows:
Furthermore, the FFM employs the attention weight
to dynamically balance the contributions of the original and transformed features. This adaptive fusion mechanism enables the model to focus on the most salient information and enhances its ability to adapt to complex data, formulated as
where ⊙ denotes element-wise multiplication, and
is the fused features. For the LiDAR features, an identical fuzzy transformation process is applied, which yields the output
. Subsequently, the HIS features
and LiDAR features
are fused through a concatenation operation, achieving effective integration of multimodal data. The formula is as follows:
where
represents the fused features of the multimodal data.
The model then applies a linear transformation to project
into a lower-dimensional space that matches the shapes of
and
. The dimensionally reduced representation
is then separately added to
and
through residual connections, yielding the final representations
and
, respectively, as illustrated in
Figure 1.
Although both FLM and FFM leverage fuzzy principles, they address distinct stages of the multimodal learning pipeline. The FLM operates at the feature extraction stage, transforming convolutional feature maps into fuzzy membership representations to improve the feature extraction capability for boundary regions of hyperspectral imagery and LiDAR data. This method effectively alleviates the fuzziness caused by mixed pixels, thereby enhancing the model’s classification performance in category transition areas. In contrast, the FFM functions at the feature integration stage, where it dynamically weights and combines heterogeneous fuzzy-enhanced features across modalities using an attention mechanism guided by membership degrees. This adaptive approach not only optimizes the feature fusion process but also strengthens cross-modal correlations, enabling more robust alignment.
2.5. Classification
After the FECF module completes the cross-modal information exchange, the cls tokens from the HSI and LiDAR branches, each serving as a compact representation of global semantics, are fed into independent multilayer perceptron (MLP) layers. Each MLP consists of two fully connected layers, with the first layer embedding a GELU activation function to introduce nonlinearity, and the second layer outputting class probabilities through a softmax operation. The dimension of the output layer matches the number of land cover classes, and the softmax function normalizes the activation values for each class, producing a probability distribution as output.
Here,
denotes the predicted score for the
i class,
C is the total number of classes, and
is the predicted probability for this class. The final probability vector is obtained by adding the two output probability vectors, and the pixel category is identified by the label associated with the highest probability value.
The cross-entropy loss function is adopted as the loss function in the integrated network, as illustrated by the formula below.
where
is the one-hot encoding of the label, and
denotes the predicted probability for class
i. The overall training process of the proposed deep fuzzy fusion method is described in Algorithm 1.
| Algorithm 1 Deep Fuzzy Fusion Network for Joint Hyperspectral and LiDAR Data Classification |
- 1:
Input: HSI data , LiDAR data , ground-truth data - 2:
PCA bands number , patch size , training sample rate - 3:
Output: Predicted labels of the test set - 4:
Set batch size , optimizer Adam (learning rate: ), epochs number , initialize all weights - 5:
Obtain after PCA transform - 6:
Create all sample patches from and , divide into training set and test set - 7:
Generate training loader and test loader - 8:
for to e do - 9:
Extract spatial–spectral features from using 3D and 2D CNN layers, extract elevation features from - 10:
Enhance features through FLM to obtain and by calculating Equations (1)–(3) - 11:
Transform the feature vector to generate tokens by calculating Equations (4) and (5) - 12:
Input to the transformer encoder for feature learning to obtain features by calculating Equations (6) and (7) - 13:
Achieve semantic alignment and information fusion to obtain and by calculating Equations (8)–(11) - 14:
Fuse features input to FFM by calculating Equations (12)–(15) - 15:
Compute to obtain classification prediction labels by calculating Equations (16) and (17) - 16:
end for - 17:
Use the trained model to predict labels for the test set
|