FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification

Wei, Shuting; Jia, Mian; Duan, Junyi

doi:10.3390/a18080524

Open AccessArticle

FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification

by

Shuting Wei

^*

,

Mian Jia

and

Junyi Duan

College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(8), 524; https://doi.org/10.3390/a18080524

Submission received: 30 June 2025 / Revised: 7 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

(This article belongs to the Section Databases and Data Structures)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of remote sensing technologies, the joint classification of hyperspectral image (HSI) and LiDAR data has become a key research focus in the field. To address the impact of inherent uncertainties in hyperspectral images on classification—such as the “same spectrum, different materials” and “same material, different spectra” phenomena, as well as the complexity of spectral features. Furthermore, existing multimodal fusion approaches often fail to fully leverage the complementary advantages of hyperspectral and LiDAR data. We propose a fuzzy-enhanced multi-scale cross-modal fusion network (FE-MCFN) designed to achieve joint classification of hyperspectral and LiDAR data. The FE-MCFN enhances convolutional neural networks through the application of fuzzy theory and effectively integrates global contextual information via a cross-modal attention mechanism. The fuzzy learning module utilizes a Gaussian membership function to assign weights to features, thereby adeptly capturing uncertainties and subtle distinctions within the data. To maximize the complementary advantages of multimodal data, a fuzzy fusion module is designed, which is grounded in fuzzy rules and integrates multimodal features across various scales while taking into account both local features and global information, ultimately enhancing the model’s classification performance. Experimental results obtained from the Houston2013, Trento, and MUUFL datasets demonstrate that the proposed method outperforms current state-of-the-art classification techniques, thereby validating its effectiveness and applicability across diverse scenarios.

Keywords:

hyperspectral image (HSI); LiDAR; fuzzy learning; fuzzy fusion; joint classification

1. Introduction

Remote sensing technology utilizes platforms such as satellites, aircraft, or drones equipped with sensors to receive electromagnetic wave reflections or radiation signals from the Earth’s surface, enabling monitoring and analysis of the Earth and its atmosphere. With the rapid advancement of remote sensing, hyperspectral imaging (HSI) technology has emerged prominently. Hyperspectral images capture continuous spectral information, allowing for precise identification of subtle features of objects, and are widely applied in agricultural monitoring [1], environmental monitoring [2], and resource exploration [3]. Image classification is a critical component of remote sensing data processing, as it assigns unique labels to different land cover types. However, hyperspectral image classification faces challenges, including “same spectrum, different objects” [4], “same object, different spectra” [5], spectral feature complexity, and noise interference. Fortunately, radar data can provide the three-dimensional spatial information lacking in hyperspectral images, enhancing object recognition. Additionally, it complements spectral information, effectively addressing the issues of “same spectrum, different objects” and “same object, different spectra”. Therefore, combining HSI and LiDAR data leverages the spectral advantages of hyperspectral data and the spatial advantages of radar data, achieving complementary benefits and improving the accuracy of land cover classification [6,7]. However, LiDAR data may have limitations in capturing spectral information for different land covers.

In recent years, deep neural networks (DNNs) have demonstrated significant achievements in various domains, including image recognition, speech recognition, and natural language processing, and have been extensively employed in the joint classification of hyperspectral and LiDAR data. Prominent deep learning frameworks include convolutional neural networks (CNNs), graph convolutional networks (GCNs), and transformer networks. For instance, Hang et al. [8] proposed a framework utilizing two coupled CNNs for the fusion of hyperspectral and LiDAR data, which notably enhanced classification accuracy by leveraging the complementary information from both data sources through feature-level and decision-level fusion methods. Chen et al. [9] introduced a feature fusion framework based on deep CNNs, extracting spectral–spatial and spatial–elevation features and integrating them using a fully connected DNN, leading to substantial improvements in classification accuracy. Cai et al. [10] developed a graph attention-based multimodal fusion network (GAMF) that employed parameter sharing and Gaussian tokenization for feature extraction, utilizing a graph attention mechanism to fuse the semantic relationships of multi-source data, resulting in significant advancements in classification performance. Zhang et al. [11] proposed a transformer-based LIIT model that dynamically integrated HSI and LiDAR features through multi-branch feature embedding, a local multi-source feature interactor (L-MSFI), and a multi-source feature selection module (MSFSM), effectively addressing the challenges associated with collaborative feature extraction and fusion in multi-source data. Xue et al. [12] presented a novel architecture known as the deep hierarchical vision transformer (DHViT), which extracted features using spectral sequence transformer and spatial hierarchical transformer and employed a cross-attention mechanism to fuse heterogeneous features, thereby enhancing classification performance effectively. Furthermore, to capitalize on the combined strengths of CNNs and GCNs, Wang et al. [13] introduced an innovative deep learning model termed S3F2Net, which effectively extracted multimodal data features from multiple angles by integrating the properties of both CNNs and GCNs. In an effort to unify the advantages of CNNs and transformers, Zhao et al. [14] proposed a dual-branch approach that integrated a hierarchical CNN and a transformer, extracting spectral–spatial and elevation features through the CNN and subsequently employing the self-attention mechanism of the transformer for feature fusion, significantly improving classification accuracy. Additionally, the recently proposed Mamba network has also been utilized for the joint classification of hyperspectral and LiDAR data. For example, Li et al. [15] proposed the AFA-Mamba model, which addresses the challenges of complex information capture and effective fusion of multi-source data in the joint classification of hyperspectral and LiDAR data through adaptive feature alignment and a global–local Mamba design. He et al. [16] developed a multi-source remote sensing data classification method grounded in the mamba architecture, utilizing the LatSS and LonSS mechanisms to extract spatial–spectral features from hyperspectral and LiDAR data, followed by the CIF module for heterogeneous feature fusion and classification. In essence, these studies [17,18,19,20,21] predominantly focused on feature extraction from hyperspectral and LiDAR data using deep neural networks (DNNs), followed by feature fusion to achieve improved feature representation, thereby enhancing classification accuracy. However, due to the significant scale differences of various objects in remote sensing images, single-scale feature extraction often proves inadequate in comprehensively capturing the spatial and spectral features of different target types [22,23].

To overcome this limitation, researchers have explored multi-scale feature extraction, aiming to effectively address the issue of significant scale differences among object types in remote sensing images by extracting features at multiple scales [24]. Specifically, Liu et al. [25] proposed a multi-scale and multi-directional feature extraction network (MSMD-Net) that integrated multi-scale spatial features, multi-directional spatial features, and spectral feature modules to address the challenges of insufficient utilization of multi-source information. Ni et al. [26] introduced a multi-scale head selection transformer (MHST) network that extracted spectral–spatial features from HSI and elevation features from LiDAR using multi-scale convolutional layers, reducing redundant information with a head selection pooling transformer, thus significantly enhancing classification performance. Ge et al. [27] proposed a cross-attention-based multi-scale convolution fusion network (CMCN), which extracted spatial–spectral-elevation features and integrated semantic information from multi-source data to achieve high-accuracy land cover classification. Feng et al. [28] proposed a dynamic scale hierarchical fusion network (DSHFNet) that used a dynamic scale feature extraction module (DSFE) to select appropriate scale features and reduce dimensionality, employing a multi-attention mechanism for hierarchical fusion to significantly enhance classification performance. Similar works [29,30,31,32,33] share the core idea of extracting local spectral information and global spatial context information from HSI and LiDAR at multiple scales and then fusing them to effectively address the issue of significant scale differences in feature types, thereby improving classification accuracy.

In addition to multi-scale feature extraction, data fusion methods can also be fully utilized, which play a crucial role in the accurate classification of multi-source remote sensing images [34]. Traditional fusion methodologies primarily concentrate on strategies at the feature, decision, or pixel levels. For instance, Zhu et al. [35] introduced the hierarchical multi-attribute transformer (HMAT), which facilitated feature-level fusion via the hierarchical multi-feature aggregation (HMFA) module, thereby significantly enhancing the joint classification performance of hyperspectral and LiDAR data. Similarly, Li et al. [36] developed a fusion classification approach for hyperspectral and LiDAR data utilizing a local-pixel-neighborhood-preserving embedding (SSLPNPE) method based on pixel segmentation. That approach effectively improved classification performance by extracting spatial and spectral features while optimizing spatial neighborhoods through pixel segmentation. Additionally, Jia et al. [37] proposed a cooperative comparative learning (CCL) method that enhanced classification performance in scenarios with limited sample sizes through multi-level fusion of collaborative feature extraction during both the pre-training and fine-tuning stages. Despite the advancements offered by these methods in classification accuracy, their effectiveness remains constrained due to an inadequate exploration of the spatial–spectral relationships and global contextual information inherent in the data. Furthermore, traditional methods exhibit sensitivity to noise and uncertainty within the data, which can lead to inaccurate classification outcomes. Consequently, deep neural networks have been employed for the fusion classification of hyperspectral and LiDAR data. Specifically, Sun et al. [38] introduced a spectral–spatial feature transformer (SSFTT) that integrated convolutional neural networks (CNNs) with transformer architectures, effectively extracting shallow spectral–spatial features and modeling high-level semantic features, thus significantly enhancing classification accuracy. Li et al. [39] proposed a depth feature fusion technique for hyperspectral image classification utilizing a double-stream CNN, which simultaneously extracted spectral, local, and global spatial features while incorporating channel correlation to identify the most informative features, resulting in a marked improvement in classification accuracy. Wang et al. [40] developed a multi-scale spatial–spectral multimodal attention network (MS2CANet), which achieved significant enhancements in classification accuracy through the implementation of multi-scale pyramid convolution and an effective feature recalibration module. Despite the significant successes achieved by the aforementioned methods in the joint classification of hyperspectral and LiDAR data, several limitations and challenges remain: (1) Existing joint classification methods for hyperspectral and LiDAR data inadequately address the modeling of data uncertainty, particularly in handling the phenomena of “same spectrum, different objects” and “same object, different spectra,” which adversely affects classification accuracy. (2) Existing multimodal fusion methods often fail to effectively integrate the complementary information of spatial and spectral features across modalities. This results in insufficient enhancement of classification models’ recognition capabilities and accuracy in complex scenes.

To address these challenges, this paper proposes a Fuzzy-Enhanced Multi-scale Cross-modal Fusion Network (FE-MCFN) for the joint classification of hyperspectral image (HSI) and LiDAR data. We innovatively incorporate fuzzy logic to enhance the feature representation capabilities of multimodal data, enabling the model to robustly handle uncertainties and redundancies within the data. Specifically, an FLM is constructed, which utilizes fuzzy membership functions to weight the input features. This approach captures subtle differences and uncertainty information in the data, enhancing the model’s ability to manage uncertainty effectively. Subsequently, an FFM is developed, which employs fuzzy rules to eliminate redundancy and interference, thereby optimizing feature representation and ensuring that the fused features are more focused on regions relevant to the classification task. The proposed method effectively addresses the limitations of existing multimodal fusion techniques in handling fuzzy modality boundaries and feature uncertainty, demonstrating greater robustness and accuracy in complex scenarios.

The primary contributions of this study are outlined as follows:

(1): We propose a fuzzy-enhanced multi-scale cross-modal fusion network that integrates global contextual information through fuzzy logic and CNN. This approach effectively addresses the inherent uncertainties in hyperspectral and LiDAR data while leveraging their complementary nature, thereby significantly enhancing the efficacy of feature extraction and data fusion processes.
(2): To address the uncertainty between classes in HSI, we propose an FLM. This module employs Gaussian fuzzy membership functions to weight the features, effectively addressing issues of spectral mixing and noise interference in hyperspectral data.
(3): To address the limitations of existing networks in feature fusion strategies, we propose a fuzzy fusion module (FFM). This module applies fuzzy rules to compute the membership degrees of features, enabling more effective weighted fusion and focusing on regions critical for classification.

The structure of the subsequent sections of this paper is organized as follows: Section 2 provides a comprehensive overview of the proposed model’s framework and explains the operational principles of each component module in detail. Section 3 systematically presents the experimental results and offers a comprehensive analysis, including ablation studies, quantitative metric comparisons, and visual effect evaluations. Finally, Section 4 summarizes the paper.

2. Methods

2.1. Overall Framework

The overall framework of the proposed FE-MCFN model is illustrated in Figure 1. This model innovatively incorporates fuzzy theory, enhancing its robustness against uncertainty and noise present in hyperspectral and LiDAR data. It effectively leverages the complementarity between the spectral features of HSI and the spatial features of LiDAR, significantly improving classification accuracy. In the preprocessing stage, PCA is applied to reduce the dimensionality of the HSI data, while the LiDAR data are simultaneously input into the model. In the second stage, shallow features of hyperspectral and LiDAR data are extracted through a dual-branch convolutional architecture, and a fuzzy learning module is designed. The membership degree of each feature in fuzzy clustering of different categories is calculated using Gaussian membership function, thereby obtaining the feature–category membership matrix; then, the log-sum-exp aggregation strategy is used to nonlinearly fuse the membership information to generate a feature representation that can characterize the fuzzy belonging relationship; finally, the fuzzy features and the original features are fused through residual connections. Subsequently, a cross-modal attention mechanism is used to enhance feature extraction while facilitating shallow fusion of the HSI and LiDAR features. In the third stage, the model utilizes pyramid convolution to extract multi-scale features and perform feature enhancement. The fuzzy fusion module uses fuzzy logic to process the complementary information of HSI and LiDAR data, and the input features are made fuzzy through the Gaussian function to achieve fuzzy fusion of features, thereby improving the discrimination ability of the fusion features. Finally, in the classification stage, the fused features are mapped to class labels using a softmax function, producing the classification results. The algorithm flow of FE-MCFN is shown in Algorithm 1.

Algorithm 1 FE-MCFN (taking the Houston2013 dataset as an example.)

1:: Input: Hyperspectral image data $X_{H}$ , LiDAR data $X_{L}$ , test label
2:: Output: Classification result Y
3:: For the hyperspectral data $X_{H}$ with input shape $X \in R^{B \times 144 \times H \times W}$ , apply PCA to reduce the 144 spectral bands to 30, yielding an output shape of $X \in R^{B \times 30 \times H \times W}$ . Preprocess the LiDAR data $X_{L}$ so it can be directly fed into the network;
4:: Initialize weight parameters;
5:: while epoch < epochs do
6:: Shallow Feature Extraction
7:: Extract local features of $X_{H 1}, X_{H 2}, X_{H 3}, X_{L 1}, X_{L 2}, X_{L 3}$ through the CNN (Equations (1)–(5)); the resulting feature maps have shape $X \in R^{B \times 64 \times H \times W}$ ;
8:: Fuzzy Learning Model
9:: Compute fuzzy membership functions according to Equations (6) and (7) to model feature uncertainty.
10:: Dynamically adjust attention scores via Equations (8)–(10) to enhance representation.
11:: The structure of the output features is denoted as $X \in R^{B \times 64 \times H \times W}$
12:: Cross-Modal Attention Module
13:: Achieve information complementarity between HSI and LiDAR data through the cross-modal attention module (Equation (11)) to obtain fused features;
14:: FERM
15:: Extract multi-scale features from F using pyramid convolution structure to obtain multi-scale features $X_{F i}$ ;
16:: Apply the FERM module to dynamically adjust feature weights through Equations (12)–(15);
17:: The structure of the output features is denoted as $X \in R^{B \times 64 \times H \times W}$
18:: Fuzzy Fusion Model
19:: Compute attention weights and aggregate features via Equation (16) to obtain updated features.
20:: Integrate multimodal features using the fuzzy fusion strategy described in Equations (17)–(20).
21:: The structure of the output features is denoted as $X \in R^{B \times 192 \times H \times W}$
22:: Classification
23:: Map the fused features to class labels using a softmax classifier through Equation (21);
24:: Output classification result Y with shape $X \in R^{B \times N_c l a s s}$ ;
25:: end while

2.2. Shallow Feature Extraction and Preliminary Fusion Stage

In the second stage of the model, to effectively extract features from HSI and LiDAR data, we designed a fuzzy-guided feature extraction module that combines CNN with fuzzy theory, enabling efficient extraction of both spectral and spatial features. Specifically, local semantic features of HSI and LiDAR are first extracted using CNN to capture the shallow features of the two modalities. To address potential uncertainties that may be irrelevant to the classification task within pixel regions, such as soil pixels mixed with dense vegetation—pixels that can introduce redundant information and noise during the feature extraction process and affect the model’s learning efficacy and classification accuracy—we designed a fuzzy learning module. This module utilizes a fuzzy membership function to weight the features, effectively distinguishing between important and interfering features, thus enhancing the model’s robustness against noise. Subsequently, we employ a cross-modal attention module to capture global context information and facilitate information complementarity between HSI and LiDAR data, further improving the model’s performance.

2.2.1. Shallow Feature Extraction

In each pixel region, the convolutional neural network (CNN) systematically extracts local features through the application of convolutional and pooling layers, thereby effectively capturing both spatial and spectral information inherent in the data. For clarity, let the hyperspectral image be denoted by

X_{H} \in R^{B \times C \times H \times W}

and the corresponding LiDAR data for the same geographic area be represented by

X_{L} \in R^{H \times W}

, where H and W signify the height and width of the image, respectively, B is the batch size, and C indicates the number of spectral bands. The following takes as an example the Houston2013 dataset, and the structure of the original hyperspectral image is represented by

X \in R^{B \times 144 \times H \times W}

. Initially, dimensionality reduction is applied to the HSI data to decrease their dimensionality and mitigate computational complexity. The original 144 band characteristics are reduced to 30 bands, and the output structure is expressed as

X \in R^{B \times 30 \times H \times W}

. Subsequently, the reduced HSI data and the LiDAR data are separately fed into the CNN for feature extraction. Specifically, both the HSI and LiDAR datasets undergo two-dimensional convolutional processing to facilitate the initial extraction of features:

F (i, j) = Conv 2 d (X_{k})

(1)

where i and j represent the indices of the output feature maps, while

X_{k}

denotes the input feature map, and

F (i, j)

signifies the output resulting from the two-dimensional convolution operation. The hyperspectral imaging data, which have been processed by the convolutional layer, along with the LiDAR data, subsequently undergo batch normalization. This procedure is intended to enhance the training efficiency and improve the overall stability of the model:

μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} F_{i}

(2)

σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} {(F_{i} - μ_{B})}^{2}

(3)

F = γ \cdot \frac{F_{k} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}} + β

(4)

where m represents the batch size,

μ_{B}

denotes the batch mean,

σ_{B}

signifies the batch variance, and

ϵ

is a constant introduced to avert division by zero. Additionally,

γ

and

β

are parameters that can be learned, facilitating the adjustment and scaling of the normalized output. Ultimately, the normalized outputs derived from HSI and LiDAR are processed through the LeakyReLU activation function, which incorporates nonlinearity and enhances the model’s capacity to learn intricate features.

f (x) = \{\begin{matrix} x, & if x \geq 0 \\ α x, & if x < 0 \end{matrix}

(5)

where the parameter

α

represents the slope that regulates negative values, while x pertains to the normalized features. A two-layer lightweight CNN architecture was developed for each modal branch, with each layer comprising three blocks. These blocks consist of two-dimensional convolution operations utilizing a kernel size of 3 × 3, followed sequentially by batch normalization and ReLU activation functions. Initially, both HSI and LiDAR data are converted into a standardized feature representation comprising 32 feature channels through the convolutional operation of the first layer. Subsequently, the second layer convolution facilitates the extraction of multi-scale spatial features, resulting in an increase in the number of feature channels to 64, the output feature is obtained, and its structure is

X \in R^{B \times 64 \times H \times W}

.

As illustrated in the preceding example of dense vegetation, the pixel regions within hyperspectral imaging (HSI) may contain a mixture of non-target pixels (such as soil) alongside target objects (such as plants). The presence of these non-target pixels can introduce extraneous information and noise during the feature extraction process, thereby hindering the model’s ability to learn essential features effectively. Consequently, relying solely on convolutional neural networks (CNNs) for feature extraction is insufficient, resulting in inaccuracies in the extraction process.

2.2.2. Fuzzy Learning Model

To address the aforementioned issues, we designed an FLM to effectively process spectral mixing and noise interference in hyperspectral data, thereby enhancing the model’s ability to capture uncertainty information. Classical mathematics depicts objects with precise axioms, while realistic concepts often have blurred boundaries. When accuracy and effectiveness are inconsistent, traditional modeling is prone to failure; fuzzy logic uses membership to characterize gradual semantics to achieve robust decision-making at low cost, taking into account uncertainty and interpretability, and significantly improving generalization capabilities. Unlike traditional binary logic, fuzzy logic allows variables to have a degree of partially belonging to a set, i.e., a degree of membership between 0 and 1. This logic is perfect for dealing with fuzzy, inaccurate concepts and data. The core of fuzzy logic lies in the construction of membership functions, which define the degree to which elements belong to a fuzzy set. Membership functions used for fuzzy logic include triangular type, trapezoidal type, Gaussian type, generalized clock type, Sigmoid type, Z type, S type, etc. The Gaussian function is the density function of a normal distribution, which is naturally associated with probability theory. In scenarios where statistical uncertainty (such as noise) and ambiguity are required, Gaussian membership can seamlessly bridge the two frameworks. Its infinitely differentiable properties make it excellent in gradient optimization (such as neural fuzzy systems, backpropagation algorithms). In contrast, triangular or trapezoidal functions are not derivable at the vertex, which may cause optimization algorithms to oscillate. Therefore, we chose the Gaussian membership function to calculate the membership of each feature channel. Specifically, after the HSI data are processed through two layers of a lightweight CNN, a feature matrix

X \in R^{B \times C \times H \times W}

is generated. The input feature structure is

X \in R^{B \times 64 \times H \times W}

. The FLM utilizes a Gaussian fuzzy membership function to compute the membership degree for each feature channel, mathematically expressed as:

F_{K} (x_{i}) = exp (- \frac{{(x_{i} - μ_{i, K})}^{2}}{2 σ_{i, K}^{2}})

(6)

where

x_{i}

represents the feature matrix corresponding to the ith channel, whereas

μ_{i, K}

and

σ_{i, K}

denote the mean and standard deviation, respectively, of the kth fuzzy membership function associated with the ith channel.

We employ a broadcast-mechanism-based methodology to determine the membership degree of each input feature within each fuzzy cluster. In particular, for each feature denoted as

F_{i}

, we compute its corresponding membership degree

F_{n}

in the fuzzy cluster to derive a normalized membership value. The formula utilized for this calculation is as follows:

F_{n} = (\sum_{i = 1}^{N} exp (- \frac{{(F_{i} - μ_{i})}^{2}}{2 σ_{i}^{2}}))

(7)

where N represents the total number of fuzzy clusters,

μ_{i}

denotes the mean of the ith channel, and

σ_{i}

signifies the standard deviation of the ith channel.

The logsumexp operator is employed to ensure numerical stability and to mitigate the issue of gradient vanishing commonly encountered in conventional fuzzy systems:

F = log (F_{n})

(8)

The output is transformed back to its original scale through an exponential transformation, enabling batch normalization to be applied. This approach serves to stabilize the training process and enhance convergence rates. Specifically, we implement batch normalization on the results following both logarithmic and exponential transformations:

{\hat{Y}}_{i} = B N (F_{i})

(9)

where

B N

denotes the batch normalization operation. This architectural choice enables the network to dynamically modify the fuzzy rules, facilitating a seamless integration of fuzzy reasoning with deep learning methodologies. Consequently, this integration improves the model’s capacity to effectively manage complex data distributions.

In conclusion, we have developed a feature fusion mechanism that incorporates a residual connection. This approach involves combining the normalized features derived from the batch with the original input features to produce the final output:

Y_{i} = {\hat{Y}}_{i} + F_{i}

(10)

where

{\hat{Y}}_{i}

represents the features that have undergone fuzzy logic processing and batch normalization, and

F_{i}

represents the original input features. Specifically, FLM uses the fuzzy theory to deeply model the uncertainty and fuzziness of features through the Gaussian membership function, and its structure is shown in Figure 2. First, the input feature sequence is expanded, and the membership degree of each feature in the fuzzy clustering of different categories is calculated using the Gaussian membership function to obtain the feature–category membership matrix; then, the log-sum-exp aggregation strategy is used to nonlinearly fuse the membership information to generate a feature representation that can characterize the fuzzy attribute relationship; finally, the fuzzy features are fused with the original features through residual connections, effectively capturing the fuzzy transition areas in complex scenes such as mixed pixels and multi-category target boundaries, and an output feature is obtained, with the structure representation

X \in R^{B \times 64 \times H \times W}

.

Through the above design, while leveraging the advantages of convolutional layers in extracting local context features, a fuzzy learning module is innovatively introduced. This addresses issues such as spectral mixing and noise interference in hyperspectral data, and enhances the model’s adaptability to uncertainty and complex scenarios.

2.2.3. Cross-Modal Attention Module

To capture the global contextual information of multimodal data and perform initial fusion, we designed the cross-modal attention module. First, the features processed by the convolutional layers, denoted as

Y_{H}

and

Y_{L}

, are defined as the target branch and the auxiliary branch, respectively. This design allows the model to establish cross-attention between the features of different modalities, enabling a richer contextual representation. The input features are initially transformed into three different representations, Query, Key, and Value, using convolution with a kernel size of

1 \times 1

. Subsequently, the original attention weights are calculated through matrix multiplication between the Q and the K. The computed attention weights f are then normalized.

f_{norm} = softmax (f)

(11)

Subsequently, a weighted sum is performed with the V features to generate a feature representation that incorporates global contextual information. Finally, these features are processed through a zero-initialized

1 \times 1

convolution to restore the number of channels, completing the feature enhancement through a residual connection.

The design of this module not only enhances the information interaction between different modalities but also improves the understanding of global context, thereby contributing to better performance of the model in handling complex tasks.

2.3. Multi-Scale Deep Feature Extraction and Fusion Stage

In the third stage, to extract deeper features from multimodal data, we employ a pyramid convolutional structure combined with a Feature Enhancement and Recalibration Module (FERM), capturing multi-scale feature information. Additionally, a fuzzy fusion module (FFM) is introduced to perform weighted fusion of multi-scale features, enabling the model to focus more precisely on feature regions critical to the classification task, thereby effectively improving classification accuracy. The pyramid convolution integrated with the feature enhancement module extracts rich multi-scale features from the initially fused multimodal data and refines these features through detailed calibration and enhancement, further boosting their expressive power. However, this approach can introduce noise and redundancy, limiting the ability to fully capture all essential data information. To address this, we incorporate the fuzzy fusion module, which leverages the complementarity between data sources to dynamically assess the importance of features at different scales, resulting in fused features that are more concentrated on regions relevant to the classification task, thus significantly enhancing classification performance.

2.3.1. FERM

During multi-scale feature extraction, substantial redundant information is introduced, adversely impacting the model’s computational efficiency and classification precision. To address this, we have developed a feature enhancement module that dynamically modulates feature weights, minimizes redundancy, and amplifies the discriminative capacity of salient features. Specifically, the LiDAR data feature patch at each scale is denoted as

F_{L} \in L_{w \times w \times C}

, while the HSI data feature patch is denoted as

F_{H} \in L_{w \times w \times C}

, with C representing the number of channels. The enhanced features are computed as follows:

F_{1} = F_{H} \otimes sigmoid (conv (N^{avg} \oplus N^{\max}))

(12)

where

F_{1}

represents the output feature map,

F_{H}

denotes the input feature map,

N^{avg}

is the spectrum-based average pooled feature map,

N^{\max}

is the spectrum-based max pooled feature map, and ⊗ signifies element-wise multiplication. Subsequently, nonlinear activation is applied via the Sigmoid function to enable the model to learn more intricate feature representations:

\hat{F} = sigmoid (conv (cat [M^{avg}, M^{\max}]))

(13)

where

M^{avg}

represents the feature map derived from spatial average pooling, and

M^{\max}

denotes the feature map obtained through max pooling. Subsequently, the activation values of these feature maps are recalibrated via an adjusted activation function to amplify salient features and attenuate less relevant ones:

{\hat{F}}_{i, j} = \{\begin{matrix} {\hat{F}}_{i, j}, & if {\hat{F}}_{i, j} \geq β \\ 0, & if {\hat{F}}_{i, j} < β \end{matrix}

(14)

where

{\hat{F}}_{i, j} \in \hat{F}

denotes the value at position

(i, j)

;

β

represents the similarity threshold used to assess whether a pixel label corresponds to the central pixel. Ultimately, the refined feature map is obtained by element-wise multiplication of the recalibrated attention weights with the original feature map to generate a spatially attentive feature representation:

F = F_{L} \otimes {\hat{F}}_{i j}

(15)

2.3.2. Fuzzy Fusion Model

To efficiently fuse multimodal features of different scales, after adopting the pyramid convolution structure combined with the feature enhancement module, obtain three different scale features, we introduced a fuzzy fusion strategy, as shown in Figure 3. This module uses Gaussian membership function to model the uncertainty existing in the feature distribution, thereby effectively capturing the continuous change trend between features and enhancing the representation reliability of features in the fuzzy subspace. This strategy utilizes fuzzy rules to eliminate redundant and interference information, enhancing feature representation and optimizing the overall feature representation. In addition, the FFM can dynamically adjust the fusion weight according to the complementarity between multimodal data, promoting deep interaction and information enhancement between features. Therefore, this further enhances the model’s ability to capture cross-modal dynamic information correlations in complex and uncertain scenarios. Specifically, three-branch multi-scale enhancement features extracted by pyramid convolution, denoted as

X_{F_{i}} \in R^{B \times 64 \times H \times W}, i = {1, 2, 3}

, is derived through the application of a Gaussian membership function to perform fuzzification.

F (X_{F}) = \frac{1}{σ \sqrt{2 π}} exp (- \frac{(X_{F} - μ^{2})}{2 σ^{2}}) + X_{F}

(16)

where

μ

represents the mean vector of the input feature set x,

σ

denotes the standard deviation vector, and the exponential component models the Gaussian membership function of the feature distribution. Subsequently, the features derived through fuzzy inference are subjected to a linear transformation to facilitate the extraction of higher-order feature representations:

X_{F}^{'} = Linear (X_{F} + F (X_{F})) + b

(17)

where b represents the bias parameter. The features derived from the linear transformation are subsequently subjected to batch normalization to enhance training stability. The mathematical formulation is as follows:

B N (X_{F}^{'}) = \frac{X_{F}^{'} - μ_{X_{F}^{'}}}{\sqrt{σ_{X_{F}^{'}}^{2} + ϵ}} \times γ + β

(18)

where

μ_{X_{F}^{'}}

and

σ_{X_{F}^{'}}

denote the mean and standard deviation parameters of the batch normalization layer, respectively;

ϵ

represents a small constant to ensure numerical stability by preventing division by zero; and

γ

and

β

are trainable scale and shift parameters.

The features extracted post-batch normalization are subjected to a Dropout transformation, with a subset of these features randomly omitted to mitigate overfitting during model training. The mathematical formulation is as follows:

Dropout (B N (X_{F}^{'})) = B N (X_{F}^{'}) ⊙ mask

(19)

where the mask is a stochastic binary matrix, and the features post-Dropout are integrated with the original input features through element-wise addition to produce the resultant output:

F_{f} = Dropout (B N (X_{F}^{'})) + X_{F}^{'}

(20)

The residual connections alleviate the issue of gradient vanishing. Based on the principle of fuzzy logic, the FFM performs fuzzy modeling and interactive processing of complementary information from HSI and LiDAR data, thereby effectively capturing the correlation and uncertainty between multimodal features. Specifically, first, the three-branch multi-scale enhancement features of the pyramid convolution output are expanded, and the membership of each channel-space position in different categories is calculated using the Gaussian membership function. After backpropagation and update, the optimal set of Gaussian parameters

{μ, σ}

is obtained. The hyperspectrum with a proportion of

μ

is fused with the LiDAR with a proportion of

σ

, thereby obtaining the feature–category membership matrix; then, nonlinear aggregation of the membership information is performed through linear transformation and batch normalization to generate a fuzzy feature representation that can characterize the uncertainty of mixed pixels and category boundaries in complex scenes; finally, residual connections are used to fuse the fuzzy features with the original features, effectively alleviating the gradient vanishing problem while retaining higher-order discriminant information and realizing accurate capture of the fuzzy transition area, the output feature is obtained, and its structure is expressed as

X \in R^{B \times 192 \times H \times W}

.

2.4. Classification Stage

In the final stage of the model, the optimized fuzzy features are input into the classifier to produce the final classification results. Specifically, the fused features

F_{f}

are mapped via a fully connected layer, and the softmax function is applied to convert the mapped features into class probabilities:

Y = Softmax (W F_{f} + b)

(21)

where W denotes the weight matrix, b the bias vector,

F_{f}

the fused feature representations, and Y the model’s predicted output. The optimized fuzzy features are effectively projected onto the class label space, enabling the derivation of classification results for all pixels within the hyperspectral image (HSI) and LiDAR datasets. The structure of the classification output is expressed as

X \in R^{B \times N_c l a s s}

.

3. Experiments

3.1. Dataset Description

In our experiments, we evaluated the performance of the proposed method on three widely used hyperspectral datasets: Houston2013 dataset, Trento dataset, and MUUFL dataset:

(1): Houston2013 Dataset: The Houston2013 dataset was acquired in 2013 in the agricultural and urban areas in the northwest of Houston, USA. The dataset contains images with $349 \times 1905$ pixels and 144 bands, covering a spectral range from 0.4 to 2.5 $μ$ m, with a spatial resolution of 2.5 m. The ground truth contains 15,029 labeled samples from 15 land-cover classes, mainly representing various crops, grasslands, forests, residential areas, commercial areas, roads, and water bodies. In the experiment, 20 samples from each class were selected for training, and the rest were used as test samples.
(2): Trento Dataset: The Trento dataset was acquired in the rural area south of Trento, Italy, in 2007. The dataset contains images with $600 \times 166$ pixels and 63 bands, covering a spectral range from 0.42 to 0.99 $μ$ m, with a spatial resolution of 1 m. The ground truth contains 30,214 labeled samples from six land-cover classes, mainly representing apple trees, buildings, ground, wood, vineyards, and roads. In the experiment, five samples from each class were selected for training, and the rest were used as test samples.
(3): MUUFL Dataset: The MUUFL dataset was acquired in November 2010 at the Gulf Park campus of the University of Southern Mississippi in Long Beach, Mississippi. The dataset contains images with $325 \times 220$ pixels and 64 bands, covering a spectral range from 0.38 to 1.05 $μ$ m, with a spatial resolution of 0.54 × 1.0 m. The ground truth includes 53,687 labeled samples from 11 land-cover classes, mainly representing trees, grasslands, mixed ground, and sand. In the experiment, 20 samples from each class were selected for training, and the rest were used as test samples.

The datasets are available at: https://github.com/AnkurDeria/MFT ( accessed on 9 January 2025). The names of land categories, along with the numbers of training and testing samples used in the experiments for the three datasets mentioned above, are presented in Table 1. Figure 4, Figure 5 and Figure 6 illustrate the hyperspectral image in false color, the LiDAR intensity image in grayscale, and the corresponding ground-truth classification map.

3.2. Experimental Setup

(1) Evaluation Metrics: To evaluate the performance of the proposed method and the comparative algorithms, three widely utilized metrics in the hyperspectral image classification domain were employed: overall accuracy (OA), average accuracy (AA), and Kappa coefficient. By comprehensively analyzing OA, AA, and the Kappa coefficient, this study offers a holistic and balanced assessment of classification performance. This approach not only emphasizes the overall classification effectiveness but also considers the performance across various classes and the statistical significance of the results.

(2) Environment Configuration: The computational framework employed in this investigation was the PyTorch v2.3.0 deep learning library. The experimental hardware comprised an Intel Xeon Gold 5320 (2.20 GHz)-Intel Corporation, Santa Clara, CA, USA and an NVIDIA A40 (48 GB VRAM)-NVIDIA Corporation, Santa Clara, CA, USA, with all processing accelerated via the CUDA 11.8 parallel computing architecture. Model training was conducted using the Adam optimization algorithm, with a learning rate of

4 \times 10^{- 4}

and a maximum of 500 epochs. Multi-scale feature extraction was achieved through convolutional kernels of sizes

3 \times 3

,

5 \times 5

, and

7 \times 7

. Batch size was maintained at 64, and the fuzzy set counts were configured at 30, 50, and 50, respectively. To ensure statistical robustness and result stability, each experimental procedure was repeated ten times, and the mean of the outcomes was reported as the definitive result.

Additionally, to verify the effectiveness of the proposed method, we compared it against several state-of-the-art hyperspectral image classification algorithms, including MS2CANet [40], CoupledCNNs [41], MSA-GCN [42], ExViT [43], S3F2Net [13], GLTNet [44], CALC [45], and DSHFNet [28]. It is important to note that for all methods, the same training and test splits described in the data section were utilized to ensure a fair comparison. All experiments were conducted within an identical computing environment, and the hyperparameters for each method were fine-tuned following the recommendations provided in their respective publications.

3.3. Quantitative Results and Analysis

(1) Houston2013 Dataset: For the Houston2013 dataset, The classification results of each model on the Houston2013 dataset are presented in Table 2, accompanied by the corresponding visual representations in Figure 7. As indicated in Table 2, our model outperformed all comparative methods across the three key metrics, OA, AA, and Kappa coefficient, for the Houston2013 dataset. Notably, in terms of OA, our model achieved improvements of 1.52%, 1.76%, 1.10%, 3.7%, 2.63%, 0.7%, 3.07%, and 0.90% over CALC, Coupled-CNNs, ExViT, MSA-GCN, S3F2Net, GLT, DSHFNet, and MS2CANet, respectively. This substantial difference illustrates that our method effectively captured both the spectral and spatial characteristics of various categories.

Examining Figure 7 reveals that alternative models experienced significant challenges with edge classification and noise within the classification diagrams. For instance, while CALC performed adequately in the C5 category, it presented notable noise and erroneous boundary categorizations in the C6 category. ExViT demonstrated considerable misclassification when addressing the C6 and C9 categories, highlighting its limitations in processing local geographical data. MSA-GCN and DSHFNet excelled in classifying C14 and C15 categories but struggled with accurate boundary classification in C9. Although Coupled-CNNs and S3F2Net were effective in most categories, they failed to optimally capture the local structure of the C8 category, resulting in decreased classification accuracy. GLTNet and MS2CANet displayed stability across several categories, yet issues of noise and blurred boundaries persisted in the challenging C6 category. In contrast, our model consistently outperformed the comparative models across the majority of categories, effectively reducing noise and accurately delineating the boundaries of complex categories, such as C5.

This series of results clearly indicates that our model can effectively capture the spectral and spatial features of various categories, achieving accurate classification of different types of ground objects when processing the Houston2013 dataset.

(2) Trento Dataset: On the Trento dataset, the classification results of each model are shown in Table 3, and the corresponding visual effects are shown in Figure 8. Figure 8a is a standard label diagram, and Figure 8b–j show the classification results of different algorithms. Table 3 shows that our model outperformed other comparison models in three important indexes, OA, AA, and Kappa, with respective values of 98.28%, 96.73%, and 98.28%. This outcome demonstrates how our approach may more precisely capture the spatial and spectral information in the overall classification problem. Our model attained notable advantages in numerous categories when viewed from different angles. Our model’s accuracy of 91.77% in the C6 category, for instance, was much higher than MS2CANet’s accuracy of 85.83%. This suggests that the model can more effectively balance the fitting degree and generalization ability, fully utilize spectral and spatial characteristics, and accurately distinguish objects that are very similar.

Other models clearly fell short when it came to handling boundary classification and noise issues, as seen by the classification results in Figure 8b–j. For instance, ExViT and CALC performed well in C4 and C5 categories, but while working with C3 and C6 categories, there were more sounds, which caused the borders to become hazy. S3F2Net and MS2CANet performed well in the majority of categories; however, they still misclassified samples from the C6 category, suggesting that they lacked in their capacity to capture global context. The accuracy of the classification decreased as a result of Coupled-CNNs and DSHFNet’s inability to manage the C2 class border. In contrast, the classification result of our model was better than that of the contrast model in most categories, which fully verifies the effectiveness of our method in local feature enhancement and context space processing and explains its excellent performance and robustness in complex scenes and boundary classification tasks.

(3) MUUFL Dataset: On the MUUFL dataset, the classification results of each model are shown in Table 4, and the corresponding visual effects are shown in Figure 9. Figure 9a is a standard label diagram, and Figure 9b–j show the classification results of different algorithms. As illustrated in Table 4, the coefficients of overall accuracy (OA), average accuracy (AA), and Kappa for our model were 85.32%, 84.81%, and 85.32%, respectively. These results underscore the model’s significant advantages in addressing complex land object classification tasks. Furthermore, our model demonstrated superior performance compared to the benchmark model across the majority of categories. For instance, in the C1 category, our model effectively captured the characteristics of trees, achieving a classification accuracy of 90.25%, which markedly exceeded ExViT’s accuracy of 64.42%. This finding further corroborates the model’s exceptional capability in managing challenges associated with overlapping categories.

The classification results presented in Figure 9b–j indicate that several models exhibited significant limitations in addressing the boundary regions of complex categories. Specifically, the IF_CALC and GLTNet models demonstrated increased noise levels when classifying the C9 category, resulting in indistinct boundaries. Although Coupled-CNNs and MS2CANet showed commendable performance across most categories, their handling of the C3 category’s boundaries was suboptimal, which adversely affected classification accuracy. In contrast, the FE-MCFN model demonstrated superior accuracy in classifying boundary regions of complex categories, with a notable reduction in noise and achieving a classification accuracy of 100% for the C6 category. The incorporation of fuzzy learning and fuzzy fusion within this model effectively captures the spectral–spatial complexities and mitigates boundary ambiguities between categories, thereby enhancing performance in the boundary regions of complex categories and improving overall classification consistency.

(4) Feature Distribution Analysis: To illustrate the visual contrast effect of each model classification more effectively, Figure 10. presents the T-SNE feature distribution for S3F2Net, CALC, MS2CANet, and FE-MCFN across three datasets. The experimental findings indicate that the boundaries between different categories in our proposed method are distinctly defined, with minimal overlap, thereby demonstrating a strong feature discrimination capability. In contrast, the other methods exhibit considerable variability in the distribution of within-class features, highlighting their limitations in addressing complex spectral mixed scenarios. Our approach dynamically adjusts feature weights through the membership function inherent in fuzzy logic, which significantly mitigates the variability in features within each class and results in the formation of more compact clusters. Additionally, the implementation of a fuzzy fusion strategy facilitates the establishment of clearer inter-class boundaries within the feature space. This innovative methodology, which integrates fuzzy theory with deep learning, offers a novel perspective for addressing issues of class ambiguity and mixed pixels in hyperspectral data, thereby enhancing the stability and generalization capacity of the classification system.

3.4. Ablation Analysis

To validate the effectiveness of the various modules in the proposed FE-MCFN model, we conducted ablation experiments on the Houston2013, Trento, and MUUFL datasets. The FE-MCFN model comprises two key modules: the fuzzy learning module (FLM) and the fuzzy fusion module (FFM). By integrating individual modules and then combining both modules, we analyzed the specific contributions of each module to overall classification performance. Table 5 presents the performance under different module combinations.

The experimental results indicate that integrating each module significantly enhanced the classification performance of the model. For instance, on the Houston2013 dataset, the overall accuracy (OA) of the baseline model was

94.31 %

, demonstrating the CNN’s capability to capture local spectral and spatial relationships. Following the integration of the FLM, the OA improved to

94.57 %

, indicating that the fuzzy learning module effectively enhances the model’s ability to process uncertainty information. With the addition of the FFM, the OA reached

94.73 %

, highlighting the crucial role of the fuzzy fusion module in integrating multi-scale features and optimizing feature representation. Similar performance improvements were observed in the Trento and MUUFL datasets. On the Trento dataset, the OA of the baseline model was

97.32 %

, while the OA of the complete model with all modules integrated increased to

98.28 %

, demonstrating the collaborative effect of each module in enhancing classification accuracy. For the MUUFL dataset, the model’s performance improved from

81.29 %

to

85.32 %

, further validating the model’s robustness in handling complex data.

These experimental results indicate that the fuzzy learning module effectively addresses issues of spectral mixing and noise interference, thereby enhancing the model’s feature extraction capability. Meanwhile, the fuzzy fusion module significantly improves classification accuracy and robustness through dynamically weighted integration of multi-source features. The synergistic interaction of these modules greatly enhances the classification performance of the model.

3.5. Parameter Sensitivity Analysis

In this study, we analyzed the influence of learning rate and batch size on model performance, aiming to find the best parameter configuration to optimize model training. Different learning rates

{1 \times 10^{- 3}, 1 \times 10^{- 4}, 2 \times 10^{- 4}, 4 \times 10^{- 4}}

and batch sizes

{16, 32, 64, 128}

were selected to determine the best learning rate and batch size. The result is shown in Figure 11. For the Houston2013 dataset, a higher learning rate improved the model performance. For MUUFL dataset, when the learning rate was

4 \times 10^{- 4}

or

1 \times 10^{- 3}

, better accuracy was achieved with a large batch. On the Trento dataset, a higher learning rate combined with a larger batch size led to better classification performance and helped the model to be more stable. Therefore, the final determination of a learning rate of

4 \times 10^{- 4}

and batch size of 64 was the best choice for that model.

3.6. Fuzzy Set Quantity Impact Analysis

In this study, we discussed the influence of fuzzy membership set number on the performance of hyperspectral image classification model. By choosing different numbers of fuzzy sets

{10, 20, 30, 40, 50}

, we determined the best setting on the three datasets. The experimental results are shown in Figure 12. In the Houston2013 dataset, when the number of fuzzy clusters was set to 30, the model reached its best performance, and OA was increased to

95.39 %

, AA to

95.45 %

, and Kappa to

95.39 %

, which was significantly higher than that in the 10 sets. Compared with the suboptimal value, the OA of the MUUFL dataset increased by

3.88 %

in 50 episodes. In 50 episodes of the Trento dataset, OA increased by

0.42 %

compared with the suboptimal value. The results show that when the number of fuzzy clusters was set to 30, 50, and 50, respectively, the three datasets all reached an optimal balance among OA, AA and Kappa coefficients, indicating that the number of fuzzy sets is very important to the model performance.

3.7. Performance Analysis of Different Training Samples

To assess the robustness of the proposed method across varying training sample ratios, we conducted experiments utilizing three distinct datasets. The results of these experiments are illustrated in Figure 13. On the Houston2013 and MUUFL datasets, the training samples for each category were set at

{20, 40, 60, 80, 100}

, while on the Trento dataset, the training samples were

{5, 10, 20, 30, 40}

. The findings indicate that the FE-MCFN method exhibits significant robustness and adaptability, particularly in scenarios characterized by limited sample sizes. Notably, as the number of training samples increases, the model’s classification performance consistently improves, reflecting its strong learning capabilities. This trend suggests that FE-MCFN effectively balances the extraction of both local and global features. Importantly, even with a reduced number of samples, the model maintains a high level of accuracy, demonstrating its adaptability in addressing data scarcity challenges. Across all three datasets, despite the limited training samples, FE-MCFN successfully captured essential feature information through its fuzzy learning and fusion mechanisms. This capability enhances the model’s comprehension of complex spectral and spatial relationships, thereby significantly improving classification accuracy and further substantiating its robust generalization ability.

3.8. Model Stability Analysis

To assess the classification robustness of the model across diverse scenarios, a stability analysis was performed on three datasets. Figure 14 illustrates the OA, AA, and Kappa coefficients for each class. On the Houston2013 dataset, the OA, AA, and Kappa coefficients for FE-MCFN were

95.21 \pm 0.11 %

,

95.38 \pm 0.10 %

, and

95.21 \pm 0.19 %

, respectively, demonstrating high stability and consistency. Notably, the C14 and C15 classes achieved 100% classification accuracy, with standard deviations across 11 classes being ≤3%. On the MUUFL dataset, the model maintained a stable OA of

85.32 \pm 0.22 %

under complex ground object conditions. On the Trento dataset, OA reached

98.28 \pm 0.07 %

, with all class standard deviations being ≤0.21%, indicating excellent robustness. The experimental findings confirm that FE-MCFN attains synergistic improvements in accuracy and stability for cross-scene and multi-class classification tasks by integrating multi-scale feature extraction and adaptive optimization mechanisms, thereby demonstrating strong robustness and generalization capabilities.

3.9. Comparison of Computational Efficiency

To comprehensively evaluate the computational efficiency of the FE-MCFN model, we conducted a systematic comparison of various classification models across three datasets, focusing on metrics such as training time, testing time, number of parameters, and computational complexity (FLOPs). As shown in Table 6, although FE-MCFN did not achieve the shortest training and testing times, it demonstrated excellent performance in terms of computational complexity, ranking third among nine comparative models with a complexity of

7.45

G. Furthermore, FE-MCFN maintained a balance between training and testing times across different datasets. For instance, on the Houston2013 dataset, while the training time was slightly longer, the testing time was only

4.78

s, showcasing its efficient inference capability. The experimental results indicate that FE-MCFN optimizes feature extraction and fusion strategies, ensuring efficient inference speed while reducing computational complexity. This adaptability meets the demands of diverse datasets and achieves a favorable balance between accuracy and resource consumption.

4. Conclusions

In this paper, we proposed a Fuzzy-Enhanced Multi-scale Cross-modal Fusion Network (FE-MCFN) for the classification of hyperspectral image (HSI) and LiDAR data. FE-MCFN synergistically integrates the local feature extraction capabilities of convolutional neural networks (CNNs) with fuzzy logic principles to effectively address data uncertainty. Specifically, CNNs are employed to extract local spectral and spatial features from HSI and LiDAR modalities. The cross-modal attention mechanism models spatial relationships across different modalities, capturing global contextual information and enhancing overall classification performance. Additionally, the pyramid convolution structure, combined with the FERM, facilitates the dynamic selection of salient features while reducing redundancy, thereby augmenting the model’s expressive capacity. The incorporation of fuzzy learning and fuzzy fusion modules further enhances the model’s ability to manage inter-class uncertainty and exploit the complementary nature of the data. Experimental validations across three datasets of varying scales demonstrated that FE-MCFN significantly outperformed existing state-of-the-art methods in terms of classification accuracy and computational efficiency, confirming its robustness and applicability in diverse remote sensing scenarios. Future research will focus on optimizing the feature extraction methodologies of FE-MCFN to further enhance its robustness and classification efficacy.

Author Contributions

Conceptualization, S.W. and M.J.; methodology, S.W. and M.J.; software, S.W.; validation, S.W., M.J. and J.D.; formal analysis, S.W.; investigation, M.J.; resources, J.D.; data curation, S.W.; writing—original draft preparation, S.W.; writing—review and editing, S.W., M.J., and J.D.; visualization, M.J.; supervision, J.D.; project administration, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Suggested Data Availability Statements are available in Section 3.1 Dataset Description.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Faisal, S.; Ooi, M.P.-L.; Abeysekera, S.K.; Kuang, Y.-C.; Fletcher, D. Roadmap for Measurement and Applications: Uncertainty Quantification and Visualization for Optimal Decision-Making in Hyperspectral Imaging-Based Precision Agriculture. IEEE Instrum. Meas. Mag. 2025, 28, 23–32. [Google Scholar] [CrossRef]
Liu, H.; He, J.; Li, Y.; Bi, Y. Multilevel Prototype Alignment for Cross-Domain Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4400115. [Google Scholar] [CrossRef]
Qin, B.; Feng, S.; Zhao, C.; Li, W.; Tao, R.; Zhou, J. Language-Enhanced Dual-Level Contrastive Learning Network for Open-Set Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5508114. [Google Scholar] [CrossRef]
Zhang, Y.; Du, B.; Zhang, L.; Wang, S. A Low-Rank and Sparse Matrix Decomposition-Based Mahalanobis Distance Method for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1376–1389. [Google Scholar] [CrossRef]
Xia, J.; Yokoya, N.; Iwasaki, A. Fusion of Hyperspectral and LiDAR Data With a Novel Ensemble Classifier. IEEE Geosci. Remote Sens. Lett. 2018, 15, 957–961. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Hyperspectral and LiDAR Data Classification Based on Structural Optimization Transmission. IEEE Trans. Cybern. 2023, 53, 3153–3164. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Gao, F.; Dong, J.; Li, H.-C.; Du, Q. Nearest Neighbor-Based Contrastive Learning for Hyperspectral and LiDAR Data Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5501816. [Google Scholar] [CrossRef]
Hang, R.; Li, Z.; Ghamisi, P.; Hong, D.; Xia, G.; Liu, Q. Classification of Hyperspectral and LiDAR Data Using Coupled CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4939–4950. [Google Scholar] [CrossRef]
Chen, Y.; Li, C.; Ghamisi, P.; Jia, X.; Gu, Y. Deep Fusion of Remote Sensing Data for Accurate Classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1253–1257. [Google Scholar] [CrossRef]
Cai, J.; Zhang, M.; Yang, H.; He, Y.; Yang, Y.; Shi, C.; Zhao, X.; Xun, Y. A novel graph-attention based multimodal fusion network for joint classification of hyperspectral image and LiDAR data. Expert Syst. Appl. 2024, 249, 123587. [Google Scholar] [CrossRef]
Zhang, Y.; Peng, Y.; Tu, B.; Liu, Y. Local Information Interaction Transformer for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 130–143. [Google Scholar] [CrossRef]
Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Song, L.; Feng, Y.; Zhu, J. S3F2Net: Spatial-Spectral-Structural Feature Fusion Network for Hyperspectral Image and LiDAR Data Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4801–4815. [Google Scholar] [CrossRef]
Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint Classification of Hyperspectral and LiDAR Data Using a Hierarchical CNN and Transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5500716. [Google Scholar] [CrossRef]
Li, S.; Huang, S. AFA–Mamba: Adaptive Feature Alignment with Global–Local Mamba for Hyperspectral and LiDAR Data Classification. Remote Sens. 2024, 16, 4050. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Jiang, P.; Liu, B.; Li, J.; Plaza, A. Classification of Multisource Remote Sensing Data Using Slice Mamba. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5505414. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Hong, D.; Ahmad, M.; Plaza, A.; Chanussot, J. Hyperspectral and LiDAR Data Classification Using Joint CNNs and Morphological Feature Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530416. [Google Scholar] [CrossRef]
Zhao, X.; Tao, R.; Li, W.; Li, H.-C.; Du, Q.; Liao, W.; Philips, W. Joint Classification of Hyperspectral and LiDAR Data Using Hierarchical Random Walk and Deep CNN Architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7355–7370. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Tao, R.; Li, H.; Du, Q. Information Fusion for Classification of Hyperspectral and LiDAR Data Using IP-CNN. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5506812. [Google Scholar] [CrossRef]
Song, H.; Yang, Y.; Gao, X.; Zhang, M.; Li, S.; Liu, B.; Wang, Y.; Kou, Y. Joint Classification of Hyperspectral and LiDAR Data Using Binary-Tree Transformer Network. Remote Sens. 2023, 15, 2706. [Google Scholar] [CrossRef]
Liao, D.; Wang, Q.; Lai, T.; Huang, H. Joint Classification of Hyperspectral and LiDAR Data Based on Mamba. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5530915. [Google Scholar] [CrossRef]
Huang, S.; Lu, Y.; Wang, W.; Sun, K. Multi-scale guided feature extraction and classification algorithm for hyperspectral images. Sci. Rep. 2021, 11, 18396. [Google Scholar] [CrossRef] [PubMed]
Bai, H.; Xu, T.; Chen, H.; Liu, P.; Li, J. Content-Driven Magnitude-Derivative Spectrum Complementary Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5524914. [Google Scholar] [CrossRef]
Bai, L.; Liu, Q.; Li, C.; Ye, Z.; Hui, M.; Jia, X. Remote Sensing Image Scene Classification Using Multiscale Feature Fusion Covariance Network With Octave Convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620214. [Google Scholar] [CrossRef]
Liu, Y.; Ye, Z.; Xi, Y.; Liu, H.; Li, W.; Bai, L. Multiscale and Multidirection Feature Extraction Network for Hyperspectral and LiDAR Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9961–9973. [Google Scholar] [CrossRef]
Ni, K.; Wang, D.; Zheng, Z.; Wang, P. MHST: Multiscale Head Selection Transformer for Hyperspectral and LiDAR Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5470–5483. [Google Scholar] [CrossRef]
Ge, H.; Wang, L.; Pan, H.; Liu, Y.; Li, C.; Lv, D.; Ma, H. Cross Attention-Based Multi-Scale Convolutional Fusion Network for Hyperspectral and LiDAR Joint Classification. Remote Sens. 2024, 16, 4073. [Google Scholar] [CrossRef]
Feng, Y.; Song, L.; Wang, L.; Wang, X. DSHFNet: Dynamic Scale Hierarchical Fusion Network Based on Multiattention for Hyperspectral Image and LiDAR Data Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522514. [Google Scholar] [CrossRef]
Chang, J.; He, X.; Li, P.; Tian, T.; Cheng, X.; Qiao, M.; Zhou, T.; Zhang, B.; Chang, Z.; Fan, T. Multi-Scale Attention Network for Building Extraction from High-Resolution Remote Sensing Images. Sensors 2024, 24, 1010. [Google Scholar] [CrossRef]
Tu, B.; Li, N.; Fang, L.; He, D.; Ghamisi, P. Hyperspectral Image Classification with Multi-Scale Feature Extraction. Remote Sens. 2019, 11, 534. [Google Scholar] [CrossRef]
Wang, A.; Lei, G.; Dai, S.; Wu, H.; Iwahori, Y. Multiscale Attention Feature Fusion Based on Improved Transformer for Hyperspectral Image and LiDAR Data Classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2025, 18, 4124–4140. [Google Scholar] [CrossRef]
Yang, J.X.; Wang, J.; Li, Z.F.; Sui, C.H.; Long, Z.K.; Zhou, J. HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5505605. [Google Scholar] [CrossRef]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multiscale Feature Fusion State Space Model for Multisource Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar] [CrossRef]
Zhao, K.; Gao, Q.; Hao, S.; Sun, J.; Zhou, L. Credible Remote Sensing Scene Classification Using Evidential Fusion on Aerial-Ground Dual-View Images. Remote Sens. 2023, 15, 1546. [Google Scholar] [CrossRef]
Zhu, F.; Shi, C.; Shi, K.; Wang, L. Joint Classification of Hyperspectral and LiDAR Data Using Hierarchical Multimodal Feature Aggregation-Based Multihead Axial Attention Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503817. [Google Scholar] [CrossRef]
Li, Y.; Ge, C.; Sun, W.; Peng, J.; Du, Q.; Wang, K. Hyperspectral and LiDAR Data Fusion Classification Using Superpixel Segmentation-Based Local Pixel Neighborhood Preserving Embedding. Remote Sens. 2019, 11, 550. [Google Scholar] [CrossRef]
Jia, S.; Zhou, X.; Jiang, S.; He, R. Collaborative Contrastive Learning for Hyperspectral and LiDAR Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5507714. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Li, X.; Ding, M.; Pižurica, A. Deep Feature Fusion via Two-Stream Convolutional Neural Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2615–2629. [Google Scholar] [CrossRef]
Wang, X.; Zhu, J.; Feng, Y.; Wang, L. MS2CANet: Multiscale Spatial–Spectral Cross-Modal Attention Network for Hyperspectral Image and LiDAR Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5501505. [Google Scholar] [CrossRef]
Neil, D.; Liu, S. Minitaur, an Event-Driven FPGA-Based Spiking Network Accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2014, 22, 2621–2628. [Google Scholar] [CrossRef]
Yin, Y.; Jing, L.; Huang, F.; Yang, G.; Wang, Z. MSA-GCN: Multiscale Adaptive Graph Convolution Network for gait emotion recognition. Pattern Recognit. 2024, 147, 110117. [Google Scholar] [CrossRef]
Yao, J.; Zhang, B.; Li, C.; Hong, D.; Chanussot, J. Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5514415. [Google Scholar] [CrossRef]
Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LiDAR Data Joint Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541213. [Google Scholar] [CrossRef]
Lu, T.; Ding, K.; Fu, W.; Li, S.; Guo, A. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data. Inf. Fusion 2023, 93, 118–131. [Google Scholar] [CrossRef]

Figure 1. Overall framework of FE-MCFN.

Figure 2. Fuzzy learning module (FLM) structure.

Figure 3. Fuzzy fusion module (FFM) structure.

Figure 4. Houston2013 dataset. (a) HSI false-color image, (b) LiDAR grayscale image, (c) ground-truth map.

Figure 5. Trento dataset. (a) HSI false-color image, (b) LiDAR grayscale image, (c) ground-truth map.

Figure 6. MUUFL dataset. (a) HSI false-color image, (b) LiDAR grayscale image, (c) ground-truth map.

Figure 7. Classification result diagrams obtained by different methods on the Houston2013 dataset. (a) Ground truth, (b) IF_CALC, (c) Coupled-CNNs, (d) ExViT, (e) MSA-GCN, (f) S3F2Net, (g) GLTNet, (h) DSHFNet, (i) MS2CANet, (j) proposed.

Figure 8. Classification result diagrams obtained by different methods on the Trento dataset. (a) Ground truth, (b) IF_CALC, (c) Coupled-CNNs, (d) ExViT, (e) MSA-GCN, (f) S3F2Net, (g) GLTNet, (h) DSHFNet, (i) MS2CANet, (j) proposed.

Figure 9. Classification result diagrams obtained by different methods on the MUUFL dataset. (a) Ground truth, (b) IF_CALC, (c) Coupled-CNNs, (d) ExViT, (e) MSA-GCN, (f) S3F2Net, (g) GLTNet, (h) DSHFNet, (i) MS2CANet, (j) proposed.

Figure 10. T-SNE visualization of features in three datasets. Houston2013: (a) S3F2Net, (b) IF_CALC, (c) MS2CANet, (d) proposed. MUUFL: (e) S3F2Net, (f) IF_CALC, (g) MS2CANet, (h) proposed. Trento: (i) S3F2Net, (j) IF_CALC, (k) MS2CANet, (l) proposed.

Figure 11. Impact of learning rate and batch size on performance: (a) Houston2013, (b) MUUFL, (c) Trento.

Figure 12. Impact of fuzzy cluster number on performance: (a) Houston2013, (b) MUUFL, (c) Trento.

Figure 13. Classification performance of all methods under different samples: (a) Houston2013, (b) MUUFL, (c) Trento.

Figure 14. Stability analysis: (a) Houston2013, (b) MUUFL, (c) Trento.

Table 1. Class distribution of training and testing samples on the Houston2013, MUUFL, and Trento datasets.

No.	Houston2013			MUUFL			Trento
No.	Class Name	Train	Test	Class Name	Train	Test	Class Name	Train	Test
C1	Healthy grass	20	1231	Trees	20	23,226	Apple trees	5	4029
C2	Stressed grass	20	1234	Mostly grass	20	4250	Buildings	5	2898
C3	Synthetic grass	20	677	Mixed ground surface	20	6862	Ground	5	474
C4	Trees	20	1224	Dirt and sand	20	1806	Woods	5	9118
C5	Soil	20	1222	Road	20	6667	Vineyard	5	10,496
C6	Water	20	305	Water	20	446	Roads	5	3169
C7	Residential	20	1248	Building shadow	20	2213
C8	Commercial	20	1224	Building	20	6220
C9	Road	20	1232	Sidewalk	20	1365
C10	Highway	20	1207	Yellow curb	20	163
C11	Railway	20	1215	Cloth panels	20	249
C12	Parking Lot 1	20	1213
C13	Parking Lot 2	20	449
C14	Tennis court	20	408
C15	Runway track	20	640
Total		300	14,729		220	53,467		30	30,184

Table 2. Classification performance (%) of different methods on the Houston2013 dataset. The best values are displayed in bold.

No.	IF_CALC [45]	Coupled-CNNs [41]	ExViT [43]	MSA-GCN [42]	S3F2Net [13]	GLTNet [44]	DSHFNet [28]	MS2CANet [40]	Proposed
C1	89.12	92.49	94.38	96.85	94.25	91.38	92.82	90.06	93.37
C2	96.96	93.36	95.37	89.11	93.56	96.43	96.76	95.85	97.03
C3	99.31	97.42	99.70	99.85	99.24	97.79	99.01	97.91	97.80
C4	93.23	97.36	93.65	98.41	94.48	95.83	95.97	97.76	98.70
C5	99.64	100.00	99.60	99.92	98.50	100.00	99.81	99.77	100.00
C6	85.64	85.66	97.02	98.14	86.43	86.89	86.89	87.16	86.55
C7	93.79	91.04	95.91	94.63	87.20	91.85	88.11	92.45	94.88
C8	88.33	78.06	86.10	81.86	72.99	89.03	78.95	80.82	89.39
C9	85.03	91.40	86.60	65.66	90.06	94.00	80.75	88.97	91.41
C10	93.14	91.21	92.16	93.67	96.65	91.93	93.17	97.16	93.49
C11	97.81	96.59	93.80	91.44	97.01	97.28	97.28	97.15	97.54
C12	91.72	93.02	90.71	88.46	89.81	90.83	89.84	93.75	89.88
C13	94.70	98.88	95.06	88.86	97.33	99.33	95.99	99.39	99.30
C14	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
C15	99.77	100.00	99.86	100.00	99.62	99.84	100.00	99.84	99.84
OA	93.69	93.45	94.11	91.34	92.58	94.44	92.14	94.31	95.21
AA	93.99	93.72	95.14	92.17	92.71	94.77	92.85	94.74	95.38
Kappa	93.19	92.99	93.67	90.63	92.16	93.99	91.60	93.87	95.21

Table 3. Classification performance (%) of different methods on the Trento dataset. The best values are displayed in bold.

No.	IF_CALC [45]	Coupled-CNNs [41]	ExViT [43]	MSA-GCN [42]	S3F2Net [13]	GLTNet [44]	DSHFNet [28]	MS2CANet [40]	Proposed
C1	96.79	97.47	99.21	90.42	98.75	99.29	98.56	99.26	99.30
C2	91.55	87.05	94.84	94.05	93.25	90.11	85.12	91.96	93.85
C3	80.00	91.33	77.59	92.85	97.23	85.33	83.44	94.38	95.32
C4	100.00	99.96	100.00	97.73	99.86	100.00	98.70	99.96	100.00
C5	100.00	100.00	100.00	94.27	99.97	100.00	98.62	99.40	99.62
C6	85.74	89.71	85.74	87.18	85.83	88.75	85.46	85.83	91.77
OA	97.08	97.19	97.65	94.23	97.62	96.82	96.40	97.28	98.28
AA	93.30	94.23	93.07	93.33	95.85	93.36	91.07	95.08	96.73
Kappa	96.64	96.25	96.88	93.92	96.80	96.25	94.86	96.42	98.28

Table 4. Classification performance (%) of different methods on the MUUFL dataset. The best values are displayed in bold.

No.	IF_CALC [45]	Coupled-CNNs [41]	ExViT [43]	MSA-GCN [42]	S3F2Net [13]	GLTNet [44]	DSHFNet [28]	MS2CANet [40]	Proposed
C1	88.51	89.53	86.75	83.66	90.95	87.71	80.75	89.23	90.25
C2	67.49	83.20	65.41	71.62	75.40	70.08	71.63	81.37	75.39
C3	72.71	58.25	67.25	64.56	73.53	75.00	72.86	66.49	76.35
C4	64.43	78.40	83.08	88.37	85.75	70.42	74.13	79.65	77.61
C5	74.57	74.66	77.22	71.40	80.86	80.65	86.63	83.75	77.41
C6	100.00	100.00	98.32	99.55	99.78	99.69	99.23	100.00	100.00
C7	80.77	78.73	80.63	85.36	84.84	83.87	87.83	85.61	90.17
C8	90.66	94.41	91.47	83.73	92.29	92.05	92.05	93.92	93.10
C9	32.48	54.84	61.74	42.56	45.76	41.75	53.47	57.49	52.10
C10	61.23	70.51	59.36	73.62	70.21	69.23	64.76	71.67	72.23
C11	97.49	98.64	99.53	99.60	98.84	99.59	99.19	97.37	97.19
OA	82.05	82.38	81.21	77.61	84.92	83.00	80.90	84.79	85.32
AA	76.31	80.42	79.23	76.22	81.91	80.30	77.33	82.51	84.81
Kappa	76.51	78.47	75.94	71.28	80.38	78.06	75.75	80.60	85.32

Table 5. Performance comparison on three datasets with base, fuzzy learning, and fuzzy fusion modules.

Datasets	Base	Fuzzy Learning	Fuzzy Fusion	OA	AA	Kappa
Houston2013	✓			94.31	94.74	93.87
	✓	✓		94.57	95.13	94.57
	✓		✓	94.73	95.11	94.73
	✓	✓	✓	95.21	95.38	95.21
MUUFL	✓			81.29	79.24	81.29
	✓	✓		84.68	81.92	84.68
	✓		✓	83.36	80.15	83.36
	✓	✓	✓	85.32	84.81	85.32
Trento	✓			97.32	94.81	97.32
	✓	✓		97.95	96.11	97.95
	✓		✓	97.76	95.11	97.76
	✓	✓	✓	98.28	96.73	98.28

Table 6. Efficiency analysis of different methods.

Method	Running Time (s)						Params (M)	Complexity (FLOPs)
	Houston2013		Trento		Muul
	Train	Test	Train	Test	Train	Test
IF_CALC	1091.29	7.41	1393.69	13.87	3441.23	34.38	0.95	16.70 G
Coupled-CNNs	71.11	4.18	72.44	3.02	150.49	9.20	0.41	2.26 G
ExViT	372.06	5.26	443.99	9.76	526.57	13.94	0.92	100.46 G
MSA-GCN	133.20	8.79	50.20	16.55	176.10	49.33	2.15	35.76 G
S3F2Net	983.91	3.93	315.20	2.94	831.38	12.48	1.11	191.26 G
GLTNet	1474.32	7.41	1626.81	13.87	1779.32	34.38	1.59	85.04 G
DSHFNet	1174.73	21.50	767.83	22.62	824.50	62.43	1.24	42.36 G
MS2CANet	87.44	3.33	140.46	3.63	258.36	8.73	2.75	7.25 G
Ours	1411.00	4.78	119.46	3.15	450.87	8.78	3.53	7.45 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, S.; Jia, M.; Duan, J. FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification. Algorithms 2025, 18, 524. https://doi.org/10.3390/a18080524

AMA Style

Wei S, Jia M, Duan J. FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification. Algorithms. 2025; 18(8):524. https://doi.org/10.3390/a18080524

Chicago/Turabian Style

Wei, Shuting, Mian Jia, and Junyi Duan. 2025. "FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification" Algorithms 18, no. 8: 524. https://doi.org/10.3390/a18080524

APA Style

Wei, S., Jia, M., & Duan, J. (2025). FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification. Algorithms, 18(8), 524. https://doi.org/10.3390/a18080524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FE-MCFN: Fuzzy-Enhanced Multi-Scale Cross-Modal Fusion Network for Hyperspectral and LiDAR Joint Data Classification

Abstract

1. Introduction

2. Methods

2.1. Overall Framework

2.2. Shallow Feature Extraction and Preliminary Fusion Stage

2.2.1. Shallow Feature Extraction

2.2.2. Fuzzy Learning Model

2.2.3. Cross-Modal Attention Module

2.3. Multi-Scale Deep Feature Extraction and Fusion Stage

2.3.1. FERM

2.3.2. Fuzzy Fusion Model

2.4. Classification Stage

3. Experiments

3.1. Dataset Description

3.2. Experimental Setup

3.3. Quantitative Results and Analysis

3.4. Ablation Analysis

3.5. Parameter Sensitivity Analysis

3.6. Fuzzy Set Quantity Impact Analysis

3.7. Performance Analysis of Different Training Samples

3.8. Model Stability Analysis

3.9. Comparison of Computational Efficiency

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI