CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification

Wu, Haibin; Lv, Haoran; Wang, Aili; Yan, Siqi; Molnar, Gabor; Yu, Liang; Wang, Minhui

doi:10.3390/rs18020216

Open AccessArticle

CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification

by

Haibin Wu

¹

,

Haoran Lv

¹

,

Aili Wang

^1,*

,

Siqi Yan

¹

,

Gabor Molnar

²

,

Liang Yu

³

and

Minhui Wang

¹

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

²

Institute for Application Techniques in Plant Protection, Julius Kühn Institute (JKI)-Federal Research Centre for Cultivated Plants, 38104 Brunswick, Germany

³

Ultra-Precision Optoelectronic Instrument Engineering Center, School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 216; https://doi.org/10.3390/rs18020216

Submission received: 11 November 2025 / Revised: 3 January 2026 / Accepted: 4 January 2026 / Published: 9 January 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a novel CNN-GCN framework coordinated with wavelet transform for HSI and LiDAR classification. Its core innovation is a set of dedicated modules that work in concert to effectively balance local detail extraction with global contextual modeling.
The proposed method achieves state-of-the-art classification performance, significantly outperforming existing advanced methods across three standard benchmark datasets.

What is the implication of the main finding?

The study provides an effective solution to key challenges in multimodal remote sensing, such as balancing local details with global contexts and enabling computationally efficient deep feature interaction.
The framework’s superior generalization capability across diverse scenes demonstrates its strong potential as a reliable tool for enhancing accuracy in practical applications like environmental monitoring and urban planning.

Abstract

The existing multimodal image classification methods often suffer from several key limitations: difficulty in effectively balancing local detail and global topological relationships in hyperspectral image (HSI) feature extraction; insufficient multi-scale characterization of terrain features from light detection and ranging (LiDAR) elevation data; and neglect of deep inter-modal interactions in traditional fusion methods, often accompanied by high computational complexity. To address these issues, this paper proposes a comprehensive deep learning framework combining convolutional neural network (CNN), a graph convolutional network (GCN), and wavelet transform for the joint classification of HSI and LiDAR data, including several novel components: a Spectral Graph Mixer Block (SGMB), where a CNN branch captures fine-grained spectral–spatial features by multi-scale convolutions, while a parallel GCN branch models long-range contextual features through an enhanced gated graph network. This dual-path design enables simultaneous extraction of local detail and global topological features from HSI data; a Spatial Coordinate Block (SCB) to enhance spatial awareness and improve the perception of object contours and distribution patterns; a Multi-Scale Elevation Feature Extraction Block (MSFE) for capturing terrain representations across varying scales; and a Bidirectional Frequency Attention Encoder (BiFAE) to enable efficient and deep interaction between multimodal features. These modules are intricately designed to work in concert, forming a cohesive end-to-end framework, which not only achieves a more effective balance between local details and global contexts but also enables deep yet computationally efficient interaction across features, significantly strengthening the discriminability and robustness of the learned representation. To evaluate the proposed method, we conducted experiments on three multimodal remote sensing datasets: Houston2013, Augsburg, and Trento. Quantitative results demonstrate that our framework outperforms state-of-the-art methods, achieving OA values of 98.93%, 88.05%, and 99.59% on the respective datasets.

Keywords:

multimodal image classification; hyperspectral image (HSI); light detection and ranging (LiDAR); convolutional neural network (CNN); graph convolutional network (GCN)

Graphical Abstract

1. Introduction

The advancement of remote sensing technology has enabled various sensors to capture surface features from complementary dimensions, making image classification pivotal across diverse applications [1,2,3]. This technological progress has significantly enhanced the accuracy and efficiency of feature extraction from multi-source data [4,5,6]. Consequently, these developments have expanded the applicability of remote sensing in various domains, including environmental monitoring, urban planning, and agricultural management [7,8,9]. Key sensors include hyperspectral imagery (HSI), which provides rich spectral and spatial data for fine-grained classification [10]; LiDAR and DSM, delivering accurate elevation and 3D structural information for terrain and flood modeling [11]; and Synthetic Aperture Radar (SAR), offering all-weather, all-time observation capabilities for deformation monitoring and target recognition [12]. Each sensor has unique strengths and inherent limitations. Multimodal data fusion therefore offers a promising pathway to achieve information complementarity, to construct more robust and discriminative feature representations [13,14], and to significantly enhance classification accuracy, thereby supporting scientific, governmental, and industrial applications with more reliable geospatial intelligence. The field of remote sensing image classification has evolved significantly from traditional shallow models to deep networks. While early methods like Support Vector Machine (SVM) [15] and Random Forest (RF) [16] primarily used spectral information, efforts to incorporate spatial context introduced feature engineering techniques such as Morphological Profiles [17] and PCA [18]. However, their reliance on handcrafted features and shallow architectures [19] limits their ability to adaptively extract discriminative cross-modal features, especially in complex, heterogeneous scenarios.

To overcome these fundamental limitations, the researchers have gradually turned their attention to deep learning technology. Deep learning models, through end-to-end training, can automatically learn highly discriminative hierarchical feature representations from raw data, providing a new technical path for the classification of remote sensing data. The 1-D CNN proposed by Hu et al. was the first to apply convolutional networks to spectral dimension feature extraction [20]. Zhao achieved effective mining of spatial features through 2-D CNN [21]. The 3-D CNN developed by Chen synchronously captures spatial-spectral features through three-dimensional convolutional kernels, significantly enhancing the joint characterization ability [22]. Recurrent neural networks (RNN) [23], with their recurrent connection structure, can effectively capture the long-range dependencies in spectral sequences, significantly enhancing the modeling ability of spectral features. Graph convolutional networks (GCN) [24] represent samples and their relationships through nodes and edges and can effectively model semantic associations and spatial dependencies among ground objects. They are particularly suitable for handling irregular regions and global semantic reasoning.

Transformer has gained significant attention in HSI analysis due to its remarkable capacity for modeling long-range dependencies, with research evolving from spectral modeling to spatial–spectral and multimodal feature learning. Roy [25] proposed a spectral–spatial morphological attention Transformer (MorphFormer) to improve the classification accuracy of hyperspectral images. Concurrently, Hong developed SpectralFormer, introducing a cross-layer transformer encoder that extracts group-wise spectral features from adjacent bands [26]. Sun proposed the Spectral–Spatial Feature Tokenization Transformer (SSFTT), which models local spatial relationships through innovative tokenization of image patches [27]. Xu et al. [28] proposed a cross spatial–spectral dense Transformer network (CS2DT), enhancing the performance of hyperspectral classification. Jiang proposed the Graph Generative Structure-Aware Transformer (GraphGST), which dynamically learns graph topologies and injects structural priors into vision transformers for HSI classification [29]. Although deep learning models have made significant progress in hyperspectral image classification, HSI still has certain limitations: it is easily affected by changes in illumination and shadow occlusion [30,31], which leads to challenges in classification accuracy and robustness in complex scenarios.

To break through this bottleneck, the researchers have begun to explore the joint analysis of HSI and LiDAR data. LiDAR data can actively obtain the elevation information of ground objects, effectively making up for the deficiency of hyperspectral images in spatial geometric features. In recent years, deep learning has developed rapidly in the field of joint classification of HSI and LiDAR data, promoting continuous innovation in multimodal fusion technology. Many networks are dedicated to extracting local spatial and spectral features. Li et al. [32] proposed CMFNet, a cross Mamba fusion network for hyperspectral and LiDAR data classification, which leverages a Mamba-based sequence modeling mechanism to facilitate efficient cross-modal feature interaction and long-range dependency modeling. He et al. [33] proposed a lightweight fusion vision Mamba network, which achieves efficient collaborative classification of hyperspectral and LiDAR data through jump sampling and dual-path fusion. Wang [34] proposed a spatial–spectral-structural feature fusion network (S3F2Net), which achieves multi-view feature collaborative classification of hyperspectral and LiDAR through a CNN-GCN hybrid architecture and dynamic node updates. Zhao et al. [35] proposed a hybrid framework combining deep CNN and hierarchical random walk to optimize the preliminary classification results of CNN by utilizing multi-scale spatial context information of HSI and LiDAR data. Pan et al. [36] proposed a multi-scale hierarchical cross-fusion network, which achieves spatial–spectral cross-fusion and classification optimization of hyperspectral and LiDAR data through multi-scale cascaded feature extraction and hierarchical fusion modules. Lu et al. [37] proposed a cross-modal fusion network guided by assimilation modal mapping, achieving an improvement in the joint classification performance of hyperspectral and lidar data. Ni et al. [38] proposed a coarse-fine high-order network, which achieved hierarchical feature enhancement and classification optimization of multi-source remote sensing data. Roy et al. [39] proposed a cross-hyperspectral and lidar attention converter, which achieves the collaborative fusion of spatial–spectral and elevation features of multi-source remote sensing data through cross-modal attention interaction and heterogeneous convolutional feature extraction. Jing et al. [40] proposed a heterogeneous contrast image fusion network, which achieves contrast alignment and adaptive fusion of HSI and LiDAR through dual-flow image attention and dynamic structure learning. The global-local Transformer network proposed by Ding [41] achieves efficient joint classification of HSI and LiDAR data by integrating the global modeling capability of Transformer and the local feature extraction advantage of CNN. Roy [42] proposed a multimodal fusion Transformer that models cross-modal global dependencies through a self-attention mechanism, significantly enhancing the joint classification performance of HSI and LiDAR data. Xue [43] proposed that the depth-level vision Transformer effectively integrates and classifies multi-level semantic information of HSI and LiDAR data through a hierarchical cross-scale attention mechanism. Qin [44] proposed a spectral–spatial graph convolutional network and constructed an end-to-end semi-supervised classification framework by concatenating CNN with GCN. Zhao [45] constructed a new hierarchical CNN and transformer network (HCT), and its cross-token attention encoder established a pixel-level spectral-elevation correlation model in the spatial dimension, significantly enhancing the long-range dependency modeling capability. Wang [46] developed a multi-scale spatial-spectral cross-attention network (MS2CANet), which extracted multi-scale detail features through pyramid convolutional grouping and introduced a feature recalibration module to enhance key information. Wei et al. [47] proposed Multimodal Data Fusion Classification via Adaptive Frequency Domain Sparse Enhancement (AFDSE), which performs multimodal data fusion classification by adaptively enhancing discriminative frequency-domain sparse representations, effectively suppressing redundant information and improving classification robustness across heterogeneous data sources. Ni [48] proposed a frequency-domain-based network framework (FDNet), which extracts multi-scale frequency features through discrete wavelet transform and captures global semantic information by combining the self-attention mechanism of fast Fourier transform.

Although significant progress has been made in the classification methods of HSI and LiDAR data fusion based on deep learning, its overall framework still faces a series of interrelated and complex challenges. Firstly, the traditional data preprocessing using PCA, due to its global operational characteristics, is prone to diluting local spatial details, losing high-order statistical information, and is difficult to provide multi-scale feature representations with physical interpretability. Secondly, in terms of HSI spectral feature extraction, the application of CNN is confronted with multiple structural constraints [22]. Specifically, there are significant computational bottlenecks in three-dimensional convolution operations and the fixed-scale convolution kernels are difficult to adapt to the drastic changes in the scale of ground objects in remote sensing scenes. Notably, the CNN method is limited by the characteristics of the local receptive field; essentially, it is difficult to establish long-range dependencies between pixels, nor can it effectively express global topological associations. Although GCN can make up for this deficiency and construct global context relationships [49], the traditional GCN architecture has obvious shortcomings: graph structures constructed based on simple spectral similarity often ignore spatial local detail features, and the neighborhood aggregation process lacks an effective feature screening mechanism, resulting in limited node representation capabilities. These problems make it difficult for traditional GCN and CNN features to form effective complementarity, becoming the main obstacle to spectral feature extraction. Furthermore, in terms of spatial feature extraction, the original HSI lacks explicit position encoding, and the translation invariance of traditional convolution jointly limits the model’s precise perception ability of the spatial distribution laws and geometric relationships of ground objects. In addition, in the aspect of LiDAR elevation feature extraction, fixed-scale convolutional kernels are difficult to adaptively capture cross-scale elevation features ranging from microscopic terrain undulations to macroscopic landform structures. Finally, the existing multimodal attention mechanisms are often limited by the unidirectional architecture, which easily triggers heterogeneous feature competition within the limited representation bandwidth and makes it difficult to achieve true semantic collaboration due to the unidirectionality of the information flow.

To address the challenges of feature extraction and cross-modal interaction in HSI and LiDAR data fusion, this paper proposes an integrated deep learning framework with a well-defined processing pipeline. The framework begins by employing Symlets wavelet transform for preprocessing the raw data, extracting physically meaningful hierarchical feature representations in the frequency domain. During the feature extraction stage, the designed Spectral Graph Mixer Block module (SGMB) utilizes a collaborative architecture combining multi-dimensional convolutional neural networks with an enhanced gated graph convolutional network to simultaneously extract joint spectral–spatial features from HSI while establishing long-range spatial dependencies. The Spatial Coordinate Block module (SCB) incorporates advanced positional encoding through coordinate convolution technology, explicitly embedding spatial coordinates into the feature representation to enhance the model’s awareness of object contours and distribution patterns. For LiDAR data, the Multi-scale Elevation Feature Extraction Module (MSFE) employs a multi-scale dilated convolution structure coupled with a feature enhancement module to achieve cross-scale feature capture. Finally, all features are integrated through a Bidirectional Frequency Attention Encoder (BiFAE) for cross-modal fusion; this module introduces a novel bidirectional attention mechanism that operates in the frequency domain, enabling efficient and deep interaction between multimodal features. This design allows the model to selectively enhance complementary information from hyperspectral and LiDAR data while suppressing redundant features, resulting in more discriminative fused representations. Together, these components form a complete processing chain from preprocessing to feature extraction and multimodal integration.

The contributions of this study are threefold, summarized as follows:

We designed a dedicated Symlets wavelet transform module specifically for multisource remote sensing data, which generates hierarchically organized features with inherent physical interpretability while systematically preserving fine-grained spatial information often lost in conventional PCA-based preprocessing.
We proposed a coordinated feature extraction architecture that systematically integrates the processing of hyperspectral and LiDAR data. For 3D HSI, we design the SGMB module that combines multi-dimensional CNNs with an improved gated graph convolutional network to simultaneously extract local spectral–spatial features while establishing long-range spatial dependencies. For 2D HSI, we develop the SCB module employing coordinate convolution technology to explicitly encode spatial positions, effectively overcoming the translation invariance limitations of conventional convolutions. For LiDAR elevation data, we construct the MSFE module that utilizes a multi-branch dilated convolutional architecture with feature enhancement mechanisms to achieve effective cross-scale terrain feature extraction. This architecture enables organic fusion and coordinated processing of heterogeneous remote sensing data while preserving multimodal data characteristics.
We proposed the BiFAE to address the limitations of unidirectional fusion architectures. Its core innovation comprises a bidirectional cross-attention mechanism that enables interactive learning between parallel spectral branches through Fourier operations, achieving adaptive feature distribution compensation across modalities.

The organization of this article is as follows. The proposed framework is described in Section 2. Section 3 provides the experimental datasets, parameter settings, and a comprehensive analysis of the classification results. The discussion is presented in Section 4. Conclusions and potential future research directions are discussed in Section 5.

2. Methods

The proposed framework for joint classification of HSI and LiDAR data is illustrated in Figure 1. The four stages that make up this framework are HSI and LiDAR data preprocessing, spatial-spectral feature extraction of HSI, elevation feature extraction of LiDAR data, and bidiretional frequency attention encoder.

2.1. Preprocessing of HSI and LiDAR Data

For the HSI

X_{H} \in ℝ^{H \times W \times C}

and co-aligned LiDAR data

X_{L} \in ℝ^{H \times W}

—where H and W denote the spatial height and width dimensions, respectively, and C represents the number of spectral bands in the HSI—edge pixel padding is applied to the input datasets, followed by extraction of local patches centered on every pixel position from each padded dataset. This process generates HSI sample cubes

X_{H}^{P} \in ℝ^{p \times p \times C}

and LiDAR sample matrices

X_{L}^{P} \in ℝ^{p \times p}

, with parameter p defining the spatial window size of extracted patches.

In contrast to traditional Principal Component Analysis (PCA) that often dilutes local spatial details and lacks multi-scale representation and existing wavelet-based methods that primarily utilize the transform as a denoising or downsampling alternative [46], to achieve effective preprocessing, we employ a Discrete Wavelet Transform (DWT) with a specific choice of the Symlets 5 wavelet from the Symlets family. This approach effectively decomposes each p × p local patch into a set of frequency-domain subbands, while preserving spatial-spectral integrity and physical interpretability. The selection of the Symlets wavelet family is motivated by its well-established utility in processing multi-dimensional remote sensing data. The selection of Symlets as our wavelet basis is driven by its balanced properties that are particularly advantageous for multimodal remote sensing data. While maintaining essential wavelet properties like orthogonality and compact support for efficient computation, Symlets’ near-symmetry offers superior phase response compared to asymmetric wavelets such as Daubechies—this characteristic is crucial for minimizing edge distortion and preserving spatial integrity in hyperspectral imagery. Furthermore, Symlets overcome the limitations of simpler bases: they provide smoother waveforms than the discontinuous Haar wavelet, enabling better frequency localization without introducing artifacts, and achieve a more computationally favorable balance between vanishing moments and filter length compared to Coiflets. These features make Symlets uniquely suited for extracting discriminative, multi-scale features from the complex spectral–spatial-elevation relationships present in fused HSI and LiDAR data.

In the decomposition stage, HSI is processed through two branches: one branch applies a 2D wavelet transform individually to each spectral band, decomposing each into one low-frequency component and three high-frequency components, capturing approximate structures and detail features along horizontal, vertical, and diagonal directions per band. The other branch performs a 3D wavelet transform concurrently along two spatial dimensions and one spectral dimension, producing one 3D low-frequency component and seven 3D high-frequency components that effectively extract joint spectral–spatial frequency features across bands and space. The LiDAR data undergoes only a 2D wavelet transform, with each patch decomposed into one low-frequency component and three high-frequency components. All resulting components serve as discriminative inputs for downstream deep learning models.

2.2. Spatial-Spectral Feature Extraction of HSI

This module aims to comprehensively extract rich spectral–spatial information from HSI through a well-structured dual-branch processing framework. The system architecture mainly consists of two key components: SGMB focuses on the extraction and refinement of spectral features and SCB focuses on spatial relationship modeling and coordinate information preservation. High-frequency components are processed using the high-frequency feature learning (HFL) module, while low-frequency components undergo corresponding feature extraction. This integrated framework effectively captures detailed and structural information while maintaining computational efficiency.

In HSI classification, while CNNs effectively capture fine-grained spectral–spatial patterns through multi-scale convolutional kernels, its inherent limitation lies in the local nature of convolutional operations. This results in a limited receptive field, making it difficult to model long-range dependencies and global contextual relationships within the scene. On the other hand, GCNs overcome this by representing pixels as nodes connected by edges based on spectral similarity, which explicitly models long-range spatial relationships. However, GCNs may sometimes overlook localized, fine-grained details that CNNs excel at extracting, as their graph construction relies on potentially oversimplified spectral similarity metrics. To harness the strengths of both paradigms while mitigating their individual weaknesses, our SGMB adopts a parallel integration strategy. This design aims to establish meaningful global context relationships through GCN branch while simultaneously preserving the good local discriminative features extracted by CNN branch. The parallel fusion of these complementary features ensures a more robust and comprehensive representation for precise classification.

The low-frequency components of the preprocessed HSI spectral information

X_{H}^{l l l}

are simultaneously input into the CNN branch and GCN branch for training. The CNN branch tackles three fundamental limitations in HSI classification: the computational intensity of 3D convolutions restricts model depth and efficiency, the fixed-scale kernels struggle with significant size variations among the ground objects, and the inherent channel redundancy degrades classification accuracy. This hybrid architecture begins with a shallow 3D CNN that processes input spectral components using 3 × 3 × 3 sized kernel with 1 stride and 1 padding to capture intrinsic spatial-spectral correlations. The output feature of the 3D CNN are used as the input of the parallel 2D CNN, which employs three complementary convolution strategies to enhance feature representation: Firstly, it uses group convolution with kernel size of 3 × 3 with 1 stride and 1 padding, which significantly reduces computational complexity while retaining feature discriminative power; Secondly, by using grouped convolution with kernel size of 5 × 5 with 1 stride and 2 padding, the receptive field is expanded to effectively obtain the wide-area context information in multi-scale ground objects. Thirdly, it achieves cross-channel feature reconstruction and dimension compression by pointwise convolution with kernel size of 1 × 1. The outputs of multiple parallel streams are processed through weighted summation, normalization, and ReLU activation functions to obtain

F_{CNN}^{2 D}

. The process can be formulated as follows:

F_{CNN}^{2 D} = ReLU \{BN \{\begin{array}{l} {Conv}_{3 \times 3}^{g} (F_{CNN}^{3 D}) \\ \oplus {Conv}_{3 \times 5}^{g} (F_{CNN}^{3 D}) \\ \oplus {Conv}_{1 \times 1}^{p} (F_{CNN}^{3 D}) \end{array}\}\}

(1)

where

\oplus

represents the weighted summation.

It should be noted that Squeeze-and-Excitation Networks (SENet) aim to dynamically adjust the weights of each feature channel. By enhancing the response of important channels and suppressing redundant or noisy channels, it improves the representational ability of features. In terms of specific implementation, we first perform an average pooling operation on the input features along the spectral channel dimension. Then, we learn the channel weights through two fully connected layers and nonlinear activation. Finally, we multiply the learned channel weights channel by channel with the original feature map to obtain the features of the CNN branch

F_{CNN}^{H}

,

w_{H} = σ (L_{2} ReLU (L_{1} (AvgPool (F_{CNN}^{2 D}))))

(2)

F_{CNN}^{H} = w_{H} ⊙ F_{CNN}^{2 D}

(3)

where

σ

represents the Sigmoid function, L₁ represents the dimension reduction matrix, L₂ represents the dimension recovery matrix, and

⊙

represents element multiplication.

Simultaneously, the low-frequency components

X_{H}^{l l l}

serve as input to the GCN module. The input feature tensors are efficiently converted into graph-structured data. The node feature matrix X_G is composed of the concatenation of normalized coordinate features and bilinear interpolation spectral information. The sparse edge matrix E constructs spatial relationships through four-neighborhood connections. The X_G and E are used as input to the two-level gated GCN layer, whose detailed architecture is depicted in Figure 2.

Each layer maps the node feature to the target dimension H through a linear transformation and generates feature selection weights G using independent linear transformations and Sigmoid activation functions, thereby overcoming the limitation of original GCN layer that treats all features equally.

H = X_{G} K + b

(4)

G = σ (X_{G} V + c)

(5)

where K is the learnable weight matrix, V is the gated weight matrix, and b and c are bias terms. Subsequently, based on the connection relationship in the edge matrix E, the features of the source node (src) and the target node (dst) are extracted and encoded to generate edge feature vectors M with semantic information;

M = GeLU ((Concat (X_{s r c}, X_{d s t})) + U + d)

(6)

where U is the edge transformation matrix, d is the bias term. The edge feature vectors M are weighted by the corresponding gating values and aggregated through residual connections, yielding node features Y enriched with topological information. It breaks through the constraint of the original GCN layer that only relies on the adjacency matrix for simple aggregation.

Y = LayerNorm (H + \sum G_{s r c} ⊙ M)

(7)

The node feature Y after gridification is extracted by the post-processing module to obtain the features

F_{GCN}^{H}

of the GCN branches.

F_{GCN}^{H} = {Conv}_{1 \times 1} (GeLU (BN ({Conv}_{3 \times 3}^{g} (Y))))

(8)

The low-frequency spectral features

F_{H}^{l l l}

are obtained by weighted summation of

F_{C N N}^{H}

and

F_{GCN}^{H}

. The

F_{H}^{l l l}

and spectral high-frequency features

F_{H}^{h h h}

are channel-wise concatenated and fused via a 1 × 1 convolutional layer for cross-channel integration and dimensionality reduction, outputting the final HSI spectral features

F_{H}^{3 D}

. The specific values of SGMB module parameters are shown in Table 1.

F_{H}^{3 D} = ReLU (BN ({Conv}_{1 \times 1} (Concat (F_{H}^{l l l}, F_{H}^{h h h}))))

(9)

Although the SGMB designed above can effectively extract the spectral–spatial joint features, it still faces two key challenges: Firstly, using only the original 3D hyperspectral data has inherent flaws. The data lacks explicit spatial position encoding, which limits the model’s understanding of the spatial relationship of ground objects. Secondly, traditional convolution operations essentially have translation invariance and lack the ability to perceive absolute positional information, making it difficult to model precise spatial geometric relationships between pixels. Spatial location information plays a decisive role in the precise positioning of ground object targets, boundary recognition, and understanding of spatial context. To break through these limitations, the designed SCB module effectively addresses these issues. CoordConv explicitly concatenates the standardized height and width coordinate mappings with the original hyperspectral image in terms of channel dimensions, first injecting absolute position prior knowledge into the original data to make up for its lack of spatial coding. Secondly, this mechanism alters the calculation method of traditional convolution, enabling the convolution kernel to perceive the absolute spatial position of each pixel and fundamentally breaking through the displacement invariance limitation of traditional convolution.

As shown in Figure 3, diagrams (a) and (b) illustrate a structural comparison between the traditional convolutional layer and the CoordConv layer, respectively. The traditional convolutional layer directly maps input features to output features, whereas the CoordConv layer additionally introduces coordinate information on top of the input data. By concatenating these coordinate maps with the original data along the channel dimension, the convolution kernel is enabled to perceive the absolute spatial position of each pixel, thereby enhancing the model’s ability to capture spatial-geometric relationships. Specifically, the preprocessed low-frequency components

X_{H}^{l l}

of HSI spatial information are first augmented through channel-wise concatenation with normalized height-axis

ϕ_{i} (h)

and width-axis

ϕ_{j} (w)

coordinate maps, forming enhanced features F_coord. The F_coord then undergoes feature transformation through successive convolutional layers. The first transformation stage applies CoordConv with kernel of size 1 × 1, 1 stride, 0 padding, followed by batch normalization and ReLU activation to extract position-sensitive features while preserving spatial resolution.

The second stage further refines the features through the same structure to obtain the transformed features. The transformed features are processed by the spatial attention mechanism to obtain low-frequency spatial features

F_{H}^{l l}

and high-frequency spatial features

F_{H}^{h h}

, which are concatenated and then pass through a 1 × 1 convolutional layer to obtain the HSI spatial features

F_{H}^{2 D}

.

F_{H}^{2 D} = ReLU (BN ({Conv}_{1 \times 1} (Concat (F_{H}^{h h}, F_{H}^{l l}))))

(10)

Then, the feature F_H can be represented as,

F_{H} = F_{H}^{3 D} \oplus F_{H}^{2 D}

(11)

where ⊕ represents element-wise summation operation.

2.3. Elevation Feature Extraction of LiDAR

When dealing with LiDAR data, CNN-based methods usually directly adopt standard convolution stacking, but this method has obvious limitations. Firstly, the inherent sampling non-uniformity of LiDAR point clouds makes it difficult for fixed-step convolution to stably capture representative local patterns. Secondly, the scale of ground objects varies greatly, ranging from fine surface textures to vast terrain undulations. All these require the model to have multi-scale perception capabilities. However, traditional single-scale convolutional kernels are limited by their fixed receptive fields and cannot effectively capture these significantly different features simultaneously. Therefore, we designed the Multi-Scale Elevation Feature Extraction Block (MSFE) module to extract elevation features. In the basic feature extraction stage, we did not simply increase the network depth. Instead, the low-frequency components

X_{L}^{l l}

of LiDAR pass through two layers of convolution with a kernel of 3 × 3, 1 stride, 1 padding and each layer is followed by batch normalization and ReLU activation function. The first layer of convolution robustly extracts basic terrain features from the sparse and uneven LiDAR data. The feature representation is deepened further through the second layer of convolution to enhance the network’s ability to express complex terrain structures, thereby laying a more robust feature foundation for subsequent multi-scale analysis.

The output features

F_{P}

are passed through a multi-branch dilated convolutional layer, which consists of three dilated convolutional layers with the same convolutional kernel size 3 × 3 but different dilation rates (d = 1, 2, 3) in parallel. The dilated convolution controls the sampling interval of the convolutional kernel through the dilation rate, effectively expanding the receptive field without increasing the number of parameters. The actual receptive field calculation process is as follows::

RF = (k − 1) × d + 1

(12)

where k is the convolutional kernel size, d is the dilation rate. The effective receptive fields corresponding to each branch act on different scenes, respectively. The 3 × 3 receptive field is used to capture microscopic features such as surface texture; the equivalent 5 × 5 receptive field is used to extract medium-scale features such as building outlines; the equivalent 7 × 7 receptive field focuses on modeling large-scale spatial patterns such as terrain undulations and introduces learnable weight to achieve dynamic fusion of multi-scale features. The process is as follows,

F_{d} = ReLU (BN ({Conv}_{3 \times 3}^{k} (F_{L_{2}}))) k = 1, 2, 3

(13)

α_{d} = \frac{\exp (w_{d})}{\sum_{j = 1}^{3} \exp (w_{j})}

(14)

F_{m} = \sum_{d = 1}^{3} F_{d} \times α_{d}

(15)

where

{Conv}_{3 \times 3}^{k}

is a dilated convolution with a dilation rate of k,

α

is a learnable weight. The result of the residual connection between F_p and F_m is input to the feature enhancement module, the elevation feature enhancement module is shown in Figure 4.

The module first realizes the efficient feature transformation through improved depth separable convolution. The structure consists of group convolution with convolutional kernel size of 3 × 3 and point-by-point convolution with kernel size of 1 × 1,

F_{d} = ReLU ({Conv}_{1 \times 1} ({Conv}_{3 \times 3}^{g} (F_{p} \oplus F_{m})))

(16)

then

F_{d}

is modulated in parallel with dual-mode features. In the channel dimension, dynamic recalibration is realized through compression-excitation mechanism to generate channel weights

w_{c}

. In the spatial dimension, 7 × 7 large receptive field convolution is used to capture contextual relationships and output spatial weights

w_{s}

. The two weights are realized through outer product operation to realize dual-mode synergy and output the enhanced elevation features in the form of residuals, the process is as follows,

w_{c} = σ ({Conv}_{1 \times 1} (AvgPool (F_{d})))

(17)

w_{s} = σ ({Conv}_{7 \times 7} (\frac{1}{C} \sum_{i = 1}^{C} F_{d}))

(18)

F_{L}^{l l} = (F_{p} \oplus F_{m}) ⊙ (w_{c} \otimes w_{s} + 1)

(19)

where

\otimes

represents the outer product and

⊙

represents the inner product. The high-frequency components also pass through the HFL module to obtain high-frequency features

F_{L}^{h h}

. The

F_{L}^{l l}

and

F_{L}^{h h}

are spliced along the channel dimension and a grouped CNN with kernel of 3 × 3 is used to extract the spatial local features in each group of channels. The point convolution with kernel of 1 × 1 is used to reorganize the independent features of the grouped convolution output, establish cross-channel associations, and output LiDAR elevation features

F_{L}

.

F_{L} = {Conv}_{1 \times 1} (ReLU (BN ({Conv}_{3 \times 3}^{g} (Concat (F_{L}^{l l}, F_{L}^{h h})))))

(20)

2.4. Bidirectional Frequency Attention Encoder

The modeling capacity of existing attention-based multimodal fusion methods is often intrinsically constrained by their unidirectional architecture. The single-branch, serial processing design not only leads to competition between complementary features within a limited representational bandwidth but also, and more importantly, results in a rigid information flow that is fundamentally incapable of resolving inter-modal conflicts. This bottleneck severely undermines the robustness and effectiveness of fusion, especially under conditions of significant feature distribution asymmetry or when certain sensor modalities are unreliable or missing.

Therefore, we design the BiFAE through an innovative bidirectional cross-attention mechanism in the frequency domain. A key distinction from existing frequency-aware fusion methods lies in the fusion mechanism: whereas existing methods employ a unidirectional feature aggregation process, our BiFAE introduces a bidirectional interactive attention mechanism in the frequency domain, enabling mutual adaptive enhancement between modalities rather than mere feature concatenation or weighting. The multimodal fusion feature F_fusion fuses with position embedding and frequency embedding to form F_p as the input of this encoder. The workflow of the encoder is illustrated in Figure 5. The input feature F_p is first normalized. The normalized feature is then fed into two parallel paths. Each path utilizes a series of deformable convolutional (DeformConv) layers to generate its own set of Query (Q), Key (K), and Value (V) feature tensors. Specifically, Path 1 produces Q₁, K₁, and V₁, while Path 2 produces Q₂, K₂, and V₂,

F_{f u s i o n} = ω \cdot F_{H} \oplus (1 - ω) F_{L}

(21)

where

ω

is the weight coefficient and can be manually adjusted.

At the core of BiFAE is a dual-path processing system, with two parallel branches maintaining continuous bidirectional interaction through carefully designed frequency domain operations. Branch 1 focuses on high-frequency details and branch 2 preserves low-frequency context. The innovation of this design is reflected in the interaction mode between branches. Through the bidirectional attention mechanism, the query features of branch 1 interact with the key-value features of branch 2, and at the same time, the query features of branch 2 also interact with the key-value features of branch 1. The specific process is as follows:

O u t_{1} = L (F^{- 1} (F (Q_{1}) F {(K_{2})}^{T})) V_{2}

(22)

O u t_{2} = L (F^{- 1} (F (Q_{2}) F {(K_{1})}^{T})) V_{1}

(23)

A_{o u t} = α \cdot O u t_{1} + (1 - α) \cdot O u t_{2}

(24)

where Q, K, and V are query, key, and value, respectively,

F

is Fourier transform,

F^{- 1}

is inverse Fourier transform, and

α

is an attention-based gating mechanism that can automatically detect and compensate for the differences in feature distribution. The final output feature F_out is obtained through the residual network.

F_{o u t} = F_{p} + {Conv}_{1 \times 1} (A_{o u t})

(25)

For the output feature Fout, we constructed a lightweight neural network classification module, which adopts a CNN architecture to transform and compress this feature and, ultimately, outputs a probability distribution vector with a dimension of the number of categories. The category index corresponding to the maximum value in this probability distribution is the type of land cover predicted by the model.

3. Results

To systematically evaluate the effectiveness of the proposed method, this study conducted experiments using three publicly available multimodal remote sensing datasets. The fundamental characteristics of each dataset and the experimental parameter configurations were first elaborated. Subsequently, the ablation experiments were performed to validate the functionality of the core modules within the framework. The experimental results demonstrated that the proposed method not only significantly outperforms the state-of-the-art models in performance metrics but also exhibits superior generalization capability and stability.

3.1. Data Description

1. Houston2013 Dataset: The Houston2013 dataset was made available through the 2013 IEEE GRSS Data Fusion Contest. It covers the University of Houston campus and its neighboring urban areas, with data collected in 2012 by the NSF-funded National Center for Airborne Laser Mapping. HSI contains 144 spectral bands with a spectral resolution ranging from 0.38 to 1.05 μm. For the same region, LiDAR data are provided with only one band. Both the HSI and LiDAR data are made up of 349 × 1905 pixels and a spatial resolution of 2.5 m. This dataset depicts 15 urban land-cover classes. Figure 6 visualizes the pseudo-color composite image of the HSI, the grayscale image of the LiDAR data, and the ground-truth map.

2. Augsburg Dataset: Augsburg dataset was acquired over the urban area of Augsburg, Germany. HSIs were obtained by the DAS-EOC HySpex sensor, while the LiDAR-based DSM data were captured by the DLR-3K system. Both images were down-sampled to a uniform spatial resolution of 30 m to facilitate multimodal fusion processing. In this dataset, HSI consist of 180 bands ranging from 0.4 to 2.5 μm, while the DSM data have a single raster. The size of this dataset is 332 × 485 pixels. The dataset depicts seven different landcover classes. Figure 7a–c shows the pseudo-color composite image of the HSI, the grayscale image of the LiDAR data, and the ground-truth map.

3. Trento Dataset: Trento dataset was acquired over a rural area in southern Trento, Italy. The HSIs were obtained by the AISA Eagle system for HSI Eagle sensor with 63 spectral bands and a spectral resolution ranging from 0.42 to 0.99 μm, while the LiDAR data were collected by the Optech ALTM 3100EA sensor with one raster. The size of this dataset is 600 × 166 pixels with a spatial resolution of 1 m and includes six different types of ground objects. Figure 8a–c shows the pseudo-color composite image of the HSI, the grayscale image of the LiDAR data, and the ground-truth map.

3.2. Experimental Setting

The experiments with the proposed method and other deep learning methods were all implemented in the PyTorch 2.3.1 platform, using a server with an Intel i7-13700HCPU, a NVIDIA RTX 4060 GPU, and 32 GB RAM. the CPU wafers are manufactured in the United States and packaged and tested in Penang, Malaysia, while the GPU is designed in the United States with board-level assembly and testing completed in Shenzhen, China, where RAM installation and final system integration are also performed. To ensure fair comparison and reproducibility, all comparative methods were trained and tested under strictly consistent conditions. An Adam optimizer is adopted to optimize the network. All methods used the same data slices from the same public dataset with identical ground truth labels. The hyperparameters of the model we proposed were determined through extensive preliminary experiments. In the subsequent experiments, we have comprehensively reported the selection process of key hyperparameters and their impact on performance, demonstrating the robustness of our model design. For clarity, key implementation details include: the use of Symlets 5 (sym5) wavelet for decomposition; for the GCN, the graph structure was constructed using the k-nearest neighbors algorithm with k = 4. In the training stage, batch size, and the number of training epochs are set to 128 and 300, respectively. The number of attention heads in the BiFAE module is set to two.

To evaluate the classification performance of the proposed framework and other existing models, three widely used quantity analysis criteria were computed, namely the overall accuracy (OA), average accuracy (AA), and Kappa coefficient.

3.3. Performance Comparison

To verify the effectiveness of the proposed framework, the experiments were performed to compare it against several representative classification methods, including Spectralformer, GraphGST, SSFTT, MS2CANet, HCT, and FDNet, where Spectralformer, GraphGST and SSFTT were designed for HSI classification tasks. For these methods, the settings of networks were set as described in their corresponding references.

3.3.1. Quantitative Results and Analysis

Table 2, Table 3 and Table 4 clearly show the OA, AA, Kappa, and per-class accuracy of the proposed method and the comparative methods for Houston2013, Augsburg, and Trento datasets. In each table, the best results are highlighted in bold. The values of the evaluation indicators show that the proposed framework performed more accurate classification results than other methods. Taking Houston2013 as an example, as shown in Table 2, methods rely solely on HSI tend to exhibit lower performance compared with those that integrate LiDAR data, highlighting the beneficial effect of incorporating elevation information. Among the array of HSI-only approaches, while MorphFormer records the highest classification accuracy, our proposed model surpasses MorphFormer by significant margins of 0.20% in OA, 0.12% in AA, and 0.21% in Kappa. Moreover, ours maintains a competitive edge over other fusion-based classification techniques, underscoring its efficacy in leveraging multisource information for enhanced classification outcomes. Specifically, our model outperforms FDNet by 0.51% in terms of OA, 0.49% in terms of AA, and 0.55% in terms of Kappa. It is easy to see that the classification accuracy of ours for “Stressed grass,” “Commercial”, and “Park lot 1” is the highest among the compared methods. This is closely related to the SGMB and MSFE in our method. Similar conclusions can also be drawn from Table 3 and Table 4. These two tables present the quantitative results of the comparative methods on the Augsburg and Trento datasets. It is evident that the proposed method significantly outperforms the comparative methods in terms of OA, AA, and Kappa metrics, which again confirms the superiority of the proposed method.

3.3.2. Visual Evaluation and Analysis

The classification maps of several comparison methods are shown in Figure 9, Figure 10 and Figure 11 for the three datasets, respectively. It can be seen from the maps that the proposed method produced better classification performance with clearer boundaries when compared with the other methods, which is consistent with the numerical results. The classification maps of a single data source have more misclassification phenomena, accompanied by noise. For MS2CANet, HCT, S3F2Net, AFDSE, and FDNet, misclassification and noise phenomena are reduced, but the classification accuracy is still low for some categories. For the Trento dataset, it is easy to be seen from the classification map that various methods have misclassification phenomena for the categories of “Apple Trees,” “Ground”, and “Vineyard,” and the classification map of proposed shows the results closest to the ground-truth map.

3.3.3. Intermediate Visual Comparison

To provide a more intuitive visual comparison, we adopt the t-SNE method to visualize the feature distribution of the proposed model and various advanced methods on Houston2013, Augsburg, and Trento datasets, as shown in Figure 12, Figure 13 and Figure 14. Notably, the visualization of feature distribution almost matches the quantitative results. Taking Augsburg as an example, for the “Industrial Area” category with lower per-class accuracy in most methods, the distribution of its corresponding features is more scattered compared to other categories; whereas, for the “Forest” category, which is almost perfectly classified in all methods, the distribution of corresponding feature is more compact and contains almost no isolated points. In addition, the proposed methods overall have fewer isolated feature points and the feature distribution of the same category is more concentrated and compact. On the contrary, other classification methods with lower accuracy have more scattered feature distributions and the boundaries between different category features are more blurred. Therefore, intermediate visual comparison results demonstrate the superiority of proposed.

3.4. Complexity Analysis

To comprehensively assess the model’s practicality, we conducted a complexity analysis comparing the proposed method with state-of-the-art approaches. The training time, test (inference) time, number of parameters, and FLOPs are summarized in Table 5. All experiments were performed under identical hardware and software configurations to ensure a fair comparison. As illustrated in Table 5, our proposed method demonstrates a balanced trade-off between computational efficiency and model capacity. While not the most lightweight in terms of parameters (1.56 M) and FLOPs (235.29 M)—which are primarily attributed to its deliberately designed modules for capturing intricate spatial-spectral dependencies—it achieves a competitive training efficiency. Specifically, the training time (536.12 s) is significantly lower than that of similarly complex, high-performing models such as S3F2Net (1162.21 s) and FDNet (783.35 s). This indicates that our architecture optimizes the learning process effectively despite its representational power. Regarding inference speed, the proposed model’s test time (9.05 s) is comparable to several recent high-accuracy models. More importantly, this computational cost is justified by the substantial gains in classification accuracy demonstrated in Table 1, Table 2 and Table 3. The complexity profile of our model positions it as a high-performance yet practical solution; it avoids the extreme computational overhead of the heaviest models while delivering superior classification results, thus offering a favorable choice for applications where both accuracy and deployable efficiency are valued.

3.5. Parameter Analysis

Several hyperparameters that may affect the classification performance and training process were analyzed.

3.5.1. Training Samples Percentage

As shown in Figure 15, to measure the stability and robustness of the proposed method with different percentages of training samples, 4%, 8%, 12%, and 16% labeled samples were randomly selected as training data for the Houston2013 dataset, 0.1%, 0.3%, 0.5%, and 0.7% labeled samples were randomly selected as training data for Augsburg, and 1%, 3%, 5%, and 7% labeled samples were randomly selected as training data for Trento. Across various training sample ratios, our proposed method consistently maintains the best classification performance. When the sample ratio of Augsburg was only 0.3%, the proposed method already achieved an overall accuracy of 88.87%, significantly outperforming other comparative methods. As the sample ratio increased, the performance of all methods improved. For Houston2013 and Trento datasets, the accuracy approached 100% with increased sample size and the differences among methods became less pronounced; nevertheless, the proposed method consistently remained in the optimal position. The experimental results across the three datasets demonstrate that the proposed method sustains the best classification performance throughout the entire range of sample sizes.

3.5.2. Batch Size

Batch size significantly influences neural network training in terms of speed, memory consumption, accuracy, and generalization ability. A larger batch size typically accelerates the training process but requires more memory. Conversely, using a smaller batch size may increase the risk of overfitting and make the model more sensitive to noise in the training data. Choosing an appropriate batch size requires considering the specific problem and hardware setup and finding the optimal value through experimentation. To evaluate the effect of the batch size, the other hyperparameter values were fixed and batch size was selected from a candidate set {16, 32, 64, 128, 256}. It is evident from Figure 16a that at a batch size of 128, our proposed method achieves the optimal solution across all the three datasets.

3.5.3. Patch Size

As shown in Figure 16b, The OA follows a clear, consistent pattern across all datasets: it first increases with patch size, reaches a peak, and then gradually decreases. The optimal patch size is 8 for the Houston2013 and Trento datasets, while the lower-resolution Augsburg dataset performs best with a patch size of 10.

This behavior reflects a classic trade-off between spatial context and feature purity. When the patch size grows from 4 to 8, the model gains a larger receptive field, allowing it to incorporate more meaningful surrounding context. This additional spatial information is particularly helpful for correctly classifying extended or complex objects, resulting in substantial accuracy improvements. However, once the patch becomes too large, it begins to include pixels from neighboring, unrelated objects as well as boundary artifacts. These extraneous elements dilute the distinctive characteristics of the central pixel and introduce noise, ultimately harming classification performance and increasing computational cost. The larger optimal patch size required by the Augsburg dataset is directly linked to its coarser 30 m spatial resolution. At this scale, considerable spectral mixing occurs within individual pixels, obscuring fine details. Consequently, a wider spatial window is needed to gather sufficient contextual cues to compensate for the reduced per-pixel information. In contrast, the higher-resolution Houston2013 and Trento datasets retain sharper detail within each pixel, enabling the model to achieve excellent results with smaller, more focused patches.

3.5.4. Learning Rate

The learning rate, an important hyperparameter in the deep learning model, significantly impacts the objective function convergence to the local optimal value. The objective function can quickly reach its local minimum with an appropriate learning rate. In experiments, the learning rate was selected from a candidate set {5 × 10⁻⁵, 1 × 10⁻⁴, 5 × 10⁻⁴, 1 × 10⁻³, 5 × 10⁻³}. Figure 16c shows the OA of the proposed method in three datasets by setting different learning rates. It can be observed that the optimal learning rate for Houston2013 and Trento datasets is 1 × 10⁻³, and for Augsburg dataset it is 5 × 10⁻⁴.

3.5.5. Weight Coefficient for Feature Fusion

This section examines the impact of the feature fusion coefficient on classification performance. With other parameters held constant, the coefficient is varied from 0.1 to 0.9 in increments of 0.2 across three datasets to evaluate model behavior. As depicted in Figure 16d, while the coefficient is too small, the model’s classification performance is not satisfactory, achieving the lowest accuracy on three datasets, which can be attributed to the lack of spectral information from HSI in the fused futures. Furthermore, as the coefficient is too large, the proposed model yields lower classification results on all datasets, possibly due to the model potentially focusing too much on the spectral features of HSI while neglecting to learn the elevation features present in LiDAR data. It can be observed that the optimal coefficient for all the datasets is 0.7.

3.6. Ablation Analysis

3.6.1. Ablation Analysis of Different Data Inputs

Considering the influence of different modal data on the classification performance of the model, we conducted two sets of experiments, single HSI input and single LiDAR data put, respectively, as shown in Table 6. By comparing OA, AA, and Kappa values on three datasets, the proposed network achieves better classification results, which validates that the multimodal data play a positive role in improving the classification performance and also verifies that the designed dual-branch network can make full use of the effective information of different modal data.

3.6.2. Ablation Analysis of Different Components

Since the proposed network benefits from many components, we analyzed the contribution of each component to the joint classification performance by ablation experiments. Specifically, the proposed framework was mainly divided into four components, SGMB for HSIs, SCB for HSIs, MSFE for LiDAR, and BiFAE for HSIs and LiDAR. As listed in Table 7, the Augsburg dataset was used to evaluate four variants of the proposed framework that had varying combinations of components removed. The classification accuracy of these variants was analyzed to evaluate the impact of the various components on the model. The first variant without SGMB component yielded the worst classification accuracy. The second variant without SCB component performed a litter better than the first variant. The performance of the third variant without MSFE components is somewhat better than that of the first and second variants. The classification accuracy rates of the first three model variants verified the significant effects of the three components in the feature extraction stage. The fourth variant achieved the OA of 85.33% without the BiFAE component, less than the optimal value, which shows that the component contributed to the performance improvement. This analysis further confirms the validity of the proposed network framework, showing that all of its major components are important.

4. Discussion

The proposed framework demonstrates robust generalization capabilities, achieving state-of-the-art performance across three distinct datasets, which underscores its effectiveness in handling diverse geographical scenes. This cross-scene robustness stems from the synergistic design of its core components: the SGMB and MSFE enable adaptive characterization of ground objects across varying spatial scales, while the BiFAE facilitates dynamic compensation between HSI and LiDAR modalities. However, the current validation remains confined to within-dataset generalization, and the model’s performance under cross-dataset transfer scenarios presents a fundamental challenge for real-world deployment that warrants future investigation through domain adaptation and self-supervised learning on large-scale multimodal corpora. A detailed error analysis further reveals that classification uncertainty increases notably along boundary regions between land-cover classes due to mixed pixel effects, especially in complex urban environments, indicating a need for enhanced spatial contextual modeling or boundary-aware post-processing. Additionally, while the model performs robustly on well-represented categories, classes with limited training samples exhibit higher variance, highlighting a dependency on sufficient and representative annotations that could be mitigated via advanced imbalance learning strategies. The framework’s performance under conditions of LiDAR sparsity or elevated sensor noise also remains to be systematically evaluated, as the current study assumes optimal data quality. Finally, the achieved high accuracy comes with increased computational complexity, underscoring a key trade-off between performance and efficiency that necessitates future work on model compression for resource-constrained deployment scenarios.

5. Conclusions

This paper proposes a three-branch deep learning model based on wavelet transform preprocessing for the joint classification of HSI and LiDAR data. This model effectively extracts spectral and spatial features from the low-frequency components of HSI through the SGMB and SCB modules and simultaneously uses the MSFE module to extract discriminative elevation information from LiDAR data. To further enhance the global context modeling capability, the proposed BiFAE module combines frequency transformation with deformable convolution, achieving efficient fusion of cross-modal features while significantly reducing the computational complexity. Comprehensive experiments were conducted on three widely used HSI and LiDAR datasets and compared with a variety of mainstream classification methods. The results show that our method outperforms existing approaches in terms of OA, AA, Kappa, and per-class classification performance. Specifically, on the Augsburg dataset, our method achieves an OA of 88.05%, a significant improvement of 1.58% over the previous best. On the Houston2013 dataset, it attains an OA of 98.93%, improving by 0.51%. Additionally, on the Trento dataset, it reaches an OA of 99.59%, advancing the state-of-the-art method by 0.39%. This consistent high performance across diverse scenes demonstrates the framework’s excellent generalization ability and robustness.

While the proposed framework demonstrates advanced performance in HSI and LiDAR data joint classification, it shares a common limitation with existing methods: a reliance on large-scale annotated data. Future work will focus on developing semi-supervised and self-supervised learning paradigms tailored for multimodal remote sensing data. By leveraging unlabeled HSI and LiDAR data information, we aim to reduce annotation dependence and further strengthen the model’s generalization ability.

Author Contributions

Conceptualization, M.W., A.W., H.L., G.M., L.Y., S.Y. and H.W.; methodology, H.L., M.W., A.W. and H.W.; software, M.W., H.L. and S.Y.; validation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, A.W. and H.W.; visualization, H.L.; supervision, H.L., L.Y., A.W. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key Research and Development Plan Project of Heilongjiang (JD2023SJ19), the National Key Support Project for Foreign Experts of Northeast Special Project (D20250098), the Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077), and the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (GZC20252304).

Data Availability Statement

Houston2013 and Augsburg: https://drive.google.com/file/d/1UaeUWqTHhXzpwGHZElcF8AHwmVo0IwaV/view (accessed on 21 May 2021); Trento: https://gitcode.com/Resource-Bundle-Collection/cfabc (accessed on 6 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Friedlingstein, P.; Jones, M.W.; O’Sullivan, M.; Andrew, R.M.; Bakker, D.C.E.; Hauck, J.; Le Quéré, C.; Peters, G.P.; Peters, W.; Pongratz, J.; et al. Global Carbon Budget 2021. Earth Syst. Sci. Data 2022, 14, 1917–2005. [Google Scholar] [CrossRef]
Coppola, D.; Laiolo, M.; Cigolini, C.; Massimetti, F.; Delle Donne, D.; Ripepe, M.; Arias, H.; Barsotti, S.; Parra, C.B.; Centeno, R.G.; et al. Thermal remote sensing for global volcano monitoring: Experiences from the MIROVA system. Front. Earth Sci. 2020, 7, 362. [Google Scholar] [CrossRef]
Lasaponara, R.; Masini, N. Satellite remote sensing in archaeology: Past, present and future perspectives. J. Archaeol. Sci. 2011, 38, 1995–2002. [Google Scholar] [CrossRef]
Klemas, V. Remote sensing of Coastal Wetland biomass: An Overview. J. Coast. Res. 2013, 290, 1016–1028. [Google Scholar] [CrossRef]
Smith, B.; Fricker, H.A.; Gardner, A.S.; Medley, B.; Nilsson, J.; Paolo, F.S.; Holschuh, N.; Adusumilli, S.; Brunt, K.; Csatho, B.; et al. Pervasive ice sheet mass loss reflects competing ocean and atmosphere processes. Science 2020, 368, 1239–1242. [Google Scholar] [CrossRef]
Alsdorf, D.E.; Rodríguez, E.; Lettenmaier, D.P. Measuring surface water from space. Rev. Geophys. 2007, 45. [Google Scholar] [CrossRef]
Zhao, D.; Zhong, W.; Ge, M.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral Video Tracker based on Band Correlation Grouping and Spatial-Spectral Information Interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral Video Tracker based on Spectral Difference Matching Reduction and Deep Spectral Target Perception Features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
Zhao, D.; Zhou, L.; Li, Y.; He, W.; Arun, P.V.; Zhu, X.; Hu, J. Visibility Estimation via Near-infrared Bispectral Real-time Imaging in Bad Weather. Infrared Phys. Technol. 2024, 136, 105008. [Google Scholar] [CrossRef]
Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A.; et al. Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 2009, 113, S110–S122. [Google Scholar] [CrossRef]
Wang, Z.; Menenti, M. Challenges and opportunities in LiDAR Remote Sensing. Front. Remote Sens. 2021, 2, 641723. [Google Scholar] [CrossRef]
Moreira, A.; Prats-Iraola, P.; Younis, M.; Krieger, G.; Hajnsek, I.; Papathanassiou, K.P. A tutorial on synthetic aperture radar. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–43. [Google Scholar] [CrossRef]
Zhang, J. Multi-source remote sensing data fusion: Status and trends. Int. J. Image Data Fusion 2010, 1, 5–24. [Google Scholar] [CrossRef]
Pohl, C.; Van Genderen, J. Remote sensing image fusion: An update in the context of Digital Earth. Int. J. Digit. Earth 2013, 7, 158–172. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Fauvel, M.; Chanussot, J.; Benediktsson, J.A.; Sveinsson, J.R. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geosci. Remote Sens 2007, 46, 3804–3814. [Google Scholar] [CrossRef]
Rodarmel, C.; Shan, J. Principal Component Analysis for Hyperspectral Image Classification. Surv. Land Inf. Syst 2002, 62, 115–123. Available online: https://engineering.purdue.edu/~jshan/publications/2002/SaLIS_2002_HyperImagesPCA.pdf (accessed on 1 January 2024).
Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature extraction for classification of hyperspectral and LiDAR data using Patch-to-Patch CNN. IEEE Trans. Cybern. 2018, 50, 100–111. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Spectral–Spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
Liu, B.; Yu, X.; Yu, A.; Wan, G. Deep convolutional recurrent neural network with transfer learning for hyperspectral image classification. J. Appl. Remote Sens. 2018, 12, 026028. [Google Scholar] [CrossRef]
Chen, N.; Yue, J.; Fang, L.; Xia, S. SpectralDiff: A generative framework for hyperspectral image classification with diffusion models. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Roy, K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–Spatial Morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Xu, H.; Zeng, Z.; Yao, W.; Lu, J. CS2DT: Cross Spatial–Spectral Dense Transformer for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5510105. [Google Scholar] [CrossRef]
Jiang, M.; Su, Y.; Gao, L.; Plaza, A.; Zhao, X.-L.; Sun, X.; Liu, G. GRAphGST: Graph Generative Structure-Aware Transformer for Hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5504016. [Google Scholar] [CrossRef]
Zhao, D.; Yan, W.; You, M.; Zhang, J.; Arun, P.V.; Jiao, C.; Wang, Q.; Zhou, H. Hyperspectral Anomaly Detection based on Empirical Mode Decomposition and Local Weighted Contrast. IEEE Sens. J. 2024, 24, 33847–33861. [Google Scholar] [CrossRef]
Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral Video Tracker based on Spectral Spatial Angle Mapping Enhancement and State Aware Template Update. Infrared Phys. Technol. 2025, 150, 105919. [Google Scholar] [CrossRef]
Li, Z.; Wu, J.; Zhang, Y.; Yan, Y. CMFNET: Cross Mamba Fusion Network for Hyperspectral and LiDAR data classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4418614. [Google Scholar] [CrossRef]
He, X.; Han, X.; Chen, Y.; Huang, L. A Light-Weighted Fusion Vision MAMBA for multimodal remote sensing data classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21532–21548. [Google Scholar] [CrossRef]
Wang, X.; Song, L.; Feng, Y.; Zhu, J. S3F2NET: Spatial-Spectral-Structural Feature fusion network for hyperspectral image and LIDAR data classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4801–4815. [Google Scholar] [CrossRef]
Zhao, X.; Tao, R.; Li, W.; Li, H.-C.; Du, Q.; Liao, W.; Philips, W. Joint classification of hyperspectral and LiDAR data using hierarchical random walk and deep CNN architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7355–7370. [Google Scholar] [CrossRef]
Pan, H.; Li, X.; Ge, H.; Wang, L.; Yu, X. Multi-scale hierarchical cross fusion network for hyperspectral image and LiDAR classification. J. Frankl. Inst. 2025, 362, 107713. [Google Scholar] [CrossRef]
Lu, Y.; Yu, W.; Wei, X.; Huang, J. AM2CFN: Assimilation Modality Mapping Guided Crossmodal Fusion Network for HSI and LiDAR Data Joint Classification. IEEE Geosci. Remote Sens. Lett. 2024, 22, 5500605. [Google Scholar] [CrossRef]
Ni, K.; Xie, Y.; Zhao, G.; Zheng, Z.; Wang, P.; Lu, T. Coarse-to-Fine High-Order network for hyperspectral and LiDAR classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5509716. [Google Scholar] [CrossRef]
Roy, S.K.; Sukul, A.; Jamali, A.; Haut, J.M.; Ghamisi, P. Cross Hyperspectral and LIDAR Attention Transformer: An Extended Self-Attention for Land Use and Land Cover classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Jing, H.; Wu, S.; Zhang, L.; Meng, F.; Yan, Y.; Wang, Y.; Du, Z. Heterogeneous contrastive graph fusion network for classification of hyperspectral and LIDAR data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5521317. [Google Scholar] [CrossRef]
Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–Local Transformer Network for HSI and LIDAR Data joint classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–20. [Google Scholar] [CrossRef]
Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision transformer for hyperspectral and LiDAR data classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef] [PubMed]
Guo, F.; Li, Z.; Meng, Q.; Wang, L.; Zhang, J. Dual graph convolution joint dense networks for hyperspectral and LiDAR data classification. In IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium; Curran Associates, Inc.: Red Hook, NY, USA, 2022. [Google Scholar] [CrossRef]
Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–16. [Google Scholar] [CrossRef]
Wang, X.; Zhu, J.; Feng, Y.; Wang, L. MS2CANET: Multiscale Spatial–Spectral Cross-Modal Attention Network for hyperspectral image and LIDAR classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wei, X.; Tu, B.; Liu, B.; Li, J.; Plaza, A. Multimodal data fusion classification via adaptive frequency domain sparse enhancement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5525716. [Google Scholar] [CrossRef]
Ni, K.; Wang, D.; Zhao, G.; Zheng, Z.; Wang, P. Hyperspectral and LiDAR classification via frequency domain-based network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5525117. [Google Scholar] [CrossRef]
Qin, A.; Shang, Z.; Tian, J.; Wang, Y.; Zhang, T.; Tang, Y.Y. Spectral–Spatial Graph convolutional networks for semisupervised hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2018, 16, 241–245. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed network for the HSI and LiDAR data joint classification.

Figure 2. Overall architecture of the GCN layer.

Figure 3. A schematic diagram comparing the architectures of CoordConv and standard convolution (a) Convolutional Layer (b) CoordConv Layer.

Figure 4. Elevation feature enhancement module.

Figure 5. The framework of Bidirectional Frequency Attention Encoder.

Figure 6. Houston2013 dataset. (a) Pseudo-color composite image based on bands 64, 43, and 20 for HSIs. (b) LiDAR-based DSM. (c) Ground-truth map.

Figure 7. Augsburg dataset. (a) Pseudo-color composite image based on bands 20, 12, and 10 for HSIs. (b) LiDAR-based DSM. (c) Ground-truth map.

Figure 8. Trento dataset. (a) Pseudo-color composite image based on bands 20, 15, and 5 for HSIs. (b) LiDAR-based DSM. (c) Ground-truth map.

Figure 9. Classification maps by different methods on Houston2013. (a) Ground-truth. (b) SpectralFormer (76.66%). (c) GraphGST (95.79%). (d) SSFTT (97.47%). (e) MorphFormer (98.73%). (f) CS2DC (98.48%). (g) MS2CANet (97.91%). (h) HCT (98.20%). (i) S3F2Net (98.71%). (j) AFDSE (98.53%). (k) FDNet (98.42%). (l) Proposed (98.93%).

Figure 10. Classification maps by different methods on Augsburg. (a) Ground-truth. (b) SpectralFormer (66.99%). (c) GraphGST (82.46%). (d) SSFTT (81.07%). (e) MorphFormer (86.16%). (f) CS2DC (87.85%). (g) MS2CANet (83.10%). (h) HCT (86.82%). (i) S3F2Net (85.56%). (j) AFDSE (86.12%). (k) FDNet (83.91%). (l) Proposed (88.87%).

Figure 11. Classification maps by different methods on Trento. (a) Ground-truth. (b) SpectralFormer (94.62%). (c) GraphGST (97.47%). (d) SSFTT (97.16%). (e) MorphFormer (98.46%). (f) CS2DC (98.84%). (g) MS2CANet (98.26%). (h) HCT (99.20%). (i) S3F2Net (99.40%). (j) AFDSE (99.38%). (k) FDNet (99.14%). (l) Proposed (99.59%).

Figure 12. t-SNE visualization of different methods on Houston2013. (a) SpectralFormer. (b) GraphGST. (c) SSFTT. (d) MorphFormer. (e) CS2DC (f) MS2CANet. (g) HCT. (h) S3F2Net (i) AFDSE. (j) FDNet. (k) Proposed. (l) Legend.

Figure 13. t-SNE visualization of different methods on Augsburg. (a) SpectralFormer. (b) GraphGST. (c) SSFTT. (d) MorphFormer. (e) CS2DC (f) MS2CANet. (g) HCT. (h) S3F2Net (i) AFDSE. (j) FDNet. (k) Proposed. (l) Legend.

Figure 14. t-SNE visualization of different methods on Trento. (a) SpectralFormer. (b) GraphGST. (c) SSFTT. (d) MorphFormer. (e) CS2DC (f) MS2CANet. (g) HCT. (h) S3F2Net (i) AFDSE. (j) FDNet. (k) Proposed. (l) Legend.

Figure 15. OA with different training samples percentages. (a) Houston2013. (b) Augsburg. (c) Trento.

Figure 16. Influence of different parameters on the overall accuracy. (a) Batch size. (b) Patch size. (c) Learning rate. (d) Weight coefficient for feature fusion.

Table 1. Input dimensions of each layer network structure.

Network Name	Output Size	Explain
Input	(1, 1, 144, 16, 16)	-
DWT_layer_3D	(1, 1, 72, 8, 8)	DWT transformation, output 8 components, size halved.
conv1_HSI	(1, 8, 72, 8, 8)	3D convolution
(Reshape)	(1, 576, 8, 8)	-
Multi_Scale_conv2_HSI	(1, 64, 8, 8)	Multi-scale convolution, channel compression to 64
SS_attn	(1, 64, 8, 8)	Spatial–spectral attention, size unchanged
conv3d_high	(1, 256, 64)	-
gcn_net_layer1	(1, 256, 128)	Graph convolution, dimensionality increase
gcn_net_layer2	(1, 256, 64)	Graph convolution, dimensionality reduction
(Reshape)	(1, 64, 16, 16)	Restore the nodes to a 2D image structure
gcn_post	(1, 64, 16, 16)	2D convolution processing
Interpolate (GCN)	(1, 64, 8, 8)	Downsampling to match the CNN branch (SS_attn output)
fusion_conv	(1, 64, 8, 8)	Feature fusion of GCN and CNN
Fusion (Weighted Sum)	(1, 64, 8, 8)	Assemble the 7 high-frequency components of the DWT
High Freq (Prepare)	(1, 7, 72, 8, 8)	High-frequency 3D convolution processing
conv3d_high	(1, 16, 72, 8, 8)	-
Reshape (x_high)	(1, 1152, 8, 8)	-
Concat (Feature Fusion)	(1, 1216, 8, 8)	Low frequency (64) + high frequency (1152) = 1216
conv2d (Output)	(1, 64, 8, 8)	The final dimensionality reduction output

Table 2. Classification performance obtained by different methods for the Houston2013 dataset (optimal results are in bold).

NO.	Class (Train/Test)	SpectralFormer	GraphGST	SSFTT	MorphFormer	CS2DT	MS2CANet	HCTNet	S3F2Net	AFDSE	FDNet	Proposed
1	Healthy grass (80/1171)	99.42 ± 0.37	95.80 ± 1.31	98.50 ± 0.88	99.74 ± 0.01	99.44 ± 0.25	99.63 ± 0.26	97.77 ± 0.69	97.91 ± 0.11	97.56 ± 0.61	98.79 ± 0.84	98.90 ± 0.50
2	Stressed grass (80/1174)	95.32 ± 4.40	98.43 ± 0.96	99.50 ± 0.11	99.85 ± 0.10	99.64 ± 0.33	98.01 ± 2.19	99.18 ± 0.58	99.79 ± 0.05	98.83 ± 0.76	99.01 ± 0.97	99.71 ± 0.31
3	Synthetic grass (80/617)	99.06 ± 0.91	98.90 ± 1.19	100.00	99.62 ± 0.66	100.00	99.73 ± 0.09	99.68 ± 0.56	99.35 ± 0.69	99.61 ± 0.40	99.90 ± 0.09	99.86 ± 0.24
4	Trees (80/1164)	92.77 ± 0.86	96.24 ± 3.92	98.90 ± 1.28	99.05 ± 1.29	99.70 ± 0.15	99.63 ± 0.42	98.28 ± 1.15	99.92 ± 0.17	98.71 ± 2.16	98.97 ± 1.27	98.80 ± 0.58
5	Soil (80/1162)	99.56 ± 0.56	99.95 ± 0.08	99.84 ± 0.30	100.00	100.00	100.00	100.00	99.68 ± 0.19	99.84 ± 0.18	99.98 ± 0.04	99.99 ± 0.03
6	Water (80/245)	98.44 ± 1.87	99.92 ± 0.18	100.00	100.00	100.00	99.73 ± 0.47	100.00	100.00	100.00	98.61 ± 1.28	100.00
7	Residential (80/1188)	87.21 ± 3.67	93.55 ± 3.83	92.81 ± 2.49	94.56 ± 3.13	93.08 ± 1.23	98.30 ± 1.60	98.23 ± 1.49	98.67 ± 1.00	99.01 ± 0.95	96.21 ± 2.10	98.02 ± 0.68
8	Commercial (80/1164)	85.44 ± 2.02	93.23 ± 4.69	94.12 ± 1.01	95.85 ± 0.26	96.07 ± 0.83	93.81 ± 2.83	96.21 ± 3.14	96.67 ± 2.07	97.95 ± 1.56	97.85 ± 2.00	98.14 ± 1.37
9	Road (80/1172)	82.71 ± 3.05	90.60 ± 3.00	93.09 ± 1.89	98.07 ± 1.19	97.34 ± 1.26	93.43 ± 5.40	94.50 ± 4.69	97.46 ± 1.26	94.62 ± 1.23	93.60 ± 2.56	96.34 ± 0.97
10	Highway (80/1147)	92.32 ± 3.71	95.06 ± 6.73	97.82 ± 2.12	99.74 ± 0.23	99.59 ± 0.55	99.19 ± 0.71	98.65 ± 1.30	100.00	99.83 ± 0.16	99.53 ± 0.35	99.35 ± 0.62
11	Railway (80/1155)	73.87 ± 6.12	92.26 ± 12.80	97.92 ± 1.52	99.97 ± 0.05	99.37 ± 0.44	96.62 ± 1.88	99.89 ± 0.17	99.03 ± 1.11	99.91 ± 0.15	99.38 ± 0.59	99.52 ± 0.26
12	Park lot 1 (80/1153)	86.80 ± 2.82	97.14 ± 2.34	97.83 ± 1.17	98.47 ± 0.65	98.18 ± 0.12	98.12 ± 0.84	97.54 ± 2.48	96.44 ± 3.47	97.49 ± 1.72	98.86 ± 0.63	99.04± 0.62
13	Park lot 2 (80/389)	76.61 ± 5.70	99.13 ± 0.70	99.13 ± 0.70	99.91 ± 0.15	98.97 ± 0.81	99.14 ± 0.83	98.76 ± 1.15	98.46 ± 0.47	97.02 ± 2.74	98.61 ± 1.83	98.93 ± 1.61
14	Tennis court (80/348)	98.85 ± 1.64	99.94 ± 0.13	99.94 ± 0.13	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
15	Running track (80/580)	99.83 ± 0.19	100.00	100.00	100.00	100.00	100.00	98.05 ± 4.78	99.96 ± 0.09	100.00	100.00	100.00
	OA (%)	90.43 ± 0.92	95.79 ± 2.37	97.47 ± 0.47	98.73 ± 0.03	98.48 ± 0.21	97.91 ± 0.25	98.20 ± 0.63	98.71 ± 0.12	98.53 ± 0.36	98.42 ± 0.15	98.93± 0.10
	AA (%)	91.21 ± 0.76	96.42 ± 2.05	97.96 ± 0.39	98.99 ± 0.04	98.76 ± 0.18	98.29 ± 0.23	98.45 ± 0.61	98.89 ± 0.07	98.69 ± 0.41	98.62 ± 0.20	99.11± 0.09
	Kappa (×100)	89.65 ± 1.00	95.44 ± 2.57	97.26 ± 0.51	98.63 ± 0.03	98.35 ± 0.23	97.74 ± 0.26	98.05 ± 0.69	98.67 ± 0.20	98.41 ± 0.39	98.29 ± 0.16	98.84± 0.11

Table 3. Classification performance obtained by different methods for the Augsburg dataset (optimal results are in bold).

NO.	Class (Train/Test)	SpectralFormer	GraphGST	SSFTT	MorphFormer	CS2DT	MS2CANet	HCTNet	S3F2Net	AFDSE	FDNet	Proposed
1	Forest (30/13,477)	92.62 ± 1.89	95.10 ± 0.46	98.14 ± 0.66	91.64 ± 6.42	95.99 ± 1.07	95.66 ± 1.95	96.80 ± 1.10	96.82 ± 0.42	95.84 ± 2.10	97.12 ± 1.43	97.85 ± 0.73
2	Residential area (30/30,299)	71.05 ± 3.72	76.72 ± 5.96	69.96 ± 1.91	86.27 ± 3.68	86.12 ± 3.32	80.27 ± 8.45	84.66 ± 1.11	81.73 ± 5.81	85.74 ± 2.96	80.87 ± 3.96	88.68 ± 2.23
3	Industrial area (30/3821)	52.41 ± 10.54	58.12 ± 5.42	54.23 ± 5.25	51.89 ± 8.07	63.42 ± 1.06	38.83 ± 6.27	66.28 ± 3.99	61.88 ± 4.51	57.08 ± 11.19	52.09 ± 4.05	58.44 ± 4.74
4	Low plants (30/26,827)	53.28 ± 6.34	88.18 ± 6.92	90.98 ± 3.16	89.66 ± 4.09	91.99 ± 2.74	87.67 ± 7.89	90.60 ± 2.92	89.97 ± 1.17	88.63 ± 4.05	89.46 ± 5.55	91.63 ± 2.32
5	Allotment (30/545)	71.33 ± 10.49	83.95 ± 5.15	93.88 ± 6.99	84.49 ± 4.28	86.30 ± 8.91	88.17 ± 4.14	93.62 ± 3.30	94.31 ± 2.59	87.70 ± 6.72	89.91 ± 1.41	88.37 ± 6.63
6	Commercial area (30/1615)	51.24 ± 7.70	62.71 ± 10.30	53.18 ± 6.20	70.45 ± 8.03	56.29 ± 2.86	76.21 ± 11.62	47.96 ± 7.36	54.38 ± 4.21	59.19 ± 10.26	58.58 ± 1.45	63.42 ± 4.82
7	Water (30/1500)	52.25 ± 6.55	65.49 ± 5.45	68.68 ± 4.61	77.25 ± 2.33	72.33 ± 3.21	63.93 ± 7.82	64.77 ± 9.88	73.32 ± 9.27	64.12 ± 9.47	69.35 ± 4.19	67.87 ± 3.48
	OA (%)	66.99 ± 2.72	82.46 ± 0.35	81.07 ± 1.19	86.16 ± 1.80	87.85 ± 0.70	83.10 ± 0.87	86.82 ± 0.94	85.56 ± 2.09	86.12 ± 0.79	83.91 ± 0.93	88.87 ± 0.19
	AA (%)	63.46 ± 2.14	75.75 ± 0.55	75.58 ± 0.90	78.81 ± 1.13	78.92 ± 0.98	75.82 ± 1.21	77.81 ± 2.35	78.92 ± 1.02	76.90 ± 2.06	76.48 ± 0.34	79.47 ± 1.30
	Kappa (×100)	56.82 ± 2.87	76.27 ± 0.43	74.56 ± 1.42	80.95 ± 2.28	83.13 ± 0.91	76.86 ± 0.97	81.80 ± 1.25	80.33 ± 2.64	80.77 ± 1.06	78.02 ± 1.15	84.50 ± 0.28

Table 4. Classification performance obtained by different methods for the Trento dataset (optimal results are in bold).

NO.	Class (Train/Test)	SpectralFormer	GraphGST	SSFTT	MorphFormer	CS2DT	MS2CANet	HCTNet	S3F2Net	AFDSE	FDNet	Proposed
1	Apple Trees (60/3974)	94.26 ± 8.09	97.47 ± 0.85	99.10 ± 0.37	98.43 ± 0.48	98.51 ± 0.36	98.89 ± 1.23	99.11 ± 0.24	99.24 ± 0.22	98.74 ± 0.41	99.42 ± 0.24	99.44 ± 0.23
2	Buildings (60/2843)	88.02 ± 4.69	90.50 ± 2.07	92.69 ± 1.59	96.47 ± 0.73	97.30 ± 1.35	92.15 ± 5.26	98.07 ± 1.02	98.92 ± 0.51	97.74 ± 0.39	98.77 ± 0.71	98.38 ± 0.83
3	Ground (60/419)	97.42 ± 2.10	98.01 ± 1.76	99.44 ± 0.42	99.76 ± 0.34	99.60 ± 0.37	99.00 ± 1.11	99.47 ± 1.05	98.09 ± 0.52	99.47 ± 0.77	97.95 ± 1.35	99.52 ± 1.07
4	Woods (60/9063)	99.18 ± 0.67	99.54 ± 0.60	99.80 ± 0.16	100.00	100.00	99.79 ± 0.19	100.00	100.00	100.00	99.97 ± 0.05	100.00
5	Vineyard (60/10,441)	96.20 ± 3.03	99.16 ± 0.43	97.65 ± 1.13	99.78 ± 0.31	99.95 ± 0.05	100.00	99.89 ± 0.13	100.00	99.87 ± 0.08	99.36 ± 0.35	99.98 ± 0.02
6	Roads (60/3114)	82.16 ± 5.64	92.04 ± 0.99	89.42 ± 0.54	91.17 ± 1.82	94.60 ± 0.18	92.63 ± 2.76	95.64 ± 1.75	96.44 ± 0.81	98.21 ± 0.70	96.08 ± 1.55	98.32 ± 0.57
	OA (%)	94.62 ± 0.84	97.47 ± 0.29	97.16 ± 0.49	98.46 ± 0.45	98.84 ± 0.27	98.26 ± 0.49	99.20 ± 0.10	99.40 ± 0.07	99.38 ± 0.09	99.14 ± 0.14	99.59 ± 0.11
	AA (%)	92.88 ± 1.14	96.12 ± 0.61	96.30 ± 0.32	97.60 ± 0.61	98.55 ± 0.47	97.08 ± 0.79	98.69 ± 0.14	98.78 ± 0.13	99.01 ± 0.20	98.59 ± 0.30	99.28 ± 0.29
	Kappa (×100)	92.82 ± 1.13	96.61 ± 0.40	96.21 ± 0.65	97.93 ± 0.58	98.49 ± 0.30	97.67 ± 0.66	98.92 ± 0.14	99.20 ± 0.10	99.17 ± 0.13	98.85 ± 0.19	99.45 ± 0.14

Table 5. Analysis of Complexity.

Complexity Analysis	SpectralFormer	GraphGST	SSFTT	MorphFormer	CS2DT	MS2CANet	HCTNet	S3F2Net	AFDSE	FDNet	Proposed
Training Time (s)	255.47	87.93	32.70	103.6	143.60	76.20	31.14	1162.21	161.97	783.35	536.12
Test Time (s)	6.85	1.63	0.54	1.07	1.98	4.93	2.14	13.42	3.01	7.85	9.05
Parameters (M)	0.23	4.51	0.15	1.80	0.72	0.73	0.45	0.28	0.44	1.35	1.56
FLOPs (M)	36.86	43.41	1.43	245.12	57.58	12.48	21.63	114.99	135.76	51.27	235.29

Table 6. Ablation analysis of different modal data inputs (optimal results are in bold).

Cases	Houston2013			Augsburg			Trento
Cases	OA (%)	AA (%)	Kappa (%)	OA (%)	AA (%)	Kappa (×100)	OA (%)	AA (%)	Kappa (×100)
Only HSI	98.05	98.24	97.89	87.58	78.52	82.81	98.22	97.08	97.62
Only LiDAR	61.44	64.76	58.49	41.42	45.56	30.44	91.88	91.33	89.31
HSI + LiDAR	99.04	99.19	98.96	88.89	79.75	84.56	99.55	99.31	99.40

Table 7. Ablation analysis of different components in the proposed model on Augsburg.

Cases	Component				Indicators
Cases	SGMB	SCB	MSFE	BiFAE	OA (%)	AA (%)	Kappa (×100)
1	-	√	√	√	82.45	76.22	76.21
2	√	-	√	√	83.28	76.83	77.13
3	√	√	-	√	86.03	79.32	80.76
4	√	√	√	-	85.33	78.69	79.90
5	√	√	√	√	88.89	79.75	84.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, H.; Lv, H.; Wang, A.; Yan, S.; Molnar, G.; Yu, L.; Wang, M. CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification. Remote Sens. 2026, 18, 216. https://doi.org/10.3390/rs18020216

AMA Style

Wu H, Lv H, Wang A, Yan S, Molnar G, Yu L, Wang M. CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification. Remote Sensing. 2026; 18(2):216. https://doi.org/10.3390/rs18020216

Chicago/Turabian Style

Wu, Haibin, Haoran Lv, Aili Wang, Siqi Yan, Gabor Molnar, Liang Yu, and Minhui Wang. 2026. "CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification" Remote Sensing 18, no. 2: 216. https://doi.org/10.3390/rs18020216

APA Style

Wu, H., Lv, H., Wang, A., Yan, S., Molnar, G., Yu, L., & Wang, M. (2026). CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification. Remote Sensing, 18(2), 216. https://doi.org/10.3390/rs18020216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification

Highlights

Abstract

1. Introduction

2. Methods

2.1. Preprocessing of HSI and LiDAR Data

2.2. Spatial-Spectral Feature Extraction of HSI

2.3. Elevation Feature Extraction of LiDAR

2.4. Bidirectional Frequency Attention Encoder

3. Results

3.1. Data Description

3.2. Experimental Setting

3.3. Performance Comparison

3.3.1. Quantitative Results and Analysis

3.3.2. Visual Evaluation and Analysis

3.3.3. Intermediate Visual Comparison

3.4. Complexity Analysis

3.5. Parameter Analysis

3.5.1. Training Samples Percentage

3.5.2. Batch Size

3.5.3. Patch Size

3.5.4. Learning Rate

3.5.5. Weight Coefficient for Feature Fusion

3.6. Ablation Analysis

3.6.1. Ablation Analysis of Different Data Inputs

3.6.2. Ablation Analysis of Different Components

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI