1. Introduction
The advancement of remote sensing technology has enabled various sensors to capture surface features from complementary dimensions, making image classification pivotal across diverse applications [
1,
2,
3]. This technological progress has significantly enhanced the accuracy and efficiency of feature extraction from multi-source data [
4,
5,
6]. Consequently, these developments have expanded the applicability of remote sensing in various domains, including environmental monitoring, urban planning, and agricultural management [
7,
8,
9]. Key sensors include hyperspectral imagery (HSI), which provides rich spectral and spatial data for fine-grained classification [
10]; LiDAR and DSM, delivering accurate elevation and 3D structural information for terrain and flood modeling [
11]; and Synthetic Aperture Radar (SAR), offering all-weather, all-time observation capabilities for deformation monitoring and target recognition [
12]. Each sensor has unique strengths and inherent limitations. Multimodal data fusion therefore offers a promising pathway to achieve information complementarity, to construct more robust and discriminative feature representations [
13,
14], and to significantly enhance classification accuracy, thereby supporting scientific, governmental, and industrial applications with more reliable geospatial intelligence. The field of remote sensing image classification has evolved significantly from traditional shallow models to deep networks. While early methods like Support Vector Machine (SVM) [
15] and Random Forest (RF) [
16] primarily used spectral information, efforts to incorporate spatial context introduced feature engineering techniques such as Morphological Profiles [
17] and PCA [
18]. However, their reliance on handcrafted features and shallow architectures [
19] limits their ability to adaptively extract discriminative cross-modal features, especially in complex, heterogeneous scenarios.
To overcome these fundamental limitations, the researchers have gradually turned their attention to deep learning technology. Deep learning models, through end-to-end training, can automatically learn highly discriminative hierarchical feature representations from raw data, providing a new technical path for the classification of remote sensing data. The 1-D CNN proposed by Hu et al. was the first to apply convolutional networks to spectral dimension feature extraction [
20]. Zhao achieved effective mining of spatial features through 2-D CNN [
21]. The 3-D CNN developed by Chen synchronously captures spatial-spectral features through three-dimensional convolutional kernels, significantly enhancing the joint characterization ability [
22]. Recurrent neural networks (RNN) [
23], with their recurrent connection structure, can effectively capture the long-range dependencies in spectral sequences, significantly enhancing the modeling ability of spectral features. Graph convolutional networks (GCN) [
24] represent samples and their relationships through nodes and edges and can effectively model semantic associations and spatial dependencies among ground objects. They are particularly suitable for handling irregular regions and global semantic reasoning.
Transformer has gained significant attention in HSI analysis due to its remarkable capacity for modeling long-range dependencies, with research evolving from spectral modeling to spatial–spectral and multimodal feature learning. Roy [
25] proposed a spectral–spatial morphological attention Transformer (MorphFormer) to improve the classification accuracy of hyperspectral images. Concurrently, Hong developed SpectralFormer, introducing a cross-layer transformer encoder that extracts group-wise spectral features from adjacent bands [
26]. Sun proposed the Spectral–Spatial Feature Tokenization Transformer (SSFTT), which models local spatial relationships through innovative tokenization of image patches [
27]. Xu et al. [
28] proposed a cross spatial–spectral dense Transformer network (CS2DT), enhancing the performance of hyperspectral classification. Jiang proposed the Graph Generative Structure-Aware Transformer (GraphGST), which dynamically learns graph topologies and injects structural priors into vision transformers for HSI classification [
29]. Although deep learning models have made significant progress in hyperspectral image classification, HSI still has certain limitations: it is easily affected by changes in illumination and shadow occlusion [
30,
31], which leads to challenges in classification accuracy and robustness in complex scenarios.
To break through this bottleneck, the researchers have begun to explore the joint analysis of HSI and LiDAR data. LiDAR data can actively obtain the elevation information of ground objects, effectively making up for the deficiency of hyperspectral images in spatial geometric features. In recent years, deep learning has developed rapidly in the field of joint classification of HSI and LiDAR data, promoting continuous innovation in multimodal fusion technology. Many networks are dedicated to extracting local spatial and spectral features. Li et al. [
32] proposed CMFNet, a cross Mamba fusion network for hyperspectral and LiDAR data classification, which leverages a Mamba-based sequence modeling mechanism to facilitate efficient cross-modal feature interaction and long-range dependency modeling. He et al. [
33] proposed a lightweight fusion vision Mamba network, which achieves efficient collaborative classification of hyperspectral and LiDAR data through jump sampling and dual-path fusion. Wang [
34] proposed a spatial–spectral-structural feature fusion network (S3F2Net), which achieves multi-view feature collaborative classification of hyperspectral and LiDAR through a CNN-GCN hybrid architecture and dynamic node updates. Zhao et al. [
35] proposed a hybrid framework combining deep CNN and hierarchical random walk to optimize the preliminary classification results of CNN by utilizing multi-scale spatial context information of HSI and LiDAR data. Pan et al. [
36] proposed a multi-scale hierarchical cross-fusion network, which achieves spatial–spectral cross-fusion and classification optimization of hyperspectral and LiDAR data through multi-scale cascaded feature extraction and hierarchical fusion modules. Lu et al. [
37] proposed a cross-modal fusion network guided by assimilation modal mapping, achieving an improvement in the joint classification performance of hyperspectral and lidar data. Ni et al. [
38] proposed a coarse-fine high-order network, which achieved hierarchical feature enhancement and classification optimization of multi-source remote sensing data. Roy et al. [
39] proposed a cross-hyperspectral and lidar attention converter, which achieves the collaborative fusion of spatial–spectral and elevation features of multi-source remote sensing data through cross-modal attention interaction and heterogeneous convolutional feature extraction. Jing et al. [
40] proposed a heterogeneous contrast image fusion network, which achieves contrast alignment and adaptive fusion of HSI and LiDAR through dual-flow image attention and dynamic structure learning. The global-local Transformer network proposed by Ding [
41] achieves efficient joint classification of HSI and LiDAR data by integrating the global modeling capability of Transformer and the local feature extraction advantage of CNN. Roy [
42] proposed a multimodal fusion Transformer that models cross-modal global dependencies through a self-attention mechanism, significantly enhancing the joint classification performance of HSI and LiDAR data. Xue [
43] proposed that the depth-level vision Transformer effectively integrates and classifies multi-level semantic information of HSI and LiDAR data through a hierarchical cross-scale attention mechanism. Qin [
44] proposed a spectral–spatial graph convolutional network and constructed an end-to-end semi-supervised classification framework by concatenating CNN with GCN. Zhao [
45] constructed a new hierarchical CNN and transformer network (HCT), and its cross-token attention encoder established a pixel-level spectral-elevation correlation model in the spatial dimension, significantly enhancing the long-range dependency modeling capability. Wang [
46] developed a multi-scale spatial-spectral cross-attention network (MS2CANet), which extracted multi-scale detail features through pyramid convolutional grouping and introduced a feature recalibration module to enhance key information. Wei et al. [
47] proposed Multimodal Data Fusion Classification via Adaptive Frequency Domain Sparse Enhancement (AFDSE), which performs multimodal data fusion classification by adaptively enhancing discriminative frequency-domain sparse representations, effectively suppressing redundant information and improving classification robustness across heterogeneous data sources. Ni [
48] proposed a frequency-domain-based network framework (FDNet), which extracts multi-scale frequency features through discrete wavelet transform and captures global semantic information by combining the self-attention mechanism of fast Fourier transform.
Although significant progress has been made in the classification methods of HSI and LiDAR data fusion based on deep learning, its overall framework still faces a series of interrelated and complex challenges. Firstly, the traditional data preprocessing using PCA, due to its global operational characteristics, is prone to diluting local spatial details, losing high-order statistical information, and is difficult to provide multi-scale feature representations with physical interpretability. Secondly, in terms of HSI spectral feature extraction, the application of CNN is confronted with multiple structural constraints [
22]. Specifically, there are significant computational bottlenecks in three-dimensional convolution operations and the fixed-scale convolution kernels are difficult to adapt to the drastic changes in the scale of ground objects in remote sensing scenes. Notably, the CNN method is limited by the characteristics of the local receptive field; essentially, it is difficult to establish long-range dependencies between pixels, nor can it effectively express global topological associations. Although GCN can make up for this deficiency and construct global context relationships [
49], the traditional GCN architecture has obvious shortcomings: graph structures constructed based on simple spectral similarity often ignore spatial local detail features, and the neighborhood aggregation process lacks an effective feature screening mechanism, resulting in limited node representation capabilities. These problems make it difficult for traditional GCN and CNN features to form effective complementarity, becoming the main obstacle to spectral feature extraction. Furthermore, in terms of spatial feature extraction, the original HSI lacks explicit position encoding, and the translation invariance of traditional convolution jointly limits the model’s precise perception ability of the spatial distribution laws and geometric relationships of ground objects. In addition, in the aspect of LiDAR elevation feature extraction, fixed-scale convolutional kernels are difficult to adaptively capture cross-scale elevation features ranging from microscopic terrain undulations to macroscopic landform structures. Finally, the existing multimodal attention mechanisms are often limited by the unidirectional architecture, which easily triggers heterogeneous feature competition within the limited representation bandwidth and makes it difficult to achieve true semantic collaboration due to the unidirectionality of the information flow.
To address the challenges of feature extraction and cross-modal interaction in HSI and LiDAR data fusion, this paper proposes an integrated deep learning framework with a well-defined processing pipeline. The framework begins by employing Symlets wavelet transform for preprocessing the raw data, extracting physically meaningful hierarchical feature representations in the frequency domain. During the feature extraction stage, the designed Spectral Graph Mixer Block module (SGMB) utilizes a collaborative architecture combining multi-dimensional convolutional neural networks with an enhanced gated graph convolutional network to simultaneously extract joint spectral–spatial features from HSI while establishing long-range spatial dependencies. The Spatial Coordinate Block module (SCB) incorporates advanced positional encoding through coordinate convolution technology, explicitly embedding spatial coordinates into the feature representation to enhance the model’s awareness of object contours and distribution patterns. For LiDAR data, the Multi-scale Elevation Feature Extraction Module (MSFE) employs a multi-scale dilated convolution structure coupled with a feature enhancement module to achieve cross-scale feature capture. Finally, all features are integrated through a Bidirectional Frequency Attention Encoder (BiFAE) for cross-modal fusion; this module introduces a novel bidirectional attention mechanism that operates in the frequency domain, enabling efficient and deep interaction between multimodal features. This design allows the model to selectively enhance complementary information from hyperspectral and LiDAR data while suppressing redundant features, resulting in more discriminative fused representations. Together, these components form a complete processing chain from preprocessing to feature extraction and multimodal integration.
The contributions of this study are threefold, summarized as follows:
We designed a dedicated Symlets wavelet transform module specifically for multisource remote sensing data, which generates hierarchically organized features with inherent physical interpretability while systematically preserving fine-grained spatial information often lost in conventional PCA-based preprocessing.
We proposed a coordinated feature extraction architecture that systematically integrates the processing of hyperspectral and LiDAR data. For 3D HSI, we design the SGMB module that combines multi-dimensional CNNs with an improved gated graph convolutional network to simultaneously extract local spectral–spatial features while establishing long-range spatial dependencies. For 2D HSI, we develop the SCB module employing coordinate convolution technology to explicitly encode spatial positions, effectively overcoming the translation invariance limitations of conventional convolutions. For LiDAR elevation data, we construct the MSFE module that utilizes a multi-branch dilated convolutional architecture with feature enhancement mechanisms to achieve effective cross-scale terrain feature extraction. This architecture enables organic fusion and coordinated processing of heterogeneous remote sensing data while preserving multimodal data characteristics.
We proposed the BiFAE to address the limitations of unidirectional fusion architectures. Its core innovation comprises a bidirectional cross-attention mechanism that enables interactive learning between parallel spectral branches through Fourier operations, achieving adaptive feature distribution compensation across modalities.
The organization of this article is as follows. The proposed framework is described in
Section 2.
Section 3 provides the experimental datasets, parameter settings, and a comprehensive analysis of the classification results. The discussion is presented in
Section 4. Conclusions and potential future research directions are discussed in
Section 5.
2. Methods
The proposed framework for joint classification of HSI and LiDAR data is illustrated in
Figure 1. The four stages that make up this framework are HSI and LiDAR data preprocessing, spatial-spectral feature extraction of HSI, elevation feature extraction of LiDAR data, and bidiretional frequency attention encoder.
2.1. Preprocessing of HSI and LiDAR Data
For the HSI and co-aligned LiDAR data —where H and W denote the spatial height and width dimensions, respectively, and C represents the number of spectral bands in the HSI—edge pixel padding is applied to the input datasets, followed by extraction of local patches centered on every pixel position from each padded dataset. This process generates HSI sample cubes and LiDAR sample matrices , with parameter p defining the spatial window size of extracted patches.
In contrast to traditional Principal Component Analysis (PCA) that often dilutes local spatial details and lacks multi-scale representation and existing wavelet-based methods that primarily utilize the transform as a denoising or downsampling alternative [
46], to achieve effective preprocessing, we employ a Discrete Wavelet Transform (DWT) with a specific choice of the Symlets 5 wavelet from the Symlets family. This approach effectively decomposes each
p ×
p local patch into a set of frequency-domain subbands, while preserving spatial-spectral integrity and physical interpretability. The selection of the Symlets wavelet family is motivated by its well-established utility in processing multi-dimensional remote sensing data. The selection of Symlets as our wavelet basis is driven by its balanced properties that are particularly advantageous for multimodal remote sensing data. While maintaining essential wavelet properties like orthogonality and compact support for efficient computation, Symlets’ near-symmetry offers superior phase response compared to asymmetric wavelets such as Daubechies—this characteristic is crucial for minimizing edge distortion and preserving spatial integrity in hyperspectral imagery. Furthermore, Symlets overcome the limitations of simpler bases: they provide smoother waveforms than the discontinuous Haar wavelet, enabling better frequency localization without introducing artifacts, and achieve a more computationally favorable balance between vanishing moments and filter length compared to Coiflets. These features make Symlets uniquely suited for extracting discriminative, multi-scale features from the complex spectral–spatial-elevation relationships present in fused HSI and LiDAR data.
In the decomposition stage, HSI is processed through two branches: one branch applies a 2D wavelet transform individually to each spectral band, decomposing each into one low-frequency component and three high-frequency components, capturing approximate structures and detail features along horizontal, vertical, and diagonal directions per band. The other branch performs a 3D wavelet transform concurrently along two spatial dimensions and one spectral dimension, producing one 3D low-frequency component and seven 3D high-frequency components that effectively extract joint spectral–spatial frequency features across bands and space. The LiDAR data undergoes only a 2D wavelet transform, with each patch decomposed into one low-frequency component and three high-frequency components. All resulting components serve as discriminative inputs for downstream deep learning models.
2.2. Spatial-Spectral Feature Extraction of HSI
This module aims to comprehensively extract rich spectral–spatial information from HSI through a well-structured dual-branch processing framework. The system architecture mainly consists of two key components: SGMB focuses on the extraction and refinement of spectral features and SCB focuses on spatial relationship modeling and coordinate information preservation. High-frequency components are processed using the high-frequency feature learning (HFL) module, while low-frequency components undergo corresponding feature extraction. This integrated framework effectively captures detailed and structural information while maintaining computational efficiency.
In HSI classification, while CNNs effectively capture fine-grained spectral–spatial patterns through multi-scale convolutional kernels, its inherent limitation lies in the local nature of convolutional operations. This results in a limited receptive field, making it difficult to model long-range dependencies and global contextual relationships within the scene. On the other hand, GCNs overcome this by representing pixels as nodes connected by edges based on spectral similarity, which explicitly models long-range spatial relationships. However, GCNs may sometimes overlook localized, fine-grained details that CNNs excel at extracting, as their graph construction relies on potentially oversimplified spectral similarity metrics. To harness the strengths of both paradigms while mitigating their individual weaknesses, our SGMB adopts a parallel integration strategy. This design aims to establish meaningful global context relationships through GCN branch while simultaneously preserving the good local discriminative features extracted by CNN branch. The parallel fusion of these complementary features ensures a more robust and comprehensive representation for precise classification.
The low-frequency components of the preprocessed HSI spectral information
are simultaneously input into the CNN branch and GCN branch for training. The CNN branch tackles three fundamental limitations in HSI classification: the computational intensity of 3D convolutions restricts model depth and efficiency, the fixed-scale kernels struggle with significant size variations among the ground objects, and the inherent channel redundancy degrades classification accuracy. This hybrid architecture begins with a shallow 3D CNN that processes input spectral components using 3 × 3 × 3 sized kernel with 1 stride and 1 padding to capture intrinsic spatial-spectral correlations. The output feature of the 3D CNN are used as the input of the parallel 2D CNN, which employs three complementary convolution strategies to enhance feature representation: Firstly, it uses group convolution with kernel size of 3 × 3 with 1 stride and 1 padding, which significantly reduces computational complexity while retaining feature discriminative power; Secondly, by using grouped convolution with kernel size of 5 × 5 with 1 stride and 2 padding, the receptive field is expanded to effectively obtain the wide-area context information in multi-scale ground objects. Thirdly, it achieves cross-channel feature reconstruction and dimension compression by pointwise convolution with kernel size of 1 × 1. The outputs of multiple parallel streams are processed through weighted summation, normalization, and ReLU activation functions to obtain
. The process can be formulated as follows:
where
represents the weighted summation.
It should be noted that Squeeze-and-Excitation Networks (SENet) aim to dynamically adjust the weights of each feature channel. By enhancing the response of important channels and suppressing redundant or noisy channels, it improves the representational ability of features. In terms of specific implementation, we first perform an average pooling operation on the input features along the spectral channel dimension. Then, we learn the channel weights through two fully connected layers and nonlinear activation. Finally, we multiply the learned channel weights channel by channel with the original feature map to obtain the features of the CNN branch
,
where
represents the Sigmoid function, L
1 represents the dimension reduction matrix, L
2 represents the dimension recovery matrix, and
represents element multiplication.
Simultaneously, the low-frequency components
serve as input to the GCN module. The input feature tensors are efficiently converted into graph-structured data. The node feature matrix
XG is composed of the concatenation of normalized coordinate features and bilinear interpolation spectral information. The sparse edge matrix
E constructs spatial relationships through four-neighborhood connections. The
XG and
E are used as input to the two-level gated GCN layer, whose detailed architecture is depicted in
Figure 2.
Each layer maps the node feature to the target dimension
H through a linear transformation and generates feature selection weights
G using independent linear transformations and Sigmoid activation functions, thereby overcoming the limitation of original GCN layer that treats all features equally.
where
K is the learnable weight matrix,
V is the gated weight matrix, and
b and
c are bias terms. Subsequently, based on the connection relationship in the edge matrix
E, the features of the source node (
src) and the target node (
dst) are extracted and encoded to generate edge feature vectors
M with semantic information;
where
U is the edge transformation matrix,
d is the bias term. The edge feature vectors
M are weighted by the corresponding gating values and aggregated through residual connections, yielding node features
Y enriched with topological information. It breaks through the constraint of the original GCN layer that only relies on the adjacency matrix for simple aggregation.
The node feature
Y after gridification is extracted by the post-processing module to obtain the features
of the GCN branches.
The low-frequency spectral features
are obtained by weighted summation of
and
. The
and spectral high-frequency features
are channel-wise concatenated and fused via a 1 × 1 convolutional layer for cross-channel integration and dimensionality reduction, outputting the final HSI spectral features
. The specific values of SGMB module parameters are shown in
Table 1.
Although the SGMB designed above can effectively extract the spectral–spatial joint features, it still faces two key challenges: Firstly, using only the original 3D hyperspectral data has inherent flaws. The data lacks explicit spatial position encoding, which limits the model’s understanding of the spatial relationship of ground objects. Secondly, traditional convolution operations essentially have translation invariance and lack the ability to perceive absolute positional information, making it difficult to model precise spatial geometric relationships between pixels. Spatial location information plays a decisive role in the precise positioning of ground object targets, boundary recognition, and understanding of spatial context. To break through these limitations, the designed SCB module effectively addresses these issues. CoordConv explicitly concatenates the standardized height and width coordinate mappings with the original hyperspectral image in terms of channel dimensions, first injecting absolute position prior knowledge into the original data to make up for its lack of spatial coding. Secondly, this mechanism alters the calculation method of traditional convolution, enabling the convolution kernel to perceive the absolute spatial position of each pixel and fundamentally breaking through the displacement invariance limitation of traditional convolution.
As shown in
Figure 3, diagrams (a) and (b) illustrate a structural comparison between the traditional convolutional layer and the CoordConv layer, respectively. The traditional convolutional layer directly maps input features to output features, whereas the CoordConv layer additionally introduces coordinate information on top of the input data. By concatenating these coordinate maps with the original data along the channel dimension, the convolution kernel is enabled to perceive the absolute spatial position of each pixel, thereby enhancing the model’s ability to capture spatial-geometric relationships. Specifically, the preprocessed low-frequency components
of HSI spatial information are first augmented through channel-wise concatenation with normalized height-axis
and width-axis
coordinate maps, forming enhanced features
Fcoord. The
Fcoord then undergoes feature transformation through successive convolutional layers. The first transformation stage applies CoordConv with kernel of size 1 × 1, 1 stride, 0 padding, followed by batch normalization and ReLU activation to extract position-sensitive features while preserving spatial resolution.
The second stage further refines the features through the same structure to obtain the transformed features. The transformed features are processed by the spatial attention mechanism to obtain low-frequency spatial features
and high-frequency spatial features
, which are concatenated and then pass through a 1 × 1 convolutional layer to obtain the HSI spatial features
.
Then, the feature
FH can be represented as,
where ⊕ represents element-wise summation operation.
2.3. Elevation Feature Extraction of LiDAR
When dealing with LiDAR data, CNN-based methods usually directly adopt standard convolution stacking, but this method has obvious limitations. Firstly, the inherent sampling non-uniformity of LiDAR point clouds makes it difficult for fixed-step convolution to stably capture representative local patterns. Secondly, the scale of ground objects varies greatly, ranging from fine surface textures to vast terrain undulations. All these require the model to have multi-scale perception capabilities. However, traditional single-scale convolutional kernels are limited by their fixed receptive fields and cannot effectively capture these significantly different features simultaneously. Therefore, we designed the Multi-Scale Elevation Feature Extraction Block (MSFE) module to extract elevation features. In the basic feature extraction stage, we did not simply increase the network depth. Instead, the low-frequency components of LiDAR pass through two layers of convolution with a kernel of 3 × 3, 1 stride, 1 padding and each layer is followed by batch normalization and ReLU activation function. The first layer of convolution robustly extracts basic terrain features from the sparse and uneven LiDAR data. The feature representation is deepened further through the second layer of convolution to enhance the network’s ability to express complex terrain structures, thereby laying a more robust feature foundation for subsequent multi-scale analysis.
The output features
are passed through a multi-branch dilated convolutional layer, which consists of three dilated convolutional layers with the same convolutional kernel size 3 × 3 but different dilation rates (
d = 1, 2, 3) in parallel. The dilated convolution controls the sampling interval of the convolutional kernel through the dilation rate, effectively expanding the receptive field without increasing the number of parameters. The actual receptive field calculation process is as follows::
where
k is the convolutional kernel size,
d is the dilation rate. The effective receptive fields corresponding to each branch act on different scenes, respectively. The 3 × 3 receptive field is used to capture microscopic features such as surface texture; the equivalent 5 × 5 receptive field is used to extract medium-scale features such as building outlines; the equivalent 7 × 7 receptive field focuses on modeling large-scale spatial patterns such as terrain undulations and introduces learnable weight to achieve dynamic fusion of multi-scale features. The process is as follows,
where
is a dilated convolution with a dilation rate of
k,
is a learnable weight. The result of the residual connection between
Fp and
Fm is input to the feature enhancement module, the elevation feature enhancement module is shown in
Figure 4.
The module first realizes the efficient feature transformation through improved depth separable convolution. The structure consists of group convolution with convolutional kernel size of 3 × 3 and point-by-point convolution with kernel size of 1 × 1,
then
is modulated in parallel with dual-mode features. In the channel dimension, dynamic recalibration is realized through compression-excitation mechanism to generate channel weights
. In the spatial dimension, 7 × 7 large receptive field convolution is used to capture contextual relationships and output spatial weights
. The two weights are realized through outer product operation to realize dual-mode synergy and output the enhanced elevation features in the form of residuals, the process is as follows,
where
represents the outer product and
represents the inner product. The high-frequency components also pass through the HFL module to obtain high-frequency features
. The
and
are spliced along the channel dimension and a grouped CNN with kernel of 3 × 3 is used to extract the spatial local features in each group of channels. The point convolution with kernel of 1 × 1 is used to reorganize the independent features of the grouped convolution output, establish cross-channel associations, and output LiDAR elevation features
.
2.4. Bidirectional Frequency Attention Encoder
The modeling capacity of existing attention-based multimodal fusion methods is often intrinsically constrained by their unidirectional architecture. The single-branch, serial processing design not only leads to competition between complementary features within a limited representational bandwidth but also, and more importantly, results in a rigid information flow that is fundamentally incapable of resolving inter-modal conflicts. This bottleneck severely undermines the robustness and effectiveness of fusion, especially under conditions of significant feature distribution asymmetry or when certain sensor modalities are unreliable or missing.
Therefore, we design the BiFAE through an innovative bidirectional cross-attention mechanism in the frequency domain. A key distinction from existing frequency-aware fusion methods lies in the fusion mechanism: whereas existing methods employ a unidirectional feature aggregation process, our BiFAE introduces a bidirectional interactive attention mechanism in the frequency domain, enabling mutual adaptive enhancement between modalities rather than mere feature concatenation or weighting. The multimodal fusion feature
Ffusion fuses with position embedding and frequency embedding to form F
p as the input of this encoder. The workflow of the encoder is illustrated in
Figure 5. The input feature F
p is first normalized. The normalized feature is then fed into two parallel paths. Each path utilizes a series of deformable convolutional (DeformConv) layers to generate its own set of Query (
Q), Key (
K), and Value (
V) feature tensors. Specifically, Path 1 produces
Q1,
K1, and
V1, while Path 2 produces
Q2,
K2, and
V2,
where
is the weight coefficient and can be manually adjusted.
At the core of BiFAE is a dual-path processing system, with two parallel branches maintaining continuous bidirectional interaction through carefully designed frequency domain operations. Branch 1 focuses on high-frequency details and branch 2 preserves low-frequency context. The innovation of this design is reflected in the interaction mode between branches. Through the bidirectional attention mechanism, the query features of branch 1 interact with the key-value features of branch 2, and at the same time, the query features of branch 2 also interact with the key-value features of branch 1. The specific process is as follows:
where
Q,
K, and
V are query, key, and value, respectively,
is Fourier transform,
is inverse Fourier transform, and
is an attention-based gating mechanism that can automatically detect and compensate for the differences in feature distribution. The final output feature
Fout is obtained through the residual network.
For the output feature Fout, we constructed a lightweight neural network classification module, which adopts a CNN architecture to transform and compress this feature and, ultimately, outputs a probability distribution vector with a dimension of the number of categories. The category index corresponding to the maximum value in this probability distribution is the type of land cover predicted by the model.