In this section, we present the proposed LGNet architecture, which addresses key challenges in RSSC through Lie group feature extraction and MD-CAM. First, we provide an overview of the overall model structure, followed by comprehensive explanations of the LGML branch, deep learning branch, MD-CAM, and feature fusion strategy design principles and implementation details. Finally, we describe our experimental setup, including datasets, implementation environment, and parameter configurations, establishing the foundation for subsequent results analysis and discussion.
3.1. Overall Model Architecture
Figure 1 illustrates our proposed RSSC framework based on Lie group multi-dimensional cross-attention mechanisms. The symmetric dual-branch architecture consists of: (1) an LGML branch for extracting low-level features (
), (2) a deep learning branch for capturing high-level semantic features (
), (3) a feature-level fusion module that combines both feature types into fused features (
), and (4) the MD-CAM that processes these three feature streams to highlight key information. The final classification is performed using a Softmax function applied to the enhanced features.
Before feeding images into the model, we performed systematic preprocessing and data augmentation operations on the input HRSSI to improve training sample quality and model generalization capabilities. First, all remote sensing images were resampled to a standard size of pixels to ensure data format consistency. Subsequently, we applied normalization using the canonical ImageNet dataset mean and standard deviation , accelerating model convergence and significantly enhancing training stability.
For the training dataset, we implemented diverse data augmentation techniques to enrich sample diversity: we introduced random horizontal and vertical flipping to enhance the model’s invariance to directional changes; carefully designed random rotation strategies ( degrees) to cultivate the model’s ability to recognize rotation-invariant features; adjusted brightness, contrast, and saturation (each ) to enable robust performance under varying illumination conditions; and incorporated subtle random affine transformations (translation range ) to strengthen the model’s adaptability to object position shifts. This enhancement protocol was specifically optimized for HRSSI-specific characteristics such as multi-angle views, complex illumination conditions, and temporal variations, effectively improving the model’s classification performance and generalization capabilities when confronting diverse and complex scenes.
3.2. LGML Branch
Based on our team’s previous research in the field of Lie group feature extraction [
1,
2,
51], we first map the samples to the Lie group manifold space to obtain the Lie group samples, and the formula is as follows:
In this context, represents the j-th sample within the i-th class of the dataset, while corresponds to the sample in the Lie group manifold space. The subsequent operations will be based on these processed Lie group samples.
Through extensive comparative experiments, we found that different scene classes often have similarities in texture and color, making it difficult to effectively distinguish these classes by relying only on single shallow features. To extract the low-level features of the image more comprehensively and avoid the limitations and errors caused by a single feature, building upon our previous research [
2,
19], we introduce the Lie group feature covariance matrix to capture multiple low-level features in the image jointly, thus improving the characterization ability of the features in different scenes. After numerous experimental optimizations, our determined feature vector expression is as follows:
where
x and
y are the pixel position coordinates, which are mainly used to preserve the spatial distribution information of the features. For the color features, through experimental comparisons of RGB, YCbCr, and Lab spaces, we ultimately selected the
,
, and
components of the Lab color space, which represent the luminance, green-red, and blue-yellow color contrasts, respectively. Experiments have shown that the Lab space, based on a human vision model, has better color consistency under lighting changes and can reflect color differences more naturally [
52].
represents the average color of a local image block, which is used to summarize the overall color distribution of the region, and is advantageous for identifying areas of uniform color (e.g., grass, rivers, etc.).
Local Binary Patterns (LBP) [
53] encode the grayscale variations around pixels to generate binary texture features, effectively reflecting the local texture patterns of images. Histogram of Oriented Gradients (HOG) [
54] computes the gradient directions of pixels within a local region and constructs a directional histogram, capturing edge and contour information of the image. Based on previous research [
19,
51], we employed Gabor [
55] filters to extract spatial and directional information from the image through filtering operations at multiple scales and directions.
Our feature covariance matrix also integrates the extraction of image shape features, including area
S, compactness
, and aspect ratio
. Area
S reflects the size of the feature and is an important feature for distinguishing objects at different scales. To describe the complexity of shadowed objects, we incorporate the compactness metric proposed in previous studies [
56]:
where
P is the perimeter of the object’s silhouette. Specifically, objects with high structural regularity, such as buildings and roads, tend to have higher
, while natural elements like forests and lakes exhibit lower
. The elongation
measures the extent of an object’s shape, making it useful for distinguishing elongated features such as rivers and roads from more compact objects, thereby enhancing the accuracy and efficiency of geographical object classification in HRRSI.
By extracting texture, color, shape, and other features described above, our model can obtain more low-level features from images, thereby acquiring multi-scale remote sensing image information and more accurate classification results. Based on these feature vectors, we calculate the Lie group feature covariance matrix
C:
where
represents the
i-th feature vector,
is the mean of all feature vectors, and
n is the total number of feature vectors.
Compared with traditional feature extraction methods, our Lie group feature covariance matrix has three key advantages: (1) low matrix dimensionality, effectively reducing redundancy in the feature space; (2) high computational performance, suitable for real-time or large-scale data processing; and (3) strong noise resistance, providing robust performance against noise and distortions in images. This method can effectively capture low-level features in HRRSI and significantly reduce computational complexity during subsequent feature fusion and classification processes. For additional details on the Lie group feature covariance matrix, please refer to [
35].
3.3. Deep Learning Branch
In the RSSC task, HRRSI has rich texture and multi-scale target features, which place higher demands on the model’s feature extraction capability. However, traditional deep neural networks often result in a sharp increase in computational complexity while improving classification performance, leading to reduced inference efficiency and making it difficult to meet the demands of large-scale remote sensing data processing [
57,
58].
To address the challenges mentioned above, we designed an innovative deep learning branch (
Figure 2). In this study, ResNet50 [
30] serves as the backbone network of our deep learning branch. While maintaining high accuracy and significantly reducing computational complexity, we implemented key optimizations to the ResNet50 architecture: (1) a lightweight network structure that preserves only the initial convolution layer, max-pooling layer, and first three residual blocks, substantially reducing parameters while retaining essential feature extraction capabilities; (2) transfer learning with pre-trained weights to efficiently capture high-level semantic features in remote sensing images.
Based on this optimized architecture, we innovatively introduced depthwise separable convolution to replace standard convolutions, reducing computation by 8–9 times while preserving feature representation ability [
59]. More importantly, we designed a novel parallel multi-scale depthwise separable convolution module (PMDS Conv,
Figure 3) that processes three different dimensional convolution operations in parallel and decomposes standard convolution into a combination of depthwise convolution and pointwise convolution (
), effectively expanding the feature extraction range while reducing computational overhead.
Specifically, PMDS Conv consists of three parallel branches, employing , , and convolutions respectively to extract multi-scale features. Among them, the convolution quickly integrates channel information; the convolution is used for extracting local features; while the convolution expands the receptive field to capture larger-scale spatial contextual information. This parallel structural design enables the model to simultaneously process features at different scales, making it suitable for extracting complex and diverse ground object features in HRRSI.
After the convolution operations in each branch, we introduce the Lie group Sigmoid (LG-Sigmoid) activation function [
1]:
where
represents the matrix trace calculation,
c denotes a constant (
), and
is the Hermitian transpose matrix of
y. This activation function is based on Lie group manifold space, providing a smooth normalization and probabilistic mapping of processed matrix samples, thus facilitating the model to obtain a more balanced, continuous, and interpretable feature distribution in subsequent feature decision processes [
60,
61]. Subsequently, we fuse features of different scales obtained from the three channels to capture multi-scale receptive field information. Direct concatenation would significantly increase the number of channels and reduce computational efficiency. Therefore, we employ
pointwise convolution for feature refinement, combined with Batch Normalization (BN) [
62] and the ReLU activation function [
63] to stabilize feature distributions and enhance the model’s nonlinear representation capability.
Despite PMDS Conv’s effective capture of multi-scale features in remote sensing images, our experimental analysis revealed room for improvement in distinguishing complex scenes. To address this limitation, we developed two complementary innovative modules that further enhance model performance.
The Feature Enhancement Module was designed based on the complex relationship between targets and backgrounds in remote sensing images. This module employs a standard convolution combined with the BN and LG-Sigmoid activation function to highlight key target characteristics while effectively suppressing background interference. After comparative experimentation, we selected the convolution kernel as the optimal solution, achieving an ideal balance between effective local receptive field coverage and computational efficiency.
Remote sensing images typically exhibit distinctive channel-wise feature distributions in different categories. Based on this observation, we developed a lightweight Channel Attention Mechanism (CAM) that adaptively adjusts the importance of different channels. Unlike traditional SENet [
39], which requires a complete Multilayer Perceptron for feature transformation, our approach efficiently employs a single
convolution, significantly reducing computational load while maintaining robust feature modeling capability. The mechanism initially performs Global Average Pooling (GAP) on input features to obtain global channel response, then processes it through ReLU and LG-Sigmoid activation functions, enabling the network to focus precisely on the most discriminative feature channels for classification tasks. The channel attention calculation process is expressed as
where
represents the calculated channel attention weights,
denotes the input features,
and
are the two
convolution kernels in the attention mechanism, and
represents the LG-Sigmoid activation function. Finally, these attention weights are used to recalibrate the input features to enhance important information:
To further enhance the representational capacity of the model and accelerate training convergence, we also introduce residual connections in the deep learning branch. By adding skip connections, the residual structure effectively addresses the degradation problem in deep networks and promotes the learning of deep features, thereby improving the stability of training [
30]. Additionally, for the feature maps obtained from the LGML branch and the deep learning branch after processing, we adopt the feature-level fusion method proposed by Xu et al. [
2,
33], fusing the two branch-derived feature maps to obtain a processed fusion feature map
, thereby enabling the feature map to have better representational capacity and lower dimensionality for subsequent processing.
3.4. MD-CAM
Traditional self-attention mechanisms generate attention weights by calculating dot products between input features. Despite remarkable success in various vision tasks, these approaches still face challenges of computational overhead and insufficient information fusion when processing multi-dimensional features or high-resolution images [
64].
To overcome these limitations, we propose an innovative multi-dimensional cross-attention mechanism (MD-CAM) as shown in
Figure 4. This mechanism enhances RSSC performance through two key innovations: first, employing a local block self-attention strategy instead of traditional global dot-product attention schemes, calculating interrelationships between features within local regions; second, implementing multi-dimensional interaction among low-level features, high-level features, and fused features, addressing the insufficient fusion of different-level features in existing methods.
In our model design, we simultaneously input the feature map
obtained from feature-level fusion, as well as the feature maps
and
from the LGML branch and deep learning branch, respectively, into the MD-CAM for processing. For each feature map, we generate separate Query (Q), Key (K), and Value (V) matrices through three independent
convolution operations. Unlike traditional attention mechanisms, our design ensures that matrices generated from each feature layer effectively capture information at various levels without mutual interference, while reducing feature dimensionality and lowering the complexity of subsequent computations [
65,
66]:
where
,
, and
are independent
convolution kernels that serve as learnable projection matrices in our multi-dimensional attention framework.
Subsequently, within each feature layer, the key matrix K and query matrix Q generate the attention weight matrix through dot product operations. To ensure numerical stability and accelerate model convergence, we perform L2 normalization [
67] on K and Q before calculating the attention weights. Specifically, we normalize K and Q along the dimension of the channels, making the norm of each channel vector equal to 1, thereby constraining the dot product results within the range of [−1, 1]. The normalization method is expressed as
The normalized
and
calculate similarity through the dot product, thereby generating the attention weight matrix
where
is a learnable parameter that determines the sharpness of the attention weight distribution. When
is smaller, the attention weights tend toward a sharper distribution, focusing on fewer important features; when
is larger, the attention distribution becomes more uniform, attending to more feature channels. We initialize
to 1 and allow it to adjust during the training process adaptively.
Using the attention weight matrix
calculated in the previous step, we perform weighted summation with the value matrix V, allowing the model to generate the final fused feature representation:
This weighted summation operation implements efficient information fusion through the attention mechanism performed along the channel dimension. It enables the model to focus more on key relationships between features, reduces computational complexity, and significantly enhances the feature dependency modeling capability.
3.5. Feature Fusion Strategy
After obtaining the cross-attention enhanced representations from each feature stream, this study proposes a hierarchical feature fusion strategy to maximally preserve complementary information from multiple sources while enhancing the model’s ability to comprehend multi-scale features. Unlike the simple additive fusion methods widely adopted in existing research, we first employ feature concatenation as the initial fusion step:
where
B represents the batch size,
C represents the number of channels in a single feature stream, and
H and
W represent the height and width of the feature map, respectively. This concatenation operation preserves the complete information from all three feature streams but simultaneously introduces issues of redundancy and increased channel dimensionality. To address this issue, we design an efficient feature fusion module composed of depthwise separable convolution and standard convolution:
where
represents the depthwise separable convolution operation, specifically implemented as
Inspired by MobileNet V2 [
68], we choose a
depthwise separable convolution. Here,
represents depthwise convolution using a
kernel, PWConv represents
pointwise convolution used to reduce channel dimensionality from
to
C,
represents the LG-Sigmoid activation function, and BN represents the batch normalization operation.
To further ensure feature space consistency and enhance contextual modeling capability, we apply the final convolution operation
:
The innovation of this hierarchical feature fusion strategy is mainly reflected in three aspects: (1) the connection operation preserves the integrity of the original information in different feature streams, avoiding information loss that may result from simple addition fusion; (2) depthwise separable convolutions reduce computational complexity while providing a wide receptive field by separating convolution operations across channels and spatial dimensions; and (3) the standard convolution, as the final fusion layer, strengthens the spatial consistency and contextual correlation of features, enabling the model to more accurately capture complex spatial relationships of ground objects in remote sensing images.
After obtaining the fused features, we adopt a simple and efficient classification strategy by compressing the feature map into a fixed-dimensional feature vector through GAP, followed by applying a single-layer linear mapping to project the feature vector into the classification space:
The model finally applies the Softmax function to generate class probability distributions. This classification architecture not only reduces the parameter count and lowers the risk of overfitting but also simplifies the computational process while enhancing the model’s generalization capability.
Table 2 compares computational complexity across different attention mechanisms. Traditional self-attention mechanisms have a time complexity of
in HRRSI [
69] (where N represents the number of tokens in the image). In contrast, the MD-CAM proposed in this study successfully reduces the overall complexity to
(where d represents the number of feature channels after dimensionality reduction) by constraining global attention calculations within fixed-size local blocks. This design significantly reduces the computational overhead and preserves the independent semantic expression capability of different feature layers, enhancing cross-layer feature fusion effects while effectively avoiding the information interference between feature layers. Experimental results demonstrate that MD-CAM exhibits significant advantages in multi-feature layer fusion and dynamic attention to key regions, while notably reducing the overall computational complexity.