LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection

Yan, Lingyu; He, Zijian; Zhang, Zhiqi; Xie, Guangqi

doi:10.3390/rs17101721

Open AccessArticle

LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

Key Laboratory of Green Intelligent Computing Network in Hubei Province, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(10), 1721; https://doi.org/10.3390/rs17101721

Submission received: 29 March 2025 / Revised: 5 May 2025 / Accepted: 11 May 2025 / Published: 14 May 2025

Download

Browse Figures

Versions Notes

Abstract

Target detection plays a crucial role in the intelligent interpretation of remote sensing images and has a wide range of potential applications. However, in the presence of targets with high aspect ratios and significant scale variations in remotely sensed images, existing methods prefer CNN or transformer architectures but suffer from the limitations of overly fixed receptive fields or excessive computational complexity. Recently, Mamba-based methods have become hot in the field of target detection and show significant potential in capturing remote dependencies with linear complexity but lack in-depth customization for remote sensing targets. To address the above challenges, we propose a new target detection framework for complex remote sensing images, LS-MambaNet. Specifically, firstly, a group fusion strategy is combined with the introduction of large-band convolution to adaptively adjust the receptive domains of the features, which enhances the spatial context information extraction for objects with high aspect ratios. In addition, a Multi-Granularity Spatial Mamba Block is proposed, and this employs a multi-granularity scanning strategy to reduce the computational cost and feature redundancy on different scanning paths and is able to efficiently model the global contextual information of the target. Experimental results show that LS-MambaNet outperforms baselines on DOTA1.0 and HRSC2016 datasets. In particular, LS-MambaNet significantly improves the inference speed and achieves a higher FPS while maintaining state-of-the-art detection accuracy.

Keywords:

large strip convolution; grouping fusion strategy; state space model; remote sensing images; object detection

Graphical Abstract

1. Introduction

Object detection in remote sensing images (RSIs) has attracted significant research attention in recent years [1,2], primarily due to its critical role in identifying and localizing specific objects including vehicles, vessels, and aircraft in aerial images. It has been extensively applied to domains such as satellite positioning and atmospheric monitoring. Compared to the traditional horizontal bounding box generation method, remote sensing object detection focuses more on producing bounding boxes that are precisely aligned with the orientation of objects. As a result, numerous researchers have devoted efforts to developing various oriented bounding box (OBB) detectors [3,4], aiming to improve codes of OBBs [5,6,7] and enhance the accuracy of angle prediction [8,9,10,11]. However, as for the feature extracting of objection detection, the unique characteristics of remote sensing images in feature extraction have not yet been fully explored.

Remote sensing images are typically captured from an aerial perspective at a high resolution, which results in a wide range of object scales, from small entities such as vehicles and ships to large structures like football fields and ring roads. The accuracy of object recognition in these images depends not only on the appearances of the objects themselves but also on extensive contextual information. Surrounding environments provide valuable cues regarding the shape, orientation, and other characteristics of an object. Thus, some studies employed explicit data augmentation techniques [12,13,14] to enhance the robustness of contextual feature representation; meanwhile, other studies focused on multi-scale feature integration [15,16] for extracting rich multi-scale contextual information. However, due to the significant scale variations in objects in remote sensing images and the presence of numerous high-aspect-ratio objects, where a high-aspect-ratio object refers to a target with an aspect ratio significantly greater than 1 (e.g., ships, bridges, etc.), such an object’s elongated form imposes a special demand on receptive field design. This variation reduces the accuracy of CNN- and transformer-based methods in object recognition and detection. The primary reasons are as follows. (a) In remote sensing image recognition, CNNs are constrained by the limited sizes of their local receptive fields. This limitation makes it difficult to effectively capture contextual information for high-aspect-ratio objects in high-resolution remote sensing images. In addition, irrelevant information from surrounding areas is included, increasing the likelihood of misdetections and omissions, ultimately leading to misclassification. (b) In remote sensing object detection, achieving a balance between robust local contextual feature extraction and precise global context modeling is essential for optimal detection performance. The advantage of transformers is the ability to establish long-range dependencies through self-attention mechanisms. However, directly applying traditional transformer models to remote sensing object detection can result in slow convergence, leading to a series of negative impacts on detection performance. Additionally, this approach has high computational complexity, which significantly slows down inference speed, reducing the FPS and limiting real-time applications.

In recent years, large-kernel convolutional networks have been introduced into the field of remote sensing object detection. By expanding the receptive field, these networks effectively capture broader contextual information and have achieved promising results. A notable example is LSKNet [17], which incorporates a series of large convolutions combined with a spatial selection mechanism to capture long-range contextual information. PKINet [18] further extends LSKNet by employing a parallel large-square convolution structure, enhancing the model’s ability to handle object scale variations. However, despite the expansion of the receptive field, these models primarily extract features within square-shaped windows. This limitation makes it difficult to effectively capture the contextual information of high-aspect-ratio objects. To address this issue, the large strip convolutions proposed in [19] have demonstrated strong capability in detecting objects with various aspect ratios. These convolutions exhibit excellent feature representation learning abilities in remote sensing object detection. Additionally, the dilated large-kernel convolutions used in [17,18] may overlook dependencies between distant objects within the receptive field. Transformer-based models mitigate this limitation through their global self-attention mechanism. However, this attention mechanism comes with significantly high computational complexity, leading to reduced inference speed, making them less suitable for real-time remote sensing applications.

Recently, the state space model (SSM) Mamba [20], equipped with a selective scanning mechanism, has demonstrated both superior performance in long-range interactions and linear computational complexity. These advantages enable it to effectively address the computational inefficiency of transformers in long-sequence state space modeling. Existing SSM-based models typically employ a multi-scan strategy, involving steps such as forward, backward, horizontal, and vertical scanning, to ensure that every part of an image can establish connections with others. However, the repetitive extraction of similar sequence patterns within the selective scanning strategy exacerbates information redundancy. Encouragingly, a multi-granularity strategy, by capturing spatial dependencies at different levels of granularity, offers a promising solution to this challenge. Therefore, exploring efficient selective scanning mechanisms is crucial for improving SSM-based approaches.

Remote sensing images often exhibit significant object scale variation and high-aspect-ratio characteristics, which pose substantial challenges for accurate object detection. Effectively capturing both local details and long-range contextual information under these conditions has become a critical research problem. To address this, we propose LS-MambaNet, a general-purpose backbone network specifically designed to tackle the unique difficulties of remote sensing object detection, particularly those caused by large variations in object scales and elongated shapes.

In summary, the main contributions of this paper are as follows:

To address the issue that traditional CNN methods are limited by fixed square receptive fields and struggle to effectively capture the contextual information of high-aspect-ratio objects in remote sensing images, the LSF Block is proposed. This module enhances the model’s ability to extract context from high-aspect-ratio objects by combining a grouping fusion strategy based on large-kernel convolutions with large strip convolutions in orthogonal directions.
The MGSpa-Mamba Block has been designed not only to reduce computational costs but also to enhance inference speed, thereby improving the FPS, especially when processing high-resolution remote sensing images. This block adopts a multi-granularity scanning strategy to reduce information redundancy across multiple scanning paths and lower computational costs, thereby ensuring the efficient modeling of spatial dependencies.
The proposed LS-MambaNet model achieves state-of-the-art detection performance while significantly improving the FPS, advancing remote sensing object detection to a higher level. On the DOTA1.0 and HRSC2016 datasets, LS-MambaNet demonstrates outstanding detection accuracy and inference efficiency.

2. Related Works

In this section, we review related works for object detection frameworks, large-kernel networks, and state-space models in the remote sensing domain.

2.1. Remote Sensing Object Detection Framework

High-performance detectors in the field of remote sensing object detection are usually based on the CNN and transformer architectures, wherein the architecture of CNN is subdivided into single-stage and two-stage detection frameworks. The R-CNN [21] framework is commonly used in two-stage detection, consisting of a region proposal stage and a classification stage. In the first stage, the selective search algorithm generates region proposals, followed by classification and bounding box regression for each candidate region in the second stage. In recent years, several variants of the RCNN framework have been derived. Ding et al. [22] use a fully connected layer to rotate candidate anchor frames in the first stage and extract in-frame features for further processing. Similarly, Xu et al. [5] and Xie et al. [6] have introduced an innovative frame encoding scheme that solves the issue of rotational angular periodicity and stabilizes the training loss. In contrast to two-stage detection methods, the single-stage detection method eliminates the need for predefined anchors and directly classifies and regresses the anchors generated from the densely sampled grid. Han et al. [23] have enhanced feature extraction by using directional feature alignment and orientation-invariant extraction while Pan et al. [24] have introduced an attention mechanism to optimize feature extraction for the backbone network. Qian et al. [25] have resolved regression loss discontinuity through modulation loss and Zheng et al. [26] have improved the localization accuracy of oriented bounding boxes via distillation techniques. For the transformer method, Zeng et al. [27] first introduced DETR [28] into the field of remote sensing object detection. These approaches expand research diversity in the field and harness the global modeling capabilities of transformers, opening new technology directions for remote sensing object detection. While these approaches address the challenges associated with arbitrary orientations, they typically still rely on standard backbones that are still relatively inadequate for improving the handling of large object scale variations and complex contextual information in key feature extraction from remotely sensed images. In contrast, the LS-MambaNet model proposed in this paper significantly improves the feature extraction capability of multi-scale and high-aspect-ratio targets through customized large strip convolution and Mamba modules.

2.2. Large-Kernel Networks

To better address the unique challenges in remote sensing imagery due to object scale variations, etc., researchers have proposed a series of improvement strategies, including data enhancement [13,14], multi-scale feature integration [16,29], and feature pyramid network (FPN) enhancement [12,30]. These methods provide important support for remote sensing object detection by expanding data diversity, fusing multi-scale information, and enhancing the expressiveness of features at different levels. However, it has been shown that a large receptive field is one of the key factors for the success of these methods [31,32,33] as it is able to capture global contextual information more efficiently, which exhibits potential to compete with transformer models. Liu et al. [34] used 7 × 7 deep convolution in their backbone network to enhance the performance of downstream tasks while Ding et al. [35] achieved convincing performance by reparameterizing using 31 × 31 convolutional kernels. Liu et al. [36] have extended the convolutional kernel to 51 × 51 by applying kernel decomposition with sparse grouping techniques. Gao et al. [37] satisfied further demands of vision tasks by automatically searching for large kernels for fixation while Chen et al. [38] further improved the performance by scaling up the kernel size to 101 × 101 using peripheral convolution with a large number of parameters shared in the edge region. Furthermore, Li et al. [39] allowed even small convolutional kernels to achieve large convolutional kernel effects through sparse mechanisms and shift operations. Similarly, Guo et al. [40] and Ding et al. [41] demonstrated the important role of large-kernel convolution in enhancing context-modulated convolutional features. Although large-kernel convolution has received much attention in general-purpose target recognition tasks, it tends to ignore the properties of high-aspect-ratio objects that are abundantly present in remote sensing images. In this paper, the Large Strip Convolution Fusion Block innovatively combines the group fusion strategy with orthogonal direction strip convolution to achieve efficient feature extraction for high-aspect-ratio targets.

2.3. State Space Models

In recent years, structured state-space sequence models (SSMs), such as Mamba [20], have offered powerful methods for modeling long sequences. These models have the advantage of linear complexity in the input scale, which enables the efficient modeling of global information. In contrast to traditional self-attentive mechanisms, using an SSM reduces the computational complexity from quadratic to linear by compressing the hidden states that allow each element in a one-dimensional sequence (e.g., a text sequence) to interact with previously scanned samples. While the SSM was originally designed for Natural Language Processing (NLP) tasks, it has also shown strong potential in the field of computer vision. Zhu et al. [42] introduced the Mamba architecture to the visual domain, offering a pure visual backbone network. Liu et al. [43] further enhanced visual processing by incorporating a cross-scanning module, which enables the selective scanning of 2D images and boosts performance in image classification. Furthermore, MSFMamba [44], PanMamba [45], RSMamba [46], RS-Mamba [47], and RS3Mamba [48] further explore the application of the Mamba framework in complex remote sensing scenarios. Although the above Mamba-based model utilizes an SSM to improve the computational efficiency of long-sequence state-space modeling and shows excellent performance in capturing remote dependencies, the multi-scan strategy in its SSM introduces additional computational costs and functional redundancy, which becomes a performance bottleneck in high-resolution remote sensing image processing. The Multi-Granularity Spatial Mamba Block proposed in this paper significantly reduces redundancy through a multi-granularity scanning strategy and achieves a real-time breakthrough in the FPS while achieving high detection accuracy on the DOTA 1.0 dataset.

3. Methods

3.1. Network Overview

As illustrated in Figure 1, our LS-MambaNet framework adopts an encoder–decoder architecture. The encoder consists of multiple LSF Blocks, which progressively extract multi-level local feature information through gradual downsampling. Figure 2 presents the structure of an LSF Block, which primarily comprises two key components: group fusion based on large-kernel convolutions and large strip convolutions applied along horizontal and vertical directions. The decoder is composed of multiple MGSpa-Mamba Blocks, which employ progressive upsampling to extract global feature information. Figure 3 provides an overview of an MGSpa-Mamba Block, where the most critical component is the MGSpa-SSM. This module leverages a multi-granularity strategy to reduce computational costs and mitigate functional redundancy across different scanning paths.

3.2. Large Strip Convolution Fusion Block

Convolutional neural networks have achieved high accuracy in multi-scale feature extraction. However, due to the significant variations in object scales within high-resolution remote sensing images, traditional convolutional networks face limitations. Their fixed-size kernels restrict the ability to accurately and comprehensively capture local contextual information, particularly for numerous high-aspect-ratio objects. To address this issue, we propose the LSF Block, as illustrated in Figure 2. This module incorporates a large-kernel convolution with a grouping fusion strategy, along with a spatial attention mechanism based on large strip convolutions in orthogonal directions. The use of large-kernel convolutions effectively enlarges the receptive field of the LSF Block. Meanwhile, the grouping fusion strategy helps control the number of parameters and computational cost. Furthermore, to capture the spatial context of high-aspect-ratio objects, the spatial attention mechanism with large strip convolutions enhances long-range dependencies among distant pixels. This design effectively integrates the advantages of both standard convolutions and strip convolutions without requiring additional information fusion modules. Specifically, the LSF Block operates as follows. The input feature map

M \in R^{C \times H \times W}

, with C channels, is first transformed through a 1 × 1 convolution to adjust the output channels accordingly. Then, four different kernel sizes (1 × 1, 3 × 3, and 5 × 5 with a dilation rate of 2, and 7 × 7 with a dilation rate of 3) are applied to the output feature maps. The (1, 3, 5, 7) combination allows the model to progressively capture multi-scale contextual features, from fine-grained local details (1 × 1 and 3 × 3 kernels) to broader contextual information (5 × 5 and 7 × 7 kernels) [17,49]. Each grouped convolution processes

\frac{C}{4}

channels, followed by normalization and activation. This process can be described as follows:

Z_{i} = RELU (BN ({Conv}_{i} ({Conv}_{1 \times 1} (M))))

(1)

Here,

Z_{i} \in R^{\frac{C}{4} \times H \times W}

represents the feature maps obtained by applying different convolutional kernels. Convᵢ denotes the various convolution kernel sizes. The feature maps extracted from different kernel sizes are then concatenated. This process can be described as follows:

Z^{'} = Concat (Z_{1}, Z_{2}, Z_{3}, Z_{4})

(2)

The concatenated feature map

Z^{'}

is restored to C channels, facilitating the subsequent strip convolution operations.

Subsequently, we apply the concatenated feature map to the spatial attention mechanism based on strip convolution kernels. By extending the shape of the strip convolution (e.g., horizontal 1 × k or vertical k × 1 kernels), we cover long-range regions within the image. This ensures that even distant pixels can interact with each other, addressing the issue of modeling relationships between remote targets in high-aspect-ratio objects within remote sensing images. In the l-th stage, we first perform dimensionality reduction on the features using average pooling, followed by 1 × 1 convolution to extract features from local regions:

F_{l}^{pool} = {Conv}_{1 \times 1} (P_{avg} (Z))

(3)

Here, P_avg represents the average pooling operation.

We then apply depthwise separable strip convolutions in both horizontal and vertical directions as an approximation to the standard large-kernel depthwise convolution:

F_{l}^{h} = {H Strip Conv}_{1 \times k_{m}} (F_{l}^{pool})

(4)

F_{l}^{h} = {H Strip Conv}_{1 \times k_{m}} (F_{l}^{pool})

(5)

To further expand the receptive field of the LSF Block at deeper layers, we set km = 11 + 2 × l, where l represents the current depth of the LSF stage. This dynamic adjustment of the kernel size enables the model to more effectively capture long-range pixel dependencies. Additionally, due to the lightweight nature of strip convolutions, this design effectively avoids a significant increase in computational cost. As a result, it enhances the ability of LS-MambaNet to establish relationships between distant pixels around detection targets, particularly for high-aspect-ratio objects.

Finally, an attention weight

A_{l}^{atten} \in R^{C_{l} \times H_{l} \times W_{l}}

is generated, which is used to adjust the long-range fine-grained detail information dependency of the features:

A_{l}^{atten} \in Sigmoid ({Conv}_{1 \times 1} (F_{l}^{v}))

(6)

Formally, the output

F_{l}^{'} \in R^{C_{l} \times H_{l} \times W_{l}}

of the l-th level LSF can be mathematically expressed as follows:

F_{l}^{'} = (A_{l}^{atten} \cdot Z^{'}) + M

(7)

3.3. Multi-Granularity Spatial Mamba Block

Mamba is capable of modeling long-range feature dependencies, which is crucial for understanding the global context in remote sensing images. Existing SSM-based models typically adopt multi-scan strategies to ensure connectivity across different regions of an image. However, this approach significantly increases feature redundancy within the SSM, leading to high computational costs when processing high-resolution remote sensing images. As a result, it becomes a performance bottleneck. To address this issue, we propose a simple yet efficient MGSpa-Mamba Block. As illustrated in Figure 3, the input feature X is fed into two parallel branches. In the first branch, the feature undergoes a linear transformation, followed by depthwise convolution, SiLU activation, a Multi-Granularity Spatial SSM (MGSpa-SSM) layer, and LayerNorm (LN) normalization. In the second branch, the feature is first processed by a linear layer, then passed through depthwise convolution and finally activated by SiLU. The outputs from both branches are then aggregated using element-wise multiplication. The computation of the MGSpa-Mamba Block is formulated as follows:

X_{1} = LN (MGSpa-SSM (SiLU (DWConv (Linear (X)))))

(8)

X_{2} = SiLU (DWConv (Linear (X)))

(9)

X_{out} = (Linear (X_{1} \cdot X_{2}))

(10)

The MGSpa-SSM is a key component of the multi-kernel spatial Mamba module. As illustrated in Figure 3, it adopts a multi-granularity strategy, where different granular features are generated through depthwise convolutions with varying kernel sizes and strides. These multi-granularity feature maps are then processed by the SSM and propagated along four distinct scanning paths. These four scanning paths can be categorized into three groups. First group: A 3 × 3 convolution kernel with a stride of 1 is applied through depthwise convolution to preserve the original feature map size. This generates two scanning paths, which are subsequently processed by the SSM. Second group: A 5 × 5 convolution kernel with a stride of 2 is used to downsample the feature map by a factor of 2, creating the third scanning path, which is then processed by the SSM. Third group: A 7 × 7 convolution kernel with a stride of 4 is applied to downsample the feature map by a factor of 4, generating the fourth scanning path, which is also processed by the SSM. Finally, the downsampled feature maps are upsampled to restore spatial resolution. This strategy effectively reduces the overall feature redundancy by generating more compact feature representations, preventing redundant information accumulation across all scanning paths.

Mathematically, the MGSpa-SSM is formulated as follows:

[Y_{1}, Y_{2}] = SSM (σ_{1} (F_{1}, σ_{1} (F_{1})))

(11)

Y_{3} = SSM (σ_{3} (F_{2}))

(12)

Y_{4} = SSM (σ_{3} (F_{3}))

(13)

Here, F₁, F₂, and F₃ represent the feature maps generated using depthwise convolutions (DWConv) with kernel sizes of 3, 5, and 7, respectively. The function σ denotes the transformation of the input features from

R^{H \times W \times C}

to

R^{HW \times C}

, where HW specifies the spatial sequence length processed by the SSM module. The input features are scanned along four distinct directions. First scanning path: Rows are scanned first, followed by columns. Second scanning path: Columns are scanned first, followed by rows. Third scanning path: The scanning follows a diagonal direction opposite to the first scanning path. Fourth scanning path: The scanning follows a diagonal direction opposite to the second scanning path. Finally, the processed sequence y is converted back into a 2D feature map, and the downsampled feature maps are fused through interpolation. The mathematical formulation is as follows:

F_{i}^{'} = β_{i} (Y_{i}), i \in {1, 2, 3, 4}

(14)

F^{'} = F_{1}^{'} + F_{2}^{'} + Inter (F_{3}^{'}) + Inter (F_{4}^{'})

(15)

Here, β represents the inverse transformation of σ, mapping the input data from

R^{HW \times C}

back to

R^{H \times W \times C}

. In this context,

F_{i}^{'}

denotes the feature map enhanced by the SSM module while “Inter” refers to the interpolation operation.

4. Experiments

This section presents extensive experiments to evaluate the effectiveness and performance of the model in remote sensing object detection. First, the datasets used in the experiments are briefly introduced. Next, the experimental setup and evaluation metrics are explained. Finally, the results of ablation studies and comparative experiments are provided, followed by an analysis of the observed phenomena and trends.

4.1. Datasets

DOTA-v1.0 [1] is a large-scale optical remote sensing image object detection dataset, consisting of 2806 remote sensing images with sizes ranging from 800 × 800 to 4000 × 4000. It includes 188,282 instances across 15 categories, such as ‘airplane’ (PL), ‘baseball diamond’ (BD), ‘bridge’ (BR), and ‘ground track field’ (GTF), each annotated with horizontal bounding boxes (HBBs) and oriented bounding boxes (OBBs). The dataset is divided into training (50%), testing (33%), and validation sets, with the remainder allocated for the test set.

HRSC2016 [2] is a high-resolution remote sensing dataset that is collected for ship detection. It consists of 1061 images: 436 images are for training data, 444 images are for testing data, and the rest are validation data. It contains 2976 ship instances.

4.2. Implementation Details and Evaluation Metrics

In our experiments, we evaluate the oriented object detection model using the DOTA-v1.0 and HRSC2016 datasets. To ensure a fair comparison, we adhere to the dataset processing methods used in other mainstream studies [6,17]. For DOTA-v1.0, we utilize a multi-scale approach during training and testing by resizing images to three different scales (0.5, 1.0, 1.5) and then dividing each resized image into overlapping 1024 × 1024 patches with a 500-pixel overlap. For HRSC2016, images are resized by setting the longer edge to 800 pixels while preserving their aspect ratio. The backbone network is first pre-trained on the ImageNet-1K dataset and subsequently fine-tuned on the remote sensing benchmark. To compare with SOTA results, we adopt a 300-epoch pretraining strategy for the backbone to improve the primary results’ accuracy. For ablation studies, we reduce the pretraining epochs to 100 to enhance experimental efficiency. Unless otherwise mentioned, LS-MambaNet is built by default in the framework of Oriented RCNN. Models are trained on training and validation sets and evaluated on the test set. We employ the AdamW optimizer for training, with 36 epochs allocated for HRSC2016 and 12 epochs for DOTA-v1.0. The initial learning rate is 0.0004 for HRSC2016 and 0.0002 for DOTA-v1.0. In both experiments on these two datasets, the learning rate is tuned using cosine scheduling and a warm-up strategy, where the weights decay to 0.05. LS-MambaNet is implemented under the MMRotate framework. The training of LS-MambaNet is conducted on four RTX 4090 GPUs (batch size: 8) in an Ubuntu 22.04 environment while testing uses a single RTX 4090 GPU. All reported FLOPs are calculated with 1024 × 1024 input images, and mean average precision (mAP) is used to measure the accuracy of the compared methods. Table 1 lists the detailed configuration of LS-MambaNet used in this paper.

Our experimental evaluation metrics include mean average precision (mAP), frames per second (FPS), model parameter count (Params), and model floating point operations (FLOPs).

mAP is a comprehensive metric obtained by averaging the AP values. It uses an integration approach to calculate the area under the precision–recall curve for all classes. Therefore, mAP can be calculated thus:

mAP = \frac{AP}{N} = \frac{\int_{0}^{1} p (r) dr}{N}

(16)

Here, p represents precision, r represents recall, and N is the number of categories.

FPS refers to the number of frames processed by the model in one second. A higher FPS value indicates faster inference speed. The calculation formula is as follows:

FPS = \frac{1}{Lateny}

(17)

Here, Latency refers to the model’s computation time, which is the average time required for the model to process a single image.

The Params value and the FLOPs are calculated by feeding the model with the same batch of images. The Params value refers to the total number of parameters in the model while FLOPs measure the computational complexity, representing the number of floating-point operations required to process the given input.

4.3. Comparison with the State-of-the-Art

Results for DOTA-v1.0: We compare LS-MambaNet with 12 state-of-the-art methods on the DOTA-v1.0 dataset. The reported DOTA results are obtained by submitting the predicted outcomes to the official evaluation server. As shown in Table 2, our LS-MambaNet achieves state-of-the-art performance, with an mAP of 79.89%, improving by 1.5% and 1.81% compared to Pkinet-S and LSKnet-S, respectively. This demonstrates the effectiveness and efficiency of the proposed LS-MambaNet model, which shows high recognition capability for complex objects in remote sensing images. Figure 4 and Figure 5 display the line charts for detection accuracy by category and the bar charts for overall detection accuracy, respectively. Figure 6 presents the confusion matrix generated by the model on the DOTA-v1.0 test set, where the rows represent the true categories and the columns represent the predicted categories. The diagonal elements from the top-left to bottom-right represent the accuracy with which the model correctly classifies each category.

Results for HRSC2016: The ship instances in HRSC2016 exhibit significant variations in the aspect ratio, arbitrary orientation, and instance size within the same category, posing a great challenge to the detector’s generalization and accuracy. We evaluate the performance of LS-MambaNet on the HRSC2016 dataset against nine state-of-the-art methods. As shown in Table 3, the mAP values under PASCAL VOC 2007 and VOC 2012 metrics are 90.62% and 98.46%, respectively. This demonstrates that, compared to other methods, LS-MambaNet has stronger robustness in recognizing objects with large scale variations. Figure 7 shows the bar charts of the mAP (07) and mAP (12) metrics for the comparison methods on the HRSC2016 dataset.

Overall, the experimental results demonstrate that the proposed method not only achieves significant performance improvements on the DOTA-V1.0 dataset but also delivers excellent detection performance on the HRSC2016 dataset. Our LS-MambaNet enhances the accuracy of remote sensing object detection under normal conditions, primarily due to the following reasons: (1) In the LS-MambaNet model, the LSF Block combines a grouping fusion strategy with large strip convolutions to adaptively adjust the receptive field of features. This enhances the model’s ability to extract spatial features for high-aspect-ratio objects and effectively captures their contextual information. (2) The MGSpa-Mamba Block is proposed, and this optimizes computational efficiency through a multi-granularity strategy and reduces functional redundancy across different scanning paths, thereby more efficiently modeling the global contextual information of the detection targets. The results of the DOTA1.0 dataset and HRSC2016 dataset detection visualizations are shown in Figure 8 and Figure 9, respectively.

4.4. Ablation Study

Our proposed LS-MambaNet consists of two key components: the LSF Block and the MGSpa-Mamba Block. Therefore, we conduct the following experiments and compare the results using O-RCNN [6] as the baseline model.

4.4.1. Analysis of the Grouping Fusion Strategy in the LSF Block

As shown in Table 4, we investigate the kernel design of the grouped fusion strategy in LSF. The results indicate that using only small kernels (1 × 1 and 3 × 3) in the grouping strategy leads to suboptimal detection performance due to limited texture information extraction capability. Subsequently, we adopt (1,3,5,7) as the grouping kernel structure, with kernel sizes ranging from 1 × 1 to 7 × 7. Under this configuration, the model achieves the best performance, with an mAP of 79.89% and the highest FPS of 22.9. Further testing reveals that when the grouped kernels are adjusted to (3,5,7,9), the additional expansion of large kernels not only increases the computational complexity and parameter size (with an increase of 1.06M parameters) but also leads to a drop in the FPS from 22.9 to 19.2 and a 0.56% decrease in the mAP. This suggests that an excessively large grouped kernel strategy may introduce background noise, leading to performance degradation while also negatively impacting the inference speed.

These findings demonstrate that the (1,3,5,7) grouping fusion strategy achieves the best trade-off between speed and accuracy, maximizing both the FPS and mAP. It effectively captures multi-scale contextual information while avoiding excessive computational overhead, making it the optimal choice for balancing detection precision and inference efficiency.

4.4.2. Analysis of the Strip Convolution in the LSF Block

Firstly, we analyze the impact of different kernel sizes in strip convolutions on model performance, as shown in Table 5. The two parameters in the first column represent the kernel sizes of the horizontal and vertical strip convolutions, respectively. Experimental results indicate that smaller kernels (e.g., 3 × 3 and 5 × 5) struggle to effectively capture long-range dependencies, especially when handling elongated objects in vertical and horizontal directions, leading to a decline in detection accuracy. In contrast, larger kernels (e.g., 7 × 7 and 11 × 11) improve detection accuracy by capturing more distant contextual information, thereby enhancing overall model performance. Specifically, a kernel size of (11,11) achieves the highest mAP (79.89%) while also maintaining the best inference speed (22.9 FPS), demonstrating its efficiency in both accuracy and computational performance.

Notably, while increasing the kernel size from (3,3) to (5,5) improves FPS from 20.5 to 22.6, the FPS fluctuates slightly when moving to (7,7) (21.5 FPS) before reaching its peak at (11,11) (22.9 FPS). This trend suggests that moderately increasing the kernel size enhances both inference speed and detection performance by efficiently capturing long-range dependencies without introducing excessive computational overhead. Based on these findings, we adopt a strip convolution with a kernel size of (11,11) as it provides the optimal trade-off between accuracy and speed, ensuring robust performance in remote sensing object detection.

4.4.3. Analysis of the MGSpa-Mamba Block

Table 6 presents the impact of changing the number of downsampling paths on the detection performance of the DOTA1.0 dataset. The results show that when the network architecture adopts a mixed configuration of two original resolution paths and two downsampling paths, the model achieves optimal detection performance, with parameter, FPS, and mAP values of 1.79 M, 22.9, and 79.89%, respectively. This configuration strikes the best balance between feature redundancy and the efficient retention of key details by preserving fine-grained semantic information in the high-resolution feature maps while leveraging downsampling paths to capture long-range spatial dependencies, thereby validating the effectiveness of the multi-scale scanning strategy. In contrast, when the number of downsampling paths exceeds two (i.e., three or four), the detection performance systematically declines. Specifically, the mAP drops to 78.17% with three paths and to 77.33% with four paths. A similar trend is observed in the FPS: while the FPS slightly decreases from 22.9 to 21.2 when increasing the number of paths from two to three, it rebounds to 22.1 at four paths, indicating that aggressive downsampling does not consistently lead to faster inference. This phenomenon reveals the dynamic balance mechanism between model complexity and feature fidelity. Excessive downsampling, although reducing the computational load, leads to the irreversible loss of high-frequency feature information, which negatively impacts small object detection accuracy.

Under limited computational resource constraints, an optimal balance point for downsampling paths exists, which, in this experiment, is determined to be two downsampling paths. This balance avoids wasting dense computational resources while effectively maintaining the richness of feature space, achieving both high detection accuracy and the best inference speed (22.9 FPS). Additionally, the paper finds that the effect of multi-granularity fusion strategies on the number of parameters exhibits non-linear characteristics. When the number of downsampling paths exceeds two, the reduction in parameters does not correlate proportionally with performance enhancement, offering new optimization directions for future lightweight model designs.

4.4.4. Synergy Analysis Between LSF and MGSpa-Mamba

In addition to intra-module parameter tuning, we further investigate the individual and combined contributions of the LSF Block and the MGSpa-Mamba Block to the overall performance. Specifically, three experimental comparison sets are specifically designed: (1) baseline + LSF only, (2) baseline + MGSpa-Mamba only, and (3) full LS-MambaNet (both modules). The results for the DOTA-v1.0 dataset are shown in Table 7.

From the results, we observe that introducing the LSF Block alone improves the baseline by +1.94% mAP while the MGSpa-Mamba Block alone brings a +2.18% gain. When both modules are incorporated, the final performance reaches 79.89% mAP, demonstrating a cumulative improvement of +4.02% over the baseline. Specifically, the LSF module effectively expands the local sensing field through large-band convolution, which is especially adapted to the feature extraction of high-aspect-ratio targets; the MGSpa-Mamba module improves the long-range contextual relationship modeling capability through multi-granularity state modeling. The two complement each other so that the model can extract local and global information more comprehensively in remote sensing images, thus realizing more accurate target detection.

Although our LS-MambaNet significantly improves the detection accuracy, we observe a moderate decrease in the FPS compared to the baseline model. This phenomenon is mainly attributed to the additional computational complexity introduced by the LSF Block and the MGSpa-Mamba Block.

Specifically, the LSF Block incorporates multiple grouped convolutions with different kernel sizes and employs strip convolutions (e.g., 11 × 1 and 1 × 11) to enhance the receptive field, resulting in a slight increase in the number of per-layer operations. The MGSpa-Mamba Block adopts a multi-granularity spatial modeling strategy, where features are processed across multiple downsampled paths with selective state space scanning, thereby increasing the processing steps and memory access overhead. While these modules introduce more computations, they enable the network to capture richer spatial context and long-range dependencies, which are crucial for remote sensing object detection.

Overall, despite a decrease in the FPS (from 25.4 FPS to 22.9 FPS), the trade-off is reasonable given the significant improvement in the mAP (+4.02%), and LS-MambaNet still maintains competitive inference efficiency compared to other state-of-the-art methods.

4.4.5. Performance Analysis of the LS-MambaNet Backbone in Different Detection Frameworks

For fair comparison, all detection frameworks have been trained using the same hyperparameter settings, including the optimizer, learning rate schedule, batch size, and data preprocessing methods, following the settings described in Section 4.2.

To validate the generalizability and effectiveness of the proposed LS-MambaNet backbone, we conduct evaluations across various remote sensing object detection frameworks, including both two-stage frameworks (RoI Transformer, Rotated Faster R-CNN, and O-RCNN) as well as single-stage frameworks (Rotated FCOS, R3Det, and S2A-Net). As shown in Table 8 the experimental results demonstrate that LS-MambaNet significantly improves detection performance compared to the classic ResNet-50 backbone. This proves that LS-MambaNet is widely applicable across different detection frameworks, not limited to a specific framework, and exhibits generalizability in various detection tasks. Figure 10 further illustrates the mAP and parameter results of our LS-MambaNet across these detection frameworks.

5. Conclusions

Due to the significant scale variation and high-aspect-ratio characteristics of remote sensing images, designing a network architecture that efficiently captures comprehensive contextual information has become an important and challenging research topic. To address this, this paper proposes a universal backbone network, LS-MambaNet, aimed at tackling the challenges posed by large object scale variations and high aspect ratios in remote sensing object detection. In this work, we have introduced a universal backbone network, LS-MambaNet, to address issues such as large object scale variations and high aspect ratios in remote sensing object detection. We propose the Large Strip Convolution Fusion Block, which uses a grouping fusion strategy and large strip convolutions to adaptively adjust the receptive field for feature extraction. This effectively captures contextual information from multi-scale features, especially excelling in remote sensing images that contain many high-aspect-ratio objects. Additionally, to alleviate the performance bottleneck caused by functional redundancy in selective scanning strategies, we propose the Multi-Granularity Spatial Mamba Block. This module adopts a multi-granularity scanning strategy, which effectively reduces computational costs and mitigates functional redundancy across different scanning paths, thus efficiently modeling the global contextual information of detection targets.

Experimental results demonstrate that LS-MambaNet achieves a 79.89% mAP and 22.9 FPS on the DOTA-v1.0 dataset, outperforming the baseline by +4.02% in accuracy while maintaining efficient inference. On the HRSC2016 dataset, our model reaches a 90.63% mAP, demonstrating strong generalization capability across diverse remote sensing scenarios. Although the proposed network has achieved satisfactory results in addressing the insufficient contextual information extraction during the detection of high-aspect-ratio objects in remote sensing images, there are still limitations and several directions for future exploration:

The experiments have only validated general remote sensing datasets such as DOTA1.0 and HRSC2016 and have not tested the model’s robustness in special scenarios (such as cloud or fog obstruction or nighttime infrared images) or with new sensor data (such as SAR and hyperspectral images).
The current multi-granularity scanning strategy uses fixed granularity window divisions. Future work could dynamically adjust the scanning paths based on target size or local image complexity, which would lead to a more refined modeling of small targets or dense areas.
Future work could explore the application of the method in 3D remote sensing object detection frameworks, incorporating Digital Surface Model (DSM) data and extending strip convolutions to the height dimension for building height estimation and 3D object detection.
There is potential for constructing a multi-modal Mamba architecture, integrating multi-source data such as visible light, SAR, LiDAR, etc., and enhancing detection robustness in complex environments through cross-modal interaction modules (such as cross-attention).
LS-MambaNet may still face challenges in extreme scenarios such as those involving dense small objects or cloud occlusion, which we aim to address through dynamic receptive field adjustment and multi-modal feature integration in future work.

Author Contributions

Conceptualization, L.Y., Z.H. and G.X.; methodology, L.Y.; validation, Z.H., Z.Z. and G.X.; writing—original draft preparation, L.Y. and Z.H.; writing—review and editing, Z.Z.; visualization, G.X.; supervision, L.Y.; project administration, L.Y.; funding acquisition, L.Y. and G.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grants 62472149 and 62272356.

Data Availability Statement

The original contributions presented in the study have been included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are thankful to the providers for all the datasets used in this study. We also thankful the anonymous reviewers and editors for their comments to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction from High-Resolution Optical Satellite Images with Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Yao, Y.; Cheng, G.; Wang, G.; Li, S.; Zhou, P.; Xie, X.; Han, J. On improving bounding box representations for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5600111. [Google Scholar] [CrossRef]
Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense label encoding for boundary discontinuity free rotation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15819–15829. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Chanussot, J.; Zhou, H.; Yang, J. Rotation equivariant feature image pyramid network for object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608614. [Google Scholar] [CrossRef]
Ma, S.; Hou, B.; Wu, Z.; Li, Z.; Guo, X.; Ren, B.; Jiao, L. Automatic Aug-Aware Contrastive Proposal Encoding for Few-Shot Object Detection of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615211. [Google Scholar] [CrossRef]
Dai, Y.; Ma, F.; Hu, W.; Zhang, F. SPGC: Shape-Prior-Based Generated Content Data Augmentation for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4504111. [Google Scholar] [CrossRef]
Lin, Z.; Ji, K.; Leng, X.; Kuang, G. Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 751–755. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A task-balanced multiscale adaptive fusion network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613515. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.M.; Yang, J. Lsknet: A foundation lightweight backbone for remote sensing. Int. J. Comput. Vis. 2025, 133, 1410–1431. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Yuan, X.; Zheng, Z.; Li, Y.; Liu, X.; Liu, L.; Li, X.; Hou, Q.; Cheng, M.M. Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection. arXiv 2025, arXiv:2501.03775. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Zheng, Z.; Ye, R.; Hou, Q.; Ren, D.; Wang, P.; Zuo, W.; Cheng, M.M. Localization distillation for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10070–10083. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Sun, H.; Chen, Y.; Lu, X.; Xiong, S. Decoupled feature pyramid learning for multi-scale object detection in low-altitude remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 6556–6567. [Google Scholar] [CrossRef]
Zhang, W.; Jiao, L.; Li, Y.; Huang, Z.; Wang, H. Laplacian feature pyramid network for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604114. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Xu, P.; Cheng, M.M.; Sakaridis, C.; Van Gool, L. Advances in deep concealed scene understanding. Vis. Intell. 2023, 1, 16. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. arXiv 2017, arXiv:1701.04128. [Google Scholar]
Yan, H.; Li, Z.; Li, W.; Wang, C.; Wu, M.; Zhang, C. ConTNet: Why not use convolution and transformer at the same time? arXiv 2021, arXiv:2104.13497. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Liu, S.; Chen, T.; Chen, X.; Chen, X.; Xiao, Q.; Wu, B.; Kärkkäinen, T.; Pechenizkiy, M.; Mocanu, D.; Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv 2022, arXiv:2207.03620. [Google Scholar]
Gao, S.; Li, Z.Y.; Han, Q.; Cheng, M.M.; Wang, L. Rf-next: Efficient receptive field search for convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2984–3002. [Google Scholar] [CrossRef]
Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5557–5567. [Google Scholar]
Li, D.; Li, L.; Chen, Z.; Li, J. Shift-ConvNets: Small convolutional kernel with large kernel effects. arXiv 2024, arXiv:2401.12736. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5513–5524. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectionaltional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar] [CrossRef]
He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M. Pan-mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS 3 Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Cheng, G.; Yao, Y.; Li, S.; Li, K.; Xie, X.; Wang, J.; Yao, X.; Han, J. Dual-Aligned Oriented Detector. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618111. [Google Scholar] [CrossRef]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6589–6600. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer toward remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar] [CrossRef]
Tian, Z.; Chu, X.; Wang, X.; Wei, X.; Shen, C. Fully convolutional one-stage 3d object detection on lidar range images. Adv. Neural Inf. Process. Syst. 2022, 35, 34899–34911. [Google Scholar]
Yang, S.; Pei, Z.; Zhou, F.; Wang, G. Rotated faster R-CNN for oriented object detection in aerial images. In Proceedings of the 2020 3rd International Conference on Robot Systems and Applications, Chengdu, China, 14–16 June 2020; pp. 35–39. [Google Scholar]

Figure 1. The overall architecture of LS-MambaNet.

Figure 2. The structure of LSF Block.

Figure 3. The structure of MGSpa-Mamba Block.

Figure 4. The average detection accuracy on the DOTA-v1.0 dataset. Each point represents the AP50 value of a specific object category for the corresponding model. The horizontal axis lists the compared models and the vertical axis indicates the per-category detection accuracy.

Figure 5. The average detection accuracy for all categories in the DOTA-V1.0 dataset. LS-MambaNet is represented by the red bar and achieves the highest detection accuracy of 79.89 on this dataset.

Figure 6. The confusion matrix of the DOTA-V1.0 dataset.

Figure 7. Comparison of mAP07 and mAP12 metrics on the HRSC2016 dataset.

Figure 8. Partial visualization results of our method on the DOTA1.0 dataset and category color coding referenced to official benchmarks.

Figure 9. Partial visualization results of our method on the HRSC2016 dataset.

Figure 10. Compared to various remote sensing detectors [4,6,22,23,55,56], our method achieves reliable performance improvements in DOTA with fewer parameters.

Table 1. The configuration information of LS-MambaNet.

Model		{C0,C1,C2,C3}	{D0,D1,D2,D3}
LS-MambaNet	Encoder	{32,64,128,256}	{3,3,5,2}
LS-MambaNet	Decoder	{32,64,160,256}	{1,1,2,1}

Table 2. Comparison of results for the DOTA1.0 dataset test set. Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mMP
RoI Trans [22]	88.65	82.26	52.53	70.87	77.93	76.67	86.87	90.71	83.83	82.51	53.95	67.61	74.67	68.75	61.03	74.61
R3Det [4]	88.80	82.77	47.11	65.77	77.76	82.27	86.84	89.82	84.38	84.51	64.57	61.68	66.53	77.56	71.62	75.47
O-RCNN [6]	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
ReDet [50]	88.79	82.64	53.97	74.00	78.13	84.06	88.04	90.89	87.78	85.75	61.76	60.39	75.96	68.07	63.59	76.25
DODet [51]	85.96	81.52	55.01	77.22	74.71	81.46	84.59	86.89	83.12	83.80	66.50	67.54	78.06	73.43	70.47	76.68
S²A-Net [23]	86.49	81.20	55.34	79.55	77.54	80.79	86.71	88.38	82.47	85.51	67.90	65.85	75.90	74.61	67.18	77.02
KFIou [11]	85.74	80.71	58.52	78.81	76.40	82.37	84.98	87.20	83.62	84.68	69.10	68.25	75.26	71.25	71.57	77.23
ACR [52]	89.40	82.48	55.33	73.88	79.37	84.05	88.06	90.85	86.49	84.83	63.63	70.32	74.29	71.91	65.43	77.35
LSKNet-S [17]	89.66	85.52	57.72	75.70	74.95	78.69	88.24	90.88	86.79	86.38	66.92	63.77	77.77	74.47	64.82	77.49
RTMDet-R [53]	85.36	81.96	54.33	83.46	77.58	81.88	87.08	87.90	83.32	84.57	66.29	67.61	75.63	77.97	76.24	78.08
PKINet-S [18]	89.72	84.20	55.81	77.63	80.25	84.45	88.12	90.88	87.57	86.07	66.86	70.23	77.47	73.62	62.94	78.39
RVSA [54]	85.63	83.23	59.73	79.11	78.68	83.37	86.26	88.80	84.38	85.21	65.93	67.81	82.06	79.25	75.76	79.01
LS-MambaNet	90.02	86.88	62.46	84.04	79.53	86.27	88.75	88.55	87.58	70.42	72.66	78.74	77.61	68.67	76.12	79.89

Table 3. Comparison of results for the HRSC2016 dataset test set. Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	mAP (07)	mAP (12)	Params (M)	FLOPs (G)
Rol Trans [22]	86.20	92.21	41.4	198
R3Det [4]	89.26	96.01	41.9	336
GWD [9]	89.85	97.37	47.4	456
S²A-Net [23]	90.17	95.01	38.6	198
O-RCNN [6]	90.20	97.30	41.1	199
RTMDet [53]	90.37	97.10	52.3	205
ReDet [50]	90.39	97.61	31.6	205
LSKNet-S [17]	90.65	98.46	31.0	161
PKINet-S [18]	90.65	98.54	30.8	190
LS-MambaNet	90.62	98.56	39.93	149

Table 4. The impact of different grouping fusion strategies on Params, FPS, and mAP on the DOTA1.0 dataset. Bold numbers indicate the best result in each column.

Kernel Design	Params (M)	FPS	mAP (%)
(1,1,3,3)	12.26	22.6	78.18
(1,3,3,5)	17.42	21.4	78.87
(1,3,5,7)	17.87	22.9	79.89
(3,5,7,9)	18.93	19.2	79.33

Table 5. The impact of different kernel sizes of strip convolutions on Params, FPS, and mAP on the DOTA1.0 dataset. Bold numbers indicate the best result in each column.

Kernel Design	Params (M)	FPS	mAP (%)
(3,3)	15.93	20.5	78.48
(5,5)	15.95	22.6	78.91
(7,7)	15.97	21.5	79.46
(11,11)	16.01	22.9	79.89

Table 6. The impact of different multi-granularity strategies on Params, FPS, and mAP in the DOTA1.0 dataset. Bold numbers indicate the best result in each column.

Path Number for Downsampling	Params (M)	FPS	mAP (%)
0	3.35	20.8	77.48
1	2.65	21.6	78.21
2	1.79	22.9	79.89
3	1.32	21.2	78.17
4	1.02	22.1	77.33

Table 7. Ablation study of the effect of LSF Block and MGSpa-Mamba Block on mAP, and FPS performance in the DOTA1.0 dataset. Bold numbers indicate the best result in each column.

Model	mAP (%)	FPS
Baseline (O-RCNN)	75.87	25.4
Baseline + LSF Only	77.81	24.7
Baseline + MGSpa-Mamba Only	78.05	23.9
Full LS-MambaNet (Ours)	79.89	22.9

Table 8. Comparison of LS-MambaNet and ResNet-50 across different detection frameworks on DOTA-v1.0. Bold numbers indicate the best result in each column.

Frameworks	ResNet-50	LS-MambaNet
One-stage
Rotated FCOS [55]	72.35	74.76
R3Det [4]	73.76	76.89
S²A-Net [23]	75.02	78.23
Two-stage
RoI Trans [22]	74.61	78.17
Rotated Faster R-CNN [56]	73.61	78.45
ORCNN [6]	75.87	79.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, L.; He, Z.; Zhang, Z.; Xie, G. LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection. Remote Sens. 2025, 17, 1721. https://doi.org/10.3390/rs17101721

AMA Style

Yan L, He Z, Zhang Z, Xie G. LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection. Remote Sensing. 2025; 17(10):1721. https://doi.org/10.3390/rs17101721

Chicago/Turabian Style

Yan, Lingyu, Zijian He, Zhiqi Zhang, and Guangqi Xie. 2025. "LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection" Remote Sensing 17, no. 10: 1721. https://doi.org/10.3390/rs17101721

APA Style

Yan, L., He, Z., Zhang, Z., & Xie, G. (2025). LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection. Remote Sensing, 17(10), 1721. https://doi.org/10.3390/rs17101721

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Remote Sensing Object Detection Framework

2.2. Large-Kernel Networks

2.3. State Space Models

3. Methods

3.1. Network Overview

3.2. Large Strip Convolution Fusion Block

3.3. Multi-Granularity Spatial Mamba Block

4. Experiments

4.1. Datasets

4.2. Implementation Details and Evaluation Metrics

4.3. Comparison with the State-of-the-Art

4.4. Ablation Study

4.4.1. Analysis of the Grouping Fusion Strategy in the LSF Block

4.4.2. Analysis of the Strip Convolution in the LSF Block

4.4.3. Analysis of the MGSpa-Mamba Block

4.4.4. Synergy Analysis Between LSF and MGSpa-Mamba

4.4.5. Performance Analysis of the LS-MambaNet Backbone in Different Detection Frameworks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI