CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection

Liu, Jin; Li, Liangliang; Zhao, Xiaobin; Lv, Ming; Jia, Zhenhong; Zhang, Xueyu; Vivone, Gemine; Ma, Hongbing

doi:10.3390/rs18040591

Open AccessArticle

CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection

by

Jin Liu

¹,

Liangliang Li

²

,

Xiaobin Zhao

³,

Ming Lv

⁴,

Zhenhong Jia

⁴,

Xueyu Zhang

⁵,

Gemine Vivone

⁶ and

Hongbing Ma

^7,*

¹

School of Intelligence Science and Technology, Xinjiang University, Urumqi 830017, China

²

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

³

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

⁴

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

⁵

School of Computer and Electronic Information, Guangxi University, Nanning 530004, China

⁶

Institute Methodologies for Environmental Analysis, National Research Council, CNR-IMAA, 85050 Tito, Italy

⁷

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 591; https://doi.org/10.3390/rs18040591

Submission received: 14 January 2026 / Revised: 10 February 2026 / Accepted: 11 February 2026 / Published: 13 February 2026

(This article belongs to the Special Issue Advances in Detection-Oriented Multi-Sensor Fusion Beyond the Visible Spectrum)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel CNN-Mamba network (CMNet) is proposed, synergizing both architectures to generate complementary global–local features for remote sensing object detection, while an FCC module addresses feature representational differences spatially and channel-wise.
Experimental results show that CMNet achieves excellent performance, with 79.38% mAP50 on the DOTA v1.0 dataset and 90.60% mAP50 on the HRSC dataset, outperforming other state-of-the-art approaches.

What are the implications of the main findings?

CMNet, which integrates CNN and Mamba, proposes a new paradigm for remote sensing object detection. It overcomes the limitations of single-architecture models and validates the value of fusing local feature extraction capabilities with global receptive fields.
The FCC module effectively addresses the disparity of heterogeneous features and exhibits strong extensibility for other hybrid networks. CMNet’s superior performance not only facilitates the practical application of remote sensing object detection but also highlights the great potential of Mamba in this field.

Abstract

In the field of remote sensing object detection (RSOD), significant challenges remain, including the vast field of view in remote sensing images, the diverse array of target categories, and complex backgrounds. Traditional methods for processing remote sensing images face limitations in this context. While convolutional neural networks (CNNs) can expand the receptive field by utilizing kernels of different sizes, larger kernels increase the number of parameters and introduce noise. Vision Transformers (ViT) achieve global receptive fields through their global attention mechanism. However, their quadratic computational complexity struggles with high-resolution images. Recently, Mamba has gained prominence in image processing. Its unique four-directional scanning mechanism allows focusing on regions of interest from multiple angles while maintaining linear model complexity and achieving global receptive fields. In this work, we propose a new CNN–Mamba network (CMNet) that synergistically exploits the advantages of both architectures. Specifically, we employ VMamba(VM) to extract global semantic features from images. Moreover, we design a multi-scale local feature extraction (MLFE) module, which captures local texture information and edge details through the local feature extraction (LFE) and the global attention module (GAM). The synergy between VMamba and MLFE creates complementary global–local features. To address the representational differences between these two kinds of features, we further design a feature cross-complementary (FCC) module. This module achieves cross-complementarity of features, solving feature disparity issues. Our CMNet achieves 79.38% mAP50 on the DOTA v1.0 dataset and 90.60% mAP50 on the HRSC dataset, outperforming existing state-of-the-art approaches.

Keywords:

remote sensing object detection (RSOD); VMamba; convolutional neural networks (CNNs); feature fusion

1. Introduction

In recent years, remote sensing object detection (RSOD) has garnered significant attention. Detection tasks in this field primarily focus on identifying specific objects within remote sensing images while determining their categories and locations [1,2]. Unlike horizontal bounding boxes, targets in remote sensing images typically exhibit arbitrary orientations and thus contain rich angular information [3]. Consequently, the RSOD task involves generating bounding boxes precisely aligned with an object’s orientation. This technology finds extensive applications across various domains, including plant protection, wildlife conservation, urban monitoring, and maritime rescue [4]. Moreover, recent advances in deep learning have substantially driven progress in this research area. However, while previous studies have improved detection accuracy through techniques such as Oriented Bounding Box (OBB) detectors [5,6] and neck-level feature fusion within the Feature Pyramid Network (FPN) [7], feature extraction in RSOD remains under-explored.

Remote sensing images primarily consist of bird’s-eye views captured from high altitudes by satellites and drones, providing rich surface object information [8]. Due to their elevated acquisition positions, these images typically feature expansive fields of view and high resolution, resulting in highly complex categories. Object scales range from small vehicles to aircraft, and many objects possess intricate background contexts—such as athletic fields, soccer fields, and bridges. This presents significant challenges in RSOD, requiring models to balance attention to capture information at both local and global scales. To address the challenges of large object scale variations and complex backgrounds, previous studies have introduced data augmentation techniques. These methods enhance model robustness to feature variations by transforming object scales [9,10]. Other approaches fuse features across scales and leverage feature pyramid hierarchies to integrate multi-scale object information, extracting rich scale-dependent features to enhance detection accuracy [11,12]. Regarding remote sensing image classification, existing studies enhance the accuracy of classification and detection by fusing features [13,14,15,16,17]. However, the loss of spatial information is often irreversible. For instance, while the FPN restores feature map dimensions through upsampling, bilinear interpolation may introduce false textures, potentially leading to localization errors.

Recent RSOD research has addressed challenges by employing multi-scale convolutional kernels and Transformers. PKINet leverages a set of multi-scale convolution kernels to represent targets appearing at various spatial scales [18]. Small kernels detect fine details, while large kernels cover broader areas, thereby overcoming the limitations of single-scale kernels. Furthermore, convolutional kernels of varying sizes correspond to distinct receptive fields, enabling the effective capture of both local target textures and surrounding contextual information. This integration of multi-scale local contexts enhances feature richness [19,20,21]. However, features extracted by multi-scale kernels exhibit semantic and spatial scale differences, which necessitate the design of effective fusion mechanisms to prevent feature redundancy or information conflicts.

Transformers [22] excel at modeling long-range dependencies and capturing global contextual information, but their heavy computational cost constrains their deployment on high-resolution remote sensing imagery [23,24]. To address this issue, a variety of hybrid architectures integrating CNNs with Transformers have been developed. For example, DConvTrans-LKA combines the strengths of both CNNs and Transformers [25]. DConvTrans-LKA addresses the limitations of traditional CNNs in associating small targets with global context and the insufficient detail extraction of Transformers by introducing a dynamic convolution (DConv) module for local fine-grained feature extraction and simultaneously employing Transformer-based self-attention to extract long-range contextual dependencies. This achieves feature acquisition across local to global scales. While DConvTrans-LKA demonstrates significant advantages in detection accuracy and feature fusion capability within the RSOD domain, further optimization is needed in terms of computational efficiency and generalization performance.

Despite the remarkable advances achieved by these approaches, issues related to computational efficiency and precise feature alignment remain unresolved. Although PKINet balances local texture and contextual information through multi-scale convolutions, it fails to capture global long-range dependencies [18]. While DConvTrans-LKA combines both CNN and Transformer architectures, its quadratic computational complexity limits its applicability to high-resolution remote sensing images, and the fusion of these heterogeneous features lacks a unified alignment mechanism [25]. In contrast, Mamba’s linear complexity is well-suited for high-resolution remote sensing images [26,27,28], and VMamba’s four-directional scanning mechanism effectively captures global semantic information [29,30]. Therefore, we propose a novel CNN–Mamba network (CMNet), where Mamba provides global feature extraction, MLFE enhances the acquisition of local details, and the FCC module integrates complementary dual features. More specifically, the MLFE module consists of three convolutional kernels of varying sizes for local feature extraction. The VMamba module effectively captures contextual information and extracts global semantic features with high computational efficiency and long-sequence processing capabilities. Finally, the FCC module combines these features. Our innovative contributions are summarized as follows:

We leverage VMamba’s linear complexity and four-directional scanning mechanism to efficiently capture long-range context and global dependencies in remote sensing images. Moreover, we design an MLFE module to accurately extract local structural details, such as edges and textures, through multi-scale parallel depthwise-separable convolutions and gated attention, thereby establishing an initial complementarity between global and local features.
To address the potential representational discrepancies and redundancy between Mamba’s global sequence modeling and CNN modeling, we propose an FCC module. This module aims to reduce representational differences between the global contextual features extracted by VMamba and the local detail features extracted by the proposed MLFE. More importantly, this module removes redundant information from features, refining them into more concise, complementary, and discriminative representations that provide a solid foundation for subsequent high-precision object detection.
We conduct extensive experiments on two widely used and challenging public datasets in remote sensing object detection, i.e., DOTA v1.0 [8] public benchmark dataset as well as the HRSC2016 dataset [31]. The experimental results demonstrate the effectiveness of the proposed CMNet architecture and its core components, showing that it achieves competitive performance on RSOD tasks.

2. Related Work

2.1. Architectures for RSOD

In early studies, two-stage methods adopted a core workflow that first generated candidate boxes, followed by fine-grained predictions. For example, Ma et al. [32] proposed a rotated Region Proposal Network (RPN) that pioneered the integration of rotated anchor boxes into the RPN module of faster R-CNN [33], laying the foundational architecture for adapting two-stage methods to rotated objects. The RoI Transformer addressed feature misalignment by converting horizontal ROIs [34] into rotated RoIs via a spatial transformation network. Oriented R-CNN [35] systematically optimized the RPN-RoI alignment process, establishing a standard two-stage detection paradigm. As the demand for faster and more efficient detection grew, research shifted towards single-stage methods that offer a better balance between detection accuracy and speed. Driven by this need, single-stage approaches based on RetinaNet [36] emerged as a research hotspot. RetinaNet’s focal loss and FPN architecture provided the core framework for rotated single-stage detectors. Among these, R3Det [37], as the first rotated single-stage architecture built upon RetinaNet, achieved multi-stage optimization and end-to-end detection of rotated boxes through cascaded refinement modules and feature refinement modules. Han et al. [38] proposed S²A-Net, which further designed a single-pass alignment mechanism based on RetinaNet and R3Det. This approach simultaneously generates rotated anchor boxes and performs feature alignment, enhancing its adaptability to arbitrarily oriented targets.

Multi-scale feature fusion is pivotal for addressing scale variations in object detection. The FPN [7] establishes a milestone by introducing a top-down pathway and lateral connections to construct a feature pyramid with enriched semantics. To further enhance information flow, PANet [39] integrates an additional bottom-up path, effectively shortening the propagation distance of low-level localization cues. BiFPN [40] optimizes this structure via weighted feature fusion and repeated bidirectional blocks, achieving superior efficiency. Shifting from manual design to automated search, NAS-FPN [41] leverages neural architecture search to discover optimal cross-scale topologies. Recent advancements focus on refining fusion quality; for instance, AugFPN [42] mitigates semantic gaps and information loss through consistency supervision and residual feature augmentation. More recently, HS-FPN [43] incorporates a hierarchical scale-aware mechanism to selectively refine multi-scale contexts, further bolstering the robustness of feature representations.

As the core module for extracting discriminative features, the backbone network has attracted considerable attention, For instance, LSKNet [44] employs convolutions with very large kernels to enlarge the effective receptive field for detecting large objects. However, this design may introduce more background interference when handling small targets. Dilated convolutions effectively expand the receptive field but often result in sparse feature representations. To address the drawbacks of dilated convolutions, PKINet [18] proposes dilation-free inception-style multi-scale depthwise convolutions, balancing local texture and contextual information. ARC [45] employs adaptive rotated convolutions to handle orientation variations, while DecoupleNet [46] and LWGANet [47] focus on efficient feature decoupling for lightweight deployment. To specifically address issues such as low contrast, structural discontinuities, and blurred feature responses in low-quality remote sensing images, LEGNet [48] introduces an Edge-Gaussian Aggregation (EGA) module, which fuses deep learning with traditional hand-crafted filters to effectively enhance feature representations of low-quality targets.

2.2. The State Space Models and Mamba

Mamba [49,50,51], an advanced architecture based on state-space models (SSMs), achieves parameter adaptation through selective scanning and hardware-aware optimization. Its computational efficiency enables efficient high-resolution remote sensing image processing [26], overcoming the computational bottlenecks of Transformers in handling high-resolution data. Mamba successfully circumvents the quadratic complexity barrier of Transformers [52], emerging as a key alternative for modeling long-range dependencies.

In recent years, methods based on the Mamba architecture have demonstrated significant advantages across various tasks in remote sensing and computer vision. To mitigate modality discrepancies in multispectral oriented object detection, Zhou et al. [53] and Ren et al. [54] leveraged the Mamba framework and independently introduced two modules: Disparity-guided Cross-modal Fusion Mamba (DCFM) and Cross-modal Fusion Mamba (CFM) to achieve efficient cross-modal feature fusion in complex remote sensing scenarios. However, these approaches lack a fusion mechanism tailored to the intrinsic differences between global and local features of remote sensing targets. In contrast, our CMNet achieves spatial complementarity between Mamba and multi-scale CNN local features through the synergy of the MLFE and the FCC modules. In hyperspectral target detection, HTD-Mamba [55] pioneered Mamba’s application by integrating the Spatial Encoding Spectral Augmentation (SESA) technique with the pyramidal SSMs. Similarly, for hyperspectral image classification, ConvMamba [56] presents a hybrid architecture that combines CNNs with Mamba blocks. By leveraging the local inductive bias of convolutions alongside the long-range sequence modeling of Mamba, it effectively captures the intricate spectral-spatial relationships inherent in hyperspectral cubes. In image segmentation, Mamba-UNet [57] introduces a Dual-branch Mamba Fusion Module (DMFM) and a multi-scale spatio-temporal attention module. Combined with the Dynamic Quantile Weighting Loss (DQWL), it optimizes the prediction performance of radar echo sequences in precipitation nowcasting, demonstrating exceptional capability in capturing extreme precipitation events.

3. Methodology

This section describes the overall architecture of the proposed model. As illustrated in Figure 1, CMNet comprises three main parts: the MLFE module that employs multi-kernel CNNs to capture local fine-grained structures, the VM block responsible for modeling long-range global dependencies, and the FCC module designed to properly combine these different kinds of features.

3.1. Mamba

SSMs represent system states as a set of linear equations. They can be modeled as linear time-invariant (LTI) dynamical systems, where

x (t) \in R

is input to the system. After processing through the hidden layer

h (t) \in R^{N}

, the output

y (t) \in R

is obtained, where N is the number of units in the hidden layer. The state transition matrix projection parameters are:

A \in R^{N \times N}

,

B \in R^{N \times 1}

,

C \in R^{1 \times N}

, and

D \in C

. This system can be described by the following set of equations:

h^{'} (t) = A h (t) + B x (t),

(1)

y (t) = C h (t) + D x (t) .

(2)

Since SSMs are continuous in time, directly applying them to deep learning tasks poses significant challenges. Therefore, continuous systems must be discretized for use in deep learning tasks, such as image recognition. The specific task of discretization involves solving the linear differential equations in (1). A common discretization method in control theory is the use of a zero-order hold element. For the time-scale parameter

Δ

, the parameters

A

and

B

in the continuous system are converted to their discrete form as follows:

\bar{A} = exp (Δ A),

(3)

\bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - 1),

(4)

where

{(\cdot)}^{- 1}

is the inverse operator and

exp (\cdot)

is the exponential function. Hence, Equation (1) can be rewritten as:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t},

(5)

where

h_{t}

,

h_{t - 1}

, and

x_{t}

are the discretization versions of

h (t)

,

h (t - 1)

, and

x (t)

.

In computer vision, images are typically represented as two-dimensional data. To overcome the limitations of the 1D processing of SSMs, VM introduces the 2D selective scanning (SS2D) mechanism as shown in Figure 2. Specifically, the SS2D mechanism generates four sets of one-dimensional sequences by scanning along the four directions of the image. These generated sequences are then processed by the S6 module, which serves as a critical auxiliary component within VM’s SS2D mechanism. S6 is responsible for feature enhancement and dimensionality alignment of the one-dimensional sequences. It adopts a lightweight residual structure to fuse local and global feature information, effectively mitigating the loss of spatial details during the 2D-to-1D scanning process. Moreover, S6 optimizes the feature representation of the sequences by introducing adaptive normalization, making the subsequent SSM processing more efficient and accurate in capturing image context features. Following this enhancement, the sequences are fed into the SSM for processing. Finally, the processed sequences are merged and reconstructed into a 2D image, restoring the 2D spatial structure. This method addresses the limitations of Mamba’s application in computer vision.

3.2. VMamba

As shown in Figure 2, the backbone network VM backbone serves as the global feature extraction module. Specifically, the input image

F \in R^{B \times H \times W}

is first processed by the block

S t e m

to divide

F \in R^{B \times H \times W}

into patches

F^{i n} \in R^{B \times \frac{H}{p} \times \frac{W}{p}}

, where p is a parameter related to the patch size, and B, W, and H represent the number of spectral bands, the width and height of the input image, respectively. The feature maps processed by

S t e m

are fed into the VM block for feature extraction, followed by the

F C C

block and the

D S

block related to the downsampling of a factor two. This process is repeated across four stages, as shown in the upper network of Figure 1. Finally, the four feature maps obtained after each stage are combined through addition with the related features maps provided by the

M L F E

block and these combinations at each stage are exploited as the input to the

F P N

block. Thus, the VM block receives different inputs. More specifically, we have:

F_{i}^{V M_{i n}} = \{\begin{matrix} F^{i n}, & i f i = 0, \\ D S (F_{i}^{f u s e}), & i f i > 0, \end{matrix}

(6)

where

F_{i}^{f u s e}

is the output of the FCC block at stage i.

The VM backbone can be described as follows:

G_{i}^{1} = S i L U (D W C o n v (L i n e a r (L N (F_{i}^{V M_{i n}})))),

(7)

G_{i}^{2} = L i n e a r (L N (S S 2 D (G_{i}^{1}))),

(8)

G_{i}^{3} = G_{i}^{2} + F_{i}^{V M_{i n}},

(9)

F_{i}^{V M_{o u t}} = F F N (L i n e a r (G_{i}^{3})) + G_{i}^{3},

(10)

where

L i n e a r

is a linear layer,

L N

denotes the layer normalization,

F F N

stands for a feed-forward network,

D W C o n v

denotes the depth-wise separable convolution. The

S i L U

is the sigmoid-weighted linear unit activation function.

S S 2 D

denotes 2D Selective Scanning, which is a core mechanism introduced in the VM architecture to address the sequential processing constraints of state space models (SSMs) and to adapt them to two-dimensional image tasks in computer vision. The input feature map

F_{i}^{V M_{i n}}

undergoes the VM processing to generate the global feature map

F_{i}^{V M_{o u t}}

at the i-th stage.

3.3. MLFE

As shown in Figure 3 the MLFE module is inspired by PKINet and incorporates targeted modifications to achieve its final design. The MLFE module comprises two submodules: LFE module for local feature extraction, and the GAM. Figure 3 outlines the complete framework of the proposed MLFE. Thus, the MLFE block receives different inputs as in the previous case. Hence, we have that

F_{i}^{M L F E_{i n}} = F_{i}^{V M_{i n}}

.

Starting from

F_{i}^{M L F E_{i n}}

, we have:

L_{i}^{1} = C o n v (F_{i}^{M L F E_{i n}}),

(11)

L_{i}^{2} = L_{i}^{1} + F_{i}^{L F E} ⊙ F_{i}^{G A M},

(12)

F_{i}^{M L F E_{o u t}} = S i L U (B N (L_{i}^{2})),

(13)

where

C o n v

denotes a 1 × 1 convolution layer,

F_{i}^{L F E}

is the output of the LFE module,

F_{i}^{G A M}

is the output of the GAM module, ⊙ denotes the point-wise multiplication operator, and

F_{i}^{M L F E_{o u t}}

is the output of the MLFE block at the i-th stage.

The LFE module is designed to comprehensively and accurately capture rich local features information within images. To achieve this, the module adopts multiple small kernel size separable convolutions in parallel, inspired by PKINet, enabling the multidimensional extraction of local texture features. Specifically, as shown in the upper part of Figure 3, the LFE module constructs three independent convolutional operation branches, employing three different convolutional kernels. Mathematically, we have:

F_{i}^{L F E} = B N (C o n v (\sum_{k = 1}^{3} D W C o n v_{k} (B N (L_{i}^{1})))),

(14)

where k indicates the kernel size (

3 \times 3

if

k = 1

,

5 \times 5

if

k = 2

, and

7 \times 7

if

k = 3

). Through the parallel execution of these three convolutional operations, the module acquires diverse features across different receptive fields, generating feature maps corresponding to three different scales. After multi-scale feature extraction, the module first performs the summation of these three feature maps to achieve preliminary feature fusion, integrating effective information across the different scales. To further optimize feature representation, a convolution with kernel size

1 \times 1

and the

B N

function are included after summation, thus getting

F_{i}^{L F E}

.

The architecture of the GAM module is depicted as shown in the lower part of Figure 3. Starting from the input feature maps, batch normalization is first carried out, followed by parallel average pooling and max pooling operations. To capture contextual cues at different scales and enlarge the receptive field, we then apply two dilated convolution layers with

3 \times 3

and

5 \times 5

kernels, which model long-range feature interactions. A subsequent

1 \times 1

convolution aggregates these responses, and a sigmoid activation is finally employed to produce the attention weights that modulate the original features. Mathematically, we have:

A_{i} = M P (B N (L_{i}^{1})) + A P (B N (L_{i}^{1})),

(15)

F_{i}^{G A M} = σ (C o n v (D W C o n v_{5 \times 5}^{d} (D W C o n v_{3 \times 3}^{d} (A_{i})))),

(16)

where

M P

denotes the max pooling operator,

A P

indicates the average pooling operator,

σ

is the sigmoid activation function, and d represents the convolution kernel dilation factor (set to two in our experiments). This module first uses the

B N

function to compensate for batch data distribution variations, and then employs average pooling to preserve global statistical properties and background information. Afterwards, we combine it with max pooling to enhance locally salient features. Both are summed to get

A_{i}

containing both global and local features. Subsequently, dilated convolution expands the receptive field without increasing computational load to capture long-range features. We fuse the depth features with a

1 \times 1

convolution. The resulting response is then passed through a sigmoid activation to obtain the attention weights

F_{i}^{G A M}

.

3.4. FCC Module

Since we employed two methods for feature extraction, we fully leveraged VM’s capability to capture long sequences for extracting global features. However, to avoid neglecting local features, we designed the MLFE module. To address the disparity between the two kinds of features, we developed the FCC module. This module aims to correct both feature sets, fully utilizing complementary information while suppressing redundancy information. The FCC module is illustrated in Figure 4.

As shown in Figure 4, the spatial cross-complementary module refines spatial features. Local and global features in the input are spatially concatenated and then processed through a linear layer exploiting a sigmoid activation function to generate attention weights. These weights are split into two components via the

S p l i t

function. Each component is multiplied by the corresponding global or local features to yield corrected features. The corrected features are then summed with the uncorrected features. Thus, we have:

[W_{i}^{s g}, W_{i}^{s l}] = S p l i t (σ (F F N (C a t (F_{i}^{V M_{o u t}}, F_{i}^{M L F E_{o u t}})))),

(17)

\{\begin{matrix} F_{i}^{s c l} = F_{i}^{M L F E_{o u t}} + F_{i}^{V M_{o u t}} ⊙ W_{i}^{s g}, \\ F_{i}^{s c g} = F_{i}^{V M_{o u t}} + F_{i}^{M L F E_{o u t}} ⊙ W_{i}^{s l}, \end{matrix}

(18)

where

C a t

denotes the concatenation of the two kinds of features along the channel dimension,

S p l i t

represents the splitting of the generated attention weights along the channel dimension, and

F_{i}^{s c l}

and

F_{i}^{s c g}

are the spatially refined local and global features at stage i, respectively.

3.5. Overall Network Architecture

As described before, our CMNet is a dual-stream feature extraction network. The upper part of the backbone network extracts global features, while the lower part extracts local features, thereby balancing both global and local feature extraction. Subsequently, the FCC module performs cross-complementary processing on the two sets of features, achieving complementarity between global and local features and eliminating redundant features. CMNet does not rely on detection heads and can replace the backbone part of existing state-of-the-art methods. In this paper, we employ the two-stage detector S2ANet as the detector for CMNet, preserving its detector head configuration while just modifying the feature extraction component.

4. Experiments

4.1. Datasets

We evaluate the proposed model on two widely recognized benchmarks for oriented object detection: DOTA v1.0 [8] and HRSC2016 [31]. These datasets were selected because they provide complementary challenges that comprehensively validate the robustness of our model.

Launched by Wuhan University in 2018, DOTA v1.0 is a large-scale dataset specifically designed for object detection in aerial remote sensing imagery. It contains 2806 high-resolution images with sizes ranging from

800 \times 800

to

4000 \times 4000

, comprising 15 object categories and 188,282 annotated instances. These categories include Plane (PL), Baseball-diamond (BD), Bridge (BR), Ground-track-field (GTF), Small vehicle (SV), Large vehicle (LV), Ship (SH), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball-field (SBF), Roundabout (RA), Harbor (HA), Swimming pool (SP), and Helicopter (HC). All instances are annotated using the Oriented Bounding Box (OBB) protocol [5] with 8 degrees of freedom. We chose DOTA v1.0 because its diverse categories, varying object scales, and highly dense distributions provide a rigorous test for the model’s ability to handle complex spatial layouts and feature interference in large-scene remote sensing images. During training, images were cropped and resized to

1024 \times 1024

pixels.

HRSC2016 [31] is a high-resolution benchmark released by Northwestern Polytechnical University in 2016 for ship detection in optical remote sensing images. The dataset contains 1061 images with spatial resolutions ranging from

300 \times 300

to

1500 \times 900

pixels. Following the official split, 436 images are used for training, 181 for validation, and 444 for testing. The dataset employs the OBB annotation format. Additionally, it provides bounding box annotations and pixel-based segmentation annotations, enabling a more comprehensive description of vessel characteristics, such as position, shape, and orientation.

4.2. Quality Metrics

To comprehensively evaluate the performance of the proposed network in remote sensing image processing, this study employs a series of widely recognized quantitative metrics, including recall [58], average precision (AP) [18], precision [59], F1 score [59], and mean intersection over union (mIOU) [60,61]. AP is the area under the precision-recall curve, reflecting the average precision of a class at different recall levels, while mAP50 [18,62] is the arithmetic mean of AP values across all classes. Higher precision typically indicates lower false alarm rates for the model. The F1 score is the harmonic mean of precision and recall. A higher value indicates greater accuracy in target recognition. Finally, mIOU measures the overlap between predicted and ground-truth bounding boxes in object detection, with larger values implying superior performance. We additionally report the numbers of ground-truth boxes (gts) and detected boxes (dts), which indicate whether the model tends to under-detect targets or produce redundant detections.

4.3. Runtime Environment

During the experiments, a single NVIDIA RTX 3090 GPU has been used. The software environment was built on the CUDA 11.8 computing platform, leveraging the PyTorch 2.1.2 deep learning framework, with the MMRotate toolkit used to implement the model training workflow. Regarding the VM block, its pre-trained weights were loaded during training to initialize parameters. All input images were uniformly resized to

1024 \times 1024

pixels, and the batch size was set to 2. We employed AdamW [63] as the optimizer under a cosine-annealing learning rate schedule. The initial learning rate was set to

1 \times 10^{- 4}

, and the weight decay was fixed at

5 \times 10^{- 3}

. The model was trained for 25 epochs until convergence. Data processing and augmentation were carried out using the default settings of MMRotate, consistent with the comparison models.

4.4. Performance Assessment on the HRSC2016 Dataset

The quantitative results on the HRSC2016 dataset [31] are shown in Table 1. CMNet outperforms existing mainstream methods, such as PKINet [18], LEGNet [48], and AMMBA [64] across five key detection metrics: recall, AP, precision, F1 score, and mIOU. Specifically, CMNet achieves a high recall rate of 0.977, with one of the lowest detection target counts (dts) of 1467 among all compared methods, surpassed only by Oriented R-CNN (0.987) and RoI Transformer (0.985). This validates the effectiveness of modules like FCC and MLFE in mitigating feature representation discrepancies and enhancing localization accuracy. The results demonstrate that CMNet achieves low redundancy, high precision, and strong robustness in high-resolution remote sensing ship detection tasks.

To corroborate the numerical assessment, we visually compare the results on the HRSC2016 dataset, as shown in Figure 5. For complex scenarios where ship boundaries blend with the background in port environments (areas marked by yellow dashed boxes), CMNet demonstrates the highest detection box localization accuracy. Due to the high visual similarity between ship edges and port facilities, most comparison models (e.g., Oriented R-CNN [35] and RoI Transformer [34]) fail to accurately distinguish target boundaries. They misclassify portions of the port area as ship decks, resulting in noticeable boundary shifts in the detection boxes. In contrast, CMNet precisely segments ships from the port background through its feature fusion mechanism, enabling detection boxes to tightly align with the actual target contours. Regarding full image detection completeness, while the AMMBA model achieves comparable localization accuracy to CMNet in localized areas (yellow dashed boxes), it exhibits significant missed detections, failing to identify some naval targets within the scene. In contrast, CMNet successfully detects all ship instances, demonstrating superior target recall capability. Although CMNet slightly underperforms Oriented R-CNN on the mIOU metric, its overall detection performance is superior when considering positioning accuracy, false negative rate, and adaptability to complex boundary scenarios.

4.5. Performance Assessment on the DOTA v1.0 Dataset

The effectiveness of the proposed approach is demonstrated through quantitative comparisons with competitive state-of-the-art methods on the DOTA v1.0 dataset [8]. The quantitative results are shown in Table 2. The proposed CMNet, leveraging its VM + MLFE + FCC architecture, demonstrates superior overall detection performance compared to existing state-of-the-art methods such as AMMBA, PKINet-S, and LEGNet-T. Its mAP50 reaches 79.38%, surpassing PKINet-S (78.39%), LEGNet-T (78.96%), and AMMBA (74.77%) by 0.72, 0.15, and 4.34 percentage points, respectively, while achieving optimal performance across five categories. VM’s quad-directional scanning mechanism efficiently captures global semantic features in high-resolution remote sensing images. The MLFE module precisely extracts local textures and edge details through

3 \times 3

,

5 \times 5

, and

7 \times 7

parallel depth-separable convolutions and the GAM, enabling CMNet to demonstrate outstanding performance in complex scenarios. The FCC module effectively bridges the representational gap between VM’s global features and MLFE’s local features by cross-complementing spatial features, while eliminating redundant information. This further enhances feature fusion efficacy, fully validating CMNet’s advantages in processing remote sensing imagery.

Qualitative results for dense object detection are shown in Figure 6. CMNet demonstrates significant advantages. It is one of only two models among all comparators that do not misclassify truck cargo compartments as the Large-Car category. Compared to AMMBA, while both models avoid misclassification, CMNet exhibits the fewest false negatives across the entire image. The proposed CMNet achieves precise recognition of subcategories. Compared to all models, CMNet also exhibits the fewest false detections and redundant detections, along with the highest spatial matching accuracy between detection boxes and targets. False detection rates are lower than those of LEGNet, PKINet, and AMMBA, fully demonstrating CMNet’s accuracy and completeness in detecting small objects within complex backgrounds.

Another example of a complex scenario is reported in Figure 7. Focusing on the complex case where Harbor and Ship overlap, this scene features high target-background fusion and significant differences in target scale. Results show that CMNet accurately identifies both Harbor and Ship categories simultaneously, while other models exhibit single-category recognition or dual-category detection failures. CMNet achieves the lowest false negative rate in this scenario, effectively overcoming detection biases caused by background interference and target scale variations. This demonstrates CMNet’s superior feature discrimination capability and large-object detection performance.

We also compare feature attention heatmaps in Figure 8. The feature focus and background suppression effects in the heatmaps demonstrate that CMNet’s heatmap exhibits more concentrated responses to targets like port vessels and more pronounced background interference suppression. This stems from the synergistic collaboration of its FCC spatial feature complementarity architecture, effectively achieving feature focus for multi-scale targets and the suppression of background redundancy. This result provides intuitive validation at the feature representation level of CMNet’s advantages in feature focus and enhanced detection robustness.

Figure 6, Figure 7 and Figure 8 demonstrate that our model achieves optimal performance in classification accuracy, detection completeness, and robustness in complex scenarios, with detection capabilities significantly outperforming other compared models. This highlights the advantages of the proposed CMNet. By utilizing VM for global feature extraction and MLFE for local texture details, we reduce the probability of missed detections and false positives across various complex scenarios. Finally, the FCC module allows the model to focus more effectively on regions of interest.

4.6. Ablation Study

To assess the effectiveness of the proposed CMNet and its key components, we performed ablation experiments on the DOTA v1.0 dataset using S2ANet [38] as the detection framework. Following previous work, mAP50 (%) and mAP75 (%) [18] are adopted as evaluation metrics, and the results are summarized in Table 3. Five configurations are evaluated. The baseline model with the ResNet-101 backbone achieved an mAP50 (%) of 72.26 and an mAP75 (%) of 43.77. When replacing ResNet-101 with the VM backbone, the mAP50 (%) increased to 78.27 and the mAP75 (%) to 51.19, yielding gains of 6.01 and 7.42 percentage points, respectively, which demonstrated the effectiveness of VM for remote sensing object detection. Building on this, the VM+MLFE configuration attained an mAP50 (%) of 78.87 and an mAP75 (%) of 52.18, further improving the baseline by 6.61 and 8.41 percentage points and surpassing the VM-only setting by 0.60 and 0.99 percentage points, which validated that MLFE enhanced local features through multi-scale convolutions and gated attention. In the fourth group of experiments, when only two features are concatenated, the mAP50 (%) and mAP75 (%) reached 78.44 and 51.66 respectively. Finally, the full VM + MLFE + FCC configuration achieved an mAP50 (%) of 79.38 and an mAP75 (%) of 50.14, corresponding to overall improvements of 6.85 and 6.37 percentage points over the baseline and gains of 0.63 percentage points in mAP50 (%) compared with VM+MLFE, while maintaining a comparable mAP75 (%). These results jointly verified the effectiveness and complementarity of the three modules.To verify the effectiveness of our designed MLFE, a comparative experiment was also conducted on the selection of the number of layers in the MLFE network. As shown in Table 4, when using [1, 1, 2, 1] and [2, 4, 8, 2], the mAP50 (%) reached 78.57 and 78.43 respectively. When using [2, 2, 5, 2], the optimal mAP50 (%) reached 79.38.

A category-wise comparison between our proposed method and baseline models is presented in Table 5: performance improvements are observed across all categories when the Baseline is compared with VMamba, with the mAP50 (%) reaching 78.27, which demonstrates VMamba’s distinct advantages in remote sensing image processing; over half of the categories achieve performance gains in the VMamba versus VMamba + MLFE comparison, a phenomenon arising from the disparate information extracted by the two frameworks—simple summation or concatenation fails to capitalize on their respective strengths; and the integration of the FCC module into VMamba + MLFE boosts the AP of most hard-to-detect categories (e.g., Helicopter and Bridge), verifying that our proposed FCC module can resolve architectural discrepancies between the two frameworks, enable feature complementarity, and thus enhance detection accuracy.

To further validate the effectiveness of the ablation experiments, we visualize the ablation results in Figure 9. The detection maps are consistent with the corresponding heatmap responses, jointly confirming the effectiveness of the proposed combination of the VM, MLFE, and FCC modules. When only the baseline model is used, significant missed and false detections occur for small, densely distributed ship targets. For example, in the red dashed box of the Baseline column in Figure 9, both missed targets and false alarms can be observed. After replacing the ResNet-101 [65] backbone with the VM, the false detections in this region are largely suppressed and a single ship is correctly detected; however, the remaining missed ships reveal VM’s limitations in local feature representation. With the subsequent introduction of the MLFE module, three ships are successfully detected within the same red dashed box, demonstrating that MLFE effectively enhances dense-target capture by fusing local features with global semantic information and thus markedly reduces missed detections. Finally, the full configuration of VM + MLFE + FCC yields a comprehensive improvement in detection performance, further suppressing missed detections, accurately identifying smaller ships, and improving recall. These results fully verify that the FCC module, through feature cross-complementation, efficiently filters redundant information, enables the network to focus on regions of interest, and strengthens the discrimination between targets and background clutter.

4.7. Computational Complexity

We evaluate the computational complexity of the compared approaches in Table 6. We consider for this analysis the number of parameters in millions and the Giga floating-point operations per second (GFLOPs). The results are obtained considering the DOTA v1.0 dataset [8]. Our method secures a well-balanced trade-off among accuracy, efficiency, and model complexity. Compared to AMMBA, our model achieves a significant 4.43% improvement in mAP50 with fewer parameters and reduced computational overhead. Furthermore, when evaluated against mainstream models such as Gliding Vertex and Rotated RTMDet, our approach demonstrates comparable or superior overall competitiveness in terms of Parameters (M) count and computational complexity, validating its effectiveness.Overall, the proposed model achieves 79.38% mAP50 under a well-balanced trade-off between model complexity and computational burden, surpassing ten state-of-the-art approaches including AMMBA and LEGNet.

4.8. Discussion

The cross-dataset experiments demonstrate the preliminary visual robustness. As shown in Figure 10, the DOTA-trained model provides satisfactory visualization results on the DIOR [1] test set images, indicating that the feature extraction capabilities of CMNet remain stable across different data distributions. A comparison with ten advanced models shows that our proposed CMNet obtains the best detection performance, both in dense object detection and complex scenes. Heatmap feature analysis reveals that CMNet focuses most intensely on the regions of interest, thanks to the FCC module. The FCC module effectively enables the cross-complementation of global and local features, thereby eliminating redundancy. Despite achieving optimal detection performance, Parameters (M) and FLOPs (G) reveal some disadvantages of our CMNet. On the one hand, the multi-branch convolution structure of the MLFE module requires multi-path feature computation when extracting fine-grained features, which consumes more computational resources. On the other hand, the 2D alignment operation of the FCC module further increases the computational burden. In future work, we will employ quantization techniques to reduce model parameters from high precision to low precision, thus reducing storage and computational costs. Alternatively, we will explore modifying the network structure and its modules to reduce network redundancy, thereby alleviating the computational overhead of our CMNet.

5. Conclusions

In this study, we introduce a CNN–Mamba Network (CMNet). Its architecture comprises a global semantic information extraction module (VM) responsible for modeling global contextual cues and an MLFE module that compensates for the VM’s limitations in capturing fine-grained local features. In addition, CMNet integrates an FCC module to further strengthen the network’s attention to salient regions, thereby enhancing robustness and detection accuracy under complex scenes. We evaluate CMNet on two widely used remote sensing benchmarks and compare it with ten state-of-the-art models, including LEGNet-T and AMMBA, demonstrating its clear advantages. The qualitative analysis shows that our method delivers superior performance in cluttered backgrounds and for densely distributed small objects. Furthermore, the ablation study conducted on the DOTA v1.0 dataset shows that the full CMNet setting yields the highest performance among all configurations. Although CMNet outstrips ten current state-of-the-art methods when it comes to detection accuracy, it remains less competitive than some models with respect to Parameters (M) size and GFLOPs, suggesting that there is still room for model compression and efficiency improvement in future work.

Author Contributions

Conceptualization, J.L., H.M. and L.L.; methodology, J.L. and L.L.; software, J.L. and H.M.; validation, M.L., H.M. and Z.J.; formal analysis, G.V. and X.Z. (Xiaobin Zhao); investigation, J.L. and H.M.; resources, H.M. and L.L.; data curation, Z.J. and X.Z (Xueyu Zhang); writing—original draft preparation, X.Z. (Xiaobin Zhao) and G.V.; writing—review and editing, J.L., H.M. and L.L.; visualization, J.L. and L.L.; supervision, H.M. and G.V.; project administration, X.Z. (Xueyu Zhang); funding acquisition, J.L., L.L. and H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant Nos. 62261053 and 62401062; the Tianshan Talent Training Project-Xinjiang Science and Technology Innovation Team Program (2023TSYCTD0012); and the Research Project of Xinjiang Sky-Ground Integrated Intelligent Computing Technology Laboratory under grant No. 2025A05-1.

Data Availability Statement

All data generated or analyzed during this study are included in the published article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar]
Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Pan, E.; Liu, M.; Yu, Q. Oriented object detection in optical remote sensing images using deep learning: A survey. arXiv 2023, arXiv:2302.10473. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Chen, Y.; Zhang, P.; Li, Z.; Zhang, X.; Meng, G. Stitcher: Feedback-driven data provider for object detection. arXiv 2020, arXiv:2004.12432. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Chanussot, J.; Zhou, H.; Yang, J. Rotation equivariant feature image pyramid network for object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608614. [Google Scholar] [CrossRef]
Hou, L.; Lu, K.; Xue, J. Refined one-stage oriented object detection method for remote sensing images. IEEE Trans. Image Process. 2022, 31, 1545–1558. [Google Scholar] [CrossRef]
Zhang, W.; Jiao, L.; Li, Y.; Huang, Z.; Wang, H. Laplacian feature pyramid network for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604114. [Google Scholar] [CrossRef]
Li, L.; Zhao, X.; Hou, H.; Zhang, X.; Lv, M.; Jia, Z.; Ma, H. Fractal dimension-based multi-focus image fusion via coupled neural p systems in nsct domain. Fractal Fract. 2024, 8, 554. [Google Scholar] [CrossRef]
Lv, M.; Jia, Z.; Li, L.; Ma, H. Fractal dimension-based multi-focus image fusion via AGPCNN and consistency verification in NSCT domain. Fractal Fract. 2026, 10, 1. [Google Scholar] [CrossRef]
Lv, M.; Song, S.; Jia, Z.; Li, L.; Ma, H. Multi-focus image fusion based on dual-channel rybak neural network and consistency verification in NSCT domain. Fractal Fract. 2025, 9, 432. [Google Scholar] [CrossRef]
Li, L.; Song, S.; Lv, M.; Jia, Z.; Ma, H. Multi-focus image fusion based on fractal dimension and parameter adaptive unit-linking dual-channel PCNN in curvelet transform domain. Fractal Fract. 2025, 9, 157. [Google Scholar] [CrossRef]
Vivone, G.; Deng, L.-J.; Deng, S.; Hong, D.; Jiang, M.; Li, C.; Li, W.; Shen, H.; Wu, X.; Xiao, J.-L.; et al. Deep Learning in Remote Sensing Image Fusion: Methods, protocols, data, and future perspectives. IEEE Geosci. Remote Sens. Mag. 2025, 13, 269–310. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27706–27716. [Google Scholar]
Dai, Y.; Li, C.; Su, X.; Liu, H.; Li, J. Multi-scale depthwise separable convolution for semantic segmentation in street-road scenes. Remote Sens. 2023, 15, 2649. [Google Scholar] [CrossRef]
Guo, M.; Lu, C.; Hou, Q.; Liu, Z.; Cheng, M.; Hu, S. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
Li, L.; Shi, Y.; Lv, M.; Jia, Z.; Liu, M.; Zhao, X.; Zhang, X.; Ma, H. Infrared and visible image fusion via sparse representation and guided filtering in Laplacian pyramid domain. Remote Sens. 2024, 16, 3804. [Google Scholar]
Huang, W.; Wu, T.; Zhang, X.; Li, L.; Lv, M.; Jia, Z.; Zhao, X.; Ma, H.; Vivone, G. MCFTNet: Multimodal cross-layer fusion transformer network for hyperspectral and LiDAR data classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 12803–12818. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Aleissaee, A.; Kumar, A.; Anwer, R.; Khan, S.; Cholakkal, H.; Xia, G.; Khan, F. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1806. [Google Scholar] [CrossRef]
Huang, Y.; Jiao, D.; Huang, X.; Tang, T.; Gui, G. A hybrid CNN-transformer network for object detection in optical remote sensing images: Integrating local and global feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 241–254. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Wang, F.; Wang, J.; Ren, S. Mamba-Reg: Vision mamba also needs registers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 5–10 June 2025; pp. 14944–14953. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Xu, Y.; Wang, H.; Zhou, F.; Luo, C.; Sun, X.; Rahardja, S.; Ren, P. MambaHSISR: Mamba hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5511216. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), Porto, Portugal, 27 February–1 March 2017; pp. 324–331. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposed. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–17 December 2015; pp. 1440–1448. [Google Scholar]
Ding, J. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Orlando, FL, USA, 2–9 February 2021; pp. 3163–3171. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 12595–12604. [Google Scholar]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Austin, TX, USA, 7–14 April 2025; pp. 6896–6904. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. arXiv 2023, arXiv:2303.09030. [Google Scholar] [CrossRef]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–10 October 2023; pp. 6589–6600. [Google Scholar]
Lai, X. DecoupleNet: Decoupled network for domain adaptive semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 369–387. [Google Scholar]
Lu, W.; Chen, S.-B.; Ding, C.H.Q.; Tang, J.; Luo, B. LWGANet: A lightWeight group attention backbone for remote sensing visual tasks. arXiv 2025, arXiv:2501.10040. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.B.; Li, H.D.; Shu, Q.L.; Ding, C.H.; Tang, J.; Luo, B. LEGNet: Lightweight edge-gaussian driven network for low-quality remote sensing image object detection. arXiv 2025, arXiv:2503.14012. [Google Scholar]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. [Google Scholar]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Wu, T.; Zhao, R.; Lv, M.; Jia, Z.; Li, L.; Liu, M.; Zhao, X.; Ma, H.; Vivone, G. Efficient Mamba-Attention Network for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5627814. [Google Scholar] [CrossRef]
Huang, Z.; Zou, Y.; Bhagavatula, V.; Huang, D. Comprehensive attention self-distillation for weakly-supervised object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020; pp. 16797–16807. [Google Scholar]
Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N. DMM: Disparity-guided multispectral mamba for oriented object detection in remote sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
Ren, K.; Wu, X.; Xu, L.; Wang, L. Remotedet-mamba: A hybrid mamba-CNN network for multi-modal object detection in remote sensing images. arXiv 2024, arXiv:2410.13532. [Google Scholar]
Shen, D.; Zhu, X.; Tian, J.; Liu, J.; Du, Z.; Wang, H.; Ma, X. HTD-Mamba: Efficient hyperspectral target detection with pyramid state space model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5507315. [Google Scholar] [CrossRef]
Zhang, H.; Liu, H.; Shi, Z.; Mao, S.; Chen, N. ConvMamba: Combining Mamba with CNN for hyperspectral image classification. Neurocomputing 2025, 131016. [Google Scholar] [CrossRef]
Zhao, S. Mamba-UNet: Dual-branch mamba fusion U-Net with multiscale spatio-temporal attention for precipitation nowcasting. IEEE Trans. Ind. Informat. 2025, 21, 4466–4475. [Google Scholar] [CrossRef]
Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. Multi-attention pyramid context network for infrared small ship detection. J. Mar. Sci. Eng. 2024, 12, 345. [Google Scholar] [CrossRef]
Li, L.; Ma, H.; Zhang, X.; Zhao, X.; Lv, M.; Jia, Z. Synthetic aperture radar image change detection based on principal component analysis and two-level clustering. Remote Sens. 2024, 16, 1861. [Google Scholar] [CrossRef]
Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. FCNet: Flexible convolution network for infrared small ship detection. Remote Sens. 2024, 16, 2218. [Google Scholar] [CrossRef]
Cao, Z.; Liang, Y.; Deng, L.; Vivone, G. An efficient image fusion network exploiting unifying language and mask guidance. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9845–9862. [Google Scholar] [CrossRef]
Ma, J.; Wang, G.; Yin, R.; He, G.; Zhou, D.; Long, T.; Adam, E.; Zhang, Z. Wind turbines small object detection in remote sensing images based on CGA-YOLO: A case study in Shandong Province, China. Remote Sens. 2026, 18, 324. [Google Scholar] [CrossRef]
Shi, Y.; Yang, R.; Yin, C.; Lu, Y.; Huang, B.; Tao, Y.; Zhong, Y. Two-stage fine-tuning of large vision-language models with hierarchical prompting for few-shot object detection in remote sensing images. Remote Sens. 2026, 18, 266. [Google Scholar] [CrossRef]
Lin, Q.; Chen, N.; Huang, H.; Zhu, D.; Fu, G.; Chen, C.; Yu, Y. Attention-based mean-max balance assignment for oriented object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609215. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 6–14 December 2021; pp. 18381–18394. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the 38th International Conference on Machine Learning (ICML), Vienna, Austria, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-Time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]

Figure 1. This paper presents the overall structure of the proposed dual-stream feature extraction network combining CNN and VM. The architecture consists of three main components: the VM block for global feature extraction, the MLFE module for local feature extraction, and the FCC module for feature fusion.

Figure 2. Overview of the VM block. The SS2D mechanism generates four sets of 1D sequences by scanning along four image directions. These sequences are then processed within the S6 block. Finally, the processed sequences are merged and reconstructed into a 2D image.

Figure 3. The diagram above presents three key modules proposed in our approach, namely the multi-scale local feature extraction (MLFE) module, local feature extraction (LFE) module, and global attention module (GAM).

Figure 4. The FCC sub-module realizes feature fusion along the spatial dimension within the feature extraction modules.

Figure 5. Visual results on the HRSC2016 dataset. Note that, to ensure fairness, all model visualizations were configured with consistent parameters, including bounding box line width and IOU threshold.

Figure 6. Visual results on a complex scene with dense small targets acquired by the DOTA v1.0 dataset.

Figure 7. Visual results on a complex scene where large and small targets are intermingled acquired by the DOTA v1.0 dataset.

Figure 8. Heatmaps on a complex scene containing large objects for feature analysis acquired by the DOTA v1.0 dataset. Red color indicates the area of the model’s focus.

Figure 9. Visual results related to the ablation experiments.

Figure 10. The visualization results of applying the weights trained on the DOTA dataset to the DIOR dataset are presented.

Table 1. Quantitative results comparing the proposed approach with state-of-the-art methods on the HRSC2016 dataset. Best results are highlighted in red.

Model	Backbone	gts	dts	Recall (%)	Precision (%)	F1-Score (%)	mIOU (%)	mAP50 (%)
Roi-Trans [23]	Swin [23]	1227	1586	98.50	76.20	86.00	84.00	90.10
O-Rcnn [35]	R50 [65]	1227	3442	98.70	35.20	51.90	85.80	90.50
KLD [66]	R50 [65]	1227	4331	97.10	27.50	42.90	80.80	90.20
R3Det [37]	R50 [65]	1227	5349	93.00	21.30	34.70	76.40	88.00
G-Vertex [67]	R50 [65]	1227	1589	91.10	70.40	79.40	77.50	86.60
KFIOU [68]	R50 [65]	1227	4599	97.60	26.00	41.10	80.80	89.10
R-RTMdet [69]	CSPNeXt	1227	2171	97.30	55.00	70.30	84.40	89.90
O-Rcnn [35]	PKINet-S [18]	1227	2658	96.40	44.50	60.90	82.20	89.30
O-Rcnn [35]	LEGNet-T [48]	1227	1754	93.60	65.50	77.10	83.50	89.20
AMMBA [64]	R101 [65]	1227	1892	95.50	61.90	75.20	82.40	89.80
S2ANet [38]	Proposed	1227	1467	97.70	81.70	89.00	85.40	90.60

Table 2. Quantitative results comparing the proposed approach with state-of-the-art methods on the DOTA v1.0 dataset. Best results are highlighted in red.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP50 (%)
Roi-Trans [34]	89.08	83.60	54.84	72.10	78.97	84.45	87.97	90.90	87.14	86.65	64.74	66.50	76.67	72.28	66.90	77.52
O-Rcnn [35]	89.35	81.39	52.62	75.02	79.04	82.42	87.82	90.90	86.46	85.30	63.28	65.68	68.27	70.47	57.21	75.68
KLD [66]	89.20	75.60	48.30	73.02	76.88	75.26	86.32	90.90	84.52	83.46	60.93	62.10	66.56	64.90	43.85	72.12
R3Det [37]	89.30	75.22	45.42	69.24	74.56	72.83	79.28	90.89	81.02	83.25	58.78	63.16	63.42	62.24	37.41	69.73
G-Vertex [67]	89.21	75.77	51.28	69.56	78.14	75.62	86.88	90.90	85.40	84.77	53.48	66.65	66.31	69.98	54.43	73.22
KFIOU [68]	89.06	75.17	49.05	69.67	78.09	75.40	86.69	90.90	83.66	84.48	62.08	62.85	66.73	65.96	50.20	72.66
R-RTMdet [69]	89.42	84.08	55.12	75.32	80.77	84.36	88.95	90.90	87.35	87.28	62.91	67.74	78.02	81.10	68.86	78.81
PKINet-S [18]	89.72	84.20	55.81	77.63	80.25	84.45	88.12	90.88	87.57	86.07	66.86	70.23	77.47	73.62	62.94	78.38
LEGNet-T [48]	89.45	86.49	55.76	76.38	80.59	85.40	88.42	90.90	88.72	86.42	65.24	67.81	77.93	73.49	71.39	78.96
AMMBA [64]	89.11	81.44	51.10	70.29	79.96	78.04	87.25	90.87	82.93	85.54	63.30	64.02	66.47	71.65	59.57	74.77
Proposed	88.95	84.40	56.53	74.77	80.79	85.11	88.68	90.73	86.70	87.25	65.15	71.10	78.62	76.66	77.22	79.38

Table 3. Ablation experiments using S2ANet as detector.

Methed	mAP50 (%)	mAP75 (%)	Parameters (M)	FLOPs (G)
BaseLine	72.26	43.77	55.23	275.50
Vmamba	78.27	51.19	39.37	200.19
Vmamba + MLFE + Add	78.87	52.18	39.83	211.8
Vmamba + MLFE + Cat	78.44	51.66	53.94	255.28
ALL	79.38	51.65	41.41	216.68

Table 4. Regarding the impact of the number of MLFE layers on the network. S2ANet serves as the detector of our model.

Layers	mAP50 (%)	Parameters (M)	FLOPs (G)
[1, 1, 2, 1]	78.57	41.17	214.97
[2, 2, 4, 2]	79.38	41.41	216.68
[2, 4, 8, 2]	78.43	41.65	218.29

Table 5. Quantitative results comparing the proposed approach with BaseLine methods on the DOTA v1.0 dataset. Best results are highlighted in red.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP50 (%)
BaseLine	87.95	76.84	52.82	67.99	78.65	79.81	87.36	90.82	81.91	84.49	55.45	59.18	72.23	68.41	40.08	72.26
Vmamba	88.70	84.18	53.05	73.56	80.72	84.25	88.49	90.88	84.89	87.18	61.78	71.57	78.58	74.78	71.41	78.27
Vmamba	88.70	84.18	53.05	73.56	80.72	84.25	88.49	90.88	84.89	87.18	61.78	71.57	78.58	74.78	71.41	78.27
Vmamba + MLFE	88.66	83.53	55.11	75.29	80.88	85.77	88.56	90.77	87.44	86.61	66.79	72.09	78.54	73.09	69.87	78.86
Vmamba	88.70	84.18	53.05	73.56	80.72	84.25	88.49	90.88	84.89	87.18	61.78	71.57	78.58	74.78	71.41	78.27
Proposed	88.95	84.40	56.53	74.77	80.79	85.11	88.68	90.73	86.70	87.25	65.15	71.10	78.62	76.66	77.22	79.38

Table 6. Computational analysis of the compared approaches. Use S2Anet as the detector of our model. Red indicates the optimal result.

Module	Backbone	FLOPs (G)	Parameters (M)	mAP50 (%)
Roi_Trans [34]	Swin [23]	229.53	58.75	77.52
O_RCNN [35]	R50 [65]	211.43	41.14	75.68
KLD [66]	R50 [65]	335.74	41.90	72.12
R3Det [37]	R50 [65]	335.74	41.90	69.73
G_Vertex [67]	R50 [65]	211.30	41.14	73.23
KFIOU [68]	R50 [65]	355.74	41.90	72.67
R_RTMdet [69]	CSPNeXt [69]	204.21	52.27	78.81
O_RCNN [35]	PKINet-S [18]	184.44	30.86	78.39
O_RCNN [35]	LEGNet-T [48]	184.46	20.65	75.68
AMMBA [64]	R101 [65]	287.52	57.14	74.77
S2ANet [38]	Proposed	216.68	41.41	79.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Li, L.; Zhao, X.; Lv, M.; Jia, Z.; Zhang, X.; Vivone, G.; Ma, H. CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sens. 2026, 18, 591. https://doi.org/10.3390/rs18040591

AMA Style

Liu J, Li L, Zhao X, Lv M, Jia Z, Zhang X, Vivone G, Ma H. CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sensing. 2026; 18(4):591. https://doi.org/10.3390/rs18040591

Chicago/Turabian Style

Liu, Jin, Liangliang Li, Xiaobin Zhao, Ming Lv, Zhenhong Jia, Xueyu Zhang, Gemine Vivone, and Hongbing Ma. 2026. "CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection" Remote Sensing 18, no. 4: 591. https://doi.org/10.3390/rs18040591

APA Style

Liu, J., Li, L., Zhao, X., Lv, M., Jia, Z., Zhang, X., Vivone, G., & Ma, H. (2026). CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sensing, 18(4), 591. https://doi.org/10.3390/rs18040591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Architectures for RSOD

2.2. The State Space Models and Mamba

3. Methodology

3.1. Mamba

3.2. VMamba

3.3. MLFE

3.4. FCC Module

3.5. Overall Network Architecture

4. Experiments

4.1. Datasets

4.2. Quality Metrics

4.3. Runtime Environment

4.4. Performance Assessment on the HRSC2016 Dataset

4.5. Performance Assessment on the DOTA v1.0 Dataset

4.6. Ablation Study

4.7. Computational Complexity

4.8. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI