CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection
Highlights
- A novel CNN-Mamba network (CMNet) is proposed, synergizing both architectures to generate complementary global–local features for remote sensing object detection, while an FCC module addresses feature representational differences spatially and channel-wise.
- Experimental results show that CMNet achieves excellent performance, with 79.38% mAP50 on the DOTA v1.0 dataset and 90.60% mAP50 on the HRSC dataset, outperforming other state-of-the-art approaches.
- CMNet, which integrates CNN and Mamba, proposes a new paradigm for remote sensing object detection. It overcomes the limitations of single-architecture models and validates the value of fusing local feature extraction capabilities with global receptive fields.
- The FCC module effectively addresses the disparity of heterogeneous features and exhibits strong extensibility for other hybrid networks. CMNet’s superior performance not only facilitates the practical application of remote sensing object detection but also highlights the great potential of Mamba in this field.
Abstract
1. Introduction
- We leverage VMamba’s linear complexity and four-directional scanning mechanism to efficiently capture long-range context and global dependencies in remote sensing images. Moreover, we design an MLFE module to accurately extract local structural details, such as edges and textures, through multi-scale parallel depthwise-separable convolutions and gated attention, thereby establishing an initial complementarity between global and local features.
- To address the potential representational discrepancies and redundancy between Mamba’s global sequence modeling and CNN modeling, we propose an FCC module. This module aims to reduce representational differences between the global contextual features extracted by VMamba and the local detail features extracted by the proposed MLFE. More importantly, this module removes redundant information from features, refining them into more concise, complementary, and discriminative representations that provide a solid foundation for subsequent high-precision object detection.
- We conduct extensive experiments on two widely used and challenging public datasets in remote sensing object detection, i.e., DOTA v1.0 [8] public benchmark dataset as well as the HRSC2016 dataset [31]. The experimental results demonstrate the effectiveness of the proposed CMNet architecture and its core components, showing that it achieves competitive performance on RSOD tasks.
2. Related Work
2.1. Architectures for RSOD
2.2. The State Space Models and Mamba
3. Methodology
3.1. Mamba
3.2. VMamba
3.3. MLFE
3.4. FCC Module
3.5. Overall Network Architecture
4. Experiments
4.1. Datasets
4.2. Quality Metrics
4.3. Runtime Environment
4.4. Performance Assessment on the HRSC2016 Dataset
4.5. Performance Assessment on the DOTA v1.0 Dataset
4.6. Ablation Study
4.7. Computational Complexity
4.8. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2849–2858. [Google Scholar]
- Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Pan, E.; Liu, M.; Yu, Q. Oriented object detection in optical remote sensing images using deep learning: A survey. arXiv 2023, arXiv:2302.10473. [Google Scholar] [CrossRef]
- Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
- Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
- Chen, Y.; Zhang, P.; Li, Z.; Zhang, X.; Meng, G. Stitcher: Feedback-driven data provider for object detection. arXiv 2020, arXiv:2004.12432. [Google Scholar]
- Shamsolmoali, P.; Zareapoor, M.; Chanussot, J.; Zhou, H.; Yang, J. Rotation equivariant feature image pyramid network for object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608614. [Google Scholar] [CrossRef]
- Hou, L.; Lu, K.; Xue, J. Refined one-stage oriented object detection method for remote sensing images. IEEE Trans. Image Process. 2022, 31, 1545–1558. [Google Scholar] [CrossRef]
- Zhang, W.; Jiao, L.; Li, Y.; Huang, Z.; Wang, H. Laplacian feature pyramid network for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5604114. [Google Scholar] [CrossRef]
- Li, L.; Zhao, X.; Hou, H.; Zhang, X.; Lv, M.; Jia, Z.; Ma, H. Fractal dimension-based multi-focus image fusion via coupled neural p systems in nsct domain. Fractal Fract. 2024, 8, 554. [Google Scholar] [CrossRef]
- Lv, M.; Jia, Z.; Li, L.; Ma, H. Fractal dimension-based multi-focus image fusion via AGPCNN and consistency verification in NSCT domain. Fractal Fract. 2026, 10, 1. [Google Scholar] [CrossRef]
- Lv, M.; Song, S.; Jia, Z.; Li, L.; Ma, H. Multi-focus image fusion based on dual-channel rybak neural network and consistency verification in NSCT domain. Fractal Fract. 2025, 9, 432. [Google Scholar] [CrossRef]
- Li, L.; Song, S.; Lv, M.; Jia, Z.; Ma, H. Multi-focus image fusion based on fractal dimension and parameter adaptive unit-linking dual-channel PCNN in curvelet transform domain. Fractal Fract. 2025, 9, 157. [Google Scholar] [CrossRef]
- Vivone, G.; Deng, L.-J.; Deng, S.; Hong, D.; Jiang, M.; Li, C.; Li, W.; Shen, H.; Wu, X.; Xiao, J.-L.; et al. Deep Learning in Remote Sensing Image Fusion: Methods, protocols, data, and future perspectives. IEEE Geosci. Remote Sens. Mag. 2025, 13, 269–310. [Google Scholar] [CrossRef]
- Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27706–27716. [Google Scholar]
- Dai, Y.; Li, C.; Su, X.; Liu, H.; Li, J. Multi-scale depthwise separable convolution for semantic segmentation in street-road scenes. Remote Sens. 2023, 15, 2649. [Google Scholar] [CrossRef]
- Guo, M.; Lu, C.; Hou, Q.; Liu, Z.; Cheng, M.; Hu, S. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
- Li, L.; Shi, Y.; Lv, M.; Jia, Z.; Liu, M.; Zhao, X.; Zhang, X.; Ma, H. Infrared and visible image fusion via sparse representation and guided filtering in Laplacian pyramid domain. Remote Sens. 2024, 16, 3804. [Google Scholar]
- Huang, W.; Wu, T.; Zhang, X.; Li, L.; Lv, M.; Jia, Z.; Zhao, X.; Ma, H.; Vivone, G. MCFTNet: Multimodal cross-layer fusion transformer network for hyperspectral and LiDAR data classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 12803–12818. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Aleissaee, A.; Kumar, A.; Anwer, R.; Khan, S.; Cholakkal, H.; Xia, G.; Khan, F. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1806. [Google Scholar] [CrossRef]
- Huang, Y.; Jiao, D.; Huang, X.; Tang, T.; Gui, G. A hybrid CNN-transformer network for object detection in optical remote sensing images: Integrating local and global feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 241–254. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Wang, F.; Wang, J.; Ren, S. Mamba-Reg: Vision mamba also needs registers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 5–10 June 2025; pp. 14944–14953. [Google Scholar]
- Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Xu, Y.; Wang, H.; Zhou, F.; Luo, C.; Sun, X.; Rahardja, S.; Ren, P. MambaHSISR: Mamba hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5511216. [Google Scholar] [CrossRef]
- Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), Porto, Portugal, 27 February–1 March 2017; pp. 324–331. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposed. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–17 December 2015; pp. 1440–1448. [Google Scholar]
- Ding, J. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
- Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Orlando, FL, USA, 2–9 February 2021; pp. 3163–3171. [Google Scholar]
- Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
- Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar]
- Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 12595–12604. [Google Scholar]
- Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Austin, TX, USA, 7–14 April 2025; pp. 6896–6904. [Google Scholar]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. arXiv 2023, arXiv:2303.09030. [Google Scholar] [CrossRef]
- Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–10 October 2023; pp. 6589–6600. [Google Scholar]
- Lai, X. DecoupleNet: Decoupled network for domain adaptive semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 369–387. [Google Scholar]
- Lu, W.; Chen, S.-B.; Ding, C.H.Q.; Tang, J.; Luo, B. LWGANet: A lightWeight group attention backbone for remote sensing visual tasks. arXiv 2025, arXiv:2501.10040. [Google Scholar] [CrossRef]
- Lu, W.; Chen, S.B.; Li, H.D.; Shu, Q.L.; Ding, C.H.; Tang, J.; Luo, B. LEGNet: Lightweight edge-gaussian driven network for low-quality remote sensing image object detection. arXiv 2025, arXiv:2503.14012. [Google Scholar]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 10041–10071. [Google Scholar]
- Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
- Wu, T.; Zhao, R.; Lv, M.; Jia, Z.; Li, L.; Liu, M.; Zhao, X.; Ma, H.; Vivone, G. Efficient Mamba-Attention Network for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5627814. [Google Scholar] [CrossRef]
- Huang, Z.; Zou, Y.; Bhagavatula, V.; Huang, D. Comprehensive attention self-distillation for weakly-supervised object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020; pp. 16797–16807. [Google Scholar]
- Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N. DMM: Disparity-guided multispectral mamba for oriented object detection in remote sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
- Ren, K.; Wu, X.; Xu, L.; Wang, L. Remotedet-mamba: A hybrid mamba-CNN network for multi-modal object detection in remote sensing images. arXiv 2024, arXiv:2410.13532. [Google Scholar]
- Shen, D.; Zhu, X.; Tian, J.; Liu, J.; Du, Z.; Wang, H.; Ma, X. HTD-Mamba: Efficient hyperspectral target detection with pyramid state space model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5507315. [Google Scholar] [CrossRef]
- Zhang, H.; Liu, H.; Shi, Z.; Mao, S.; Chen, N. ConvMamba: Combining Mamba with CNN for hyperspectral image classification. Neurocomputing 2025, 131016. [Google Scholar] [CrossRef]
- Zhao, S. Mamba-UNet: Dual-branch mamba fusion U-Net with multiscale spatio-temporal attention for precipitation nowcasting. IEEE Trans. Ind. Informat. 2025, 21, 4466–4475. [Google Scholar] [CrossRef]
- Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. Multi-attention pyramid context network for infrared small ship detection. J. Mar. Sci. Eng. 2024, 12, 345. [Google Scholar] [CrossRef]
- Li, L.; Ma, H.; Zhang, X.; Zhao, X.; Lv, M.; Jia, Z. Synthetic aperture radar image change detection based on principal component analysis and two-level clustering. Remote Sens. 2024, 16, 1861. [Google Scholar] [CrossRef]
- Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. FCNet: Flexible convolution network for infrared small ship detection. Remote Sens. 2024, 16, 2218. [Google Scholar] [CrossRef]
- Cao, Z.; Liang, Y.; Deng, L.; Vivone, G. An efficient image fusion network exploiting unifying language and mask guidance. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9845–9862. [Google Scholar] [CrossRef]
- Ma, J.; Wang, G.; Yin, R.; He, G.; Zhou, D.; Long, T.; Adam, E.; Zhang, Z. Wind turbines small object detection in remote sensing images based on CGA-YOLO: A case study in Shandong Province, China. Remote Sens. 2026, 18, 324. [Google Scholar] [CrossRef]
- Shi, Y.; Yang, R.; Yin, C.; Lu, Y.; Huang, B.; Tao, Y.; Zhong, Y. Two-stage fine-tuning of large vision-language models with hierarchical prompting for few-shot object detection in remote sensing images. Remote Sens. 2026, 18, 266. [Google Scholar] [CrossRef]
- Lin, Q.; Chen, N.; Huang, H.; Zhu, D.; Fu, G.; Chen, C.; Yu, Y. Attention-based mean-max balance assignment for oriented object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609215. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 6–14 December 2021; pp. 18381–18394. [Google Scholar]
- Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
- Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the 38th International Conference on Machine Learning (ICML), Vienna, Austria, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-Time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]










| Model | Backbone | gts | dts | Recall (%) | Precision (%) | F1-Score (%) | mIOU (%) | mAP50 (%) |
|---|---|---|---|---|---|---|---|---|
| Roi-Trans [23] | Swin [23] | 1227 | 1586 | 98.50 | 76.20 | 86.00 | 84.00 | 90.10 |
| O-Rcnn [35] | R50 [65] | 1227 | 3442 | 98.70 | 35.20 | 51.90 | 85.80 | 90.50 |
| KLD [66] | R50 [65] | 1227 | 4331 | 97.10 | 27.50 | 42.90 | 80.80 | 90.20 |
| R3Det [37] | R50 [65] | 1227 | 5349 | 93.00 | 21.30 | 34.70 | 76.40 | 88.00 |
| G-Vertex [67] | R50 [65] | 1227 | 1589 | 91.10 | 70.40 | 79.40 | 77.50 | 86.60 |
| KFIOU [68] | R50 [65] | 1227 | 4599 | 97.60 | 26.00 | 41.10 | 80.80 | 89.10 |
| R-RTMdet [69] | CSPNeXt | 1227 | 2171 | 97.30 | 55.00 | 70.30 | 84.40 | 89.90 |
| O-Rcnn [35] | PKINet-S [18] | 1227 | 2658 | 96.40 | 44.50 | 60.90 | 82.20 | 89.30 |
| O-Rcnn [35] | LEGNet-T [48] | 1227 | 1754 | 93.60 | 65.50 | 77.10 | 83.50 | 89.20 |
| AMMBA [64] | R101 [65] | 1227 | 1892 | 95.50 | 61.90 | 75.20 | 82.40 | 89.80 |
| S2ANet [38] | Proposed | 1227 | 1467 | 97.70 | 81.70 | 89.00 | 85.40 | 90.60 |
| Method | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP50 (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Roi-Trans [34] | 89.08 | 83.60 | 54.84 | 72.10 | 78.97 | 84.45 | 87.97 | 90.90 | 87.14 | 86.65 | 64.74 | 66.50 | 76.67 | 72.28 | 66.90 | 77.52 |
| O-Rcnn [35] | 89.35 | 81.39 | 52.62 | 75.02 | 79.04 | 82.42 | 87.82 | 90.90 | 86.46 | 85.30 | 63.28 | 65.68 | 68.27 | 70.47 | 57.21 | 75.68 |
| KLD [66] | 89.20 | 75.60 | 48.30 | 73.02 | 76.88 | 75.26 | 86.32 | 90.90 | 84.52 | 83.46 | 60.93 | 62.10 | 66.56 | 64.90 | 43.85 | 72.12 |
| R3Det [37] | 89.30 | 75.22 | 45.42 | 69.24 | 74.56 | 72.83 | 79.28 | 90.89 | 81.02 | 83.25 | 58.78 | 63.16 | 63.42 | 62.24 | 37.41 | 69.73 |
| G-Vertex [67] | 89.21 | 75.77 | 51.28 | 69.56 | 78.14 | 75.62 | 86.88 | 90.90 | 85.40 | 84.77 | 53.48 | 66.65 | 66.31 | 69.98 | 54.43 | 73.22 |
| KFIOU [68] | 89.06 | 75.17 | 49.05 | 69.67 | 78.09 | 75.40 | 86.69 | 90.90 | 83.66 | 84.48 | 62.08 | 62.85 | 66.73 | 65.96 | 50.20 | 72.66 |
| R-RTMdet [69] | 89.42 | 84.08 | 55.12 | 75.32 | 80.77 | 84.36 | 88.95 | 90.90 | 87.35 | 87.28 | 62.91 | 67.74 | 78.02 | 81.10 | 68.86 | 78.81 |
| PKINet-S [18] | 89.72 | 84.20 | 55.81 | 77.63 | 80.25 | 84.45 | 88.12 | 90.88 | 87.57 | 86.07 | 66.86 | 70.23 | 77.47 | 73.62 | 62.94 | 78.38 |
| LEGNet-T [48] | 89.45 | 86.49 | 55.76 | 76.38 | 80.59 | 85.40 | 88.42 | 90.90 | 88.72 | 86.42 | 65.24 | 67.81 | 77.93 | 73.49 | 71.39 | 78.96 |
| AMMBA [64] | 89.11 | 81.44 | 51.10 | 70.29 | 79.96 | 78.04 | 87.25 | 90.87 | 82.93 | 85.54 | 63.30 | 64.02 | 66.47 | 71.65 | 59.57 | 74.77 |
| Proposed | 88.95 | 84.40 | 56.53 | 74.77 | 80.79 | 85.11 | 88.68 | 90.73 | 86.70 | 87.25 | 65.15 | 71.10 | 78.62 | 76.66 | 77.22 | 79.38 |
| Methed | mAP50 (%) | mAP75 (%) | Parameters (M) | FLOPs (G) |
|---|---|---|---|---|
| BaseLine | 72.26 | 43.77 | 55.23 | 275.50 |
| Vmamba | 78.27 | 51.19 | 39.37 | 200.19 |
| Vmamba + MLFE + Add | 78.87 | 52.18 | 39.83 | 211.8 |
| Vmamba + MLFE + Cat | 78.44 | 51.66 | 53.94 | 255.28 |
| ALL | 79.38 | 51.65 | 41.41 | 216.68 |
| Layers | mAP50 (%) | Parameters (M) | FLOPs (G) |
|---|---|---|---|
| [1, 1, 2, 1] | 78.57 | 41.17 | 214.97 |
| [2, 2, 4, 2] | 79.38 | 41.41 | 216.68 |
| [2, 4, 8, 2] | 78.43 | 41.65 | 218.29 |
| Method | PL | BD | BR | GTF | SV | LV | SH | TC | BC | ST | SBF | RA | HA | SP | HC | mAP50 (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BaseLine | 87.95 | 76.84 | 52.82 | 67.99 | 78.65 | 79.81 | 87.36 | 90.82 | 81.91 | 84.49 | 55.45 | 59.18 | 72.23 | 68.41 | 40.08 | 72.26 |
| Vmamba | 88.70 | 84.18 | 53.05 | 73.56 | 80.72 | 84.25 | 88.49 | 90.88 | 84.89 | 87.18 | 61.78 | 71.57 | 78.58 | 74.78 | 71.41 | 78.27 |
| Vmamba | 88.70 | 84.18 | 53.05 | 73.56 | 80.72 | 84.25 | 88.49 | 90.88 | 84.89 | 87.18 | 61.78 | 71.57 | 78.58 | 74.78 | 71.41 | 78.27 |
| Vmamba + MLFE | 88.66 | 83.53 | 55.11 | 75.29 | 80.88 | 85.77 | 88.56 | 90.77 | 87.44 | 86.61 | 66.79 | 72.09 | 78.54 | 73.09 | 69.87 | 78.86 |
| Vmamba | 88.70 | 84.18 | 53.05 | 73.56 | 80.72 | 84.25 | 88.49 | 90.88 | 84.89 | 87.18 | 61.78 | 71.57 | 78.58 | 74.78 | 71.41 | 78.27 |
| Proposed | 88.95 | 84.40 | 56.53 | 74.77 | 80.79 | 85.11 | 88.68 | 90.73 | 86.70 | 87.25 | 65.15 | 71.10 | 78.62 | 76.66 | 77.22 | 79.38 |
| Module | Backbone | FLOPs (G) | Parameters (M) | mAP50 (%) |
|---|---|---|---|---|
| Roi_Trans [34] | Swin [23] | 229.53 | 58.75 | 77.52 |
| O_RCNN [35] | R50 [65] | 211.43 | 41.14 | 75.68 |
| KLD [66] | R50 [65] | 335.74 | 41.90 | 72.12 |
| R3Det [37] | R50 [65] | 335.74 | 41.90 | 69.73 |
| G_Vertex [67] | R50 [65] | 211.30 | 41.14 | 73.23 |
| KFIOU [68] | R50 [65] | 355.74 | 41.90 | 72.67 |
| R_RTMdet [69] | CSPNeXt [69] | 204.21 | 52.27 | 78.81 |
| O_RCNN [35] | PKINet-S [18] | 184.44 | 30.86 | 78.39 |
| O_RCNN [35] | LEGNet-T [48] | 184.46 | 20.65 | 75.68 |
| AMMBA [64] | R101 [65] | 287.52 | 57.14 | 74.77 |
| S2ANet [38] | Proposed | 216.68 | 41.41 | 79.38 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, J.; Li, L.; Zhao, X.; Lv, M.; Jia, Z.; Zhang, X.; Vivone, G.; Ma, H. CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sens. 2026, 18, 591. https://doi.org/10.3390/rs18040591
Liu J, Li L, Zhao X, Lv M, Jia Z, Zhang X, Vivone G, Ma H. CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sensing. 2026; 18(4):591. https://doi.org/10.3390/rs18040591
Chicago/Turabian StyleLiu, Jin, Liangliang Li, Xiaobin Zhao, Ming Lv, Zhenhong Jia, Xueyu Zhang, Gemine Vivone, and Hongbing Ma. 2026. "CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection" Remote Sensing 18, no. 4: 591. https://doi.org/10.3390/rs18040591
APA StyleLiu, J., Li, L., Zhao, X., Lv, M., Jia, Z., Zhang, X., Vivone, G., & Ma, H. (2026). CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sensing, 18(4), 591. https://doi.org/10.3390/rs18040591

