Abstract
Remote sensing small object detection remains a challenging task due to limited feature representation and interference from complex backgrounds. Existing methods that rely exclusively on either visible or infrared modalities often fail to achieve both accuracy and robustness in detection. Effectively integrating cross-modal information to enhance detection performance remains a critical challenge. To address this issue, we propose a novel Multispectral Sparse Cross-Attention Guided Mamba Network (MSCGMN) for small object detection in remote sensing. The proposed MSCGMN architecture comprises three key components: Multispectral Sparse Cross-Attention Guidance Module (MSCAG), Dynamic Grouped Mamba Block (DGMB), and Gated Enhanced Attention Module (GEAM). Specifically, the MSCAG module selectively fuses RGB and infrared (IR) features using sparse cross-modal attention, effectively capturing complementary information across modalities while suppressing redundancy. The DGMB introduces a dynamic grouping strategy to improve the computational efficiency of Mamba, enabling effective global context modeling. In remote sensing images, small objects occupy limited areas, making it difficult to capture their critical features. We design the GEAM module to enhance both global and local feature representations for small object detection. Experiments on the VEDAI and DroneVehicle datasets show that MSCGMN achieves mAP50 scores of 83.9% and 84.4%, outperforming existing state-of-the-art methods and demonstrating strong competitiveness in small object detection tasks.