Accurate and efficient vehicle detection from unmanned aerial vehicle (UAV) imagery is essential for intelligent transportation, urban monitoring, and public safety, yet this task remains challenging due to high target density, extreme scale variation, complex backgrounds, and stringent onboard computational constraints. Existing DETR-based
[...] Read more.
Accurate and efficient vehicle detection from unmanned aerial vehicle (UAV) imagery is essential for intelligent transportation, urban monitoring, and public safety, yet this task remains challenging due to high target density, extreme scale variation, complex backgrounds, and stringent onboard computational constraints. Existing DETR-based detectors model global context through self-attention but incur quadratic
complexity that is prohibitive for high-resolution UAV images, while CNN-based methods lack the long-range contextual awareness needed for dense small-object scenarios. We propose
RSMamDet, an efficient end-to-end detection framework built upon RT-DETR that replaces quadratic self-attention with linear
State Space Model scanning. The framework integrates a MobileMamba backbone with a Selective Feature Scanning module for efficient global context modeling, a Dimension-Aware Selective Integration module for adaptive cross-scale feature fusion, a Poly Kernel Inception Network encoder for multi-receptive-field feature enrichment, and an Adaptive Multi-Level Feature Fusion module for content-aware dynamic upsampling, complemented by an Uncertainty-Minimal Composite loss for stable query selection in cluttered aerial scenes. Experiments on DroneVehicle and VisDrone2019 demonstrate that RSMamDet achieves mAP
50 of 72.6% and 40.2%, surpassing state-of-the-art methods by 4.1% and 2.2%, respectively, while maintaining real-time inference at 186.2 FPS with only 19.8M parameters and 42.3 GFLOPs, representing a
reduction in computational cost and a
reduction in model parameters compared to the strongest baseline.
Full article