AMSRDet: An Adaptive Multi-Scale UAV Infrared-Visible Remote Sensing Vehicle Detection Network
Abstract
1. Introduction
- AMSRDet fuses infrared-visible modalities through state-space models and adaptive attention, achieving superior detection performance with computational efficiency for UAV vehicle detection. Compared to the baseline RT-DETR, AMSRDet improves mean Average Precision (mAP) by 1.1 percentage points while achieving 1.41× faster inference and 48.9% fewer Floating Point Operations (FLOPs).
- A MobileMamba dual-stream encoder with SS2D blocks extracts hierarchical features at linear complexity , while Cross-Modal Global Fusion captures global dependencies via spatial-channel attention with adaptive gating that suppresses modality-specific noise.
- Scale-Coordinate Attention Fusion adaptively integrates multi-scale features through coordinate attention and learned scale weights, improving small object detection by 2.5 percentage points. A Separable Dynamic Decoder generates scale-adaptive predictions via content-aware dynamic convolution.
- Experiments on DroneVehicle show that AMSRDet outperforms twenty state-of-the-art detectors (YOLOv12, DEIM, Mamba-YOLO, etc.), achieving 45.8% mAP@0.5:0.95 (81.2% mAP@0.5) at 68.3 Frames Per Second (FPS) with 28.6 M parameters and 47.2 GFLOPs. Cross-dataset evaluation yields 52.3% mAP on Camera-vehicle without fine-tuning, demonstrating 2.5–9.1 percentage points improvement over baselines.
2. Related Work
2.1. UAV Vehicle Detection
2.2. Multi-Modal Fusion for Object Detection
2.3. State-Space Models for Vision
2.4. Attention Mechanisms for Multi-Scale Detection
3. Methodology
3.1. Overall Architecture
3.2. MobileMamba Encoder with SS2D Blocks
3.3. Cross-Modal Global Fusion (CMGF)
3.4. Scale-Coordinate Attention Fusion (SCAF)
3.5. Separable Dynamic Decoder
3.6. Loss Function
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
5. Results and Analysis
5.1. Comparison with State-of-the-Art Methods
5.2. Cross-Dataset Generalization
5.3. Multi-Scale Detection Analysis
5.4. Ablation Studies
5.5. Analysis of Multi-Modal Fusion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Agarwal, S.; Mustavee, S.; Contreras-Castillo, J.; Guerrero-Ibañez, J. Sensing and Monitoring of Smart Transportation Systems. In The Rise of Smart Cities; Elsevier: Amsterdam, The Netherlands, 2022; pp. 495–522. [Google Scholar]
- Collado, J.M.; Hilario, C.; De la Escalera, A.; Armingol, J.M. Model Based Vehicle Detection for Intelligent Vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 14–17 June 2004; pp. 572–577. [Google Scholar]
- Zhang, P.; Zhong, Y.; Li, X. Lightweight Object Detection for UAV Aerial Images. IEEE Access 2023, 11, 42384–42397. [Google Scholar] [CrossRef]
- Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
- Song, G.; Du, H.; Zhang, X.; Bao, F.; Zhang, Y. Small Object Detection in Unmanned Aerial Vehicle Images Using Multi-Scale Hybrid Attention. Eng. Appl. Artif. Intell. 2024, 128, 107455. [Google Scholar] [CrossRef]
- Li, J.; Fan, C.; Ou, C.; Zhang, H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones 2025, 9, 811. [Google Scholar] [CrossRef]
- Ikram, S.; Sarwar, I.; Ikram, A.; Abdullah-AI-Wahud, M. A Transformer-Based Multimodal Object Detection System for Real-World Applications. IEEE Access 2025, 13, 29162–29176. [Google Scholar] [CrossRef]
- Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding from Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13955–13965. [Google Scholar]
- Wu, A.; Zheng, W.S.; Yu, H.X.; Gong, S.; Lai, J. RGB-Infrared Cross-Modality Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
- Defaoui, M.; Koutti, L.; El Ansari, M.; Lahmyed, R.; Masmoudi, L. A Novel Hybrid Deep Learning Framework for Pedestrian Detection Based on Thermal Infrared and Visible Spectrum Images. Multimed. Tools Appl. 2025, 1–27. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Godase, V.V.; Takale, S.R.; Ghodake, R.G.; Mulani, A. Attention Mechanisms in Semantic Segmentation of Remote Sensing Images. J. Adv. Electron. Signal Process. 2025, 2, 45–58. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 11 April–10 May 2024; pp. 11884–11895. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar] [PubMed]
- Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
- Zhou, T.; Wang, W.; Konukoglu, E.; Van Gool, L. Mamba-Based Vision Models: A Comprehensive Survey. arXiv 2024, arXiv:2404.15956. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Coordinate Attention for Efficient Feature Extraction. Pattern Recognit. 2024, 145, 109912. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Li, J.; Shi, Y.; Hong, Q.; Jia, Y. A Scale-Aware Multi-Domain DETR for Small Object Detection in UAV Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4421520. [Google Scholar]
- Hao, X.; Diao, Y.; Wei, M.; Yang, Y.; Hao, P.; Yin, R.; Zhang, H.; Li, W.; Zhao, S.; Liu, Y. MapFusion: A Novel BEV Feature Fusion Network for Multi-Modal Map Construction. Inf. Fusion 2025, 119, 103018. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16788–16797. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Vision Mamba for Dense Prediction Tasks. arXiv 2024, arXiv:2405.14604. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 128–138. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, J.; Ou, C.; Yan, D.; Zhang, H.; Xue, X. Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives. Drones 2025, 9, 557. [Google Scholar] [CrossRef]
- Chen, S.; Ye, M.; Huang, Y.; Du, B. Towards Effective Rotation Generalization in UAV Object Re-Identification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 2593–2606. [Google Scholar] [CrossRef]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Context-Aware Feature Pyramid Network for Multi-Scale Object Detection in UAV Imagery. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1123–1137. [Google Scholar] [CrossRef]
- Zhong, H.; Zhang, Y.; Shi, Z.; Zhang, Y.; Zhao, L. PS-YOLO: A Lighter and Faster Network for UAV Object Detection. Remote Sens. 2025, 17, 1641. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, X.; Wang, H.; Chen, J. RSW-YOLO: A Vehicle Detection Model for Urban UAV Remote Sensing Images. Sensors 2025, 25, 4335. [Google Scholar] [CrossRef]
- Guo, H.; Wu, Q.; Wang, Y. AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sens. 2025, 17, 1920. [Google Scholar] [CrossRef]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Prakash, I.V.; Palanivelan, M. A Study of YOLO (You Only Look Once) to YOLOv8. In Algorithms in Advanced Artificial Intelligence; CRC Press: Boca Raton, FL, USA, 2024; pp. 257–266. [Google Scholar]
- Shi, P.; Yang, L.; Dong, X.; Qi, H.; Yang, A. Research Progress on Multi-Modal Fusion Object Detection Algorithms for Autonomous Driving: A Review. Comput. Mater. Contin. 2025, 83, 3877. [Google Scholar] [CrossRef]
- Liu, Z.; Cheng, J.; Fan, J.; Lin, S.; Wang, Y.; Zhao, X. Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection. IEEE Trans. Multimed. 2023, 27, 707–717. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, Z.; Hu, C.; Li, S.E.; Zhang, X. Semantic-Guided Illumination-Aware Deformable Transformer for RGB-T Object Detection. IEEE Robot. Autom. Lett. 2025, 10, 11936–11943. [Google Scholar] [CrossRef]
- Qu, Y.; Kim, J. Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation. Sensors 2025, 25, 359. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
- Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
- Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
- Wang, Z.; Li, X.; Chen, Y.; Zhao, Y. Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 5832–5840. [Google Scholar]
- Huang, L.; Zhang, W.; Liu, Y.; Chen, X. MambaODet: Efficient Mamba-Based Object Detection for Real-Time Applications. arXiv 2024, arXiv:2410.08923. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Chowdhury, A.; Jiang, Y.; Wang, X. Bandit-Based Attention Mechanism in Vision Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 3245–3254. [Google Scholar]
- DeAlcala, D.; Kim, S.; Lee, J. AttZoom: Attention Zoom for Better Visual Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Honolulu, HI, USA, 19–23 October 2025; pp. 1823–1832. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection with Dynamic Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2988–2997. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Jocher, G. YOLOv5: A State-of-the-Art Real-Time Object Detection System; Zenodo: Geneva, Switzerland, 2021. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
- Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar] [CrossRef]
- Zong, Z.; Song, G.; Liu, Y. DETRs with Collaborative Hybrid Assignments Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6748–6758. [Google Scholar]











| Method | Type | Multi-Modal | Complexity | Real-Time | Small Object | Year |
|---|---|---|---|---|---|---|
| Faster R-CNN [37] | Two-stage CNN | No | No | Moderate | 2017 | |
| YOLOv8 [48] | One-stage CNN | No | Yes | Moderate | 2023 | |
| DETR [13] | Transformer | No | No | Poor | 2020 | |
| RT-DETR [31] | Hybrid | No | Yes | Good | 2023 | |
| VMamba [20] | SSM | No | Yes | Good | 2024 | |
| AMSRDet (Ours) | SSM + Attention | Yes | Yes | Excellent | 2025 |
| Component | RT-DETR | AMSRDet (Ours) |
|---|---|---|
| Backbone | ResNet-50/101 (single-stream) | MobileMamba (dual-stream) |
| Input Modality | RGB only | RGB + IR |
| Encoder | AIFI + CCFM | AIFI + CMGF + SCAF |
| Cross-modal Fusion | None | CMGF module |
| Multi-scale Fusion | CCFM (fixed weights) | SCAF (adaptive weights) |
| Decoder | Standard cross-attention | Separable Dynamic Decoder |
| Complexity |
| Dataset | Images | Instances | Categories | Resolution | Viewpoint |
|---|---|---|---|---|---|
| DroneVehicle | 56,878 | 389,779 | 5 | Aerial | |
| Camera-vehicle | 12,483 | 87,421 | 4 | Ground-level |
| Type | Method | Modality | P (%) | R (%) | F1 (%) | mAP@0.5:0.95 (%) | mAP@0.5 (%) | FPS | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|---|---|---|---|---|
| CNN-based | Faster R-CNN | Early | 62.3 | 58.7 | 60.4 | 38.2 | 68.5 | 12.5 | 41.3 | 207.4 |
| RetinaNet | Early | 64.1 | 59.3 | 61.6 | 39.7 | 70.2 | 18.3 | 36.2 | 145.2 | |
| ATSS | Early | 65.8 | 61.2 | 63.4 | 41.3 | 72.8 | 22.1 | 32.1 | 128.6 | |
| YOLOv5 | Early | 68.2 | 63.5 | 65.8 | 42.1 | 74.3 | 58.7 | 46.5 | 109.3 | |
| YOLOv7 | Early | 70.5 | 65.8 | 68.1 | 43.6 | 76.8 | 61.2 | 37.2 | 105.8 | |
| YOLOv8 | Early | 72.1 | 67.3 | 69.6 | 44.2 | 78.2 | 64.5 | 43.6 | 165.3 | |
| YOLOv9 | Early | 71.8 | 66.9 | 69.3 | 43.9 | 77.6 | 59.3 | 51.8 | 238.9 | |
| YOLOv10 | Early | 72.8 | 68.1 | 70.4 | 44.5 | 78.9 | 67.2 | 29.4 | 98.7 | |
| YOLOv11 | Early | 73.2 | 68.6 | 70.8 | 44.8 | 79.4 | 65.8 | 31.2 | 102.3 | |
| YOLOv12 | Early | 73.6 | 69.2 | 71.3 | 45.1 | 79.8 | 63.4 | 33.8 | 108.5 | |
| Transformer | DETR | Early | 68.5 | 62.3 | 65.3 | 41.8 | 73.6 | 15.6 | 41.5 | 186.4 |
| Deformable DETR | Early | 71.3 | 66.2 | 68.7 | 43.2 | 76.2 | 24.3 | 40.1 | 173.2 | |
| RT-DETR | Early | 73.4 | 68.5 | 70.9 | 44.7 | 79.1 | 48.6 | 32.8 | 92.4 | |
| DINO | Early | 74.2 | 69.1 | 71.6 | 45.3 | 80.2 | 35.2 | 47.5 | 265.7 | |
| DEIM | Early | 73.8 | 69.3 | 71.5 | 45.0 | 79.6 | 52.7 | 35.6 | 98.3 | |
| Co-DETR | Early | 73.9 | 69.8 | 71.8 | 45.1 | 80.4 | 28.4 | 62.3 | 341.2 | |
| Mamba-based | VMamba | Early | 71.6 | 67.4 | 69.4 | 43.7 | 77.1 | 52.3 | 44.2 | 156.8 |
| Mamba-YOLO | Early | 72.4 | 68.2 | 70.2 | 44.3 | 78.4 | 61.5 | 38.7 | 132.4 | |
| MambaODet | Early | 72.9 | 68.7 | 70.7 | 44.6 | 78.8 | 58.9 | 41.3 | 145.6 | |
| Ours | AMSRDet | Ours | 75.6 | 71.2 | 73.3 | 45.8 | 81.2 | 68.3 | 28.6 | 47.2 |
| Method | P (%) | R (%) | F1 (%) | mAP (%) |
|---|---|---|---|---|
| YOLOv5 | 58.3 | 52.1 | 55.0 | 43.2 |
| YOLOv7 | 61.2 | 54.8 | 57.8 | 45.7 |
| YOLOv8 | 63.5 | 57.2 | 60.2 | 47.3 |
| RT-DETR | 64.8 | 58.9 | 61.7 | 48.6 |
| DINO | 66.1 | 60.3 | 63.1 | 49.8 |
| Co-DETR | 65.7 | 59.7 | 62.6 | 49.2 |
| AMSRDet (Ours) | 68.4 | 62.8 | 65.5 | 52.3 |
| MobileMamba | CMGF | SCAF | Sep. Decoder | mAP (%) | FPS | FLOPs (G) |
|---|---|---|---|---|---|---|
| RT-DETR baseline (RGB only) | ||||||
| 44.7 | 48.6 | 92.4 | ||||
| Our modifications (RGB + IR dual-stream) | ||||||
| 39.2 | 72.5 | 38.4 | ||||
| ✓ | 41.8 | 71.3 | 42.1 | |||
| ✓ | ✓ | 43.5 | 69.8 | 44.6 | ||
| ✓ | ✓ | ✓ | 44.7 | 68.9 | 46.3 | |
| ✓ | ✓ | ✓ | ✓ | 45.8 | 68.3 | 47.2 |
| Modality | mAP (%) | Day mAP (%) | Night mAP (%) |
|---|---|---|---|
| RGB only | 42.3 | 44.8 | 35.2 |
| IR only | 40.1 | 38.7 | 43.9 |
| RGB + IR (Early fusion) | 43.7 | 45.9 | 39.8 |
| RGB + IR (Late fusion) | 44.2 | 46.3 | 40.6 |
| RGB + IR (AMSRDet) | 45.8 | 47.5 | 44.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yan, Z.; Li, Y. AMSRDet: An Adaptive Multi-Scale UAV Infrared-Visible Remote Sensing Vehicle Detection Network. Sensors 2026, 26, 817. https://doi.org/10.3390/s26030817
Yan Z, Li Y. AMSRDet: An Adaptive Multi-Scale UAV Infrared-Visible Remote Sensing Vehicle Detection Network. Sensors. 2026; 26(3):817. https://doi.org/10.3390/s26030817
Chicago/Turabian StyleYan, Zekai, and Yuheng Li. 2026. "AMSRDet: An Adaptive Multi-Scale UAV Infrared-Visible Remote Sensing Vehicle Detection Network" Sensors 26, no. 3: 817. https://doi.org/10.3390/s26030817
APA StyleYan, Z., & Li, Y. (2026). AMSRDet: An Adaptive Multi-Scale UAV Infrared-Visible Remote Sensing Vehicle Detection Network. Sensors, 26(3), 817. https://doi.org/10.3390/s26030817

