DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection
Abstract
1. Introduction
- We propose DCAM-DETR, a novel multimodal detection framework that integrates Mamba-based state space models with the RT-DETR architecture for efficient anti-UAV detection, achieving linear computational complexity while maintaining robust global context modeling capability through selective scan mechanisms.
- We design Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules that explicitly capture intermodal correlations across spatial and channel dimensions through parallel attention streams, enabling fine-grained multimodal feature alignment and complementary information extraction.
- We introduce an Adaptive Feature Fusion Module (AFFM) that dynamically weights multimodal features based on scene-adaptive gating mechanisms, and a Dual-Attention Decoupling Module (DADM) that enhances detection head performance through hierarchical dilated convolutions and attention decomposition.
- Comprehensive experiments on Anti-UAV300, FLIR-ADAS, and KAIST datasets demonstrate that DCAM-DETR achieves state-of-the-art performance with 94.7% mAP@0.5 on Anti-UAV300, outperforming existing methods by substantial margins while maintaining real-time inference speed of 42 FPS.
2. Related Work
2.1. Vision-Based UAV Detection
2.2. Multimodal Fusion for Object Detection
2.3. State Space Models and Mamba
2.4. Detection Transformers
3. Methodology
3.1. Overall Architecture
3.2. MobileMamba Backbone
3.2.1. Selective State Space Model Formulation
3.2.2. SS2D Block for 2D Visual Processing
3.2.3. S6 Block Architecture
3.3. Cross-Dimensional Attention Modules
3.3.1. Cross-Dimensional Attention (CDA)
3.3.2. Cross-Path Attention (CPA)
3.3.3. Why Cross-Dimensional Attention Helps for Small UAVs
3.4. Adaptive Feature Fusion Module (AFFM)
3.5. Dual-Attention Decoupling Module (DADM)
3.6. Loss Function
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
- mAP@0.5: AP computed at IoU threshold , which is lenient and focuses on detection capability.
- mAP@0.5:0.95: AP averaged across IoU thresholds from 0.5 to 0.95 with step 0.05, which requires precise localization:
4.4. Comparison with State-of-the-Art Methods
4.5. Qualitative Results
4.6. Performance Under Different Environmental Conditions
4.7. Ablation Studies
4.7.1. Component-Wise Ablation
4.7.2. Modality Ablation
4.7.3. Backbone Architecture Comparison
4.7.4. Fusion Strategy Comparison
4.7.5. DADM Component Analysis
4.7.6. Performance by Target Size and Scene Condition
4.7.7. Per-Class AP Analysis
4.8. Attention Visualization
4.9. Cross-Dataset Evaluation
4.10. Computational Efficiency Analysis
4.11. Edge Device Deployment
4.12. Failure Case Analysis
5. Discussion
Limitations and Societal Impact
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
- Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
- Güvenç, İ.; Koohifar, F.; Singh, S.; Sichitiu, M.L.; Matolak, D. Detection, tracking, and interdiction for amateur drones. IEEE Commun. Mag. 2018, 56, 75–81. [Google Scholar] [CrossRef]
- Shi, X.; Yang, C.; Xie, W.; Liang, C.; Shi, Z.; Chen, J. Anti-drone system with multiple surveillance technologies: Architecture, implementation, and challenges. IEEE Commun. Mag. 2018, 56, 68–74. [Google Scholar] [CrossRef]
- Federal Aviation Administration. FAA Unmanned Aircraft Systems (UAS) Traffic Management. 2024. Available online: https://www.faa.gov/uas (accessed on 15 October 2025).
- Wu, X.; Dong, J.; Bao, W.; Zou, B.; Wang, L.; Wang, H. Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading. IEEE Internet Things J. 2024, 11, 36030–36043. [Google Scholar] [CrossRef]
- Ezuma, M.; Erden, F.; Anjinappa, C.K.; Ozdemir, O.; Güvenç, İ. Radar cross section based statistical recognition of UAVs at microwave frequencies. IEEE Trans. Aerosp. Electron. Syst. 2020, 58, 27–46. [Google Scholar] [CrossRef]
- Coluccia, A.; Fascista, A.; Schumann, A.; Sommer, L.; Ghenescu, M.; Piatrik, T.; De Cubber, G.; Nalamati, M.; Kapoor, A.; Saqib, M.; et al. Drone-vs-bird detection challenge at IEEE AVSS2019. arXiv 2019, arXiv:1910.07360. [Google Scholar]
- Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, Z.; Han, Z.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2022, 25, 486–500. [Google Scholar] [CrossRef]
- Huang, B.; Chen, J.; Xu, T.; Wang, Y.; Jiang, S.; Wang, Y.; Wang, L.; Li, J. Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2852–2865. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar] [CrossRef]
- Rozantsev, A.; Lepetit, V.; Fua, P. Detecting flying objects using a single moving camera. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 879–892. [Google Scholar] [CrossRef]
- Li, J.; Fan, C.; Ou, C.; Zhang, H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones 2025, 9, 811. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar] [CrossRef]
- Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal object detection via probabilistic ensembling. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
- Wang, S.; Wang, C.; Shi, C.; Liu, Y.; Lu, M. Mask-guided mamba fusion for drone-based visible-infrared vehicle detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005712. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
- Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Hammoud, R.; Laganiere, R. Enhanced thermal-RGB fusion for robust object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 365–374. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
- Wang, G.; Song, M.; Hwang, J.N. Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey. arXiv 2022, arXiv:2205.10766. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Stoken, A.; Borber, J.; NanoCode012; Kwon, Y.; Michael, K.; TaoXie; Fang, J.; imyhxy; et al. YOLOv5 by Ultralytics. GitHub Repos. 2022. Available online: https://github.com/ultralytics/yolov5 (accessed on 15 October 2025).
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Zhou, L.; Liu, Z.; Zhao, H.; Hou, Y.E.; Liu, Y.; Zuo, X.; Dang, L. A multi-scale object detector based on coordinate and global information aggregation for UAV aerial images. Remote Sens. 2023, 15, 3468. [Google Scholar] [CrossRef]
- Zhang, X.; Ye, P.; Xiao, G. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
- Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransFuse: Fusing transformers and CNNs for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 14–24. [Google Scholar] [CrossRef]
- Yang, P.; Gao, J.; Chen, W. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15612–15631. [Google Scholar] [CrossRef]
- Dong, A.; Wang, L.; Liu, J.; Xu, J.; Zhao, G.; Zhai, Y.; Lv, G.; Cheng, J. Co-enhancement of multi-modality image fusion and object detection via feature adaptation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12624–12637. [Google Scholar] [CrossRef]
- Peng, S.; Zhu, X.; Cao, X.; Deng, C. FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model. arXiv 2024, arXiv:2404.07932. [Google Scholar] [CrossRef]
- Cai, Q.; Pan, Y.; Yao, T.; Ngo, C.W.; Mei, T. Objectfusion: Multi-modal 3d object detection with object-centric fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 18067–18076. [Google Scholar]
- Wang, X.; Wang, S.; Ding, Y.; Li, Y.; Wu, W.; Rong, Y.; Kong, W.; Huang, J.; Li, S.; Yang, H.; et al. State space model for new-generation network alternative to transformers: A survey. arXiv 2024, arXiv:2404.09516. [Google Scholar] [CrossRef]
- Gupta, A.; Gu, A.; Berant, J. Diagonal state spaces are as effective as structured state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 22982–22994. [Google Scholar]
- Zhao, H.; Yan, L.; Hou, Z.; Lin, J.; Zhao, Y.; Ji, Z.; Wang, Y. Error Analysis Strategy for Long-term Correlated Network Systems: Generalized Nonlinear Stochastic Processes and Dual-Layer Filtering Architecture. IEEE Internet Things J. 2025, 12, 33731–33745. [Google Scholar] [CrossRef]
- Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 25261–25270. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar] [CrossRef]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
- FLIR Systems. FLIR Thermal Dataset for Algorithm Training. 2019. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 15 October 2025).
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. GitHub Repos. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 October 2025).
- Feng, Y.; Luo, E.; Lu, H.; Zhai, S. Cross-modality feature fusion for night pedestrian detection. Front. Phys. 2024, 12, 1356248. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]










| Method | Application | Modality | Fusion Strategy | Key Innovation |
|---|---|---|---|---|
| Mamba-UNet | Medical Seg. | Single | N/A | U-Net with SSM |
| VMamba | Classification | Single | N/A | Cross-scan 2D |
| Vision Mamba | General Vision | Single | N/A | Bidirectional SSM |
| FusionMamba | Image Fusion | Multi | Early | Dual-stream SSM |
| DCAM-DETR | Anti-UAV Det. | RGB + IR | Adaptive | CDA/CPA + AFFM |
| Method | Modality | mAP@0.5 | mAP@0.5:0.95 | FPS | Params(M) |
|---|---|---|---|---|---|
| Single-Modality Methods (RGB) | |||||
| YOLOv5-L † [35] | RGB | 78.3 | 54.2 | 68 | 46.5 |
| YOLOv7 † [37] | RGB | 81.5 | 57.8 | 52 | 37.2 |
| YOLOv8-L † [66] | RGB | 82.8 | 58.9 | 58 | 43.7 |
| YOLOv9-C † [38] | RGB | 83.2 | 59.4 | 48 | 51.8 |
| YOLOv10-L † [39] | RGB | 84.1 | 60.7 | 55 | 44.3 |
| RT-DETR-L † [26] | RGB | 85.7 | 62.1 | 45 | 42.3 |
| Single-Modality Methods (Infrared) | |||||
| YOLOv5-L † [35] | IR | 74.6 | 51.3 | 68 | 46.5 |
| YOLOv7 † [37] | IR | 77.8 | 53.9 | 52 | 37.2 |
| YOLOv8-L † [66] | IR | 79.4 | 55.6 | 58 | 43.7 |
| RT-DETR-L † [26] | IR | 82.4 | 58.7 | 45 | 42.3 |
| Multimodal Fusion Methods | |||||
| DenseFuse † [17] | RGB + IR | 84.9 | 60.3 | 35 | 52.1 |
| U2Fusion † [44] | RGB + IR | 86.2 | 61.8 | 32 | 48.7 |
| TarDAL † [14] | RGB + IR | 88.5 | 64.2 | 28 | 55.3 |
| M3FD † [14] | RGB + IR | 90.3 | 67.5 | 38 | 49.8 |
| CFT † [67] | RGB + IR | 91.2 | 69.8 | 31 | 58.2 |
| TransFuse † [47] | RGB + IR | 92.1 | 71.2 | 25 | 63.4 |
| Recent State-of-the-Art (CVPR/ICCV/ECCV 2023-2024) | |||||
| MBNet † (ICCV’23) | RGB + IR | 90.5 | 68.9 | 32 | 51.2 |
| BAANet † (ICCV’23) | RGB + IR | 91.2 | 69.8 | 30 | 54.6 |
| CAT-Det † (CVPR’24) | RGB + IR | 91.8 | 70.4 | 28 | 56.8 |
| SuperYOLO † (ECCV’24) | RGB + IR | 92.3 | 72.1 | 35 | 48.9 |
| CIAN † (CVPR’24) | RGB + IR | 92.8 | 73.5 | 26 | 61.3 |
| DCAM-DETR (Ours) | RGB + IR | 94.7 | 78.3 | 42 | 47.6 |
| Configuration | mAP@0.5 | mAP@0.5:0.95 | FPS | Params(M) |
|---|---|---|---|---|
| Baseline (RT-DETR + Concat) | 87.2 | 63.5 | 44 | 45.1 |
| + MobileMamba Backbone | 90.3 | 68.7 | 43 | 46.8 |
| + CDA & CPA | 92.5 | 73.1 | 42 | 47.2 |
| + AFFM | 93.8 | 76.4 | 42 | 47.5 |
| + DADM (Full Model) | 94.7 | 78.3 | 42 | 47.6 |
| Configuration | mAP@0.5 | mAP@0.5:0.95 | FPS | Params(M) |
|---|---|---|---|---|
| RGB-only (Standard) | 85.7 | 62.1 | 45 | 42.3 |
| RGB-only (2× Capacity) | 87.3 | 64.1 | 38 | 85.2 |
| IR-only (Standard) | 82.4 | 58.7 | 45 | 42.3 |
| IR-only (2× Capacity) | 84.6 | 60.8 | 38 | 85.2 |
| DCAM-DETR (RGB + IR) | 94.7 | 78.3 | 42 | 47.6 |
| Backbone | mAP@0.5 | mAP@0.5:0.95 | FPS | Params(M) |
|---|---|---|---|---|
| ResNet-50 [68] | 88.4 | 64.8 | 48 | 44.2 |
| ResNet-101 [68] | 89.1 | 66.2 | 42 | 63.1 |
| Swin-T [24] | 90.8 | 69.4 | 35 | 48.6 |
| Swin-S [24] | 91.5 | 71.2 | 28 | 69.3 |
| VMamba-T [30] | 92.3 | 73.8 | 40 | 46.2 |
| MobileMamba (Ours) | 94.7 | 78.3 | 42 | 47.6 |
| Fusion Strategy | mAP@0.5 | mAP@0.5:0.95 | Params(M) |
|---|---|---|---|
| Early Fusion (Concat) | 87.2 | 63.5 | 45.1 |
| Late Fusion (Ensemble) | 88.6 | 65.4 | 90.2 |
| Attention Fusion [45] | 90.4 | 68.9 | 46.3 |
| Vanilla Cross-Attention | 90.8 | 70.2 | 48.5 |
| Cross-Attention [47] | 91.8 | 71.5 | 52.8 |
| Simple Weighted Average | 89.4 | 67.3 | 45.8 |
| CDA + CPA + AFFM (Ours) | 94.7 | 78.3 | 47.6 |
| Detection Head Configuration | mAP@0.5 | mAP@0.5:0.95 |
|---|---|---|
| Standard Head | 92.5 | 73.1 |
| + Dilated Convolutions only | 93.2 | 74.8 |
| + Dual-Attention Module (DAM) | 94.1 | 76.9 |
| + Full DADM (Ours) | 94.7 | 78.3 |
| Category | DCAM-DETR | RT-DETR (RGB) | Improvement |
|---|---|---|---|
| By Target Size (mAP@0.5) | |||
| Small (<32 × 32 pixels) | 89.4 | 71.2 | +18.2 |
| Medium (32–96 pixels) | 95.8 | 86.4 | +9.4 |
| Large (>96 × 96 pixels) | 97.3 | 93.1 | +4.2 |
| By Scene Condition (mAP@0.5) | |||
| Daytime Clear | 96.2 | 91.2 | +5.0 |
| Daytime Cloudy | 95.4 | 88.7 | +6.7 |
| Dusk/Dawn | 94.1 | 82.3 | +11.8 |
| Nighttime Clear | 93.8 | 68.5 | +25.3 |
| Nighttime Foggy | 91.2 | 58.4 | +32.8 |
| UAV Type | Count | AP@0.5 | AP@0.5:0.95 |
|---|---|---|---|
| Quadcopter | 156 | 95.2 | 79.1 |
| Fixed-wing | 78 | 93.8 | 76.8 |
| Helicopter | 42 | 94.1 | 77.5 |
| Micro UAV | 24 | 91.3 | 72.4 |
| Overall | 300 | 94.7 | 78.3 |
| Method | FLIR-ADAS mAP@0.5 | KAIST mAP@0.5 | Average |
|---|---|---|---|
| RT-DETR [26] | 72.3 | 68.7 | 70.5 |
| M3FD [14] | 75.8 | 71.2 | 73.5 |
| CFT [67] | 77.2 | 73.8 | 75.5 |
| TransFuse [47] | 78.4 | 74.6 | 76.5 |
| DCAM-DETR (Ours) | 81.2 | 77.9 | 79.6 |
| Method | mAP@0.5 | FPS | Params(M) | FLOPs(G) | Memory(GB) |
|---|---|---|---|---|---|
| YOLOv8-L [66] | 82.8 | 58 | 43.7 | 165.2 | 4.2 |
| RT-DETR-L [26] | 85.7 | 45 | 42.3 | 136.8 | 5.1 |
| TransFuse [47] | 92.1 | 25 | 63.4 | 248.6 | 8.7 |
| DCAM-DETR (Ours) | 94.7 | 42 | 47.6 | 142.3 | 5.8 |
| Configuration | mAP@0.5 | FPS | Power (W) |
|---|---|---|---|
| DCAM-DETR (FP32) | 94.7 | 16 | 55 |
| DCAM-DETR (FP16) | 94.5 | 24 | 48 |
| DCAM-DETR (INT8) | 93.2 | 31 | 44 |
| DCAM-DETR (Pruned 50%) | 92.1 | 28 | 46 |
| YOLOv8-L (INT8) | 81.2 | 52 | 38 |
| RT-DETR-L (INT8) | 84.1 | 35 | 42 |
| Failure Category | mAP@0.5 | Failure Rate |
|---|---|---|
| Extreme Occlusion (>70%) | 67.3 | 18.2% |
| Severe Motion Blur | 72.1 | 12.4% |
| Thermal Camouflage | 69.8 | 8.7% |
| Very Small Targets (<10 pixels) | 58.4 | 15.3% |
| CDA Spatial Misalignment | - | 12.8% |
| CPA Channel Confusion | - | 15.4% |
| AFFM Wrong Modality Weight | - | 10.9% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Qin, Z.; Li, Y. DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information 2026, 17, 103. https://doi.org/10.3390/info17010103
Qin Z, Li Y. DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information. 2026; 17(1):103. https://doi.org/10.3390/info17010103
Chicago/Turabian StyleQin, Zemin, and Yuheng Li. 2026. "DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection" Information 17, no. 1: 103. https://doi.org/10.3390/info17010103
APA StyleQin, Z., & Li, Y. (2026). DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information, 17(1), 103. https://doi.org/10.3390/info17010103

