Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning
Highlights
- This study proposes a novel hierarchical visible-infrared fusion framework that integrates feature-level fusion with an Environment-Aware Dynamic Weighting (EADW) mechanism and decision-level fusion with D-S evidence theory for un-certainty management.
- The proposed framework demonstrates significant performance improvements, with a significant enhancement in detection capability and system robustness for Low-Slow-Small (LSS) UAV clusters in complex environments, particularly under challenging conditions such as nighttime and haze.
- The work provides an efficient and reliable technical solution for LSS-UAV cluster detection, which is critical for enhancing low-altitude security systems.
- The success of the EADW-DS fusion architecture offers a new paradigm for mul-ti-modal information fusion, highlighting the importance of combining adaptive feature fusion with uncertainty-aware decision fusion.
Abstract
1. Introduction
2. Deep Learning-Based Multi-Modal Feature Extraction
2.1. Relevant Deep Learning Theories and Methods
2.2. Photoelectric Target Feature Extraction Based on Deep Learning
2.2.1. Deep Learning Network Design
2.2.2. Dense Target Decoupling Strategy
2.3. Infrared Target Feature Extraction Based on Deep Learning
2.3.1. Deep Learning Network Design
2.3.2. Multi-Scale Detection and Group Target Decoupling
3. Multi-Modal Fusion Architecture Design
3.1. Feature Fusion Layer Design
3.1.1. Dynamic Environment Awareness
3.1.2. Dynamic Weight Generation
3.1.3. Feature Alignment and Fusion
3.2. Decision Fusion Layer Design
3.2.1. Evidence Space Modeling
3.2.2. Basic Probability Assignment (BPA) Construction
3.2.3. D-S Evidence Combination Rule
3.2.4. Conflict-Adaptive Handling
3.2.5. Decision Generation and Uncertainty Management
4. Experiments and Results Analysis
4.1. Experimental Setup
4.2. Overall Performance Comparison
4.3. Environmental Adaptability Analysis

4.4. Ablation Study
5. Conclusions and Outlook
- (1)
- Dedicated deep feature extraction networks (Improved ResNet-50 with DPAM for Visible/Dual-branch ConvNeXt-Tiny with UNet for IR) tailored for visible and infrared modalities were designed. These networks address issues such as blurred texture features, complex scene interference, and varying target scales in visible and infrared images, effectively enhancing the feature discrimination capability of single modalities under complex interference.
- (2)
- An innovative EADW-DS hierarchical feature-level and decision-level fusion framework was proposed. The EADW mechanism dynamically adjusts fusion weights by perceiving real-time environmental states (illumination, weather, time) and single-modal confidence, significantly improving the environmental adaptability of feature fusion. The D-S evidence theory fuses the feature fusion layer’s decision with independent single-modal decisions. By quantifying uncertainty and introducing a Conflict-Adaptive (CA) handling mechanism, it effectively resolves issues of missing information interaction and decision conflicts, substantially enhancing the overall system robustness.
- (1)
- Incorporating Additional Modalities: Introduce other modalities such as radar, RF (Radio Frequency), and acoustic signals to explore detection methods for LSS targets under more modalities and complex scene interference conditions. The proposed hierarchical fusion architecture is designed to be extensible for integrating more than two modalities.
- (2)
- Enhancing Dynamic Weighting: Refine the dynamic weighting mechanism within EADW by incorporating reinforcement learning or online learning techniques, enabling more intelligent and fine-grained adjustment of modality contributions based on real-time scene changes (e.g., sudden interference, target maneuvers). This could further improve the system’s autonomy and long-term adaptation in dynamic environments.
- (3)
- Robustness to Adversarial Attacks: Investigate the vulnerability of the proposed multi-modal system to adversarial attacks and develop corresponding defense mechanisms [35], enhancing the security and reliability of the detection system in potentially contested environments.
- (4)
- Theoretical Analysis: Conduct deeper theoretical analysis on the relationship between environmental parameters, fusion weights, and final performance to provide stronger theoretical foundations for the EADW mechanism.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Na, Z.; Cheng, L.; Sun, H.; Lin, B. Survey of UAV detection and recognition research based on deep learning. Signal Process. 2024, 40, 609–624. [Google Scholar] [CrossRef]
- Adnan, W.H.; Khamis, M.F. Drone use in military and civilian application: Risk to national security. J. Media Inf. Warf. 2022, 15, 60–70. [Google Scholar]
- Qiu, X.; Luo, B.; Fu, Z.; Tan, X.; Xiong, C. Review of anti-UAV technology development at home and abroad. Tact. Missile Technol. 2024, 63–73, 98. [Google Scholar] [CrossRef]
- Liu, L.; Liu, D.; Wang, X.; Wang, F.; Li, Y.; He, Y.; Liao, M. Development status and prospect of UAV swarm and anti-UAV swarm. Acta Aeronaut. Astronaut. Sin. 2022, 43, 4–20. [Google Scholar]
- Cheng, Y.; Zou, R.; Chen, J.; Wu, H.; Hua, X. LSS-Ku-1.0: A radar low-slow-small UAV detection dataset under ground clutter background. Signal Process. 2025, 41, 807–820. [Google Scholar] [CrossRef]
- Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-UAV detection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
- Yang, F.; Wang, M. Infrared small target detection method for low-altitude surveillance systems. Opt. Tech. 2024, 50, 120–128. [Google Scholar] [CrossRef]
- Xu, S.; Chen, X.; Li, H.; Liu, T.; Chen, Z.; Gao, H.; Zhang, Y. Airborne small target detection method based on multimodal and adaptive feature fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637215. [Google Scholar] [CrossRef]
- Song, H. Small target detection based on multimodal data fusion. In Proceedings of the IEEE International Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, 21–23 March 2025; pp. 537–540. [Google Scholar] [CrossRef]
- Ouyang, J.; Jin, P.; Wang, Q. Multimodal feature-guided pretraining for RGB-T perception. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 16041–16050. [Google Scholar] [CrossRef]
- Li, M.; Zhou, M.; Zhi, R. Survey of UAV recognition research based on multi-modal fusion. Comput. Eng. Appl. 2025, 61, 1–14. Available online: http://kns.cnki.net/kcms/detail/11.2127.TP.20250523.1539.008.html (accessed on 10 April 2025).
- Han, Z.; Yue, M.; Zhang, C.; Gao, Q. Multi-modal fusion detection for UAV targets based on Siamese network. Infrared Technol. 2023, 45, 739–745. [Google Scholar]
- Hu, N.; Tian, X. OFDM-MIMO radar signal design and processing method for small UAVs. Signal Process. 2024, 40, 878–886. [Google Scholar]
- Kaleem, Z. Lightweight and computationally efficient YOLO for rogue UAV detection in complex backgrounds. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 5362–5366. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Li, R.; Peng, Y.; Yang, Q. Fusion enhancement: UAV target detection based on multi-modal GAN. In Proceedings of the IEEE Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 15–17 September 2023; pp. 1953–1957. [Google Scholar] [CrossRef]
- Gao, M.; Lin, S. Deep learning LSS target detection algorithm based on radar signal and remote sensing map fusion. Signal Process. 2024, 40, 82–93. [Google Scholar] [CrossRef]
- Guo, R.; Sun, B.; Sun, X.; Bu, D.; Su, S. Multi-modal fusion target detection method for UAVs under low-light conditions. Chin. J. Sci. Instrum. 2025, 46, 338–350. [Google Scholar] [CrossRef]
- Hengy, S.; Laurenzis, M.; Schertzer, S.; Hommes, A.; Kloeppel, F.; Shoykhetbrod, A.; Geibig, T.; Johannes, W.; Rassy, O.; Christnacher, F. Multimodal UAV detection: Study of various intrusion scenarios. In SPIE Electro-Optical Remote Sensing XI; SPIE: Bellingham, WA, USA, 2017; Volume 10434, pp. 203–212. [Google Scholar]
- Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. Available online: https://arxiv.org/abs/1706.03762 (accessed on 8 December 2025).
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- Chi, W.; Liu, J.; Wang, X.; Feng, R.; Cui, J. DBGNet: Dual-branch gate-aware network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5003714. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- Zhao, G.; Chen, A.; Lu, G.; Liu, W. Data fusion algorithm based on fuzzy sets and D-S theory of evidence. Tsinghua Sci. Technol. 2020, 25, 12–19. [Google Scholar] [CrossRef]
- Murphy, C.K. Combining belief functions when evidence conflicts. Decis. Support Syst. 2000, 29, 1–9. [Google Scholar] [CrossRef]
- Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Ye, Q.; Jiao, J.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
- Xu, K.; Wang, B.; Zhu, Z.; Jia, Z.; Fan, C. A contrastive learning enhanced adaptive multimodal fusion network for hyperspectral and LiDAR data classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4700319. [Google Scholar] [CrossRef]
- Liu, Z.; Shen, Y.; Lakshminararasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.-P. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
- Perez-Rua, J.-M.; Vielzeuf, V.; Pateux, S.; Baccouche, M.; Jurie, F. MFAS: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6959–6968. [Google Scholar] [CrossRef]
- Li, H.; Zhang, S.; Kong, X. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5003713. [Google Scholar] [CrossRef]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]





| Method | Accuracy | Precision | Recall | FAR | F1-Score |
|---|---|---|---|---|---|
| Visible Only | 83.1 | 85.4 | 81.0 | 8.5 | 83.1 |
| Infrared Only | 78.9 | 76.8 | 81.5 | 10.3 | 79.1 |
| Cross-Modal Attention [31] | 90.1 | 89.3 | 90.8 | 5.4 | 90.0 |
| Feature Weighted Avg [32] | 89.2 | 88.7 | 89.5 | 5.9 | 89.1 |
| Voting Decision Fusion [33] | 87.5 | 86.0 | 89.0 | 7.1 | 87.5 |
| Dynamic Feature Fusion [33] | 88.7 | 87.5 | 89.9 | 6.3 | 88.7 |
| TransFuser [23] | 91.5 | 90.9 | 92.0 | 4.9 | 91.4 |
| CMX [34] | 91.8 | 91.2 | 92.3 | 4.7 | 91.7 |
| Proposed Method | 93.5 | 92.8 | 94.1 | 4.2 | 93.4 |
| Method | Params (m) | FLOPS (G) | FPS |
|---|---|---|---|
| Visible Branch | 25.1 | 45.3 | 45 |
| Infrared Branch | 18.7 | 32.1 | 52 |
| Cross-Modal Attention [31] | 46.5 | 82.1 | 31 |
| CMX [35] | 55.3 | 98.7 | 26 |
| TransFuser [34] | 62.1 | 110.5 | 22 |
| Proposed Method | 46.2 | 81.9 | 28 |
| Environmental Condition | Visible Only | IR Only | Weighted Avg | Cross-Modal Attn | Trans Fuser | CMX | Proposed Method |
|---|---|---|---|---|---|---|---|
| Daytime | 82.1 | 72.8 | 88.3 | 90.2 | 91.5 | 91.8 | 94.8 |
| Nighttime | 46.3 | 76.5 | 80.1 | 83.7 | 88.5 | 90.8 | 92.3 |
| Haze | 51.7 | 65.2 | 78.4 | 81.5 | 85.9 | 89.7 | 89.7 |
| Average Performance | 60 | 71.5 | 82.3 | 85.1 | 88.6 | 90.8 | 92.3 |
| Method | Accuracy | Precision | Recall | FAR | F1-Score |
|---|---|---|---|---|---|
| Baseline (FeatAvg + Vote) | 87.5 | 86.0 | 89.0 | 7.1 | 87.5 |
| EADW | 90.3 | 89.6 | 91.0 | 5.5 | 90.3 |
| EADW + D-S | 92.1 | 91.3 | 92.8 | 4.8 | 92.0 |
| EADW + D-S + CA (ours) | 93.5 | 92.8 | 94.1 | 4.2 | 93.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Z.; Zou, Y.; Hu, Z.; Xue, H.; Li, M.; Rao, B. Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning. Drones 2025, 9, 852. https://doi.org/10.3390/drones9120852
Liu Z, Zou Y, Hu Z, Xue H, Li M, Rao B. Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning. Drones. 2025; 9(12):852. https://doi.org/10.3390/drones9120852
Chicago/Turabian StyleLiu, Zhengtang, Yongjie Zou, Zhenzhen Hu, Han Xue, Meng Li, and Bin Rao. 2025. "Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning" Drones 9, no. 12: 852. https://doi.org/10.3390/drones9120852
APA StyleLiu, Z., Zou, Y., Hu, Z., Xue, H., Li, M., & Rao, B. (2025). Research on Multi-Modal Fusion Detection Method for Low-Slow-Small UAVs Based on Deep Learning. Drones, 9(12), 852. https://doi.org/10.3390/drones9120852

