YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection
Abstract
:1. Introduction
- In this paper, we use the lightweight network MobileViT as the backbone network, which combines the spatial inductive bias of CNNs and the global modeling capability of transformers, reducing the model parameters and complexity.
- The multi-scale C3-PANet structure is proposed, which uses the feature recombination and upsampling CARAFE method to predict the upsampling weights based on the feature map, and then recombines the features based on the predicted upsampling weights to obtain a larger receptive field and enhance the perception ability of small targets. The neck structure is improved by using the C3 structure and by stacking C3 modules to extract more effective features while reducing the number of parameters and improving the detection accuracy of small targets.
- A K-means++ clustering algorithm is introduced to cluster the dataset samples and redesign the anchor box size to improve detection efficiency.
2. Related Work
3. Proposed Methodology, Tools, and Techniques
3.1. YOLO-ViT
3.1.1. Improved Backbone Network Based on MobileViT
3.1.2. Content-Aware Multi-Scale-Structure-Based C3-PANet
3.1.3. K-Means++ Clustering Algorithm
3.2. Datasets
3.3. Assessment Indicators
4. Experimental Results
4.1. Experimental Platform and Parameter Settings
4.2. Comparison Algorithms
4.3. Ablation Studies and Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A survey. Geosci. Remote Sens. 2022, 10, 91–124. [Google Scholar] [CrossRef]
- Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
- Chen, Z.; Cao, L.; Wang, Q. YOLOv5-Based Vehicle Detection Method for High-Resolution UAV Images. Mob. Inf. Syst. 2022, 2022, 1828848. [Google Scholar] [CrossRef]
- Ghasemi Darehnaei, Z.; Shokouhifar, M.; Yazdanjouei, H. SI-EDTL: Swarm intelligence ensemble deep transfer learning for multiple vehicle detection in UAV images. Concurr. Comput. Pract. Exp. 2021, 34, e6726. [Google Scholar] [CrossRef]
- Du, Y. Multi-UAV Search and Rescue with Enhanced A∗ Algorithm Path Planning in 3D Environment. Int. J. Aerosp. Eng. 2023, 2023, 8614117. [Google Scholar] [CrossRef]
- Choutri, K.; Mohand, L.; Dala, L. Design of search and rescue system using autonomous Multi-UAVs. Intell. Decis. Technol. 2021, 14, 553–564. [Google Scholar] [CrossRef]
- Patel, T.; Guo, B.H.; van der Walt, J.D.; Zou, Y. Effective Motion Sensors and Deep Learning Techniques for Unmanned Ground Vehicle (UGV)-Based Automated Pavement Layer Change Detection in Road Construction. Buildings 2022, 13, 5. [Google Scholar] [CrossRef]
- Cao, S.; Deng, J.; Luo, J.; Li, Z.; Hu, J.; Peng, Z. Local Convergence Index-Based Infrared Small Target Detection against Complex Scenes. Remote Sens. 2023, 15, 1464. [Google Scholar]
- Zhang, R.; Newsam, S.; Shao, Z.; Huang, X.; Wang, J.; Li, D. Multi-scale adversarial network for vehicle detection in UAV imagery. ISPRS J. Photogramm. Remote Sens. 2021, 180, 283–295. [Google Scholar] [CrossRef]
- Srivastava, S.; Narayan, S.; Mittal, S. A Survey of Deep Learning Techniques for Vehicle Detection from UAV Images. J. Syst. Archit. 2021, 117, 102152. [Google Scholar] [CrossRef]
- Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A. Vehicle Detection From UAV Imagery With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef] [PubMed]
- Gao, P.; Tian, T.; Zhao, T.; Li, L. GF-Detection: Fusion with GAN of Infrared and Visible Images for Vehicle Detection at Nighttime. Remote Sens. 2022, 14, 2771. [Google Scholar] [CrossRef]
- Fan, Y.; Qiu, Q.; Hou, S.; Li, Y.; Xie, J.; Qin, M.; Chu, F. Application of Improved YOLOv5 in Aerial Photographing Infrared Vehicle Detection. Electronics 2022, 11, 2344. [Google Scholar] [CrossRef]
- Yang, L.; Xie, T.; Liu, M.; Zhang, M.; Qi, S.; Yang, J. Infrared Small–Target Detection under a Complex Background Based on a Local Gradient Contrast Method. Int. J. Appl. Math. Comput. Sci. 2023, 33, 33–43. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Liu, S.; Ma, Z.; Chen, B. Remote Sensing Image Detection Based on FasterRCNN. In Artificial Intelligence in China; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar] [CrossRef]
- Wei, L.; Dragomir, A.; Dumitru, E.; Szegedy, C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
- Chen, W.; Baojun, Z.; Linbo, T.; Boya, Z. Small vehicles detection based on UAV. J. Eng. 2019, 2019, 7894–7897. [Google Scholar] [CrossRef]
- Benjdira, B.; Khursheed, T.; Koubaa, A.; Ammar, A.; Ouni, K. Car Detection using Unmanned Aerial Vehicles: Comparison between Faster R-CNN and YOLOv3. In Proceedings of the 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS), Muscat, Oman, 5–7 February 2019; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
- Qiu, M.; Huang, L.; Tang, B.H. ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on Multiscale Feature Fusion. Remote Sens. 2022, 14, 3498. [Google Scholar] [CrossRef]
- Liu, F.; Qian, Y.; Li, H.; Wang, Y. CAFFNet: Channel Attention and Feature Fusion Network for Multi-target Traffic Sign Detection. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2152008. [Google Scholar] [CrossRef]
- Liu, Y. Dense Multiscale Feature Fusion Pyramid Networks for Object Detection in UAV-Captured Images. arXiv 2020, arXiv:2012.10643. [Google Scholar] [CrossRef]
- Zhu, P.F.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision Meets Drones: A Challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
- Sun, M.; Zhang, H.; Huang, Z.; Luo, Y. Road infrared target detection with I-YOLO. IET Image Process. 2021, 16, 92–101. [Google Scholar] [CrossRef]
- Tang, T.; Zhou, S.; Deng, Z.; Zou, H.; Lei, L. Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 2017, 17, 336. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhao, Q.; Liu, B.; Lyu, S.; Wang, C. TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer. Remote Sens. 2023, 15, 1687. [Google Scholar] [CrossRef]
- Zuo, Z.; Tong, X.; Wei, J.; Su, S.; Wu, P.; Guo, R.; Sun, B. AFFPN: Attention Fusion Feature Pyramid Network for Small Infrared Target Detection. Remote Sens. 2022, 14, 3412. [Google Scholar] [CrossRef]
- Yao, S.; Zhu, Q.; Zhang, T.; Cui, W.; Yan, P. Infrared Image Small-Target Detection Based on Improved FCOS and Spatio-Temporal Features. Electronics 2022, 11, 933. [Google Scholar] [CrossRef]
- Zhang, M.; Li, B.; Wang, T.; Bai, H. CHFNet: Curvature Half-Level Fusion Network for Single-Frame Infrared Small Target Detection. Remote Sens. 2023, 15, 1573. [Google Scholar] [CrossRef]
- Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
- Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 949–958. [Google Scholar] [CrossRef]
- Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 867–876. [Google Scholar] [CrossRef]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014, 3104–3112. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; 2021. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; EH Tay, F.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. arXiv 2021, arXiv:2101.11986. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland; Volume 12346. [CrossRef]
- Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds. arXiv 2021, arXiv:2109.14379. [Google Scholar] [CrossRef]
- Chen, G.; Wang, W.; Tan, S. IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection. Remote Sens. 2022, 14, 3258. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2012, arXiv:2012.12877. [Google Scholar]
- Rao, Y.; Liu, Z.; Zhao, W.; Zhou, J.; Lu, J. Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks. arXiv 2022, arXiv:2207.01580. [Google Scholar] [CrossRef]
- Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef]
- Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A High-altitude Infrared Thermal Dataset for Unmanned Aerial Vehicles. arXiv 2022, arXiv:2204.03245. [Google Scholar]
- Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef] [Green Version]
- Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA, 7–9 January 2007. [Google Scholar]
Layer | Output Size | Stride | Number | Output Channels |
---|---|---|---|---|
Image | 640 × 640 | 1 | ||
Conv-3 × 3, ↓2 | 320 × 320 | 2 | 1 | 16 |
MV2 | 1 | 32 | ||
MV2, ↓2 | 160 × 160 | 4 | 1 | 48 |
MV2 | 2 | 48 | ||
MV2, ↓2 | 80 × 80 | 8 | 1 | 64 |
MobileViTBlock (L = 2) | 1 | 64 | ||
MV2, ↓2 | 40 × 40 | 16 | 1 | 80 |
MobileViTBlock (L = 4) | 1 | 80 | ||
MV2, ↓2 | 20 × 20 | 32 | 1 | 96 |
MobileViTBlock (L = 3) | 1 | 96 |
Small (0, 32 × 32) | Medium (32 × 32, 96 × 96) | Large (96 × 96, 640 × 512) | |
---|---|---|---|
HIT-UAV | 17,118 | 7249 | 268 |
Train set | 12,045 | 5205 | 268 |
Test set | 3331 | 1379 | 70 |
Validation set | 1742 | 665 | 46 |
Names | Related Configurations |
---|---|
Graphics processing unit | NVIDIA Quadro GV100 |
Central processing unit | Inter Xeon Platinum 8151+++ |
GPU memory size | 32 G |
Operating system | Win 10 |
Computing platform | CUDA10.2 |
Deep learning framework | Pytorch |
Parameters | GFLOPs | Precision | Recall(%) | F1(%) | APVehicle (%) | mAP(%) | |
---|---|---|---|---|---|---|---|
YOLOv7s | 36.5 M | 103.2 | 90.2 | 88.4 | 89.3 | 97.6 | 93.6 |
YOLO-ViT | 17.3 M | 33.1 | 90 | 91.3 | 90.6 | 98.1 | 94.5 |
Model | Size | Parameters | F1 (%) | APPerson (%) | APVehicle (%) | APBicycle (%) | mAP (%) |
---|---|---|---|---|---|---|---|
YOLOv5s | 640 | 7.0 M | 90.6 | 92.8 | 97.1 | 91.0 | 93.7 |
YOLO5m | 640 | 20.9 M | 90.8 | 92.2 | 96.6 | 90.9 | 93.2 |
YOLO5l | 640 | 46.1 M | 91.2 | 93.2 | 96.9 | 90.6 | 93.6 |
YOLO7s | 640 | 36.5 M | 89.3 | 92.1 | 97.6 | 91.2 | 93.6 |
YOLO8s | 640 | 11.2 M | 90.3 | 92.6 | 96.3 | 91.5 | 93.5 |
YOLO-ViT | 640 | 17.3 M | 90.6 | 93.3 | 98.1 | 92.1 | 94.5 |
Yolov7 | MobileViT | CARAFE | C3 | K-Means ++ | Parameters | F1 (%) | AP50 | mAP (%) | FPS | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Person | Vehicle | Bicycle | |||||||||
√ | 36.2 M | 89.3 | 92.1 | 97.6 | 91.2 | 93.6 | 54 | ||||
√ | √ | 23.8 M | 87.5 | 90.3 | 97.2 | 88.9 | 92.1 | 39 | |||
√ | √ | 36.4 M | 89.4 | 91.6 | 97.9 | 91.4 | 93.7 | 51 | |||
√ | √ | 29.9 M | 89.9 | 92.4 | 97.7 | 92.4 | 94.2 | 60 | |||
√ | √ | 36.5 M | 91.5 | 93.4 | 98.1 | 92.6 | 94.7 | 54 | |||
√ | √ | √ | 26.7 M | 89.2 | 90.0 | 97.4 | 90.3 | 92.6 | 37 | ||
√ | √ | √ | √ | 17.3 M | 89.6 | 90.6 | 97.6 | 90.5 | 92.9 | 40 | |
√ | √ | √ | √ | √ | 17.3 M | 90.5 | 93.3 | 98.1 | 92.0 | 94.5 | 41 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, X.; Xia, Y.; Zhang, W.; Zheng, C.; Zhang, Z. YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection. Remote Sens. 2023, 15, 3778. https://doi.org/10.3390/rs15153778
Zhao X, Xia Y, Zhang W, Zheng C, Zhang Z. YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection. Remote Sensing. 2023; 15(15):3778. https://doi.org/10.3390/rs15153778
Chicago/Turabian StyleZhao, Xiaofeng, Yuting Xia, Wenwen Zhang, Chao Zheng, and Zhili Zhang. 2023. "YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection" Remote Sensing 15, no. 15: 3778. https://doi.org/10.3390/rs15153778
APA StyleZhao, X., Xia, Y., Zhang, W., Zheng, C., & Zhang, Z. (2023). YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection. Remote Sensing, 15(15), 3778. https://doi.org/10.3390/rs15153778