A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes
Abstract
1. Introduction
- To address the sparsity and difficulty of capturing long-range target features in depth images, we design a Swin-Conv [6] hybrid module. We introduce sparse attention and deformable convolution into the backbone. This enables the network to adaptively focus on effective feature regions and enhance the modeling of target geometric structures. Consequently, the feature extraction and representation for small targets are improved;
- To address multi-scale target variations and complex background interference, an Attentional Feature Fusion [7] module is introduced in the model neck. It employs a multi-scale channel attention mechanism to adaptively fuse features from different levels and effectively coordinate detail with semantic information. This process improves the feature-pyramid fusion efficiency and enhances robustness in detecting multi-scale targets;
- Considering the inherent viewpoint characteristics of low-altitude UAV sensing, we construct a LiDAR range-image dataset. We then systematically study how different viewpoint proportions in the training data affect model performance, aiming to optimize generalization from a data perspective.
2. Related Work
2.1. Progress in 2D Image-Based Object Detection for UAV Perception
2.2. Progress in LiDAR Technologies for UAV Perception
3. Method
3.1. Swin Transformer-Conv
3.2. Attentional Feature Fusion
4. Experiments and Results
4.1. Experimental Environment and Training Parameters
4.2. Objective Evaluation Indicators
4.3. Construction of an All-View Range-Image Dataset
4.3.1. Dataset Construction
4.3.2. Dataset Optimization Experiment for Model Training Efficiency
4.4. Analysis of Experimental Results
4.4.1. Ablation Study
4.4.2. Comparative Experiment
4.4.3. Real-Time Performance Evaluation via Edge-Device Deployment
4.4.4. Visual Comparative Analysis of Interference Robustness
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Guan, X.; Shi, H.; Xu, D.; Zhang, B.; Wei, J.; Chen, J. The exploration and practice of low-altitude airspace flight service and traffic management in China. Green Energy Intell. Transp. 2024, 3, 100149. [Google Scholar] [CrossRef]
- Liu, H.; Ma, R. Sky’s-Eye Perspective: A Multidimensional Review of UAV Applications in Highway Systems. Appl. Sci. 2025, 15, 11199. [Google Scholar] [CrossRef]
- Gariepy, G.; Krstajic, N.; Henderson, R.; Li, C.Y.; Thomson, R.R.; Buller, G.S.; Heshmat, B.; Raskar, R.; Leach, J.; Faccio, D. Single-photon sensitive light-in-fight imaging. Nat. Commun. 2015, 6, 6021. [Google Scholar] [CrossRef]
- Seidaliyeva, U.; Ilipbayeva, L.; Utebayeva, D.; Smailov, N.; Matson, E.T.; Tashtay, Y.; Turumbetov, M.; Sabibolda, A.J.S. LiDAR Technology for UAV Detection: From fundamentals and operational principles to advanced detection and classification techniques. Sensors 2025, 25, 2757. [Google Scholar] [CrossRef] [PubMed]
- Sapkota, R.; Karkee, M. Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
- Zhang, K.; Li, Y.; Liang, J.; Cao, J.; Zhang, Y.; Tang, H.; Fan, D.-P.; Timofte, R.; Gool, L.V.J.M.I.R. Practical blind image denoising via Swin-Conv-UNet and data synthesis. Mach. Intell. Res. 2023, 20, 822–836. [Google Scholar] [CrossRef]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Lin, J.; Sun, S.; Gong, S. Gridclip: One-stage object detection by grid-level clip representation learning. Pattern Recognit. 2023, 171, 112187. [Google Scholar] [CrossRef]
- Zhang, X.; Yuan, D.; Hu, Y.; Wu, Z.; Zhang, X.; Yu, B.; Bai, X.; Cao, S.-Y.; Jin, Y.; Yang, B.J.P.R. SADet: A Semantic-Aware Tiny Object Detection Network Against Missed Detection. Pattern Recognit. 2025, 172, 112624. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
- Cao, J.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Pang, Y.; Shao, L. D2det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11485–11494. [Google Scholar]
- Li, W.; Zhao, D.; Yuan, B.; Gao, Y.; Shi, Z. PETDet: Proposal enhancement for two-stage fine-grained object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–14. [Google Scholar] [CrossRef]
- Kong, F.; Shan, X.; Hu, Y.; Li, J. Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Appl. Sci. 2025, 9, 803. [Google Scholar] [CrossRef]
- Li, Y.; Wang, J.; Zhang, K.; Yi, J.; Wei, M.; Zheng, L.; Xie, W. Lightweight object detection networks for UAV aerial images based on YOLO. Chin. J. Electron. 2024, 33, 997–1009. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-scale feature aggregation and grouping feature reconstruction-based UAV image target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
- Fang, Q.; Han, D.; Wang, Z. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar] [CrossRef]
- Wang, W.; Peng, Y.; Cao, G.; Guo, X.; Kwok, N. Low-illumination image enhancement for night-time UAV pedestrian detection. IEEE Trans. Ind. Inform. 2020, 17, 5208–5217. [Google Scholar] [CrossRef]
- Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA, 5–7 October 2020; pp. 207–222. [Google Scholar]
- He, X.; Li, X.; Xu, Q.; Hu, Y.; Sun, Z. Radial awareness with adaptive hybrid CNN-Transformer range-view representation for outdoor LiDAR point cloud semantic segmentation. Expert Syst. Appl. 2025, 271, 126572. [Google Scholar] [CrossRef]
- Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 2918–2927. [Google Scholar]
- Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; Anguelov, D. Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 20–25 June 2021; pp. 5725–5734. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
- Wang, Z.; Liao, Z.; Zhou, B.; Yu, G.; Luo, W. SwinURNet: Hybrid transformer-cnn architecture for real-time unstructured road segmentation. IEEE Trans. Instrum. Meas. 2024, 73, 16. [Google Scholar] [CrossRef]
- Zhang, H.; Mao, H.; Zheng, J.; Jin, L.; Guo, B. CR-Pillars: A Three-Dimensional Object Detection Model Based on Enhanced PointPillars. In Proceedings of the International Conference on Green Intelligent Transportation System and Safety, Qinghuadao, China, 16–18 September 2022; pp. 533–546. [Google Scholar]
- Bultmann, S.; Quenzel, J.; Behnke, S.J.R.; Systems, A. Real-time multi-modal semantic fusion on unmanned aerial vehicles with label propagation for cross-domain adaptation. Robot. Auton. Syst. 2023, 159, 104286. [Google Scholar] [CrossRef]
- Manduhu, M.; Dow, A.; Trslic, P.; Dooly, G.; Blanck, B.; Riordan, J. Airborne Sense and Detect of Drones using LiDAR and adapted PointPillars DNN. arXiv 2023, arXiv:2310.09589. [Google Scholar]
- Tzutalin. LabelImg. GitHub. 2015. Available online: https://github.com/HumanSignal/labelImg (accessed on 28 December 2025).









| Parameter | Setting |
|---|---|
| Batch size | 8 |
| Number of epochs | 200 |
| Image resolution | 512 × 512 |
| Optimizer | SGD |
| Initial learning rate | 0.001 |
| Momentum | 0.9 |
| Weight decay | 0.0005 |
| Configuration Parameter | Setting |
|---|---|
| Embedding dimensions | [64, 128, 256, 480] |
| Number of attention heads | [2, 4, 8, 12] |
| Window size | [7, 7, 7, 7] |
| Number of blocks | [2, 4, 4, 2] |
| Number of deformable convolution groups | [1, 2, 3, 4] |
| Sparse threshold factor | 0.01 |
| Module | Hyperparameter | Setting |
|---|---|---|
| MSDCK | Parallel kernel sizes | [3, 5] |
| Internal channels | C/16 | |
| Scale selector | MLP structure | GAP → C/32 → ReLU → 2 |
| Weight normalization | Softmax | |
| Global branch | Channel compression ratio r | 16 |
| Basic configuration | Activation and normalization | ReLU, Sigmoid, BN |
| Target-Related Parameters | Pose Angle Parameters | Image-Related Parameters | |||
|---|---|---|---|---|---|
| Parameter Name | Range | Parameter Name | Range | Parameter Name | Range |
| Number of Targets | 1~12 | Azimuth Angle | 0~359° | Overall Position | 0~2 |
| Rotation Angle | 0~359° | Viewing Angle | 0~80° | Field of View | 40~80% |
| Relative Position | 0~8 | Spin Angle | −45~45° | Noise Intensity | 20~40 dB |
| Ratio | Overall | 0–10° | 10–20° | 20–30° | 30–40° | 40–50° | 50–60° | 60–70° | 70–80° |
|---|---|---|---|---|---|---|---|---|---|
| 5:5 | 86.92% | 89.41% | 89.69% | 87.93% | 86.88% | 86.34% | 85.78% | 85.13% | 82.03% |
| 4:6 | 87.17% | 89.90% | 89.51% | 88.34% | 88.53% | 87.17% | 84.84% | 85.63% | 82.36% |
| 3:7 | 87.51% | 90.24% | 89.76% | 88.05% | 88.83% | 86.53% | 87.15% | 86.90% | 81.94% |
| 2:8 | 87.49% | 89.86% | 89.09% | 89.02% | 88.49% | 85.64% | 87.39% | 86.28% | 83.71% |
| 1:9 | 87.29% | 89.16% | 88.98% | 88.18% | 89.14% | 87.44% | 86.02% | 86.65% | 81.22% |
| Method | AP/% | mAP/% | Pre/% | Rec/% | F1/% | |||
|---|---|---|---|---|---|---|---|---|
| Pedestrian | Tank | Truck | Car | |||||
| YOLOv10 | 73.51 | 88.16 | 88.82 | 86.24 | 84.18 | 82.94 | 81.95 | 82.44 |
| YOLOv10 + Swin-Conv | 76.48 | 91.47 | 90.32 | 89.73 | 87.06 | 85.80 | 84.85 | 85.32 |
| YOLOv10 + AFF | 78.36 | 92.29 | 89.64 | 89.61 | 87.48 | 86.68 | 84.53 | 85.62 |
| Ours | 80.17 | 93.18 | 91.53 | 90.97 | 88.96 | 88.52 | 86.47 | 87.47 |
| Method | Pre/% | Rec/% | F1/% | mAP/% | FPS | Params/M | Size/MB | FLOPs/B |
|---|---|---|---|---|---|---|---|---|
| FastR-CNN | 71.56 | 73.44 | 72.49 | 74.71 | 12.1 | 25.6 | 97.28 | 142 |
| DETR | 76.23 | 78.12 | 77.16 | 78.91 | 23.4 | 41.2 | 156.56 | 154 |
| YOLOv7 | 82.11 | 80.37 | 81.23 | 82.56 | 53.6 | 36.9 | 140.22 | 70.5 |
| YOLOv8 | 81.26 | 80.93 | 81.09 | 83.73 | 60.9 | 25.7 | 97.66 | 53.2 |
| YOLOv10 | 82.94 | 81.95 | 82.44 | 84.18 | 63.3 | 15.4 | 58.52 | 37.9 |
| Ours | 88.52 | 86.47 | 87.47 | 88.96 | 54.2 | 28.9 | 109.82 | 46.7 |
| Hardware Platform | Model | Computational Precision | mAP/% | FPS | Mean Latency/ms | Inference Time/ms |
|---|---|---|---|---|---|---|
| RTX 3080 Ti | YOLOv10 | FP16 | 83.76 | 114.5 | 8.8 | 7.3 |
| INT8 | 82.52 | 148.2 | 6.8 | 5.2 | ||
| Ours | FP16 | 88.41 | 97.5 | 10.3 | 8.9 | |
| INT8 | 87.19 | 126.7 | 7.9 | 6.4 | ||
| Jetson Orin Nano | YOLOv10 | FP16 | 83.05 | 27.4 | 36.5 | 8.6 |
| INT8 | 81.48 | 38.5 | 26.0 | 5.8 | ||
| Ours | FP16 | 87.85 | 23.4 | 42.7 | 10.2 | |
| INT8 | 87.06 | 32.7 | 30.6 | 6.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhai, Y.; Zhang, Z.; Xie, S.; Tong, C.; Luo, X.; Li, X.; Wang, L.; Zhao, Y. A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics 2026, 15, 211. https://doi.org/10.3390/electronics15010211
Zhai Y, Zhang Z, Xie S, Tong C, Luo X, Li X, Wang L, Zhao Y. A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics. 2026; 15(1):211. https://doi.org/10.3390/electronics15010211
Chicago/Turabian StyleZhai, Yu, Ziyi Zhang, Sen Xie, Chunsheng Tong, Xiuli Luo, Xuan Li, Liming Wang, and Yingliang Zhao. 2026. "A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes" Electronics 15, no. 1: 211. https://doi.org/10.3390/electronics15010211
APA StyleZhai, Y., Zhang, Z., Xie, S., Tong, C., Luo, X., Li, X., Wang, L., & Zhao, Y. (2026). A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics, 15(1), 211. https://doi.org/10.3390/electronics15010211
