A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes

Zhai, Yu; Zhang, Ziyi; Xie, Sen; Tong, Chunsheng; Luo, Xiuli; Li, Xuan; Wang, Liming; Zhao, Yingliang

doi:10.3390/electronics15010211

Open AccessArticle

A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes

by

Yu Zhai

^1,2,

Ziyi Zhang

^1,2,

Sen Xie

^1,2,

Chunsheng Tong

^1,2,

Xiuli Luo

^1,2,

Xuan Li

^1,2,

Liming Wang

^1,2 and

Yingliang Zhao

^1,2,*

¹

State Key Laboratory of Extreme Environment Optoelectronic Dynamic Measurement Technology and Instrument, Taiyuan 030051, China

²

Shanxi Province Key Laboratory of Intelligent Detection Technology & Equipment, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 211; https://doi.org/10.3390/electronics15010211

Submission received: 21 November 2025 / Revised: 29 December 2025 / Accepted: 29 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue Image Processing for Intelligent Electronics in Multimedia Systems)

Download

Browse Figures

Versions Notes

Abstract

Low-altitude Unmanned Aerial Vehicle (UAV) detection using LiDAR range images faces persistent challenges. These include sparse features for long-range targets, large scale variations caused by viewpoint changes, and severe interference from complex backgrounds. To address these issues, we propose an improved detection framework based on YOLOv10. First, we design a Swin-Conv hybrid module that combines sparse attention with deformable convolution. This module enables the network to focus on informative regions and adapt to target geometry. These capabilities jointly strengthen feature extraction for sparse, long-range targets. Second, we introduce Attentional Feature Fusion (AFF) in the neck to replace naïve feature concatenation. AFF employs multi-scale channel attention to softly select and adaptively weight features from different levels, improving robustness to multi-scale targets. In addition, we systematically study how the viewpoint distribution in the training set affects performance. The results show that moderately increasing the proportion of low-elevation-view samples significantly improves detection accuracy. Experiments on a self-built simulated LiDAR range-image dataset demonstrate that our method achieves 88.96% mAP at 54.2 FPS, which is 4.78 percentage points higher than the baseline. Deployment on the Jetson Orin Nano edge device further validates the model’s potential for real-time applications. The proposed method remains robust under noise and complex backgrounds. The proposed approach achieves an effective balance between detection accuracy and computational efficiency, providing a reliable solution for real-time target detection in complex low-altitude environments.

Keywords:

object detection; YOLOv10; LiDAR; UAV detection; feature fusion

1. Introduction

With the rapid growth of the low-altitude economy, fine-grained perception is increasingly demanded in applications such as urban governance, environmental monitoring, emergency response, and infrastructure inspection [1]. UAVs have become an important platform for real-time, high-resolution data acquisition because of their compact size, low-altitude operation, and strong maneuverability. By flying at low altitude and adjusting viewpoints flexibly, UAVs can provide wide-area coverage and maintain effective sensing even in complex environments and adverse weather conditions [2]. Therefore, developing detection methods that are both accurate at long range and suitable for UAV deployment is crucial for practical low-altitude applications.

Among sensing modalities that can be integrated into UAVs, LiDAR is particularly attractive because it actively acquires high-resolution range information. Airborne LiDAR emits laser beams and measures the reflected signals to generate high-quality single-channel range images. This imaging mechanism is insensitive to illumination changes and provides accurate radial distance measurements, making it advantageous for target perception and recognition in complex scenarios [3]. Moreover, as LiDAR sensors become smaller and more efficient, they show strong potential for detecting small and distant targets [4]. In this work, LiDAR data are represented as 2D range images—also known as depth images in a 2D grid—where each pixel encodes radial distance. This representation is used instead of 3D point clouds or voxel grids. This representation allows the use of mature 2D detection frameworks. However, due to LiDAR imaging physics and reflectance properties, range images are often sparse and non-uniformly distributed. In practical low-altitude UAV ground-sensing tasks, the scenes are typically complex in three aspects: (1) Variable viewpoints: the UAV viewpoint changes continuously during flight, leading to diverse target scales and poses and making partial occlusion likely; (2) Complex backgrounds: ground scenes often contain dense clutter such as trees, buildings, and vehicles, resulting in high semantic similarity between targets and backgrounds; (3) Adverse-weather interference: conditions such as rain, fog, and dust introduce coupled Gaussian noise and speckle noise, severely reducing the signal-to-noise ratio. The combined effects of these factors make the extracted range-image features more fragile. Particularly for distant small targets, extracting robust features and achieving accurate recognition remain technical challenges.

As a core research direction in computer vision, object detection algorithms serve as the “brain” of UAV sensing systems. In recent years, driven by deep learning, the YOLO series has become one of the widely adopted detection frameworks on UAV edge-computing platforms because of its excellent balance between detection speed and accuracy [5]. This architecture has continued to evolve in feature extraction, multi-scale fusion, and detection-head design, with performance being steadily improved. However, directly applying standard YOLO architectures to LiDAR range images acquired on UAV platforms still faces many challenges. On the one hand, the model’s ability to capture subtle features of long-range small targets is insufficient; on the other hand, the concatenation operation in the feature pyramid network (FPN) ignores differences in importance among features of different scales and semantic levels.

To address these challenges—feature sparsity, scale variability, and noise interference in complex scenarios—this paper proposes an improved YOLOv10 model for low-altitude UAV sensing, tailored to the characteristics of single-channel LiDAR range images. The main contributions of this paper are as follows:

To address the sparsity and difficulty of capturing long-range target features in depth images, we design a Swin-Conv [6] hybrid module. We introduce sparse attention and deformable convolution into the backbone. This enables the network to adaptively focus on effective feature regions and enhance the modeling of target geometric structures. Consequently, the feature extraction and representation for small targets are improved;
To address multi-scale target variations and complex background interference, an Attentional Feature Fusion [7] module is introduced in the model neck. It employs a multi-scale channel attention mechanism to adaptively fuse features from different levels and effectively coordinate detail with semantic information. This process improves the feature-pyramid fusion efficiency and enhances robustness in detecting multi-scale targets;
Considering the inherent viewpoint characteristics of low-altitude UAV sensing, we construct a LiDAR range-image dataset. We then systematically study how different viewpoint proportions in the training data affect model performance, aiming to optimize generalization from a data perspective.

2. Related Work

2.1. Progress in 2D Image-Based Object Detection for UAV Perception

In recent years, deep learning has driven rapid progress in 2D image-based object detection. Convolutional Neural Network models such as VGGNet [8], ResNet [9], and GoogleNet [10] have established a strong foundation for feature extraction. Current mainstream approaches are broadly categorized into one-stage and two-stage methods [11]. One-stage methods, represented by the YOLO series [12,13,14], EfficientDet [15] and the Transformer-based DETR [16] are characterized by concise architecture and efficient inference. Subsequent innovations in this direction include works such as GridCLIP [17] and SADet [18]. Two-stage methods, exemplified by Cascade R-CNN [19], D2Det [20], and PETDet [21] generally achieve higher detection accuracy. These advances in general-purpose detection algorithms have laid the groundwork for visual perception in complex scenarios such as UAV applications.

To enable efficient application of object detection in UAV aerial-imaging scenarios, achieving an optimal balance between algorithm lightweighting and detection accuracy remains a core research challenge [22]. Research has thus evolved along two main paths. The first prioritizes lightweight design to enhance deployment efficiency. Li et al. [23] adopted a GhostNet backbone with a feature reuse mechanism, which significantly reduces model parameters and enhances compatibility with edge devices. This approach often comes at the cost of reduced accuracy in small object detection. The second path focuses on improving detection performance through enhancement modules. Zhang et al. [24] introduced feature fusion and grouped reconstruction to handle scale variation. Fang et al. [25] leveraged multimodal fusion to learn cross-spectral relationships between visible and infrared data. And Wang et al. [26] applied image enhancement techniques to improve robustness under varying illumination. However, these performance-oriented methods usually increase model complexity and computational overhead.

2.2. Progress in LiDAR Technologies for UAV Perception

For LiDAR object detection on range images, the data distribution—where pixel values represent radial distance—differs significantly from natural images. Early works, such as SalsaNext [27], applied conventional convolutions directly to range images. In contrast, recent research has shifted focus towards specialized architectures. These are designed to address the unique challenges of range images, particularly continuous scale variations and sparse small-target features. For instance, He et al. [28] designed a radially aware hybrid CNN–Transformer to reconstruct lost radial context, improving segmentation. In object detection, RangeDet [29] tackles geometric representation and scale variation, and RSN [30] combines range-image foreground segmentation with 3D sparse convolution to balance efficiency and accuracy, validating the potential of range-image-based approaches.

Effectively interpreting depth information in range images is crucial, which has driven the development of specialized deep learning models. Related studies focus on depth completion on the one hand. For example, the AdaBins framework proposed by Bhat et al. [31] improves monocular depth estimation accuracy through an adaptive depth-bin partitioning strategy. On the other hand, they explore 3D geometric architectures to better process projected or point-cloud data. SwinURNet [32] designed a hybrid Transformer–CNN architecture to perform real-time segmentation on projected range images. CR-Pillars [33] enhances feature extraction for pseudo-images generated from point clouds by introducing the convolutional block attention module (CBAM).

Towards integrating LiDAR into UAV platforms, research emphasizes lightweight, task-specific systems. A key trend is multimodal fusion. An onboard system [34] fuses LiDAR, RGB, and thermal imaging via label propagation for real-time semantic mapping. Another study [35] demonstrates a dedicated UAV LiDAR framework for detecting nearby drone swarms, highlighting its applicability in complex low-altitude scenarios.

Inspired by the above studies, although existing work has made important progress in general object detection, low-altitude platform adaptation, and range-image data processing. A key research gap remains for detection tasks on UAV LiDAR range images as a specific data form. The core challenge lies in designing a dedicated detection architecture that combines highly targeted feature extraction with efficient multi-scale fusion, in order to address inherent challenges such as sparse small-target features and special geometric distributions.

3. Method

UAV object detection on LiDAR depth images faces combined issues: feature sparsity, multi-scale variations, complex background interference, and noise under adverse weather. To address these challenges, we propose an optimized framework based on an improved YOLOv10, as illustrated in Figure 1.

To address the above challenges, this study makes two key improvements to the YOLOv10 network. First, the Swin-Conv module is used to replace the original C2f module in the backbone to enhance modeling capability for sparse features and complex backgrounds. Second, the Attentional Feature Fusion (AFF) module replaces simple concatenation operations in the neck, thereby improving the model’s robustness to scale changes and noise interference.

3.1. Swin Transformer-Conv

The C2f module in the original YOLOv10 backbone improves gradient flow and feature reuse through cross-layer connections. However, it remains fundamentally based on standard convolution operations. Due to its limited receptive field, standard convolution has difficulty establishing long-range dependencies, which is particularly inadequate for sparse depth images. Compared with RGB images, depth images lack texture information and have sparse feature distributions, which is highly unfavorable for UAV ground-target detection. Conventional convolution is limited to local operations and struggles to capture semantic correlations among distant pixels, which affects global scene understanding. In addition, from a UAV viewpoint, ground targets occupy only a small pixel area in depth images, and their subtle features are easily diluted by convolutional structures. Therefore, a hybrid architecture that can integrate local details and global context is needed.

To this end, we introduce the Swin-Conv hybrid module, whose structure is shown in Figure 2. Based on an original parallel two-branch design, this module is optimized for the sparsity characteristics of LiDAR range images to improve feature modeling for sparse structures and long-range targets. The module splits the input features into two groups, which are then fed into a sparse Swin Transformer block and a deformable residual convolution block, respectively.

The sparse Swin Transformer block constructs a global context modeling mechanism through the collaboration of window-based multi-head self-attention (W-MSA) and shifted-window multi-head self-attention (SW-MSA). W-MSA partitions the feature map into multiple non-overlapping regular

M \times M

windows and independently performs self-attention computation within each window, as shown in Figure 3a. This mechanism introduces relative position bias to enhance the spatial awareness of attention weights, while reducing computational complexity from

O (n^{2})

to

O (n)

. To further overcome the limitation of local windows, SW-MSA adopts a window-shifting strategy, as shown in Figure 3b. After W-MSA computation, subsequent layers perform cyclic shifting of the window partition, moving

(⌊\frac{M}{2}⌋, ⌊\frac{M}{2}⌋)

pixels toward the lower-right direction. Through a masking mechanism, self-attention is ensured to be computed only among spatially adjacent features, thereby establishing cross-window dependencies while maintaining linear complexity. To adapt to the sparsity of LiDAR data and suppress noise under adverse weather, we introduce a sparse masking mechanism into attention computation. Attention is computed only over non-zero or salient feature regions, which significantly reduces computational overhead and strengthens the focus on effective structures. This mechanism enables the model to focus on real target structures under noise interference. The sparse mask

S

is generated based on the

L_{2}

norm of the input feature map

X

and is binarized using an adaptive threshold

τ

, as shown below:

S_{i, j} = \{\begin{matrix} 1 & i f {‖X_{:, i, j}‖}_{2} > τ \\ 0 & o t h e r w i s e \end{matrix},

(1)

Here, the threshold τ is adaptively computed according to statistics of feature magnitudes.

The deformable residual convolution block focuses on extracting local structural features. It leverages the dynamic receptive-field adaptability of deformable convolution. This allows the model to adaptively adjust the sampling locations and receptive-field shape of convolution kernels according to the geometric structure and edge contours of targets in depth images. The model gains a stronger inductive bias for geometric structures. This mechanism better captures edge details and spatial layouts of targets in depth images, effectively compensating for the Transformer’s weakness in local feature representation. It thereby effectively compensates for the Transformer’s weakness in local feature representation. Furthermore, it significantly improves the model’s ability to distinguish target contours in complex backgrounds, as well as its adaptability to target deformation caused by varying viewpoints.

The outputs of the two branches are dynamically weighted and fused through a cross-branch gated-attention module. This module generates a spatial weight map based on the input features, enabling adaptive soft selection between local details and global semantics. The fused features are then added to the input via a sparsity-preserving residual connection, which uses a sparse mask to restrict updates only to effective regions. This design prevents the introduction of noise in invalid areas while preserving their original input information.

By embedding this module into the YOLOv10 detector, the model’s perception of long-range targets and sparse features is effectively enhanced. This leads to improved detection accuracy and robustness in challenging scenarios—such as those with feature sparsity, complex backgrounds, and noise interference—while high computational efficiency is maintained.

3.2. Attentional Feature Fusion

The neck network of YOLOv10 performs multi-scale feature fusion through feature pyramids (e.g., FPN/PAN). Although its core operation, Concat, can integrate features from different sources. It has an inherent limitation: it homogenizes all input channels and ignores the intrinsic differences in semantics and details among different feature levels. To achieve efficient fusion with dynamic weighting according to input content, we introduce Attentional Feature Fusion (AFF). This module is mainly designed for the core challenge of multi-scale variation, and its dynamic weighting mechanism also helps select more discriminative features under complex backgrounds and noise interference. The specific structure is shown in Figure 4.

The module first integrates low-level detail features

F_{1}

with high-level semantic features

F_{2}

. It then analyzes these features using its core component, MS-CAM. Different from the original AFF, this paper makes a key improvement to MS-CAM: fixed-size pointwise convolution is replaced with a Multi-scale Dynamic Convolution Kernel (MSDCK), as shown in Figure 5. By adaptively adjusting the receptive-field size according to the scale characteristics of the input features, this directly strengthens the module’s core capability to handle multi-scale targets.

The improved MS-CAM generates attention weights through a dual-pathway design. The global branch compresses the features to obtain image-level statistical descriptions. The local branch uses the MSDCK structure to extract multi-scale features via a set of parallel convolution kernels with different sizes and dynamically fuses these features using a lightweight selector. This enables the module to adaptively handle targets of different scales. For small targets, it increases the weight of local details; for large targets, it strengthens broader contextual information. Finally, the module outputs a spatial attention weight map

M \in R^{C \times H \times W}

and uses it to weight and fuse the input features. The process can be expressed as:

F_{f u s i o n} = M (F_{1} + F_{2}) \otimes (1 - M (F_{1} + F_{2})) \otimes F_{2}

(2)

Here,

F_{f u s i o n}

is the fused output feature, and

\otimes

denotes element-wise multiplication. This module dynamically assigns spatial weights to features

F_{1}

and

F_{2}

according to the input content, thereby replacing the fixed Concat operation.

We replace all nodes in the YOLOv10 neck that use Concat for feature fusion with the improved AFF modules. This change turns feature fusion from a simple concatenation operation into a content-aware and scale-aware dynamic selection process driven by multi-scale dynamic convolution kernels. When facing multi-scale variations, viewpoint differences, and background interference, the model can autonomously focus on more discriminative feature levels. This capability significantly enhances its robustness for target detection in complex scenarios.

4. Experiments and Results

4.1. Experimental Environment and Training Parameters

The experiments were conducted on Ubuntu with Python 3.10, PyTorch 2.6, CUDA 12.4, and an RTX 3080 Ti (12 GB). The specific training configuration parameters used in the experiments are listed in Table 1.

During model training, we first employed data augmentation techniques such as Mosaic, random flipping, and random scaling to enhance generalization capability. To balance training efficiency and model representational capacity, a two-stage training strategy was adopted. In the first 100 epochs, the backbone network was frozen, and only the neck and head were trained. In the subsequent 100 epochs, the entire network was unfrozen for end-to-end fine-tuning. After training, the weights corresponding to the lowest validation loss were selected for final evaluation.

The structural parameters of the model’s core components are set as follows. For the Swin-Conv module used in the backbone, detailed configurations such as per-stage dimensions, number of heads, and convolution groups are shown in Table 2.

For the AFF module used for multi-scale feature fusion, key hyperparameter configurations such as internal convolution kernels and channel compression ratio are shown in Table 3.

The above modules and parameters constitute the model basis for this study. Ablation experiments and performance comparisons are conducted in subsequent sections based on these settings.

4.2. Objective Evaluation Indicators

For object detection and recognition tasks, this paper adopts a series of metrics for comprehensive evaluation, including Precision, Recall, F1-score, and mean Average Precision (mAP).

Precision measures the accuracy of the model’s positive predictions, and is computed as:

P_{r e} = \frac{T P}{T P + F P},

(3)

Here,

T P

denotes true positives correctly detected, and

F P

denotes negatives incorrectly identified as positives.

Recall reflects the model’s coverage of positive samples, and is computed as:

R_{ec} = \frac{T P}{T P + F N},

(4)

Here,

F N

denotes true positive samples that are not detected.

Because Precision and Recall typically constrain each other, the F1-score is introduced as a comprehensive metric; it is the harmonic mean of Precision and Recall.

F_{1} = \frac{2 \cdot P_{r e} \cdot R_{ec}}{P_{r e} + R_{ec}} = \frac{2 T P}{2 T P + F P + F N},

(5)

In addition, mAP is used to evaluate overall performance. In this paper, mAP50 is used as the evaluation metric, where the intersection-over-union (IoU) threshold between predicted and ground-truth boxes is set to 50%. For each class, average precision (AP) is computed from its precision–recall curve, and mAP is obtained by averaging AP over all classes, defined as:

m A P 50 = \frac{1}{N} \int_{0}^{1} P_{r e} (R_{ec}) d (R_{ec})

(6)

Subsequent experiments will evaluate the detection and recognition performance of different network models using the above metrics.

4.3. Construction of an All-View Range-Image Dataset

To address the core challenge of uncontrollable target poses in UAV ground sensing, and to overcome the limitation that most existing public range-image datasets are from fixed-viewpoint scenarios. This study constructs a large-scale, multi-view LiDAR range-image dataset.

4.3.1. Dataset Construction

The depth-image dataset used for training and testing in this study is generated by a self-developed laser imaging LiDAR simulation system. This system can simulate large-scale complex scenes containing multiple target classes, omnidirectional poses, different noise levels, and resolutions, providing rich and controllable data sources for deep learning methods. Examples of generated range images are shown in Figure 6.

The core of dataset construction is parameterized 3D scene generation and imaging simulation. We select cars, trucks, pedestrians, and tanks as detection targets. These four classes are nearly evenly distributed in the dataset, each accounting for approximately 25%. Houses and trees are added as background elements to enhance scene realism. All 3D models are represented as triangular meshes. Using Euler-angle rotation matrices and translation vectors, we independently adjust the 3D pose (azimuth, viewing angle, and spin angle) and position of each target. This process enables the construction of diverse scenes with arbitrary layouts and occlusion relationships. The sampling principle of the simulation algorithm is illustrated in Figure 7. Depth images are generated using a projection sampling algorithm. This algorithm projects the 3D scene onto the imaging plane and performs discrete sampling at a preset resolution. The depth value for each pixel is obtained by solving the plane equation of the corresponding 3D triangular facet. A maximum-depth criterion is applied to handle occlusion between targets and self-occlusion. To approximate measurement errors of real LiDAR, a Gaussian–impulsive mixed noise model is introduced in the simulation. This model aims to simulate Gaussian random errors caused by sensor thermal noise, as well as impulsive noise induced by multipath effects and other factors. By adjusting the noise-intensity parameter, simulated images under different signal-to-noise ratios can be generated. This makes the data statistically closer to real acquisition results, thereby enhancing the generalization ability of algorithms trained on this dataset.

This study uses a random generation mode to construct the dataset in batches. As shown in Table 4, all key simulation parameters—including target type, quantity, pose angles, spatial position, imaging field of view, and noise parameters—are randomly and independently selected within preset reasonable ranges. This design ensures diversity and uniform distribution of generated samples in the feature space. Based on this process, this study generates a total of 6500 simulated depth images with a resolution of 512 × 512.

To provide ground-truth labels for supervised learning, we annotated all images using the LabelImg tool [36]. The YOLO annotation format is used, where the label information includes the minimum bounding rectangle for each target of interest in the image and its corresponding class label. This format can be directly used for training YOLO-series models.

For dataset splitting, among the 6500 annotated range images, 500 consecutively generated range images are selected as the test set for subsequent ablation and comparative experiments, while the remaining 6000 images are used as the training and validation sets, split at a ratio of 9:1.

4.3.2. Dataset Optimization Experiment for Model Training Efficiency

After obtaining the initial data, we further investigate the impact of data distribution on model performance and attempt to optimize detection performance by adjusting the composition of the training set. To ensure sufficient training data while controlling the experimental cycle and computational cost, we randomly selected 3000 images from the generated 6500 range images for the dataset optimization experiment. The training, validation, and test sets are split at a ratio of 8:1:1.

This experiment is motivated by an initial observation of UAV range-image data. In the randomly generated dataset, the ratio of high- to low-viewpoint samples is approximately 5:5. We further observe that the learnable features in range images vary significantly with the acquisition viewpoint. For images acquired at a small viewpoint—defined in this paper as a pitch angle of 0–40°, approximately top-down—the overall image features change little despite variations in other sampling angles or relative target positions. Consequently, the learnable features are relatively limited, as illustrated in Figure 8a,b. When the acquisition viewpoint is large (i.e., a pitch angle of 40–80°, approximately horizontal), changes in other sampling angles or relative target positions lead to significant variations in overall image features. This results in a richer set of learnable features, as shown in Figure 8c,d. Based on this observation, we speculate that the original balanced distribution may limit the model’s ability to learn from samples with different information content, and therefore hypothesize that adjusting this ratio may further improve model recognition accuracy.

To verify this hypothesis, while keeping the total number of images unchanged, we construct five training–validation sets with high-to-low viewpoint ratios of 5:5, 4:6, 3:7, 2:8, and 1:9 to train the improved YOLOv10 model.

During testing, in addition to the overall test set, the test set is divided into eight subsets at 10° intervals to evaluate model performance across different viewpoint ranges in detail. As shown in Table 5, the model achieves the best performance in overall mAP and on multiple subsets when trained with a 3:7 ratio. In contrast, the 5:5 ratio yields the worst performance on both the overall test set and across multiple viewpoint ranges. This phenomenon is consistent with our in-depth analysis of data characteristics. The experiments show that high-viewpoint range images exhibit minimal overall feature changes under pose variations, resulting in limited learnable information. In contrast, low-viewpoint range images are highly sensitive to pose variations and contain richer discriminative features. Therefore, appropriately increasing the proportion of low-viewpoint samples helps the model focus on more discriminative features, thereby improving overall performance.

Based on these results, we finally choose the training–validation set with a high-to-low viewpoint ratio of 3:7 for subsequent experiments, aiming to effectively improve model detection accuracy without changing the total amount of data.

4.4. Analysis of Experimental Results

After model training, this subsection conducts a systematic verification and analysis of the effectiveness of the proposed improvements. First, under the same experimental settings. These experiments are designed to investigate the individual contributions of the Swin-Conv and AFF modules, as well as their synergistic effect, thereby clarifying the sources of performance improvement. Subsequently, we conduct a comprehensive comparison between our final improved model and current mainstream object detection algorithms. This comparison serves to objectively evaluate the advancement and competitiveness of our method for UAV detection on LiDAR range images.

4.4.1. Ablation Study

To accurately evaluate the contributions of the Swin-Conv module and the AFF module, we conducted systematic ablation experiments on the self-built dataset, and the results are shown in Table 6.

The ablation results show that each improvement yields a significant performance gain. The baseline YOLOv10 achieves mAP of 84.18%, with AP scores for pedestrians, tanks, trucks, and cars are 73.51%, 88.16%, 88.82%, and 86.24%, respectively. After introducing the Swin-Conv module, the model mAP increases to 87.06%, and the detection accuracies for all four classes increase steadily. Meanwhile, both Precision and Recall improve, indicating that this module enhances feature representation and provides higher-quality base features for detection. Introducing the AFF module alone further increases mAP to 87.48%. This module performs particularly well on pedestrian detection, with an accuracy improvement of 4.85%, but its results on truck and car detection are slightly lower than those achieved by introducing Swin-Conv alone. This difference indicates that while the AFF module effectively improves recall for difficult targets, such as pedestrians. Its dynamic weighting mechanism may also interfere with the relatively stable and consistent feature representations of some structured targets (e.g., trucks and cars), resulting in a slight negative impact. Our combined scheme achieves the best performance. The mAP improves by 4.78% over the baseline, and the detection accuracies for pedestrians, tanks, trucks, and cars all improve, increasing by 6.66%, 5.02%, 2.71%, and 4.73%, respectively. The experiments demonstrate that the high-quality features provided by Swin-Conv effectively suppress possible noise introduced by AFF, while AFF further enhances the model’s adaptability to multi-scale targets on this basis. Together, they achieve the best balance between detection accuracy and recall.

4.4.2. Comparative Experiment

To systematically evaluate the overall performance of the proposed method, comparative experiments are conducted against mainstream detectors, including two-stage methods, Transformer-based architectures, and the YOLO family, and the results are shown in Table 7.

The proposed method achieves a significant advantage in balancing detection accuracy and model efficiency. In terms of accuracy, the proposed method ranks first with an mAP of 88.96%, improving by 14.25% over Fast R-CNN and by 10.05% over DETR. It also achieves stable improvements of 4.78%, 5.23%, and 6.40% over YOLOv10, YOLOv8, and YOLOv7, respectively. Moreover, the proposed method also achieves the highest Recall and F1-score, demonstrating that the introduced Swin-Conv and AFF modules enhance feature representation and optimize multi-scale fusion. Consequently, the overall detection capability and robustness of the model are significantly improved. In terms of model complexity and inference efficiency, the proposed method shows excellent overall performance on mainstream high-performance computing platforms. It has 28.9 M parameters, which is significantly fewer than DETR’s 41.2 M. In computational complexity, the proposed method requires 46.7 B FLOPs, which is far lower than Fast R-CNN’s 142 B and DETR’s 154 B, and lower than YOLOv8’s 53.2 B and YOLOv7’s 70.5 B, reflecting the efficiency of the model design. Under this hardware condition, the proposed method achieves a real-time inference speed of 54.2 FPS. This frame rate is significantly higher than Fast R-CNN (12.1 FPS) and DETR (23.4 FPS) and is on the same order of magnitude as the latest YOLO versions, meeting the requirement for real-time detection.

The experimental results confirm that the proposed method achieves high accuracy while successfully controlling both model parameter count and computational overhead. This balance gives it strong practical potential for high-accuracy, real-time detection in moderately resource-constrained scenarios.

4.4.3. Real-Time Performance Evaluation via Edge-Device Deployment

To evaluate performance in practical scenarios, we benchmarked the baseline YOLOv10 and our proposed model on two heterogeneous platforms: an NVIDIA RTX 3080 Ti (12 GB), representing a high-performance server GPU, and an NVIDIA Jetson Orin Nano Super (8 GB, 15 W mode), representing a typical onboard edge computing unit for UAVs. We report results under both FP16 and INT8 precision to assess whether the improved model satisfies real-time requirements on resource-constrained edge hardware, thereby supporting engineering deployment.

We optimized and deployed the models using the TensorRT inference framework. Key metrics were recorded, including mean Average Precision (mAP@50), end-to-end throughput, mean latency, and inference time. All measurements were obtained in a streaming inference setting with batch size 1 to emulate practical UAV perception workloads.

According to Table 8, on the Jetson Orin Nano edge platform used for UAV deployment, the proposed Ours model achieves an end-to-end throughput of 32.7 FPS with INT8 quantization, with a mean latency of 30.6 ms. This performance clearly exceeds the widely adopted 30 FPS real-time threshold in vision systems, meeting the real-time processing requirements of UAV platforms. Although its FPS is slightly lower than that of the lightweight baseline YOLOv10 (38.5 FPS), Ours maintains a high mAP of 87.06% under INT8 precision. It trades roughly a 15% throughput reduction for an absolute accuracy gain of 5.58 percentage points, offering an effective solution for high-accuracy real-time perception on UAVs.

4.4.4. Visual Comparative Analysis of Interference Robustness

To visually verify the robustness of the proposed method under noise interference, we selected two typical complex scenes from the test set for qualitative comparison. In the experiment, the proposed algorithm and the baseline YOLOv10 perform inference on the same batch of images under noise-free and noisy conditions. The results are shown in Figure 9.

Our analysis indicates that under noise-free conditions, the proposed method outperforms the baseline in two key aspects. First, it suppresses false positives induced by complex backgrounds—such as trees and buildings—more effectively. Second, it maintains higher recall and localization accuracy. After noise is introduced, the baseline exhibits pronounced missed detections and misclassifications. Specifically, under high-viewpoint conditions where target texture becomes scarce, the baseline misclassifies Tanks as Cars due to their similar shapes (Figure 9b,d). Moreover, under noise interference, the baseline fails to extract discriminative features for some targets, leading to missed detections (Figure 9d). In contrast, our method maintains stable detection performance and shows clear advantages under challenging conditions, including dense multi-target scenarios, occlusions, and the need to discriminate visually similar targets. These qualitative results corroborate the quantitative analysis. They demonstrate that the proposed method outperforms the baseline in noise robustness, adaptability to complex scenes, and small-target detection capability, thereby offering greater practical value.

5. Conclusions

This paper proposes an object detection method based on an improved YOLOv10 for UAV low-altitude sensing tasks. We introduce the Swin-Conv module to enhance global context modeling for sparse long-range small targets. Combined with the AFF module for adaptive cross-level feature fusion and multi-scale coordination, our method improves robust detection performance for multi-scale targets—particularly long-range sparse ones. Experimental results show that while maintaining a high inference speed (54.2 FPS), this method outperforms mainstream detection models on key metrics such as mAP (88.96%) and Recall, improving by 4.78 percentage points over the baseline model. This study provides an effective solution for object detection in complex scenarios. Future work will focus on lightweight model design to further expand its application scope on mobile devices. In addition, deployment on the Jetson Orin Nano edge device confirms its feasibility for real-world deployment.

It should be noted that this study validates the method using simulated data, primarily because real LiDAR data that meet the requirements of multi-viewpoint and complex-scene settings are difficult to acquire. Although strong performance is observed in simulation, the model’s practicality and generalization capability still require further verification on real-world data. Future work will continue along two directions—model lightweighting and real-world validation—to facilitate practical adoption of this technique on UAV platforms.

Author Contributions

Conceptualization, Y.Z. (Yu Zhai) and Y.Z. (Yingliang Zhao); methodology, Y.Z. (Yu Zhai), Z.Z. and S.X.; software, Z.Z. and X.L. (Xiuli Luo); validation, S.X., C.T. and X.L. (Xiuli Luo); formal analysis, Y.Z. (Yu Zhai) and X.L. (Xuan Li); investigation, Y.Z. (Yu Zhai), Z.Z., S.X., C.T. and X.L. (Xiuli Luo); resources, L.W. and Y.Z. (Yingliang Zhao); data curation, Y.Z. (Yu Zhai) and Z.Z.; writing—original draft preparation, Y.Z. (Yu Zhai); writing—review and editing, Y.Z. (Yu Zhai), Z.Z., S.X., C.T., X.L. (Xiuli Luo), X.L. (Xuan Li), L.W. and Y.Z. (Yingliang Zhao); visualization, Y.Z. (Yu Zhai) and Z.Z.; supervision, Y.Z. (Yingliang Zhao); project administration, Y.Z. (Yingliang Zhao); funding acquisition, Y.Z. (Yingliang Zhao). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Foundation of State Key Laboratory of Dynamic Measurement Technology, North University of China, grant number 2023-SYSJJ-06 and the Fundamental Research Program of Shanxi Province, grant number 202303021212207.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, X.; Shi, H.; Xu, D.; Zhang, B.; Wei, J.; Chen, J. The exploration and practice of low-altitude airspace flight service and traffic management in China. Green Energy Intell. Transp. 2024, 3, 100149. [Google Scholar] [CrossRef]
Liu, H.; Ma, R. Sky’s-Eye Perspective: A Multidimensional Review of UAV Applications in Highway Systems. Appl. Sci. 2025, 15, 11199. [Google Scholar] [CrossRef]
Gariepy, G.; Krstajic, N.; Henderson, R.; Li, C.Y.; Thomson, R.R.; Buller, G.S.; Heshmat, B.; Raskar, R.; Leach, J.; Faccio, D. Single-photon sensitive light-in-fight imaging. Nat. Commun. 2015, 6, 6021. [Google Scholar] [CrossRef]
Seidaliyeva, U.; Ilipbayeva, L.; Utebayeva, D.; Smailov, N.; Matson, E.T.; Tashtay, Y.; Turumbetov, M.; Sabibolda, A.J.S. LiDAR Technology for UAV Detection: From fundamentals and operational principles to advanced detection and classification techniques. Sensors 2025, 25, 2757. [Google Scholar] [CrossRef] [PubMed]
Sapkota, R.; Karkee, M. Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
Zhang, K.; Li, Y.; Liang, J.; Cao, J.; Zhang, Y.; Tang, H.; Fan, D.-P.; Timofte, R.; Gool, L.V.J.M.I.R. Practical blind image denoising via Swin-Conv-UNet and data synthesis. Mach. Intell. Res. 2023, 20, 822–836. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Lin, J.; Sun, S.; Gong, S. Gridclip: One-stage object detection by grid-level clip representation learning. Pattern Recognit. 2023, 171, 112187. [Google Scholar] [CrossRef]
Zhang, X.; Yuan, D.; Hu, Y.; Wu, Z.; Zhang, X.; Yu, B.; Bai, X.; Cao, S.-Y.; Jin, Y.; Yang, B.J.P.R. SADet: A Semantic-Aware Tiny Object Detection Network Against Missed Detection. Pattern Recognit. 2025, 172, 112624. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Cao, J.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Pang, Y.; Shao, L. D2det: Towards high quality object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11485–11494. [Google Scholar]
Li, W.; Zhao, D.; Yuan, B.; Gao, Y.; Shi, Z. PETDet: Proposal enhancement for two-stage fine-grained object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–14. [Google Scholar] [CrossRef]
Kong, F.; Shan, X.; Hu, Y.; Li, J. Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Appl. Sci. 2025, 9, 803. [Google Scholar] [CrossRef]
Li, Y.; Wang, J.; Zhang, K.; Yi, J.; Wei, M.; Zheng, L.; Xie, W. Lightweight object detection networks for UAV aerial images based on YOLO. Chin. J. Electron. 2024, 33, 997–1009. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-scale feature aggregation and grouping feature reconstruction-based UAV image target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Fang, Q.; Han, D.; Wang, Z. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar] [CrossRef]
Wang, W.; Peng, Y.; Cao, G.; Guo, X.; Kwok, N. Low-illumination image enhancement for night-time UAV pedestrian detection. IEEE Trans. Ind. Inform. 2020, 17, 5208–5217. [Google Scholar] [CrossRef]
Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA, 5–7 October 2020; pp. 207–222. [Google Scholar]
He, X.; Li, X.; Xu, Q.; Hu, Y.; Sun, Z. Radial awareness with adaptive hybrid CNN-Transformer range-view representation for outdoor LiDAR point cloud semantic segmentation. Expert Syst. Appl. 2025, 271, 126572. [Google Scholar] [CrossRef]
Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 2918–2927. [Google Scholar]
Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; Anguelov, D. Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 20–25 June 2021; pp. 5725–5734. [Google Scholar]
Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
Wang, Z.; Liao, Z.; Zhou, B.; Yu, G.; Luo, W. SwinURNet: Hybrid transformer-cnn architecture for real-time unstructured road segmentation. IEEE Trans. Instrum. Meas. 2024, 73, 16. [Google Scholar] [CrossRef]
Zhang, H.; Mao, H.; Zheng, J.; Jin, L.; Guo, B. CR-Pillars: A Three-Dimensional Object Detection Model Based on Enhanced PointPillars. In Proceedings of the International Conference on Green Intelligent Transportation System and Safety, Qinghuadao, China, 16–18 September 2022; pp. 533–546. [Google Scholar]
Bultmann, S.; Quenzel, J.; Behnke, S.J.R.; Systems, A. Real-time multi-modal semantic fusion on unmanned aerial vehicles with label propagation for cross-domain adaptation. Robot. Auton. Syst. 2023, 159, 104286. [Google Scholar] [CrossRef]
Manduhu, M.; Dow, A.; Trslic, P.; Dooly, G.; Blanck, B.; Riordan, J. Airborne Sense and Detect of Drones using LiDAR and adapted PointPillars DNN. arXiv 2023, arXiv:2310.09589. [Google Scholar]
Tzutalin. LabelImg. GitHub. 2015. Available online: https://github.com/HumanSignal/labelImg (accessed on 28 December 2025).

Figure 1. Improved Network Architecture for YOLOv10.

Figure 2. Network Architecture of the Swin-Conv Hybrid Module. (a) Swin-Conv Module; (b) Swin Transformer Block.

Figure 3. Window partitioning methods for different modules. (a) W-MSA; (b) SW-MSA.

Figure 4. Network Architecture of the AFF Module.

Figure 5. Architecture of Multi-scale Dynamic Convolution Kernel.

Figure 6. Randomly generated range-image data from the laser imaging LiDAR range-image simulation software. The images illustrate simulated data under varied pose angles, different noise levels, and complex scenes containing multiple target types.

Figure 7. Sampling principle of the simulation algorithm: (a) pose-angle-related parameters; (b) target- and image-related parameters.

Figure 8. Comparison of overall feature changes in range profiles acquired under different viewpoints, where θ denotes azimuth angle, β denotes spin angle, and φ denotes viewing angle. (a,b) Range profiles obtained at a viewing angle of 10° by changing sampling azimuth, spin angle, target position, and rotation angle; (c,d) range profiles obtained at a viewing angle of 70° from the scenarios in (a,b).

Figure 9. Visualization and Comparative Analysis of Robustness Against Interference. (a) Detection results by Ours under noise-free conditions; (b) Detection results by the Baseline under noise-free conditions; (c) Detection results by Ours under noisy conditions; (d) Detection results by the Baseline under noisy conditions.

Table 1. Experimental parameter settings.

Parameter	Setting
Batch size	8
Number of epochs	200
Image resolution	512 × 512
Optimizer	SGD
Initial learning rate	0.001
Momentum	0.9
Weight decay	0.0005

Table 2. Configuration parameters of the Swin-Conv module.

Configuration Parameter	Setting
Embedding dimensions	[64, 128, 256, 480]
Number of attention heads	[2, 4, 8, 12]
Window size	[7, 7, 7, 7]
Number of blocks	[2, 4, 4, 2]
Number of deformable convolution groups	[1, 2, 3, 4]
Sparse threshold factor	0.01

Table 3. Configuration parameters of the AFF module.

Module	Hyperparameter	Setting
MSDCK	Parallel kernel sizes	[3, 5]
MSDCK	Internal channels	C/16
Scale selector	MLP structure	GAP → C/32 → ReLU → 2
Scale selector	Weight normalization	Softmax
Global branch	Channel compression ratio r	16
Basic configuration	Activation and normalization	ReLU, Sigmoid, BN

Table 4. Parameter ranges for randomly generated range profiles.

Target-Related Parameters		Pose Angle Parameters		Image-Related Parameters
Parameter Name	Range	Parameter Name	Range	Parameter Name	Range
Number of Targets	1~12	Azimuth Angle	0~359°	Overall Position	0~2
Rotation Angle	0~359°	Viewing Angle	0~80°	Field of View	40~80%
Relative Position	0~8	Spin Angle	−45~45°	Noise Intensity	20~40 dB

Table 5. Evaluation results on the test set of the improved YOLOv10 model trained with training–validation sets containing different ratios of high- and low-viewpoint range images.

Ratio	Overall	0–10°	10–20°	20–30°	30–40°	40–50°	50–60°	60–70°	70–80°
5:5	86.92%	89.41%	89.69%	87.93%	86.88%	86.34%	85.78%	85.13%	82.03%
4:6	87.17%	89.90%	89.51%	88.34%	88.53%	87.17%	84.84%	85.63%	82.36%
3:7	87.51%	90.24%	89.76%	88.05%	88.83%	86.53%	87.15%	86.90%	81.94%
2:8	87.49%	89.86%	89.09%	89.02%	88.49%	85.64%	87.39%	86.28%	83.71%
1:9	87.29%	89.16%	88.98%	88.18%	89.14%	87.44%	86.02%	86.65%	81.22%

Table 6. Ablation study results.

Method	AP/%				mAP/%	Pre/%	Rec/%	F₁/%
Method	Pedestrian	Tank	Truck	Car	mAP/%	Pre/%	Rec/%	F₁/%
YOLOv10	73.51	88.16	88.82	86.24	84.18	82.94	81.95	82.44
YOLOv10 + Swin-Conv	76.48	91.47	90.32	89.73	87.06	85.80	84.85	85.32
YOLOv10 + AFF	78.36	92.29	89.64	89.61	87.48	86.68	84.53	85.62
Ours	80.17	93.18	91.53	90.97	88.96	88.52	86.47	87.47

Table 7. Comparison results with mainstream detection models.

Method	Pre/%	Rec/%	F₁/%	mAP/%	FPS	Params/M	Size/MB	FLOPs/B
FastR-CNN	71.56	73.44	72.49	74.71	12.1	25.6	97.28	142
DETR	76.23	78.12	77.16	78.91	23.4	41.2	156.56	154
YOLOv7	82.11	80.37	81.23	82.56	53.6	36.9	140.22	70.5
YOLOv8	81.26	80.93	81.09	83.73	60.9	25.7	97.66	53.2
YOLOv10	82.94	81.95	82.44	84.18	63.3	15.4	58.52	37.9
Ours	88.52	86.47	87.47	88.96	54.2	28.9	109.82	46.7

Table 8. Comparison of real-time performance on edge-device deployment.

Hardware Platform	Model	Computational Precision	mAP/%	FPS	Mean Latency/ms	Inference Time/ms
RTX 3080 Ti	YOLOv10	FP16	83.76	114.5	8.8	7.3
	YOLOv10	INT8	82.52	148.2	6.8	5.2
	Ours	FP16	88.41	97.5	10.3	8.9
	Ours	INT8	87.19	126.7	7.9	6.4
Jetson Orin Nano	YOLOv10	FP16	83.05	27.4	36.5	8.6
	YOLOv10	INT8	81.48	38.5	26.0	5.8
	Ours	FP16	87.85	23.4	42.7	10.2
	Ours	INT8	87.06	32.7	30.6	6.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhai, Y.; Zhang, Z.; Xie, S.; Tong, C.; Luo, X.; Li, X.; Wang, L.; Zhao, Y. A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics 2026, 15, 211. https://doi.org/10.3390/electronics15010211

AMA Style

Zhai Y, Zhang Z, Xie S, Tong C, Luo X, Li X, Wang L, Zhao Y. A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics. 2026; 15(1):211. https://doi.org/10.3390/electronics15010211

Chicago/Turabian Style

Zhai, Yu, Ziyi Zhang, Sen Xie, Chunsheng Tong, Xiuli Luo, Xuan Li, Liming Wang, and Yingliang Zhao. 2026. "A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes" Electronics 15, no. 1: 211. https://doi.org/10.3390/electronics15010211

APA Style

Zhai, Y., Zhang, Z., Xie, S., Tong, C., Luo, X., Li, X., Wang, L., & Zhao, Y. (2026). A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes. Electronics, 15(1), 211. https://doi.org/10.3390/electronics15010211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real-Time Improved YOLOv10 Model for Small and Multi-Scale Ground Target Detection in UAV LiDAR Range Images of Complex Scenes

Abstract

1. Introduction

2. Related Work

2.1. Progress in 2D Image-Based Object Detection for UAV Perception

2.2. Progress in LiDAR Technologies for UAV Perception

3. Method

3.1. Swin Transformer-Conv

3.2. Attentional Feature Fusion

4. Experiments and Results

4.1. Experimental Environment and Training Parameters

4.2. Objective Evaluation Indicators

4.3. Construction of an All-View Range-Image Dataset

4.3.1. Dataset Construction

4.3.2. Dataset Optimization Experiment for Model Training Efficiency

4.4. Analysis of Experimental Results

4.4.1. Ablation Study

4.4.2. Comparative Experiment

4.4.3. Real-Time Performance Evaluation via Edge-Device Deployment

4.4.4. Visual Comparative Analysis of Interference Robustness

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI