Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture

Zhao, Yongjuan; Ma, Qiang; Lei, Guannan; Wang, Lijin; Guo, Chaozhe

doi:10.3390/drones9080551

Open AccessArticle

Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture

by

Yongjuan Zhao

^1,*,

Qiang Ma

^1,2,

Guannan Lei

³

,

Lijin Wang

^1,2 and

Chaozhe Guo

^1,2

¹

School of Mechanical and Electrical Engineering, North University of China, Taiyuan 030051, China

²

Institute of Intelligent Weapons, North University of China, Taiyuan 030051, China

³

School of Mechanical Engineering, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 551; https://doi.org/10.3390/drones9080551

Submission received: 26 May 2025 / Revised: 30 July 2025 / Accepted: 1 August 2025 / Published: 5 August 2025

Download

Browse Figures

Versions Notes

Abstract

Traditional unmanned aerial vehicle (UAV) detection and tracking methods have long faced the twin challenges of high cost and poor efficiency. In real-world battlefield environments with complex backgrounds, occlusions, and varying speeds, existing techniques struggle to track small UAVs accurately and stably. To tackle these issues, this paper presents an enhanced YOLOv8N-Drone-based algorithm for improved target tracking of small UAVs. Firstly, a novel module named C2f-DSFEM (Depthwise-Separable and Sobel Feature Enhancement Module) is designed, integrating Sobel convolution with depthwise separable convolution across layers. Edge detail extraction and multi-scale feature representation are synchronized through a bidirectional feature enhancement mechanism, and the discriminability of target features in complex backgrounds is thus significantly enhanced. For the feature confusion problem, the improved lightweight Context Anchored Attention (CAA) mechanism is integrated into the Neck network, which effectively improves the system’s adaptability to complex scenes. By employing a position-aware weight allocation strategy, this approach enables adaptive suppression of background interference and precise focus on the target region, thereby improving localization accuracy. At the level of loss function optimization, the traditional classification loss is replaced by the focal loss (Focal Loss). This mechanism effectively suppresses the contribution of easy-to-classify samples through a dynamic weight adjustment strategy, while significantly increasing the priority of difficult samples in the training process. The class imbalance that exists between the positive and negative samples is then significantly mitigated. Experimental results show the enhanced YOLOv8 boosts mean average precision (Map@0.5) by 12.3%, hitting 99.2%. In terms of tracking performance, the proposed YOLOv8 N-Drone algorithm achieves a 19.2% improvement in Multiple Object Tracking Accuracy (MOTA) under complex multi-scenario conditions. Additionally, the IDF1 score increases by 6.8%, and the number of ID switches is reduced by 85.2%, indicating significant improvements in both accuracy and stability of UAV tracking. Compared to other mainstream algorithms, the proposed improved method demonstrates significant advantages in tracking performance, offering a more effective and reliable solution for small-target tracking tasks in UAV applications.

Keywords:

UAV; target detection; target tracking; YOLOv8

1. Introduction

In recent years, the application of unmanned aerial vehicles (UAVs) in scenarios such as military reconnaissance and civil logistics has been expanding, mainly due to their excellent maneuverability and precise control characteristics [1,2,3]. Within civilian contexts, UAVs have proven to be of significant value in areas such as agricultural protection, logistics, and transportation [4,5]. In military scenarios, their utilization has surged in various localized conflicts, driven by their capabilities in reconnaissance, surveillance, and precision strikes [6]. However, detecting and tracking small UAV targets remains challenging in complex environments, primarily due to their inherent characteristics, including a small radar cross-section and low thermal signature [7,8,9]. Therefore, developing effective methods for the identification, localization, and tracking of small UAV targets is of critical importance.

In the field of target detection, deep learning methods have demonstrated significant advantages in their unique characterization and generalization capabilities, and become an approach that promises to overcome the limitations of traditional techniques. Current deep learning detection frameworks fall into two main categories. The two-stage detection algorithms represented by fast R-CNN, faster R-CNN and masked R-CNN [8,9,10], whose core process consists of region suggestion generation with subsequent category determination and bounding box refinement adjustment. Fast R-CNN improves detection efficiency by introducing the region-of-interest (ROI) pooling layer, enabling feature sharing across candidate regions, and supports joint optimization of classification and regression tasks. Fast R-CNN builds a unified detection architecture by integrating a region suggestion network (RPN) with a classification regression module. This design realizes end-to-end training of region suggestion generation and target detection, which significantly improves the performance of the model. This significantly enhances both detection accuracy and efficiency. However, its high computational complexity limits its applicability in real-time scenarios, and its performance on small targets remains suboptimal. Based on the fast R-CNN framework, Mask R-CNN constructs a hybrid architecture with synergistic detection and segmentation by introducing an independent mask prediction branch. The design enables the model to generate high-precision pixel-level instance masks while accomplishing target localization, which significantly improves the ability of spatial structure resolution in complex scenes.

The other category consists of single-stage algorithms (e.g., SSD, YOLO [11,12,13]), in which the entire image is treated as a candidate region. These methods directly regress the class labels and bounding box coordinates of targets in a single forward pass. Compared to two-stage approaches, single-stage detectors offer faster inference speeds, rendering them better suited for real-time use. Since the introduction of the YOLO algorithm, such models have attracted extensive research interest due to their ability to maintain detection accuracy while enabling rapid deployment across various tasks. Hamid et al. [14] proposed a method to mitigate feature loss during information propagation by incorporating a dense connectivity module (DenseNet) and applying K-means clustering to assign scale-adapted anchor boxes to different feature maps. This approach achieved a detection accuracy of 95.6% and an inference speed of 60.3 Frame Per Second (FPS), effectively addressing the issues of high false alarm rates, low accuracy, and slow detection speeds commonly found in traditional methods. To address UAV detection challenges like small target size, stealth materials, and complex environments, Zhai et al. [15] introduced the YOLO-Drone network. The high-resolution probe head is technically complementary to the multi-scale feature fusion mechanism of SPD-Conv by enhancing the spatial detail capturing capability. This design improves the pre-precision, recall and average precision to 111.9%, 115.2% and 109.0% of the baseline model, respectively [16] optimized the YOLOv4 architecture through channel pruning of and reduction in shortcut layers, resulting in a 60.4% increase in processing speed while maintaining 90.5% Map. This allowed efficient real-time detection of fast-moving small UAVs. Zamri et al. [17] proposed the P2-YOLOv8n-ResCBAM model by integrating multiple attention mechanisms into the YOLOv8n framework and adding a high-resolution detection head. This enhancement allowed effective UAV detection and bird differentiation under long-distance imaging conditions. Zhao et al. [18] presented the TGC-YOLOv5 model to address the challenges of large scale variations and potential information loss in detecting small UAV targets in complex environments. The model combined Transformer modules with GAM and CA attention mechanisms to achieve fast and accurate detection. However, the number of model parameters increased nearly twofold compared to YOLOv8s, indicating a need for further optimization in terms of model complexity and computational efficiency. Although the aforementioned methods have demonstrated effectiveness in improving UAV detection performance, they still face limitations in extracting sufficient discriminative features from small targets. Future research should focus on enhancing feature representation and reducing redundancy to improve both accuracy and efficiency.

With significant advancements in target detection algorithms, most modern tracking approaches adopt a “detect-then-track” paradigm. This methodology enables efficient, accurate, and stable tracking of targets within video sequences by first identifying the UAV’s location and then performing continuous tracking to capture its motion status. Addressing the dual challenges of stringent urban UAV regulations and the economic inefficiency of legacy monitoring systems, Hong et al. [19] designed a 5G-integrated deep learning architecture for real-time detection and tracking of low-altitude drones. By optimizing the YOLOv4 and DeepSORT models, they achieved multi-target tracking with 94.4% accuracy, running at a speed of 69.0 FPS. The system’s practicality was demonstrated through deployment on a ZCU104 hardware platform. Gandhi et al. [20] introduced a detection and tracking system based on YOLOv3 and DeepSORT. They constructed a custom dataset containing both standard and bird-shaped UAVs and applied transfer learning for model training. This approach enabled real-time detection (Map@0.5 of 97.0%) and trajectory tracking for UAVs of various shapes. Ghazlane et al. [21] addressed the lack of comprehensive tracking solutions for diverse aerial targets such as birds and airplanes, along with the imbalance between detection speed and accuracy. Their method combined the YOLOv7 detector with an enhanced DeepSORT tracker, achieving cross-frame target association through appearance feature extraction and motion trajectory prediction. This effectively mitigated target loss under occlusion and fast-moving conditions, achieving high detection accuracy (Map@0.5 of 0.982) and real-time performance (42.8 FPS). However, the increased complexity of feature extraction led to higher computational costs, indicating potential efficiency bottlenecks. Delleji et al. [22] presented a YOLOv7-modified tracking framework to overcome issues such as limited feature extraction capability for small targets, low tracking accuracy, and vulnerability to occlusion in complex scenarios. Xi et al. [23] addressed the issues of high tracking error and high false-negative rates in existing small UAV targets detection and tracking algorithms. By incorporating image enhancement preprocessing and ROI restriction, they introduced a detection and tracking algorithm integrating YOLOv5 and DeepSORT. This approach effectively resolved the issues of poor detectability and unstable tracking of small targets in complex scenarios, yielding real-time and high-precision performance. Despite these advances, current research still falls short in tackling missed or false detections of small UAVs in complex backgrounds like skies and forests—issues stemming from limited pixel information and weak feature representation. Further exploration of lightweight feature enhancement techniques or multimodal fusion strategies is needed to improve robustness in challenging scenarios.

In summary, low-altitude UAV detection and tracking technology holds significant research value in both military security and civilian monitoring applications. However, several core challenges—such as the accurate characterization of small targets, distractions caused by intricate surroundings, and correlation modeling in dynamic scenes—still require further investigation. To tackle these problems effectively, this study presents a joint detection and tracking algorithm utilizing an improved YOLOv8 framework integrated with DeepSORT. The proposed method enhances small-target feature representation by integrating modified depthwise separable convolution and edge-sensitive feature enhancement modules. Additionally, a contextual anchor attention mechanism is introduced to improve cross-frame target association. The effectiveness of the algorithm is validated using a self-constructed dataset encompassing complex real-world scenarios. The principal contributions of this work are detailed as follows:

(1): The C2f-DSFEM module has been developed to handle the complexities introduced by the indistinctness of edge features and the inadequacy of multi-scale feature integration for diminutive UAV targets. This module has been designed to achieve bidirectional enhancement of edge details and efficient semantic features by integrating Sobel convolution and depth-separable convolution across multiple layers. This approach overcomes the limitations of the conventional single convolution module, which exhibits insufficient capability in capturing features of diminutive targets in complex backgrounds.
(2): The existing attention mechanisms (e.g., SE, CBAM) have been criticized for their inability to focus sufficiently on the target area in lightweight scenarios. In response to this criticism, an improved CAA mechanism has been proposed. This mechanism achieves adaptive suppression of background interference while maintaining a low computational cost through a position-aware weight allocation strategy. Furthermore, it addresses the problem of balancing the accuracy and efficiency of small target localization in traditional attention mechanisms.
(3): The synergistic framework of ‘Feature Enhancement—Attention Focus—Loss Optimisation’ combines Focal Loss with an enhanced detector and a DeepSORT tracking algorithm. This combination is intended to solve the problem of tracking drift and frequent ID switching, which is caused by sample imbalance in highly dynamic scenes. It is asserted that this will result in a significant improvement in tracking stability when compared with the existing YOLO Series + DeepSORT solution.

The subsequent sections of this paper are arranged as follows. The second section of this paper outlines the detection and tracking algorithm that has been utilized in this study, in addition to the enhancement scheme that has been implemented to address its limitations. Section 3 carries out UAV detection and tracking performance tests to validate the proposed YOLOv8N-Drone algorithm, and provides the test results and analyses. In conclusion, the primary conclusions are outlined in Section 4.

2. Models and Methods

2.1. Modeling Framework

In UAV visual detection tasks, target objects often exhibit small-scale and low-resolution characteristics, and are further challenged by complex illumination conditions and dynamic occlusions. These factors result in insufficient feature representation and limited scale robustness in traditional object detection models, making accurate and stable UAV detection particularly difficult.

To address these challenges, this section proposes a UAV target detection and tracking method based on an improved YOLOv8 architecture. The model’s adaptability to small targets and complex environments is enhanced through three key improvements: (1) the design of a feature enhancement module to strengthen discriminative feature extraction, (2) the introduction of a contextual attention mechanism to improve feature alignment and localization accuracy, and (3) optimization of the training process using the Focal Loss function to alleviate class imbalance and enhance learning on hard-to-detect samples. Furthermore, an integrated detection-and-tracking framework is constructed by combining the improved detector with the DeepSORT algorithm, enabling long-term continuous tracking of UAV targets. The comprehensive structure of the improved YOLOv8N-Drone system is shown in Figure 1.

2.2. Depth Separable and Edge-Sensitive Feature Enhancement Module

The C2f (Cross-Stage-Partial) module employs a cross-stage partial linking technique to integrate disparate levels of features, thereby enhancing the detection capability of targets across multiple scales. However, in scenarios involving UAV detection, the generic design of the C2f module has been found to be ineffective in the extraction pertaining to small targets’ features when confronted with the task of small-target detection. This has resulted in instances of misdetection and omission [24]. To mitigate these limitations, this paper proposes a lightweight Depthwise-Separable and Sobel Feature Enhancement Module (C2f-DSFEM). The proposed module aims to leverage the respective advantages of Sobel convolution [25] for edge-sensitive feature extraction and Depthwise-Separable convolution [26] for efficient spatial filtering. By processing and fusing input features from complementary perspectives, the C2f-DSFEM module generates more expressive and information-rich feature representations, which are beneficial for subsequent tasks such as UAV detection and tracking. The module’s key concept is centered on integrating features extracted via different convolutional mechanisms, thus boosting the model’s capacity to seize structural and textural details alike in input images. This modification greatly enhances the feature representation ability of the architecture, especially for small and low-resolution targets in challenging environments.

Let the input image (feature map) be

X \in ℝ^{H \times W \times C_{i n}}

, and the output feature map be

Y \in ℝ^{H \times W \times C_{o u t}}

.

In this, the Sobel convolution branch performs convolution operation of the input image feature map in channel dimension with

f i l t e r_{x}

and

f i l t e r_{y}

, respectively. For each channel c:

X_{c, x} = X_{c} * f i l t e r_{x}

(1)

X_{c, y} = X_{c} * f i l t e r_{y}

(2)

Then, the Sobel response is computed for each channel, i.e.,

X_{c, Sobel} = \sqrt{{(X_{c, x})}^{2} + {(X_{c, y})}^{2}}

, and finally, the results of all the channels are combined to extract the edge and texture characteristics in the image to get the output of the Sobel convolution branch as

X_{Sobel} \in ℝ^{H \times W \times C_{i n}}

.

At the same time, local and cross-channel features are extracted by Depthwise Convolution and Pointwise Convolution to reduce the computational effort. Among them, the Depthwise-Separable Convolution (DSC) branch breaks standard convolution down into two steps: Depthwise Convolution and Pointwise Convolution. Depthwise Convolution applies a separate convolution to each channel of the input feature map using a kernel

K_{D W} \in R^{k \times k \times 1 \times c_{i n}}

(where k denotes kernel size), with the output given as:

X_{D W} = DepthwiseConv (X) = \sum_{c = 1}^{C_{i n}} X_{c} * K_{D W, c}

(3)

where

X_{c}

represents the c-th channel of the input feature map, the

X_{D W} \in ℝ^{H \times W \times C_{i n}}

.

Point-by-point convolution, on the other hand, uses a 1 × 1 convolution kernel

K_{PW} \in ℝ^{1 \times 1 \times C_{in} \times C_{in}}

to linearly combine the outputs of deep convolution, fusing feature information from different channels to obtain the output:

X_{S e p} = PointwiseConv (X_{D W}) = X_{D W} * K_{P W}

(4)

The final output is then obtained through operations such as splicing and convolutional residual joining:

Y = K_{2} * (K_{1} * C o n c a t (X_{S o b e l}, X_{S e p}) + X)

(5)

where convolution kernel

K_{1} \in ℝ^{1 \times 1 \times 2 C_{in} \times C_{in}}

, convolution kernel

K_{2} \in ℝ^{1 \times 1 \times C_{in} \times C_{out}}

, and X are the original inputs.

C2f-DSFEM introduces Sobel convolutional branching based on MobileNet’s DSC, which extracts edge and texture features through specific convolutional kernel convolving with input image feature maps in the channel dimension and fuses them with the output of depth-separable convolutional branching. This retains the advantage of reduced computational effort and enhances edge and texture information. In scenarios involving the detection of unmanned aerial vehicles (UAVs), the system has been shown to be effective in situations where small targets are present, and where factors such as moving blur, complex environments and varying light conditions contribute to challenges in detection. The system has been demonstrated to enhance the extraction of features associated with small targets, thereby leading to an improvement in the overall detection efficacy. In comparison with the conventional edge enhancement module, C2f-DSFEM integrates the Sobel convolution output with the depth-separable convolution output. This integration facilitates the preservation and enhancement of useful information through a series of operations, including splicing, convolution residual joining, and others. Thus, this augments the model’s capability to efficiently tackle challenges arising from complex backgrounds and small targets, subsequently boosting target detection precision and robustness.

The proposed module performs multi-faceted feature extraction and fusion on the input feature maps by integrating Sobel convolution and depthwise-separable convolution. To preserve the original feature information and suppress feature degradation, a residual concatenation strategy is employed. As a result, the module generates more expressive and information-rich feature representations, which significantly enhance the performance of the target detection model. The structure of the C2f-DSFEM module is depicted in Figure 2.

C2f—The DSFEM module operates around the overall structure and the S-Bottleneck sub-module, and adapts to the UAV detection requirements with a multi-branch feature processing mechanism: firstly, the input channels are

c_{i n}

. Firstly, the feature map with the number of channels c in is input, and the channels are compressed by 1 × 1 convolution (parameters k = 1, s = 1, p = 0) to

c_{o u t}

, which reduces the computational effort and initially integrates the feature maps across the drones. Which reduces the computation amount and initially integrates the cross-channel information; then, through the Split operation, the

c_{o u t}

. Then, the c out channel features are split into two by the Split operation (each occupying 0.5c out channel), one way is directly shorted to retain the original base features, and the other way enters a branch consisting of n S-Bottleneck series.

Within the S-Bottleneck sub-module, the input features are processed in parallel in two branches: the Sobel Conv branch simulates the Sobel algorithm by using a 3 × 3 convolution (k = 3, s = 1, p = 1) to extract the edge and texture features to highlight the contour of the UAV, and to cope with the small targets and motion blurred scenes; the DS Conv branch uses a 3 × 3 depth-separable convolution (k = 3, s = 2, p = 1), first extracts spatial local features by DW Conv, and then fuses the cross-channel information by PW Conv (1 × 1 point-by-point convolution), which retains the key semantics while lightweighting the computation.

After the dual-branch results are concatenated by Concat, they are further integrated by 3 × 3 convolution (Conv1), and then fused with the original input (shortcut = True to enable residual linking) by Add operation, which not only retains the basic information, mitigates the gradient disappearance, but also strengthens the feature expression, and then finally activated to output by 3 × 3 convolution (Conv2) and ReLU. After n S-Bottleneck processing is completed, the directly connected branch features are concatenated with the S-Bottleneck output features (the number of channels becomes 0.5(n + 2)

c_{o u t}

), and again by 1 × 1 convolutional reduction channel to

c_{o u t}

, the final output fuses edge details with high efficiency. The ultimate output is a refined feature integrating edge particulars and streamlined semantics, empowering the model to precisely recognize UAV targets in intricate scenarios and attain enhanced performance in detecting small targets.

2.3. Context Anchor Attention Mechanism Module

The Neck module in YOLOv8 adopts the PANet architecture, integrating a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN). By performing upsampling and channel concatenation, this structure enables the fusion of top-down and bottom-up feature flows through three output branches, which are then passed to the detection head. This design facilitates effective integration of multi-scale feature information, thereby addressing the challenge posed by large variations in target scales. However, the C2f Bottleneck module adopts a residual-free connection with a conventional stacked convolutional structure. This design restricts the receptive field and impairs the model’s capacity to tell apart target and background features effectively, especially in complex scenes with cluttered backgrounds. In particular, for small UAV targets, this structural limitation prevents the model from accurately capturing detailed target features. As a result, the background is often misclassified as the target, leading to increased confusion between foreground and background, which severely degrades both detection precision and localization accuracy [27].

To tackle the two key shortcomings inherent in the conventional Bottleneck architecture—specifically, its restricted receptive field and suboptimal feature extraction efficiency—it is essential to introduce an optimization strategy that enhances long-range feature dependencies without significantly increasing computational cost. In this context, we incorporate the Context Anchor Attention (CAA) mechanism [28], which focuses on establishing long-range pixel-wise relationships across the feature map. By emphasizing central feature representation through an efficient strip-based depthwise convolution design, CAA achieves improved feature modeling while maintaining computational efficiency. This ensures that the model retains both a lightweight structure and high detection performance.

It is evident that a distinction can be drawn between the CAA module and the attention mechanisms, such as Squeeze-and-Excitation Attention Mechanism (SE) and Convolutional Block Attention Module (CBAM). The SE module is well-suited for scenarios that necessitate the enhancement of channel features. While it exhibits efficacy in the context of target detection, in terms of detecting complex backgrounds and identifying small targets, its effectiveness is limited. Conversely, the CBAM module is adept at comprehensively enhancing features, yet its extensive computational demands render it ill-suited for scenarios characterized by limited resources. Conversely, the CAA module has been engineered expressly for target detection tasks, with a particular proficiency in accurately identifying diminutive targets within intricate background contexts. The lightweight design of the CAA module has been demonstrated to enhance features while maintaining computational efficiency, rendering it suitable for scenarios where resources are limited. This has been shown to effectively enhance the feasibility and effectiveness of the model in practical applications.

Additionally, the CAA module reinforces central features through the establishment of a relational network between remote pixels, thereby compensating for the limitations of existing methods in long-distance dependency modelling. The structure of the CAA module is depicted in Figure 3.

Local area features are first obtained by average pooling and 1 × 1 convolution:

F_{l - 1, n}^{p o o l} = C o n v_{1 \times 1} (P_{a v g} ({X_{l - 1, n}}^{(2)})), n = 0, …, N_{l} - 1,

(6)

where P_avg represents the average pooling operation. When n = 0,

X_{l - 1, n}^{(2)} = X_{l - 1}^{(2)}

.

In the following step, two depth-wise separable strip convolutions are utilized in order to approximate the standard large-kernel depth-wise separable convolutions. The specific approach employed is outlined below:

F_{l - 1, n}^{w} = D W C o n v_{1 \times k_{b}} (F_{l - 1, n}^{pool})

(7)

F_{l - 1, n}^{h} = D W C o n v_{k_{b} \times 1} (F_{l - 1, n}^{w})

(8)

where kb = 11 + 2 × l, i.e., the convolution kernel size increases with the depth of the block.

Finally, the CAA module generates an attention weight

A_{l - 1, n} \in R^{\frac{1}{2} C_{l} \times H_{l} \times W_{l}}

, which is used to enhance the output of the block module as follows:

A_{l - 1, n} = S i g m o i d (C o n v_{1 \times 1} (F_{l - 1, n}^{h}))

(9)

F_{l - 1, n}^{a t t n} = (A_{l - 1, n} ⊙ P_{l - 1, n}) \oplus P_{l - 1, n}

(10)

In this context, ⊙ denotes element-wise multiplication, while ⊕ represents element-wise addition.

The employment of the CAA module facilitates the dynamic allocation of weights to local channels, thereby suppressing background noise and accentuating the salient local regions of small targets. This approach compensates for the absence of nuanced information that was previously overlooked by the original neck network during the feature fusion stage. Furthermore, the model’s lightweight characteristics enhance its overall expressiveness, while simultaneously avoiding the non-essential wastage of computational resources. This ensures the model’s high feasibility and effectiveness in practical application scenarios.

2.4. Loss Function Improvement Study

The loss function in YOLOv8 primarily comprises two components: classification loss and localization loss. For the classification branch, YOLOv8 adopts the VariFocal Loss (VFL) formulation, while the regression branch combines Complete Intersection over Union (CIoU) loss and Distribution Focal Loss (DFL). The VariFocal Loss is formulated as follows:

V F L (p, q) = \{\begin{matrix} - q (q \log (p) + (1 - q) (1 - p)), q > 0 \\ - α p^{γ} \log (1 - p), p = 0 \end{matrix}

(11)

where p denotes the predicted class score, p ∈ [0, 1]; q denotes the predicted target score (q is the predicted true IoU if it is the true class, and 0 if it is another class); and α and γ represent the hyperparameters of the zoom focality loss. the VFL loss utilizes asymmetric parameters to weight positive and negative samples differently, and reduces only the negative samples in order to achieve foreground and background contributions to the loss of equal treatment.

In this paper, the Focal Loss function [29] is adopted to replace the conventional classification loss function. Focal Loss is specifically designed to tackle the challenges of sample imbalance in target detection tasks—including both positive-negative and easy-hard sample disparities. Traditional cross-entropy loss treats all samples equally, which may result in suboptimal model performance when positive and negative samples are significantly imbalanced in quantity. Under such conditions, the model becomes overwhelmed by the prevalence of easy negative instances, consequently hindering effective learning from scarce positive and challenging samples Focal Loss alleviates this problem by incorporating a modulation factor to lessen the weight given to easily classified samples. This adjustment ensures that the model focuses more on hard samples and positive samples during training, enhancing its capability to recognize various types of samples. Consequently, Focal Loss improves the model’s robustness and generalization ability, particularly for small or challenging targets where traditional loss functions may fall short. The mathematical formulation for Focal Loss is defined as:

F L (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(12)

The relationship between Focal Loss and predicted probability for different values of α and γ is shown in Figure 4:

2.5. Target Tracking Algorithm

After the target detection stage is completed, the tracking algorithm must further perform cross-frame target association. Considering the high dynamics and frequent occlusions typically encountered in UAV scenarios, this paper employs the DeepSORT algorithm [30] to enhance tracking performance. DeepSORT’s core strength is improving tracking stability in complex environments by fusing motion information with appearance features.

The execution pipeline of the DeepSORT algorithm can be conceptually divided into three logically coherent components that form an integrated framework through data-driven interactions, as follows:

The first is the trajectory prediction and dynamic estimation based on Kalman filter. This stage utilizes the Kalman filter to predict the trajectory of targets in the current frame. By employing a linear state transition model, the algorithm estimates the motion state of each target and subsequently refines the prediction using observed measurement data, thereby achieving accurate dynamic estimation of the target trajectories. The state prediction and update equations of the Kalman filter are formulated as follows:

Prediction Step : {\hat{x}}_{k} = F x_{k - 1} P_{k} = F_{k} P_{k - 1} F_{k}^{T} + Q

(13)

Update Step : y_{k} = z_{k} - H {\hat{x}}_{k} K_{k} = P_{k} H^{T} {(H P_{k} H^{T} + R)}^{- 1} {\hat{x}}_{k} = (I - K_{k} H) P_{k}

(14)

where H is the observation matrix, R is the measurement noise covariance, P is the state transfer matrix, and Q is the process noise covariance.

Then, there is a dual matching mechanism that fuses motion and appearance information. To effectively associate the detection bounding box with the tracking target, the motion matching strategy utilizes the Mahalanobis distance to measure the motion consistency between the detection bounding box and the predicted state. This approach helps mitigate mismatches caused by sudden accelerations of the target. The formula for the Mahalanobis distance is as follows:

d^{(1)} (i, j) = \sqrt{{(z_{j} - H {\hat{x}}_{i})}^{T} S_{i}^{- 1} (z_{j} - H {\hat{x}}_{i})}

(15)

where

S_{i} = H P_{i} H^{T} + R_{a}

is the residual covariance moments over covariance normalization to suppress noise interference.

In the appearance matching strategy, the pre-trained ReID network is used to extract the target appearance features, and the similarity is measured by the cosine distance, which is robust to light changes and partial occlusion and can complement the lack of motion information. The cosine distance formula is as follows:

d^{(2)} (i, j) = 1 - \frac{f_{j} \cdot f_{i}}{‖ f_{j} ‖ ‖ f_{i} ‖}

(16)

The above two matching strategies establish the target association constraints from the dimensions of kinematic and visual features, respectively, to provide multimodal data support for cross-frame matching.

Finally, the optimal matching solution is based on the cost matrix. The algorithm fuses the Mahalanobis distance with the cosine distance to construct the cost matrix, and solves the optimal matching by the Hungarian algorithm, which is given by:

c_{i, j} = λ d^{(1)} (i, j) + (1 - λ) d^{(2)} (i, j)

(17)

here,

λ

is a weight parameter that balances the ratio of motion matching to appearance matching in the overall cost, enabling the algorithm to adaptively adjust its matching strategy in complex scenes and achieve stable cross-frame target association.

Through the “trajectory prediction–dual matching–optimization solution” strategy, an effective integration is achieved among Kalman filter dynamic modeling, dual constraints from both motion and appearance features, and precise association through optimization algorithms. This approach successfully addresses the tracking challenges caused by the high dynamics and frequent occlusions of targets in UAV scenarios, thereby laying a solid foundation for subsequent target behavior analysis.

3. Experiment and Result Analysis

3.1. Experimental Basis

3.1.1. Experimental Condition

In order to rigorously validate the performance of the target detection model and guarantee the reproducibility of the experiments, this study adopts a standardized experimental environment: The Windows 10 Professional 64-bit operating system (Microsoft) is employed, alongside an Intel(R) Core(TM) i5-10600KF processor (Intel Corporation, Santa Clara, CA, USA). The graphics processing unit (GPU) is a NVIDIA GeForce GTX 1050 Ti (Colorful, Shenzhen, China), and the system has 4 GB GDDR5 memory, supporting CUDA 11.8. The experiments were executed within a Python 3.9 environment, with the construction and training of the models being facilitated by the PyTorch 2.0.0 deep learning framework. The experiment-related parameters are set as follows: the number of training epochs is 200, at which point the training curve has converged. The number of workers is 16, which fully utilizes the CPU’s computational resources to prevent the data loading speed from lagging behind the GPU’s computation speed. The initial learning rate is 0.00001, which is more suitable for handling low-resolution and high-interference targets, ensuring the stability of feature learning. The optimizer adopted is SGD, which balances “lightweight” and “generalization”. The input image size is fixed at 128 × 128, and this size is determined as the optimal solution in consideration of the characteristics of drone targets and hardware limitations. The random seed is set to 1 to eliminate the interference of random factors.

3.1.2. Dataset Construction

In practical UAV detection scenarios, dataset quality exerts a direct impact on algorithmic accuracy. Currently, several publicly available UAV datasets exist, such as DUT-Anti-UAV [31], MIDGARD [32], Real World [33], Det-Fly [34], and USC-Drone [35], which are characterized by their large scale and rich content diversity. In this paper, considering the specific application scenario of detecting and tracking small UAV targets, we construct a specialized UAV dataset. This dataset takes into account various UAV attitudes and sizes caused by different motion states, diverse camera shooting angles, and the influence of complex environmental conditions. The original data are primarily sourced from the aforementioned public datasets. In addition to these publicly available resources, a small portion of the dataset is composed of self-collected UAV images, along with another portion derived from self-recorded video sequences. The drone model used is the DJI Mini 2, with flight heights ranging from 10 m to 100 m. The recording locations include both campus and off-campus environments. A schematic diagram of the dataset section is shown in Figure 5.

The original dataset in this paper was collected using the aforementioned method. All videos had a resolution of 1920 × 1080 and a frame rate of 30 fps, with a total recording duration of 4 h and 50 min for the drone footage. The video was sliced to obtain the drone pictures, in order to avoid the drone in the front and back two pictures with small changes in location and morphology, the video was sliced into one picture per 20 frames, and a total of 20,262 drone pictures were collected through screening, and factors such as scale changes and light intensity were fully considered in the shooting process. Subsequently, the conventional data enhancement strategy is utilized on the acquired dataset to construct the low-altitude UAV dataset of this paper, so that it meets the research needs of this paper.

3.2. Algorithm Detection Performance Validation

3.2.1. Evaluation Metrics

Four commonly used evaluation metrics, including Precision (P), Recall (R), mean Average Precision (mAP) and FPS, are adopted in this study. These metrics are defined as follows:

R e c a l l = \frac{T P}{T P + F N}

(18)

P r e c i s i o n = \frac{T P}{T P + F P}

(19)

A P = \int_{0}^{1} P (R) d R

(20)

m A P = \frac{1}{N} \sum_{n \in N} A P (n)

(21)

here, TP, FP, and FN stand for correctly detected, falsely detected, and missed targets, respectively. Recall (Recall) is defined as the proportion of correctly detected TP relative to all objects that ought to be detected (TP + FN); Precision (Precision) refers to the ratio of correctly detected TP to the total number of detected objects (TP + FP). P(R) represents the precision-recall curve, with each IOU threshold matching a unique P(R) curve. Meanwhile, N indicates the category, and mAP is the mean of AP values across all categories.

3.2.2. Detection of Performance Ablation Experiments

The primary objective of this work is to conduct a quantitative evaluation of the performance improvements derived from the proposed modules. In order to achieve this objective, a series of ablation tests are conducted to systematically assess the contribution of each individual component to the overall detection capability. Utilizing YOLOv8n as a foundational framework, the experimental design methodically integrates the optimized modules, undertaking a meticulous examination of their individual effects. Firstly, the C2f-DSFEM module is introduced to enhance the sensitivity to edge pixels during feature extraction. Empirical analysis reveals that the proposed module integrates hierarchical spatial features, enabling simultaneous modeling of fine-grained details and global context within feature representations. Concurrently, evidence has emerged demonstrating its capacity to reduce computational cost through the implementation of depthwise-separable convolution. The next phase integrates the CAA Attention Mechanism, which dynamically modulates the importance of diverse spatial regions within feature maps. This mechanism thereby enhances the model’s capacity to prioritize target areas while attenuating contextual interference from background regions. In conclusion, the original loss function is substituted for Focal Loss with a view to optimizing the training process and enhancing detection accuracy to a greater degree. Maintaining uniform experimental configurations is essential to ensure equitable evaluation of individual module contributions. Table 1 presents the study’s findings (where √ indicates the added module), while Figure 6 offers a schematic visualization of performance enhancements.

Following the integration of the C2f-DSFEM module into Model A, the mAP@0.5 increased from 86.9% to 93.0%, and the mAP@0.5:0.95 improved from 63.7% to 75.1%. T Revealed by the experimental data is that the module has a substantial impact on the model’s capacity to discern edge pixels of UAV targets, thereby demonstrating its efficacy in capturing both local and global feature information. Consequently, the model attains enhanced target recognition accuracy and substantial improvement in detection performance. With respect to computational efficiency, the parameter count is reduced from 3.2 × 106 to 3.0 × 106, and the computational cost is decreased from 8.9 GFLOPs to 8.2 GFLOPs. Despite a decline in frame rate from 76.0 frames·s⁻¹ to 50.0 frames·s⁻¹, the model remains compliant with the stipulated real-time detection criteria, thereby illustrating an optimal compromise between accuracy and efficiency.

After incorporating the CAA attention mechanism into Model B, the mAP@0.5 reaches 89.4%, and the mAP@0.5:0.95 achieves 70.0%. The experimental results demonstrate that the CAA mechanism dynamically calibrates feature map saliency across spatial regions, enabling the model to prioritize foreground regions while suppressing irrelevant contextual interference. This spatial attention strategy directly contributes to detection accuracy improvement. Regarding computational efficiency, the optimized architecture achieves a parameter count of 2.10M and computational overhead of 7.9 GFLOPs, maintaining real-time processing capability at 61.0 frames·s⁻¹. This suggests that the CAA mechanism enhances model efficacy while simultaneously alleviating processing resource demands, all without compromising inference velocity.

After replacing the original loss function with the Focal Loss in Model C, the mAP@0.5 improves to 87.6%, and the mAP@0.5:0.95 increases to 69.0%. These results indicate that Focal Loss effectively optimizes the training process, facilitates better model learning, and enhances detection accuracy. Notably, both parameter count and computational overhead remain stable, whereas the frame rate reaches 79.0 frames·s⁻¹. This shows that Focal Loss can enhance training efficiency and speed up detection without adding extra computational burden, proving itself a highly effective and practical approach to model optimization.

Model D incorporates both the C2f-DSFEM module and the CAA attention mechanism, achieving an mAP@0.5 of 92.7% and an mAP@0.5:0.95 of 76.6%, indicating a further improvement in detection accuracy. The number of parameters is 2.1 × 10⁶, with the computational cost reduced to 7.5 GFLOPs, while the frame rate is maintained at 55.0 frames·s⁻¹. These results demonstrate that the synergistic integration of the two modules enables more effective feature extraction and target focus, achieving higher detection performance with lower computational resource consumption. However, there is still room for improvement in terms of inference speed.

The refined model (Ours), incorporating the C2f-DSFEM module, CAA attentional component, and Focal loss formulation, achieves marked performance gains. Specifically, the mAP@0.5 reaches 99.2%, and the mAP@0.5:0.95 achieves 81.7%. The number of parameters and computational cost remain consistent with those of Model D, while the frame rate increases to 62.0 frames·s⁻¹. These results indicate that the synergistic integration of these three modules not only markedly improves UAV target detection accuracy but also optimizes operational speed to some extent. This integration effectively strikes a balance between computational resource consumption, detection accuracy, and operational efficiency.

Collectively, the C2f-DSFEM module, CAA attentional component, and Focal loss formulation synergistically elevate the efficacy of UAV target detection frameworks. Furthermore, their combined use yields even better results, demonstrating the effectiveness of this multi-module approach.

3.2.3. Visualization Experiment

To further validate the effectiveness of the improved modules and provide a clearer illustration of their contribution to the enhancement of YOLOv8, this paper employs Gradient-weighted Class Activation Mapping (Grad-CAM) [36] to visualize the gradient values associated with the confidence scores of the output categories in both the original and the final improved models. In the resulting heatmaps, regions with higher gradients in the feature maps are represented by darker red shading, while areas with lower gradients are indicated by darker blue shading. Heat-maps are generated for corresponding test results to qualitatively compare model behavior before and after the improvements.

As demonstrated in Figure 7, the Grad-CAM heat map (b) of the original YOLOv8 model demonstrates an inadequate focus on the UAV area, and the model frequently misidentifies the UAV target in complex backgrounds. This finding suggests that the model is susceptible to environmental interference. Conversely, the heat-map (c) of the enhanced YOLOv8 model precisely localizes UAV targets. The high gradient regions in the heat map are more concentrated on the key structural components of the UAV, such as the fuselage and support components. Examination of the visualized outcomes reveals that the enhanced YOLOv8 model developed herein significantly mitigates background noise interference, thereby allowing the network to direct attention more precisely toward the actual UAV target. This capability, as validated through empirical testing, substantially improves detection robustness and operational reliability.

3.3. Algorithm Tracking Performance Validation

3.3.1. Tracking Performance Comparison Experiment

To assess the efficacy of the proposed method in UAV tracking scenarios, this study employs YOLOv3, YOLOv5, YOLOv8, and an enhanced YOLOv8 variant as detection backbones, each integrated with the DeepSORT framework for comparative benchmarking. The performance of these combinations is tested on two self-constructed datasets (video 1 and video 2) and the Anti-UAV-RGBT [37] dataset video (video 3). Specifically, the DarkLabel software darklabel2.4 (alias darkpgmr, South Korea) is utilized for annotating video1 and video 2. Video 1 consists of 221 frames with backgrounds featuring buildings and sky, while video 2 comprises 2986 frames, including complex scenes with small UAV targets and tree occlusions. For the Anti-UAV-RGBT [38] dataset, video 3 is selected, which contains 999 frames of low-resolution images with challenging camera transitions. The MOTA [38] metric is utilized to assess the tracking efficacy of the enhanced YOLOv8 algorithm. The MOTA is calculated as follows:

MOTA = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}}

(22)

where, t refers to the current frame number,

F N_{t}

(False Negative Tracklets) refers to the number of target frames that are not actually present in the output,

F P_{t}

(False Positive Tracklets) refers to the number of target frames that are not actually present in the output,

I D S W_{t}

refers to the number of ID-switches, i.e., the frequency of tracking ID switches for the same target in a tracking task, and

{G T}_{t}

represents the count of target frames and target boxes.

The MOTA metric is a comprehensive evaluation of various factors, including detection accuracy, false positive rate, and the number of ID switches. As demonstrated in Videos 1 and 2, the proposed detector (“Ours”) attains elevated MOTA scores Compared to contemporary state-of-the-art detectors. This outcome corroborates the detector’s enhanced tracking proficiency under standard imaging circumstances. As demonstrated in Video 3, the overall tracking performance is found to be suboptimal due to the low resolution of the video and the irregular transitions in frame rate that occur during camera switching. The consequences of these issues have been demonstrated to lead to the failure of motion prediction, particularly in cases of abnormal time intervals. This has the effect of resulting in trajectory fragmentation or erroneous extrapolation. Notwithstanding the challenges previously mentioned, the proposed enhanced detector demonstrates a higher level of performance in comparison to the baseline models. This finding suggests that the detector is highly robust and effective in detecting and tracking UAV targets in unfavorable conditions. The enhanced detector’s augmented capabilities are underscored by this outcome, which demonstrates significant advantages regarding detection reliability and tracking robustness, especially in challenging operational environments where system stability is critical.

The experimental results are shown in Table 2, where, the IDF1 (Identification F1-Score) metric evaluates the accuracy of target identification during tracking. On video 1, the proposed detector (“Ours”) achieves an IDF1 score of 92.4, the highest among all tested models. In comparison, YOLOv3 and YOLOv5 achieve similar scores of 81.6 and 81.6, respectively, while YOLOv8 lags significantly with a score of only 64.8. On video 2, the proposed detector maintains superior performance, achieving an IDF1 score of 78.2, which also outperforms all other detectors. In video3, where image noise is more prominent, visual features are easily misinterpreted by tracking algorithms, leading to trajectory drift or false matching. As a result, the overall identification accuracy is lower compared to the other two test videos. Nevertheless, the proposed detector still demonstrates strong performance in both tracking precision and identification reliability, highlighting its effectiveness in handling challenging real-world conditions and limited-quality public datasets.

In terms of false negatives (FNt) and false positives (FPt) metrics, the proposed detector (“Ours”) demonstrates superior performance across all test videos. On video1, the proposed model achieves an FNt value of 9.0 and an FPt value of 20.0, significantly lower than those of other detectors. This indicates its high precision in detecting and distinguishing targets under relatively controlled conditions. On video 2, the proposed model maintains its edge with an FNt value of 652.0 and an FPt value of 532.0, compared to YOLOv5, which has an FNt value of 847.0 and an FPt value of 1370.0, and YOLOv8, which has an FNt value of 1112.0 and an FPt value of 1330.0. These results highlight the effectiveness of our model in distinguishing between targets and background, thereby reducing false detections. In video 3, despite the challenging conditions of low resolution and abrupt camera transitions, the proposed model still performs exceptionally well, achieving an FN value of 280 and an FP value of 375.0. In contrast, YOLO series detectors exhibit higher numbers of false negatives and false positives under these difficult conditions. The significant reduction in both false negatives and false positives for our model underscores its enhanced capability in extracting and recognizing target features accurately, even in complex scenarios characterized by low-resolution imagery and harsh camera transitions. This ability allows it to more accurately distinguish between targets and background, thereby minimizing misclassifications. These findings collectively demonstrate that the Ours model excels particularly in environments with low-resolution imagery and abrupt camera transitions, where it can effectively minimize false negatives and false positives, thus providing robust and reliable tracking performance.

In the IDSWt metric, the ‘Ours’ model has an IDSWt value of only 1 in video1, whereas several other detectors have values between 5 and 6. In video 2, the ‘Ours’ model has an IDSWt value of 2, and in video 3, it is only 5, which is much lower than the YOLO series. Overall, the 85.23% reduction in tracking ID switches indicates greater effectiveness in maintaining target identity consistency and tracking trajectory coherence in response to changes in target appearance and motion caused by camera switching.

The FPS metric measures the algorithm’s processing speed. On both test videos, the FPS values of all detectors are relatively close. Although YOLOv8 achieves a slightly higher frame rate than the proposed method, the “Ours” detector still satisfies real-time tracking requirements while delivering superior performance in terms of detection accuracy, tracking stability, and robustness.

In summary, in this UAV target tracking experiment, the Ours detector combined with the DeepSORT algorithm performs well in several key metrics, with high detection accuracy and tracking accuracy, less leakage and false detection, and the overall tracking performance advantage is obvious with stronger adaptability and robustness. The experimental findings demonstrate that the Ours detector exhibits superior performance in key tracking metrics when compared to existing models. Concurrently, it evinces a more pronounced adaptive capacity in addressing real-world challenges, including background complexity, scale variation, and camera instability. Consequently, the enhanced YOLOv8-based detector, when used in conjunction with DeepSORT, presents a compelling solution for precise and expeditious UAV tracking in dynamic and constrained environments.

3.3.2. Continuous Frame Visualization Experiment

To demonstrate the tracking results more intuitively, this study conducts visualization experiments based on YOLOv3, YOLOv5, YOLOv8, and the improved YOLOv8 detector. The experiments are performed using multiple continuous frame sequences: frames 22–152 from video1.mp4, frames 1–331 from video 2.mp4, and frames 1–13 from the Anti-UAV-RGBT [38] dataset. The comparative analysis is conducted across four dimensions: detection box stability, ID continuity, dynamic adaptability, and anti-interference capability. These experiments validate the improved algorithm’s effectiveness, meanwhile furnishing a theoretical basis for engineering UAV tracking technology applications.

As shown in Figure 8, in the scene corresponding to frames 22–152 of video 1.mp4, the UAV first flies horizontally at a constant speed and then experiences a sudden acceleration. In this scenario, traditional models exhibit limitations in dynamic adaptation. Specifically, YOLOv3, YOLOv5, and YOLOv8 all experience ID switching, with YOLOv8 failing to detect the UAV target at frame 22. Additionally, YOLOv3 shows repeated ID jumps throughout the sequence. In contrast, the improved YOLOv8 achieves stable single-ID tracking throughout the entire sequence (ID = 1), demonstrating strong adaptability to both regular motion and sudden acceleration scenarios, significantly outperforming other models.

In the scene corresponding to frames 1–331 of video 2.mp4, the UAV first ascends and then descends continuously. Among the baseline detectors, YOLOv3 performs relatively better than YOLOv5 and YOLOv8, with fewer ID switches. However, YOLOv5 and YOLOv8 fail to maintain stable tracking, showing frequent ID jumps. In contrast, the improved YOLOv8 successfully tracks the UAV target from its initial appearance through the entire descent phase, with only one ID switch observed. This indicates overall stability and reliability in tracking performance.

In the building scene depicted in frames 1–13 of the Anti-UAV-RGBT dataset, the YOLOv3, YOLOv5, and YOLOv8 algorithms consistently misidentified and tracked the target. In contrast, the enhanced YOLOv8 demonstrated a consistent capacity to accurately detect and track the UAV target, with the tracking ID transitioning to 3, indicative of robust detection stability, dynamic adaptability, and anti-interference capability.

The comparative analysis of the tracking performance across multiple sequential image sequences reveals that YOLOv3 and YOLOv5 detectors suffer from varying degrees of ID switching issues, reflecting limitations in their target detection and cross-frame association algorithms. Although YOLOv8 demonstrates relatively stable overall performance, it still requires improvement when handling complex dynamic scenes. In contrast, the improved YOLOv8 detector introduces enhancements in feature extraction, context information utilization, and loss function optimization. These improvements lead to comprehensive advancements in detection accuracy, tracking stability, and generalization ability across various scenarios. As a result, the improved model provides a reliable and effective solution for UAV tracking applications.

4. Conclusions

The present study addresses three core challenges in real-world UAV surveillance scenarios: background complexity, occlusion phenomena, high-frequency motion, and abrupt velocity variations. To address these issues, a joint detection and tracking algorithm based on an improved YOLOv8 combined with DeepSORT is proposed, establishing a comprehensive technical pipeline encompassing feature enhancement, scale awareness, hard sample learning, and trajectory optimization. Firstly, the fusion of Sobel operator with depthwise-separable convolution is proposed for multi-dimensional feature extraction and integration, aiming to strengthen the model’s feature acquisition capability and establish an edge-sensitive detection mechanism. Secondly, a lightweight Channel Attention Aggregation mechanism is applied to the Neck network, enhancing target localization accuracy and reducing errors caused by scale variations. Thirdly, Focal Loss replaces the original classification loss function, dynamically adjusting sample weights to strengthen the model’s learning capability for challenging samples. Finally, a “trajectory prediction-double matching-optimization solution” strategy is employed to resolve trajectory discontinuities in highly dynamic scenes. Experimental results demonstrate marked enhancements in detection performance, with mAP@0.5 attaining 99.2% (a 12.3% gain), accompanied by a parameter reduction of 1.1 × 10⁶, thereby establishing an equilibrium between detection precision and computational efficiency. When assessing tracking algorithm efficacy, the YOLOv8N-Drone algorithm exhibits a 19.2% enhancement in tracking precision across challenging multi-scenario environments. Meanwhile, the target identification accuracy improves by 6.82%, while ID switches decrease by 85.2%, showcasing the algorithm’s capacity to achieve high-precision and consistently reliable tracking performance.

Despite the improved YOLOv8N-Drone algorithm demonstrating strong performance in terms of ID switching frequency, occasional ID jumps still occur, particularly in complex multi-scenario transitions. Future research will focus on analyzing the causes of ID jumps, including factors such as target appearance changes, occlusions, and target similarity. On one hand, optimizing feature extraction methods to extract more discriminative target features can reduce ID confusion due to similar appearances. Incorporating additional contextual information, such as motion trajectories and velocities associated with the object, could establish more accurate association models, thereby enhancing the continuity and stability of entity identities.

Author Contributions

Conceptualization, validation, and methodology, Y.Z.; Methodology, formal analysis, writing—original draft preparation, writing—review and editing, Q.M.; writing—original draft preparation, G.L.; writing—review and editing, L.W. and C.G.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Program of Shanxi Province under Grant (No. 202203021222063). This work was also financed by the Shanxi Science and Technology Innovation Leading Talent Team for Special Unmanned Systems and Intelligent Equipment 202204051002001 and the State Key Laboratory of Intelligent Mining Equipment Technology (Grant No. ZNCKKF20240110).

Data Availability Statement

The data used in this analysis are partially publicly available and can be accessed in the text.

Acknowledgments

Thank you to the “North China University Special Intelligent Unmanned Team” for providing technical support during the publication of this research.

Conflicts of Interest

All authors agree to the final version, and all authors promise that there are no conflicts of interest or potential risks.

Abbreviations

The following abbreviations are used in this manuscript:

DSFEM	Depthwise-Separable and Sobel Feature Enhancement Module
CAA	Context Anchor Attention
UAVs	unmanned aerial vehicles
mAP	mean average precision

References

Hu, P.; Zhang, R.; Yang, J.; Chen, L. Development status and key technologies of plant protection UAVs in China: A review. Drones 2022, 6, 354. [Google Scholar] [CrossRef]
Adnan, W.H.; Khamis, M.F. Drone use in military and civilian application: Risk to national security. J. Media Inf. Warf. (JMIW) 2022, 15, 60–70. [Google Scholar]
Gonzalez-Jorge, H.; Aldao, E.; Fontenla-Carrera, G.; Veiga-López, F.; Balvís, E.; Ríos-Otero, E. Counter drone technology: A review. Preprints 2024. [Google Scholar] [CrossRef]
AL-Dosari, K.; Hunaiti, Z.; Balachandran, W. Systematic review on civilian drones in safety and security applications. Drones 2023, 7, 210. [Google Scholar] [CrossRef]
Moshref-Javadi, M.; Winkenbach, M. Applications and research avenues for drone-based models in logistics: A classification and review. Expert Syst. Appl. 2021, 177, 114854. [Google Scholar] [CrossRef]
Brown, A.D. Radar challenges, current solutions, and future advancements for the counter unmanned aerial systems mission. IEEE Aerosp. Electron. Syst. Mag. 2023, 38, 34–50. [Google Scholar] [CrossRef]
Yang, T.; De Maio, A.; Zheng, J.; Su, T.; Carotenuto, V.; Aubry, A. An adaptive radar signal processor for UAVs detection with super-resolution capabilities. IEEE Sens. J. 2021, 21, 20778–20787. [Google Scholar] [CrossRef]
Olorunshola, O.; Jemitola, P.; Ademuwagun, A. Comparative study of some deep learning object detection algorithms: R-CNN, fast R-CNN, faster R-CNN, SSD, and YOLO. Nile J. Eng. Appl. Sci. 2023, 1, 70–80. [Google Scholar] [CrossRef]
Zhu, H.; Qi, Y.; Shi, H.; Li, N.; Zhou, H. Human detection under UAV: An improved faster R-CNN approach. In Proceedings of the 2018 5th International Conference on Systems and Informatics (ICSAI), Nanjing, China, 10–12 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 367–372. [Google Scholar]
Maske, S.R. Micro-UAV detection using Mask R-CNN. Ph.D. Dissertation, National College of Ireland, Dublin, Ireland, 2021. [Google Scholar]
Hammer, M.; Hebel, M.; Borgmann, B.; Laurenzis, M.; Arens, M. Potential of lidar sensors for the detection of UAVs. In Laser Radar Technology and Applications XXIII; SPIE: Bellingham, WA, USA, 2018; Volume 10636, pp. 39–45. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Kwan, C.; Budavari, B. Enhancing small moving target detection performance in Low-Quality and Long-Range infrared videos using optical fow techniques. Remote Sens. 2020, 12, 4024. [Google Scholar] [CrossRef]
Alsanad, H.R.; Sadik, A.Z.; Ucan, O.N.; Ilyas, M.; Bayat, O. YOLO-V3 based real-time drone detection algorithm. Multimed. Tools Appl. 2022, 81, 26185–26198. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An optimized YOLOv8 network for tiny UAV object detection. Elec-Tronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Liu, H.; Fan, K.; Ouyang, Q.; Li, N. Real-time small drones detection based on pruned yolov4. Sensors 2021, 21, 3374. [Google Scholar] [CrossRef]
Zamri FN, M.; Gunawan, T.S.; Yusoff, S.H.; Alzahrani, A.A.; Bramantoro, A.; Kartiwi, M. Enhanced small drone detection using optimized YOLOv8 with attention mechanisms. IEEE Access 2024, 12, 90629–90643. [Google Scholar] [CrossRef]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. Tgc-yolov5: An enhanced yolov5 drone detection model based on transformer, gam & ca attention mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Hong, T.; Liang, H.; Yang, Q.; Fang, L.; Kadoch, M.; Cheriet, M. A real-time tracking algorithm for multi-target UAV based on deep learning. Remote Sens. 2022, 15, 2. [Google Scholar] [CrossRef]
Gandhi, R. UAV Object detection and tracking in video using YOLOv3 and DeepSORT. In Proceedings of the 2024 International Conference on Emerging Technologies in Computer Science for Interdisciplinary Applications (ICETCS), Bengaluru, India, 22–23 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Ghazlane, Y.; Hilali Alaoui, A.E.; Medomi, H.; Bnouachir, H. Real-Time airborne target tracking using DeepSort algorithm and Yolov7 Model. Int. J. Adv. Comput. Sci. Appl. 2024, 15. [Google Scholar] [CrossRef]
Delleji, T.; Fkih, H.; Kallel, A.; Chtourou, Z. Visual tracking of mini-UAVs using modified YOLOv5 and improved DeepSORT algorithms. In Proceedings of the 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sfax, Tunisia, 24–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Yuding, X.; Zhuang, S.; Wang, P.; Ling, Y.; Lv, Z.; Lu, W. Research on Real-time detection and tracking algorithm for low slow small targets based on the DeepSort. In Proceedings of the 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, China, 22–24 March 2024; pp. 135–142. [Google Scholar]
Liu, S.; Zhu, M.; Tao, R.; Ren, H. Fine-grained feature perception for unmanned aerial vehicle target detection algorithm. Drones 2024, 8, 181. [Google Scholar] [CrossRef]
Chang, Q.; Li, X.; Li, Y.; Miyazaki, J. Multi-directional sobel operator kernel on GPUs. J. Parallel Distrib. Comput. 2023, 177, 160–170. [Google Scholar] [CrossRef]
Dai, Y.; Li, C.; Su, X.; Liu, H.; Li, J. Multi-Scale depthwise separable convolution for semantic segmentation in Street–Road scenes. Remote Sens. 2023, 15, 2649. [Google Scholar] [CrossRef]
Chen, D.; Zhang, L. SL-YOLO: A stronger and lighter drone target detection model. arXiv 2024, arXiv:2411.11477. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Chen, Y.; Shi, B. Enhanced heterogeneous graph attention network with a novel multilabel focal loss for Document-Level relation extraction. Entropy 2024, 26, 210. [Google Scholar] [CrossRef]
Wang, P.-S.; Lin, C.-H.; Chuang, C.-T. Real-Time object localization using a fuzzy controller for a Vision-Based drone. Inventions 2024, 9, 14. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-uav detection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Walter, V.; Vrba, M.; Saska, M. On training datasets for machine learning-based visual relative localization of micro-scale UAVs. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10674–10680. [Google Scholar]
Pawełczyk, M.; Wojtyra, M. Real world object detection dataset for quadcopter unmanned aerial vehicle detection. IEEE Access 2020, 8, 174394–174409. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-air visual detection of micro-uavs: An experimental evaluation of deep learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Chen, Y.; Aggarwal, P.; Choi, J.; Kuo CC, J. A deep learning approach to drone monitoring. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 686–691. [Google Scholar]
Liu, R.M.; Su, W.H. APHS-YOLO: A lightweight model for Real-Time detection and classification of stropharia Rugoso-Annulata. Foods 2024, 13, 1710. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, G.; Han, Z. Anti-UAV: A large multi-modal benchmark for UAV tracking. arXiv 2021, arXiv:2101.08466. [Google Scholar]
Shen, P.; Mei, K.; Xue, H.; Li, T.; Zhang, G.; Zhao, Y.; Luo, W.; Mao, L. Research on enhanced dynamic pig counting based on YOLOv8n and Deep SORT. Sensors 2025, 25, 2680. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Improved YOLOv8N-Drone structure diagram.

Figure 2. Structure of C2f-DSFEM.

Figure 3. Structure of CAA.

Figure 4. Image of Focal Loss function.

Figure 5. Schematic of the self-built UAV dataset.

Figure 6. Results of ablation experiments.

Figure 7. Visualization of the results, where (a) is the original image; (b) is the heat map of the YOLOv8 model; (c) is the heat map of the enhanced YOLOv8 model.

Figure 8. Tracking results of different detectors on various videos.

Table 1. Results of ablation experiments.

Model	Backbone	Neck	Loss Function	mAP@0.5/%	mAP@0.5:0.95	Params (M)	GFLOPs	FPS
base				86.9	63.7	3.2	8.9	76.0
A	√			93.0	75.1	3.0	8.2	50.0
B		√		89.4	70.0	2.1	7.9	61.0
C			√	87.6	69.0	3.2	8.9	79.0
D	√	√		92.7	76.6	2.1	7.5	55.0
Ours	√	√	√	99.2	81.7	2.1	7.5	62.0

Table 2. Comparative experiments on tracking performance of different detectors + DeepSORT.

Test Vedio	Detectors	MOTA	IDF1	FNt	FPt	IDSWt	FPS
video 1	YOLOv3	59.6	81.6	11.0	73.0	5.0	27.9
	YOLOv5	60.5	81.6	16.0	66.0	5.0	37.2
	YOLOv8	59.1	64.8	47.0	123.0	6.0	41.8
	Ours	85.0	92.4	9.0	20.0	1.0	38.6
vedio 2	YOLOv3	39.1	71.3	692.0	1102.0	24.0	29.6
	YOLOv5	25.3	65.6	847.0	1370.0	12.0	37.6
	YOLOv8	17.7	60.2	1112.0	1330.0	15.0	52.8
	Ours	58.3	78.2	652.0	532.0	2.0	38.6
video 3	YOLOv3	35.3	55.2	492.0	660.0	15.0	28.6
	YOLOv5	24.6	51.8	493.0	509.0	15.0	38.6
	YOLOv8	16.1	52.2	490.0	513.0	13.0	49.8
	Ours	45.3	60.3	280.0	375.0	5.0	38.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Ma, Q.; Lei, G.; Wang, L.; Guo, C. Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture. Drones 2025, 9, 551. https://doi.org/10.3390/drones9080551

AMA Style

Zhao Y, Ma Q, Lei G, Wang L, Guo C. Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture. Drones. 2025; 9(8):551. https://doi.org/10.3390/drones9080551

Chicago/Turabian Style

Zhao, Yongjuan, Qiang Ma, Guannan Lei, Lijin Wang, and Chaozhe Guo. 2025. "Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture" Drones 9, no. 8: 551. https://doi.org/10.3390/drones9080551

APA Style

Zhao, Y., Ma, Q., Lei, G., Wang, L., & Guo, C. (2025). Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture. Drones, 9(8), 551. https://doi.org/10.3390/drones9080551

Article Menu

Research on Lightweight Tracking of Small-Sized UAVs Based on the Improved YOLOv8N-Drone Architecture

Abstract

1. Introduction

2. Models and Methods

2.1. Modeling Framework

2.2. Depth Separable and Edge-Sensitive Feature Enhancement Module

2.3. Context Anchor Attention Mechanism Module

2.4. Loss Function Improvement Study

2.5. Target Tracking Algorithm

3. Experiment and Result Analysis

3.1. Experimental Basis

3.1.1. Experimental Condition

3.1.2. Dataset Construction

3.2. Algorithm Detection Performance Validation

3.2.1. Evaluation Metrics

3.2.2. Detection of Performance Ablation Experiments

3.2.3. Visualization Experiment

3.3. Algorithm Tracking Performance Validation

3.3.1. Tracking Performance Comparison Experiment

3.3.2. Continuous Frame Visualization Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI