YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection

UAV miss detection. Additionally, the multi-scale fusion module incorporates the GAM attention mechanism to enhance the fusion of target features and reduce the probability of false detections. The combined use of SPD-Conv and GAM strengthens the model’s ability to detect tiny targets. Abstract: With the widespread use of UAVs in commercial and industrial applications, UAV detection is receiving increasing attention in areas such as public safety. As a result, object detection techniques for UAVs are also developing rapidly. However, the small size of drones, complex airspace backgrounds, and changing light conditions still pose signiﬁcant challenges for research in this area. Based on the above problems, this paper proposes a tiny UAV detection method based on the optimized YOLOv8. First, in the detection head component, a high-resolution detection head is added to improve the device’s detection capability for small targets, while the large target detection head and redundant network layers are cut off to effectively reduce the number of network parameters and improve the detection speed of UAV; second, in the feature extraction stage, SPD-Conv is used to extract multi-scale features instead of Conv to reduce the loss of ﬁne-grained information and enhance the model’s feature extraction capability for small targets. Finally, the GAM attention mechanism is introduced in the neck to enhance the model’s fusion of target features and improve the model’s overall performance in detecting UAVs. Relative to the baseline model, our method improves performance by 11.9%, 15.2%, and 9% in terms of P (precision), R (recall), and mAP (mean average precision), respectively. Meanwhile, it reduces the number of parameters and model size by 59.9% and 57.9%, respectively. In addition, our method demonstrates clear advantages in comparison experiments and self-built dataset experiments and is more suitable for engineering deployment and the practical applications of UAV object detection systems.


•
Addressing the small size characteristics of UAV targets, a high-resolution detection branch is added to the detection head to enhance the model's ability to detect tiny targets.Simultaneously, prediction and the related feature extraction and fusion layers for large targets are pruned, reducing network redundancy and lowering the model's parameter count.

•
Improving multi-scale feature extraction, using SPD-Conv instead of Conv to extract multiscale features, better retaining the features of tiny targets, and reducing the probability of UAV miss detection.Additionally, the multi-scale fusion module incorporates the GAM attention mechanism to enhance the fusion of target features and reduce the probability of false detections.
The combined use of SPD-Conv and GAM strengthens the model's ability to detect tiny targets.

Introduction
With the advancement of drone technology, drones are widely employed in various sectors, such as aerial photography, emergency response, and agricultural planning.However, the development of drones has also brought to the fore a series of management issues.These include illegal "rogue flights", the exploitation of drones for criminal and terrorist activities, and their potential to be transformed into dangerous weapons by carrying explosive materials [1][2][3].Drones have become a new tool for terrorism, posing significant threats to public safety.In response to the increasingly severe UAV threat, it is urgent to establish an anti-drone system around restricted areas; thus, illegal UAV detection, as a critical component of the anti-drone system [4], has become a subject of widespread attention among researchers.Improving the accuracy and processing speed of detecting enemy UAV targets, conducting effective early warning detection, and then taking measures to intercept them is the key to mastering air control and maintaining national security and social stability.Most of the current early warning detection equipment has the defects of fixed deployment location, large size, and apparent target exposure, meaning that they cannot be flexibly distributed in hidden forward positions; therefore, lightweight and easy-to-deploy large-scale early warning equipment is needed to fill the gap.The following problems exist in solving the detection of UAV targets: (1) UAVs are characterized by their small size, the use of "stealth" materials, low-altitude reconnaissance targets, and flexible take-off platforms; (2) complex airspace environments are often affected by clouds, light, and object occlusion, so that the use of electromagnetic and other signals to detect UAV groups are prone to false detection and missed detection [5,6].With the rapid development of computer vision technology and neural networks, methods based on video and image frames have been widely used to extract features such as target contours, colors, and shapes, enabling the real-time detection of target positions and motion behaviors.This approach has extensive applications in public security monitoring, intelligent transportation systems, national defense and security, human-computer interaction systems, and safety production.Applying computer vision technology to drone detection opens up a new avenue for airspace early warnings, offering vast prospects for practical applications [7].
The rest of the paper is structured as follows: Section 2 summarizes the works related to UAV detection.Section 3 first introduces the YOLOv8 network structure and the details of its critical modules, followed by an improved tiny UAV target detection model, and details the structure and roles of each improved module of the model.Section 4 first introduces the dataset and the experimental environment and then conducts ablation experiments, comparison experiments on the publicly available dataset TIB-Net, and, finally, self-built dataset experiments to validate the proposed method's feasibility fully.Section 5 summarizes the research results in the full paper and provides an outlook on future research directions.

Related Work
In recent years, improving hardware device performance has enhanced computer data-processing capabilities, enabling rapid advancements in visual technologies that rely on deep learning with big data.Object detection based on computer vision technology has garnered significant attention from researchers.It has evolved from traditional manual feature extraction [8][9][10] using convolutional calculations for object detection to leveraging deep learning to improve recognition accuracy in visual object detection.Compared to traditional electromagnetic signal detection methods such as radar, laser, infrared, audio, and radio frequency, object detection using visual sensors, specifically cameras capturing group videos and image data, offers more intuitive detection and the recognition of groups' information.It offers advantages such as the real-time and dynamic recording of sequential images of targets, low cost, fast detection speed, and immunity to interference from lowaltitude clutter [11].
Object detection is an important research area in computer vision and is the foundation for numerous complex visual tasks.It has been widely applied in industries, agriculture, and other fields [12,13].Since 2014, there has been a remarkable advancement in deep learning-based object detection techniques.The industry has introduced various algorithms, including Faster R-CNN [14], SSD [15], and the YOLO series [16], to improve object detection further.With the rapid development of target detection technology, several useful methods have explicitly emerged for UAV target detection tasks [17][18][19][20][21].For example, the authors of [17] argue that convolutional neural networks struggle to balance detection accuracy and model size.To address this issue, they introduced a recurrent pathway and spatial attention module into the original extremely tiny face detector (EXTD), enhancing its ability to extract features from small UAV targets.The model size is only 690.7 kb.However, this model exhibits a slow inference time and is unsuitable for deployment in practical engineering scenarios.Ref. [18] proposed a UAV target detection network based on multiscale feature fusion, which first extracts the target multisensory field features using res2net, then improves the network performance in terms of both fine-grained multiscale feature extraction and hierarchical multiscale feature fusion, and finally achieves better results on a self-built UAV detection dataset.Ref. [19] created a new UAV detection method that overcomes the limitations of the UAV detection process in terms of parameters and computational environment to perform realistic detection using web applications.In the current paper, we first screen an SSD pre-trained model that is suitable for deployment in this web application to improve detection accuracy and recall.The experimental results prove that the web application method outperforms the on-board processing method and achieves better results.Ref. [20] proposes a lightweight feature-enhanced convolutional neural network that is capable of the real-time and high-precision detection of low-flying objects.It effectively alerts against unauthorized drones in the airspace and provides guidance information.Ref. [21] introduces a novel deep learning method called the convolutional transformation network (CT-Net).The backbone of this network first incorporates an attention-enhanced transformation block, which establishes a feature-enhanced multihead self-attention mechanism to improve the model's feature extraction capability.Then, a lightweight bottleneck module is employed to control computational load and reduce parameters.Finally, a direction feature fusion structure is proposed to enhance detection accuracy when dealing with multi-scale objects, especially small-sized objects.The approach achieves a mAP of 0.966 on a self-built low-altitude small-object dataset, demonstrating good detection accuracy.However, the FPS is only 37, indicating that there is room for improvement in detection speed.
Although significant progress has been made in UAV detection technology, existing detection methods still face challenges in balancing detection accuracy, model size, and detection speed.The YOLO series detection network has solved these problems effectively.The YOLO series models have undergone eight official iterations and several branch versions, showcasing remarkable detection accuracy and speed performance.These models have extensive applications in various fields, including medicine, transportation, remote sensing, and industry [22].Scholars have extensively researched using the YOLO series models for UAV target detection, as evidenced by numerous studies [23][24][25][26][27].For example, in reference [23], by incorporating an attention mechanism module into the PP-YOLO detection algorithm, enhancements were made to improve its performance.Furthermore, introducing the Mish activation function addressed the issue of gradient-vanishing during the backpropagation process, resulting in a significant boost in detection accuracy.In Ref. [24], a UAV detection algorithm for complex urban backgrounds was proposed, based on YOLOv3.It employed an FPN for multi-scale prediction, enhancing the system's detection performance for small targets.A lightweight Ghost network was also utilized to accelerate the model, achieving network lightweight status.Experimental results demonstrated that the algorithm effectively detected small UAV targets in complex scenes and exhibited strong robustness.In Ref. [25], a lightweight convolutional neural network, Mo-bileNetv2, replaced the original CSPDarknet53 backbone of the high-performance YOLOv4 model.This substitution aimed to reduce the model's scale and simplify the computational operations.Experimental results demonstrated that Mob-YOLO could achieve accurate real-time monitoring of UAV targets with smaller model sizes, making it deployable with onboard embedded processors.In Ref. [26], a YOLOv5-based distributed anti-drone system was proposed.This system integrates airport defense capabilities to address UAV jamming scenarios by incorporating features such as automatic targeting and jamming signal broadcasting, enabling the interception of illegal UAVs.To cater to the wide no-fly zone of the airport, the system is deployed around the airport using distributed clustering, effectively resolving the issues of blind detection and target loss.Experimental results have demonstrated the high accuracy of automatic targeting based on the YOLOv5 algorithm, with the inference speed and model size meeting real-time hardware detection requirements.Although the system needs to be more innovative to improve YOLOv5, the successful application of UAV target detection technology to practical engineering scenarios is also informative.Ref. [27] proposed the YOLOX-drone, an improved target detection algorithm for UAS based on YOLOX-S.Based on the YOLOX-S target detection network, this paper first introduces a coordinated attention mechanism to improve the image highlighting of UAV targets, enhance useful features, and suppress useless features.Secondly, for this paper, a feature aggregation structure has been designed to improve the representation of useful features, suppress interference, and improve detection accuracy.The improved algorithm performs well on both the publicly available DUT-Anti-AV dataset and the self-generated dataset, demonstrating its strong obstacle-detection capability.
Combining the improvement ideas proposed in the above-related literature on the YOLO series, this paper improves on the YOLOv8s model and offers a new model suitable for tiny UAV object detection, which achieves high detection accuracy and speed on the challenging small UAV dataset, and dramatically reduces the size of the model and the number of parameters.This study provides a new approach for model deployment in the field of tiny UAV object detection.

YOLOv8 Network Structure
YOLOv8 builds upon the success of previous versions of YOLO and introduces new features and improvements to enhance performance and flexibility further, achieving top performance and exceptional speed.YOLOv8 offers five different-sized models: nano, small, middle, large, and extra-large.The Nano model has a parameter count of only 3.2 million, providing convenience for deployment on mobile and CPU-only devices.In order to balance detection accuracy and speed, this paper employs YOLOv8s as the model for UAV detection, which is obtained by deepening and widening the nano network structure.YOLOv8 is divided into the backbone, neck, and head, which are used for feature extraction, multi-feature fusion, and prediction output.The design of the YOLOv8 network is shown in Figure 1.
The feature extraction network mainly extracts individual scale features from images created by the C2f and SPPF modules.The C2f module reduces the network by one convolutional layer based on the original C3 module, making the model more lightweight.It also incorporates the strengths of the ELAN structure from YOLOv7, effectively expanding the gradient branch using bottleneck modules to obtain richer gradient flow information [28].SPPF reduces the network layers based on SPP (spatial pyramid pooling) [29] to eliminate redundant operations and perform feature fusion more rapidly.The multiscale fusion module adopts a combination of an FPN (feature pyramid network) [30] and PAN (path aggregation network) [31].By bi-directionally fusing the low-level features and high-level features, it enhances low-level features with smaller receptive fields and improves the detection capability of targets at different scales.The detection layer predicts target positions, categories, confidence scores, and other information.The head part of YOLOv8 switches from an anchor-based to an anchor-free approach.It abandons the IOU matching or single-side scale assignment and uses the task-aligned assigner for positive and negative sample matching.Ultimately, it performs multi-scale predictions using 8×, 16×, and 32× down-sampled features to achieve accurate predictions for small, medium, and large targets.The detailed modules in the YOLOv8 network are illustrated in Figure 2.
Electronics 2023, 12, x FOR PEER REVIEW 5 of 21 YOLOv8 switches from an anchor-based to an anchor-free approach.It abandons the IOU matching or single-side scale assignment and uses the task-aligned assigner for positive and negative sample matching.Ultimately, it performs multi-scale predictions using 8×, 16×, and 32× down-sampled features to achieve accurate predictions for small, medium, and large targets.The detailed modules in the YOLOv8 network are illustrated in Figure 2.    YOLOv8 switches from an anchor-based to an anchor-free approach.It abandons the IOU matching or single-side scale assignment and uses the task-aligned assigner for positive and negative sample matching.Ultimately, it performs multi-scale predictions using 8×, 16×, and 32× down-sampled features to achieve accurate predictions for small, medium, and large targets.The detailed modules in the YOLOv8 network are illustrated in Figure 2.

Improved YOLOv8 UAV Detection Model
YOLOv8 extracts the target features by using a deep residual network.It completes the multiscale prediction using the PAN structure, but YOLOv8 still performs three downsampling iterations when extracting features to obtain the maximum feature map.However, much of the target feature information is lost, which could be useful for detecting tiny targets.Therefore, this paper improves YOLOv8 and proposes a network model for UAV micro-target detection, and the improved network structure is shown in Figure 3.The specific improvement schemes are as follows.(1) We enhanced the detection capability of the model for tiny targets by adding a high-resolution detection branch in the detection head part; meanwhile, the detection layer and its related feature extraction and fusion layer for large target prediction were cut, and the model parameters were reduced.(2) The multiscale feature extraction module was improved by using SPD-Conv [32] instead of Conv to extract multiscale features.(3) The GAM attention mechanism [33] was introduced into the multiscale fusion module to enhance the model's fusion of target features.

Improved YOLOv8 UAV Detection Model
YOLOv8 extracts the target features by using a deep residual network.It completes the multiscale prediction using the PAN structure, but YOLOv8 still performs three downsampling iterations when extracting features to obtain the maximum feature map.However, much of the target feature information is lost, which could be useful for detecting tiny targets.Therefore, this paper improves YOLOv8 and proposes a network model for UAV micro-target detection, and the improved network structure is shown in Figure 3.The specific improvement schemes are as follows.(1) We enhanced the detection capability of the model for tiny targets by adding a high-resolution detection branch in the detection head part; meanwhile, the detection layer and its related feature extraction and fusion layer for large target prediction were cut, and the model parameters were reduced.( 2

A. Adding a tiny-target detection head
In this paper, the detection object is a low-flying UAV.When using the camera to capture the UAV image, in order to prevent the flying UAV from rushing out of the camera's field of view, the camera generally maintains a large area of view.Hence, the proportion of the UAV in the image is usually small.The original YOLOv8 model backbone network down-samples for a total of five times to obtain five layers of feature expressions (P1, P2, P3, P4, and P5), wherein Pi denotes a resolution of 1/2i of the original image.Although multi-scale feature fusion is achieved in the neck network via top-down and

A. Adding a tiny-target detection head
In this paper, the detection object is a low-flying UAV.When using the camera to capture the UAV image, in order to prevent the flying UAV from rushing out of the camera's field of view, the camera generally maintains a large area of view.Hence, the proportion of the UAV in the image is usually small.The original YOLOv8 model backbone network down-samples for a total of five times to obtain five layers of feature expressions (P1, P2, P3, P4, and P5), wherein Pi denotes a resolution of 1/2i of the original image.Although multi-scale feature fusion is achieved in the neck network via top-down and bottom-up aggregation paths, this does not affect the scale of the feature map, and the final detection head part is detected after passing through P3, P4, and P5.The feature map scales are 80 × 80, 40 × 40, and 20 × 20, respectively.In the small target detection task, there are often tiny targets to be detected.The TIB-Net data used in this paper contains many tiny UAV targets, usually smaller than 10 × 10 pixels in scale.Such marks have lost most of their feature information after multiple down-sampling and are still challenging to detect with high resolution by the P3 layer detection head.
To achieve micro-target identification, as mentioned above, and also gain a better detection effect, we introduced a new detection head on the YOLOv8 model by P2 layer features, called the micro-target detection head; the structure is shown in Figure 4.The resolution of the P2 layer detection head is 160 × 160 pixels, which is equivalent to only two down-sampling operations in the backbone network, containing richer information on the underlying features of the target.The two P2 layer features, obtained from top-down and bottom-up in the neck network, are fused with the same scale features in the backbone network, in the form of concat, while the output features are the fused results of the three input features, which makes the P2 layer detection head fast and effective when dealing with tiny targets.The P2 layer detection head, together with the original detection head, can effectively mitigate the scale variance caused by the P2 detection head, which, together with the initial detection head, can effectively reduce the negative effects of scale variance.The added detection head is specific to the underlying features and is generated from lowlevel, high-resolution feature maps, which are more sensitive to small targets.Although adding this detection head increases the computation and memory overhead of the model, it significantly improves the detection of tiny targets.
bottom-up aggregation paths, this does not affect the scale of the feature map, and the final detection head part is detected after passing through P3, P4, and P5.The feature map scales are 80 × 80, 40 × 40, and 20 × 20, respectively.In the small target detection task, there are often tiny targets to be detected.The TIB-Net data used in this paper contains many tiny UAV targets, usually smaller than 10 × 10 pixels in scale.Such marks have lost most of their feature information after multiple down-sampling and are still challenging to detect with high resolution by the P3 layer detection head.
To achieve micro-target identification, as mentioned above, and also gain a better detection effect, we introduced a new detection head on the YOLOv8 model by P2 layer features, called the micro-target detection head; the structure is shown in Figure 4.The resolution of the P2 layer detection head is 160 × 160 pixels, which is equivalent to only two down-sampling operations in the backbone network, containing richer information on the underlying features of the target.The two P2 layer features, obtained from topdown and bottom-up in the neck network, are fused with the same scale features in the backbone network, in the form of concat, while the output features are the fused results of the three input features, which makes the P2 layer detection head fast and effective when dealing with tiny targets.The P2 layer detection head, together with the original detection head, can effectively mitigate the scale variance caused by the P2 detection head, which, together with the initial detection head, can effectively reduce the negative effects of scale variance.The added detection head is specific to the underlying features and is generated from low-level, high-resolution feature maps, which are more sensitive to small targets.Although adding this detection head increases the computation and memory overhead of the model, it significantly improves the detection of tiny targets.

B. Removing the large-target detection head
The large target detection header P5 layer is obtained by down-sampling the image by a factor of 32.When the target size is smaller than 32 pixels, it is likely that, at most, only one point of the target is sampled or not sampled.Therefore, the YOLOv8 large target detection layer is redundant when detecting small-sized UAV targets.Based on the above conclusions, this paper cuts out the large target prediction layer and the related feature extraction and feature fusion layers from the YOLOv8 network structure.It only retains the 4-fold down-sampling, 8-fold down-sampling, and 16-fold down-sampling feature maps for UAV prediction.In the improved network structure shown in Figure 3, the 16-

B. Removing the large-target detection head
The large target detection header P5 layer is obtained by down-sampling the image by a factor of 32.When the target size is smaller than 32 pixels, it is likely that, at most, only one point of the target is sampled or not sampled.Therefore, the YOLOv8 large target detection layer is redundant when detecting small-sized UAV targets.Based on the above conclusions, this paper cuts out the large target prediction layer and the related feature extraction and feature fusion layers from the YOLOv8 network structure.It only retains the 4-fold down-sampling, 8-fold down-sampling, and 16-fold down-sampling feature maps for UAV prediction.In the improved network structure shown in Figure 3, the 16-fold down-sampled feature maps of the third C2f layer are directly fed into SPPF for multi-scale feature extraction.The fused feature maps are then discarded from the Upsample-Concat-C2f module and directly connected to the next module, and all network layers after the medium target detection layer are discarded.This improved network structure reduces the computational bottleneck by removing redundant calculations with guaranteed accuracy.The improved detection head is shown in Figure 4.

Improvement of the Feature Extraction Module
When the image shows good resolution, and the detection object is of moderate size, the image contains a significant enough amount of redundant pixel information that strode convolution (i.e., stride > 1) can conveniently skip this redundant pixel information.The model is still able to learn features efficiently.However, in more complex tasks involving ambiguous images and small objects, the assumption of redundant information no longer holds, and the current model starts to suffer from a loss of detail, which significantly impairs its ability to learn features.Small objects are challenging to detect because they are characterized by low resolution and have limited information about the content needed to learn patterns.In YOLOv8, the feature extraction module Conv, a stride convolutional layer, rapidly degrades its detection performance in tasks with low image resolution or small detection objects.For this reason, the current paper introduces a new CNN building block, SPD-Conv, in the feature extraction stage to replace the stride convolution layer.SPD-Conv consists of an SPD (space-to-depth) layer and a non-stride convolution layer and can be applied to most CNN architectures.In an earlier study [32], the authors introduced SPD-Conv into the backbone and neck of YOLOv5.They experimentally demonstrated that the method significantly improved the performance in complex tasks dealing with low-resolution images and small objects.Combined with the improved ideas of this paper for YOLOv5, demonstrated experimentally, we only need to introduce SPD-Conv in the feature extraction module (i.e., backbone) of YOLOv8 to improve the detection of tiny UAV targets without adding too much redundancy, as shown in Figure 3.The SPD-Conv structure is shown at a scale = 2 in Figure 5.
fold down-sampled feature maps of the third C2f layer are directly fed into SPPF for multiscale feature extraction.The fused feature maps are then discarded from the Upsample-Concat-C2f module and directly connected to the next module, and all network layers after the medium target detection layer are discarded.This improved network structure reduces the computational bottleneck by removing redundant calculations with guaranteed accuracy.The improved detection head is shown in Figure 4.

Improvement of the Feature Extraction Module
When the image shows good resolution, and the detection object is of moderate size, the image contains a significant enough amount of redundant pixel information that strode convolution (i.e., stride > 1) can conveniently skip this redundant pixel information.The model is still able to learn features efficiently.However, in more complex tasks involving ambiguous images and small objects, the assumption of redundant information no longer holds, and the current model starts to suffer from a loss of detail, which significantly impairs its ability to learn features.Small objects are challenging to detect because they are characterized by low resolution and have limited information about the content needed to learn patterns.In YOLOv8, the feature extraction module Conv, a stride convolutional layer, rapidly degrades its detection performance in tasks with low image resolution or small detection objects.For this reason, the current paper introduces a new CNN building block, SPD-Conv, in the feature extraction stage to replace the stride convolution layer.SPD-Conv consists of an SPD (space-to-depth) layer and a non-stride convolution layer and can be applied to most CNN architectures.In an earlier study [8], the authors introduced SPD-Conv into the backbone and neck of YOLOv5.They experimentally demonstrated that the method significantly improved the performance in complex tasks dealing with low-resolution images and small objects.Combined with the improved ideas of this paper for YOLOv5, demonstrated experimentally, we only need to introduce SPD-Conv in the feature extraction module (i.e., backbone) of YOLOv8 to improve the detection of tiny UAV targets without adding too much redundancy, as shown in Figure 3.The SPD-Conv structure is shown at a scale = 2 in Figure 5.The SPD-Conv operation consists of two steps.Firstly, the feature map of the input image undergoes preprocessing from space to depth; subsequently, the preprocessed feature map is subjected to a standard convolution.Figure 5 illustrates the feature map of a C1 channel, demonstrating the process of slicing up the input feature map.After pruning, four sets of sub-shaped images are obtained, where each sub-shaped image retains the same number of channels as the input feature map.As the scale is set to 2, the width and height of the output feature map are halved compared to the input.The resulting subfeature images are combined through a standard convolution, ensuring the preservation The SPD-Conv operation consists of two steps.Firstly, the feature map of the input image undergoes preprocessing from space to depth; subsequently, the preprocessed feature map is subjected to a standard convolution.Figure 5 illustrates the feature map of a C1 channel, demonstrating the process of slicing up the input feature map.After pruning, four sets of sub-shaped images are obtained, where each sub-shaped image retains the same number of channels as the input feature map.As the scale is set to 2, the width and height of the output feature map are halved compared to the input.The resulting sub-feature images are combined through a standard convolution, ensuring the preservation of all sub-feature information due to the use of a standard convolution with a step size of one.

Improvement of the Feature Fusion Module
GAM, an attention mechanism module, is a lightweight, practical, and simple component that can be seamlessly integrated into CNN architectures.Its primary purpose is to enhance the performance of deep neural networks by minimizing information loss and amplifying global interaction representation within a given feature mapping.The Electronics 2023, 12, 3664 9 of 21 GAM module adopts the CBAM attention mechanism, which operates from channel to spatial order.In an earlier work [33], the GAM module was successfully integrated into various models across different datasets and classification tasks, resulting in significant improvements in model performance that underscore the efficacy of the GAM module.As a plug-and-play module, GAM is widely cited, as in the literature [34], by inserting GAM into the backbone and head of YOLOv7, enabling the network to extract critical features by amplifying the interaction of global dimensional features.The GAM structure is shown in Figure 6.
of all sub-feature information due to the use of a standard convolution with a step size of one.

Improvement of the Feature Fusion Module
GAM, an attention mechanism module, is a lightweight, practical, and simple component that can be seamlessly integrated into CNN architectures.Its primary purpose is to enhance the performance of deep neural networks by minimizing information loss and amplifying global interaction representation within a given feature mapping.The GAM module adopts the CBAM attention mechanism, which operates from channel to spatial order.In an earlier work [9], the GAM module was successfully integrated into various models across different datasets and classification tasks, resulting in significant improvements in model performance that underscore the efficacy of the GAM module.As a plugand-play module, GAM is widely cited, as in the literature [34], by inserting GAM into the backbone and head of YOLOv7, enabling the network to extract critical features by amplifying the interaction of global dimensional features.The GAM structure is shown in Figure 6.Given the mapping of input attribute F1, intermediate states F2 and output F3 are defined as follows: Since small targets are small in size and have few and inconspicuous features, adding the GAM attention module to the feature fusion network can amplify global interaction and enhance the retention ability of the network for small target features, while directly improving the feature fusion in the neck part of the network.In the detection task, the GAM attention module can help the model to extract the attention region effectively and improve the detection performance.

Experimental Preparation and Results
In this paper, we use the public UAV dataset TIB-Net [19] to evaluate the model's performance and introduce the dataset, network setup and training, evaluation index, ablation experiment, comparison experiment, and self-built dataset experiment.

Dataset Introduction
The TIB-Net UAV dataset comprises 2850 images showcasing various types of UAVs, including multi-rotor UAVs and fixed-wing UAVs.The images were captured by a fixed camera on the ground at a distance of about 500 m from the aerial drones, and the Given the mapping of input attribute F 1 , intermediate states F 2 and output F 3 are defined as follows: Since small targets are small in size and have few and inconspicuous features, adding the GAM attention module to the feature fusion network can amplify global interaction and enhance the retention ability of the network for small target features, while directly improving the feature fusion in the neck part of the network.In the detection task, the GAM attention module can help the model to extract the attention region effectively and improve the detection performance.

Experimental Preparation and Results
In this paper, we use the public UAV dataset TIB-Net [17] to evaluate the model's performance and introduce the dataset, network setup and training, evaluation index, ablation experiment, comparison experiment, and self-built dataset experiment.

Dataset Introduction
The TIB-Net UAV dataset comprises 2850 images showcasing various types of UAVs, including multi-rotor UAVs and fixed-wing UAVs.The images were captured by a fixed camera on the ground at a distance of about 500 m from the aerial drones, and the resolution of the collected images was 1920 × 1080 pixels.These scenes cover several low-altitude scenes (sky, trees, buildings, etc.) from UAV flight images, fully considering samples at different times of the day and in different weather.It can be seen from Figure 7 that the UAV occupies only less than 1% of each image.Some of the samples are shown in Figure 8.

Network Setup and Training
This section details the training process of the TIB-Net dataset on YOLOv8 and the modified YOLOv8.The hardware configuration used for the experiments is an 8 GB NVIDIA GeForce RTX 3070 graphics card, the deep learning framework PyTorch 1.13.1,Python version 3.7.15,CUDA version 11.7, and Ubuntu 22.04 as the operating system.resolution of the collected images was 1920 × 1080 pixels.These scenes cover several lowaltitude scenes (sky, trees, buildings, etc.) from UAV flight images, fully considering samples at different times of the day and in different weather.It can be seen from Figure 7 that the UAV occupies only less than 1% of each image.Some of the samples are shown in Figure 8.   resolution of the collected images was 1920 × 1080 pixels.These scenes cover several lowaltitude scenes (sky, trees, buildings, etc.) from UAV flight images, fully considering samples at different times of the day and in different It can be seen from Figure 7 that the UAV occupies only less than 1% of each image.Some of the samples are shown in Figure 8.

Loss Function Setting
The loss functions of the improved YOLOv8 are consistent with YOLOv8, and both include rectangular box loss (Loss box ), distribution focus loss (Loss d f l ), and classification loss (Loss cls ).
Among them, a, and c all represent the weight proportion of the corresponding loss function in the total loss function.In this experiment, the three weights are a = 7.5, b = 1.5, and c = 0.5, respectively.

Network Training
Before training, the dataset images and labels are divided into the training set, validation set, and test set in a ratio of 7:1:2.The maximum number of epochs for training the dataset is set to 150, with the first three epochs used for warm-up training.The SGD optimization strategy is employed for learning rate adjustment, with an initial learning rate of 0.01.Considering the presence of numerous tiny objects in the sample images and the need to balance real-time performance with accuracy in the detection process, the sample size is normalized to 640 × 640.This size allows the model to be deployed on edge devices without losing too much helpful information from the images.To ensure fairness and the comparability of the model's performance, no pre-trained weights are used in ablation or comparative experiments.Additionally, all training processes share consistent hyperparameter settings.The most important parameter settings for the training process are shown in Table 1.

Evaluation Indicators
To validate the model performance, P, R, AP, mAP, the number of parameters, model size, and frames per second (FPS) [35] are chosen as experimental evaluation indicators.
(1) Accuracy and recall rates are calculated as follows: where TP (true positives) denotes the number of targets detected correctly, FP (false positives) denotes the number of backgrounds detected as targets, and FN (false negatives) denotes the number of targets detected as backgrounds.
(2) The average precision and average precision mean are calculated as follows: where N is the number of categories and AP is the average accuracy of each category.In our UAV detection task, N = 1.

Ablation Experiments
For this section, based on the TIB-Net dataset, ablation experiments were conducted to explore the improvement effects of each added or modified module on the overall model.Starting with the original YOLOv8s as a baseline, the detection head, backbone, and neck improvements were sequenced.To analyze the performance improvement of each module, the benchmark Model 1, improved Model 2 (with added tiny-head), improved Model 3 (added tiny-head and cropped large-head), improved Model 4 (with added tiny-head, cropped large-head, and improved SPD-Conv), improved Model 5 (with added tiny-head, cropped large-head, and added GAM), and improved Model 6 (with added tiny-head, cropped large-head, improved SPD-Conv, and added GAM) were defined.The changes in evaluation metrics for these six models were quantitatively explored, and the optimal results for each evaluation metric were highlighted.The experimental results of the models on the TIB-Net dataset are shown in Table 2. Referring to Table 2, it can be seen that: 1.
The increase from the tiny detection head improved the model by 10.8%, 13.5%, and 8.3% for P, R, and mAP, respectively, indicating that the increase from the highresolution detection head can effectively enhance the detection ability of tiny targets.At the same time, it can be seen that after trimming off the large target detection layer, the parameter amount was reduced by 70.2% and the model size was reduced by 67%, while R remained unchanged, P was reduced by 0.3%, and mAP was reduced by 0.9%, indicating that the low-resolution detection head made little contribution to the detection of tiny UAV targets and generated a large redundant network.

2.
The experimental results of improving models 3, 4, 5, and 6 show that improving the SPD-Conv module had a better improvement effect on the recall R of the model, indicating that improving the Conv module to SPD-Conv in the backbone network can better retain the features of the minutiae targets and reduce the probability of missing detection for the minutiae targets; adding GAM had a better improvement effect on the accuracy P of the model, indicating that adding the GAM attention module in the addition of the GAM attention module in the neck had a good impact on the feature fusion of the network and reduced the probability of false network detection.When both SPD-Conv and GAM were added, P, R, and mAP were improved, although the number of parameters and the model size slightly increased.

3.
Comparing the experimental results of the improved model 6 (i.e., our model) and model 1 (i.e., the base model), as shown in Figure 9, we can see that because the tiny-head, SPD-Conv, and GAM modules added some inference time, the improved model FPS metric reached 221/f.s-1,which is lower compared to the 285/f.s-1 of the base model; however, it can still guarantee meeting the real-time requirement in actual deployment.In addition, our model significantly improved the P, R, mAP, number of parameters, and model size compared with the base model, with P, R, and mAP improving by 11.9%, 15.2%, and 9%, respectively.The number of parameters and model size decreased by 59.9% and 57.9%, respectively, thus proving the effectiveness and practicality of the improved model.
2. The experimental results of improving models 3, 4, 5, and 6 show that improving th SPD-Conv module had a better improvement effect on the recall R of the model, i dicating that improving the Conv module to SPD-Conv in the backbone network ca better retain the features of the minutiae targets and reduce the probability of missin detection for the minutiae targets; adding GAM had a better improvement effect o the accuracy P of the model, indicating that adding the GAM attention module in th addition of the GAM attention module in the neck had a good impact on the featu fusion of the network and reduced the probability of false network detection.Whe both SPD-Conv and GAM were added, P, R, and mAP were improved, although th number of parameters and the model size slightly increased.3. Comparing the experimental results of the improved model 6 (i.e., our model) an model 1 (i.e., the base model), as shown in Figure 9, we can see that because the tin head, SPD-Conv, and GAM modules added some inference time, the improve model FPS metric reached 221/f.s-1,which is lower compared to the 285/f.s-1 of th base model; however, it can still guarantee meeting the real-time requirement in a tual deployment.In addition, our model significantly improved the P, R, mAP, num ber of parameters, and model size compared with the base model, with P, R, an mAP improving by 11.9%, 15.2%, and 9%, respectively.The number of paramete and model size decreased by 59.9% and 57.9%, respectively, thus proving the effe tiveness and practicality of the improved model.In order to observe the detection effect of the improved model more intuitively, th base model YOLOv8s and the improved model in this paper are used for drone detectio and the effect comparison graphs are shown in Figures 10 and 11, respectively.In Figur In order to observe the detection effect of the improved model more intuitively, the base model YOLOv8s and the improved model in this paper are used for drone detection, and the effect comparison graphs are shown in Figures 10 and 11, respectively.In Figures 10 and 11 In Figures 10 and 11, a comparison reveals that YOLOv8s exhibit instances of missed detections when the UAVs are very small or have blended into the background, as shown in Figure 10a,c,e, while false detections as shown in Figure 11a,c,e, highlighted by the yellow boxes.In contrast, the improved model proposed in this paper accurately detects small UAV targets against complex backgrounds such as buildings and trees.Additionally, our method significantly improves the confidence regarding the detected UAVs.As shown in Figure 10b, the confidence reached 0.96, while, as shown in Figure 11e,f, the confidence increased from 0.27 to 0.82.Therefore, the improved model in this paper effectively addresses the issues of missed and false detections of small UAV targets against complex backgrounds.low boxes.In contrast, the improved model proposed in this paper accurately detects small UAV targets against complex backgrounds such as buildings and trees.Additionally, our method significantly improves the confidence regarding the detected UAVs.As shown in Figure 10b, the confidence reached 0.96, while, as shown in Figure 11e,f, the confidence increased from 0.27 to 0.82.Therefore, the improved model in this paper effectively addresses the issues of missed and false detections of small UAV targets against complex backgrounds.

Comparative Experiments
To further verify the advantages of the algorithm used in this paper, the algorithm in this paper was compared with other YOLO series algorithms for experiments, and four advanced YOLO series algorithms (YOLOv5-S [36], YOLOX-S [37], YOLOv7 [38], YOLOv7-tiny) at the present stage were selected on the TIB-Net dataset, taking into account the lightweight model size and detection performance, respectively.To fully reflect the model's superiority in this paper, the TIB-Net [17] model was also selected as a comparison object in the experiments.The parameters of the comparison experiments were carried out according to Table 1, and the evaluation metrics were consistent with Table 3.The selected experimental models are all official versions.The results of the comparison experiments are shown in Table 3.

Comparative Experiments
To further verify the advantages of the algorithm used in this paper, the algorithm in this paper was compared with other YOLO series algorithms for experiments, and four advanced YOLO series algorithms (YOLOv5-S [36], YOLOX-S [37], YOLOv7 [38], YOLOv7-tiny) at the present stage were selected on the TIB-Net dataset, taking into account the lightweight model size and detection performance, respectively.To fully reflect the model's superiority in this paper, the TIB-Net [19] model was also selected as a comparison object in the experiments.The parameters of the comparison experiments were carried out according to Table 1, and the evaluation metrics were consistent with Table 3.The selected experimental models are all official versions.The results of the comparison experiments are shown in Table 3.According to Table 3, it can be seen that: 1. Comparing YOLOv7 and YOLOv7-tiny, it can be seen that although the number of parameters and the model size of YOLOv7 are much higher than the other models, P, R, and mAP present the worst results.Conversely, YOLOv7-tiny achieves good results in terms of detection accuracy, with a smaller number of parameters and model size.
The reason for this is that the TIB-Net dataset has a smaller drone size and has fewer drone features contained in the images, while the more complex YOLOv7 network structure may learn many useless background features, which, in turn, results in poorer detection results.

2.
The TIB-Net detection network is at the other extreme; it can still maintain better detection accuracy with a much smaller number of parameters and model size than other models.However, one disadvantage is also apparent; the FPS is only 5, far from meeting the needs of real-time UAV detection.

3.
YOLOv5-s yields the best overall performance except for our model, while the FPS is 256 ahead of all models, and the P and R values are well balanced.In addition, the detection of YOLOX is also good, but R and FPS are slightly low compared with YOLOv5-s, and the model size is too large.4.
The improved model proposed in this paper outperforms other models in terms of P, R, and mAP.In addition, it is at the top of all the models in terms of the number of parameters, model size, and FPS, while the number of parameters and model size is only higher than the TIB-Net network; FPS is slightly lower compared to YOLOv5-s and YOLOv7-tiny, but it can meet the deployment requirements of real-time detection.
Overall, the tiny UAV detection network proposed in this paper achieves better detection accuracy, model size, and detection speed and can meet the specifications of practical engineering applications.

Self-Built Dataset Experiment
In order to evaluate the generalization performance of the model, this paper used cameras to collect UAV flight images on different scenes and different periods and collected a total of 1091 images of low-altitude scenes of various models of UAVs from major video sites such as YouTube and other web channels to make a new dataset.Figure 12 shows that most of the drones in the self-built dataset also occupy less than 1% of each image, compared with Figure 7, where this is larger than for the drones in the TIB-Net dataset.In addition, many new UAV images taken from high altitudes were added, to increase the diversity of the dataset.Compared with the TIB-Net dataset, where most of the dataset images are set against the sky, the background of the self-constructed dataset is more complex, as shown in Figure 13, where the drone blends in with the mountain or plants.In the self-built dataset experiments, the new dataset was divided into training and validation sets in the ratio of 7:3.To be consistent with the TIB-Net dataset, the images were first resized to 640 × 640 for training, and the training parameters were consistent with those in Table 1.The experimental results are shown in Table 4 and Figure 14.In the self-built dataset experiments, the new dataset was divided into training and validation sets in the ratio of 7:3.To be consistent with the TIB-Net dataset, the images were first resized to 640 × 640 for training, and the training parameters were consistent with those in Table 1.The experimental results are shown in Table 4 and Figure 14.As can be seen from Table 4, the P, R, and mAP of the improved model with the new dataset were 97%, 89.5%, and 95.3%, respectively, which were about 8.2%, 15.6%, and 10.1% higher, respectively, compared to the pre-improvement period.Comparing Table 2  As can be seen from Table 4, the P, R, and mAP of the improved model with the new dataset were 97%, 89.5%, and 95.3%, respectively, which were about 8.2%, 15.6%, and 10.1% higher, respectively, compared to the pre-improvement period.Comparing Tables 2 and 4, it can be seen that the improved model improved P by 3.7% in the new dataset because the UAV target volume in the new dataset was generally larger than that in the TIB-Net dataset.However, the picture background in the new dataset was more complex.Hence, the improved model reduced R by 3.8% in the new dataset.Overall, the improved model still has high detection accuracy and shows that our method has good generalization.The actual detection results are shown in Figure 15.

Conclusions and Outlook
To address the problem that tiny UAV targets are challenging to detect, this paper proposes an improved YOLOv8 detection model that can accurately detect UAV image targets while satisfying edge device deployment.The model overcomes the adverse effects of UAV size, airspace background, light intensity, and other factors on the detection task.Specifically, firstly, in the detection head part, the high-resolution detection head is added to improve the detection capability regarding tiny targets.In contrast, the large target detection head and redundant network layers are cut off to effectively reduce the number of network parameters and improve the UAV detection speed.Finally, the GAM attention mechanism is introduced in the neck to improve the target feature fusion of the model, thus improving the model's overall performance for UAV detection.Ablation and comparison experiments were conducted on a complex TIB-Net dataset.Compared with the baseline model, our method improved P, R, and mAP by 11.9%, 15.2%, and 9%, respectively.Meanwhile, the number of parameters and model size were reduced by 59.9%

Conclusions and Outlook
To address the problem that tiny UAV targets are challenging to detect, this paper proposes an improved YOLOv8 detection model that can accurately detect UAV image targets while satisfying edge device deployment.The model overcomes the adverse effects of UAV size, airspace background, light intensity, and other factors on the detection task.Specifically, firstly, in the detection head part, the high-resolution detection head is added to improve the detection capability regarding tiny targets.In contrast, the large target detection head and redundant network layers are cut off to effectively reduce the number of network parameters and improve the UAV detection speed.Finally, the GAM attention mechanism is introduced in the neck to improve the target feature fusion of the model, thus improving the model's overall performance for UAV detection.Ablation and comparison experiments were conducted on a complex TIB-Net dataset.Compared with the baseline model, our method improved P, R, and mAP by 11.9%, 15.2%, and 9%, respectively.Meanwhile, the number of parameters and model size were reduced by 59.9% and 57.9%, respectively.In addition, the detection model achieved better results in the comparison experiments and self-built dataset experiments.In conclusion, our method is more suitable for engineering deployment and the practical application of UAV target detection systems.
However, due to adding extra detection heads in the model and using both SPD-Conv and GAM modules, which increased the model inference time, the FPS decreased compared to the baseline model.In addition, from the self-built dataset experiments, it can be seen that R decreases when the airspace background is more complex, i.e., the probability of missing detection increases.Follow-up work will then be devoted to improving the detection accuracy in more complex airspace backgrounds while reducing the model inference time.
) The multiscale feature extraction module was improved by using SPD-Conv instead of Conv to extract multiscale features.(3) The GAM attention mechanism was introduced into the multiscale fusion module to enhance the model's fusion of target features.

Figure 4 .
Figure 4. Improvement scheme at the head.

Figure 4 .
Figure 4. Improvement scheme at the head.

Figure 7 .
Figure 7. Proportion of drone size in the image (darker colors mean more drones).

Figure 8 .
Figure 8. Display of dataset diversity.(a) multi-rotor drone; (b) fixed-wing drone; (c-f) show several difficult samples , which contain extreme small drone, blurred drone or complex environment.

Figure 7 .
Figure 7. Proportion of drone size in the image (darker colors mean more drones).

Figure 7 .
Figure 7. Proportion of drone size in the image (darker colors mean more drones).

Figure 8 .
Figure 8. Display of dataset diversity.(a) multi-rotor drone; (b) fixed-wing drone; (c-f) show several difficult samples , which contain extreme small drone, blurred drone or complex environment.

Figure 8 .
Figure 8. Display of dataset diversity.(a) multi-rotor drone; (b) fixed-wing drone; (c-f) show several difficult samples, which contain extreme small drone, blurred drone or complex environment.

Figure 9 .
Figure 9.Comparison graph between our model and the YOLOv8s experiment (parameters, mod size, and FPS are normalized separately).

Figure 9 .
Figure 9.Comparison graph between our model and the YOLOv8s experiment (parameters, model size, and FPS are normalized separately).
, the detection results of YOLOv8s are shown on the left, and the detection results of the improved model are shown on the right.The UAV position and confidence level are indicated by rectangular boxes and text, respectively, and the details of the area where the UAV is located are shown in the upper right corner or lower right corner of the images, respectively.

Figure 10 .
Figure 10.The left side shows some of the leakage detection results of YOLOv8s, as shown in Figure (a,c,e).The right side shows the detection results of the improved model in the same image, as shown in Figure (b,d,f).

Figure 10 .
Figure 10.The left side shows some of the leakage detection results of YOLOv8s, as shown in Figure (a,c,e).The right side shows the detection results of the improved model in the same image, as shown in Figure (b,d,f).

Figure 11 .
Figure 11. Figure (a,c,e) show the results of the partial error detection of YOLOv8s, as shown in the yellow box, and Figure (b,d,f).show the detection results of the improved model for the same image.

Figure 11 .
Figure 11. Figure (a,c,e) show the results of the partial error detection of YOLOv8s, as shown in the yellow box, and Figure (b,d,f).show the detection results of the improved model for the same image.

Figure 12 .
Figure 12.Size of self-built dataset drones (darker colors mean more drones).Figure 12. Size of self-built dataset drones (darker colors mean more drones).

Figure 12 .
Figure 12.Size of self-built dataset drones (darker colors mean more drones).Figure 12. Size of self-built dataset drones (darker colors mean more drones).

Figure 12 .
Figure 12.Size of self-built dataset drones (darker colors mean more drones).

Figure 13 .
Figure 13.Selected sample plots of the self-built dataset.(a-c) show drone imagery from different time periods; (d-i) show several difficult samples, including very small drones, drones photographed from a high altitude, or complex environments.

Figure 13 .
Figure 13.Selected sample plots of the self-built dataset.(a-c) show drone imagery from different time periods; (d-i) show several difficult samples, including very small drones, drones photographed from a high altitude, or complex environments.

Figure 14 .
Figure 14.Comparison graph between our model and the YOLOv8s experiment (self-built datasets).

Table 1 .
Important parameter setting table.

Table 2 .
Results of the various ablation experiments.

Table 3 .
Comparison of experimental results.

Table 3 .
Comparison of experimental results.
Figure 14.Comparison graph between our model and the YOLOv8s experiment (self-built datasets).