Object detection algorithms have rapidly developed with the advancement of image processing technology and artificial intelligence in recent years. With their small weight and rapid features, drones have been applied widely in fields such as agriculture, medicine, and city inspections. However, the various objects of interest in unmanned aerial vehicle (UAV) images, such as pedestrians and clusters of flowers, are generally small in scale owing to the high altitude from which these images are captured. Moreover, these small targets are straightforwardly affected by environmental interference, which hinders detection using conventional object-detection algorithms. Therefore, enhancing the capability to detect small targets in UAV [
1,
2,
3] aerial images has become a challenging research direction in the field of object detection.
Currently, object detection algorithms are being improved and refined continuously [
4]. Xie et al. [
5] proposed an improved algorithm (called Drone-YOLO) based on YOLOv5 for small object detection in UAV images. They added a detection branch and designed a feature pyramid network with multi-level information aggregation. In addition, they introduced a feature fusion module to decouple the classification and regression tasks in the prediction head and used the Alpha-IoU optimized loss function to improve the detection performance of the model. The proposed algorithm outperformed other mainstream models in detecting small objects and could effectively complete tasks of small target detection in UAV aerial images. Zhang et al. [
6] proposed an improved YOLOv5-based algorithm for vehicle and pedestrian identification, comprising the Ghost Bottleneck module [
7] to compress network parameters and reduce the overall computational workload of the model. In addition, this algorithm improved the model’s inference speed, resulting in significantly increased detection accuracy and speed. Li et al. [
8] proposed an improved algorithm for detecting small objects in UAV images in real time to enhance the safety of autonomous landings and the capability for target identification. The algorithm added a detection head and replaced the PANet structure with BiFPN to improve the detection performance for different scales. They also replaced CIoU loss with EIoU loss as the algorithm’s loss function to improve the overall performance of the model while increasing the bounding box regression rate. The improved algorithm was applied to detect QR code landing markers in UAV autonomous landing scenarios. The improved algorithm exhibited a stronger feature extraction capability and higher detection accuracy than the previous algorithm. Tian et al. [
9] proposed a small target recognition algorithm with lightweight improvements to the YOLOv4 network and tested it on the VisDrone dataset. The improved algorithm was 1.5% more accurate and 3.3 times faster, making it more effective and practical. In addition, the algorithm could also judge the KCF tracking situation by analyzing the response value and realized the template update of the adaptive learning rate. Through experiments, it was proved that the algorithm could stably track long-distance small targets. Cheng et al. [
10] proposed a real-time UAV target detection algorithm based on edge computing, namely Fast-YOLOv4. On the edge-computing platform NVIDIA Jetson Nano, Fast-YOLOv4 was used to intelligently analyze video to achieve rapid detection of UAV targets. At the same time, technologies such as lightweight network MobileNetV3, Multiscale-PANet, and soft merging, have been introduced to improve YOLOv4, thus obtaining the Fast-YOLOv4 model. Compared with the original model, the detection accuracy and speed of the Fast-YOLOv4 model have been significantly improved. Li et al. [
11] proposed a Densely Nested Attention Network (DNA-Net) for Single Frame Infrared Small Target (SIRST) detection, using a Densely Nested Interaction Module (DNIM) to achieve gradual progression between high-level and low-level feature interactions. Based on DNIM, a Cascaded Channel and Spatial Attention Module (CSAM) was proposed to adaptively enhance multi-level features, and it achieved better performance in terms of IoU (IoU). Ibrokhimov et al. [
12] proposed a new two-stage deep learning method. In the first phase, the target area is extracted, and small squares are generated to narrow down the region of interest (RoI). In the second phase, the targets are detected and classified into BI-RADS categories. To improve the accuracy of classification within a class, a classification model was designed, and its results were combined with the classification score of the detection model. This method improves the average accuracy (mAP) by 0.09, which is better than the original model. In addition, the performance of this two-stage model was verified by comparison with the existing work. Šavc et al. [
13] proposed a method for the detection of skull marks, SCN-EXT convolutional neural networks, based on SpatialConfiguration-Net networks, upgraded by adding a simpler local appearance and replication of spatial configuration components. By increasing the CNN capacity without increasing the number of free parameters, this method resulted in a significant increase of about 3% using the AUDAX database. Li et al. [
14] proposed a residual convolutional neural network solution for pest recognition based on transfer learning. Data enhancement was realized by random planting, color transformation, CutMix, and other operations. The classification accuracy of the ResNeXt-50 (32 × 4 d) model was compared under different combinations of learning rate, transfer learning, and data enhancement, and the impact of data enhancement on different sample classification performance was also compared. The results showed that the classification effect of the model based on transfer learning was generally better than that of the model based on new learning.
This paper proposes an algorithm for small object detection in UAV images based on an improved YOLOv5s network, inspired by the literature discussed above. The proposed algorithm incorporates several improvements, including the introduction of a coordinate attention mechanism after the backbone convolution operation to enhance the feature extraction capability; replacement of the original convolution with SPD-Convolution to improve the detection of low-resolution and small objects; utilization of transposed convolution rather than nearest-neighbor interpolation in the neck to increase the network’s receptive field and prevent issues, such as decreased image resolution and loss of details; and replacement of the CIoU Loss with Alpha-IoU to accelerate the model’s gradient convergence. The experimental results demonstrate that the improved algorithm has better detection capability for low-resolution and small objects compared with the original algorithm.