Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s

Lu, Feng; Li, Kewei; Nie, Yunfeng; Tao, Yejia; Yu, Yihao; Huang, Linbo; Wang, Xing

doi:10.3390/su151914564

Open AccessArticle

Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s

by

Feng Lu

¹,

Kewei Li

^2,*,

Yunfeng Nie

¹,

Yejia Tao

¹,

Yihao Yu

¹,

Linbo Huang

¹ and

Xing Wang

³

¹

College of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China

²

Institute of Digital Economy, Nanchang Hangkong University, Nanchang 330063, China

³

School of Atmosphere Science, Nanjing University, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(19), 14564; https://doi.org/10.3390/su151914564

Submission received: 24 August 2023 / Revised: 13 September 2023 / Accepted: 18 September 2023 / Published: 7 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

Object detection methods of UAV (Unmanned Aerial Vehicle) images are greatly improved with the development of UAV technology. In comparison, the existing object detection methods of UAV images lack outstanding performance in the face of challenges such as small targets, dense scenes, sparse distribution, occlusion, and complex background, especially prominent in the task of vehicle detection. This paper proposed an improved YOLOv5s method to perform vehicle detection of UAV images. The CA (Coordinate Attention) is first applied to the neck of YOLOv5s to generate direction-aware and position-sensitive feature maps, respectively, to improve the detection accuracy of sparsely distributed vehicle targets in complex backgrounds. Then, an improved PAFPN (Path Aggregation Feature Pyramid Network) at the neck of YOLOv5s is proposed for more efficient detection of small and dense vehicle targets. Finally, the CIoU (Complete Intersection Over Union) loss function was used to calculate the bounding box regression to obtain a more comprehensive overlap measure to accommodate different shapes of vehicle targets. We conducted extensive experiments on the self-built UAV-OP (Unmanned Aerial Vehicle from Orthographic Perspective) dataset. The experimental results show that our method achieves the best detection accuracy with a small quantity of calculation increase compared with YOLOv5s. The mAP50 improved by 3%, and the mAP50:95 improved by 1.7% with a 0.3 GFlops increase.

Keywords:

UAV images; YOLOv5s; deep learning; coordinate attention; improved PAFPN; CIoU loss

1. Introduction

With the continuous evolution and maturation of UAV technology [1,2], the utilization of UAV-captured images is expanding into various domains, including, but not limited to, agriculture [3,4], forestry [5], public safety [6], and transportation [7]. Utilizing UAV-based vehicle detection technology, these UAVs can effectively pinpoint targets in need of rescue, including vehicles stranded in rugged terrain or disaster-stricken areas. Furthermore, UAVs can also be applied to traffic monitoring, including traffic flow surveillance, parking lot management, as well as monitoring traffic accidents and violations. Therefore, in-depth research into UAV-based vehicle detection technology is of paramount importance for search and rescue operations and the realization of intelligent traffic management systems. Meanwhile, intelligent traffic management systems can help cities manage traffic more efficiently, reduce resource wastage, and enhance traffic safety, thereby promoting sustainability. This represents an urgent and essential area of study.

However, vehicle detection of UAV images poses numerous challenges, including small and densely packed vehicles, sparse distribution, occlusion, and complex backgrounds. Figure 1 demonstrates those challenges mentioned above. In densely populated parking lots and on highways, vehicles in UAV images tend to be small and dense, with unclear details and low resolution, making accurate detection and identification more challenging. In addition, in certain towns or areas with low traffic volume, the limited and scattered presence of vehicles poses challenges for vehicle identification and location, potentially resulting in missed detections. Furthermore, the existence of various objects can cause vehicles to be partially obscured or truncated, leading to the incomplete appearance of targets and the loss of crucial features, thereby presenting a significant challenge for vehicle detection algorithms. Additionally, the intricate structures in the background, such as complex buildings, roads, and intersections, further increase the difficulty of vehicle detection. In addition to the above, there are challenges such as orientation, shadows, varying illuminations [8], etc. Undoubtedly, vehicle detection in UAV images does present distinctive challenges, given the varying viewing angles, heights, and background complexities that differ from those encountered with traditional ground-based cameras. This makes traditional object detection algorithms problematic to apply directly to UAV images and requires innovative solutions to overcome these challenges.

In recent years, the swift progress of deep learning technology has resulted in notable advancements within the realm of computer vision, particularly with respect to object detection algorithms that rely on convolutional neural networks (CNNs). Deep learning technology offers substantial potential for enhancing the accuracy and efficiency of object detection in UAV images. Currently, object detection algorithms based on deep learning can be broadly categorized into two main groups: two-stage and one-stage. The two-stage detection algorithms initially generate region proposals and subsequently conduct classification on these region proposals. The typical two-stage detection algorithm is R-CNN series algorithm, including R-CNN [9], SPP-net [10], Fast R-CNN [11], Faster R-CNN [12], Cascade R-CNN [13], Mask R-CNN [14], and R-FCN [15], etc. The one-stage detection algorithm is a regression-based target detection method that eliminates the necessity for region proposals and directly outputs the location and category information of the bounding box. Common one-stage object detection algorithms include YOLO [16,17,18,19], SSD [20], and RetinaNet [21]. Of the two types of object detection algorithms, the one-stage method offers faster prediction speed but lower accuracy than the two-stage method. Despite significant advancements in ground-based object detection, deep learning algorithms continue to encounter multiple challenges in the field of UAV images.

Numerous scholars have conducted extensive research on the object detection algorithms of UAV images. Building upon the two-stage method Faster R-CNN, some scholars proposed improved algorithms to address the challenges of object detection of UAV images [22,23,24,25]. To detect vehicles in UAV images using CNNs, Ammar A et al. [26] conducted a comparative analysis of Faster R-CNN, YOLOv3, and YOLOv4 algorithms on the PSU (Prince Sultan University) and Standford datasets. The experimental results demonstrated that YOLO outperformed Faster R-CNN in most configurations. Luo X et al. [27] employed the K-means++ algorithm to enhance the recognition performance of YOLOv3 algorithms, followed by utilizing Soft-NMS to mitigate the issue of error multi-frame suppression caused by NMS (non-maximum suppression). Luo X et al. [28] proposed YOLOD, which incorporates the Improved Efficient Channel Attention (IECA) module into the backbone of YOLOv4. The Spatial Pyramid Pooling (SPP) module is replaced by Group Spatial Pyramid Pooling (GSPP), and an Adaptive Spatial Feature Fusion (ASFF) module is added to the end of the model. These modifications significantly boost the detection performance for numerous small targets against complex backgrounds in UAV images. Based on the YOLOv5 object detection algorithm, Feng J et al. [29] mainly proposed the CSP BoT (cross-stage partial bottleneck transformer), Angle classification prediction branch, and ASFF-Head to detect vehicles with uncertain backgrounds and in any direction under different resolutions and illuminations. Wang D et al. [30] employed transfer learning to utilize the YOLOv5s algorithm for the detection of apple fruits. They applied a channel pruning algorithm to trim the YOLOv5s model and fine-tuned the pruned model to achieve fast and accurate detection of apple fruits. Jawaharlalnehru A et al. [31] proposed an improved YOLO algorithm, employing methods such as object box dimension clustering, classification of the pre-trained network, multi-scale detection training, and modification of candidate box filtering rules, which have the potential to better adapt to localization tasks and improve detection results. Zhang H et al. [32] proposed an improved YOLOv5-based object detection algorithm. Firstly, they utilized multiple improved Gabor convolutional kernels (filters) to enhance the edges of preprocessed objects from various directions. Secondly, they added Coordinate Attention (CA) to the backbone of YOLOv5, enabling the network to accurately capture long-range dependencies in positions. Thirdly, they replaced PANet with an improved BiFPN, removing nodes with minor contributions to feature fusion and introducing an additional feature branch to gain more features at a small computational cost. This balanced the contributions of each layer and contributed to faster and more efficient training.

For the above research, the scenes of most datasets are not abundant, and the perspective of UAV images is not orthographic. Vehicle detection and classification will be implemented from a higher and absolute orthographic perspective within diverse scenes. UAV images from the orthographic view are illustrated in Figure 1. Although there are more advanced variants in the YOLO series currently, such as YOLOv7, YOLOv5 is still chosen as the baseline model for our method. YOLOv5 surpasses YOLOv7 in terms of training and inference speed, along with lower memory usage. These qualities make it highly suitable for meeting the high real-time requirements of UAV applications. The lightweight model and simplicity of YOLOv5 also make it easier to deploy. This paper proposes an object detection method based on improved YOLOv5s to make it more suitable for vehicle object detection in UAV images. In this paper, the main contribution can be described as follows:

(1): CA (Coordinate Attention) was introduced into the neck of YOLOv5s to enable the network to pay more attention to the features from various positions, allowing the network to better comprehend the spatial information of objects and thereby enhancing the model’s perception of positional information.
(2): An improved PAFPN was proposed to strengthen the features yield by different levels in a cascading way so that the lower feature can communicate with the higher feature more directly through a short path, thus enhancing the localization information of the whole feature hierarchy.
(3): CIoU loss is applied for the network to better adapt to vehicles with different shapes while alleviating the issues caused by category imbalances. Simultaneously, it accelerates model convergence and leads to a more precise predicted bounding box.
(4): A self-built UAV-OP (Unmanned Aerial Vehicle from Orthographic Perspective) dataset, which has more than 3500 valid high-resolution UAV images labeled with five categories with 20,461 vehicle objects, was built to conduct experiments to validate the model’s performance, including detection accuracy, parameters, and GFlops.

The subsequent sections of this paper are structured as follows. Section 2 presents the architecture of the YOLOv5 model and the proposed method. Section 3 introduces the datasets, experimental environment, and experiment execution. The summary of this thesis is provided in Section 4. The architecture of this paper is presented in Figure 2.

2. Materials and Methods

2.1. The Overview of YOLOv5

In this section, YOLOv5 (You Only Look Once version 5), the fifth generation of the YOLO object detection algorithm series, released by Glenn Jocher in 2020, is succinctly introduced. YOLOv5 offers high detection speed and accuracy. Currently, there are five versions of the YOLOv5 object detection algorithm: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5 is composed of four primary components: input, backbone, neck, and head, as illustrated in Figure 3.

2.1.1. Input

As for data preprocessing, the YOLOv5 employs a crucial technique called Mosaic data augmentation. Specifically, it merges several images to generate a novel training image by applying random adjustments such as scaling, cropping, and arrangement. This approach significantly boosts data diversity, enriches image backgrounds, and enhances the network’s ability to detect small targets effectively. Additionally, during the batch normalization process, several pictures are computed simultaneously, allowing for achieving excellent detection results without requiring an excessively large batch size. Moreover, unlike certain YOLO series algorithms that relied on a separate program to pre-calculate initial anchor box values tailored to specific datasets, YOLOv5 dynamically computes the optimal anchor box sizes for various training sets during each training iteration, adapting as needed. Furthermore, YOLOv5 has the capability to automatically scale images in an adaptive manner, eliminating the need to resize images of different shapes to a standard size before inputting them into the network. These enhancements and adaptive features make YOLOv5 a versatile and highly effective object detection algorithm capable of adapting to different datasets and scenarios.

2.1.2. Backbone

The YOLOv5 adopts CSPDarknet53 as its backbone, which primarily consists of the Focus, CBS, C3, and SPPF modules. Images are initially processed by the Focus before being sequentially fed into the CBS and C3. Finally, the output is generated through the SPPF. The operation of the Focus resembles down-sampling as it utilizes slicing operations to divide high-resolution feature maps into multiple low-resolution feature maps, essentially employing interlaced sampling and splicing. The specific operation of Focus slicing is illustrated in Figure 4. After inputting the original 640 × 640 × 3 image into Focus, a slice operation is initially employed to reduce the size of the feature map by half while simultaneously expanding its output channel fourfold from 3 channels to 12 channels. The output feature map of size 320 × 320 × 12 undergoes a CBS convolutional operation to produce the final feature map of size 320 × 320 × 64. The CBS convolution operation comprises the convolution, batch normalization, and SiLU activation function. The C3 layer plays a crucial role in the YOLOv5 by increasing its depth and RF (receptive field), thereby enhancing feature extraction capabilities. The YOLOv5 incorporates different C3 in its backbone and neck components. By default, the C3 Bottleneck module in the backbone utilizes a shortcut, whereas the neck’s C3_F Bottleneck module does not employ such a shortcut. The SPPF is proposed based on SPP, which in turn takes inspiration from the space pyramid approach. By integrating local and global features, SPP enriches the expressive capability of features, thereby facilitating the detection of objects with significant size variations in images. The structures of SPP and SPPF are shown in Figure 5. SPPF (Spatial Pyramid Pooling-Fast) utilizes three max-pooling layers with a kernel size of 5 to replace the original max-pooling layers with kernel sizes of 5, 9, and 13 in SPP. This approach helps retain feature maps with various receptive fields, enhancing their expressive capabilities while also improving the overall processing speed to a certain extent.

2.1.3. Neck

The neck is positioned between the backbone and head in the YOLOv5. Within the neck, YOLOv5 adopts the PAFPN (Path Aggregation Feature Pyramid Network). The PAFPN introduces a bottom-up pathway enhancement called PAN, which is built upon the FPN (Feature Pyramid Network). The low-level features in a network contain more positional information, while high-level features contain more semantic information. The FPN propagates semantic information from high-level features to low-level features in a top-down manner, while PAN propagates positional information from low-level features to high-level features in a bottom-up manner. This bidirectional fusion of information allows the network to effectively utilize both semantic and positional information, enhancing the feature representation capabilities of the network. The detailed introduction of PAFPN can be found in Section 2.2.2.

2.1.4. Head

The head is mainly used for multi-scale object detection on the feature map obtained from the neck. The channel expansion of 1 × 1 convolution is performed on the feature map obtained from the neck, and 3 × (classes + 5) feature channels are obtained. Among them, the number 3 represents that there are three anchors on each detection layer, and anchor boxes of different proportions and sizes can adapt to targets of different sizes; the number 5 represents the horizontal coordinate of the center point, vertical coordinate of the center point, width, height, and confidence of the predicted targets; classes represent the number of categories. There are three detection layers, which produce feature maps of sizes 20 × 20, 40 × 40, and 80 × 80, respectively, corresponding to object scales of 32 × 32, 16 × 16, and 8 × 8 pixels. The three detection layers are responsible for detecting large targets, medium targets, and small targets, respectively.

2.2. Improved YOLOv5s

2.2.1. Coordinate Attention

In the context of UAV images, detecting vehicles poses challenges due to small target sizes, sparse distributions, and complex backgrounds. To mitigate those issues, the utilization of the Coordinate Attention mechanism has been employed. This mechanism is an effective strategy that considers both channel and position information simultaneously, enabling cross-channel information capture as well as direction-aware and position-aware. The Coordinate Attention mechanism is capable of capturing channel correlations between features more effectively, enabling the network to focus on relevant target features while disregarding those related to complex backgrounds when dealing with small targets. This enhances the ability of the network to distinguish between the target and background, leading to an overall improvement in detection performance. Accurate direction and location information are crucial for target localization of small vehicles that are sparsely distributed and have varying orientations. The direction-aware and position-aware capabilities of the Coordinate Attention mechanism enable more precise targeting, thereby enhancing recognition capabilities.

The structure of the Coordinate Attention mechanism is illustrated in Figure 6. To preserve precise positional information while globally encoding spatial information, CA attention decomposes global pooling into a pair of one-dimensional features encoding operations, separately encoding each channel of the input feature F along the horizontal coordinates and vertical coordinates. This operation generates a pair of outputs for the c-th channel at width (w) and height (h), as illustrated by Equations (1) and (2).

z_{c}^{h} (h) = \frac{1}{W} \underset{0 \leq i < W}{\sum x_{c}} (h, i)

(1)

z_{c}^{w} (w) = \frac{1}{H} \underset{0 \leq j < H}{\sum x_{c}} (j, w)

(2)

Next, the pair of feature maps are concatenated along the spatial dimension to produce a feature map with dimensions C × 1 × (H + W). This new feature map is then passed through a shared 1 × 1 convolution to reduce its dimensionality to C/r, where r represents the reduction ratio used to adjust the block size. The output is subsequently batch-normalized and processed using the nonlinear activation function, resulting in a final feature map (f), as shown in Equation (3).

f = δ (F_{1} ([z^{h}, z^{w}]))

(3)

where

f \in R^{C / r \times (H + W)}

,

δ

is the nonlinear function,

F_{1}

is the convolution of 1 × 1,

Z^{h}

is the feature map after the average pooling at height h, and

Z^{w}

is the feature map after the average pooling at height w.

Subsequently, the feature map f is divided along different spatial dimensions to generate two tensors,

f^{h}

and

f^{w}

. These two tensors are individually transformed through convolutional operations into tensors with the same number of channels as the input F. Then, they are separately passed through a Sigmoid activation function to produce the attention weights at height h and width w:

g^{h}

and

g^{w}

, respectively. The mathematical expressions are shown as Equations (4) and (5).

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

where σ is the Sigmoid function,

F_{h}

and

F_{w}

are the convolution of 1 × 1,

f^{h} \in R^{C / r \times H}

, and

f^{w} \in R^{C / r \times W}

.

The attention weights

g^{w}

and

g^{h}

obtained from the previous calculations are utilized for multiplicative weighting calculation on the input feature map F. Finally, feature maps with attention weights in the horizontal and vertical directions are obtained, effectively incorporating spatial information with attention. The mathematical expression is shown as Equation (6).

y_{c (i, j)} = x_{c (i, j)} \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

The Coordinate Attention mechanism has been integrated into the neck of YOLOv5s. The improved YOLOv5s with Coordinate Attention is illustrated in Figure 7.

2.2.2. Improved PAFPN

As the depth increases, the network becomes capable of acquiring more abstract and intricate feature representations, thereby improving the predictive performance of detection tasks. However, this deep network structure also poses certain challenges. Firstly, as the network deepens, the positional information of targets becomes increasingly ambiguous, potentially leading to a decrease in prediction accuracy. Secondly, the representation of small objects becomes weak in feature maps and susceptible to information loss during continuous convolution, resulting in a decline in the network’s ability to detect them. To alleviate these issues, targeted measures need to be taken, and one crucial strategy is to perform predictions at different levels of feature maps. Inspired by the PAFPN (Path Aggregation Feature Pyramid Network) [33], YOLOv5’s neck utilizes the combination of FPN (Feature Pyramid Network) and PAN to perform multi-scale feature fusion. The FPN transmits robust semantic features in a top-down manner, while PAN conveys strong positional features from the bottom-up. These two features are integrated to enable feature maps of varying sizes to contain both semantic and positional information better.

A schematic representation of the neck of YOLOv5s is depicted in Figure 8a. The FPN comprises a bottom-up backbone, a top-down pathway, and horizontal connections. The bottom-up backbone is a forward process of CNNs. With the forward process, it can be seen in Figure 8a that from C2 to C4, the sizes of the corresponding feature maps are 80 × 80, 40 × 40, and 20 × 20, respectively. The height and width of the feature maps are successively halved after three rounds of down-sampling. It can also be seen from the figure that FPN adopts a top-down approach to carry out information transfer and fusion from high-level features to low-level features. Low-level feature maps can obtain semantic context information from high-level feature maps, thereby enhancing the capability of detecting small targets. However, during the transmission, the FPN may gradually lose or blur detailed information, which particularly impacts target localization. Therefore, PAFPN adopts a bottom-up structure called PAN to enhance its localization capabilities. PAFPN was initially utilized in image segmentation. It was later integrated into YOLO for improved feature extraction capacity.

Since most vehicle targets of our dataset are small and medium, in order to improve the detection accuracy of those targets while avoiding noise caused by excessive feature fusion, the feature map C3 was fused directly with the second-layer output of PAN instead of fusing every feature map extracted from the backbone, as shown in Figure 8b.

2.2.3. Losses

The loss function is utilized to quantify the disparity between the predicted value and the true value, which greatly determines the performance of the model. YOLOv5s has three loss functions: classification loss, localization loss, and confidence loss. The formula of total loss is shown in Equation (7). Classification loss calculates whether the model can accurately identify objects in the image and classify them into the correct category. Localization loss measures the error between the bounding box and the ground truth and currently employs CIoU loss for bounding box regression. Confidence loss calculates the network confidence.

L o s s = L_{l o c} + L_{c o n f} + L_{c l a s s}

(7)

Initially, YOLOv5s employs GIoU as the loss for bounding box regression. The formula of GIoU loss is shown in Equations (8)–(10). The GIoU loss effectively addresses the issue of gradient vanishment in non-overlapping scenarios. However, it exhibits slow convergence. Moreover, when the bounding box and ground truth are in an “enclosed” state, GIoU degenerates into IoU, thereby hindering the distinction of relative positional relationships.

L_{G I o U} = 1 - G I o U

(8)

G I o U = I o U - \frac{|C - A \cup B|}{|C|}

(9)

I o U = \frac{|A \cap B|}{|A \cup B|}

(10)

where A is the bounding box, B is the ground truth, C is the smallest outer rectangle containing both the bounding box and ground truth,

A \cup B

is the union area of A and B, and

A \cap B

is the intersection area of A and B.

Because of the above problems, DIoU loss is proposed. DIoU loss can optimize the distance between the ground truth and the bounding box and converge faster than GIoU loss. The DIoU loss formula is shown in Equations (11) and (12).

L_{D I o U} = 1 - D I o U

(11)

D I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} = I o U - \frac{d^{2}}{c^{2}}

(12)

where b and

b^{g t}

represent the center point of the bounding box and ground truth, respectively,

d = ρ (b, b^{g t})

stands for calculating the Euclidean distance between two central points, and c represents the diagonal length of the smallest outer rectangle containing both bounding box and ground truth.

Although DIoU considers the overlap area and center point distance, it overlooks the aspect ratio. We adopted CIoU as the bounding box for regression loss. CIoU added the influence factor αv based on DIoU. The CIoU loss formula is shown in Equations (13)–(16), while an illustration of CIoU loss is demonstrated in Figure 9.

L_{C I o U} = 1 - C I o U

(13)

C I o U = I o U - \frac{d^{2}}{c^{2}} - α v

(14)

α = \frac{v}{(1 - I o U) + v}

(15)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(16)

where α is the positive trade-off parameter, v is a parameter that measures the consistency of the aspect ratio, w and h, respectively, represent the width and height of the bounding box, and

w^{g t}

and

h^{g t}

represent the width and height of the ground truth, respectively.

3. Experiments and Results

3.1. Experimental Environment

In this experiment, the experimental hardware platform consists of CoreI7-12700K@3.61GHz CPU, 11 GB Nvidia GeForce RTX 2080Ti GPU hardware, and Windows 10 operating system. The experimental software configuration consists of Python 3.7.0, PyTorch 1.7.1, and Torchvision 0.7.2, as shown in Table 1. During the training process of the model, the optimizer SGD (stochastic gradient descent) is used to optimize the model. The specific parameters of training are shown in Table 2. The training epoch is set to 300, the batch size is set to 8, the initial learning rate is set to 0.01, the SGD optimizer’s weight decay is set to 0.0005, and the momentum is set to 0.937.

3.2. Dataset

Compared to conventional ground-based vehicle detection datasets, the dataset for vehicle detection from a UAV’s orthographic perspective is relatively smaller in scale. To assess the enhanced performance of the improved YOLOv5s, experiments were conducted on a self-constructed dataset of vehicles captured by UAVs. The dataset is named UAV-OP (Unmanned Aerial Vehicle from Orthographic Perspective). There are a total of 1200 drone images ranging from 4864 × 3648 to 8000 × 6000 pixels covering a variety of landscapes, including urban, rural, and mountainous areas. They were cut and filled into small blocks of 2560 × 2560. The images contain five categories of target vehicles: Cargo, Sedan, Truck, Excavator, and Bus. The images were carefully filtered to remove images that did not contain any of these five target categories. After screening, a total of 3563 valid UAV images were obtained, with each image measuring 2560 × 2560 pixels. Subsequently, the selected images were annotated using the image annotation tool “labelimg”. The UAV-OP dataset in YOLO format was retained for further analysis and utilization. Finally, the dataset was partitioned into two subsets with a 7:3 ratio, consisting of 2494 samples in the training set and 1069 samples in the test set. The number of targets in each subset is presented in Table 3. Figure 10 shows the details of the dataset. The statistical figure in Figure 10a illustrates the distribution of target counts for five vehicle types within the training set. It is evident from the graph that Sedan has the highest number of targets, followed by Cargo. For specific vehicle counts, please refer to Table 3. The target position and size of the label data were analyzed. Figure 10b shows the normalized target label location diagram, and Figure 10c shows the normalized target label size diagram. It can be seen from the figure that target vehicles are mostly distributed in the lower half of the whole Y-coordinate system in the vertical direction, and they are relatively evenly distributed in the horizontal direction. The width and height of the target vehicles mostly account for 1% to 6% of the size of the image.

The UAV-OP dataset has images of vehicles of different types, sizes, colors, and orientations. Moreover, the images in the UAV-OP dataset have complex geomorphic backgrounds, including cities, villages, highways, etc. These factors constitute a dataset with higher object variance, which increases the difficulty of vehicle detection and enables our method to effectively detect and identify a wide variety of vehicles with different types and features. Sample images from the UAV-OP dataset are shown in Figure 11.

3.3. Evaluation Indicators

In object detection, the classification of a bounding box as either a true or false target yields four potential outcomes: TP, FP, TN, and FN. These four types of prediction results represent the number of true positives (correctly predicting real targets as real targets), false positives (incorrectly predicting non-targets as real targets), true negatives (correctly predicting non-targets as non-targets), and false negatives (incorrectly predicting real targets as non-targets). If the IoU (Intersection over Union) between the algorithm’s detection box and the actual annotated box exceeds the threshold (in our experiments, the IoU threshold is set to 0.6), it is labeled as a TP. Otherwise, it is labeled as an FP. If there is no region that matches the actual annotated box, it is labeled as an FN.

The evaluation indicators for evaluating the performance of detection methods are usually Precision (P), Recall (R), Average Precision (AP), Mean Average Precision 50 (mAP50), and Mean Average Precision 50–95 (mAP50–95). The calculation formulas of Precision (P) and Recall (R) are shown in the Equations (17) and (18).

P = \frac{T P}{T P + F P}

(17)

R = \frac{T P}{T P + F N}

(18)

The relationship between precision and recall exhibits a significant correlation. Increasing precision tends to lead to lower recall rates and vice versa. The precision focuses on the correctness of the model’s predictions as positive examples, while the recall focuses on the coverage ability of the model for the positive example. AP is a comprehensive metric used to assess the trade-off between precision and recall at different classification thresholds for a model. AP is obtained by calculating the area under the P-R curve. The calculation formula is shown in the Equation (19). AP measures a model’s performance in each category, while mAP provides a comprehensive evaluation of the model’s overall performance across all categories. The calculation formula of mAP is shown in Equation (20).

A P = \int_{0}^{1} P (R) d R

(19)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(20)

3.4. Experiments and Results

3.4.1. Study of Input Resolution

In order to study the influence of the resolution of input images on detection accuracy, comparative experiments were first conducted on our dataset using input image resolutions of 640 × 640, 960 × 960, and 1280 × 1280 pixels, respectively. The experimental results are presented in Table 4. The Params measure the number of parameters in the model, while Flops (floating point operations per second) are utilized to gauge the complexity of the algorithm. Additionally, GFlops denote a billion floating-point operations executed within one second.

As shown in Table 4, the highest mAP50 is achieved during YOLOv5s training when using an input resolution parameter of 1280 × 1280, which is 14% higher than that achieved with a parameter of 640 × 640 and 1.9% higher than that achieved with a parameter of 960 × 960. The mAP50:95 is 0.497 when the input resolution is 1280 × 1280, which is 12.8% higher than that at 640 × 640 resolution and 2.9% higher than that at 960 × 960 resolution. With the increase in resolution of input images, mAP50 and mAP50:95 increase from fast to slow, and Params and GFlops remain unchanged. The resolution of the input image has a significant impact on the detection accuracy. In the following experiments, we choose 1280 × 1280 as the resolution of input images.

3.4.2. Evaluation of Attention Mechanisms

In this section, the performance evaluation of the YOLOv5s in conjunction with various attention mechanisms (including SE—Squeeze and Excitation block [34], CBAM—Convolutional Block Attention Module [35], and CA—Coordinate Attention Module [36]) was conducted on our dataset. The above three attention mechanisms were added into the neck of the YOLOv5s, respectively. The performance comparative analysis between YOLOv5s combined with different attention mechanisms and the original YOLOv5s algorithm is presented in Table 5.

The experimental results indicate that, in comparison with the YOLOv5s baseline, both YOLOv5s + SE and YOLOv5s + CBAM do not exhibit significant improvements in mAP50 and mAP50:95. However, the mAP50 and mAP50:95 of the YOLOv5s + CA are respectively 1.7% and 1.3% higher than those of the YOLOv5s baseline. CA attention is position-sensitive and direction-aware; position-sensitive aids in accurately locating the object, while direction-aware assists in handling the varying orientations of vehicles in the image. The improvement of YOLOv5s + CA resulted in a small increase in the number of parameters and computational complexity, but the mAPs were improved, which improved the accuracy of vehicle detection of the UAV images.

3.4.3. Evaluation of Feature Pyramid Networks

In this section, the performance comparison of different FPNs on the YOLOv5s + CA algorithm, including BiFPN (Bi-directional Feature Pyramid Network) and our improved PAFPN (Path Aggregation Feature Pyramid Network). Among them, BiFPN is divided into BiFPN_Add and BiFPN_Concat. BiFPN_Add performs feature maps addition operation with unchanged channel numbers, while BiFPN_Concat performs channel stacking. The experimental results are presented in Table 6. Our improved PAFPN performed best, with mAP50 reaching 0.779, which increased by 1.3% and mAP50:95 by 0.4% compared with YOLOv5s + CA. The GFlops is only increased by 0.2. The improved PAFPN used in our method fuses feature layer C3 directly with the second layer output of PAN. Such multi-scale feature fusion is conducive to the localization of vehicles with variable scales, better adapted to the detection of small and medium-sized objects, and avoids too much redundant information.

3.4.4. Evaluation of IoU Losses

In this section, the performance evaluation of five losses, namely GIoU_Loss [37], DIoU_Loss [38], CIoU_Loss [38], EIoU_Loss [39], and SIoU_Loss [40], on our proposed improved YOLOv5s algorithm. The experimental results are presented in Table 7. According to the table, CIoU_Loss demonstrates superior performance in terms of both mAP50 and mAP50:95 metrics. Therefore, the utilization of CIOU_Loss enhances the model’s accuracy and detection capabilities, ensuring efficient vehicle detection and classification in UAV images.

3.4.5. Ablation Experiments

To verify the effectiveness of the proposed method, the ablation experiments were conducted using the same training hyper-parameters and strategies. The experimental results of ablation experiments are shown in Table 8, while the visualization of mAP50, mAP50:95, and loss are shown in Figure 12. Compared with experiment YOLOv5s, the experiment YOLOv5s + CA adds CA attention, the experiment YOLOv5s + improved_PAFPN enhances PAFPN, the experiment YOLOv5s + CA + improved_PAFPN combines the two improvement methods. The experimental results of YOLOv5s + CA + improved_PAFPN demonstrate an enhancement, with a 3% increase in mAP50 and a 1.7% increase in mAP50:95. There is a small increase in Parmas, and GFlops has a small increase of 0.3. Therefore, the comprehensive performance of the proposed method is superior to that of the original YOLOv5s in the field of vehicle detection of UAV images.

The visualization of detection results is shown in Figure 13. Figure 13a shows the detection results of the YOLOv5s baseline, and Figure 13b shows the detection results of our proposed method. Three scenes were chosen for detection: small and dense, occlusion, and complex background. The first row of the figure shows a small and dense scene, and the proposed method improves the detection effect of vehicles in small and dense scenes in UAV images. The second row shows the occlusion. The proposed method can detect the occluded object better and has higher confidence, as shown in the second row of Figure 13b. The third row is the case of the complex background, and the proposed method can detect the missing error detection that easily occurs in a complex background. As can be seen from Figure 13, UAV images usually have a higher Angle of view, more complex background, variable scale, and occlusion. For the vehicles that are smaller, the shape of vehicles is either long strips or truncated. Therefore, the objects of UAV images and ground-based images have different shape and texture features. In this study, CA attention can help the model pay more attention to the position of the vehicles and adapt to varying orientations of vehicles. The improved PAFPN makes the model more adaptable to the variable scale, which is conducive to the detection of small and medium-sized objects.

4. Conclusions

Due to the higher perspective, complex backgrounds, and varying scales in UAV images compared to ground-based images, vehicle detection tasks in UAV images are more challenging. This paper addresses prominent issues affecting vehicle detection performance in UAV images, such as small and dense, sparse distribution, occlusion, and complex background. To improve the accuracy of vehicle detection in UAV images, this paper proposed an improved YOLOv5s algorithm for vehicle detection and classification and conducted experiments on a self-build dataset UAV-OP, which consists of 3563 high-resolution UAV images. Through experimental research, the following conclusions have been drawn:

(1) The CA attention mechanism aids the model in localizing objects better, adapting to the varying orientations of objects, and improving its understanding of spatial information about objects. Specifically, when compared to the baseline YOLOv5s model, the CA mechanism has resulted in an improvement of 1.7% in mAP50 and 1.3% in mAP50:95.

(2) The improved PAFPN adds a short path on the basis of PAFPN, enhancing multi-scale feature fusion and strengthening the model’s localization capabilities for small objects while avoiding noise caused by excessive feature fusion. Finally, our model achieved an improvement of 3% in mAP50 and 1.7% in mAP50:95 over the YOLOv5s baseline model.

Our method has improved the detection efficiency and accuracy of vehicles in UAV images, mitigating the challenges associated with vehicle detection in UAV images. Although our method has improved the detection effect of UAV vehicles and contributed to the detection of UAV vehicles to a certain extent, there are still some limitations of our study. Firstly, the parameter settings involved in this study are empirical, and there is no specific study on the impact of these parameters. Secondly, UAV datasets are limited in number and diversity, which can lead to limited generalization capability of models across different scenarios, weather conditions, and vehicle types. In the future, our method will be continuously optimized to make it more efficient, faster, and smaller. On the one hand, exploring lighter models through model pruning and knowledge distillation methods will be explored. On the other hand, our dataset will be enriched to gather a wider variety of scenarios and vehicle types and ensure a more balanced sample distribution. Furthermore, all these efforts will contribute to the subsequent research on deploying the vehicle detection model.

Author Contributions

F.L., Y.N. and Y.T. conceived, designed, and performed the algorithm and experiments; K.L. provided original UAV images; Y.Y. and L.H. annotated the dataset; X.W. provided experimental guidance and advice for the revision of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 41101426) the National Key Research and Development Program (No. 2019YFE0126600), and 03 Special Project and 5G Project of Jiangxi Science and Technology Department (No. 20212ABC03A03).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank the handing editor and anonymous reviewers for their valuable comments. This research was supported by the National Natural Science Foundation of China (No. 41101426) the National Key Research and Development Program (No. 2019YFE0126600), and 03 Special Project and 5G Project of Jiangxi Science and Technology Department (No. 20212ABC03A03).

Conflicts of Interest

The authors declare no conflict of interest.

References

Ganesan, R.; Raajini, X.M.; Nayyar, A.; Sanjeevikumar, P.; Hossain, E.; Ertas, A.H. Bold: Bio-inspired optimized leader election for multiple drones. Sensors 2020, 20, 3134. [Google Scholar] [CrossRef]
Yayli, U.C.; Kimet, C.; Duru, A.; Cetir, O.; Torun, U.; Aydogan, A.C.; Padmanaban, S.; Ertas, A.H. Design optimization of a fixed wing aircraft. Adv. Aircr. Spacecr. Sci. 2017, 4, 65. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Radoglou-Grammatikis, P.; Sarigiannidis, P.; Lagkas, T.; Moscholios, I. A compilation of UAV applications for precision agriculture. Comput. Netw. 2020, 172, 107148. [Google Scholar] [CrossRef]
Torresan, C.; Berton, A.; Carotenuto, F.; Di Gennaro, S.F.; Gioli, B.; Matese, A.; Miglietta, F.; Vagnoli, C.; Zaldei, A.; Wallace, L. Forestry applications of UAVs in Europe: A review. Int. J. Remote Sens. 2017, 38, 2427–2447. [Google Scholar] [CrossRef]
Hildmann, H.; Kovacs, E. Review: Using unmanned aerial vehicles (UAVs) as mobile sensing platforms (MSPs) for disaster response, civil security and public safety. Drones 2019, 3, 59. [Google Scholar] [CrossRef]
Gupta, A.; Afrin, T.; Scully, E.; Yodo, N. Advances of UAVs toward future transportation: The state-of-the-art, challenges, and opportunities. Future Transp. 2021, 1, 326–350. [Google Scholar] [CrossRef]
Srivastava, S.; Narayan, S.; Mittal, S. A survey of deep learning techniques for vehicle detection from UAV images. J. Syst. Archit. 2021, 117, 102152. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Volume 14, pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, L.; Liao, J.; Xu, C. Vehicle detection based on drone images with the improved faster R-CNN. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; pp. 466–471. [Google Scholar]
Hou, Z.; Yan, J.; Yang, B.; Ding, Z. A Novel UAV Aerial Vehicle Detection Method Based on Attention Mechanism and Multi-scale Feature Cross Fusion. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence in Electronics Engineering, Phuket, Thailand, 15–17 January 2021; pp. 51–59. [Google Scholar]
Xu, Y.; Yu, G.; Wang, Y.; Wu, X.; Ma, Y. Car detection from low-altitude UAV imagery with the faster R-CNN. J. Adv. Transp. 2017, 2017, 2823617. [Google Scholar] [CrossRef]
Ji, H.; Gao, Z.; Mei, T.; Li, Y. Improved faster R-CNN with multiscale feature fusion and homography augmentation for vehicle detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1761–1765. [Google Scholar] [CrossRef]
Ammar, A.; Koubaa, A.; Ahmed, M.; Saad, A.; Benjdira, B. Vehicle detection from aerial images using deep learning: A comparative study. Electronics 2021, 10, 820. [Google Scholar] [CrossRef]
Luo, X.; Tian, X.; Zhang, H.; Hou, W.; Leng, G.; Xu, W.; Jia, H.; He, X.; Wang, M.; Zhang, J. Fast automatic vehicle detection in uav images using convolutional neural networks. Remote Sens. 2020, 12, 1994. [Google Scholar] [CrossRef]
Luo, X.; Wu, Y.; Zhao, L. YOLOD: A target detection method for UAV aerial imagery. Remote Sens. 2022, 14, 3240. [Google Scholar] [CrossRef]
Feng, J.; Yi, C. Lightweight detection network for arbitrary-oriented vehicles in UAV imagery via global attentive relation and multi-path fusion. Drones 2022, 6, 108. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Jawaharlalnehru, A.; Sambandham, T.; Sekar, V.; Ravikumar, D.; Loganathan, V.; Kannadasan, R.; Khan, A.A.; Wechtaisong, C.; Haq, M.A.; Alhussen, A.; et al. Target object detection from Unmanned Aerial Vehicle (UAV) images based on improved YOLO algorithm. Electronics 2022, 11, 2343. [Google Scholar] [CrossRef]
Zhang, H.; Shao, F.; He, X.; Zhang, Z.; Cai, Y.; Bi, S. Research on Object Detection and Recognition Method for UAV Aerial Images Based on Improved YOLOv5. Drones 2023, 7, 402. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convottional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]

Figure 1. Different scenes from UAV images. (a) shows the small and dense scenes, (b) shows the sparse distribution of vehicles, (c) shows the occlusion of vehicle targets, and (d) shows object detection tasks suffered from the complex background.

Figure 2. The architecture of the paper.

Figure 3. The structure of YOLOv5.

Figure 4. Focus module slicing operation.

Figure 5. The structures of SPP and SPPF. (a) The structure of SPP; (b) The structure of SPPF.

Figure 6. The structure of Coordinate Attention mechanism.

Figure 7. The structure of improved YOLOv5s with Coordinate Attention. The yellow blocks represent Coordinate Attention.

Figure 8. The structures of PAFPN and improved PAFPN. (a) The structure of PAFPN; (b) The structure of improved PAFPN.

Figure 9. CIoU loss for bounding box regression.

Figure 10. The overview diagram of label box. (a) Target label statistics; (b) Normalized target label location map; (c) Normalized target label size map.

Figure 11. Illustration of images in the UAV-OP dataset.

Figure 12. Visualization of mAPs and loss of YOLOv5s and improved YOLOv5s. (a) mAP50, (b) mAP50:95, and (c) loss.

Figure 13. Visual comparison of detection results. (1)–(3) represent three scenes of UAV images; (a,b) are visualization results of YOLOv5s and proposed method, respectively.

Table 1. Experimental environment.

Environment	Versions
CPU	CoreI7-12700K@3.61 GHz CPU
GPU	Nvidia GeForce RTX 2080Ti GPU
Memory	11 GB
Operating System	Windows 10
Python	3.7.0
PyTorch	1.7.1
Torchvision	0.7.2

Table 2. Hyper-parameters.

Epoch	Batch Size	Learning Rate	Weight Decay	Momentum
300	8	0.01	0.0005	0.937

Table 3. Details of training and test sets.

Category	Cargo	Sedan	Truck	Excavator	Bus	Total
Train	2457	10,542	654	278	130	14,061
Test	1131	4868	253	101	47	6400
Total	3588	15,410	907	379	177	20,461

Table 4. Experimental results of different input resolutions.

Input Resolution	mAP50	AP50 of Each Category					mAP50:95	Params	GFlops
Input Resolution	mAP50	Cargo	Sedan	Truck	Excavator	Bus	mAP50:95	Params	GFlops
640 × 640	0.609	0.666	0.853	0.503	0.638	0.384	0.369	7,023,610	15.8
960 × 960	0.730	0.739	0.925	0.629	0.699	0.659	0.468
1280 × 1280	0.749	0.755	0.940	0.669	0.727	0.656	0.497

Table 5. Experimental results of YOLOv5s combined with different attention mechanisms.

Method	mAP50	AP50 of Each Category					mAP50:95	Params	GFlops
Method	mAP50	Cargo	Sedan	Truck	Excavator	Bus	mAP50:95	Params	GFlops
YOLOv5s	0.749	0.755	0.940	0.669	0.727	0.656	0.497	7,023,610	15.8
YOLOv5s + SE	0.750	0.751	0.940	0.632	0.755	0.672	0.495	7,025,658	15.8
YOLOv5s + CBAM	0.749	0.757	0.939	0.649	0.735	0.664	0.497	7,040,067	15.9
YOLOv5s + CA	0.766	0.760	0.939	0.676	0.772	0.683	0.510	7,062,618	15.9

Table 6. Experimental results of YOLOV5s + CA-combined FPNs.

Method	mAP50	AP50 of Each Category					mAP50:95	Parmas	GFlops
Method	mAP50	Cargo	Sedan	Truck	Excavator	Bus	mAP50:95	Parmas	GFlops
YOLOv5s + CA	0.766	0.760	0.939	0.676	0.772	0.683	0.510	7,062,618	15.9
YOLOv5s + CA + BiFPN_Add	0.742	0.770	0.942	0.631	0.725	0.645	0.500	7,188,811	16.5
YOLOv5s + CA + BiFPN_Concat	0.764	0.751	0.943	0.646	0.747	0.734	0.499	7,147,131	16.1
YOLOv5s + CA + improved_PAFPN	0.779	0.752	0.941	0.66	0.771	0.773	0.514	7,147,122	16.1

Table 7. Experimental results of IoU losses based on YOLOv5s + CA + improved_PAFPN.

Method	mAP50	AP50 of Each Category					mAP50:95	Parmas	GFlops
Method	mAP50	Cargo	Sedan	Truck	Excavator	Bus	mAP50:95	Parmas	GFlops
GIoU_Loss	0.761	0.764	0.939	0.654	0.764	0.685	0.500	7,147,122	16.1
DIoU_Loss	0.739	0.755	0.941	0.647	0.760	0.593	0.485
CIoU_Loss	0.779	0.752	0.941	0.660	0.771	0.773	0.514
EIoU_Loss	0.758	0.761	0.941	0.684	0.724	0.682	0.505
SIoU_Loss	0.742	0.758	0.938	0.643	0.753	0.619	0.488

Table 8. Experimental results of ablation experiment.

Method	mAP50	AP50 of Each Category					mAP50:95	Parmas	Gflops
Method	mAP50	Cargo	Sedan	Truck	Excavator	Bus	mAP50:95	Parmas	Gflops
YOLOv5s	0.749	0.755	0.940	0.669	0.727	0.656	0.497	7,023,610	15.8
YOLOv5s + CA	0.766	0.760	0.939	0.676	0.772	0.683	0.510	7,062,618	15.9
YOLOv5s + improved_PAFPN	0.754	0.773	0.944	0.667	0.768	0.619	0.500	7,089,146	16.0
YOLOv5s + CA + improved_PAFPN	0.779	0.752	0.941	0.66	0.771	0.773	0.514	7,147,122	16.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, F.; Li, K.; Nie, Y.; Tao, Y.; Yu, Y.; Huang, L.; Wang, X. Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability 2023, 15, 14564. https://doi.org/10.3390/su151914564

AMA Style

Lu F, Li K, Nie Y, Tao Y, Yu Y, Huang L, Wang X. Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability. 2023; 15(19):14564. https://doi.org/10.3390/su151914564

Chicago/Turabian Style

Lu, Feng, Kewei Li, Yunfeng Nie, Yejia Tao, Yihao Yu, Linbo Huang, and Xing Wang. 2023. "Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s" Sustainability 15, no. 19: 14564. https://doi.org/10.3390/su151914564

APA Style

Lu, F., Li, K., Nie, Y., Tao, Y., Yu, Y., Huang, L., & Wang, X. (2023). Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability, 15(19), 14564. https://doi.org/10.3390/su151914564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s

Abstract

1. Introduction

2. Materials and Methods

2.1. The Overview of YOLOv5

2.1.1. Input

2.1.2. Backbone

2.1.3. Neck

2.1.4. Head

2.2. Improved YOLOv5s

2.2.1. Coordinate Attention

2.2.2. Improved PAFPN

2.2.3. Losses

3. Experiments and Results

3.1. Experimental Environment

3.2. Dataset

3.3. Evaluation Indicators

3.4. Experiments and Results

3.4.1. Study of Input Resolution

3.4.2. Evaluation of Attention Mechanisms

3.4.3. Evaluation of Feature Pyramid Networks

3.4.4. Evaluation of IoU Losses

3.4.5. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI