Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5

Li, Shuaicai; Yang, Xiaodong; Lin, Xiaoxia; Zhang, Yanyi; Wu, Jiahui

doi:10.3390/s23125634

Open AccessArticle

Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5

by

Shuaicai Li

,

Xiaodong Yang

,

Xiaoxia Lin

^*,

Yanyi Zhang

and

Jiahui Wu

College of Intelligent Equipment, Shandong University of Science and Technology, Taian 271019, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(12), 5634; https://doi.org/10.3390/s23125634

Submission received: 11 May 2023 / Revised: 14 June 2023 / Accepted: 14 June 2023 / Published: 16 June 2023

(This article belongs to the Section Vehicular Sensing)

Download

Browse Figures

Versions Notes

Abstract

Aerial vehicle detection has significant applications in aerial surveillance and traffic control. The pictures captured by the UAV are characterized by many tiny objects and vehicles obscuring each other, significantly increasing the detection challenge. In the research of detecting vehicles in aerial images, there is a widespread problem of missed and false detections. Therefore, we customize a model based on YOLOv5 to be more suitable for detecting vehicles in aerial images. Firstly, we add one additional prediction head to detect smaller-scale objects. Furthermore, to keep the original features involved in the training process of the model, we introduce a Bidirectional Feature Pyramid Network (BiFPN) to fuse the feature information from various scales. Lastly, Soft-NMS (soft non-maximum suppression) is employed as a prediction frame filtering method, alleviating the missed detection due to the close alignment of vehicles. The experimental findings on the self-made dataset in this research indicate that compared with YOLOv5s, the mAP@0.5 and mAP@0.5:0.95 of YOLOv5-VTO increase by 3.7% and 4.7%, respectively, and the two indexes of accuracy and recall are also improved.

Keywords:

vehicle detection; UAV imagery; YOLOv5; feature fusion network; Soft-NMS

1. Introduction

The usage of small, low-altitude UAVs has snowballed in recent years [1,2,3,4]. Objection detection techniques based on UAVs equipped with vision sensors have attracted much interest in areas such as unmanned vehicles and intelligent transportation systems [5,6,7,8]. UAV-based aerial vehicle detection techniques are less expensive than cameras installed at fixed locations and produce more extensive image views, greater flexibility, and broader coverage. UAVs can monitor road traffic over any range and provide critical information for subsequent intelligent traffic supervision tasks such as traffic flow calculation, unexpected accident detection, and traffic situational awareness. However, the vast percentage of vehicle targets have few feature points and small sizes [9,10], which presents a difficulty for precise and real-time vehicle detection in the UAV overhead view [11].

Existing vehicle detection approaches can be roughly divided into traditional and deep learning-based vehicle detection algorithms. Traditional vehicle detection algorithms must extract features [12,13] manually and then use SVM, AdaBoost, and other machine learning methods for classification. However, this way is time-consuming and can only extract shallow features, which has significant limitations when applied to aerial photography scenes with small targets. In recent years, with the continuous development of deep learning techniques, various artificial intelligence algorithms based on convolutional neural networks have played a great role in different fields, such as autonomous driving [14], optimization of medicine policies [15], and wildlife census [16]. Deep learning-based target detection algorithms have also been extensively applied, mainly including two-stage and single-stage algorithms. Two-stage target detection algorithms need to extract candidate regions first and then perform regression localization and classification of targets, with common examples including: Fast R-CNN [17], Faster R-CNN [18], and R-FCN [19]. Singh et al. [20] used Fast R-CNN-optimized samples to design a real-time intelligent framework that performs well on vehicle detection tasks with complex backgrounds and many small targets. Nevertheless, the model may not fit well for cases where the objective sizes vary widely. The authors of [21] conducted a study on vehicle detection based on Faster R-CNN, and the improved model reduced the latency and enhanced the detection performance for small targets. However, the model requires high computational resources in the detection process. Kong et al. [22] use a parallel RPN network combined with a density-based sample assigner to improve the detection of vehicle-dense areas in aerial images. However, the model structure is complex and requires two stages to complete the detection, which cannot meet the requirement of real-time detection. Since the two-stage detection algorithm requires the pre-generation of many pre-selected boxes, it is highly accurate but slow and cannot meet the needs of real-time detection [23]. The single-stage target detection algorithm directly transforms the localization and classification problem into a regression problem, which has an absolute speed advantage and accuracy potential compared with the two-stage one. The mainstream single-stage target detection algorithms mainly include the YOLO (You Only Look Once) series [24,25,26,27] and the SSD series [28]. Yin et al. [29] obtained outstanding detection performance for small objects by improving the efficiency of SSD in using feature information at different scales. However, the default box needs to be selected manually, which may affect the performance of the model in detecting small targets. Lin et al. [30] detect oriented vehicles in aerial images based on YOLOv4, and the improved model significantly improved the detection performance in scenarios with densely arranged vehicles and buildings. However, further improvement studies are lacking for scenes with small targets. Adel et al. [31] compared the detection performance of Faster R-CNN, YOLOv3, and YOLOv4 on the UAV aerial vehicle dataset but without considering the impact of vehicle occlusion, shooting angle, and lighting conditions on the model. Zhang et al. [32] propose a novel multi-scale adversarial network for improved vehicle detection in UAV imagery. The model performs great in images from different perspectives, heights, and imaging situations. However, the classification of vehicles is not specific enough, with only two categories: large vehicles and small vehicles.

Because of its excellent detection accuracy and quick inference, YOLOv5 [33] is applied extensively in various fields for practical applications. Niu et al. [34] used the ZrroDCE low-light enhancement algorithm to optimize the dataset and combined it with YOLOv5 and AlexNet for traffic light detection. Sun et al. [35] employed YOLOv5 to identify the marks added to bolts and nuts, from which the relative rotation angle was calculated to determine whether the bolts were loose. Yan et al. [36] applied an enhanced model based on YOLOv5 to apple detection, which improved the detection speed and reduced the false detection rate of obscured targets.

To reduce the false and missed detection rates of vehicle detection tasks, this paper conducts research to refine YOLOv5s, the smallest network in YOLOv5 version 6.1. The details are outlined as follows:

(1): In this paper, a smaller detection layer is added to the three detection layers of the original network. It makes the network more sensitive to small targets in high-resolution pictures and strengthens the multi-scale detection capability of the network.
(2): We introduce the Bifpn structure [37] based on YOLOv5, which strengthens the feature extraction and fusion process. Bifpn enables the model to utilize the deep and shallow feature information more effectively and thus obtain more details about the small and occluded objects.
(3): YOLOv5s adopts the NMS algorithm, which directly deletes the one with low confidence in two candidate frames that overlap too much, resulting in missed detection. Therefore, we use the Soft-NMS (soft-non-maximum suppression) algorithm [38] to optimize the anchor frame confidence, effectively alleviating the missed detection caused by vehicle occlusion.

2. Related Work

2.1. Overview of YOLOv5

YOLOv5 is a single-stage target detection algorithm released by Ultralytics in 2020 that consists of four structures: YOLOv5s, YOLOv51, YOLOv5m, and YOLOv5x. The model works by dividing the image into multiple grids, and if the center of the target falls within a grid, that grid is responsible for predicting the object. YOLOv5s is the most miniature model in depth and width among these four models. With the increase in model size, although the detection accuracy improves, the detection speed also becomes slower. As shown in Figure 1, YOLOv5s network is mainly categorized into four parts: input layer(input), backbone, neck, and prediction layer (head).

The primary function of the input layer is to unify the size of the input image into a fixed size. The backbone section, which includes the CBS, C3, and SPPF modules, is primarily responsible for extracting essential information from the input picture. The structure of each module is shown in Figure 2. The Neck portion of YOLOv5 employs a mixed structure of FPN [39] and PAN [40]. FPN transfers semantic information from deep to shallow feature maps, while PAN conveys localization information from shallow to deep feature layers. The combination of the two may aggregate characteristics from multiple backbone levels to different detection layers, enhancing the feature fusion capacity of the network.

The YOLOv5 target detection algorithm, which is still being updated and iterated, has reached high accuracy and speed by absorbing the benefits of other detection methods. Additionally, it is easy to implement. As a result, in this paper, we selected YOLOv5s as the baseline and used a series of experiments to develop a model more appropriate for aerial vehicle identification.

2.2. Adding a Prediction Layer for Tiny Objects

The maximum downsampling step of the YOLOv5s network is 32. Therefore, a resolution less than 32 × 32 pixels is regarded as a small target [41], greater than 96 × 96 pixels is defined as a large target, and in between is classified as a medium target. Since there are a large number of targets with tiny scales in the pictures taken by UAVs, we further subdivide the targets with a resolution less than 32 × 32 pixels into two cases of tiny (resolution < 16 × l6 pixels) and small (16 × 16 pixels < resolution < 32 × 32 pixels). The obtained target scale distribution is shown in Figure 3. It can be found that the number of tiny objects in train, val, and test are all significant. Therefore, it is essential to customize a detection layer more suitable for detecting tiny targets.

The YOLOv5s network has three detection layers, P3, P4, and P5, with feature map sizes of 80 × 80, 40 × 40, and 20 × 20, respectively. The larger size of the feature map is responsible for detecting smaller objects. The largest 80 × 80 feature map corresponds to an input size of 640 × 640, and the receptive field size of a grid in the feature map is 8 × 8. If the height or width of a tiny vehicle in the original image is less than 8 pixels, it is difficult for the network to learn the features of the object. The new P2 detection branch can detect targets at the 4 × 4 pixel level while configuring anchor boxes of smaller size, thus effectively reducing the missed detection of tiny vehicles.

From Figure 4, we can see that the first C3 module in the Backbone outputs a feature map of 160 × 160 after two downsamplings, while the second C3 module produces a size of 80 × 80. We fuse the feature map of 160 × 160 with the output of the second C3 module after upsampling to obtain the detection branch P2. In this way, the input of P2 derives mainly from the shallow convolutional layer and contains more information related to shape, position, and size. This information facilitates the model to discriminate fine-grained features more accurately, thus improving the capability to detect small targets.

2.3. Enhancing Feature Fusion with Bifpn

The Neck part of YOLOv5 uses a combined FPN and PAN structure. When it comes to feature extraction, the shallow network has a higher resolution and more precise position information. On the other hand, the deeper network has a more extensive receptive field and more high-dimensional semantic information that aids with object categorization. FPN facilitates semantic information from deep feature maps to shallow feature maps, while PAN accomplishes a high degree of integration between shallow localization information and deep semantic information. The combination of FPN and PAN, which aggregates parameters from multiple backbone levels to distinct detection layers, significantly improves the feature fusion capabilities of the network.

Nevertheless, there is a problem in that input to the PAN structure is largely feature information processed by FPN, with no original feature information taken from the backbone network. This issue may cause the optimization direction of the model to be biased, affecting the detection impact. The BiFPN first simplifies the PAN structure by removing the nodes with only one input and output edge. Then, an extra edge is added between two nodes at the same level to fuse more differentiated features, and the structure is shown in Figure 5C. The original BiFPN would assign various weights to different input features according to their importance, and this structure is utilized frequently to encourage feature fusion. The introduction of BiFPN with weights, however, increases the number of parameters and calculations for the dataset in this research, and the detection effect is not satisfactory.

Because the motivation for introducing BiFPN is that PAN can obtain more original feature information as input, we remove the weighted part and only reference its cross-scale connection way. By introducing the de-weighted BiFPN, the trade-off between accuracy and efficiency is considered, making the feature fusion process more reasonable and efficient. In this way, each node of PAN has one input edge from the backbone network, making the training process have continuous involvement of the original features and preventing the model from deviating from the expectation during the training process. The feature information of tiny targets is already relatively insufficient, and the features are easily missing after several convolutions. As shown in Figure 6, part of the input of the added prediction layer is from the first C3 module. This module retains most of the original feature information. Thus, more features about the tiny objects can be obtained, and the detection performance of the model can be improved.

2.4. Introducing Soft-NMS to Decrease Missed Detections

Instead of the NMS algorithm adopted by YOLOv5, Soft-NMS is used in this paper. The NMS algorithm selects the one with the highest confidence among all the predictor frames then conducts IoU operations sequentially with other predictor frames. For a prediction box whose IoU value exceeds the set threshold, it is directly deleted. During peak commuting hours, the vehicle density in the images captured by the UAV is high and closely aligned. In this circumstance, using the NMS algorithm suppresses many anchor frames that initially belonged to different targets, resulting in the missed detection of obscured vehicles.The NMS algorithm is shown in Equation (1).

s_{i} = \{\begin{matrix} s_{i}, & IOU (M, b_{i}) < N_{t} \\ 0, & IOU (M, b_{i}) ⩾ N_{t} \end{matrix}

(1)

where

b_{i}

and

s_{i}

denote the ith predictor box and its score, respectively, and

N_{t}

is the set threshold. M indicates the candidate box with the highest score. When the IoU of M and

b_{i}

is greater than the threshold, the score

s_{i}

of b is directly set to 0, likely to erroneously remove some prediction boxes containing vehicles.

Unlike the NMS method, Soft-NMS selects M as the benchmark box then calculates the IoU between M and the neighboring predictor boxes. This adjacent prediction frame is not suppressed when the IoU value is less than the set threshold. When the IoU value is greater than the set threshold, the penalty function attenuates the scores of the prediction frames that overlap with the reference frame instead of directly setting the scores to 0. By penalizing the scores of prediction frames with large IoU values, anchor frames with larger overlap areas get higher penalty coefficients and more minor scores

s_{i}

. Thus, there is a chance they are preserved during the suppression iterations, avoiding the situation where highly overlapping prediction frames contain targets but are removed.

The expression of the Soft-NMS algorithm is given in Equation (2).

s_{i} = \{\begin{matrix} s_{i}, & IoU (M, b_{i}) < N_{t} \\ s_{i} e^{- \frac{IoU {(M, b_{i})}^{2}}{σ}} & IoU (M, b_{i}) ⩾ N_{t} \end{matrix}

(2)

where

σ

is the hyperparameter of the penalty function. Combining with Equation (2), it can be seen that when the overlap of two boxes is higher, the value of

IoU {(M, b_{i})}^{2}

is larger and

s_{i}

is smaller. So the predicted box obtains a smaller score but can be retained instead of directly deleted, thus avoiding the missed detection of overlapping vehicles.

Figure 7a compares the detection performance of YOLOv5 using NMS and Soft-NMS as prediction frame screening algorithms. By focusing on the red dashed box in Figure 7b, it can be found that the application of the Soft-NMS algorithm successfully decreases the number of missed vehicles in the densely arranged region and enhances the detection performance of the model in the high-overlap scenario.

3. Experiments

3.1. Experimental Setup

In our experiments, the operating system was Linux, the CPU was an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60 GHz, the GPU was an RTX A5000-24 GB, and the framework was Pytorch. The experimental settings were based on the official YOLOv5 default parameters, including an adaptive anchor strategy and mosaic data enhancement. The parameters of the training process are set as shown in Table 1.

3.2. Dataset Description

Vehicles of four categories—car, van, truck, and bus—were selected for training, validation, and testing by collating the open-source dataset VisDrone2019-DET [42]. The number of labels for each category is shown in Figure 8.

There are ten categories in the VisDrone2019-DET dataset labels, several of which have few vehicle targets in the photos. As a result, we carefully selected 3650 photos from the original dataset as the experimental dataset for this paper to increase the training efficiency of the model. Figure 9 shows some of the images in the dataset of this paper. The dataset is usually divided into a training set, a validation set, and a test set. Among them, the training set is responsible for training the model, the validation set is used to optimize the parameters continuously, and the test set is assigned to evaluate the model. If the data distributions of these three sets differ greatly, this may affect the generalization ability of the model in real scenarios. Therefore, it is essential to allocate the dataset randomly, and the common ratios are 8:1:1 and 7:2:1. In this work, we randomly partition the dataset roughly according to the proportion of 8:1:1 to obtain 2800 in the training set, 350 in the validation set, and 500 in the test set.

3.3. Data Pre-Processing

We applied adaptive image scaling and mosaic data enhancement to pre-process the dataset. Because many original images have different aspect ratios, they need to be scaled and padded before feeding them into the model. If the sides are filled with more black borders, it will lead to redundant information and affect the training speed. Therefore, we use adaptive image scaling, which can adaptively add the least amount of black borders to the original images, thus speeding up the learning speed of the model. Mosaic data enhancement selects four original randomly scaled images that are cropped and arranged, and then it stitches them into a new image. This data enhancement method can effectively boost the ability to detect small targets. Figure 10 shows the two data preprocessing types.

3.4. Evaluation Metrics

In this study, AP and mAP are used as the evaluation metrics of the model. The average precision considers the precision (P) and recall (R) of the model. FLOPs, parameters, and FPS are applied to estimate the model size. The equations for precision P, recall, and mAP are as follows.

P = \frac{T P}{T P + F P}

(3)

R = \frac{T P}{T P + F N}

(4)

A P = \int_{0}^{1} P (R) d R

(5)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(6)

The terms TP, FP, and FN indicate the number of objects that were correctly detected, wrongly detected, and undiscovered, respectively. P is the precision, indicating how many vehicles predicted to be in a certain category actually belong to that category in our manuscript. R is the recall, which shows the proportion of vehicles in a category in the dataset that are correctly predicted. It is easy to see that precision focuses on the accuracy of the detected vehicle category, while recall pursues the detection of more vehicles in a particular type. AP is the area under the P-R curve for a single class. Finally, mAP indicates the average AP of all categories and is a composite measure of detection performance.

4. Results

4.1. Ablation Experiment

Three structures are explored to improve the YOLOv5s algorithm in this paper. The first is the addition of a new detection layer, P2, to enhance the recognition capability of small target vehicles. The second is the introduction of a de-weighted BiFPN to make the feature fusion process more reasonable and practical. The third is the use of Soft-NMS as a prediction frame filtering algorithm to improve the detection performance of overlapping and occluded vehicles. We designed the corresponding ablation experiments to verify the effectiveness of YOLOv5 after adding different modules, and the results are shown in Table 2. The number of parameters and computation of the model modestly increased compared with the baseline model after adding the P2 detection layer, as seen from the data analysis in the table. By further introducing BiFPN, however, the number of parameters and calculations are reduced significantly while the accuracy is guaranteed. The three improvement strategies are combined to produce the improved model, YOLOv5-VTO. While the addition of Soft-NMS reduces the AP of “car” compared to using only P2 and Bifpn, the remaining categories of AP are improved. Because the model has achieved excellent detection performance for “car”, we think “van”, “truck”, and “bus” are more in demand for a boost in AP. In addition, the substantial improvement of mAP also indicates that the introduction of Soft-NMS plays a great role in enhancing the comprehensive performance of the model. Of course, it is also clear in Figure 7 that Soft-NMS does decrease the missed detection of closely arranged vehicles.

Compared with the benchmark model, the two comprehensive indexes of mAP@0.5 and mAP@0.5:0.95 are improved by 3.7% and 4.7%, respectively, effectively improving the accuracy of aerial vehicle detection. Although there is a small increase in the number of parameters and computation compared with the benchmark model, it is discovered that the modified model can still satisfy the requirements of real-time detection in the following comparative experiments. The ablation experiments demonstrate that the approach used in this paper is excellent in the UAV aerial vehicle detection task, outperforming the base model in scenarios with tiny targets and more overlapping occluded objects.

Throughout the training procedure, the YOLOv5 and the YOLOv5-VTO models in this article use the same dataset and parameter settings. The mAP and loss comparison graphs of the two models are plotted according to the log files saved during the training process, as shown in Figure 11.

Figure 11a shows that the model obtained a higher mAP after improvement. In contrast, Figure 11b,c illustrate no obvious overfitting problem in the training process. Furthermore, compared to the baseline model, the overall loss value of the model in the training and validation sets is much lower.

We plotted the P-R curves shown in Figure 12 based on the log files regarding precision and recall generated during the training of the model. Because Soft-NMS is a prediction-frame optimization method in the prediction phase, it is not involved in the model training process. Therefore, Figure 12B is only a curve obtained by training the model after adding P2 and Bifpn. The area below the P-R curve indicates the AP of a category, so that the closer the curve is to the upper right corner, the better the overall performance of the algorithm.

It is not difficult to find that precision and recall have an inverse relationship in the P–R curve. This is because when the model pursues a high accuracy rate, it will be more conservative in prediction. Some samples with low confidence cannot be predicted confidently, and recall will be reduced accordingly. The importance of these two metrics is different in various scenarios. Therefore, we have to make a trade-off between precision and recall according to the needs of specific problems.

We can see that the model enhances the detection capability of vehicles, especially in the categories of “truck” and “bus,” by comparing the PR curves before and after the model improvement in Figure 12. However, compared with the “car” category, the performance of the updated model in the “truck” and “van” categories still needs to be improved, and the average accuracy of the best and worst detection categories is 0.902 and 0.579, respectively. The reason for this is that there are fewer target instances in the “truck” and “van” categories in the dataset compared to the “car” category. In addition, “truck” has many types of shapes, resulting in complex and variable feature information, which increases the difficulty of detection. As a result, we will continue looking for approaches to boost the detection effect of the model in the following stage, such as data supplementation and enhancement for the relevant categories.

4.2. Comparative Experiment

We compare YOLOv5-VTO with a series of target detection algorithms, including YOLOv5s, Faster-RCNN, SSD, YOLOv3-tiny, YOLOv7-tiny [43], and Efficientdet-D0, to further evaluate the advantages of the algorithms in this study for the vehicle identification task. All models involved in the comparison were trained and validated using the same dataset, and the experimental data are presented in Table 3.

The experimental results of different algorithms in Table 3 show that the YOLOv5-VTO algorithm proposed in this paper achieves the highest mAP compared to other mainstream detection models. Compared with the benchmark YOLOv5s, the proposed model has significantly improved mAP@0.5, mAP@0.5:0.95, Precision, and Recall while keeping the detection speed pretty much the same.

As a representative of the anchor-free detection model, the Efficientdet algorithm has room for improvement in mAP. On the other hand, the two-stage detection algorithm Faster-RCNN is slower owing to the need to extract feature vectors from feature regions during the testing phase. A high accuracy rate is achieved for the same single-stage detection algorithm SSD. However, the lack of a low-level feature convolution layer of SSD leads to inadequate features extracted from small target vehicles, resulting in a high number of missed detections and low recall. Given the need for the model to detect in real-time, two miniature volume versions, YOLOv3-tiny and YOLOv7-tiny, are selected for comparison in this study. YOLOv7-tiny achieved good results in terms of recall and FPS based on the experimental data. Yet the model provided in this paper still has benefits in several other metrics, such as mAP, especially mAP@0.5:0.95. YOLOv3-tiny has a significant gap compared to YOLOv5-VTO in various indexes except for FPS. Although our model is lower than these two algorithms in terms of detection speed, it may still satisfy the demand for real-time detection.

In summary, the overall performance of this model is remarkable compared with other models, and the balance of detection accuracy and detection speed is achieved, which verifies the effectiveness of this model.

4.3. Visualizing the Detection Performance of Different Models

We have demonstrated the comparison of the detection effect before and after the modification, as shown in Figure 13, to evaluate the model more intuitively. Figure 13A shows that the model YOLOv5-VTO can reduce the false detection of vehicles. It can be observed from comparing the findings of groups B and C that the revised model decreases the rate of missed detection and remains effective even in scenes with insufficient light. From the contrast results of both groups D and E, it can be found that the detection performance of the model in this paper for tiny targets is improved. The above visualization results show that our model achieves better detection performance for tiny and obscured vehicles in aerial images.

5. Conclusions and Future Works

We propose an enhanced model, YOLOv5-VTO, based on YOLOv5s to improve the detection performance of obscured vehicles and tiny vehicles in aerial images. Above all, a new detection branch, P2, that can discover tiny targets accurately is added to three detection layers of the baseline model. Then, the bi-directional feature pyramid (BiFPN), rather than the PAN structure of the original model, is adopted to achieve the excellent fusion of feature information of multiple scales to reduce the conflict between the fusion of features of different scales. By visualizing the detection results at last, we find that the Soft-NMS algorithm plays a good role in the scenario where vehicles block each other.

The experimental results indicate that the improved algorithm is more effective than the original YOLOv5s algorithm. Further, the detection speed can still achieve 30FPS, which can meet the demands of real-time detection. Although soft-NMS can improve the detection of obscured vehicles, it also slightly reduces the AP of some categories, such as “car”. Therefore, our following research will focus on how to mitigate this side effect after introducing Soft-NMS. Furthermore, there are still many uncertainties that limit the detection speed of the model, such as the large number of vehicles in the image, changing lighting conditions, the selection of the anchor box, the setting of the confidence threshold, and the deployment of high-performance hardware devices. Therefore, in future work, we will explore how to satisfy real-time detection applications under constraints.

Author Contributions

Conceptualization, S.L.; methodology, S.L.; software, X.Y.; validation, X.Y. and X.L.; formal analysis, X.Y.; investigation, S.L.; resources, S.L., X.Y. and X.L.; data curation, S.L. and X.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L., X.Y. and X.L.; visualization, S.L. and Y.Z.; supervision, X.Y., J.W. and X.L.; project administration, S.L., X.Y., X.L., Y.Z. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this paper were derived from the following sources available in the public domain [42]: VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results.

Acknowledgments

We are grateful to the reviewers for their suggestions for this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this paper:

YOLO	You Only Look Once
IoU	Intersection over Union
HOG	Histogram of Oriented Gradients
SIFT	Scale Invariant Feature Transform
FPN	Feature Pyramid Network
PANet	Path Aggregation Network
UAV	Unmanned Aerial Vehicle
NMS	Non-Maximum Suppression
AP	Average Precision
mAP	Mean Average Precision
SVM	Support Vector Machine
SSD	Single Shot Detector
FPS	Frames Per Second
FLOPS	Floating Point of Operations
CBS	Conv BN SiLU
TP	True Positives
FP	False Positives
FN	False Negatives

References

Xiong, J.; Liu, Z.; Chen, S.; Liu, B.; Zheng, Z.; Zhong, Z.; Yang, Z.; Peng, H. Visual detection of green mangoes by an unmanned aerial vehicle in orchards based on a deep learning method. Biosyst. Eng. 2020, 194, 261–272. [Google Scholar] [CrossRef]
Byun, S.; Shin, I.-K.; Moon, J.; Kang, J.; Choi, S.-I. Road traffic monitoring from UAV images using deep learning networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
Peng, X.; Zhong, X.; Zhao, C.; Chen, A.; Zhang, T. A UAV-based machine vision method for bridge crack recognition and width quantification through hybrid feature learning. Constr. Build. Mater. 2021, 299, 123896. [Google Scholar] [CrossRef]
Jung, H.K.; Choi, G.S. Improved yolov5: Efficient object detection using drone images under various conditions. Appl. Sci. 2022, 12, 7255. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from uav imagery with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef]
Ali, B.S. Traffic management for drones flying in the city. Int. J. Crit. Infrastruct. Prot. 2019, 26, 100310. [Google Scholar]
Srivastava, S.; Narayan, S.; Mittal, S. A survey of deep learning techniques for vehicle detection from uav images. J. Syst. Architect. 2021, 117, 102152. [Google Scholar] [CrossRef]
Qu, Y.; Jiang, L.; Guo, X. Moving vehicle detection with convolutional networks in UAV videos. In Proceedings of the 2016 2nd International Conference on Control, Automation and Robotics (ICCAR), Hong Kong, China, 28–30 April 2016; pp. 225–229. [Google Scholar]
Tang, T.; Zhou, S.; Deng, Z.; Zou, H.; Lei, L. Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining. Sensors 2017, 17, 336. [Google Scholar] [CrossRef]
Qu, T.; Zhang, Q.; Sun, S. Vehicle detection from high-resolution aerial images using spatial pyramid pooling-based deep convolutional neural networks. Multimed. Tools. Appl. 2017, 76, 21651–21663. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Xu, Y.; Yu, G.; Wang, Y.; Wu, X.; Ma, Y. A Hybrid Vehicle Detection Method Based onViola-Jones and HOG plus SVM from UAV Images. Sensors 2016, 16, 1325. [Google Scholar] [CrossRef]
Moranduzzo, T.; Melgani, F. Detecting Cars in UAV lmages With a Catalog-Based Approach. IEEE Trans. Geosci. Remote Sens. 2014, 52, 6356–6367. [Google Scholar] [CrossRef]
Jin, X.; Li, Z.; Yang, H. Pedestrain detection with YOLOv5 in autonomous driving scenario. In Proceedings of the 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–5. [Google Scholar]
Tutsoy, O. Pharmacological, Non-Pharmacological Policies and Mutation: An Artificial Intelligence Based Multi-Dimensional Policy Making Algorithm for Controlling the Casualties of the Pandemic Diseases. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9477–9488. [Google Scholar] [CrossRef]
Kellenberger, B.; Marcos, D.; Tuia, D. Detecting mammals in UAV images: Best practices to address a substantially imbalanced dataset with deep learning. Remote Sens. Environ. 2018, 216, 139–153. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN:Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Singh, C.H.; Mishra, V.; Jain, K.; Shukla, A.K. FRCNN-Based Reinforcement Learning for Real-Time Vehicle Detection, Tiracking and Geolocation from UAS. Drones 2022, 6, 406. [Google Scholar] [CrossRef]
Ou, Z.; Wang, Z.; Xiao, F.; Xiong, B.; Zhang, H.; Song, M.; Zheng, Y.; Hui, P. AD-RCNN: Adaptive Dynamic Neural Network for Small Object Detection. IEEE Internet Things J. 2023, 10, 4226–4238. [Google Scholar] [CrossRef]
Kong, X.; Zhang, Y.; Tu, S.; Xu, C.; Yang, W. Vehicle Detection in High-Resolution Aerial Images with Parallel RPN and Density-Assigner. Remote Sens. 2023, 15, 1659. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3:An incremental improvement. arXiv 2018, arXiv:1804. 02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4:Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. Lect. Notes Comput. Sci. 2016, 9905, 21–37. [Google Scholar]
Yin, Q.; Yang, W.; Ran, M.; Wang, S. FD-SSD: An improved SSD object detection algorithm based on feature fusion and dilated convolution. Signal Process. Image Commun. 2021, 98, 116402. [Google Scholar] [CrossRef]
Lin, T.; Su, C. Oriented Vehicle Detection in Aerial Images Based on YOLOv4. Sensors 2022, 22, 8394. [Google Scholar] [CrossRef] [PubMed]
Ammar, A.; Koubaa, A.; Ahmed, M.; Saad, A.; Benjdira, B. Vehicle Detection from Aerial Images Using Deep Learning: A Comparative Study. Electronics 2021, 10, 820. [Google Scholar] [CrossRef]
Zhang, R.; Newsam, S.; Shao, Z.; Huang, X.; Wang, J.; Li, D. Multi-scale adversarial network for vehicle detection in UAV imagery. ISPRS J. Photogramm. Remote Sens. 2021, 180, 283–295. [Google Scholar] [CrossRef]
Glenn Jocher YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 8 November 2022).
Niu, C.; Li, K. Traffic Light Detection and Recognition Method Based on YOLOv5s and AlexNet. Appl. Sci. 2022, 12, 10808. [Google Scholar] [CrossRef]
Sun, Y.; Li, M.; Dong, R.; Chen, W.; Jiang, D. Vision-Based Detection of Bolt Loosening Using YOLOv5. Sensors 2022, 22, 5184. [Google Scholar] [CrossRef] [PubMed]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving Object Detection with One Line of Code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 213–226. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Yolov7:Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]

Figure 1. The framework of the YOLOv5s algorithm.

Figure 2. Structure diagram of CBS, C3, and SPPF modules.

Figure 3. Number of objects at different scales in this dataset.

Figure 4. YOLOv5s algorithm framework with the added tiny object detection layer.

Figure 5. Schematic diagram of different feature fusion structures.

Figure 6. Network structure diagram after improving Neck and Head parts.

Figure 7. Comparison of YOLOv5s algorithm detection results before and after using Soft-NMS. (a) The detection performance of YOLOv5; (b) The detection performance of YOLOv5 after the introduction of Soft-NMS.

Figure 8. Pie chart describing the proportion of instances of labels for each category.

Figure 9. A few examples of the images in the dataset used in this article.

Figure 10. Effect of adaptive image scaling and mosaic data enhancement.

Figure 11. Comparison of the training curve between our model and YOLOv5s. (a) The loss function curve of the training set; (b) The loss function curve of the validation set; (c) The mAP curve.

Figure 12. PR curve comparison: (A) PR curve of YOLOv5s and (B) PR curve of improved YOLOv5s.

Figure 13. Comparison of YOLOv5s algorithm detection results before and after improvement. (A) Improved model reduces false detections; (B) Mitigates missed detections in low-light scenes; (C) Improved model reduces missed detections; (D) Improved model enhances the detection performance for small targets; (E) Improved model reduces missed detection of mutually obscuring vehicles.

Table 1. Parameters of training.

Parameters	Configuration
Image size	640 × 640
Learning rate	0.01
Momentum	0.937
Data enhancement	MOSAIC
Total epoch	300
BatchSize (training)	32
BatchSize (testing)	1
Network optimizer	SGD

Table 2. The comparison of the performance with different modules.

P2	Bifpn	Soft-NMS	AP				mAP@0.5	mAP@0.5:0.95	Params(m)	GFLOPS
P2	Bifpn	Soft-NMS	Car	Van	Truck	Bus	mAP@0.5	mAP@0.5:0.95	Params(m)	GFLOPS
			0.890	0.578	0.521	0.783	0.693	0.470	7.03	15.8
✓			0.901	0.631	0.564	0.820	0.729	0.490	7.69	27.0
✓	✓		0.902	0.626	0.579	0.811	0.729	0.488	7.38	20.0
✓	✓	✓	0.872	0.630	0.598	0.819	0.730	0.517	7.38	20.0

Table 3. Comparison of detection performance of different algorithms.

Model	mAP@0.5	mAP@0.5:0.95	Precision	Recall	FPS
Faster R-CNN	0.713	0.400	0.665	0.556	20.4
SSD	0.650	0.450	0.801	0.505	30.9
YOLOv3-tiny	0.546	0.287	0.593	0.548	80.5
YOLOv7-tiny	0.721	0.475	0.778	0.651	71.4
Efficientdet-D0	0.665	0.435	0.792	0.620	41.2
YOLOv5s	0.693	0.470	0.762	0.631	37.4
YOLOv5-VTO	0.730	0.517	0.779	0.642	37.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Yang, X.; Lin, X.; Zhang, Y.; Wu, J. Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5. Sensors 2023, 23, 5634. https://doi.org/10.3390/s23125634

AMA Style

Li S, Yang X, Lin X, Zhang Y, Wu J. Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5. Sensors. 2023; 23(12):5634. https://doi.org/10.3390/s23125634

Chicago/Turabian Style

Li, Shuaicai, Xiaodong Yang, Xiaoxia Lin, Yanyi Zhang, and Jiahui Wu. 2023. "Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5" Sensors 23, no. 12: 5634. https://doi.org/10.3390/s23125634

APA Style

Li, S., Yang, X., Lin, X., Zhang, Y., & Wu, J. (2023). Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5. Sensors, 23(12), 5634. https://doi.org/10.3390/s23125634

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Vehicle Detection from UAV Aerial Images Based on Improved YOLOv5

Abstract

1. Introduction

2. Related Work

2.1. Overview of YOLOv5

2.2. Adding a Prediction Layer for Tiny Objects

2.3. Enhancing Feature Fusion with Bifpn

2.4. Introducing Soft-NMS to Decrease Missed Detections

3. Experiments

3.1. Experimental Setup

3.2. Dataset Description

3.3. Data Pre-Processing

3.4. Evaluation Metrics

4. Results

4.1. Ablation Experiment

4.2. Comparative Experiment

4.3. Visualizing the Detection Performance of Different Models

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI