Aerial Images Processing for Car Detection using Convolutional Neural Networks: Comparison between Faster R-CNN and YoloV3

In this paper, we address the problem of car detection from aerial images using Convolutional Neural Networks (CNN). This problem presents additional challenges as compared to car (or any object) detection from ground images because features of vehicles from aerial images are more difficult to discern. To investigate this issue, we assess the performance of two state-of-the-art CNN algorithms, namely Faster R-CNN, which is the most popular region-based algorithm, and YOLOv3, which is known to be the fastest detection algorithm. We analyze two datasets with different characteristics to check the impact of various factors, such as UAV's altitude, camera resolution, and object size. A total of 39 training experiments were conducted to account for the effect of different hyperparameter values. The objective of this work is to conduct the most robust and exhaustive comparison between these two cutting-edge algorithms on the specific domain of aerial images. By using a variety of metrics, we show that YOLOv3 yields better performance in most configurations, except that it exhibits a lower recall and less confident detections when object sizes and scales in the testing dataset differ largely from those in the training dataset.


Introduction
Unmanned aerial vehicles (UAVs) are nowadays a key enabling technology for a large number of applications such as surveillance [1], tracking [2], disaster management [3], smart parking [4], and Intelligent Transportation Systems [5], to name a few. Thanks to their versatility, UAVs offer unique capabilities in collecting visual data using high-resolution cameras from different locations, angles, and altitudes. These capabilities provide rich datasets of images that can be analyzed to extract useful information that serves the purpose of the underlying applications. Compared to ground images, UAV aerial imagery collection presents several advantages, including a large field of view, high spatial resolution, flexibility, and high mobility. Although satellite imagery also provides a bird's eye view of the earth, UAV-based aerial imagery presents several advantages as compared to satellite imagery. In fact, UAV imagery has a much lower cost and provides more updated views (many satellite maps are several months old and do not present recent changes). In addition, it can be used for studies applied to other object detections. Section 3 sets forth the theoretical background of the three algorithms. Section 4 describes the datasets and the obtained results. Finally, Section 5 draws the main conclusions of this study.

Related Works
Various techniques have been proposed in the literature to solve the problem of car detection in aerial images and similar related issues. The main challenge is the small size and the large number of objects in aerial views, which may lead to information loss when performing convolution operations, as well as a difficulty to discern features because of the angle of view. There are specific challenges for each type of aerial imagery (fixed CCTV cameras, satellite, or UAV), due to their disparate level of resolution. We present here the most recent, relevant works in object detection for each of these three imagery types, and we then highlight the value added of the present work.

Fixed Surveillance Cameras
Xi et al. [4] addressed the problem of vehicle detection from overhead surveillance images. They proposed a multi-task approach based on the Faster R-CNN algorithm to which they added a cost-sensitive loss. The main idea was to subdivide the object detection task into simpler subtasks with enlarged objects, thus improving the detection of small objects that are frequent in aerial views. In addition, the cost-sensitive loss gives more importance to objects that are difficult to detect or occluded because of a complex background and aims at improving the overall performance. Their method outperformed state-of-the-art techniques on their own specific, private dataset that was collected from surveillance cameras placed on top of buildings surrounding a parking lot. However, their approach has not been tested on other datasets, nor on UAV images.
In a similar application, Kim et al. [22] compared various implementations of CNN-based object detectors, namely YOLO (see Section 3.2.1), the Single Shot MultiBox Detector (SSD), the region-based convolutional neural network (R-CNN), the region-based Fully Convolutional Neural Network (R-FCN), and SqueezeDet [23] (based on a Fully Convolutional Neural Network). They applied these algorithms on the problem of person detection, and trained and tested them on their own in-house dataset composed of images that were captured by surveillance cameras in retail stores. They found that YOLOv3 (with a 416 input size) and SSD (with a VGG-500 feature extractor) [24] provide the best tradeoff between accuracy and response latency.
In [25], Hardjono et al. investigated the problem of automatic vehicle counting in CCTV images collected from four datasets with various resolutions. On the one hand, they tested two classical image processing techniques: Background Subtraction (which calculates a foreground mask by subtracting a background model from the image) and the Viola Jones Algorithm [26] (combining Haar-like Features, Integral Images, AdaBoost Algorithm [27], and Cascading Classifier), with Median or Gaussian Filters. On the other hand, they also applied deep learning neural networks, namely YOLOv2 [28] and FCRN Fully Convolutional Regression Network) [29]. Their results show that deep learning techniques yield markedly better detection results (in terms of F1 score) when applied on higher resolution datasets.

Satellite Imagery
Chen et al. [30] applied a technique based on a Hybrid Deep Convolutional Neural Network(HDNN) and a sliding window search to solve the vehicle detection problem from Google Earth images. The maps of particular layers of the CNN (last convolutional layer and max-pooling layer) are split into blocks of variable field sizes, so as to be able to extract features of various scales. In addition, they modified the sliding windows to contain the main part of the vehicle to be detected. Thus, they obtained an improved detection rate compared to the traditional deep architectures at that time, but with the expense of a high execution time (7 s per image, using a GPU).
For the aim of car counting, Mundhenk et al. [6] built their own Cars Overhead with Context (COWC) dataset containing 32,716 unique cars and 58,247 negative targets, standardized to a resolution of 15 cm per pixel, and annotated using single pixel points. The authors used a Convolutional Neural Network that they called ResCeption, based on Inception synthesized with Residual Learning. The main modification to the Inception architecture is the substitution of 1 × 1 convolutions by residual projection shortcuts. The model was able to count the number of cars in test patches with a root mean square error of 0.66 at 1.3 FPS (frames per second).

UAV Imagery
Relatively fewer works have addressed the problem of car detection from UAV images. Ammour et al. [31] used a pre-trained CNN coupled with a linear support vector machine (SVM) classifier to detect and count cars in high-resolution UAV images of urban areas. First, the input image is segmented into candidate regions using the mean-shift algorithm. The VGG16 [32] CNN model is then applied to windows that are extracted around each candidate region to generate descriptive features, that are subsequently classified using a linear SVM binary model. Finally, they applied a fine-tuning morphological dilation for smoothing the detected regions. This multi-stage technique achieved state-of-the-art performance on a reduced testing dataset (5 images containing 127 car instances), but it still falls short of realtime processing, mainly due to the high computational cost of the mean-shift segmentation stage.
Liu and Mattyus [8] focused on fine-grained car detection. They used a soft-cascade structure of integral channel features [33] to classify car orientations and types (car or truck) in a dataset of aerial images of the city of Munich consisting of 20 images taken at an altitude of 1000 m with a resolution of 5616 × 3744 and a GSD (Ground Sampling Distance) of 13 cm. They obtained an accuracy of 98% at a processing time of 4.4 s per image, which is faster than traditional techniques such as Viola Jones, but still far from real time. Such classification can be used for the urban planning, traffic management, census estimation, and sociological analysis of cities and countries. Table 1 summarizes the datasets, algorithms, and results of the most similar related works on car detection, compared to the present paper. The closest work to the present study is that of Benjedira et al. [1] who presented a performance evaluation of Faster R-CNN and YOLOv3 algorithms, on a reduced UAV imagery dataset of cars. The present paper is an improvement over this work from several aspects:

Our Contribution
(1) We use two datasets with different characteristics for training and testing, whereas most previous works described above tested their technique on a single proprietary dataset. We show that annotation errors in the dataset have an important effect on the detection performance. (2) We added a third algorithm (YOLOv4) to the comparative analysis.

Theoretical Overview of Faster R-CNN and YOLO Architectures
Object detection is an old fundamental problem in image processing, for which various approaches have been applied. However, since 2012, deep learning techniques have markedly outperformed classical ones. The object detection algorithms based on deep learning are classified into two large branches: two-stage detectors and one-stage detectors. From each of these two branches, we selected, in this study, the best performing algorithms. We selected in the first branch, Faster R-CNN [19], which is the most representative model from the two-stage family, according to [34]. In the second branch, we selected the YOLO algorithm and picked out its most recent versions: YOLO v3 [20] and YOLO v4 [21]. The selected algorithms have been proven successful in terms of of accuracy and speed in a wide variety of applications. .

Two-Stage Detector: Faster R-CNN
R-CNN, as coined by [35], is a Convolutional Neural Network (CNN) combined with a region-proposal algorithm that hypothesizes object locations. It initially extracts a fixed number of regions (2000), by means of a selective search. It then merges similar regions together, using a greedy algorithm, to obtain the candidate regions on which the object detection will be applied. Afterwards, the same authors proposed an enhanced algorithm called Fast R-CNN [36] by using a shared convolutional feature map that the CNN generates directly from the input image, and from which the regions of interest (RoI) are extracted. Finally, Ren et al. [19] proposed a Faster R-CNN algorithm that introduced a Region Proposal Network (RPN), which is a dedicated fully convolutional neural network that is trained end-to-end ( Figure 1) to predict both object bounding boxes and objectness scores in an almost computationally cost-free manner (around 10 ms per image). This important algorithmic change thus replaced the selective search algorithm, which was very computationally expensive and represented a bottleneck for previous object detection deep learning systems. As a further optimization, the RPN ultimately shares the convolutional features with the Fast R-CNN detector, after first being independently trained. For training the RPN, Faster R-CNN kept the multi-task loss function already used in Fast R-CNN [36].
Faster R-CNN uses three scales and three aspect ratios for every sliding position, and is translation-invariant. In addition, it conserves the aspect ratio of the original image while resizing it, so that one of its dimensions is 1024 or 600.

One-Stage Detectors
We have considered two networks of the one-stage category: YOLOv3 and YOLOv4. We first describe the architecture of YOLOv3 and then briefly enumerate the enhancements made in YOLOv4.

YOLOv3
Contrary to R-CNN variants, YOLO [37], which is an acronym for You Only Look Once, does not extract region proposals, but processes the complete input image only once using a Fully Convolutional Neural Network that predicts the bounding boxes and their corresponding class probabilities, based on the global context of the image. The first version was published in 2016. Later on in 2017, a second version, YOLOv2 [28], was proposed, which intro-duced batch normalization, a retuning phase for the classifier network, and dimension clusters as anchor boxes for predicting bounding boxes. Finally, in 2018, YOLOv3 [20] improved the detection further by adopting several new features: • Replacing the mean squared error by cross-entropy for the loss function. The crossentropy loss function is calculated as follows: where M is the number of classes, c is the class index, x is an observation, δ x∈c is an indicator function that equals 1 when c is the correct class for the observation x, and log(p(x ∈ c)) is the natural logarithm of the predicted probability that observation x belongs to class c. • Using logistic regression (instead of the softmax function) for predicting an objectness score for every bounding box. • Using a significantly larger feature extractor network with 53 convolutional layers (Darknet-53 replacing Darknet-19). It consists mainly of 3 × 3 and 1 × 1 filters, with some skip connections ( Figure 2) inspired from ResNet [38].
Contrary to Faster R-CNN's approach, each ground-truth object in YOLOv3 is assigned only one bounding box prior. These successive variants of YOLO were developed with the objective of obtaining a maximum mAP while keeping the fastest execution that makes it suitable for real-time applications. Special emphasis has been put on execution time, so that YOLOv3 is equivalent to state-of-the-art detection algorithms such as SSD [24] in terms of accuracy but with the advantage of being three times faster [20]. Figure 3 depicts the main stages of the YOLOv3 algorithm when applied to the car detection problem. Variable input sizes are allowed in YOLO. We have tested the three input sizes that are usually used (as in the original YOLOv3 paper [20]): 320 × 320, 416 × 416, and 608 × 608.

YOLOv4
YOLOv4 [21] was introduced after two years of cumulative improvements over YOLOv3 [20], leveraging the recent advances in deep learning. It achieves an accuracy of 43.5% AP on the MS COCO dataset compared to 33.0% AP for YOLOv3. This high accuracy is made while keeping a very efficient inference time (65 FPS on Tesla V100). YOLOv4 aims to make object detection run efficiently and smoothly on the low-cost hardware provided on most edge devices.
Concerning the technical improvements made in YOLOv4, they are classified into two categories. The first category is named the Bag of Freebies (BoF) and designates improvements that can be made during training without affecting the inference time. This includes Cut-Mix [39] and Mosaic data augmentation techniques, DropBlock regularization [40], class label smoothing, Complete IoU (CIoU) loss [41], Cross mini-Batch Normalization (CmBN) [42], Self Adversarial Training (SAT), multiple anchors for a single ground truth, cosine annealing scheduler [43], and optimal hyper-parameters obtained through genetic algorithms.
On the other hand, the second category is named Bag of Specials (BoS) and represents improvements that slightly affect the inference time while making a considerable increase in accuracy. This includes the mish activation function [44], Cross Stage Partial connections (CSP)) [45], Multi-input Weighted Residual Connection (MiWRC) [46], the Spatial Pyramid Pooling (SPP) block [47], the Spatial Attention Module (SAM) block [48], the Path Aggregation Network (PAN) block [49], and the Distance IoU Loss (DIoU) [41] used as a factor in the Non-Maximum-Suppression (NMS) step.
To summarize, Table 2 compares the features and parameters of Faster R-CNN, YOLOv3, and YOLOv4. While successive optimizations and mutual inspirations made the methodology of the two architectures relatively close, the main difference remains that Faster R-CNN has two separate phases of region proposals and classification (although now with shared features), whereas YOLO has always combined the classification and bounding-box regression processes. Other feature extractors can also be incorporated.
Anchor-based Anchor-based

Number of anchors boxes
Only one bounding-box prior for each ground-truth object.
Using multiple anchors for a single ground truth 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position.

Input size
Different possible input sizes (n × n with n multiple of 32).
Different possible input sizes (n × n with n multiple of 32).
-Conserves the aspect ratio of the original image.
-Either the smallest dimension is 600, or the largest dimension is 1024.

Experimental Comparison between Faster R-CNN , YOLOv3, and YOLOv4
In this section, we will first describe the two datasets used for training and testing, and the hyperparameters chosen for each algorithm, and then present and discuss the results obtained.

Datasets
In order to obtain a robust comparison, we tested the Faster R-CNN, YOLOv3, and YOLOv4 algorithms on two datasets of aerial images showing completely different characteristics. • The Stanford dataset [50] consists of a large-scale collection of aerial images and videos of a university campus containing various agents (cars, buses, bicycles, golf carts, skate-boarders, and pedestrians). It was obtained using a 3DR SOLO quadcopter (equipped with a 4k camera) that flew over various crowded campus scenes, at an altitude of around 80 m. It is originally composed of eight scenes, but since we are exclusively interested in car detection, we chose only three scenes that contains the largest percentage of cars: Nexus (in which 29.51% of objects are cars), Gates (1.08%), and DeathCircle (4.71%). All other scenes contain less than 1% of cars. We used the two first scenes for training and the third one for testing. In addition, we removed images that contain no cars. Table 3 shows the number of images and instances in the training and testing datasets. The images in the selected scene have variable sizes, as shown in Table 4, and contain cars of various sizes, as depicted in Figure 4. The average car size (calculated based on the ground-truth bounding boxes) is shown in Table 5. The discrepancy observed between the training and testing datasets in terms of car sizes is explained by the fact that we used different scenes for the training and testing datasets, as explained above. This discrepancy will constitute an additional challenge for the considered object detection algorithms. Furthermore, we noticed that the ground-truth bounding boxes in some images contain some mistakes (bounding boxes containing no objects) and imprecisions (many bounding boxes are much larger than the objects inside them), as can be seen in Figure 5, but we used them as they are in order to assess the impact of annotation errors on detection performance. In fact, the Stanford Drone Dataset was not primarily designed for object detection, but for trajectory forecasting and tracking.    • The PSU datasetwas collected from two sources: an open dataset of aerial images available on Github [51] and our own images acquired after flying a 3DR SOLO drone equipped with a GoPro Hero 4 camera, in an outdoor environment at a PSU parking lot. The drone recorded videos from which frames were extracted and manually labeled.
Since we are only interested in a single class, images with no cars were removed from the dataset. The training/testing split was made randomly. Table 3 shows the number of images and instances in the training and testing datasets. The dataset thus obtained contains images of different sizes, as shown in Table 6, and contains cars of various sizes, as depicted in Figure 4. The average car size (calculated based on the ground-truth bounding boxes) in the training and testing datasets is shown in Table 5. We have made this dataset available on [52].

Hyperparameters
The main hyperparameter for YOLOv3 and YOLOv4 networks is the input size, for which we tested three values (320 × 320, 416 × 416, and 608 × 608), as explained in Section 3.2.1.
On the other hand, the main hyperparameter for Faster R-CNN is the feature extractor. We tested two different feature extractors: Inception-v2 [53] (also called BN-inception in the literature [54]) and Resnet50 [38]. As explained in Section 3.1, the default setting of Faster R-CNN conserves the aspect ratio of the original image while resizing it, so that one of its dimensions is 1024 or 600. However, to be able to fairly compare its precision and speed with YOLO algorithms, which use fixed input sizes, we also tested Faster R-CNN with a fixed input size of 608 × 608, for each of the two feature extractors. These settings make a total of 10 classifiers that we trained and tested on the two datasets described above, which amounts to 20 experiments, summarized in Table 7. In these experiments, we kept the default values for the momentum (0.9), weight decay (0.0005), learning rate (initial rate of 10 −3 for YOLOv3 and YOLOv4, 2 × 10 −4 for Faster R-CNN with Inception-v2, and 3 × 10 −4 with Resnet50), batch size (64 for YOLOv3 and YOLOv4, and 1 for Faster R-CNN), and anchor sizes (see Table 2). Furthermore, we conducted additional experiments with different values of learning rates (10 −5 , 10 −4 , 10 −3 , and 10 −2 ) for each of the main algorithms (Faster R-CNN with Inception-v2, Faster R-CNN with Resnet 50, YOLOv3, and YOLOv4 with the input size 416 × 416), on each of the two datasets. We trained each network for the number of iterations necessary to its convergence. We notice, for example, in Table 7 that YOLOv3 necessitated a higher number of iterations when using the largest input size (608 × 608) on the Stanford dataset, while it reached convergence after much fewer iterations when using the medium input size (416 × 416) on the same dataset. Meanwhile, YOLOv4 converges much faster in all configurations due to the use of the cosine annealing scheduler described in Section 3.2.2. Nevertheless, the number of steps needed to reach convergence is non-deterministic and depends on the initialization of the weights.

Results and Discussion
For the experimental setup, we used a workstation powered by an Intel core i7-8700K (3.7 GHz) processor, with 32 GB RAM, and an NVIDIA GeForce 1080 (8 GB) GPU, running on Linux. We will first explain the metrics used for the evaluation, then discuss the results of each metric for each algorithm on each testing dataset described above. We also tested different learning rates and anchor scales in order to assess the algorithms' sensitivity to these hyperparameters. A total of 52 trainings have been conducted (20 experiments with default hyperparameters, 28 experiments with different learning rates, and 4 experiments with different anchor scales).

Metrics
The following metrics have been used to assess the results: • IoU: Intersection over Union measuring the overlap between the predicted and the ground-truth bounding boxes.

Average Precision
When analyzing the results, it appears that all three tested algorithms gave a much better AP on the PSU dataset than on the Stanford dataset ( Figure 6). This is mainly due to the fact that, contrary to the PSU dataset, the characteristics of the Stanford dataset differ largely between the training and testing images, as detailed in IV.A. This is the well known problem of domain adaptation in machine learning [16]. The Stanford dataset contains 20 times more car instances than the PSU dataset (Tables 3), whereas the performance of Faster R-CNN, YOLOv3, and YOLOv4 algorithms was respectively four, seven, and five times better on the PSU dataset, in terms of AP. This highlights the fact that the clarity of the features, the quality of annotation, and the representativity of the learning dataset are more important than the actual size of the dataset. However, Figure 7 shows that the number of false negatives (non-detected cars) is much higher than the number of false positives on the Stanford dataset (3 times higher for Faster R-CNN, 73 times higher for YOLOv3, and 66 times higher for YOLOv4), and much higher than the number of true positives, which indicates that most cars go undetected in the Stanford dataset, most likely due to the different size and aspect ratio of the cars in the testing images, compared to the training images. This is also visible on Figure 8, which illustrates the trade-off between precision and recall for different score thresholds. While the precision is close to 1 for YOLOv3 and YOLOv4, but significantly lower for Faster R-CNN, all the algorithms have a recall inferior to 0.25 on the Stanford dataset. On the contrary, Figure 9 shows high values of recall for YOLOv3 and YOLOv4, and a slightly lower precision compared to Faster R-CNN, on the PSU dataset. Even though all three algorithms performed poorly on the Stanford dataset as compared to the PSU dataset, with less than 20% of AP, there is still a statistically significant difference between Faster R-CNN and YOLOv3 on this dataset. In fact, a T-test between the two sets of AP values of the two algorithms (for different IoU and score thresholds) yielded a p-value of 0.0020, which means that the null hypothesis (equality of the means of the two sets of AP values) can be rejected with a confidence of 99.8%. Meanwhile, the p-value between YOLOv3 and YOLOv4 AP values is 0.72, which means that the difference in performance between these two algorithms is not statistically significant, as opposed to the large improvement that Bochkovskiy et al. [21] obtained on the COCO dataset. This result may indicate that YOLOv4 has been specifically tuned for the COCO dataset and does not perform as well on other datasets in terms of AP.      Figures 11 and 12 show examples of YOLOv3 and Faster R-CNN misclassifications (all of them false negatives) on a sample image of the PSU dataset, respectively. YOLOv4 yields almost equivalent misclassifications, compared to YOLOv3.    Table 8 shows the average recall for a given maximum number of detections (as described in the introduction of Section 4.3), on the Stanford dataset. YOLOv4 (with medium and high input size) shows the best results in this metric, while the small input size (320 × 320) shows a marked inferior performance for both YOLOv3 and YOLOv4. The fact that the columns AR max=10 and AR max=100 in this table are identical can be explained by the fact that very few images in the Stanford testing dataset contain more than 10 car instances. Nevertheless, we have kept this duplicated column to compare it to Table 9, which shows the same metrics on the PSU dataset. YOLOv4 (with any input size) is significantly better in terms of the three metrics on this dataset, which indicates that YOLOv4 is better at detecting a high number of objects in a single image.  Figure 13 depicts the inference speed measured in frames per second (FPS), for each of the tested algorithms on both datasets. It shows that all configurations of YOLOv3 and YOLOv4 are significantly faster than Faster R-CNN. Moreover, the input size has a direct impact on the inference time, as expected, since a larger input size generates a greater number of network parameters, and hence a larger number of operations. In fact, the inference processing speed of both YOLOv3 and YOLOv4 largely depends on the input size (from 12 FPS for 608 × 608 up to 23 FPS for 320 × 320), with little variation between the two datasets. As for Faster R-CNN, the Inception v2 feature extractor is 2.3 and 1.5 times faster on the Stanford and PSU datasets, respectively. The difference in speed when applying these algorithms on the two datasets is explained by the difference of image input size. In fact, we calculated that the average number of pixels in the input test images (after resizing) is 544,000 for the PSU dataset, and 265,000 for the Stanford dataset, whereas YOLOv3 and YOLOv4 are not affected by this difference because they resize the images to a fixed input size.

Average Recall
The inference speed of YOLOv3 and YOLOv4 is nearly real-time. Nevertheless, if we want to run these object detectors on embedded edge devices on UAVs, which have reduced capabilities compared to the GPU workstation used here, we should apply model optimizations after training, as explained in [55].

Effect of the Dataset Characteristics
YOLOv3 (and to a slightly lesser extent YOLOv4) show the largest performance discrepancy between the two datasets. While they provide a very high recognition on the PSU dataset (up to 0.965 of AP), their performance markedly decreases on the Stanford dataset ( Figure 6). This is mainly due to the spatial constraints imposed by the YOLO family of algorithms. On the other hand, Faster R-CNN was designed to better deal with objects of various scales and aspect ratios [19].
Nevertheless, the contrary can be observed in terms of IoU ( Figure 14). While the average IoU of Faster R-CNN decreases by half between the PSU dataset and the Stanford dataset, it decreases only by 9% for YOLOv4 and 11% for YOLOv3. The imprecision of the ground-truth bounding boxes in the Stanford dataset and the discrepancy between training and testing features could explain the difference between the two datasets in terms of IoU. YOLOv4 In addition, Faster R-CNN shows a high disparity between the two datasets in terms of processing speed (2.7 times faster on the Stanford dataset), mainly due to the difference in image input size, as mentioned in Section 4.3.4. Figures 15 and 16 show the Average Precision (AP) for each category of object size on the PSU and Stanford datasets respectively. We define small objects as objects having a surface less than 5000 pixel 2 , medium objects as having a surface between 5000 and 10,000 pixel 2 , and large objects as having a surface greater than 10,000 pixel 2 . We notice that the pattern is the same for all the tested networks. On the PSU dataset, the best performance is always obtained on small objects, whereas the lowest performance is obtained for medium-size objects (with the exception of Faster R-CNN/Resnet50 that exhibits a slightly lower AP for large objects). By contrast, on the Stanford dataset, all the algorithms completely fail to detect small and medium-size cars, while showing a much better performance on large objects. In both cases, this can be explained by the distribution of car sizes in the training dataset ( Figure 4). In fact, in the PSU training dataset, the category of small cars is the most well represented (87% of all objects), while the category of medium-size and large cars are much less represented (8% and 4%, respectively). On the other hand, in the Stanford training dataset, the most represented category is large cars (58%), while small and medium-size cars are less represented (5% and 38%, respectively). In addition, large objects still have the additional advantage of possessing more discernible features, hence being easier to detect. Figure 15. Average Precision (AP) for each category of object size: small (object surface < 5000 pixel 2 ), medium-size (5000 pixel 2 ≤ object surface ≤ 10,000 pixel 2 ), and large (object surface > 10,000 pixel 2 ), on the PSU dataset. Figure 16. AP for each category of object size: small (object surface < 5000 pixel 2 ), medium-size (5000 pixel 2 ≤ object surface ≤ 10,000 pixel 2 ), and large (object surface > 10,000 pixel 2 ), on the Stanford dataset.

Effect of the Feature Extractor
The effect of the feature extractor for Faster R-CNN is very limited on the AP, except for a high value of IoU threshold (0.9) on the Stanford dataset, as can be seen in Figure 17 and 18. Nevertheless, in terms of inference speed, the Inception-v2 feature extractor is significantly faster than Resnet50 (Figures 19 and 20), which is consistent with the findings of Bianco et al. [54] who also showed that Inception-v2 (also known as BN-inception) is less computationally complex. Size   Figures 19 and 20 show a significant gain in YOLOv3's AP when moving from a 320 × 320 input size to 416 × 416, but the performance stagnates when we move further to 608 × 608, which means that the 416 × 416 resolution is sufficient to detect the objects of the two datasets, and a higher input size may lead to overfitting. A similar behavior can be observed for YOLOv4, except that the improvement between 320 × 320 and 416 × 416 sizes is much lower on the PSU dataset, since the first input size already provides an excellent AP. Moreover, we observe a decrease in AP, when we move to 608 × 608 on the PSU dataset. This reveals an over-fitting on this smaller dataset, when using more complex networks. Concerning Faster R-CNN, Tables 10 and 11 show that the default variable input size, which conserves the aspect ratio of the images, provides a better precision and recall than the fixed size configuration, in all cases except with Inception-v2 on the Stanford dataset, which results in significantly fewer false negatives (5215 compared to 6351). This is likely due to an exceptional congruence between the fixed input size and the anchor scales for Inception-v2 on this particular dataset. This configuration also gives a slightly better performance in terms of inference speed (21.1 FPS compared to 19.2 FPS), due to the smaller average input size. In fact, the image input size has a direct impact on the inference speed, as explained in Section 4.3.4.      In order to measure the sensitivity of each algorithm to the learning rate hyperparameter, we conducted additional experiments with different values of learning rates (10 −5 , 10 −4 , 10 −3 , and 10 −2 ) for each of the main algorithms (Faster R-CNN with Inception-v2, Faster R-CNN with Resnet 50, and YOLOv3 and YOLOv4 both with input size 416 × 416), on each of the two datasets. Figure 21 shows a high sensitivity of the AP (measured on the validation dataset) to the learning rate value chosen during training, except for YOLOv4, which benefits from the cosine annealing scheduler described in Section 3.2.2. A learning rate of 10 −3 yields the best performance in most cases, except that, on the Stanford dataset, Faster R-CNN, with Inception-v2, and YOLOv4 show better results at lower learning rates (10 −4 and 10 −5 respectively). A learning rate of 10 −2 gives poor results in all cases except for YOLOv4 on both datasets, and for Resnet50 on the PSU dataset. A learning rate of 10 −1 was also tested, but it led to a divergent loss. These results highlight the importance of trying different values of learning rates when comparing the performance of object detection algorithms. The results shown in Figure 21 confirm the better performance of YOLOv4/YOLOv3 and Faster R-CNN, respectively, on the PSU and Stanford datasets, when the learning rate is well chosen.

Effect of the Anchor Scales
The anchor scales used for the two algorithms are the default values specified in Table 2. We suspected that the anchor values could be the reason for the poor performance of the tested algorithms on the Stanford dataset, so we subsequently conducted four additional experiments with a different set of anchor scales. For YOLOv3 and YOLOv4, the new anchor scales were calculated using K-means clustering on the Stanford training dataset, and yielded smaller anchor sizes ( Table 12 shows the results obtained after using these anchors, compared to the previous results obtained with the default anchors. The performance was markedly lower for YOLOv3 (and to a much lesser extent YOLOv4), which indicates that the YOLOv3 algorithm is very sensitive to the change of anchor scales, whereas this sensitivity was mitigated in YOLOv4. As for Faster R-CNN with Resnet50 as a feature extractor, the AP was slightly lower (20.7% down from 21.9%), while the average IoU dropped noticeably (25% down from 47.7%). In contrast, Faster R-CNN with Inception-v2 as feature extractor was the only algorithm that showed better results with the reduced anchor scales. The two rightmost columns in Table 12 show the average width and height of the predicted bounding boxes. We notice that the dependency between the anchor scales and the predicted sizes is not straightforward. The average predicted sizes are more affected by the size of ground-truth bounding boxes in the training dataset (72 × 152 in average, as shown in Table 5) and adapt poorly to the different ground-truth car sizes and   Tables 10 and 11 present the detailed results of all tested configurations of the two algorithms on the PSU and Stanford datasets respectively. The best performance for each metric, and each dataset is highlighted in bold. We notice that YOLOv4 with a medium input size (416 × 416) and Faster R-CNN (with Inception-v2 feature extractor and a fixed input size) show the best results in terms of AP and recall, on the PSU and Stanford datasets, respectively. In terms of precision, Faster R-CNN (with Resnet50 feature extractor and a variable input size) and YOLOv3/YOLOv4 with a medium input size (416 × 416) perform better on the PSU and Stanford datasets, respectively. Figures 19 and 20 summarize the main results of this comparison study. They compare the trade-off between AP and inference time for YOLOv3/YOLOv4 (with 3 different input sizes) and Faster R-CNN (with two different feature extractors) on the PSU and Stanford datasets, respectively, with the default hyperparameters specified in Section 4.2. It can be observed that, while Faster R-CNN (with Inception v2 as feature extractor) gave the best trade-off in terms of AP and inference speed on the Stanford dataset (followed closely by YOLOv4 416 × 416), YOLOv4 (with input size 320 × 320) presented the best trade-off on the PSU dataset. This lays emphasis on the fact that none of these algorithms outperforms the others in all cases, and that the best trade-off between AP and inference time depends on the characteristics of the dataset (object size, resolution, quality of annotation, representativity of the training dataset, etc.).
In addition, while YOLOv4 has shown a steep increase in AP on the COCO dataset (from 33% to 43%), no such gap has been observed in our experiments on the smaller PSU and Stanford datasets, which indicates that the new features introduced in YOLOv4 were mainly tailored for the COCO dataset and may not be equally beneficial on other datasets.
Finally, it should be noted that, although the present case study was restricted to only car objects, its conclusions can be easily generalized to similar types of objects in aerial images, since we did not use any specific feature of cars.

Conclusions
In this study, we conducted a thorough experimental comparison of the three leading object detection algorithms (YOLOv4, YOLOv3, and Faster R-CNN) on two UAV imaging datasets that present very different characteristics, which makes the comparison more robust. Furthermore, the performance of the three algorithms was assessed using several metrics (mAP, IoU, FPS, AR max=1 , AR max=10 , and AR max=100 ,...) in order to uncover their strengths and weaknesses. One of the main conclusions that we can draw from this comparative study is that the performance of these algorithms largely depends on the characteristics of the dataset and the representativity of the training images. In fact, while Faster R-CNN (with Inception v2 as feature extractor) gave the best trade-off in terms of AP (52% higher than YOLOv4) and inference speed (only 10% slower than YOLOv4) on the Stanford dataset, YOLOv4 (with an input size of 320 × 320) presented the best trade-off on the PSU dataset (31% more accurate and 2.4 times faster than Faster R-CNN). The two tested feature extractors for Faster R-CNN yielded close results in terms of accuracy, while Inception v2 was 1.5 to 2.6 times faster than Resnet50. On the other hand, the difference in accuracy between YOLOv3 and YOLOv4 was shown to be statistically insignificant on the Stanford and PSU datasets, while they both show a high dependency to the input size (up to 1.9 times slower when passing from 320 × 320 to 608 × 608). In addition, we have shown that a badly chosen learning rate can yield extremely low AP (almost 0), and that the choice of the anchor scale values can impact the AP up to 58% for YOLOv3, and 26% for Faster R-CNN. As future work, we intend to extend our results to the newly released EfficienDet [46] detector and to much larger datasets of aerial images. Funding: This work is supported by the research grant SEED-2020-05 from Prince Sultan University.

Data Availability Statement:
The PSU dataset used in this study is available at: https://github.com/aniskoubaa/psu-car-dataset.