Detecting Objects from Space: An Evaluation of Deep-Learning Modern Approaches

: Unmanned aircraft systems or drones enable us to record or capture many scenes from the bird’s-eye view and they have been fast deployed to a wide range of practical domains, i.e., agriculture, aerial photography, fast delivery and surveillance. Object detection task is one of the core steps in understanding videos collected from the drones. However, this task is very challenging due to the unconstrained viewpoints and low resolution of captured videos. While deep-learning modern object detectors have recently achieved great success in general benchmarks, i.e., PASCAL-VOC and MS-COCO, the robustness of these detectors on aerial images captured by drones is not well studied. In this paper, we present an evaluation of state-of-the-art deep-learning detectors including Faster R-CNN (Faster Regional CNN), RFCN (Region-based Fully Convolutional Networks), SNIPER (Scale Normalization for Image Pyramids with Efﬁcient Resampling), Single-Shot Detector (SSD), YOLO (You Only Look Once), RetinaNet, and CenterNet for the object detection in videos captured by drones. We conduct experiments on VisDrone2019 dataset which contains 96 videos with 39,988 annotated frames and provide insights into efﬁcient object detectors for aerial images.


Introduction
Object detection is a fundamental yet difficult task in image processing and computer vision research. It has been an important research topic for decades. Its development in the past two decades can be regarded as an epitome of computer vision history [1]. Since it plays a principal role in understanding and absorbing the contexts of images, therefore, object detection is considered to be a prerequisite measure that offers the computer to detect various objects. Giving a testing image, object detection could localize the coordinates of the objects and assign the corresponding labels to the objects in terms of the object category, i.e., human, dog, or cat. The coordinates of a detected object represent the object's bounding box [2,3]. Object detection has many applications in robot vision, autonomous driving, human-computer interaction, intelligent video surveillance. The deep-learning technology has brought significant breakthroughs in recent years. In particular, these techniques have produced remarkable development for object detection. Object detection can detect a specific instance, i.e., Obama's face, Eiffel Tower, Golden Gate Bridge; or objects of specific categories, i.e., humans, cars, bicycles. Historically, object detection has mainly directed on the detection of a single category,

•
To the best of our knowledge, we are among the first ones who investigate the impact of different deep-learning object detection methods on the given problem. • Second, we double the number of benchmarking methods compared to the MITA paper [25]. • Last but not least, we also increase the number of benchmarking classes in VisDrone dataset.
In particular, we evaluate full 10 classes in this paper (vs. 2 human classes, namely pedestrian and people, in the MITA paper).
The remainder of the paper is organized as follows. In Section 2, the related works are presented. Section 3 and Section 4 present the benchmarked methods and the experimental results, respectively. Finally, Section 5 concludes our work.
The number of layers is increasing in CNN architectures. The CNNs require a lot of computational resources. There are several problems: gradient vanishing, exploding, and degrading. Degradation occurs when we add more layers into deep networks, the accuracy becomes saturated and then decrease quickly. To overcome this problem, ResNet [29] introduces many residual blocks. In a residual block, each layer is fed directly to the layers about 2-3 hops away using skip-connections.
DenseNet [30] was proposed by Huang et al. in 2017. It includes many dense blocks. A dense block consists of composite layers which are densely connected together. The input of one layer is the output of all previous layers, so input information is shared.

Object Detection Methods
Object detection methods are mainly divided into one-stage frameworks and two-stage frameworks. Two-stage frameworks are more accurate than one-stage frameworks, but one-stage frameworks usually achieve real-time detection. The two-stage approach includes two steps: the first stage creates region proposals, the second stage classifies region proposals. The one-stage approach predicts object regions and object classes at the same time.
CNN-based Two-Stage frameworks: Two-stage frameworks mainly include R-CNN [15], SPP-Net [31], Fast R-CNN [16], Faster R-CNN [17], RFCN [19], Mask R-CNN [32], and SNIPER [24]. R-CNN [15] is a method for detecting objects based on the ImageNet pre-trained model. R-CNN uses Selective Search algorithm to generate region proposals. Then, these regions are warped and fed into the pre-trained model to extract high-level features. Finally, several SVM classifiers are trained based on these features to identify object classes. Fast R-CNN [16] was introduced to solve some R-CNN's limitations, i.e., the computational speed. Fast R-CNN feeds the whole image into ConvNet to create convolutional feature map instead of 2000 regions as R-CNN.
Faster R-CNN [17] proposes a Region Proposal Network (RPN) to detect region proposals instead of Selective Search, which is used in R-CNN and Fast R-CNN. Faster R-CNN is 10× faster than Fast R-CNN, and 250× faster than R-CNN in reference time. RFCN introduces the positive sensitive score map, which improves speed but remains accurate compared to Faster R-CNN. Mask R-CNN extends Faster R-CNN with instance segmentation and introduces Align Pooling.
CNN-based one-stage frameworks: The most common examples of one-stage object frameworks are YOLO, YOLOv2, YOLOv3 [23], SSD [18], and RetinaNet. You Only Look Once (YOLO [20]) is one of the first approaches to build a one-stage detector. Unlike R-CNN family, YOLO does not use a region proposal component. Instead, it learns to regress bounding-box coordinates and class probabilities directly from image pixels. This significantly boosts the speed of the detecting process. Single-Shot MultiBox Detector (SSD [18]) is also a one-stage detector which also aims at high speed object detection. However, unlike YOLO, SSD adopts a multi-scale approach. It then adds many convolutional layers decreasing in size sequentially. This can be regarded as a pyramid representation of an image, where earlier levels contain feature maps that are useful to detect small objects and deeper levels are expected to detect larger objects. Each of these layers has a set of predefined anchor boxes (also known as default boxes or prior boxes) for every cell. The model will learn and predict the offsets corresponding to correct anchor boxes. The approach has made a successful attempt on creating an efficient detector for objects in various sizes while maintaining a low inference time.

Benchmarked State-of-the-Art Object Detection Methods
In this section, we provide further details of the benchmarked object detection methods. Figure 1 visualizes the methods in the chronological order. Meanwhile, Figure 2 depicts the framework structures of different state-of-the-art object detectors.

Faster R-CNN
Faster R-CNN [17] is an extension of the R-CNN [15] and Fast R-CNN [16] methods for object detection. R-CNN requires a forward pass of the CNN for around 2000 region proposals (ROI) for every single image. Later, Fast R-CNN was able to solve the problem of R-CNN by sharing the computation of convolution between different proposals (feature map). The detection process is sped up but still depends on the region proposal method (Selective Search). Region proposals were generated by additional methods, i.e., Selective Search or Edge Box. To solve this problem, Ren et al. [17] introduced the Region Proposal Network (RPN).
Faster R-CNN consists of two main components, namely the RPN and the Fast R-CNN detector. RPN initializes squared reference boxes of aspect ratios and diverse scales at each convolutional feature map location. Each squared box is mapped to a feature vector. The feature vector is fed into two fully connected layers, an object category classification layer, and a box regression layer. Faster R-CNN enables highly efficient region proposal computation because RPN shares convolutional features with Fast R-CNN. With an image of arbitrary size as an input, RPN is trained end-to-end to generate high-quality region proposals as output. The Fast R-CNN detector also uses the ROI pooling layer to extract features from each candidate box and performs object classification and bounding-box regression. The entire system is a single, unified network for object detection.

RFCN: Region-Based Fully Convolutional Networks
A limitation of Faster RCN is that it does not share computations after ROI pooling. The amount of computation should be shared as much as possible. Faster R-CNN overcomes the limitations of Fast R-CNN but it still contains several non-shared fully connected layers that must be computed for each of hundreds of proposals. Region-based Fully Convolutional Network [19] was proposed as an improvement to Faster R-CNN. It consists of shared, fully convolutional architectures. In RFCN, fully connected layers after ROI pooling are removed, all other layers are moved prior to the ROI pooling to generate the score maps. RFCN infers 2.5 to 20 times faster than Faster R-CNN, yet it still maintains a competitive accuracy.
Backbone Architecture. The incarnation of RFCN in [19] is based on ImageNet pre-trained ResNet-101 model. To compute feature maps, the average pooling layer and the fc layer are removed. Instead, RFCN only uses the convolutional layers, and attaches a randomly initialized 1024-d 1 × 1 convolutional layer to reduce dimension at the last convolutional block in ResNet-101. In addition, RFCN uses the k 2 (C + 1) channel convolutional layers to generate scope maps.
Position-sensitive score maps and Position-sensitive ROI pooling. RFCN regularly divides each ROI rectangle into k × k bins, then each bin has a size of ≈ w k × h k for an ROI rectangle of a size w × h. For each category, the last convolutional layer is built to create k 2 score maps. Inside the (i, j)-th bin (0 ≤ i, j ≤ k − 1), a position-sensitive ROI pooling operation pools only over the (i, j)-th score map: For the c-th category, r c (i, j) is the aggregated response in the (i, j)-th bin; z b,j,c is a score map among k 2 (C + 1) maps; (x 0 , y 0 ) represents the ROI's top left corner; and the number of pixels in the bin is marked as n. is the set of learnable parameters; and the (i, j)-th bin is located at i w k ≤ x ≤ (i + 1) w k and j h k ≤ y ≤ (j + 1) h k . The k 2 position-sensitive scores are averaged to obtain a (C + 1)-dimensional vector for each ROI r c ( ) = ∑ i,j r c (i, j| ). Then the SoftMax responses for categories are computed s c ( ) = The RFCN resolves bounding-box regression similar to Fast R-CNN with k 2 (C + 1)-d convolutional layer, and appends a sibling 4k 2 -d convolutional layer additionally.
The position-sensitive ROI pooling makes a 4k 2 -d vector for each ROI. Then, a bounding box as t = (tx, ty, tw, th) uses a 4-d vector aggregated by average voting.
Training. The loss function is defined on each ROI and calculated by the sum of cross-entropy losses and box regression loss: where c * is the ground-truth label of ROI. For classification, L cls (s * c ) = − log(s * c ) denotes the cross-entropy losses, L reg denotes the bounding-box regression loss and t * is the ground-truth box. If the argument is valid, [c * > 0] receives a value of 1, otherwise, 0.
Inference. The feature maps are the results of calculations on an image with a single scale of 600 shared between RPN and RFCN (as showed in Figure 2). Then, the RFCN part evaluates score maps and regresses bounding boxes based on ROIs which are proposed by the Region Proposal Network (RPN) part.

SNIPER: Scale Normalization for Image Pyramids with Efficient Resampling
SNIPER is an effective, multi-scale training method for identification, object detection and object separation [24]. Instead of processing pixels based on the pyramid (SN), SNIPER treats the context areas around the ground truths (called chips) at an appropriate scale. This greatly increases the speed during training when it operates on low-resolution chips. Relying on the memory efficient design, SNIPER benefits from mass standardization during the training process without having to synchronize standardized statistics on the GPU.

Chip Generation
SNIPER generates chips C i at multiple scales {s 1 , s 2 , ..., s i , ..., s n } in the image. For each scale, the image is first re-sized to width (W i ) and height (H i ). On this canvas, K × K pixel chips are placed at equal intervals of d pixels.

Chip Selection
All favorable chips are chosen greedily to cover the highest amount of valid ground-truth boxes. If it is completely enclosed inside a chip, a ground-truth box is said to be covered. Although positive chips cover all the positive instances, a significant portion of the background is not covered by them. In multi-scale training architecture, each pixel is processed at all scales in the picture. A naive strategy is to use object suggestions to define areas where objects are likely to present. If there are no region proposals in an image, it is considered to be background.

SSD: Single-Shot Detector
SSD [18] is a one-stage solution, which has tremendously reduced inference time and resulted in an accurate, high speed detector that can be used for real-time video processing.

Base Network VGG-16
SSD is built on top of VGG-16 base network [27] that focuses on simplicity and depth. In particular, the model uses 16 convolutional layers with only 3 × 3 filters to extract features. As the model goes deeper, the number of filter doubles after each max-pooling layer. Noticeably, the convolutional layers of the same type are combined as shown in Figure 2. At the end, three fully connected layers are followed by 4096 channels. The last one contains 1000 channels for each class and is concatenated with a SoftMax layer to return the detection results. The model works well on classification and localization tasks and has achieved 89% mAP on PASCAL-VOC 2007 dataset. VGG-16 has been one of the most interesting models to research even though it is not as fast as the newer ones. Its architecture has been reused in many models because of its valuable extracted features.

Model Architecture
SSD extends the pre-trained VGG-16 model (on ImageNet [10]) by adding new convolutional layers conv8_2, conv9_2, conv10_2, conv11_2 in addition to using the modified conv4_3 and fc_7 layers to extract useful features. Each layer is designed to detect objects at a certain scale using k anchor boxes, where 4k offsets and c class probabilities are computed by using 3 × 3 filters. Thus, given a feature map with a size of m × n, the total number of filters to be used is kmn(c + 4). The anchor boxes are chosen manually. Here, we use the original formula, as follows, to calculate anchor box scales at different levels.
where s min = 0.2 and s max = 0.9.

RetinaNet
RetinaNet [22] is another one-stage detector. It aims to tackle the class imbalance problem between foreground and background remaining in one-stage detector. RetinaNet uses two main techniques: FPN backbone and focal loss as the loss function. FPN is built on top of a convolutional neural network and is responsible for extracting convolutional feature maps from the entire image. By using focal loss, RetinaNet changes weights in the loss function, focuses on hard, misclassified examples, which improves the prediction accuracy. With ResNet (FPN) as a backbone for feature extraction and two specific subnetworks for classification and bounding-box regression, RetinaNet has achieved state-of-the-art performance.

Class Imbalance
As a one-stage detector, RetinaNet has a much larger set of candidate object locations which is regularly sampled across an image (∼100 k locations), covering spatial positions, scales and aspect ratios tightly. The easily classified background examples still dominate the training procedure. Bootstrapping or hard example mining is typically used as a solution for this problem. However, they are not efficient enough. To solve this, RetinaNet proposes a new loss function which can adaptively tune the contributed weights of object classes during training.

Focal Loss
Focal loss is computed by adding (1 − p i ) γ to cross-entropy loss as a modulating factor.

RetinaNet Detector Architecture
RetinaNet adopts ResNet for deep feature extraction. ResNet builds a rich multi-scale feature pyramid from an input image of single resolution by using Feature Pyramid Network (FPN) [33]. It combines low-resolution, high-resolution, and semi-weak characteristics through a top-down pathway and lateral connections.

YOLO: You Only Look Once
YOLO is an object detection system targeted for real-time processing. Recently, the third version of YOLO has been published, YOLOv3 is extremely fast and accurate. In mAP measured at 0.5 IOU YOLOv3 is on par with focal loss but about 4 × faster.
YOLOv3 takes an input image to predict 3D tensors respectively to three scales and each scale is divided into N × N grid cells. During training, each grid cell considers a class that it likely is and be responsible for detecting that class. Simultaneously, each grid cell is assigned with 3 initial prior boxes with various sizes. Finally, non-max suppression is applied to select the best boxes.

Feature Extraction
YOLOv3 uses a variant of Darknet, which originally has a 53-layer network trained on ImageNet. According to [23], Darknet-53 is better than ResNet-101 and 1.5 × faster. Darknet-53 has a performance similar to ResNet-152 and is 2 × faster.

Detection at Three Scales
YOLOv3 is different from its predecessors since it performs the detection process at three different scales. In YOLOv3, the detection is done by applying 1 × 1 detection kernels on feature maps of three different sizes at three different places in the network. YOLOv3 makes prediction at three scales, which are precisely given by down sampling the dimensions of the input image by 32, 16 and 8, respectively. Detections at different layer helps address the issue of detecting small objects, a common issue in YOLOv2.

Objective Score and Confidences
The object score illustrates the probability that an object is contained inside a bounding box and its value ranging from 1 to 0. A sigmoid is applied to compute the objectness scores. In terms of class confidences, they depict the probabilities of the detected object which belongs to a particular class. In YOLOv3, Non-maximum Suppression (NMS) is used to decide a class score and it is meant to alleviate the problem of multiple detections of the same object.

CenterNet
One-stage and two-stage detection have limitations: anchor box is designed with manual proportions that are easily affected by data and fixed during training. This requires a high computation cost, but the anchors are not always accurate. To address that, recently a series of anchor-free methods [34][35][36] are proposed. CenterNet [37] is a one-stage detector, anchor-free method. Reference [37] proposed a new center-based framework based on a single Hourglass network without FPN structure [38]. The object is represented by the central point of the bounding box. Other information is calculated by regression such as object size, dimension, and pose.

Object as Points
CenterNet considers the center point of an object as a prerequisite to localize the bounding box. As a result, Reference [37] use a keypoint estimator y to predict all center points and single a single size prediction for all object categories to alleviate the computational burden.

From Points to Bounding Boxes
CenterNet identifies the peak points in the heatmap before detecting all the values meet or greater than its 8-related neighbors and keep the top 100 points. Subsequently, each keypoint location is given by an integer coordinate (x i , y i ) and the key point estimator value as a measure of its detection confidence is applied to produce the bounding box as below: where (δ x i , δ y i ) = O x i , y i is the offset prediction and ( w i , h i ) = S x i , y i is the size prediction [37].

Benchmark Experiments
In this section, we first introduce the benchmark dataset and the evaluation metrics. We then detail the model configuration and discuss the experimental results.

Dataset
VisDrone2019 dataset [39] consists of 288 videos with 261,908 frames and 10,209 static images that do not match the frames of videos. Data is collected from unmanned aerial vehicles such as DJI Mavic, Phantom series (3, 3A, 3SE, 3P, 4, 4A, 4P). The videos and images are collected at different times of day. The frames in the videos have the highest resolution of 3840 × 2160 and the still image is 2000 × 1500. Some images in the dataset are shown in the Figure 3. VisDrone2019 includes ten predefined categories of objects: pedestrian, person car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Only the training data is released by the contest organizers. In this paper, we use 56 clips of VID-train data, with 24,313 frames of the VisDrone2019 dataset as training data; and 7 clips of VID-val, with 2860 frames for model evaluation.  motor (12,025). The class distribution of Visdrone VID is depicted in Figure 4. As a quick glimpse through the training set, there are the wide discrepancy in terms of the weather, the light source direction and the time interval as well as the drone motion in Figure 5.

Evaluation Metrics
In this work, we use the Average Precision (AP) measurement [3,40], the commonly used metric to assess object detection accuracy. Given two bounding boxes, one for ground truth (the actual class label) and one for the detection result (the predicted class label), we use the Intersection over Union (IoU) to calculate the similarity between the two boxes and the score of the predicted box. It is computed as the intersected area (S i ) divided by the union (S j ) of the two areas. An IoU threshold η indicates whether the prediction is an object or not. If the actual class label is the predicted class label and IoU > η, it is considered a positive else it is considered a negative.
The AP computes the average precision value for recall value over 0 to 1. The mean Average Precision (mAP) is computed by taking the average over the AP of all classes. Precision is the proportion of the predicted bounding boxes matching actual ground truth. Recall is the proportion of ground-truth objects being correctly detected. For object detection, we report the performance results with AP (IoU = 0.50), AP (IoU = 0.75). The AP [3] summarizes the shape of the precision/recall curve, and is defined as the mean precision at a set of 11 equally spaced recall levels [0, 0.1, . . . , 1]: where: p interp (r) = max r: r≥r p( r) (8)

Model Configuration
To make a fair comparison, namely each detection model at its best, we adjust the model's parameters following the recommendation of [41][42][43]. We provide the detailed configuration as below. • For YOLOv3, we have trained a new model which adopts darknet53 (YOLOv3 code is available at: https://github.com/AlexeyAB/darknet) as pre-trained weights for the convolutional layers. In implementation, YOLO v3 uses a total of 9 anchors with three for each scale. It assigns the three biggest anchors for the first scale, the next three for the second scale, and the last three for the third. As we aim to detect effectively on 3 scales, we recalculate nine anchors correspondingly.

Results
As aforementioned, we benchmark six state-of-the-art methods: Faster R-CNN, RFCN, SSD with default parameters and SNIPER, YOLOv3, RetinaNet with adjusted parameters which is presented in detail in Table 1. Table 2 shows the training time of the methods. RFCN and YOLOv3 take the least training time. Meanwhile, SSD requires a remarkable time for training. Table 3 shows the runtime performance of different methods. SSD and YOLOv3 achieve the fastest running time among the rest. In the meantime, RFCN only processes 1.75 frames per second. One-stage object detectors clearly run faster than the two-stage ones.   Figure 6 visualizes the detection results of benchmarking methods. As a closer look, Tables 4 and 5 show the detailed results of six methods and the average performance with a threshold of IoU set as 0.5 and 0.75, respectively. In accordance with Table 4 (IoU = 0.5), CenterNet, RetinaNet and SNIPER are the only three algorithms achieving more than 25% mAP score. YOLOv3 ranks in the fourth with more than 20% mAP score. We observe that SSD is inferior, and RFCN performs better than Faster R-CNN. SSD performs the worst, only producing 10.80% mAP score. However, SSD ranks in the third with 9.10% on the bus class. As seen in Table 5 (IoU = 0.75), SNIPER, CenterNet and RetinaNet are the only three algorithms achieving more than 11% mAP score. RFCN ranks in the fourth with more than 8% mAP score. We observe that YOLOv3 performs the worst (3.20% mAP score). In the meantime, Faster R-CNN performs better than SSD. Regarding the performance, YOLOv3 has good performance with 7.5 FPS and 25.08 mAP (IoU = 0.5) then it drops rapidly to 3.2 mAP (IoU = 0.75) because YOLOv3 is does not perform well at localization. Instead, YOLOv3 is well-known for its runtime performance. We have performed the detection of YOLOv3 and realized the small confidence score for each object (<30%) due to the similarity of features. Therefore, we literally set the low confidence for the object detection demand with this YOLOv3. Meanwhile, CenterNet, RetinaNet and SNIPER achieve better detection results.   As also seen from Figure 7, SNIPER, RetinaNet as well as CenterNet produce optimal feature maps. We observe that the object shapes are well captured with larger values, whereas the background is with smaller values. This is because focal loss is adept at learning imbalanced classes (foreground/background) while chip mining is extracted from a proposal network trained for a short training schedule, which identifies regions where objects are likely to be present. Simultaneously, keypoint estimation of CenterNet is considered to be the essential factor that facilitates finding center points and regresses to all other object properties, such as size, location, orientation.
Regarding YOLO and SSD: in the feature map obtained by SSD, the object shapes are not clear, and the edges of objects are not preserved. This explains why this method detects aerial objects inaccurately which is ascribed to shallow layers in a neural network. Simultaneously, the features maps extracted by YOLO is better than SSD's ones where object regions are more prominent from the background. However, the edges of the objects are still blurred.
As far as Faster R-CNN and RFCN feature maps are concerned, the object shapes are well preserved but are not clearly distinguished from the background. This is due to the variance of various angles of images, which is considered to be an obstacle of early feature extractors.

Discussion
SSD is a unified object detector, which adopts a multi-scale approach. SSD uses a VGG16 network as a feature extractor and adds eight convolutional layers and ten layers separately, it also uses convolutional layers to reduce spatial dimension and resolution. To detect multi-scale objects, SSD makes independent object detections from multiple feature maps. Aspect ratios in SSD which are used as the anchor box scaling factors, so we widen the ratios range to ensure most objects could be captured. The higher resolution feature maps are responsible for detecting small objects, the first layer for object detection is conv4_3 which has a spatial dimension of 38 × 38, a pretty large reduction from the original input image. Furthermore, small objects can only be detected in left most feature maps. However, those maps contain low-level features, like edges or color patches that are less informative for classification. Shallow layers in a neural network may not generate enough high-level features to predict small objects [27]. Therefore, SSD usually performs worse for small objects compared to other detection methods. Although SSD is the penultimate detector, it achieves the second with 12.7% on the bicycle class, the third with 5.5% on the people class, 9.1% on the awning-tricycle class, 9.10% on the bus class and 4.6% on the motor class.
SSD is competitive with Faster R-CNN, RFCN on more substantial objects, which has poor performance on small objects. Faster R-CNN combines the Region Proposal Network (RPN) into Fast R-CNN. RPN produces box proposals based on the feature extractor. These box proposals are used to crop features from the same intermediate feature map. They fed to the remainder of the feature extractor to predict a class and refine box for each proposal. RFCN is similar to Faster R-CNN. RFCN crops features from the last feature layer before prediction to reduce the amount of computation. RFCN proposed a position-sensitive mechanism to keep translation variance for localization representations. Faster R-CNN has a mAP of 17.51%. Faster R-CNN ranks in the third with 16.96% on the truck class. RFCN has much better performance than Faster R-CNN and SSD, producing 19.55% AP. RFCN achieves the first with 28.10% on the truck class and 13.28% on the bus class, which ranks in the third with 19.45% on the people and 15.21% on the awing-tricycle class.
As far as the outstanding detectors are concerned, CenterNet, RetinaNet and SNIPER are the three algorithms that top the statistics in both IoU thresholds (0.5 and 0.75). CenterNet ranks first with 32.28%, followed by RetinaNet with 28.26% in case the threshold of IoU = 0.5. Paradoxically, in case the threshold of IoU = 0.75, which favors high accurate results, the figure for SNIPER overtakes that for CenterNet and achieves the best performance.
In particular, CenterNet achieves outstanding results in the benchmark dataset with both IoU thresholds. Please note that CenterNet object detector builds on keypoint estimation networks, finds object centers, and regresses to their size. The experimental results show that CenterNet works well with small IoU threshold, 0.5. Regarding SNIPER, the valid range, boxes, which the square root of their area lies in each range are marked as valid in that scale. Therefore, we increased the valid range to significantly detect objects in various sizes (small, medium and large objects) as shown in Table 1. Simultaneously, chip mining plays an important role in eliminating regions that are likely to contain the background and this measure could adapt to each viewpoint hence alleviating the drawback of diverse scales. As a result, these enhancements cooperate with the pyramid feature map to surpass other detectors in terms of average precision. In particular, at the 0.75 IOU threshold, SNIPER outperforms YOLOv3, with 16.96% and 3.2% respectively. This is mainly because YOLOv3 is inferior in terms of localization. Regarding Retina, by changing anchors, RetinaNet has an increment in terms of AP, 2.3% for people class. A scale adjustment has widened the variety of scaling factors to use per anchor location which could improve detection for the diverse size objects. Concurrently, the focal loss is designed to address a severe imbalance between foreground and background classes during training, as a result, this approach could tackle the problems of the unbalanced dataset, in which the number of training samples for car and bus classes outnumbers those from other classes.

Conclusion and Future Work
In this paper, we experimented the state-of-the-art object detection methods, namely Faster R-CNN, RFCN, SSD, YOLO, SNIPER, RetinaNet, and CenterNet, on aerial images. Among them, CenterNet, SNIPER and RetinaNet achieve the best performance in terms of average precision. Concurrently, YOLO is considered to be the optimal choice for real-time object detection applications which require the high FPS and moderate precision in detecting object. We notice the main challenges in the problem, for example, occlusion, scale, and class imbalance. From the aerial view, many objects are occluded, and their sizes are varied. We also notice the class imbalance issue during the training process. For example, most of the detectors perform much better on the car and pedestrian classes than on the awning-tricycle, tricycle, and bus classes due to more instances collected in the car and pedestrian classes.
In the future, we would like to investigate the fusion of different object detectors to even boost the state-of-the-art performance. In addition, we are interested in the task of aerial image segmentation. Obviously, the bounding boxes provided by the object detectors are very useful for the segmentation task. We also consider adopting the use of transfer learning [44] to consolidate and enhance the efficiency of training time.