A Feature-Enhanced Anchor-Free Network for UAV Vehicle Detection

: Vehicle detection based on unmanned aerial vehicle (UAV) images is a challenging task. One reason is that the objects are small size, low-resolution, and large scale variations, resulting in weak feature representation. Another reason is the imbalance between positive and negative examples. In this paper, we propose a novel architecture for UAV vehicle detection to solve above problems. In detail, we use anchor-free mechanism to eliminate predeﬁned anchors, which can reduce complicated computation and relieve the imbalance between positive and negative samples. Meanwhile, to enhance the features for vehicles, we design a multi-scale semantic enhancement block (MSEB) and an effective 49-layer backbone which is based on the DetNet59. The proposed network offers appropriate receptive ﬁelds that match the small-sized vehicles, and involves precise localization information provided by the contexts with high resolution. The MSEB strengthens discriminative feature representation at various scales, without reducing the spatial resolution of prediction layers. Experiments show that the proposed method achieves the state-of-the-art performance. Particularly, the main part of vehicles, much smaller ones, the accuracy is about 2% higher than other existing methods. proposed


Introduction
Vehicle detection in unmanned aerial vehicle (UAV) images has received significant attention due to its extensive applications in both military and civilian fields, such as disaster management [1], transportation surveillance [2][3][4], and smart parking [5]. However, UAV vehicle detection is a challenging task because of small-sized objects, low-resolution objects, large object scale variations (e.g., large truck is about 460 × 300 pixels, while small bicycle is only about 20 × 20 pixels on 1920 × 1080 image. It can be seen that the scale between the large truck and the small bicycle is quite different, causing the problem of large scale variations still exists on UAV aerial images.), and the imbalance between positive and negative examples. Therefore, how to accurately and quickly detect vehicles in UAV images has theoretical significance and practical application value.
Traditional vehicle detection in UAV images is mainly based on hand-crafted features followed by a classifier or cascade of classifiers within a sliding window [6][7][8][9][10]. The hand-crafted features are low-level with weak semantics, which cannot represent vehicles effectively. Recently, thanks to the powerful representation capability of deep convolutional neural networks (CNN), object detection [11][12][13][14][15] has been made significant breakthroughs in the classical types of images (ground-level images), which also inspires vehicle detection in UAV images.
As a summary, we have made the following main contributions: (1) We propose a feature-enhanced anchor-free network (FEAF) for UAV vehicle detection, reducing excessive complex calculations related to anchor boxes and relieving the imbalance between positive and negative samples. (2) We adopt an effective 49-layer backbone that can offer appropriate receptive fields and keep precise localization information to match exactly small-sized vehicles. Besides, a multi-scale semantic enhancement block (MSEB) is proposed to strengthen discriminative feature representation for vehicles at various scales, without changing the spatial resolution of prediction layers. (3) Our method achieves the state-of-the-art performance on the two datasets, which are the UAVDT dataset [44] and the XDUAV dataset [45]. On the first dataset, 81.4% AP is achieved, which is 2.4%, 1.7%, 0.8%, and 2.2% higher than FPN [28], Mask R-CNN [15], FCOS [30], and RetinaNet [32] respectively. On the second one, 73.5% AP is achieved, which is 1.6%, 1.7%, 1.9%, and 2.5% higher than FPN, Mask R-CNN, FCOS, and RetinaNet respectively. Particularly, the main part of UAV datasets is much smaller vehicles, and its accuracy AP S is about 2% higher than other existing methods. While, the proposed detector can run at 22 frames per second on a single NVIDIA TITAN Xp GPU.
The organization of the rest part of this paper is as follows. Section 2 introduces some related works of this paper. Section 3 describes the technical design and theoretical analysis of the proposed method. Section 4 presents experimental results of the proposed method and gives detailed analysis. Section 5 draws the conclusion.

Anchor-Based UAV Vehicle Detection
With the impressive progress of deep learning in image processing, UAV vehicle detectors based convolutional neural networks (CNNs) have been proposed in recent years. Many UAV vehicle detection methods [19,46] based two-stage detectors, which can achieve improved performance by enhanced feature representation for one category vehicle. Sommer et al. [47] exploit Faster R-CNN [12] to extend the detection task for multiple vehicle categories. Zhang et al. [48] adopt Cascade R-CNN [49] to realize dense and small vehicles detection in UAV vision. However, these UAV vehicle detection methods cannot satisfy real-time requirement. Subsequently, the single-stage detectors can achieve real-time object detection. Radovic et al. [21] and Tang et al. [22] use YOLO [50] and YOLOv2 [51] to complete fast vehicle detection and tracking in UAV images, respectively. ShuffleDet [52] applies inception modules and deformable modules to consider the size and geometric shape of the vehicles to finish real-time vehicle detection. However, these vehicle detectors rely on pre-defined anchor boxes that bring an imbalance between positive and negative samples, a large amount of computation and memory footprint.

Anchor-Free UAV Vehicle Detection
To solve the problems caused by setting anchor boxes, anchor-free detectors [53][54][55][56][57][58][59][60][61] are proposed to eliminate pre-designed scales and aspect ratios of anchors and directly output the bounding boxes from an image. CornerNet [62] directly detects a pair of corners of an object bounding box and groups them via associative embedding [63] technique. CornerNet achieves superior performance but comes at high post-processing cost. FCOS [30] and FoveaBox [64] consider locations located the ground-truth box as positives to predict four distances. For UAV vehicles detection, RRNet [65] first uses the anchor-free detector to generate the coarse boxes, and then applies a re-regression module to produce accurate bounding boxes. Cai et al. [66] propose an anchor-free Guided Attention Network (GANet) to deal with object detection and counting tasks, but it only achieves one class UAV vehicle detection. In this paper, we use anchor-free mechanism to achieve efficiency and effective multi-class vehicle detection in UAV images.

Feature-Enhanced UAV Vehicle Detection
Vehicles in UAV images are small size and low-resolution, resulting in weak feature representation of vehicle targets. Feature enhancement is very necessary for vehicle detection. Intuitively, contextual information [67][68][69][70][71] is helpful for small and low-resolution objects. Many vehicle detectors in UAV images [72,73] use the FPN-based network to offer contextual information, which realize high accuracy vehicle detection and counting tasks. SlimYOLOv3 [74] uses fewer trainable parameters and floating point operations (FLOPs) by pruning feature channels to achieve real-time vehicle detection. Ammar et al. [75] present a performance evaluation between Faster R-CNN [12] and YOLOv3 [25] for car detection in UAV images. Liang et al. [76] use feature fusion and scaling-based SSD for small vehicles detection in UAV images. In this paper, we use an effective backbone that maintains high spatial resolution in deeper convolutional layers. Thus, the contextual information is generated through top-down and lateral connection in the network, which can introduce precise localization information to enhance the features for small-sized vehicles. Moreover, a multi-scale semantic enhancement block (MSEB) is proposed to strengthen discriminative feature representation for vehicles at various scales.

Proposed Method
In this section, we elaborate the proposed feature-enhanced anchor-free network (FEAF) for UAV vehicle detection. The overall architecture is shown in Figure 1.

Architecture
We introduce the proposed architecture for UAV vehicle detection in detail, including feature extraction and detection head.
Feature Extraction. The proposed FEAF network adopts the anchor-free mechanism to achieve regression. If the location (x, y) falls into any ground-truth boxes, it can be considered as a positive sample to directly output the pixel-wise classification scores and the object bounding boxes. Therefore, it is very important to maintain accurate location information for the FEAF network. The DetNet59 [40] employs small down-sampling factors (i.e., 4, 8, and 16) to maintain fine location information while employing dilated convolution [77] to increase receptive field. Inspired by this, we adopt an effective 49-layer backbone based on DetNet59 to extract features as shown in Figure 1. H and W in Figure 1 are height and width of feature maps (e.g., C 3 , C 4 , and C 5 ) receptively, '/d' (d = 8, 16, 16) is the down-sampling factor of feature levels to the input image. We use the convolutional stages from C 1 to C 5 as the backbone called DetNet-49 and remove the deeper convolutional layer stage6 from the DetNet59. Because the deeper convolutional layers introduce large receptive field for small-sized vehicles, it will bring too much background inference to compromise the detection performance. Thus, the proposed backbone DetNet-49 can maintain precise localization information in deeper convolutional layers while providing matched receptive fields for small-sized vehicles.
The top-down architecture with lateral connection is used to build a feature pyramid structure from P 3 to P 5 , and offer contextual information to enhance the features of vehicles. The contextual information involves semantic information and more precise localization information with high spatial resolution. The predicted layers P 3 , P 4 , and P 5 are produced by the feature maps C 3 , C 4 , and C 5 followed by a multi-scale semantic enhancement block (MSEB) with lateral connection. The MSEB will be explained in detail in Section 3.2. Meanwhile, the channel dimensions of predicted layers are fixed at 256 as shown in Figure 1, further reducing the computational cost.
Detection Head. Many literatures [60,78] have extensively explored that the detection head plays an important role in high performance. Same as FCOS [30] and RetinaNet [32], we append four 3 × 3 convolutional layers on the detection head respectively for classification and regression branches to improve detection accuracy. Meanwhile, we employ a center-ness branch [30], in parallel with the classification branch to suppress these low-quality predicted bounding boxes without introducing any hyper-parameters, which will be introduced in detail in Section 3.3. The parameters of the head are shared across all predicted levels. For UAV vehicles detection, we predict a C dimensional (C is the number of categories) vector p of classification labels and a 4 dimensional vector t of bounding box coordinates in the final layers. In shared detection head, all parameters are shared across predicted levels, and four 3 × 3 convolutional layers are added respectively for classification and regression branches. The center-ness branch decreases low-quality predicted bounding boxes.

Multi-Scale Semantic Enhancement Block
Although the network composed of only the DetNet-49 is conducive to the detection for small-sized vehicles, it is not good for classification and regression of large-sized vehicles. Intuitively, the deeper network creates large receptive fields and stronger semantic information that are good for large-sized vehicles classification. However, localization will suffer from the absence of the fine location information, and large receptive fields will accompany background interference for small-sized and low-resolution vehicles. In order to maintain precise location information and offer matched receptive fields for all-sized vehicles, we will widen the network width instead of increasing the depth, which can effectively strengthen the semantics of vehicles at various scales. Based on the DetNet-49 backbone, we design a multi-scale semantic enhancement block (MSEB) as shown in Figure 2 to widen the network, without changing the spatial resolution of prediction layers.
The MSEB contains three branches that are stacked by convolution kernels of different size. Specifically, the first branch of the MSEB decomposes a 3 × 3 convolution kernel into a 1 × 3 and a 3 × 1 kernels, the last two branches use a 3 × 3 convolutional kernel followed by a 1 × 1 kernel and a 1 × 1 convolutional kernel respectively. We fix the number of channels in each layer of the MSEB to 256. Finally, the three branches are summed in an element-wise manner. Furthermore, we employ batch-normalization (BN) [79] after each convolutional layer in the MSEB. In this way, the proposed network can not only strengthen semantic information for vehicles, but also maintain precise location information and create various receptive fields with different scales from one feature map.

Anchor-Free Mechanism
In this paper, we adopt the anchor-free mechanism [30] to regress bounding boxes of UAV vehicles. Given a ground-truth box denoted as (x t , y t , x b , y b ) , the points (x t , y t ) and (x b , y b ) refer to the left-top and right-bottom corners of the target bounding box respectively. Each location (x, y) on the predicted layer P i can be mapped back onto the input image ( s / 2 + xs, s / 2 + ys), which is near the center of the receptive field of the location (x, y). s is the total stride and is 8, 16, and 16 in P 3 , P 4 , and P 5 respectively. If the location (x, y) falls into any ground-truth boxes, it can be considered as a positive sample; otherwise it is a negative one. If a location falls into multiple ground-truth boxes, we adopt the multi-level prediction to solve this problem. Besides the label for classification, the detection head directly regresses a 4 dimensional vector for each location on all predicted layers that represents the distances between the current location and the four bounds of the ground-truth box. For example, if location (x, y) falls into the given a ground-truth box (x t , y t , x b , y b ), the regression targets for the location can be calculated as a 4D vector t * x,y , If a location satisfies max(dx t , dy t , dx b , dy b ) > m i or max(dx t , dy t , dx b , dy b ) < m i−1 , we take it as a negative sample and is not required to regress it, where m i is the maximum distance that the predicted layer P i needs to regress. In this paper, m 3 , m 4 , and m 5 are set as 64, 256, and ∞ respectively. In other words, the size range is [0, 64] for P 3 , [64,256] for P 4 , and [256, ∞] for P 5 . Therefore, compared to the anchor-based detectors, the anchor-free mechanism can make full use of more positive samples to train. Since the anchor-based detectors only consider the anchor boxes with larger IoU as positive samples.
Center-ness. Since the positive samples far away from the target center will regress low-quality bounding boxes, a constraint called the center-ness strategy is used to remove those boxes. For a regression box (dx t , dy t , dx b , dy b ) of the location (x, y), the center-ness is defined as, The center-ness does not introduce other hyper-parameters and without fine-tuning in training process. During the testing, the final scores used for ranking the bounding boxes can be calculated by multiplying the predicted center-ness with the corresponding classification confidence. Thus the weights of the bounding boxes far away from the center point are smaller. Therefore, non-maximum suppression (NMS) can filter out those low-quality boxes and improve UAV vehicles detection performance.
Loss Function. During the training phase, we minimize an objective function following the multi-task loss, where the classification loss L cls uses focal loss [32], the center-ness loss L cn is binary cross entropy loss [12], and the regression loss L reg adopts GIoU loss [80]. N pos is the number of positive samples. The label c * x,y is 1 if the location (x, y) is positive and 0 otherwise. The summation is calculated over all location on the pyramid level P i (i = 3, 4, 5).

Experimental Results and Analysis
In this section, we analyze the proposed network for vehicle detection. In the ablation study, backbone network, multi-scale semantic enhancement block (MSEB), and anchor-free mechanism are analyzed in detail.

Dataset Preparation and Training Implementation Details
XDUAV Dataset. The XDUAV dataset [45] was captured by the DJI Phantom 2 quadcopter flying a part of urban and suburban areas of Xi'an, China. The dataset consists of 11 vehicle videos from various traffic environment, such as congested and non-congested conditions, and intersection scenarios in different weather and lighting conditions. All videos are collected with the drone-view at approximate 100 meters' height. The resolution is 1920 × 1080 and we captured one target image per 30 frames, and the whole dataset contains 4344 images with 3475 images for training and 869 images for testing. The dataset has a large amount of small vehicles with truncated, occluded, and multi-angle. Figure 3a shows some image examples in different scenarios and weather conditions. Training and testing samples are annotated 6 categories of vehicles (i.e., car, bus, truck, tanker, motor and bicycle). Each category object number is shown in Table 1.
UAVDT Dataset. UAV Detection and Tracking (UAVDT) benchmark [44] was captured by DJI Inspire 2 flying different altitude in urban areas, under various weather and lighting conditions in different scenarios such as arterial streets, highways, crossing and T-junctions, etc. The UAVDT benchmark consists of about 80,000 representative frames for three fundamental tasks, i.e., object detection, single object tracking and multiple object tracking. In this paper, we only implement the fundamental task of object detection. The dataset for object detection contains 39,850 images with 23,258 images for training and 16,592 images for testing, with the resolution of 1080 × 540 pixels. The dataset contains 3 annotated categories including car, truck and bus. The vehicle number of each category and some images examples are shown in Table 1 and Figure 3b, respectively.
Metric. In this paper, we adopt MS COCO metric to evaluate the results of the proposed detector on two datasets. MS COCO metric can judge detection performance under a rigorous manner, including AP, AP 50 , AP 75 , AP S , AP M , AP L metric. The mean average precision (AP) is calculated by averaging over all 10 Intersection over Union (IoU) thresholds in the range [0.5, 0.95] with an interval 0.05 of all categories. The AP value is the primary metric for ranking. AP 50 and AP 75 are calculated by the average of all categories at a single IoU value 0.5 and 0.75 respectively. Apart from that, the AP S , AP M , AP L values are computed separately for small-sized (area < 32 2 ), medium-sized (32 2 < area < 96 2 ) and large-sized (area > 96 2 ) objects in order to measure the detection performance on targets of different sizes.  Training Implementation Details. The proposed UAV vehicle detector FEAF is end-to-end trained on 4 NVIDIA TITAN Xp GPUs with a total of 16 images per minibatch (4 images per GPU). Our network is optimized by stochastic gradient descent (SGD) with a weight decay of 0.0001 and momentum of 0.9. Unless otherwise specified, all models are trained for 100 k iterations with an initial learning rate being 0.01, which is reduced by a factor of 10 at iteration 60 k and 80 k respectively. We initialize our backbone network using the pretrained weights on ImageNet [81]. We resize the shorter side of the input images to 800 and the longer side less or equal to 1333 to avoid too much memory cost.

Ablation Study
In this section, we take the XDUAV dataset as an example to conduct an ablative analysis for the proposed vehicle detector.

Backbone Network Analysis
To maintain precise spatial location information and demonstrate the effectiveness of the backbone DetNet-49 for small-sized vehicles detection, we make comparative experiments in different backbone networks, including ResNet-50, ResNet-101, and DetNet59 as shown in Table 2. All experiments are used anchor-free regression mechanism [30] to detect UAV vehicles. Firstly, to maintain precise spatial location information, we adopt the DetNet59 [40] as the backbone that employs small downsampling factors (i.e., 4, 8, and 16). The FCOS [30] uses ResNet-50 and ResNet-101 as the backbone respectively, and increases two additional predicted layers using large downsampling factors 64 and 128. Line 2 and line 6 from Table 2 illustrate that DetNet59 based on anchor-free regression mechanism is 0.4% and 1.9% higher in AP and AP S respectively than ResNet-50. Although the AP value is the same when the backbone is ResNet-101 and DetNet59 as shown in line 4 and line 6, the AP S and AP M value of DetNet59 as the backbone are higher than that of ResNet-101. For fair comparison, we remove two additional layers with the large factors 64 and 128 in FCOS [30]. Experimental results from Line 3 and Line 5 without additional layers show that the accuracy AP S for small-sized vehicles is higher than that with additional layers, but at the same time it is lower than the accuracy of DetNet59 with the small factors (i.e., 8 and 16). These results evidence the importance of spatial location information for small-sized vehicles.
Secondly, to prove the effectiveness of the proposed DetNet-49, we analyze the role of a deeper convolutional layer stage6 from the DetNet59. As shown in line 6 and line 7 from Table 2, the comparison of experimental results proves the deeper convolutional layer stage6 is helpless for small-sized vehicle detection. Since the deeper convolutional layers introduce large receptive field for small-sized vehicles, it will bring too much background inference to compromise the detection performance. The experimental results show that the proposed backbone DetNet-49 not only reduces the parameters and the computational cost of the model to train more stable, but also offers precise location information and matched receptive fields to improve the detection performance for small-sized vehicles.

Multi-Scale Semantic Enhancement Block (MSEB) Analysis
As shown in Table 2, the AP L value of DetNet-49 as the backbone is less than that of other backbones, which shows that the absence of deeper convolutional layers leads to insufficient semantics for large-sized vehicles. Therefore, we use the proposed MSEB to enhance the semantics for the large-sized vehicles while maintaining spatial location information for the small-sized vehicles.
In order to prove the effectiveness of the designed MSEB for vehicles, we adopt different blocks as shown in Figure 4 to predict the targets and analyze their impact on the detection performance. The original feature pyramid network (FPN) only uses a 1 × 1 convolutional layer that is named M_1 block to perform lateral connection as shown in Figure 4a. For the proposed FEAF, the M_1 block cannot enhance semantics for UAV vehicles. We adopt two 3 × 3 convolutional layers (M_2 block as shown in Figure 4b) instead of M_1 block to enhance vehicles semantics. The experimental results show that the AP and AP L of the M_2 block are 0.5% and 1.1% higher respectively than the M_1 block as shown in Table 3. Although the M_2 block has achieved the semantic enhancement for large-sized vehicles, the AP S value of small-sized vehicles has decreased, which may be caused by the absence of the precise location due to the deep convolutional operation.  Inspired by the GoogleNet [82], we will widen the network width instead of increasing the depth as shown in Figure 4c,d, which can effectively improve the semantics of vehicles at various scales. Compared with the M_2 block, the M_3 block widens the network width to extract more rich semantics. However, the AP S and AP M of the M_3 block still become poor compared with the original M_1 block. To enhance the semantics and maintain location information for the small-sized vehicles, we decompose a 3 × 3 convolution kernel into 1 × 3 and 3 × 1 kernels as shown in Figure 4d. All AP values have been increased as shown in line 5 from Table 3, and the amount of parameters has also reduced. The proposed multi-scale semantic enhancement block (MSEB) can not only strengthen semantic information for vehicles, but also retain fine location information and create various receptive fields with different scales.

Anchor-Free vs. Anchor-Based Vehicle Detectors
The qualitative comparisons between the anchor-based and the proposed anchor-free methods can be seen in Figure 5. Compared with the anchor-based method [24] (Figure 5a), our anchor-free method (Figure 5b) obviously reduces false positives rate, missed rate, and redundant bounding boxes, etc. We analyze the two main reasons as following: firstly, because the scales and aspect ratios of anchors are fixed for the anchor-based vehicle detectors, even with careful design, it is difficult to deal with object candidates with large scale variations, particularly for small-sized vehicles. Secondly, anchor-based vehicle detectors that only consider the anchor boxes with a highly enough IoU with ground-truth boxes as positive samples, but the anchor-free methods consider the locations falling into any ground-truth boxes as positive samples. Compared with the anchor-based vehicle method, the anchor-free vehicle detector has more positive samples for training to reduce the imbalance between positive and negative samples.

Overall Performance
Due to the designs of the anchor-free regression mechanism, the effective backbone DetNet-49, and multi-scale semantic enhancement block, the proposed FEAF network can achieve 73.5% accuracy on the XDUAV dataset. We compare single-stage detection methods (including anchor-based methods and anchor-free methods) and two-stage high-accuracy detection methods by COCO metric as shown in Table 4. Among single-stage methods, the proposed FEAF obtains the top AP, which is even better than two-stage methods. Our method is 1.6%, 1.7%, 1.9%, and 2.5% higher in AP than FPN [28], Mask R-CNN [15], FCOS [30], and RetinaNet [32] respectively. All methods are trained under the same conditions, so the experimental results are credible.
To further demonstrate the robustness of the proposed method, we also evaluate it on the UAVDT dataset. In the analysis of Backbone Network, Multi-scale Semantic Enhancement Block (MSEB), and Anchor-free mechanism, we conduct the same ablative experiments on the XDUAV dataset. The experimental results of the UAVDT dataset can also prove the effectiveness of the proposed detector. Table 5 shows an overall result of different detection methods on the UAVDT dataset. Our method achieves 81.4% AP that is 2.4%, 1.7%, 0.8%, and 2.2% higher than FPN [28], Mask R-CNN [15], FCOS [30], and RetinaNet [32] respectively. Figure 6 shows detection results on the UAVDT dataset in different scenarios using our proposed method. It is noted that the UAVDT dataset only focuses on vehicles on the main road, and vehicles parked around the road are ignored.

Conclusions
This paper proposes a novel feature-enhanced anchor-free network for vehicle detection in unmanned aerial vehicle (UAV) vision, which achieves accurate detection. Firstly, in order to avoid the problems caused by the setting anchors, we adopt the anchor-free mechanism to eliminate predefined anchor boxes, relieving the imbalance between positive and negative samples, and handling objects large scale variations. Secondly, to enhance the features for vehicles, we design a multi-scale semantic enhancement block (MSEB) and an effective backbone DetNet49. The backbone includes fewer layers against deeper layers to offer matched receptive fields for small-sized vehicles and precise localization information. The MSEB widens the network width, which can effectively strengthen discriminative feature representation of vehicles at various scales. The experimental results on two publicly available vehicle datasets demonstrate that the proposed method can achieve the state-of-the-art detection performance, which proves the effectiveness and robustness of the detector. In our future work, we intend to integrate logical reasoning relationships and knowledge priors into the vehicle detection network to further improve detection accuracy for more low-resolution vehicles.
Author Contributions: J.Y. and W.Y. contributed to the idea and the data collection of this study; J.Y. developed the algorithm, performed the experiments, analyzed the experimental results and worte this paper; X.X. and G.S. supervised the study and reviewed this paper. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.