1. Introduction
Object detection has been a fundamental problem in computer vision. It plays an important role in various fields such as civil and security [
1]. The development of object detection algorithms in the past 10 years can be roughly divided into two stages [
2]. Before 2013, most algorithms rely on the hand-crafted features; after that, the algorithms are mainly based on CNN features. The traditional detection method can be summarized as three steps: “Region Selection”, “Feature Extraction” and “Classification”. “Region Selection” is a coarse locating process of the target. Since the targets may appear anywhere in the image and the sizes of the targets are uncertain, a sliding-window strategy [
3] is used to traverse the image. To detect objects in different sizes, different scales and ratios are set for the sliding windows. Although the sliding-window strategy can obtain a large number of candidate regions, it also generates many redundant windows and the time complexity of this method is also high. “Feature Extraction” analyzes the candidate regions obtained in the previous step. Due to the background diversity, illumination changes, object occlusions, etc., it is not easy to design a feature with decent robustness. Because of the lack of effective image feature representation before deep learning, people have to design more diversified detection algorithms (including SIFT detection algorithm, histogram of gradients (HOG) detection algorithm and the DPM model [
4,
5,
6] to compensate the defects of hand-crafted feature expression. “Classification” uses the region classifiers to assign categorical labels to the covered regions. Commonly, support vector machines are used here due to their good performance on small scale training data. In addition, some classification techniques such as bagging, cascade learning and Adaboost are used in region classification step, leading to further improvements in detection accuracy.
Although object detection methods based on traditional manual features are mature, they still face the following two problems: first, the region selection strategy based on a sliding window makes it easy to generate window redundancy and is time-consuming; second, hand-crafted features are not robust enough for the problem of object diversity. With the rapid development of deep-learning techniques, object detection algorithms based on deep learning have taken an important place. They can be classified into two major categories: one-stage methods and two-stage methods [
3]. The methods which consist of three steps (candidate region proposal, feature extraction and classification) are well known as two-stage methods, such as the series of methods based on region convolutional neural network (RCNN [
7], fast-RCNN [
8], faster-RCNN [
9], and feature pyramid networks [
10]. In contrast, the methods which do not need any additional operation for region proposal, such as the YOLO series [
11], SSD [
12] and Retina-Net [
3], are one-stage methods.
Small object detection is a branch of object detection, which is important for various applications, e.g., traffic management, urban planning, parking lot utilization, etc. Detection of ground vehicles or pedestrians by an unmanned aerial vehicle (UAV) and detection of ground objects by remote sensing images have been intensively explored by relevant researchers. Definition of a small object is usually different depending on specific applications. Bell S et al. proposes an inside-outside net (ION) structure and defines the small object as a target with a size of 32 × 32 pixels or less in a 1024 × 1024 image (COCO dataset [
13]). While Maenpaa T et al. defines the small object with a size of approximately 20 × 20 pixels in a 512 × 512 image [
14].
In this paper, we focus on vehicle detection in aerial images and propose a feature-balanced pyramid network (FBPN) for better feature extraction. The main contributions of this paper are presented as follows: (1) a specialized framework which combines FBPN with faster RCNN is proposed and applied to vehicle detection in aerial images. (2) An annotation method is designed to be more suitable for the proposed framework. (3) Data enhancement is proved to be effective in our proposed network.
2. Related Work
Prior to the development of deep learning, a sliding window detector [
8] was widely used in object detection. Sliding window methods utilize both specific hand-crafted feature representations such as HOG and classifiers such as a support vector machine (SVM) to independently binary classify all sub-windows of an image as belonging to an object or background [
15,
16]. Even though their methods have made some improvements, hand-crafted features are insufficient to separate vehicles from complex background. Compared with sliding window methods, region proposal [
9] can determine the location where the target may appear in the image in advance, which can reduce the computational overhead and improve the quality of candidate region. The series of methods based on region convolutional neural network (RCNN) uses region proposal for object detection and the results prove that they perform well when dealing with object detection tasks.
RCNN was proposed by Girshick et al. in 2014. This algorithm has three main steps. First, it extracts the object proposals in the image. Then, the proposals are adjusted to the same size and the features are extracted using the Alexnet network trained on ImageNet dataset. Finally, it uses the SVM classifier for false alarm elimination and category judgment. RCNN achieved good results on the VOC07 dataset, with mAP increasing from 33.7% (DPM-v5 [
17]) to 58.5%. Although R-CNN has made great progress, its defects are also obvious. First, the training process of RCNN is multi-stage, which is cumbersome and time-consuming. Second, due to repeated feature extraction on high-density candidate regions, its detection speed is relatively slow (40 s per image on the graphics processing unit (GPU), 640 × 480 pixels).
In 2015, Girshick et al. proposed the fast-RCNN detector based on their previous work. The main achievement of fast-RCNN is that it realizes a multi-task learning method which simultaneously trains the target classification network and bounding box regression network while network fine-tuning. On the VOC2007 dataset, fast RCNN achieves the mAP of 70% compared with 58.5% achieved by RCNN. Because external algorithms are still needed to extract the target candidate box in advance, they cannot achieve end-to-end processing.
Faster RCNN is an end-to-end deep learning detection algorithm with fast processing speed (17 FPS, 640 × 480 images). The main innovation of faster RCNN is that it proposes the region proposal network (RPN) and designs a “multi-reference window” to combine external object proposal detection algorithms (such as selective search or edge boxes) to the same deep network. From R-CNN to faster-RCNN, candidate region generation, feature extraction, candidate target validation, and bounding box regression tasks are gradually unified into one framework. The detection accuracy is increased from 58.8% achieved by RCNN to 78.8% and the detection speed is also increased.
In the specific domain of small object detection, such as vehicle detection in aerial images, the algorithms mentioned above are not applicable because the vehicles in these images have special characteristics e.g., small size, low resolution, and inconspicuous features. Small object detection is still one of the problems to be overcome urgently by computer vision. In other words, more pertinent networks should be designed for small object detection. Although some datasets have been used for small object detection, the number of samples in these datasets are simply not comparable to that of the conventional datasets. For example, the ImageNet [
18] dataset contains 1,034,908 images with bounding box annotations, while a specially-made small object dataset (Vehicle Detection in Aerial Imagery, VEDAI) has only 1210 images [
19]. This also brings challenges to the task of small object detection, the performance of the detector should be improved on the basis of a small amount of training samples. In the following, some specially designed object detection algorithms in aerial images will be systematically introduced.
In 2015, Razakarivony S et al. put forward VEDAI (a new database of aerial images [
19]). They compared several object detection algorithms and found that most of the algorithms are not suitable for small object detection.
In 2017, the Lawrence Livermore National Laboratory of the United States [
1] proposed an algorithm which modifies faster-RCNN to train the model for positioning small vehicles in VEDAI. The algorithm modified the anchors used in the RPN module of faster-RCNN and adjusted the input of RPN. The experiments showed that the modified faster-RCNN had substantial improvements in mAP, compared to the template-based sliding window methods.
In 2018, Yohei Koga et al. applied hard example mining (HEM) to the training process of a convolutional neural network for vehicle detection in aerial images [
2]. Yohei Koga et al. used a sliding window method and CNN architecture. Candidate bounding boxes were scattered densely over an entire image and then those with no existence of vehicles were screened out. HEM was applied to the training of CNN used for the screening. The proposed method successfully promoted learning finer features and improved accuracy.
Yang et al. proposed a novel double focal loss convolutional neural network framework (DFL-CNN) in 2018, which was also an improved version of faster-RCNN [
20]. DFL-CNN used skip connection to combine the features (conv5-3 and conv5-5) of faster-RCNN, which can enhance the network’s ability to distinguish individual vehicles in a crowded scene. To address the challenges of imbalance between each class and between easy/hard examples, it adopts focal loss function instead of cross-entropy function in both of the region proposal stage and the classification stage. The proposed network outperforms many others.
Ding et al. proposed a region of interest (RoI) transformer to solve the mismatches between the Region of Interests and objects [
21]. These mismatches can be found when small objects are packed densely in aerial images. The experimental results demonstrated that by utilizing a rotated position sensitive RoI transformer based on a rotated RoI learner, the proposed algorithms can achieve a better performance than the deformable position sensitive RoI pooling method.
Low-level features of FPN [
22] correspond to large targets, while the path between high-level features and low-level features is long, which increases the difficulty of accessing accurate positioning information. In order to shorten the information path and enhance the feature pyramid with low-level accurate positioning information, PANet [
22] creates bottom-up path enhancement based on FPN, thus improving the ability to detect small objects.
In 2019, Cheng et al. proposed a CNN model based on rotation invariant and Fisher’s discrimination [
23]. This model proposes an objective function, which can be optimized to carry out rotation invariant constraints and fisher discrimination on the generated CNN features. Hu et al. focuses on the large variance of scales, and designs a scale-insensitive convolution neural network which accomplishes by a context-aware RoI pooling and a multi-branch decision network [
24]. Ju et al. proposed a specially designed network for small object detection [
25]. This network combines ‘dilated module’ with feature fusion and ‘pass-through module’, and performs at the same level with YOLO V3 with much higher processing speed. Mandal et al. also designed a fast small object detector, and named it SSSDET (simple short and shallow network) [
26]. According to their test, this algorithm outperforms YOLO V3, but it can be processed at the same speed as YOLO V3-Tiny. A new airborne image dataset named ABD was also proposed in this paper.
In 2020, Feng et al. focus on vehicle trajectory data under mixed traffic conditions [
27]. Through detecting vehicles from UAV videos under mixed traffic conditions, a novel framework for accurate vehicle trajectory construction is designed. Zhou et al. focus on detecting vehicle when the vehicle logo has motion blur, and design a Filter-DeblurGAN which possesses, a judgment mechanism, to judge whether the image needs to be deblurred [
28]. Moreover, a new vehicle logo dataset named LOGO-17 was released. Mandal et al. proposed a one-stage vehicle detection network named AVDNet. The proposed algorithm adopts specially designed ConvRes residual Blocks and enlarged output feature maps. According to the experiments, the proposed algorithm outperforms YOLO V3, faster R-CNN, and RetinaNet. Liao et al. were aware of the problem of mismatching when detecting dense and small objects in aerial images, and designed a local-aware region convolutional neural network (LRCNN) to solve this problem [
29]. Rabbi et al. designed an edge-enhanced super-resolution GAN (EESRGAN) to enhance the quality of remote-sensing images [
30]. The experimental results demonstrate that EESRGAN and Edge-Enhanced network can improve the performance of some object detectors, e.g., faster-CNN and single-shot multibox detector.