1. Introduction
The task of target detection includes prediction of location information and category information of the target. It is one of the most important components of many applications, such as pedestrian detection [
1,
2], defect detection [
3,
4] and obstacle detection [
5,
6] in automatic drive. In the research of target detection, remote sensing target detection is one of the key research topics because of its important applications in military applications, such as location and identification of military targets [
7], military reconnaissance [
8] and military navigation [
9]. With the maturity of satellite and aerial photography technology, high quality remote sensing images can be generated. It has far-reaching significance to remote sensing target detection. Remote sensing images usually have lower resolution compared with ordinary images. Due to the special method of acquisition, there are several features of remote sensing images that ordinary images do not have:
- (1)
Scale diversity: remote sensing images are taken at altitudes ranging from hundreds of meters to dozens of kilometers. Even with the same target, the scales are quite different. By contrast, the scales of the targets in ordinal images have few differences.
- (2)
Particularity of perspective: most remote sensing images are from the top view angle, while ordinary images are usually from horizontal view angle.
- (3)
The small size of the targets: remote sensing targets are acquired at high altitude and contain a large number of small targets which may be of large size in ordinary images. For example, in remote sensing images, the pixels of aircraft may be only 10 × 10 to 20 × 20, and for cars may be even less.
- (4)
Diversity of directions: the directions of a certain kinds of targets in remote sensing images are often varied, while target direction in the normal image is basically determined.
Due to the particularities of remote sensing targets compared with ordinary targets, remote sensing target detection is still a challenge for researchers.
Before the great development of deep learning technology, the mainstream algorithms of target detection could be divided into three steps: (1) search for the areas where the targets may lie in (region of interest, i.e., ROIs); (2) extract the features of the regions; (3) provide features for the classifier. In order to detect remote sensing targets efficiently, many outstanding algorithms have been proposed. A research group led by Fumio Yamazaki proposed an algorithm based on target motion contour extraction for ‘Quick bird’ remote sensing images and aerial images. A method of region correlation was introduced to find the best matching position in multispectral images by using the vehicle results extracted from panchromatic images. According to the dense distribution of vehicle targets, J. Leitloff et al. adopted Geographic Information System (GIS) information to assist to determine the location and direction of the road and then extracted the areas of interest and the lanes of concern. Then the methods of target contour extraction, shape function filtering and minimum variance optimization were used to extract the vehicles under the state of aggregation. For ship target detection, the Integrated Receiver Decoder (IRD) research group divided the key steps of target detection and recognition into three steps: predetection, feature extraction and selection, classification and decision. It realized the detection of ship targets based on optical remote sensing images. Despite the tireless efforts of researchers, accuracy remains unsatisfactory.
Within the past ten years, the rapid development of deep learning represented by Convolutional Neural Networks (CNNs) [
10,
11,
12] seems to have brought some light to the field of target detection. Region-based CNN (R-CNN) [
13] proposed in 2014 was the milestone in target detection based on convolutional neural networks. It treated target detection as a classification problem. Based on R-CNN, a series of improved algorithms, such as Fast R-CNN [
14] and Faster R-CNN [
15] were proposed subsequently. Since R-CNN was proposed, target detection algorithms based on CNN have developed rapidly.
Generally speaking, we can divide these algorithms into two categories, i.e., one-stage and two-stage. The two-stage ones are represented by the milestone of convolutional neural network, i.e., R-CNN mentioned above. They also contain Fast R-CNN, Faster R-CNN, Mask R-CNN [
16], SPP-net [
17,
18,
19], FPN (Feature Pyramid Network) [
20,
21], R-FCN (Region-based Full Convolutional Network) [
22],SNIP [
23] and SNIPER [
24]. Just as its name implies, the two-stage algorithms first search for regions of interest with the Region Proposed Network (RPN), and then generate category and location information for each proposal. The other ones are one-stage algorithms represented by YOLO series such as YOLO [
25], YOLO-V2 [
26], YOLO-V3 [
27], YOLO-V3 tiny [
28] and YOLO-V4 [
29], as well as SSD series such as SSD (Single Shot Multibox Detector) [
30], DSSD (Deconvolutional Single Shot Multibox Detector) [
31], FSSD (Feature Fusion Single Shot Multibox Detector) [
32] and LRF [
33]. Different from the two-stage algorithms, the one-stage algorithms consider target detection as regression problems directly. The core task of them is to input the image to be detected into the network, and the output layers gain the location and category information of the targets. Both one-stage and two-stage detectors have advantages. Because of the adoption of RPN, the two-stage detectors usually have higher precision, while the advantage of the one-stage detectors is the higher speed.
In the past few decades, a variety of aeronautical and astronautical remote sensing detectors have been developed and used for obtaining information rapidly and efficiently from the ground. The world’s major space powers have accelerated the development of remote sensing technology and successively launched a variety of detection equipment into space. The development of remote sensing technology has greatly improved the shortcomings of traditional ground survey, including small coverage and insufficient data acquisition. As one of the most important tasks of a remote sensing domain, remote sensing target detection is becoming more and more active for researchers. Remote sensing target detection is of great significance to both civilian and military technologies. In the civilian field, high-precision target detection can help cities with traffic management [
34] and autonomous driving [
35], whereas in the military field, it is widely used in target positioning [
36], missile guidance [
37] and other aspects.
Compared with superficial machine learning, a Deep Convolutional Neural Network (DCNN) can automatically extract the deeper features of the targets. The use of Deep Convolutional Neural Network for target detection can avoid the process of extracting features of the targets by manually designing specific algorithms according to different targets in superficial machine learning. Therefore, Deep Convolutional Neural Networks have stronger adaptability. Due to the advantages of CNN and the popularity of the target detection algorithms based on CNN, it is the first choice for remote sensing target detection. Unlike conventional images, targets in remote sensing images vary greatly in size. The smaller ones, like aircraft and cars, may take up only 10 × 10 pixels, while the larger ones like playgrounds and overpasses, can take up 300 × 300 pixels. With the difference of imaging devices, light conditions and weather conditions, the remote sensing images are also have a high range of spectral characteristics. For acquisition of increased detecting performance, improvements of network are needed.
Currently, the classical Convolutional Neural Networks have limited performance in detecting remote sensing targets, especially small targets with complex backgrounds. The two-stage detectors usually have lower speed while the one-stage ones are not good at dealing with small targets. Aimed at realizing real-time remote sensing target detection; adapting to remote sensing target detection under different backgrounds, and effectively improving the accuracy of detecting small remote sensing targets, we designed a new one-stage detector named Feature Enhancement YOLO (FE-YOLO). Our proposed FE-YOLO is based on YOLO-V3, which is one of the most popular networks for target detection. The experiments on aerial datasets show that FE-YOLO significantly improves the accuracy of detecting remote sensing targets and reconciles real-time performance simultaneously.
In the rest part of this paper,
Section 2 introduces the principle of YOLO series in detail,
Section 3 details the improvement strategies of FE-YOLO,
Section 5 gives the experimental studies on remote sensing datasets to verify the superiority of our FE-YOLO and
Section 5 gives the conclusions of this paper.
2. The Principle of YOLO Series
2.1. The Evolutionary Process of YOLO
The one-stage algorithms represented by YOLO (You Only Look Once) is an end-to-end model based on CNN. Compared with Faster R-CNN, YOLO, it does not have a cockamamie region proposal stage, so it is extremely popular due to its conciseness. YOLO ((a) in
Figure 1) was firstly proposed by Joseph Redmon et al. in 2015. When performing target detection, the feature extraction network divides the input image into
grid cells. For each grid, the network predicts
bounding boxes. Each bounding box can gain three predictive values: (1) the probability the target lies in the grid; (2) the coordinates of the bounding box and (3) the category of the target and its probability. For each grid cell, the predictive values include five parameters:
. Among them,
,
,
,
represent the coordinate, height and width of the central point of the bounding box;
represents the confidence score of the bounding box. The confidence of the bounding box is defined as:
. If the center of the target lies in the grid, then
, Otherwise,
. The network predicts
classes. Ultimately, the tensor size of the output of the network is:
. The class-specific confidence can be expressed in Equation (1):
YOLO-V1 has stunning detecting speed compared to Faster R-CNN. However, its accuracy is lower than other one-stage detectors such as SSD. In addition, for each grid cell, only one target can be detected. To improve the deficiencies of YOLO-V1, in 2017 the advanced version of YOLO (i.e., YOLO-V2) was proposed. YOLO-V2 (in
Figure 1b) offered several improvements over the first version. Firstly, instead of FC (Full Connected Layer), YOLO-V2 adopted FCN (Full Convolutional Layers). The feature extraction network was updated to Darknet19. Secondly, utilizing the ideal of Faster R-CNN, YOLO-V2 introduced the concept of anchor box to match the targets of different shapes and sizes. Moreover, Batch Normalization was also adopted. The accuracy of YOLO-V2 was drastically improved compared to YOLO-V1.
In order to improve the accuracy and enhance the performance of detecting multi-scale targets, the most classic YOLO-V3 (in
Figure 1c) came on the stage. Different from YOLO-V1 and YOLO-V2, which only have one detection layer, YOLO-V3 sets three different scales. This was inspired from a FPN (Feature Pyramid Network), and the improvement strategy can detect smaller targets. The structure of original YOLO-V3 and Darknet53 are illustrated in
Figure 2 and
Table 1, respectively.
The loss function of YOLO-V3 can be divided into three parts: (1) coordinate prediction error, (2) IOU (Intersection over Union) error and (3) classification error:
Coordinate prediction error indicates the accuracy of the position of the bounding box, and is defined as:
In Equation (2), refers to the weight of the coordinate error and we select in this work. is the number of the grid cells of each detection layer while is the number of the bounding boxes in each grid cell. indicates if a target lies in the bounding box of the grid cell. indicate the abscissa, ordinate, width and height of the center of the ground truth, while indicate the predicted box.
Intersection over Union (IOU) error indicates the degree of overlap between the ground truth and the predicted box. It can be defined in Equation (3):
In Equation (3), refers to the confidence penalty when there is no object; we selected in this work. and represent the confidence of the truth and prediction, respectively.
Classification error represents the accuracy of classification and is defined as:
In Equation (4), refers to the class the detected target belongs to. refers to the true probability the target belongs to class . refers to the predicted probability the target belongs to class .
So, the loss function of YOLO-V3 is shown as follows:
2.2. K-Means for Appropriate Anchor Boxes
The concept of an anchor box was firstly proposed in Faster-RCNN. An anchor box is used to assist the network with appropriate bounding boxes. Illuminated by Faster-RCNN, YOLO introduced the ideal of an anchor box after YOLO-V2 was proposed. The core ideal of an anchor box is to make the bounding boxes match the size of targets before detection. Compared with Faster-RCNN, YOLO-V3 runs K-means to gain appropriate anchor boxes instead of doing it manually. In order to acquire optimal sizes of anchor boxes, they ought to be as approximate as possible compared with the ground truth of the targets. So, the IOU (the intersection over union) values of them must be as large as possible. For each ground truth of the targets:
,
refer to the center of the target and
refer to the width and height of it. Otherwise, we give the concept of distance. The distance between ground truth and bounding box can be expressed as:
between them is defined as follows:
refers to the overlap area between the predicted box and ground truth, while refers to the union area between them.
The larger the
, the larger the distance and the smaller the distance between the ground truth and the bounding box.
Table 2 gives the pseudo code of K-means:
When detecting targets, we need to get the values of bounding boxes based on predicted values. The process is shown in Equation (8). In Equation (8),
refer to the predicted values of the network.
refer to the offset relative to the upper left. In addition, the correspondence between the bounding box and the output values is shown in
Figure 3.
2.3. NMS for Eliminating Superfluous Bounding Boxes
After the process of decoding, the network may acquire a series of bounding boxes. Among them, there are going to be several bounding boxes correspond to one target. To eliminate superfluous bounding boxes and retain the most precise boxes, NMS is adopted. NMS consists mainly of three steps:
- (1)
Step 1: for the bounding boxes corresponding to each category, we compare the between each bounding box with others.
- (2)
Step 2: if the s are larger than the threshold we set, then they are considered to correspond to the same target and the bounding box with the higher confidence can be retained. In this work, we set a fixed = 0.5.
- (3)
Step 3: repeat step 1 and step 2 until all the boxes are retained.
Algorithm 1 gives the pseudo code of NMS in this paper:
Algorithm 1. The pseudo code of Non-Maximum Suppression (NMS), NMS Algorithm for Our Approach. |
Original Bounding Boxes: |
|
B refers to the list of the bounding boxes generated by the network. |
C refers to the list of the confidences corresponding to the bounding boxes in C |
Detection result: |
F refers to the list of the final bounding boxes. |
1: F ← [] |
2: while B ≠ [] do: |
3: k ← arg max c |
4: F ← F.append(bk); B ← delB[bk]; C ← del[bk] |
5: for bi ∈ B do: |
6: if IOU(bi, bi) ≥ thresold |
7: B ← delB[bk]; C ← del[bk] |
8: end |
9: end |
10: end |
3. Related Work
The improved models based on YOLO-V3 mentioned in
Section 2 performed poorly in accuracy or real-time performance. Even more insufficient, when facing the remote sensing images with a large amount of small or densely distributed targets, their performance was unsatisfactory. There is still large possibility of improvement for YOLO-V3 model. Our approach aims to achieve three objectives: (1) detect small targets more effectively than other state-of-the-art one-stage detectors; (2) detect densely distributed targets perfectly and (3) realize real-time performance. To achieve the above three purposes, our improvements are shown as follows.
3.1. The Lightweight Feature Extraction Network
YOLO-V3 employs Darknet53 which deepens the network and is appealing due to its speed. However, it is limited by its accuracy, which presents a new problem. For the input image with the size of 416 × 416, the size of the three detection layers was 52, 26, and 13, respectively. In other words, the feature maps of the three detection layers were down-sampled by 32×, 16×, and 8×, respectively. That is to say, if the size of the target is less than 8 × 8, the feature map of it may take up less than one pixel after being processed by the feature extraction network. As a result, the small target is hard to be detected. Otherwise, if the distance of the center of two targets is less than eight pixels, the feature maps of them must lie in the same grid cell, which makes it impossible for the network to distinguish between the two targets. The ability of YOLO-V3 is significant for detecting small targets, but it cannot satisfy the need for detecting small remote sensing targets. Due to vast computing redundancy and the demand of detecting small targets, we adopted two improvements to the feature extraction network. First, we simplified the feature extraction network to some extent by wiping off several residual blocks. Second, we replaced the bottom detection layer with a finer one. These improvements allowed the network to detect the targets with even smaller size. In addition, some densely distributed targets could be better to be differentiate. The structures of residual unit and simplified feature extraction network are shown in
Figure 4 and
Figure 5, respectively.
3.2. Feature Enhancement Modules for Feature Extraction Network
The proposed feature extraction network simplifies the calculation tremendously but, at the same time, the property of the network for feature extraction is greatly limited. Many ways can be adopted in improving the performance of the network, such as hardware upgrades and larger scale of datasets. When objective conditions are limited, the most direct way to improve the performance is to increase depth or width of the network. The depth refers to the number of layers of the network. The width refers to the number of the channels per layer. However, this way may lead to two problems:
- (1)
With the increase of depth and width, the parameters of the network are greatly increased, which may easily result in overfitting.
- (2)
Increasing the size of the network results in an increase in computation time.
In order to solve the problem of overfitting, we focused on the skip connection which has been widely used in Deep Convolutional Neural Network. During the process of forward propagation, skip connection enable a very top layer to obtain information from the bottom layer. On the other hand, during the process of backward propagation, it facilitates gradient back-propagation to the bottom layer without diminishing magnitude, which effectively alleviates gradient vanishing problem and eases the optimization. ResNet is one of the first network which adopt skip connection.
Darknet53 in YOLO-V3 employs ResNet [
38] which consist of several residual units (see in
Figure 4). Nevertheless, too many residual units lead to computation redundancy, while the performance of lightweight feature extraction network is limited. For this purpose, we needed to specifically design feature enhancement modules.
ResNet, which was employed in the feature extraction network of YOLO-V3 (i.e., Darknet53), deepens the network and avoids gradient fading simultaneously. It adds input to the output of the convolutional layers directly (in
Figure 3). The output of it can be expressed in Equation (9):
In Equation (9),
refers to the
residual unit of the network and
refers to the transfer function (i.e., Conv (1 × 1)-BN-ReLU-Conv (3 × 3)-BN-ReLU modules in
Figure 3).
In 2017, another classic network i.e., DenseNet [
39], was proposed by Huang et al. Different from ResNet, DenseNet concatenates the output of each layer onto the input of each subsequence layer. Although the width of the densely connected network increases linearly with depth, DenseNet provides higher parameter efficiency than ResNet. The output of each layer can be expressed in Equation (10):
In Equation (10),
refers to the output of DenseNet,
refers to the feature map of
layer and
refers to the transfer function. There are
connections in the network. The architecture comparison of them is exhibited in
Figure 6.
From the architecture comparison in
Figure 6 with Equations (9) and (10), we can see ResNet reuses the feature that the previous layers have extracted. The real features extracted by convolutional layer are much more purer. They are basically new features that have not been extracted before. So, the redundancy of features extracted by ResNet is lower. By contrast, the features extracted by previous layers are no longer simply reused by the later layers in DenseNet but create entirely new features. The features extracted in the later layer of this structure are likely to be those extracted by the previous layer. Combining the above analysis, ResNet has a higher reuse rate but lower redundancy rate of features. DenseNet creates new features but has a higher redundancy rate. Based on the characteristics of the above network, combining these structures and the network has even more power. So, DPN came into being. Dual Path Network (DPN) was proposed by Chen et al. DPN is a strong network with the incorporation of ResNet and DenseNet. The architecture is shown in
Figure 7.
For each unit in DPN, the layer that goes through the convolution is divided into two parts (see Structure 1 and Structure 2 in
Figure 6). The two parts can be served as inputs of ResNet and DenseNet, respectively. Then, DPN concatenates the outputs of the two networks as the input of the next unit. By combining the core ideas of the two networks, DPN allows the model to make better use of features. Illuminated by DPN, we made improvements on original feature extraction network (see in
Figure 4) by adding two feature enhancement modules after the structure of ‘Residual block 2′ and ‘Residual block 3′, respectively. The feature enhancement modules refer to the principles of DPN, so we called them Dual Path Feature Enhancement Modules (DPFE). The structures of them are shown in
Figure 8.
3.3. Feature Enhancement Modules for Detection Layers
In YOLO-V3, six convolutional layers are appended to each detection layer. In the entire operation of YOLO-V3, the convolutional layers of three detection layers create a lot of computational redundancy. In addition, many convolutional layers may cause gradient fading. In order to solve this problem, another kind of feature enhancement module was proposed in the detection layers. Inspired by the structures of Inception and ResNet, our proposed feature extraction modules were called Inception and ResNet Feature Enhancement Modules (IRFE). The structures of them are shown in
Figure 9.
In
Figure 9, the modules adopt 1 × 1, 3 × 3, 5 × 5 kernels and obtain different sizes of receptive fields. We merged the features to realize fusion of features of different scales. As the network deepens, the features become more abstract and the receptive field involved in each feature is larger. In addition, 1 × 1, 3 × 3, 5 × 5 kernels enable the network to learn more nonlinear relations. At the same times, the combination of residual network enables the network to capture more characteristic information. In order to save time consumed by the module, 3 × 3, 5 × five kernels were decomposed into 1 × 3, 3 × 1 and 1 × 5, 5 × 1 kernels.
Different from convolutional layers in the detection layers of YOLO-V3, our proposed IRFE modules improved the performance of the network by increasing its breadth. In each branch of IRFE, we adopted convolutional kernels of different sizes to gain different receptive fields. We concentrated each branch to enrich the information of each layer.
3.4. The New Residual Blocks Based on Res2Net for Feature Extraction Network
YOLO-V3 relies heavily on a Residual Network (ResNet) in its feature extraction network and achieves good performance. However, it still represents features by the way of hierarchical multiscale representation, which makes the use of inner features of the monolayer inadequate. At present, most of the existing feature extraction methods represent multi-scale features in a hierarchical way. That means either multiscale convolutional kernels are adopted to extract features for each layer (such as SPPNet), or features are extracted for each layer to be fused (such as FPN). In 2019, a new way of connection for residual unit was proposed by Gao et al. and the new improved network was termed as Res2Net. In this method, several small residual blocks are added into the original residual unit. The essence of Res2Net is to construct hierarchical residual connections within a single residual unit. Compared with ResNet, it represents multiscale features by the way of a granular level and can increase the range of receptive fields for each layer. Utilizing the ideal of Res2Net, we proposed ‘Res2 Unit’ for our feature extraction network.
Figure 10 shows the structure of our proposed ‘Res2 unit’ and ‘Res2 block’ is made up of several ‘Res2 units’.
In each ‘Res2 unit’, the input feature maps are divided into
sub-feature maps (we select
n = 4 in this paper) on average. Each subfeature map is represented as
. Compared with the feature maps of ResNet, each sub-feature map is of the same size, but contains only
number of channels.
refers to the
convolutional layer while
represents the output of
.
and can be represented as:
In this paper,
,
,
,
can be represented as (
refers to convolution):
In this paper, we set as the controlling parameter, which refers to the number of input channels that can be divided into multiple feature channels on average. The larger , the stronger the multiscale capability. In this way, we get output of different sizes of receptive fields.
Compared with residual unit, the improved ‘Res2 unit’ makes better use of contextual information and can help the classifier detect small targets, and the targets subject to environmental interference, more easily. In addition, the extraction of features at multiple scales enhances the semantic representation of the network.
3.5. Our Proposed Model
From what has been discussed in
Section 4.3 and
Section 4.4, our proposed FE-YOLO introduced the lightweight feature extraction network and feature enhancement modules (DPFE and IRFE), Res2Net. The structure of FE-YOLO is shown in
Figure 11.
From the whole network view, the structure of our proposed FE-YOLO is relatively concise. Compared with YOLO-V3, the number of residual blocks of FE-YOLO was drastically reduced. In addition to that, the size of the detection layers was changed from 13 × 13, 26 × 26 and 52 × 52 to 26 × 26, 52 × 52 and 104 × 104, respectively. Besides the lightweight of the feature extraction network, FE-YOLO introduced ‘Res2 blocks’ and replaced the last two residual blocks to increase the range of receptive fields. In addition, feature enhancement modules were adopted to the network for better feature extraction. To show internal structures more clearly, we compared the feature extraction network’s parameters of YOLO-V3 and our FE-YOLO. We exhibit them in
Table 3 and
Table 4, respectively.
Compared with the ‘residual block’, the number of parameters of the ‘Res2 block’ was not increased. Although our proposed FE-YOLO is not deeper than YOLO-V3, it adopts feature enhancement modules and multireceptive field blocks. In addition, the lightweight feature extraction network makes it more efficient to detect small targets. The superiority of our approach is described in
Section 5.