1. Introduction
With the rapidly developing technology applied in the remote sensing field, target detection in remote sensing images has been a popular area of research because of its wide applications [
1,
2,
3,
4,
5]. Meanwhile, object detection in optical remote sensing is faced with various challenges, including changes in the appearance of objects caused by illumination, observation angle, shadow, weather, and partial shelter. These challenges result in a relatively lower accuracy and higher computational complexity compared with general object detection methods. In recent years, deep convolution neural networks (CNNs) have been used by researchers to detect objects, and a variety of methods have been proposed to achieve better performance [
6,
7,
8,
9,
10,
11,
12,
13,
14]. The excellent performance of CNN arouses the research interest of scholars. Detection methods based on CNN have been applied in fields such as image denoising, object detection, semanteme division, pose estimation, and object tracking in natural scene images [
15,
16,
17,
18]. Meanwhile, some improved CNN methods have been applied in optical remote sensing images to promote performance and applicability [
1,
19,
20,
21]. Besides, a new set of methods based on Capsule network are proposed for object detection [
22,
23,
24,
25,
26], which may offer another alternative solution for optical remote sensing images as an emerging research area.
In general, typical deep convolutional neural networks can be divided into anchor-based and anchor-free according to the regression of bounding box. For the anchor-based framework, the regression of the bounding box relies on a set of pre-defined anchor boxes with various sizes and aspect ratios. Detect networks such as Faster R-CNN [
27], SSD [
14], YOLOv2 [
28] and YOLOv3 [
29] are of the anchor-based framework. This type of method needs to select a certain number of anchor boxes on the feature layer, then it calculates the confidence of objects and refines bounding boxes according to the intersection over union (IoU), and finally draws detection results including these refined bounding boxes and confidences on original images. However, the use of the anchor box suffers from two drawbacks. First, the network needs a certain number of preset anchor boxes to match different objects with various scales and aspect ratios, while most of the preset boxes are invalid that would increase the training time of the model [
30]. Second, the anchor-based method introduces many hyper-parameters that need to be optimized [
14,
30], which increases the difficulty of model design.
Due to the emergence of Feature Pyramid Networks [
31] and Focal Loss [
30], some scholarly research has been performed to investigate anchor-free network frameworks. The networks directly detect objects without presetting anchor boxes, which is conducive to the design of the network and the improvement of training speed. Anchor-free detectors can be divided into two different types. The first type delineates the bounding box by merging a pair of corner points of the target. This kind of anchor-free detector, such as CornerNet [
32,
33] and MatrixNets [
34], is classified as the keypoint-based network. The other type locates the center point of the targets and then predicts the width and height of the bounding box. This kind of anchor-free detector is called a centerpoint-based network [
35,
36]. The CenterNet [
35] doesn’t need to preset a large number of anchor boxes and has achieved similar or even better performance compared with anchor-based detectors. Centerpoints appear inside the object while corners are outside, so the extraction of centerpoints is more accurate. The centerpoint-based detectors have more advantages in object detection.
Focal Loss is used to solve the class imbalance between foreground and background as well as between easy and hard examples. However, Li et al. [
37] points out that Focal Loss defines a static ratio coefficient to increase attention to hard examples during training, which does not adapt the change of data distribution. In this context, the article propose a gradient harmonizing mechanism (GHM) to restrict the disharmony. In the field of remote sensing, there is a large number of hard examples in remote sensing images because of illumination, shadow, shelter, haze and so on. Referring to the experimental results, we define these indistinguishable examples as outliers because they exist stably even after the model is converged. The schematic diagram of outliers is shown in
Figure 1. We introduce the gradient harmonizing method into the Centernet to improve the performance of the model. However, the gradient harmonizing mechanism doesn’t consider the imbalance between positive examples and negative examples, and the inhibition mechanism of example imbalance needs to be further improved.
We propose a dynamic gradient adjustment anchor-free object detection method used for optical remote sensing field. The model is based on CenterNet [
36], and the deep layer aggregation [
38] combined with deformable convolution layers [
39] was chosen as the backbone network to extract effective features of objects. In order to settle the problem of example imbalance and improve the performance of the detector, we propose a dynamic gradient adjustment network which embeds the gradient density into the updated Focal Loss function. Furthermore, the Complete-IoU loss function [
40] is used for the regression of bounding box, which can harmonize the coupling characteristic of border size and its location. Contrast experiments were conducted to compare the performance of the proposed method with other advanced detectors on dataset UCAS-AOD and LEVIR. The experimental results show that the proposed network significantly improves the accuracy of object detection.
The main contributions of this article are listed as follows:
We propose a dynamic gradient adjustment method to harmonize the quantity imbalance between easy and hard examples as well as between positive and negative examples.
For the detection of small and dense objects in optical remote sensing images, we select a deep layer aggregation combined with deformable convolution as the backbone network, which has a better prediction performance.
We select the Complete-IoU as regression loss function instead of the sum of dimension loss and position offset loss to improve learning efficiency.
The object detection precision and recall in optical remote sensing images demonstrate the significant improvement of our method.
The rest of the article is organized as follows.
Section 2 reviews relevant works in remote sensing object detection and describes basic principles of CenterNet. We expound the proposed improved method in
Section 3. The effectiveness of the proposed method is verified and comparison experiments with the existing detectors are conducted in
Section 4. Finally,
Section 5 gives the conclusion of this article.
3. Proposed Approach
In our proposed approach, the Centernet [
36] is used as our baseline, and the framework of the dynamic adjustment network is shown in
Figure 2. The detection regards the object as a point and then regress the size and offset of bounding box. The deep convolution network extracts features of the input image and generates a heatmap. Peaks in the heatmap indicate detected objects. Due to the influence of light, obscuration and sharpness, images contain a few objects that are difficult to distinguish. We call them outliers. Dynamic gradient adjustment module is proposed to dynamically refine the loss weight depend on density distribution and minimize the effect of these outliers.
Deep Layer Aggregation (DLA34) [
38] is used as the backbone network of the model, which is illustrated in
Figure 3. The upgraded deformable convolutional net (DCN) is used as up sampling to connect features between layers in the network. The shallow layer has spatial-rich information and is more advantageous to detect small targets, while deep layer with semantic-rich information has better robustness and is more suitable for detecting large targets. This approach realizes multi-scale feature fusion and effectively extracts the feature information of objects at different scales. Besides, we introduce the CIoU to improve the convergence speed and accuracy of the model.
3.1. Dynamic Gradient Adjustment for Classification Function
This region is focused on classification in anchor-free object detection where the class (foreground and background) of images, as well as easy and hard examples, are quite imbalanced. Concerning this issue, we propose a dynamic refinement mechanism to inhibit these two disharmonies based on updated Focal Loss and GHM. Both of them originate from the multi-classification cross entropy loss:
In Equation (1), denotes the ground-truth label for a certain class and is the probability predicted by the detector.
Experiments show that the large class imbalance, as well as disharmony between easy and hard examples, limits the improvement of detector performance. We propose the dynamic gradient adjustment focal loss to settle the problem:
The input image
is defined as
with width
and height
. The detector extracts the targets as points, and then maps all ground truth keypoints onto a heatmap
.
indicates the number of classes, while
is the down-sampling factor of the network. The heatmap
is expected to be equal to 1 at target centers and adjacent positions are decreasing in order. The values on the heatmap are not less than 0. Refering to the practice of CornerNet [
32] and CenterNet [
36], the ground truth center points of targets are mapped using a 2D Gaussian kernel
, where
and
are the target center points after down-sampled, while
is a standard deviation which changes with the size of the object. In Equation (2),
is used to harmonize positive and negative examples, while
and
are hyper parameters of the updated Focal Loss, respectively.
N is the number of center-points in an image, and it is applied to normalize all positive focal losses instance to 1.
Meanwhile,
in Equation (2) is introduced to dynamically refine the loss function by reducing weight of easy examples and hard examples. GHM adds the
in Equation (1) to improve the performance of one-stage detectors. Gradient density
and gradient norm
are defined as follows:
In Equation (3), is the gradient norm of the k-th example, and is a function indicating whether is located in the interval centered at with a valid length .
Thus, gradient density GD(g) represents the ratio of examples in the specific interval to all examples. It is a dynamically varying value determined by predictions. is used to down-weights the contribution of outliers as well as easy examples.
3.2. United Bounding Regression Function
The regression loss function of the bounding box in CenterNet is defined as
, where
denotes the loss function of the bounding box size by using the L1 loss and
is introduced to inhibit the discretization error of the center-point caused by the output stride. However, there are two drawbacks to this approach. First, the size and location of the bounding box are regarded as two independent variables, which violated the fact that the size and location of an object are highly correlated. Second, this method introduces weight parameters
and
, which need to be optimized. We select the CIoU as the bounding box regression function and its schematic diagram is shown in
Figure 4.
The united regression function is defined as follows:
where
and
denote the center-points of predicted bounding box and ground truth bounding box, respectively.
is the maximal overlap between the predicted bounding box and the ground truth box.
is the Euclidean distance of two points and
is the diagonal length of the smallest enclosing box that covers the predicted box and ground truth box.
is a positive trade-off parameter and
measures the consistency of box aspect ratio. Besides, Distance IoU (DIoU) is defined similarly to CIoU, which doesn’t consider the effect of aspect ratio. Since the experiment part involves the comparison of both of the methods, the expression of DIoU is given here:
After a series of optimizations to the classification function and regression function, the overall training objective is:
and
are equal to 1 in our experiments. DGANet is used to predict the center-point, offset, and size of objects. All output branches share a common fully convolution backbone network as shown in
Figure 2.
4. Experiment and Discussion
In the experimental section, we describe a series of experiments conducted to demonstrate the effectiveness of the proposed network in optical remote sensing images. First, we introduce datasets used and evaluation metrics for the performance. Second, we describe four groups of experiments conducted to evaluate the performance of the proposed DGANet. Finally, we compare the method with published state-of-the-art approaches to adequately prove the advantages of our DGANet framework.
4.1. Datasets and Evaluation Metrics
UCAS-AOD: Ablation experiments are conducted on an dataset named UCAS-AOD [
48] with two commonly used categories: car and plane. The airplane category is composed of 1000 optical remote sensing images with 7482 objects in total. In contrast, the car category is composed of 510 optical remote sensing images with 7114 items in total. The dataset sequence is randomly arranged. For the training set 80% of figures were randomly chosen, while the rest were adopted as the validation set.
LEVIR: The LEVIR dataset [
49] is selected from high-resolution Google Earth images with more than 22 thousand images. The dataset background covers most types of geomorphic features such as city, country, mountain area, and ocean. There are three categories of objects: airplane, ship (inshore and offshore ships), and oil-tank, respectively. It contains 11 thousand labeled images, including 4724 airplanes, 3025 ships, and 3279 oil-tanks for training. In contrast, 6585 images are set as the validation set.
Experiments were performed on a laptop with an Intel single Core i7 Central Processing Unit (CPU), NVIDIA RTX-2070 GPU (8 GB memory). The deep learning operating environment is Pytorch 1.2.0, CUDA 10.0, Python version 3.6. The initial learning rate was set to 1.25e-4, and the batch size was set to 8.
Evaluation metrics: We selected two widely used indicators mean Average Precision (mAP) and mean Recall (mRecall) to measure the performance of different frameworks of detectors. The evaluation method [
50] is described as follows:
IoU: Intersection-over-union (IoU) is used to describe the matching rate between the ground truth box and the predicted box. If the calculated value of IoU is higher than the threshold, the predicted result is labeled as true positive (TP); otherwise it is annotated as false positive (FP). If a ground truth box does not been matched by any predicted box, it is regarded as an false negative (FN).
Precision and recall: precision and recall are obtained by calculating the number of TP, FP, and FN.
Precision and recall are important indexes of model evaluation, and they can be used to evaluate a certain category.
- 3.
Average precision: Average precision (AP) corresponds to a certain category to calculate the mean value of precision for each recall rates, and it is a comprehensive index sensitive to sort. mAP is the calculated average value of each AP. Meanwhile, mRecall is the mean value of the recall for all classes. We adopted the mAP and mRecall at the IoU threshold of 0.5 to evaluate the performance of the detector. In addition, the objects are divided into three categories by area. Intervals of instance area are (1 × 1, 32 × 32), (32 × 32, 96 × 96), and (96 × 96, inf), corresponding to “small”, “medium”, and “large” respectively. The mean average precisions from IoU = 0.5 to IoU = 0.95 with step size of 0.05 for each group were defined as APS, APM, and APL.
4.2. Ablation Experiments
A series of contrast experiments were conducted to verify the superiority of our proposed approaches. In order to make it more intuitive, we defined ORIG to denote the baseline model CenterNet [
36]. The ablation experiments cover the evaluation of the dynamic gradient adjustment (DGA) and the united bounding regression (UBR), as well as their combination DGANet.
4.2.1. Evaluation of Dynamic Gradient Adjustment
In order to verify the superiority of DGA module, we conducted multiple sets of experiments to compare the performance among the original CenterNet and different DGA frameworks.
Table 1 shows some of the main contrast results of different classification modules. First, we introduced an equilibrium coefficient
to address the imbalance between positive and negative examples. The parameter
was introduced to avoid overlearning easy negative examples that contribute little to the detector, which was set to 0.25 after a series of experiments. A comparison of the results between ORIG and DGA-1 revealed that mAP increased from 0.942 to 0.961 with an increase of 1.9 percent.
Second, gradient density was used to dynamically refine the Focal Loss function to harmonize the imbalance easy and hard examples. DGA-2 stands for the updated gradient harmonizing mechanism, which introduces two hyper-parameters and to improve the recognition ability of the model to real samples. We have designed verification experiments and obtained positive results. When hyper-parameters were equal to 0, the performance of the detector degraded, and the prediction accuracy of GDA-3 deteriorated to 0.927. DGA-4 defines as equal to 4, and the mAP reached 0.952, which achieves 2.5% higher than DGA-3 and 1% higher than ORIG. DGA-5 set the value of parameter to 2, and the accuracy of the model reached 0.956.
Finally, the dynamic gradient adjustment algorithm contains all three optimized parameters , and the framework is abbreviated with DGA. The DGA framework showed the best result with mAP of 0.966 and mRecall of 0.987 among all models in the contrast experiments. Especially for small and dense vehicle objects, the accuracy increased by 2.4% compared with the Focal Loss scheme, and the mean recall increased by 1.9%.
The precision-recall curve is an important indicator of the performance of the CNN network. The large the area between the P-R curve and coordinate axeses, the better the class prediction performance of the mode. In
Figure 5, we have plotted P-R curves of different classification networks on the UCAS-AOD dataset. By comparing the two pictures, it can be found that the AP of the car dataset is lower. This is because the size of a car is less than that of plane. The precision and recall were significantly improved by introducing the DGA module. Our improvement scheme resulted in an improvement of 3.2% and 1.6% on AP for the car and plane. The results, as shown in
Figure 5, indicate that our proposed DGA classification module obviously achieves the best object prediction performance.
Some qualitative results between the original network and our proposed DGA module are illustrated in
Figure 6. Both models can realize the detection of car and plane objects in complex backgrounds. However, there are some differences between these two networks. On the one hand, the prediction confidence of the propose model for significant objects is higher. On the other hand, there were more mispredictions and omissions for the original network. Some car objects in the shadow could not be predicted accurately. Compared with the original network, the proposed method could better overcome the interference of similar objects in the background.
4.2.2. Evaluation of United Bounding Regression
The original bounding box regression directly estimate the center point offset coordinating with height and width, which treats these points as independent variables. However, this method ignores the integrity of the bounding box itself. In order to improve the convergence speed and accuracy of regression, united bounding regression based on IoU was adopted instead of the original independent regression of size and offset. Focal Loss was selected as the classification module, which is same as the original network. At the same time, IoU, DioU, and CIoU were adopted for the regression of borders to qualitative analyze the influence of the bounding box regression on the prediction network.
In
Table 2 it can be seen that the prediction network reached the best performance when CIoU was adopted. This is because CioU loss simultaneously considers the overlapping area, the distance between center points, and the aspect ratio. From the data, we can see that the mAP improved from 0.942 (ORIG) to 0.964 (CIOU) and mRecall improved from 0.968 to 0.985.
4.2.3. Evaluation of Combining DGA and UBR
The proposed DGANet network combines DGA with the UBR module to improve detector performance in remote sensing images. The comparison results are listed in
Table 3.
Compared with a single module improvement, the DGANet network performance is more advantageous. The mean average precision rose further to 0.971 and the mean recall reached 0.987. Compared with the original network, the accuracy increased by 2.9% and the mRecall increased by 1.9%. We improved the Centernet with a dynamic gradient adjustment mechanism, since it significantly increased the ability of the network to learn effective features and almost no reduction of the network operation speed. DGANet achieved a better convergence speed compared with the original network.
4.2.4. Evaluation of Backbone Model
Table 4 compares the performance of the updated anchor-free detector based on different backbones: ResNet101, Hourglass, and DLA34, and the effect of the deformable convolution layer was also experimentally compared. From the table, it can be seen that the performance of Hourglass or DLA34 is obviously better than that of ResNet101. The mAP of ResNet101 can reach 0.891, which decreases about six percentage points compared with either of the other two backbones. We added the DCN block over the ResNet101 and DLA34, since it adaptively adjusts according to the scale and shape of the objects. This module resulted in a promotion of 1.9% and 1% on mAP for the ReseNet101 and DLA34, respectively. P-R curves of different backbone models are shown in
Figure 7, our studies demonstrate that the DLA34 combined with DCN is considerably better compared with other backbone models.
4.3. Comparison with the State-of-the-Art Methods
We compared the proposed DGANet with the state-of-the-art algorithms on two datasets: UCAS-AOD and LEVIR. Experimental results demonstrate that the performance of DGANet exceeds all other published methods.
Results on UCAS-AOD. We performed extensive comparison between the proposed detector with other state-of-the-art methods used on the UCAS-AOD dataset. Among the comparison experiments, we chose some well-known frameworks, such as Faster RCNN, YOLOv2, SSD, RFCN, and YOLOv4. In addition, many deep CNNs such as VGG-Net, ResNet50, ResNet101, GooLeNet, and DenseNet65 were involved. The overall comparison performance is listed in
Table 5.
In this section, we mainly compare the average precision among multiple frameworks and baselines. From the table, it can be noticed that our method achieved 0.956 for the average precision of the car and 0.987 for the average precision of the plane, and both results are highly competitive with the existing published methods. The mAP of our proposed method reached 0.971. From the results, we note that the precision of our approach is considerably better than that of other methods.
Figure 8 shows typical detection results on UCAS-AOD dataset by the proposed method. It can be seen that DGANet can detect objects accurately in a variety of challenging backgrounds, and it performs well in a densely arranged scene.
Results on LEVIR. We compare DGANet predicted results with other models on the LEVIR dataset. LEVIR is a new remote sensing images dataset, which has three categories and is used to realize horizontal bounding box target detection. We also compare our network with other competitive methods, such as Faster RCNN, Sig-NMS, SSD300, RetinaNet500, and Centernet.
Table 6 shows the comparison results among different models on the LEVIR dataset. From the data in
Table 6 it is apparent that DGANet achievds 0.863 for the mAP, which pulls away from other existing published methods. Compared with the well-known methods Faster RCNN and SSD, DGANet outperformed them with a 6.1% and 7% promotion on mAP.
Compared with the original networks, the proposed detector results in improvements of 1.9%, 4.3%, and 4.4% respectively on AP for the plane, ship, and oil-tank dataset. The mAP is improved from 0.828 to 0.863 with a increasement of 3.5% by using dynamic gradient adjustment mechanism and CioU regression module.
Figure 9 shows some detection results on the LEVIR by using the proposed DGANet method.