Improved YOLO-V3 with DenseNet for Multi-Scale Remote Sensing Target Detection

Remote sensing targets have different dimensions, and they have the characteristics of dense distribution and a complex background. This makes remote sensing target detection difficult. With the aim at detecting remote sensing targets at different scales, a new You Only Look Once (YOLO)-V3-based model was proposed. YOLO-V3 is a new version of YOLO. Aiming at the defect of poor performance of YOLO-V3 in detecting remote sensing targets, we adopted DenseNet (Densely Connected Network) to enhance feature extraction capability. Moreover, the detection scales were increased to four based on the original YOLO-V3. The experiment on RSOD (Remote Sensing Object Detection) dataset and UCS-AOD (Dataset of Object Detection in Aerial Images) dataset showed that our approach performed better than Faster-RCNN, SSD (Single Shot Multibox Detector), YOLO-V3, and YOLO-V3 tiny in terms of accuracy. Compared with original YOLO-V3, the mAP (mean Average Precision) of our approach increased from 77.10% to 88.73% in the RSOD dataset. In particular, the mAP of detecting targets like aircrafts, which are mainly made up of small targets increased by 12.12%. In addition, the detection speed was not significantly reduced. Generally speaking, our approach achieved higher accuracy and gave considerations to real-time performance simultaneously for remote sensing target detection.


Introduction
Recently, remote sensing images [1][2][3][4] have attracted more research in the field of computer version (CV) with the rapid development of satellite and imaging technology. There is a significant value on information extraction of remote sensing images. Remote sensing target detection [5][6][7] has important and extensive applications in military, navigation, salvage, and other aspects, which requires high speed and accuracy for target detection algorithms.
The rapid development of computer technology makes it possible for the applications of the convolutional neural network (CNN) [8][9][10][11], which requires high computing power. Compared with traditional target detection algorithms like HOG-SVM (Histogram of Oriented Gradients-Support Vector Machine) [12,13], DPM (Deformable Parts Model) [14,15], and HOG-Cascade [16,17], CNN-based target detection algorithms have great advantages in many aspects such as speed and accuracy. Convolutional neural network (CNN) is a kind of feed forward neural network with convolutional computing and it usually has a deep structure. It is one of the most important components of deep learning [18][19][20]. Recently, the research of deep learning in target detection has become a hot spot. The CNN-based target detection models can mainly be divided into two categories, which include the two-stage ones and the one-stage ones. detection effect. However, the resolution of remote sensing images is large. The scales of remote sensing targets are small and the backgrounds are complex. These algorithms, which have excellent performance on routine datasets, are not suitable for remote sensing target detection. Therefore, we need to design a feature extraction network and detection networks for our proposed algorithm elaborately.
According to the characteristics of remote sensing targets, the proposed method was improved based on the YOLO-V3 model. The main contributions in this paper include the following. (1) In order to reduce reliance on ResNet and enhance the ability of feature information extraction, which is inspired by DenseNet, improved densely connected units proposed to replace some of the residual units of Darknet53. (2) To further improve the ability of detecting multi-scale remote sensing targets, we extended the original three output layers of YOLO-V3 to 4. (3) In order to avoid gradient vanishing, instead of five convolutional layers in each detection layers, three residual units were adopted. The experimental results on remote sensing images show that the proposed method not only has good performance in accuracy, but also gives attention to real-time performance for remote sensing target detection.
The rest of this paper is as follows. In Section 2, we introduced the theory of YOLO and the framework of YOLO-V3. In Section 3, we described the improved method of our approach in details. Section 4 gives the experiments of the proposed algorithm on the RSOD dataset and compared the performance of our approach with other classical algorithms. Lastly, the conclusion is shown in Section 5.

The Theory of YOLO
YOLO (You Only Look Once) is a kind of one-stage algorithm, which transforms target detection as a regression problem. Compared with Faster R-CNN, YOLO obtain the predictive information of location and categories directly without a region proposed network (RPN). After continuous development, YOLO has been developed from YOLO-V1 to YOLO-V2 and the latest YOLO-V3.

The Principle of YOLO
At the beginning, the network divides each input image into S × S grid cells. The grid, which center on the ground truth (GT) of the target falls in, is responsible for detecting it. Each grid cell defines B bounding boxes as well as their corresponding confidence scores. Each bounding box contains C classes. We denote them as P(Class i Ob ject). If the center of the target falls in the grid cell, then P(Ob ject) = 1. Otherwise, P(Ob ject) = 0. The confidence score is defined as: P(Ob ject) × IOU truth pred . It reflects the probability that the grid cell contains targets and the accuracy that the bounding box predicts. IOU represents the overlap area between the bounding box and the ground truth (GT). The class-specific scores can be denoted in Equation (1). P Class i Ob ject × P(Ob ject) × IOU truth pred = P(Class) × IOU truth pred (1) YOLO has made greater achievements than Faster R-CNN in terms of speed, but it also brings the low accuracy of detection. On the basis of YOLO-V1, YOLO-V2 introduces the concept of the anchor box and runs k-means on the dataset to generate appropriate prediction boxes at the beginning. Instead of full connected layers (FC), YOLO-V2 introduces convolutional layers in the output end. In addition, YOLO-V2 also adopts Batch Normalized, New feature extraction network (Darknet19), which greatly improves the performance compared with YOLO-V1.
YOLO-V3 is a further improved version based on YOLO-V2 by upgrading the original Darknet19 to Darknet53 and adopts multi-scale detection layers (three scales) to detect the targets. This allows YOLO-V3 to detect small targets more effectively.

The Network of YOLO-V3
YOLO-V3 adopts Darknet53 as its feature extraction network. In order to prevent information loss caused by pooling layers, Darknet53 adopts a full convolutional network (FCN). The network is basically made up of convolutional kernels of 1 × 1 or 3 × 3. Since it contains 53 convolutional layers, it is called Darknet53. In order to extract deeper features and avoid gradient fading by drawing on the residual network, Darknet53 added five residual modules to the network in which each was composed of one or multiple residual units.
YOLO-V3 borrows the idea of the feature pyramid network (FPN). The network carries out five times of the down-sampling processing on each input image. The output feature map of the feature extraction is down-sampled by 32×, which means the output feature map is 1/32 of the size of the input image. Then YOLO-V3 transmits the last three down-sampled layers to the detection layers for target detection. The network of YOLO-V3 predicts at three scales. The sizes of the three scales are 13 × 13, 26 × 26, and 52 × 52, which are responsible to detection big targets, medium-sized targets, and small targets, respectively. The deep-level feature maps contain a mass of semantic information while the shallow-level feature maps contain a mass of fine-grained information. Therefore, to carry out feature fusion, the network uses up-sampling to keep the size of the feature map down-sampled by 32×, which is consistent with the feature map down-sampled by 16×, and then merges the feature maps by concatenation. Similarly, we do the same for the feature map down-sampled by 16× and the feature map down-sampled by 8×. The structure of YOLO-V3 and its feature extraction network are shown in Figure 1 and Table 1, respectively. Sensors 2020, 20, x FOR PEER REVIEW 4 of 24 the residual network, Darknet53 added five residual modules to the network in which each was composed of one or multiple residual units. YOLO-V3 borrows the idea of the feature pyramid network (FPN). The network carries out five times of the down-sampling processing on each input image. The output feature map of the feature extraction is down-sampled by 32×, which means the output feature map is 1/32 of the size of the input image. Then YOLO-V3 transmits the last three down-sampled layers to the detection layers for target detection. The network of YOLO-V3 predicts at three scales. The sizes of the three scales are 13 × 13, 26 × 26, and 52 × 52, which are responsible to detection big targets, medium-sized targets, and small targets, respectively. The deep-level feature maps contain a mass of semantic information while the shallow-level feature maps contain a mass of fine-grained information. Therefore, to carry out feature fusion, the network uses up-sampling to keep the size of the feature map down-sampled by 32×, which is consistent with the feature map down-sampled by 16×, and then merges the feature maps by concatenation. Similarly, we do the same for the feature map down-sampled by 16× and the feature map down-sampled by 8×. The structure of YOLO-V3 and its feature extraction network are shown in Figure 1 and Table 1

Improved Densely Connected Network
The improvement of You Only Look Once (YOLO)-V3 is mainly based on the concept of a residual network. Darknet53 uses several residual units, and the ResNet made up of these residual units contains a large number of parameters and it is responsible for the main calculations for YOLO-V3 network. Unlike ResNet, which adds the values of the subsequent layers by constructing an identity map, DenseNet [53] connects all the layers for channel merging to achieve feature reuse. Compared with ResNet, the back propagation of the gradient is enhanced, which can make better use of feature information and improve the transmittance of the information between layers.The structure of DenseNet is shown in Figure 2.

Improved Densely Connected Network
The improvement of You Only Look Once (YOLO)-V3 is mainly based on the concept of a residual network. Darknet53 uses several residual units, and the ResNet made up of these residual units contains a large number of parameters and it is responsible for the main calculations for YOLO-V3 network. Unlike ResNet, which adds the values of the subsequent layers by constructing an identity map, DenseNet [53] connects all the layers for channel merging to achieve feature reuse. Compared with ResNet, the back propagation of the gradient is enhanced, which can make better use of feature information and improve the transmittance of the information between layers.The structure of DenseNet is shown in Figure 2.  layers. Each layer is connected to all the other layers. Thus, each layer can receive all the feature maps of the preceding (l − 1) layers. The feature map of each layer can be expressed in Equation (2).
The proposed densely connected network in this paper borrows from the idea of residual units in Figure 1. The convolution, Batch Normalization, and Leaky-ReLU make up the CBL module, while two CBL modules are cascaded into a Double-CBL (DCBL) module. We use the DCBL module as transport layer H i : Conv ( 1 × 1 × 32)-BN-ReLU-Conv (3 × 3 × 64)-BN-ReLU and Conv (1 × 1 × 64)-BN-ReLU-Conv (3 × 3 × 128)-BN-ReLU. Thus, too many layers of DenseNet will lead the feature maps getting redundant and decrease the speed of detection, we set four layers for each module. The increment of the feature maps for each layer in module 'DENSE 1st' is 64 while the increment of the feature maps for each layer in module 'DENSE 2nd' is 128.
With the aim of reducing the network's dependence on residual units, a part of the lower resolution layers of the feature extraction network is replaced by the improved densely connected network. The structure diagram of the proposed feature extraction network is shown in Figure 3. x H x x x − = (2) The proposed densely connected network in this paper borrows from the idea of residual units in Figure 1. The convolution, Batch Normalization, and Leaky-ReLU make up the CBL module, while two CBL modules are cascaded into a Double-CBL (DCBL) module. We use the DCBL module as transport layer i H : Thus, too many layers of DenseNet will lead the feature maps getting redundant and decrease the speed of detection, we set four layers for each module. The increment of the feature maps for each layer in module 'DENSE 1st' is 64 while the increment of the feature maps for each layer in module 'DENSE 2nd' is 128.
With the aim of reducing the network's dependence on residual units, a part of the lower resolution layers of the feature extraction network is replaced by the improved densely connected network. The structure diagram of the proposed feature extraction network is shown in Figure 3. The structure of DENSE 1st

CBL
The structure of DENSE 2nd To show the structure of our approach in detail, Table 2 gives the feature extraction network of our approach. To show the structure of our approach in detail, Table 2 gives the feature extraction network of our approach.

The Proposed Algorithm with Multi-Scale Detection
For an input image of 416 × 416, the size of the feature maps of the three detection layers are 13 × 13, 26 × 26, and 52 × 52, respectively. The smaller the size of the feature map is, the larger the area in the input image is in which each grid cell will correspond. On the contrary, the larger the size of the feature map is, the smaller the area in the input image is in which each grid cell will correspond. It means the 13 × 13 detection layer is suitable for detecting large targets, while the 52 × 52 detection layer is suitable for detecting small targets. Generally speaking, remote sensing images contain a large amount of small targets. In order to further enhance the detection performance of remote sensing targets, we need a larger-sized detection scale. The size of the new scale is 104 × 104. Compared with the original three scales, the four detecting scales strategy is suitable for detecting smaller-sized targets.
Furthermore, in order to avoid gradient fading, we replace the five convolutional layers with three residual units, which is in front of each detection layer. The structure of residual units and the proposed network are shown in Table 3 and Figure 4, respectively.    Table 1 and Table 2 show the structure of the feature extraction network of YOLO-V3 and our approach, respectively. Table 3 shows the structure of the residual units, which is in the end of four detection layers of our proposed network. In the end, Figure 4 exhibits the massive structure of our proposed network. Tables 1 and 2 show the structure of the feature extraction network of YOLO-V3 and our approach, respectively. Table 3 shows the structure of the residual units, which is in the end of four detection layers of our proposed network. In the end, Figure 4 exhibits the massive structure of our proposed network.

K-Means for Anchor Boxes
Inspired by Faster-RCNN, YOLO-V2 and YOLO-V3 introduced the ideal of the anchor box to predict the bounding boxes more accurately. In our approach, we ran K-means to generate the anchor boxes. The function of the K-means algorithm is conducting latitude clustering to make anchor boxes and adjacent ground truth have larger IOU values, which is not directly related to the size of anchor boxes.
IOU refers to the intersection ratio and it is defined in Equation (4).
S overlap refers to the overlap area between the predicted box and the ground truth and S union refers to the union area between them. The pseudocode of K-means in this paper is shown in Algorithm 1. We ran the K-means algorithm to get anchor boxes. In Figure 5, we can see the average IOU with a different number of clusters. The curve got more flat when the number increased. Since there are four detection layers in the network of our approach, we selected 12 clusters (anchor boxes). The sizes of the anchor boxes are as follows: (21,24), (25,31), (33,41), (51,54) Inspired by Faster-RCNN, YOLO-V2 and YOLO-V3 introduced the ideal of the anchor box to predict the bounding boxes more accurately. In our approach, we ran K-means to generate the anchor boxes. The function of the K-means algorithm is conducting latitude clustering to make anchor boxes and adjacent ground truth have larger IOU values, which is not directly related to the size of anchor boxes. 3: Recalculate the cluster center for each cluster: 4: Repeat step 2 and step 3 until the clusters converge.
We ran the K-means algorithm to get anchor boxes. In Figure 5, we can see the average IOU with a different number of clusters. The curve got more flat when the number increased. Since there are four detection layers in the network of our approach, we selected 12 clusters (anchor boxes). The sizes of the anchor boxes are as follows: (21,24), (25,31), (33,41), (51,54)

Relative to the Grid Cell
When detecting the targets, we need to get the values of bounding boxes based on the predicted values. The process is shown in Figure 6. In Figure 6, t x , t y , t w , and t h represent the predicted values of the network. c x and c y represent the offset of the gird relative to the upper left. The values of bounding boxes can be represented as: Figure 5. The relationship between the number of clusters and average IOU by K-means clustering.

Relative to the Grid Cell
When detecting the targets, we need to get the values of bounding boxes based on the predicted values. The process is shown in Figure 6. In Figure 6

The NMS Algorithm for Merging Bounding Boxes
Since there may be several bounding boxes corresponding to one target, the last step of our approach is to conduct non-maximum suppression (NMS) of the bounding boxes, which is aimed at eliminating unnecessary boxes. The steps of NMS are below.

•
Step 1: Take the bounding box with the highest confidence as the target for comparison. Then we compare the IOU between the bounding box and remaining boxes.

•
Step 2: If the IOU is larger than the threshold we set, then remove the bounding box from the remaining bounding boxes.

•
Step 3: Take the bounding box with the second highest confidence as the target for comparison and repeat Step 1 and Step 2 until all the bounding boxes are left. The pseudocode of the algorithm is summarized in Algorithm 2. [] F ←

The NMS Algorithm for Merging Bounding Boxes
Since there may be several bounding boxes corresponding to one target, the last step of our approach is to conduct non-maximum suppression (NMS) of the bounding boxes, which is aimed at eliminating unnecessary boxes. The steps of NMS are below.

•
Step 1: Take the bounding box with the highest confidence as the target for comparison. Then we compare the IOU between the bounding box and remaining boxes.

•
Step 2: If the IOU is larger than the threshold we set, then remove the bounding box from the remaining bounding boxes.

•
Step 3: Take the bounding box with the second highest confidence as the target for comparison and repeat Step 1 and Step 2 until all the bounding boxes are left.
The pseudocode of the algorithm is summarized in Algorithm 2. k ← argmax C 4:

Experiment and Results
In order to verify the validity of our improved YOLO-V3 for remote sensing target detection, we compared our approach with original YOLO-V3, YOLO-V3 tiny, and other state-of-the-art algorithms on RSOD and the USC-AOD dataset. The conditions of our experiment are as follows: Framework: python 3.6.5 and tensorflow 1.13.1, Operating system: Windows 10, CPU: i7-7700k, and GPU: NVIDIA GeForce RTX 2070. We set 50,000 training steps in this experiment. The learning rate of the model decreased from 0.001 to 0.0001 after 30,000 steps and to 0.00001 after 40,000 steps. We set the same parameters for other comparison algorithms. The initialization parameters of training lies in Table 4.

Loss Function
When training the network, loss function is used to measure the error between the predicted and true value. The loss function of the network can be defined in Equation (6).
Error coord refers to a coordinate prediction error and it can be defined as: In Equation (7), λ coord refers to the weight of the coordinate error and we selected λ coord = 5 in our model.  Error iou refers to an IOU error and it is defined as: (8) λ noobj refers to the confidence penalty when there is no object and we selected λ noobj = 0.5 in our model. C i And C i refer to the true and predicted confidence, respectively.
Error cls refers to the classification error and it is defined as: where c refers to the number of classes of the targets.

The Evaluation Indicators
Based on the classification accuracy and prediction accuracy, the samples can be divided into four categories: TP (true positive), FP (fault positive), TN (true negative), and FN (fault negative). We define precision and recall in Equation (10) and Equation (11).
Mean average precision (mAP) is a performance metric for predicting target locations and categories. The accuracy and recall are mutually restricted in practice, and there will be ambiguity when compared separately. Therefore, in our experiment, we introduced mAP, which is one of the most important metrics to evaluate the performance of target detection algorithms.

Experiment on Remote Sensing Target Detection
The classifier trained based on a conventional dataset is not good at detecting remote sensing targets since remote sensing images have their particularities.

1.
Scale diversity. Remote sensing images can be taken from hundreds of meters to nearly 10,000 meters in height, and ground targets may be of different sizes even if they are of the same kind. For example, ships in ports may be only tens of meters to more than 300 meters in size.

2.
Perspective particularity. The perspective of remote sensing images is basically overhead, but most of the conventional datasets are still ground level, so the mode of the same target is usually different. The detector trained well on the conventional datasets, which may have a poor effect on the remote sensing images.

3.
Problem of small targets. Most of the remote sensing targets are small in size. As a result, the target information is limited. The information of the targets has been lost due to the down sampling layers of the Convolutional Neural Network (CNN). After four times of down sampling, the feature map of the target with 24 × 24 pixels may take up only 1 pixel.

4.
Problem of multi-directions. The viewing angle of remote sensing images are usually overhead, while the directions of the targets are uncertain while there is a degree of certainty in conventional datasets.

5.
The high complexity of the background. The fields of remote sensing images are relatively large (usually covering several square kilometers). The fields of vision may contain various backgrounds, which will produce strong interference to the target detection.
Based on the above reasons, it is often difficult to train an ideal target detector from conventional datasets for target detection tasks of remote sensing images. A special remote sensing image database is needed.

Dataset Analysis
Taking everything into consideration, we selected the RSOD and UCS-AOD dataset in the experiment. RSOD is the dataset of aerial images. It contains the targets of four categories: aircraft, playground, overpass, and oil tank. UCS-AOD is the dataset of target detection in aerial images. We generally consider the target, which the ground truth takes up less than 0.12% of the whole image as a small target. The ground truth takes up 0.12-0.5%, which is a medium target, and the ground truth takes up more than 0.5%, which is a large target. Of the four categories, the aircraft targets are mostly small in size. The oil tank targets are major of a small or medium size. The playground and overpass targets are big in size. The dataset includes targets under different lighting conditions and at different heights, and the shooting angles of the targets are also different. Tables 5 and 6 show the statistics of our remote sensing datasets. The targets in the dataset are mainly small or medium in size, and the distribution of the targets is relatively dense, which increases the difficulty of target detection. Figure 7 contains eight samples of the datasets in this paper. The targets in these samples are under a complex background. After a series of convolutional layers and down sampling layers, the targets take up even fewer pixels, which makes it difficult to detect them.

Experimental Results and Analysis in RSOD and UCS-AOD Dataset
In order to compare the accuracy and real-time performance of the algorithms, the mAP and speed of our approach are evaluated. We compared our approach with the state-of-the-art target detection models in the RSOD dataset, and the comparison results are shown in Table 7. Furthermore, the comparison results of the targets with different sizes are shown in Table 8.

Experimental Results and Analysis in RSOD and UCS-AOD Dataset
In order to compare the accuracy and real-time performance of the algorithms, the mAP and speed of our approach are evaluated. We compared our approach with the state-of-the-art target detection models in the RSOD dataset, and the comparison results are shown in Table 7. Furthermore, the comparison results of the targets with different sizes are shown in Table 8.   Table 7 shows that our approach is superior to other classical algorithms in the index of mAP. The detection speed is not significantly reduced relative to YOLO-V3. For aircrafts and oil tanks, which are mainly small and medium-sized targets, our approach has a clear improvement in detection accuracy compared to YOLO-V3. The experimental results show that our improved YOLO-V3 can effectively detect the remote sensing targets under the complex background in the condition of real-time detection. In Table 8, we divide target categories by size. We can see that our approach has more advantages than YOLO-V3 in detecting small-sized targets.
For the universality of our algorithm, we ran the experiment on the UCS-AOD dataset. The comparison results are shown in Table 9. In addition, from Tables 8 and 9, we can see that the leak detection rate is significantly lower than YOLO-V3 and other state-of-the-art algorithms. Under different backgrounds, partial detection results of our approach in RSOD and UCS-AOD dataset are shown in Figure 8. In the conditions of different illumination, different distributions, and different target sizes, our approach can detect the target accurately, which proves excellent detection performance for multi-scale remote sensing targets.  Under different backgrounds, partial detection results of our approach in RSOD and UCS-AOD dataset are shown in Figure 8. In the conditions of different illumination, different distributions, and different target sizes, our approach can detect the target accurately, which proves excellent detection performance for multi-scale remote sensing targets.

Ablation Experiments
In this section, we need to verify the effectiveness of each improved module we proposed. In order to analyze the influence of module 'DENSE 1st' and module 'DENSE 2nd' (Figure 3) on the detection accuracy, different combination modes were set up in the experiment under the condition

Ablation Experiments
In this section, we need to verify the effectiveness of each improved module we proposed. In order to analyze the influence of module 'DENSE 1st' and module 'DENSE 2nd' (Figure 3) on the detection accuracy, different combination modes were set up in the experiment under the condition of three detection scales. The experimental results of each combination in the RSOD dataset are shown in Table 10. It can be seen from the first experiment and the fourth experiment that the feature extraction network of the fourth experiment introduced dense connection modules based on Darknet53. mAP of its model improved from 77.10% to 84.38%. On the other hand, the detection speed of the fourth experiment increased from 29.7 FPS to 32.3 FPS compared to the first experiment. The experimental results show that our proposed feature extraction network can improve the performance of remote sensing target detection and also has advantages in detection speed.
In addition, Table 11 compared the experimental results of each module at the detection end based on an improved feature extraction network. It can be found in the comparison of the first experiment and the second experiment, and the comparison between the third experiment and the fourth experiment, that the fourth detection scale increased and improved mAP up to 5.95% and 5.78%, respectively. Among them, for the small-sized targets like aircraft, the accuracy is improved by 8.72% and 7.04%, respectively. This shows that the increased detection scale can effectively improve the detection accuracy of small targets. Compared with six convolutional layers, the 'Res 3' module can avoid gradient fading and reduce the number of parameters. The comparison of experiment 1 and experiment 3, and the comparison between experiment 2 and experiment 4 show that the 'Res 3' module can slightly increase the detection speed. The ablation experiments result in Tables 9 and 10, which proved that the improved feature extraction network and detection end we proposed can improve the feature extraction ability of the network and enhanced the detection accuracy of multi-scale remote sensing targets, especially small-sized targets. In addition, the detection speed of our approach is not significantly reduced when compared to YOLO-V3 and meets the real-time requirements.

Expansion Experiment
In order to verify the effectiveness of our approach more intuitively, we selected several images and compared the detection results with YOLO-V3 and Faster RCNN. The comparison of the detection results are shown in Figure 9.  In Figure 9, a total of 24 detection results of eight groups were chosen in the RSOD dataset and UCS-AOD dataset to prove the superiority of the improved YOLO-V3. The pictures in the first list are the detection results of the YOLO-V3 network. The pictures in the second list are the detection results of Faster RCNN and the pictures in the third list are the detection results of our approach. It can be clearly seen that there are several small targets missed and detected by YOLO-V3. Although Faster RCNN performed better than YOLO-V3, leak detection still exists. On the other hand, all the targets were detected by our approach. The contrast experiments of eight groups and the ablation experiments showed that, by improving the feature extraction network and increasing the fourth detection scale, our approach enhanced the performance of detecting small targets with complex background conditions in remote sensing images.

Conclusions
In practical engineering applications, we need to consider both accuracy and speed of detection. The existing remote sensing target detection algorithms often fail to consider both of them. In this paper, we proposed an improved YOLO-V3-based model for multi-scale remote sensing target detection. On account of the complexity of the background of remote sensing targets, this puts forward a higher requirement for the ability of the network to extract features. In this paper, we focused on improving the original feature extraction network. Several improvements have been introduced to the original YOLO-V3 network. First, in order to extract feature information more effectively, a dense connection network (DenseNet) was introduced in the feature extraction network. Second, to enhance the performance of detecting small-sized targets, we extended the detection scales from 3 to 4. Third, we replaced three residual units with five convolutional layers, which is in each detection layer to avoid gradient fading. We can see from the ablation experiments that each In Figure 9, a total of 24 detection results of eight groups were chosen in the RSOD dataset and UCS-AOD dataset to prove the superiority of the improved YOLO-V3. The pictures in the first list are the detection results of the YOLO-V3 network. The pictures in the second list are the detection results of Faster RCNN and the pictures in the third list are the detection results of our approach. It can be clearly seen that there are several small targets missed and detected by YOLO-V3. Although Faster RCNN performed better than YOLO-V3, leak detection still exists. On the other hand, all the targets were detected by our approach. The contrast experiments of eight groups and the ablation experiments showed that, by improving the feature extraction network and increasing the fourth detection scale, our approach enhanced the performance of detecting small targets with complex background conditions in remote sensing images.

Conclusions
In practical engineering applications, we need to consider both accuracy and speed of detection. The existing remote sensing target detection algorithms often fail to consider both of them. In this paper, we proposed an improved YOLO-V3-based model for multi-scale remote sensing target detection. On account of the complexity of the background of remote sensing targets, this puts forward a higher requirement for the ability of the network to extract features. In this paper, we focused on improving the original feature extraction network. Several improvements have been introduced to the original YOLO-V3 network. First, in order to extract feature information more effectively, a dense connection network (DenseNet) was introduced in the feature extraction network. Second, to enhance the performance of detecting small-sized targets, we extended the detection scales from 3 to 4. Third, we replaced three residual units with five convolutional layers, which is in each detection layer to avoid gradient fading. We can see from the ablation experiments that each improved module we proposed is effective in improving the detection accuracy. Experiments on RSOD and UCS-AOD datasets show that our approach achieves better performance on multi-scale remote sensing target detection.
The improvement on the feature extraction network greatly improved the ability of extracting the features of the targets. The additional fourth detection scale strengthens the performance of detecting small targets. In the case of losing a portion of detection speed, the accuracy is greatly improved, especially for small remote sensing targets compared with YOLO-V3. Although numerous improved networks based on YOLO-V3 have been proposed, they usually detected targets in routine images. When facing complex remote sensing images, they did not do well. On the contrast, with the above measures adopted, our proposed algorithm is more suitable for remote sensing target detection than other state-of-the-art target detection algorithms. In further work, multi-receptive fields for the feature extraction of the network will be researched to boost the performance of remote sensing target detection. In addition, the latest version of YOLO: YOLO-V4 [55] has been proposed and this will be researched in further work.
Author Contributions: D.X. provided the original ideal, finished the experiment and this paper, and collected the dataset. Y.W. contributed the modifications and suggestions to the paper. All authors have read and agreed to the published version of the manuscript.