Research on Airplane and Ship Detection of Aerial Remote Sensing Images Based on Convolutional Neural Network

The wide range, complex background, and small target size of aerial remote sensing images results in the low detection accuracy of remote sensing target detection algorithms. Traditional detection algorithms have low accuracy and slow speed, making it difficult to achieve the precise positioning of small targets. This paper proposes an improved algorithm based on You Only Look Once (YOLO)-v3 for target detection of remote sensing images. Due to the difficulty in obtaining the datasets, research on small targets for complex images, such as airplanes and ships, is the focus of research. To make up for the problem of insufficient data, we screen specific types of training samples from the DOTA (Dataset of Object Detection in Aerial Images) dataset and select small targets in two different complex backgrounds of airplanes and ships to jointly evaluate the optimization degree of the improved network. We compare the improved algorithm with other state-of-the-art target detection algorithms. The results show that the performance indexes of both datasets are ameliorated by 1–3%, effectively verifying the superiority of the improved algorithm.


Introduction
With the advent of the information age, remote sensing technology has remedied the problem of the limited coverage of traditional ground detection and serious lack of related detection data through rapid air-to-ground information acquisition and target detection. Due to its significant advantages of flexible maneuverability, high resolution, and optional observation range, aerial remote sensing provides a new method for target detection. Airplanes and ships are vital strategic resources and means of transportation in both military and civilian fields. Therefore, they are of great significance to the research of remote sensing image detection. Compared with road vehicle detection, remote sensing images of airplanes and ships have complex backgrounds with diverse targets and small sizes, so target detection is more challenging in this context.
Before the rise of convolutional neural networks, aerial remote sensing images relied on traditional algorithms for target detection. Traditional target detection algorithms mainly include edge detection algorithms [1], such as the Roberts algorithm [2]; threshold segmentation methods [3], such as the Otsu threshold segmentation algorithm [4]; visual saliency detection algorithms, such as the ITTI algorithm [5]. The first two algorithms complete the detection task by detecting the strong contrast and the difference in gray value between the target and the image background, respectively, and the

YOLOv3 Network Analysis
The YOLO series algorithm is a typical end-to-end network, which can directly obtain the target's bounding box information and class probability through a forward operation. Compared with the R-CNN series of the two-stage algorithms, the candidate region mechanism and the detection layer are integrated in the convolutional network, which can greatly improve the detection speed.

Network Structure
The target detection algorithm based on YOLO defines the detection problem as a single regression problem, that is, a single neural network can realize the prediction of multiple bounding boxes and class probabilities on the entire image at the same time through a forward operation. Based on the previous detection algorithm, YOLOv3 was proposed in 2018 with the introduction of multi-scale target detection through the feature pyramid network (FPN) [23], and proposed a new feature extraction network, Darknet-53, after borrowing the structure of the residual network (ResNet) [24].
The Darknet-53 network includes 53 convolutional layers and five residual modules. Each residual module is composed of 1 × 1 and 3 × 3 convolutional layers and a shortcut connection. The convolutional structure consists of convolutional layers, batch normalization, and nonlinear activation function-Leaky Relu. Compared with the Darknet-19 network, the pooling layer is replaced by a 3 × 3 convolutional layer with a step size of 2, which retains more feature information during the down-sampling process. The residual module references the network connection mode of the residual skip layer in ResNet to solve the gradient disappearance or explosion caused by the deep network structure and promotes YOLOv3 to use deeper feature extraction layers to extract feature information with a wider range and higher accuracy for targets detection. At the same time, YOLOv3 introduces multi-scale prediction through the FPN to enhance the sensitivity of small targets. The structure diagram of multi-scale prediction is shown in Figure 1.

YOLOv3 Network Analysis
The YOLO series algorithm is a typical end-to-end network, which can directly obtain the target's bounding box information and class probability through a forward operation. Compared with the R-CNN series of the two-stage algorithms, the candidate region mechanism and the detection layer are integrated in the convolutional network, which can greatly improve the detection speed.

Network Structure
The target detection algorithm based on YOLO defines the detection problem as a single regression problem, that is, a single neural network can realize the prediction of multiple bounding boxes and class probabilities on the entire image at the same time through a forward operation. Based on the previous detection algorithm, YOLOv3 was proposed in 2018 with the introduction of multiscale target detection through the feature pyramid network (FPN) [23], and proposed a new feature extraction network, Darknet-53, after borrowing the structure of the residual network (ResNet) [24].
The Darknet-53 network includes 53 convolutional layers and five residual modules. Each residual module is composed of 1 × 1 and 3 × 3 convolutional layers and a shortcut connection. The convolutional structure consists of convolutional layers, batch normalization, and nonlinear activation function-Leaky Relu. Compared with the Darknet-19 network, the pooling layer is replaced by a 3 × 3 convolutional layer with a step size of 2, which retains more feature information during the down-sampling process. The residual module references the network connection mode of the residual skip layer in ResNet to solve the gradient disappearance or explosion caused by the deep network structure and promotes YOLOv3 to use deeper feature extraction layers to extract feature information with a wider range and higher accuracy for targets detection. At the same time, YOLOv3 introduces multi-scale prediction through the FPN to enhance the sensitivity of small targets. The structure diagram of multi-scale prediction is shown in Figure 1.  Figure 1 shows that the FPN uses feature maps of different sizes for fusion to achieve target detection. With the deepening of the network, the size of the output feature map of the convolution at each layer is gradually reduced, and the target detection work is completed after the fusion of feature maps in different sizes. The network up-samples each layer of the feature pyramid from top to bottom, and then connects it with the feature map of the next layer horizontally to obtain the new enhanced feature map. The new feature map contains not only the rich semantic information output by the deep network, but also the accurate target position information output by the shallow network.  Figure 1 shows that the FPN uses feature maps of different sizes for fusion to achieve target detection. With the deepening of the network, the size of the output feature map of the convolution at each layer is gradually reduced, and the target detection work is completed after the fusion of feature maps in different sizes. The network up-samples each layer of the feature pyramid from top to bottom, and then connects it with the feature map of the next layer horizontally to obtain the new Sensors 2020, 20, 4696 4 of 16 enhanced feature map. The new feature map contains not only the rich semantic information output by the deep network, but also the accurate target position information output by the shallow network. Separate target detection is performed on these newly acquired feature maps. In YOLOv3, feature maps with three different sizes of 13 × 13, 26 × 26, and 52 × 52 are applied to construct feature pyramids, which are used to detect large, medium, and small-sized targets, respectively. Multi-scale prediction makes YOLOv3 more sensitive to weak targets and significantly boosts its detection ability. The input images undergo a series of operations to finally obtain three output feature maps of different sizes, namely, 13 × 13 × 255, 26 × 26 × 255, and 52 × 52 × 255. Each grid cell contains three anchor boxes. Therefore, the output tensor 255 is 3 × (4 + 1 + 80), which corresponds to four bounding box coordinates, one confidence prediction and 80 class predictions of the object. The structure diagram of YOLOv3 is shown in Figure 2. Separate target detection is performed on these newly acquired feature maps. In YOLOv3, feature maps with three different sizes of 13 × 13, 26 × 26, and 52 × 52 are applied to construct feature pyramids, which are used to detect large, medium, and small-sized targets, respectively. Multi-scale prediction makes YOLOv3 more sensitive to weak targets and significantly boosts its detection ability. The input images undergo a series of operations to finally obtain three output feature maps of different sizes, namely, 13 × 13 × 255, 26 × 26 × 255, and 52 × 52 × 255. Each grid cell contains three anchor boxes. Therefore, the output tensor 255 is 3 × (4 + 1 + 80), which corresponds to four bounding box coordinates, one confidence prediction and 80 class predictions of the object. The structure diagram of YOLOv3 is shown in Figure 2.

Testing Process
The YOLOv3 network resizes the input image to a size of 416 × 416 before feature extraction, and then divides it into S × S grids for prediction. The center of the target will fall within the grid responsible for detecting the target. Each grid can predict B bounding boxes, and each bounding box contains information for five parameters. The network generates S × S × B bounding boxes on the feature map and directly performs a regression on the generated bounding boxes. The modified predicted bounding box coordinates of the YOLOv3 network can be calculated by Equation (1) and the schematic diagram is shown in Figure 3.

Testing Process
The YOLOv3 network resizes the input image to a size of 416 × 416 before feature extraction, and then divides it into S × S grids for prediction. The center of the target will fall within the grid responsible for detecting the target. Each grid can predict B bounding boxes, and each bounding box contains information for five parameters. The network generates S × S × B bounding boxes on the feature map and directly performs a regression on the generated bounding boxes. The modified predicted bounding box coordinates of the YOLOv3 network can be calculated by Equation (1) and the schematic diagram is shown in Figure 3.
where b x , b y is the coordinate of the center point of the modified bounding box, respectively; b w , b h is width and height of the modified bounding box, respectively; t x , t y is the offset of the target center point relative to the upper left corner of the grid where the point is located; t w , t h is the width and height of the predicted bounding box, respectively; c x , c y is the offset of the gird relative to the upper left; p w , p h is the width and height of the anchor box, respectively. The sigmoid function is used to set the value range between (0, 1) and control the offset of the target center to be within the corresponding grid unit to ensure that there is no over-offset.  (t ) is used to set the value range between (0, 1) and control the offset of the target center to be within the corresponding grid unit to ensure that there is no over-offset.
To prevent the redundancy of the predicted bounding box, it is necessary to perform confidence calculation and set the confidence threshold on each predicted bounding box. When the confidence score is above the threshold, it is reserved for regression and otherwise discarded. The confidence is composed of two parts: one is the probability of the existence of the target in the grid, expressed by (object) r P ; when there is a target to be detected in the grid, part refers to the accuracy of the predicted bounding box, which is defined as follows: where conf C refers to the class-specific confidence scores for each box; (class | object) r i P refers to the probabilities of predicting C conditional class in each grid cell (i = 1, 2,..., C); (object) IOU truth r pred P × refers to the confidence score; and Intersection over Union ( IOU truth pred ) refers to the intersection ratio of the bounding box area and the ground truth box area.
The non-maximum suppression (NMS) algorithm can determine the degree of coincidence of the remaining boxes, and the predicted box with higher confidence is retained as the target detection box. The predicted bounding box is composed of three loss functions: To prevent the redundancy of the predicted bounding box, it is necessary to perform confidence calculation and set the confidence threshold on each predicted bounding box. When the confidence score is above the threshold, it is reserved for regression and otherwise discarded. The confidence is composed of two parts: one is the probability of the existence of the target in the grid, expressed by P r (object); when there is a target to be detected in the grid, P r (object) = 1, otherwise it is 0. The other part refers to the accuracy of the predicted bounding box, which is defined as follows: where C con f refers to the class-specific confidence scores for each box; P r (class i object) refers to the probabilities of predicting C conditional class in each grid cell (i = 1, 2,..., C); P r (object) × IOU truth pred refers to the confidence score; and Intersection over Union (IOU truth pred ) refers to the intersection ratio of the bounding box area and the ground truth box area.
The non-maximum suppression (NMS) algorithm can determine the degree of coincidence of the remaining boxes, and the predicted box with higher confidence is retained as the target detection box. The predicted bounding box is composed of three loss functions: where loss 1 is the loss of the predicted bounding box; loss 2 is the loss of predicted confidence; loss 3 is the loss of the predicted class; ∧ indicates the true value, otherwise it is the predicted value; x i , y i , w i , and h i refer to the four coordinate values of the i-th bounding box; λ refers to the weight of the loss; C i refers to the confidence of the i-th bounding box; p i (c) refers to the class probability of the i-th bounding box.

Detection Scale
The YOLOv3 network introduces a feature pyramid structure to perform multi-scale prediction on the target, effectively combining semantic information from the deep network with feature information from the shallow network to enhance the target detection accuracy. However, it still experiences difficulties in detecting medium and small targets. Here, the MyPlane and MyShip datasets used for target detection are aerial remote sensing image datasets. The targets to be detected, namely, airplanes and ships, are weak targets based on the entire picture. Therefore, we subjoined a 104 × 104 detection scale to the improved model to enhance the sensitivity and detection ability of the network to small targets. This operation can make the receptive field of the network model smaller, which is conducive to the detection of small airplane and ship targets in the background of aerial remote sensing. However, the height of aerial remote sensing imagery is not fixed, and the existing targets have relatively large targets to be detected. Therefore, the improved model in this paper continues to retain the detection work at the scale of 13 × 13. The receptive fields at the scale of 13 × 13, 26 × 26, 52 × 52, and 104 × 104 on the same aerial remote sensing image are shown in Figure 4.

Detection Scale
The YOLOv3 network introduces a feature pyramid structure to perform multi-scale prediction on the target, effectively combining semantic information from the deep network with feature information from the shallow network to enhance the target detection accuracy. However, it still experiences difficulties in detecting medium and small targets. Here, the MyPlane and MyShip datasets used for target detection are aerial remote sensing image datasets. The targets to be detected, namely, airplanes and ships, are weak targets based on the entire picture. Therefore, we subjoined a 104 × 104 detection scale to the improved model to enhance the sensitivity and detection ability of the network to small targets. This operation can make the receptive field of the network model smaller, which is conducive to the detection of small airplane and ship targets in the background of aerial remote sensing. However, the height of aerial remote sensing imagery is not fixed, and the existing targets have relatively large targets to be detected. Therefore, the improved model in this paper continues to retain the detection work at the scale of 13 × 13. The receptive fields at the scale of 13 × 13, 26 × 26, 52 × 52, and 104 × 104 on the same aerial remote sensing image are shown in Figure 4. The improved network model increased the output of 104 × 104 × 255 with a smaller receptive field to enhance the detection ability using small targets. The improved network model graph is shown in Figure 5. The improved network model increased the output of 104 × 104 × 255 with a smaller receptive field to enhance the detection ability using small targets. The improved network model graph is shown in Figure 5.

Loss Function
In general, the mathematical model parameters obtained by the network fitting are kept to the minimum. When the new sample data processed by the network destroys the previous sample distribution with the increase of sample data, the existing mathematical model needs to be corrected. However, due to the relatively small number of parameters, the extent of modification of the entire

Loss Function
In general, the mathematical model parameters obtained by the network fitting are kept to the minimum. When the new sample data processed by the network destroys the previous sample distribution with the increase of sample data, the existing mathematical model needs to be corrected. However, due to the relatively small number of parameters, the extent of modification of the entire network is actually not large and the whole network has a strong anti-disturbance capability. L2 regularization is used to avoid the occurrence of overfitting by obtaining smaller parameter weights. The expression of L2 regularization is as follows: where J 0 represents the initial loss function, λ is the regularization parameter, and ω is the weight vector. L2 regularization is equivalent to adding a penalty term to the initial loss function. Assuming that the parameter to be sought is θ, the loss function in linear regression is as follows: The gradient descent method is adopted to reduce the entire loss function, that is, to adjust the parameter θ in the opposite direction of the gradient, as follows: where Equation (9) is the iterative equation of the initial loss function, and Equation (10) is the iterative equation after adding L2 regularization. By comparing the two equations, θ, after increasing L2 regularization, will first be multiplied by (1 − aλ/m) during each iteration. Since this factor is less than 1, the parameter θ j continuously decreases. Based on the above theory, adding the L2 regularization term to the loss function of YOLOv3 can enhance the interference ability of the network and avoid overfitting.
where loss L2 refers to the improved loss function, loss refers to the loss function of the original YOLOv3 network, and (λ/2N) ω ω 2 represents the L2 regularization term, where N is a constant.

Network Training
The background of aerial remote sensing images is complex, which makes data acquisition extremely difficult. Training based on the convolutional neural network requires a large number of samples, but the data of aerial remote sensing images is insufficient. It is difficult to directly detect images using existing datasets. Due to the specificity of aerial remote sensing images, the detection accuracy of small targets in the network model is relatively high. We chose small targets in two different complex backgrounds (airplane and ship) to assess the optimization degree of the network model and improve the accuracy requirements of the network model. Currently, the publicly known datasets of aerial remote sensing images are as follows: the NWPU VHR-10 dataset produced by Northwestern Polytechnical University, and the RSOD dataset and DOTA dataset produced by Wuhan University. Among them, the DOTA dataset has the best sample image quality at present, and the image size is generally around 4000-5000 pixels. At the same time, the DOTA dataset has 15 types of targets, including airplanes and ships, totaling more than 180,000 specific targets. This paper used the DOTA dataset to make MyPlane and MyShip datasets. Since the production process of the two sets is similar, we took the MyPlane dataset as an example. The main process was as follows: • We used the code to automatically traverse all sample images in the DOTA dataset (containing two or more airplane targets) and filtered out the .txt annotation files.

•
The excessively large size of sample images from the DOTA dataset leads to excessive compression and a large amount of information will be lost when they are sent directly to the network. Therefore, each sample image obtained was cropped into several 1024 × 1024 size images, and the corresponding .txt annotation file was updated. If the intersection ratio between the divided target and the original target was greater than 0.7, the current target was retained; otherwise it was discarded.

•
We repeated the above steps to reduce the dataset until the required sample set was obtained and the preliminary preprocessing operation was completed.
The annotation file type of images in DOTA dataset is the .txt file type, where the data annotation format is OBB, which is determined by the coordinate position: (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ). It is visualized as an ordinary quadrilateral in Figure 6. The final annotation file of the MyPlane datasets in this paper is the common VOC type, that is, the annotation file is in .xml format. The target position coordinates are in HBB format, which are determined by the value: x max , y max , x min , y min . It is visualized as a rectangular box in Figure 6. We used the code to convert the .txt file in OBB format into the .xml file in HBB format and deleted the annotation information of other non-airplane targets. Finally, the dataset was amplified by means of rotation angle and mirror flipping to increase the number of training samples and avoid overfitting. It is visualized as an ordinary quadrilateral in Figure 6. The final annotation file of the MyPlane datasets in this paper is the common VOC type, that is, the annotation file is in .xml format. The target position coordinates are in HBB format, which are determined by the value: max x , max y , min x , min y .
It is visualized as a rectangular box in Figure 6. We used the code to convert the .txt file in OBB format into the .xml file in HBB format and deleted the annotation information of other non-airplane targets. Finally, the dataset was amplified by means of rotation angle and mirror flipping to increase the number of training samples and avoid overfitting.  Figure 6 clearly shows the annotation file of DOTA dataset and the dataset used in this paper. The production process of the entire dataset was based on the step-by-step implementation of Python code, which fully utilized the accuracy and effectiveness of the information marked by the DOTA dataset.  The production process of the entire dataset was based on the step-by-step implementation of Python code, which fully utilized the accuracy and effectiveness of the information marked by the DOTA dataset.

Reset Anchor Box Parameters
An airplane target tends to be square while a ship target is elongated, which makes the shape of the bounding box of the two quite different. The original prior frame size of YOLOv3 is relatively fixed, mainly used for general target detection. Therefore, when using the aerial remote sensing datasets MyPlane and MyShip to detect small targets, the number of anchor boxes needed to be readjusted. The K-means algorithm [25] is a typical distance-based clustering algorithm, which can determine the similarity of samples based on the distance between each sample, gathering similar samples to form a cluster as the final goal, and using distance to measure similarity-the shorter the distance, the higher the similarity. The traditional K-means algorithm generally uses Euclidean distance to measure the distance, which can directly cluster the width and height of the predicted bounding box and obtain K anchor boxes at the same time. However, when the size of the bounding box is large, significant errors occur. Therefore, YOLOv3 [9] introduces the IOU value of the predicted box and the anchor box into the K-means algorithm to determine the anchor box that is most suitable for the existing dataset to improve the detection accuracy of the target. By calculating the IOU values of the two boxes, the degree of similarity between the two boxes can be obtained. In order to better represent the error value, the formula of distance parameter d can be defined as follows: where box represents the sample and centroid represents the center of the cluster. The smaller the d, the more similar the two boxes. We set the range of K from 1 to 12. The IOU values corresponding to the different cluster numbers of the two datasets are shown in Figure 7.  (14,12), (20,16). Similarly, on the MyShip dataset, the sizes of the anchor boxes obtained by clustering are (126, 306), (86, 174), (58, 120), (44, 92), (28, 54), (24,42), (26, 10), (16,12). As the size of the scale is inversely proportional to the range of the receptive field, the receptive field of the 13 × 13 scale is the most sensitive to large targets.
As the size of the scale is inversely proportional to the range of the receptive field, the receptive field of the 13 × 13 scale is the most sensitive to large targets.
The final calibration and clustering results are shown in Figure 8. It can be seen that the size of the airplane targets in the MyPlane dataset images varies, and the general size of the prior box tends to be square; while the size of the MyShip dataset tends to be more elongated. The results are consistent with the actual situation, and it can be seen that a reasonable choice of the corresponding number and the size of anchor boxes helps improve the precise positioning and detection of specific targets. When K = 8, the curves of both datasets tend to be stable. Considering the adjustment of the prediction scale of the network, each anchor box can be equally assigned two prior anchor boxes. The final sizes of the anchor boxes for the MyPlane dataset are (132, 238), (244, 296), (118, 110), (84, 102), (54, 46), (80, 74), (14,12), (20,16). Similarly, on the MyShip dataset, the sizes of the anchor boxes obtained by clustering are (126, 306), (86, 174), (58, 120), (44, 92), (28, 54), (24,42), (26, 10), (16,12). As the size of the scale is inversely proportional to the range of the receptive field, the receptive field of the 13 × 13 scale is the most sensitive to large targets.
The final calibration and clustering results are shown in Figure 8. It can be seen that the size of the airplane targets in the MyPlane dataset images varies, and the general size of the prior box tends to be square; while the size of the MyShip dataset tends to be more elongated. The results are consistent with the actual situation, and it can be seen that a reasonable choice of the corresponding number and the size of anchor boxes helps improve the precise positioning and detection of specific targets.

Model Training
We used four different network models for the training of the two specific target datasets. Since it is only a difference in dataset types, the MyPlane dataset will be used as an example to introduce the experimental process in detail. The four experimental procedures on the MyShip dataset are similar. The training process for the four models is roughly the same, and here we will mainly introduce the training process of the improved model.
The programming language of the algorithm in this paper was Python 3.5.2, the deep learning framework was TensorFlow 1.11.0, and the operating system was Ubuntu 16.04. The hardware platform was Intel(R) Core(TM) i7-7700K CPU at 4.20 GHz and NVIDIA GTX 1060Ti GPU for accelerating model training. In the self-made datasets, there were 1264 training samples, 368 validation samples, and 736 test samples in the MyPlane datasets; while, there were 1754 training samples, 608 validation samples, and 655 test samples in the MyShip dataset. The model training process was used to conduct preliminary training of remote sensing images using the two self-made datasets and the validation set was used for further validation. The preset training parameters were as follows: the momentum was 0.9, the weight decay was 0.0005, the initial learning rate was 0.001, and the maximum number of iterations was 50,200. The learning rate dropped to 0.0001 after 5500

Model Training
We used four different network models for the training of the two specific target datasets. Since it is only a difference in dataset types, the MyPlane dataset will be used as an example to introduce the experimental process in detail. The four experimental procedures on the MyShip dataset are similar. The training process for the four models is roughly the same, and here we will mainly introduce the training process of the improved model.
The programming language of the algorithm in this paper was Python 3.5.2, the deep learning framework was TensorFlow 1.11.0, and the operating system was Ubuntu 16.04. The hardware platform was Intel(R) Core(TM) i7-7700K CPU at 4.20 GHz and NVIDIA GTX 1060Ti GPU for accelerating model training. In the self-made datasets, there were 1264 training samples, 368 validation samples, and 736 test samples in the MyPlane datasets; while, there were 1754 training samples, 608 validation samples, and 655 test samples in the MyShip dataset. The model training process was used to conduct preliminary training of remote sensing images using the two self-made datasets and the validation set was used for further validation. The preset training parameters were as follows: the momentum was 0.9, the weight decay was 0.0005, the initial learning rate was 0.001, and the maximum number of iterations was 50,200. The learning rate dropped to 0.0001 after 5500 iterations, and to 0.00001 after 20,000 iterations. We set the batch of the improved model to 3; the training epoch was 50, with 20 rounds in the first phase and 30 rounds in the second phase. The graph of the loss function of the improved model on the MyPlane dataset is shown in Figure 9. similar. The training process for the four models is roughly the same, and here we will mainly introduce the training process of the improved model.
The programming language of the algorithm in this paper was Python 3.5.2, the deep learning framework was TensorFlow 1.11.0, and the operating system was Ubuntu 16.04. The hardware platform was Intel(R) Core(TM) i7-7700K CPU at 4.20 GHz and NVIDIA GTX 1060Ti GPU for accelerating model training. In the self-made datasets, there were 1264 training samples, 368 validation samples, and 736 test samples in the MyPlane datasets; while, there were 1754 training samples, 608 validation samples, and 655 test samples in the MyShip dataset. The model training process was used to conduct preliminary training of remote sensing images using the two self-made datasets and the validation set was used for further validation. The preset training parameters were as follows: the momentum was 0.9, the weight decay was 0.0005, the initial learning rate was 0.001, and the maximum number of iterations was 50,200. The learning rate dropped to 0.0001 after 5500 iterations, and to 0.00001 after 20,000 iterations. We set the batch of the improved model to 3; the training epoch was 50, with 20 rounds in the first phase and 30 rounds in the second phase. The graph of the loss function of the improved model on the MyPlane dataset is shown in Figure 9.   Figure 9 shows that the loss function of the improved model drops significantly when the number of iterations reaches about 5500. After the current 20 rounds of training were completed, the entire network was still underfitting and the loss function was relatively large; the subsequent 30 rounds adjusted the learning rate in time to make the network model fit better to continuously reduce the loss function. By drawing the loss curve, the training process of the network can be observed intuitively.

Experiments and Analysis
The YOLOv3 network is presented in this paper to test the target detection effect of the improved model. The experiment uses Faster R-CNN+Resnet101, Faster R-CNN+VGG16, YOLOv3, and the improved model based on YOLOv3 Network to train and verify the two aerial remote sensing datasets MyPlane and MyShip and consider the detection capabilities of the four network models from the detection result map and multiple evaluation indexes. The background of the original airplane images in the remote sensing images is dim and the number of interfering objects is large, which presents a challenge to the detection work. To effectively evaluate the performance of the network model, the accuracy rate P, the recall rate R, the false alarm rate F, the miss alarm rate M, and the average precision (AP) were selected to evaluate the detection capability of the network model. The formulas are as follows: There was a total of 736 images in the test set of the MyPlane dataset, including 7436 real targets of the airplane. There was a total of 655 images in the test set of the MyShip dataset, including 68,054 ship targets. For the four different network models, the test results of the two datasets are shown in Tables 1 and 2. The data results in Tables 1 and 2 prove that the improved algorithm is more sensitive to small targets. We also compared the performance of the four models on two datasets. The results are shown in Tables 3 and 4. It can be seen from Tables 3 and 4 that the evaluation indicators of the improved model are significantly better than the other three network models. At the same time, the AP comprehensively reflects the detection capability of the model from the two dimensions of accuracy and recall. The larger the AP value, the better the detection. The AP value of the improved model using the MyPlane dataset is 94.29%, compared to the other three network models, which have AP values 12.65%, 13%, and 2.69% lower; the AP value for MyShip dataset is 93.13%, which is 39.75%, 41.83%, and 1.81% higher than the other three network models.
Compared with the improved algorithm, the false detection rate and the missed detection rate of the first three models are higher. This is because the Faster R-CNN+Resnet101 and Faster R-CNN+VGG16 network models do not import multi-scale detection, which leads to lower sensitivity to small targets. By introducing multi-scale detection in the YOLOv3 model, the detection accuracy is higher. However, the scale size and the number of anchor boxes selected in the YOLOv3 model are based on the COCO dataset, which is not very consistent with the data situation of the aerial remote sensing dataset with the majority of small targets. The number of YOLOv3 false detections has increased. We compared the performance of each part of the improved model on two datasets. The results are shown in Tables 5 and 6. The results of Tables 5 and 6 show that the addition of the 104 × 104 detection scale enables the network to have smaller receptive field feature information. In addition, the overall loss function increases the L2 regularization constraint, which effectively avoids the occurrence of overfitting. Therefore, these two improvements have a certain promotion effect on the detection accuracy of small targets, and the AP and Recall values are improved. At the same time, a reasonable number of prior anchor boxes are set to enhance the ability of network model detection for specific small targets. It can be seen from the data results that the detection effect is slightly improved.
Through the above experiments, the results of the algorithm mentioned in this article are shown in Figure 10.  The results of Tables 5 and 6 show that the addition of the 104 × 104 detection scale enables the network to have smaller receptive field feature information. In addition, the overall loss function increases the L2 regularization constraint, which effectively avoids the occurrence of overfitting. Therefore, these two improvements have a certain promotion effect on the detection accuracy of small targets, and the AP and Recall values are improved. At the same time, a reasonable number of prior anchor boxes are set to enhance the ability of network model detection for specific small targets. It can be seen from the data results that the detection effect is slightly improved.
Through the above experiments, the results of the algorithm mentioned in this article are shown in Figure 10.  Whether from the index parameters or the actual detection result diagram, the network comprehensive detection capability of the improved model proposed by us is improved. The detection sensitivity of the model to dim small targets is also enhanced. Especially in the MyShip dataset, which had mostly small targets, the advantage and characteristic of the improved model with strong detection sensitivity to small targets was perfectly demonstrated.

Conclusions
As important strategic resources and means of transportation, airplanes and ships have a practical value that cannot be ignored in the study of aerial remote sensing images. The background of aerial remote sensing images is complex, and the target sizes are diverse. Traditional target detection algorithms cannot accurately extract target features for these objects. We proposed an improved detection algorithm based on the YOLOv3 network, which acquires the specific target samples from the DOTA dataset and resets the number of corresponding anchor boxes to enhance the detection ability. Then, the detection scale was increased to obtain a smaller receptive field and improve the positioning accuracy of small targets. To avoid model overfitting, an L2 regularization term was added to the loss function to reduce the false detection rate. The experimental results show that the proposed model has an improvement of about 1-3% in the accuracy of evaluation indexes in small target detection tasks compared with other algorithms, which effectively improves the ability of the network. By evaluating the optimization of the proposed network model of small targets under two different complex backgrounds (i.e., airplanes and ships), this study has practical significance for the research of aerial remote sensing image detection technology.