The Application of Improved YOLO V3 in Multi-Scale Target Detection

: Target detection is one of the most important research directions in computer vision. Recently, a variety of target detection algorithms have been proposed. Since the targets have varying sizes in a scene, it is essential to be able to detect the targets at di ﬀ erent scales. To improve the detection performance of targets with di ﬀ erent sizes, a multi-scale target detection algorithm was proposed involving improved YOLO (You Only Look Once) V3. The main contributions of our work include: (1) a mathematical derivation method based on Intersection over Union (IOU) was proposed to select the number and the aspect ratio dimensions of the candidate anchor boxes for each scale of the improved YOLO V3; (2) To further improve the detection performance of the network, the detection scales of YOLO V3 have been extended from 3 to 4 and the feature fusion target detection layer downsampled by 4 × is established to detect the small targets; (3) To avoid gradient fading and enhance the reuse of the features, the six convolutional layers in front of the output detection layer are transformed into two residual units. The experimental results upon PASCAL VOC dataset and KITTI dataset show that the proposed method has obtained better performance than other state-of-the-art target detection algorithms.


Introduction
Target detection is one of the research hotspots in the field of computer vision. The location and the category of the targets can be determined by using target detection. Nowadays, target detection has been applied in many fields such as military and civil areas, including image segmentation [1,2], intelligent surveillance [3][4][5], autonomous driving [6,7], and intelligent transportation [8,9].
With the rapid development of graphics processing unit (GPU) hardware, deep learning has made significant progress [10] and a lot of target detection algorithms based on deep learning have been used in our daily life [11], such as pedestrian detection [12], face detection [13][14][15], and vehicle detection [16,17]. The convolutional neural network (CNN) is a kind of network with many layers used to extract features based on invariance of regional statistics in images [18]. By training on the dataset, CNN can learn the features of the targets that need to be detected autonomously and the performance of the model can be improved gradually. With the improvement of computer hardware, the structure of CNN becomes much deeper and the features of the targets can be better learned. The conventional LeNet [19] consisted of five layers and the number of the layers in Highway Networks [20] and Residual Networks [21] has surpassed 100. DenseNet [22] connects layers in a feed-forward fashion, which is useful for reusing the features and avoiding gradient fading.
The state-of-the-art target detection algorithms based on CNN can be divided into two categories. The first category is two stage target detection algorithms, such as R-CNN [2], Fast R-CNN [23], Faster R-CNN [24], Mask R-CNN [25], etc. The target detection process is divided into two phases by these algorithms. Firstly they use a Region Proposal Network (RPN) to generate a sparse of candidate anchor boxes and then the location and category of the candidate targets can be predicted and identified by the detection network. These algorithms can achieve perfect performance for target detection. However, they are not end-to-end target detection algorithms. The second category is one stage target detection algorithms, such as OverFeat [26], SSD [27], YOLO [28], YOLO 9000 [29], YOLO V3 [30], You Only Look Twice [31], etc. There is no need to use RPN to generate the candidate targets. These algorithms predict the target location and category information directly through the network. They are end-to-end target detection algorithms. Therefore, the speed of the one stage target detection algorithm is faster.
As a representative algorithm of one stage target detection, YOLO V3 [30] predicts targets of different size at 3 different scales, which uses a similar concept to feature pyramid networks [32]. Instead of choosing anchor boxes by hand, YOLO V3 runs K-means clustering on the dataset to find good priors automatically. However, it divides up the priors evenly across scales, which may cause some anchor boxes to be placed at inappropriate scales because the clusters are arbitrarily allocated.
In this paper, a multi-scale target detection approach based on YOLO V3 is proposed. To improve the detection performance of the targets with different sizes, we extend the detection scale of YOLO V3 from 3 to 4. To avoid gradient fading and enhance the reuse of the features, the six convolutional layers in front of the output detection of each scale are transformed into two residual units. To select appropriate size of the anchor boxes for each scale, a mathematical derivation method based on Intersection over Union (IOU) is proposed.
The main contributions of our work can be concluded as follows: (1) a mathematical derivation method based on Intersection over Union (IOU) was proposed to select the number and the aspect ratio dimensions of the candidate anchor boxes for each scale of the improved YOLO V3; (2) To further improve the detection performance of the network, the detection scales of YOLO V3 have been extended from 3 to 4 and the feature fusion target detection layer downsampled by 4× is established to detect the small targets; (3) To avoid gradient fading and enhance the reuse of the features, the six convolutional layers in front of the output detection layer are transformed into two residual units; (4) We compared our approach with the state-of-the-art target detection algorithms both on the PASCAL VOC dataset and the KITTI dataset to evaluate the performance of the improved network. We also provide quantitative and qualitative evaluations between our algorithm and other target detection algorithms.
The rest of this paper is organized as follows. We introduce the framework of YOLO V3 in Section 2. In Section 3, we describe the details of our approach about the main framework of improved YOLO V3 network and the mathematical derivation method to select the appropriate candidate anchor boxes for each scale. The comparative experiments and results on PASCAL VOC dataset and KITTI dataset are presented in Section 4. Finally, the conclusion of this paper is drawn in Section 5.

Brief Introduction to YOLO V3
YOLO (You Only Look Once) is a one stage target detection algorithm, which has been developed into the third generation called YOLO V3. YOLO V3 is end-to-end target detection algorithm. It can predict the category and the location of the targets directly, so the detection speed of the YOLO V3 algorithm is fast.
To perform feature extraction, YOLO V3 uses successive 3 × 3 and 1 × 1 convolutional layers and draws on the idea of the Residual Networks [21]. There are 5 residual blocks in YOLO V3. Each residual block is composed of multiple residual units. The residual unit is shown in Figure 1. With the residual unit, the depth of the network can be deeper and the gradient fading can be avoided. Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 14 The input image is downsampled by five times. YOLO V3 predicts the targets in the last 3 downsampled layers. There are 3 scales for YOLO V3 to detect the targets. At scale 3, the feature map downsampled by 8× is used to detect small targets. At scale 2, the feature map downsampled by 16×is used to detect medium-sized targets. At scale 1, the feature map downsampled by 32× is used to detect big targets. Feature fusion is used to detect targets because small feature map provides deep semantic information and large feature map provides more finer-grained information of the targets. To perform feature fusion, YOLO V3 resizes the feature maps of the deeper layer by upsampling. Then the feature maps at different scales will have the same size. YOLO V3 merges the features from the earlier layer with the features from the deeper layer by concatenation. So YOLO V3 has good performance to detect both large and small targets. The network of YOLO V3 is shown in Figure 2.  The input image is downsampled by five times. YOLO V3 predicts the targets in the last 3 downsampled layers. There are 3 scales for YOLO V3 to detect the targets. At scale 3, the feature map downsampled by 8× is used to detect small targets. At scale 2, the feature map downsampled by 16×is used to detect medium-sized targets. At scale 1, the feature map downsampled by 32× is used to detect big targets. Feature fusion is used to detect targets because small feature map provides deep semantic information and large feature map provides more finer-grained information of the targets. To perform feature fusion, YOLO V3 resizes the feature maps of the deeper layer by upsampling. Then the feature maps at different scales will have the same size. YOLO V3 merges the features from the earlier layer with the features from the deeper layer by concatenation. So YOLO V3 has good performance to detect both large and small targets. The network of YOLO V3 is shown in Figure 2. The input image is downsampled by five times. YOLO V3 predicts the targets in the last 3 downsampled layers. There are 3 scales for YOLO V3 to detect the targets. At scale 3, the feature map downsampled by 8× is used to detect small targets. At scale 2, the feature map downsampled by 16×is used to detect medium-sized targets. At scale 1, the feature map downsampled by 32× is used to detect big targets. Feature fusion is used to detect targets because small feature map provides deep semantic information and large feature map provides more finer-grained information of the targets. To perform feature fusion, YOLO V3 resizes the feature maps of the deeper layer by upsampling. Then the feature maps at different scales will have the same size. YOLO V3 merges the features from the earlier layer with the features from the deeper layer by concatenation. So YOLO V3 has good performance to detect both large and small targets. The network of YOLO V3 is shown in Figure 2.

Our Approach
We ran a K-means clustering algorithm to perform clustering analysis on the PASCAL VOC dataset and KITTI dataset. Then we proposed a mathematical derivation method based on IOU to determine the number and the aspect ratio dimensions of candidate anchor box for each scale of the improved network. Finally, to enhance the detection performance of the YOLO V3, we improved the structure of YOLO V3.

Appropriate Size for Anchor Boxes
YOLO V3 introduced the idea of anchor boxes used in Faster R-CNN. Anchor boxes are a set of initial candidate boxes with a fixed width and height. The choice of the initial anchor boxes will directly affect the detection accuracy and the detection speed. Instead of choosing anchor boxes by hand, YOLO V3 runs K-means clustering on the dataset to find good priors automatically. The clusters generated by K-means can reflect the distribution of the samples in each dataset, which can make it easier for the network to get good predictions. In this paper, Avg IOU is used as a metric of target clustering analysis. The objective function of clustering Avg IOU is as follows: where box is the sample, namely the ground truth of the target. centroid is the center of the cluster. n k is the numbers of samples in the kth cluster center. k is the total number of samples. n is the numbers of clusters. IOU(box, centroid) is the intersection over Union of the clusters and the sample. We applied K-means clustering on the PASCAL VOC and KITTI dataset, respectively. The Figure 3 shows the average IOU we got with different value of k. With the increase of k, the change of the objective function became more and more stable. Considering Avg IOU and the number of detection layers, we selected 12 anchor boxes. The width and the height of the corresponding clusters on PASCAL VOC dataset and KITTI dataset are shown in Table 1.

Our Approach
We ran a K-means clustering algorithm to perform clustering analysis on the PASCAL VOC dataset and KITTI dataset. Then we proposed a mathematical derivation method based on IOU to determine the number and the aspect ratio dimensions of candidate anchor box for each scale of the improved network. Finally, to enhance the detection performance of the YOLO V3, we improved the structure of YOLO V3.

Appropriate Size for Anchor Boxes
YOLO V3 introduced the idea of anchor boxes used in Faster R-CNN. Anchor boxes are a set of initial candidate boxes with a fixed width and height. The choice of the initial anchor boxes will directly affect the detection accuracy and the detection speed. Instead of choosing anchor boxes by hand, YOLO V3 runs K-means clustering on the dataset to find good priors automatically. The clusters generated by K-means can reflect the distribution of the samples in each dataset, which can make it easier for the network to get good predictions. In this paper, Avg IOU is used as a metric of target clustering analysis. The objective function of clustering Avg IOU is as follows: Where box is the sample, namely the ground truth of the target. centroid is the center of the cluster. k n is the numbers of samples in the kth cluster center. k is the total number of samples. n is the numbers of clusters.

( )
, IOU box centroid is the intersection over Union of the clusters and the sample.
We applied K-means clustering on the PASCAL VOC and KITTI dataset, respectively. The Figure 3 shows the average IOU we got with different value of k . With the increase of k , the change of the objective function became more and more stable. Considering Avg IOU and the number of detection layers, we selected 12 anchor boxes. The width and the height of the corresponding clusters on PASCAL VOC dataset and KITTI dataset are shown in Table 1.   After K-means clustering on the dataset, cluster centers are generated. YOLO V3 divides up the clusters evenly across scales. This may cause some clusters to be placed at inappropriate scales because the clusters are arbitrarily allocated. It is essential to determine what size of the anchor box is suitable for each scale of the network. Inspired by the method of the proposal generation [33], we used mathematical derivation based on IOU to help select the appropriate size of the anchor boxes for each scale.
There are two extreme cases, as shown in Figure 4. The red box represents the anchor box and the black box represents the ground truth box, and the green box is the grid cell of the feature map. If S a is the side length of an anchor box and S g is the side length of the ground truth box, and 2 d is the side length of the grid cell in the feature map. d is the numbers of downsampling layers.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 14 After K-means clustering on the dataset, cluster centers are generated. YOLO V3 divides up the clusters evenly across scales. This may cause some clusters to be placed at inappropriate scales because the clusters are arbitrarily allocated. It is essential to determine what size of the anchor box is suitable for each scale of the network. Inspired by the method of the proposal generation [35], we used mathematical derivation based on IOU to help select the appropriate size of the anchor boxes for each scale.
There are two extreme cases, as shown in Figure 4. The red box represents the anchor box and the black box represents the ground truth box, and the green box is the grid cell of the feature map. If a S is the side length of an anchor box and g S is the side length of the ground truth box, and 2 d is the side length of the grid cell in the feature map. d is the numbers of downsampling layers. Consider the extreme case in Figure 4a: We assume the anchor boxes and the ground truth boxes are quadrate and the anchor box is bigger than the ground truth box ( ≥ a g S S ). The IOU between the anchor box and the ground truth box can be defined as Where GT is ground truth box. The common metric to decide the prediction results is to see if the IOU is greater than 0.5 ( 0.5 λ ≥ ). Then we can get the result as follows Consider the extreme case in Figure4b: The center of the anchor box is in the upper left corner of the grid cell of the feature map and the center of the ground truth box is in the bottom right corner of the grid cell. We assume that half the side length of the anchor boxes and the ground truth box is longer than the length of the grid cell in the feature map.
The IOU between the anchor box and the ground truth box can be expressed by (6) Consider the extreme case in Figure 4a: We assume the anchor boxes and the ground truth boxes are quadrate and the anchor box is bigger than the ground truth box (S a ≥ S g ). The IOU between the anchor box and the ground truth box can be defined as where GT is ground truth box. The common metric to decide the prediction results is to see if the IOU is greater than 0.5 (λ ≥ 0.5). Then we can get the result as follows Consider the extreme case in Figure 4b: The center of the anchor box is in the upper left corner of the grid cell of the feature map and the center of the ground truth box is in the bottom right corner of the grid cell. We assume that half the side length of the anchor boxes and the ground truth box is longer than the length of the grid cell in the feature map.
The IOU between the anchor box and the ground truth box can be expressed by (6) IOU(anchor, GT) = ( S a 2 − 2 d + when S a = S g , the equal sign in the inequality (7) is true. From (3), the IOU between anchor box and the ground truth box in Figure 4a can be 1. We hope the IOU in (6) is greater than 0.5 so we can get the inequality to appear as follows Then we can get the side length S a and the area of an anchor box with the numbers of the downsampling d. This is shown in Table 2. With the results in Table 1, we got the appropriate size and area of the anchor boxes for each scale of the output detection layer. However, the clusters generated by K-means on each dataset are not quadrate. Like the cluster (63, 32) generated by K-means on KITTI dataset, the height of the anchor box is much longer than that of the width. What is more, the height of the anchor meet the demand of output detection layer which is downsampled by 4× and the width of the anchor box meet the demand of the detection layer, which is downsampled by 8× according to Table 1. To solve this problem, suppose the height and the width of the anchor box are h and w, we compare the value of √ h × w with S a to determine which scale is suitable for this anchor. The cluster (63, 32) is suitable for the output detection layer which is downsampled by 8× (43.6 ≤ √ 63 × 32 ≤ 87.2). So the principle to select suitable anchor boxes for each scale of the output detection layer can be concluded as follows: • Apply the K-means clustering to get the candidate anchor boxes on each dataset.
• Calculate the value of √ h × w for each candidate anchor box.
• Compare Table 1 and select the appropriate scale for each anchor box.
According to the principle above, we can allocate the clusters on PASCAL VOC dataset to each suitable scale. This is shown in Table 3. The allocation of each cluster on KITTI dataset can be shown in Table 4.

Improving YOLO V3 with More Scales
The network of improved YOLO V3 is shown in Figure 5. We add scale 4 (red box) to improve the detection performance of the network. The allocation of each cluster on KITTI dataset can be shown in Table 4.

Improving YOLO V3 with More Scales
The network of improved YOLO V3 is shown in Figure 5. We add scale 4 (red box) to improve the detection performance of the network. As shown in Figure 5, to improve the performance of YOLO V3, we proposed an enhanced object detection algorithm. YOLO V3 uses three scales to detect targets of different size and detects small targets with the feature map downsampled by 8×. To get more fine-grained features and location information from small targets, we explore to use the feature map downsampled by 4× in the original network, because it contains more finer-grained information of small targets. To concatenate the As shown in Figure 5, to improve the performance of YOLO V3, we proposed an enhanced object detection algorithm. YOLO V3 uses three scales to detect targets of different size and detects small targets with the feature map downsampled by 8×. To get more fine-grained features and location information from small targets, we explore to use the feature map downsampled by 4× in the original network, because it contains more finer-grained information of small targets. To concatenate the feature map in the earlier layer to the feature map in the deeper layer, upsample the feature map downsampled by 8× and concatenate it with the output of the second residual block in YOLO V3. The feature fusion target detection layer downsampled by 4× is established to detect small targets. Three detection scales in the original YOLO V3 are extended to four detection scales. So the improved YOLO V3 has better performance to detect targets of different size.
In front of the target detection layer of the YOLO V3 network, there are six convolutional layers. Inspired by DSSD [34], to avoid gradient fading and enhance the reuse of the features, we transformed the six convolutional layers into two residual units. This is shown in Figure 6. feature map in the earlier layer to the feature map in the deeper layer, upsample the feature map downsampled by 8× and concatenate it with the output of the second residual block in YOLO V3.The feature fusion target detection layer downsampled by 4× is established to detect small targets. Three detection scales in the original YOLO V3 are extended to four detection scales. So the improved YOLO V3 has better performance to detect targets of different size.
In front of the target detection layer of the YOLO V3 network, there are six convolutional layers. Inspired by DSSD [36], to avoid gradient fading and enhance the reuse of the features, we transformed the six convolutional layers into two residual units. This is shown in Figure 6.

Experiments and Results
To evaluate the performance of the improved network and the proposed method, the comparative experiments of our approach with other target detection algorithms were done on the PASCAL VOC dataset and KITTI dataset, respectively. The experimental condition is as follows. The operating system: Ubuntu 14.04. Deep learning framework: Darknet.CPU: i7-5930k. Memory: 64GB, GPU: NVIDIA GeForce GTX TITAN X.

Experiment on PASCAL VOC
We combined the PASCAL VOC 2007 training set and the PASCAL VOC 2012 training set as the training set of the improved network and the PASCAL VOC 2007 test set was taken as the test set. We evaluated with the commonly used performance metric in object detection if IOU overlaps with the ground truth bounding box is greater than 0.5. In the training stage, the initial learning rate is 0.001, and the weight decay is 0.0005. When the training batches were 60,000 and 70,000, the learning rate was reduced to 0.0001 and 0.00001, respectively, which makes the loss function further converge. To improve the performance of training, we use data augmentation methods such as rotating the images with different angles and changing the saturation, exposure, and hue of the images.
The improved YOLO V3 network was used to detect twenty categories of targets in the dataset. Calculate the average accuracy of each kind of target and calculate mAP of twenty categories of targets. The test results of different target detection algorithms are shown in Table 5. The results of YOLO [30], YOLO V2 [31], Faster RCNN [26], SSD [29], and DSSD [36] were referred to in their homepages or corresponding papers. The mAP of our proposed network for the 416×416 and 512×512 input models was 82.88% and 86.73%, respectively. Compared with the state-of-the-art object detection algorithms, the mAP was enhanced by the improved network obviously. The mAP has been improved by 3.62% and 7.47% over YOLO V3.

Experiments and Results
To evaluate the performance of the improved network and the proposed method, the comparative experiments of our approach with other target detection algorithms were done on the PASCAL VOC dataset and KITTI dataset, respectively. The experimental condition is as follows. The operating system: Ubuntu 14.04. Deep learning framework: Darknet. CPU: i7-5930k. Memory: 64GB, GPU: NVIDIA GeForce GTX TITAN X.

Experiment on PASCAL VOC
We combined the PASCAL VOC 2007 training set and the PASCAL VOC 2012 training set as the training set of the improved network and the PASCAL VOC 2007 test set was taken as the test set. We evaluated with the commonly used performance metric in object detection if IOU overlaps with the ground truth bounding box is greater than 0.5. In the training stage, the initial learning rate is 0.001, and the weight decay is 0.0005. When the training batches were 60,000 and 70,000, the learning rate was reduced to 0.0001 and 0.00001, respectively, which makes the loss function further converge. To improve the performance of training, we use data augmentation methods such as rotating the images with different angles and changing the saturation, exposure, and hue of the images.
The improved YOLO V3 network was used to detect twenty categories of targets in the dataset. Calculate the average accuracy of each kind of target and calculate mAP of twenty categories of targets. The test results of different target detection algorithms are shown in Table 5. The results of YOLO [28], YOLO V2 [29], Faster RCNN [24], SSD [27], and DSSD [34] were referred to in their homepages or corresponding papers. The mAP of our proposed network for the 416 × 416 and 512 × 512 input models was 82.88% and 86.73%, respectively. Compared with the state-of-the-art object detection algorithms, the mAP was enhanced by the improved network obviously. The mAP has been improved by 3.62% and 7.47% over YOLO V3. The detection results of the improved YOLO V3 network on the PASCAL VOC dataset are shown in Figure 7. The dark blue box is the ground truth and the green box is the prediction box. It is clear that both the big targets and the small targets can be detected by the improved YOLO V3 network.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 14 The detection results of the improved YOLO V3 network on the PASCAL VOC dataset are shown in Figure 7. The dark blue box is the ground truth and the green box is the prediction box. It is clear that both the big targets and the small targets can be detected by the improved YOLO V3 network.

Experimenton KITTI Dataset
We used 2012 2D left color images of object data in KITTI dataset to train and test our model. We combined 8 categories of targets into 3 categories, namely the 'car', 'cyclist', and' person'. In the training stage, the initial learning rate is 0.001, and the weight decay is 0.0005. When the training batches were 50,000 and 55,000, the learning rate was reduced to 0.0001 and 0.00001, respectively, which makes the loss function further converge. We use data augmentation by rotating the images and shifting the color.
The mAp comparison of each algorithm on KITTI dataset is shown in Table 6.Compared with the state-of-the-art object detection algorithms, the mAP was enhanced by the improved network. The mAP of our proposed network for 512×512 input models was 84.72%. Our proposed network has improved the mAP by 1.77% over the conventional YOLO V3. The average accuracy of each kind of target and mAP of three categories of targets are shown in Table 7. The AP of each category has been improved by our proposed approach.

Experimenton KITTI Dataset
We used 2012 2D left color images of object data in KITTI dataset to train and test our model. We combined 8 categories of targets into 3 categories, namely the 'car', 'cyclist', and' person'. In the training stage, the initial learning rate is 0.001, and the weight decay is 0.0005. When the training batches were 50,000 and 55,000, the learning rate was reduced to 0.0001 and 0.00001, respectively, which makes the loss function further converge. We use data augmentation by rotating the images and shifting the color.
The mAp comparison of each algorithm on KITTI dataset is shown in Table 6. Compared with the state-of-the-art object detection algorithms, the mAP was enhanced by the improved network. The mAP of our proposed network for 512 × 512 input models was 84.72%. Our proposed network has improved the mAP by 1.77% over the conventional YOLO V3. The average accuracy of each kind of target and mAP of three categories of targets are shown in Table 7. The AP of each category has been improved by our proposed approach. The detection results of the improved YOLO V3 network on the KITTI dataset are shown in Figure 8. The dark blue box is the ground truth and the green box is the prediction box. It is clear that both the big targets and the small targets can be detected by the improved YOLO V3 network.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 14 The detection results of the improved YOLO V3 network on the KITTI dataset are shown in Figure 8. The dark blue box is the ground truth and the green box is the prediction box. It is clear that both the big targets and the small targets can be detected by the improved YOLO V3 network.

Experimenton VEDAI Dataset
To evaluate the performance of our network for detecting small targets, we do comparative experiments on VEDAI dataset. VEDAI is a dataset for vehicle detection in Aerial Imagery [38].There is an average of 5.5 vehicles per image, and they occupy about 0.7% of the total pixels of the image. It is a typical small vehicle dataset. The experimental results show that YOLO V3 achieved a mAP of 55.81% and our network achieved 62.36%. Our network improved the mAP greatly by 6.55% which clearly demonstrates that the performance of our network for detecting small targets is improved.

Quantitative and Qualitative Evaluation
Quantitative evaluation: In Tables 4 and 5, the experimental results on PASCAL VOC dataset and KITTI dataset show that our network achieves better performance than other state-of-the-art detection algorithms. Compared with YOLO V3, our proposed network has improved mAP by 7.47%

Experimenton VEDAI Dataset
To evaluate the performance of our network for detecting small targets, we do comparative experiments on VEDAI dataset. VEDAI is a dataset for vehicle detection in Aerial Imagery [36]. There is an average of 5.5 vehicles per image, and they occupy about 0.7% of the total pixels of the image. It is a typical small vehicle dataset. The experimental results show that YOLO V3 achieved a mAP of 55.81% and our network achieved 62.36%. Our network improved the mAP greatly by 6.55% which clearly demonstrates that the performance of our network for detecting small targets is improved.

Quantitative and Qualitative Evaluation
Quantitative evaluation: In Tables 4 and 5, the experimental results on PASCAL VOC dataset and KITTI dataset show that our network achieves better performance than other state-of-the-art detection algorithms. Compared with YOLO V3, our proposed network has improved mAP by 7.47% on the PASCAL VOC dataset and by 1.77% on the KITTI dataset. The performance is much better for the PASCAL VOC dataset than for the KITTI dataset. That is because PASCAL VOC contains 20 categories and every category may have small targets. However, there are 3 categories, namely 'car', 'cyclist', and 'person' in the KITTI dataset. Compared with PASCAL VOC, there are less small targets in the KITTI dataset. By extending the scales from 3 to 4, our proposed network is better than YOLO V3, especially for detecting small targets. What is more, our proposed mathematical derivation method based on IOU can select appropriate size of the anchor boxes for each scale of the output detection layer.
The comparative results of model size and the frames per second (FPS) between YOLO V3 and improved YOLO V3 are shown in Table 8. The model size of our network is 249.1 M, which is slightly than that of YOLO V3. The FPS of our network is less than that of YOLO V3. That is because our proposed network extended the scales of the YOLO V3 from 3 to 4. Qualitative evaluation: Figures 7 and 8 shows that our proposed network has perfect performance for detecting both big and small targets. A qualitative comparison of our network with YOLOV3 is also provided. The comparative results between YOLO V3 and improved YOLO V3on KITTI dataset and PASCAL VOC dataset are shown in Figures 9 and 10. The light blue box is the ground truth. The green box is the prediction box. The red box shows the targets with fault. It is obvious that our proposed network achieved a better performance than YOLO V3. In Figures 9 and 10, a lot of targets are missed and detected with fault by YOLO V3, especially some small targets. The improved network can avoid these problems. That is because our network has 4 scales to detect the targets. The feature fusion target detection layer downsampled by 4× is helpful to detect small targets. What is more, with the mathematical derivation method based on IOU, targets of different sizes can be accurately assigned to corresponding scales.  Table 8. The model size of our network is 249.1M, which is slightly than that of YOLO V3. The FPS of our network is less than that of YOLO V3. That is because our proposed network extended the scales of the YOLO V3 from 3 to 4. Qualitative evaluation: Figures 7 and 8 shows that our proposed network has perfect performance for detecting both big and small targets. A qualitative comparison of our network with YOLOV3 is also provided. The comparative results between YOLO V3 and improved YOLO V3on KITTI dataset and PASCAL VOC dataset are shown in Figures 9 and 10. The light blue box is the ground truth. The green box is the prediction box. The red box shows the targets with fault. It is obvious that our proposed network achieved a better performance than YOLO V3. In Figures9 and 10, a lot of targets are missed and detected with fault by YOLO V3, especially some small targets. The improved network can avoid these problems. That is because our network has 4 scales to detect the targets. The feature fusion target detection layer downsampled by 4× is helpful to detect small targets. What is more, with the mathematical derivation method based on IOU, targets of different sizes can be accurately assigned to corresponding scales.

Conclusions
In this paper, a multi-scale target detection approach based on YOLO V3 is proposed. The main contributions of the proposed network are as follows: First, a mathematical derivation method based on IOU was proposed to select the number and the aspect ratio dimensions of the candidate anchor boxes for each scale of the improved YOLO V3. Second, to further improve the detection performance of the network, the detection scales of YOLO V3 have been extended from 3 to 4 and the feature fusion target detection layer downsampled by 4×wasestablished to detect the small targets. To avoid gradient fading and enhance the reuse of the features, the six convolutional layers in front of the output detection layer were transformed into two residual units. Finally, we did comparative experiments on the PASCAL VOC dataset and the KITTI dataset. Compared with the state-of-the-art target detection algorithms, our experimental results show that the mean average precision is enhanced by the improved network and show that the proposed method is a valid one to select the suitable anchors for each scale. We also provided quantitative and qualitative comparisons of our approach with YOLO V3.Ways to simplify the network and reduce the computational cost without reducing the detection performance will be our main future research direction.

Conclusions
In this paper, a multi-scale target detection approach based on YOLO V3 is proposed. The main contributions of the proposed network are as follows: First, a mathematical derivation method based on IOU was proposed to select the number and the aspect ratio dimensions of the candidate anchor boxes for each scale of the improved YOLO V3. Second, to further improve the detection performance of the network, the detection scales of YOLO V3 have been extended from 3 to 4 and the feature fusion target detection layer downsampled by 4×wasestablished to detect the small targets. To avoid gradient fading and enhance the reuse of the features, the six convolutional layers in front of the output detection layer were transformed into two residual units. Finally, we did comparative experiments on the PASCAL VOC dataset and the KITTI dataset. Compared with the state-of-the-art target detection algorithms, our experimental results show that the mean average precision is enhanced by the improved network and show that the proposed method is a valid one to select the suitable anchors for each scale. We also provided quantitative and qualitative comparisons of our approach with YOLO V3. Ways to simplify the network and reduce the computational cost without reducing the detection performance will be our main future research direction.
Author Contributions: M.J. contributed to this work by setting up the experimental environment, designing the algorithms, designing, performing the experiments, analyzing the data and writing the paper. H.L., Z.W., Z.C. and B.H. contributed through research supervisory and reviewer roles and by amending the paper.
Funding: This research received no external funding.

Conflicts of Interest:
We declare that we have no conflict of interest.