Object Detection Using Multi-Scale Balanced Sampling

: Detecting small objects and objects with large scale variants are always challenging for deep learning based object detection approaches. Many efforts have been made to solve these problems such as adopting more effective network structures, image features, loss functions, etc. However, for both small objects detection and detecting objects with various scale in single image, the ﬁrst thing should be solve is the matching mechanism between anchor boxes and ground-truths. In this paper, an approach based on multi-scale balanced sampling(MB-RPN) is proposed for the difﬁcult matching of small objects and detecting multi-scale objects. According to the scale of the anchor boxes, different positive and negative sample IOU discriminate thresholds are adopted to improve the probability of matching the small object area with the anchor boxes so that more small object samples are included in the training process. Moreover, the balanced sampling method is proposed for the collected samples, the samples are further divided and uniform sampling to ensure the diversity of samples in training process. Several datasets are adopted to evaluate the MB-RPN, the experimental results show that compare with the similar approach, MB-RPN improves detection performances effectively.


Introduction
Object detection is a kind of approaches for objects localization and category classification in digital images, which is one of the most challenging branches in the field of computer vision.
The early approaches are based on handcrafted image features. In [1], a pedestrian detection system is proposed with histogram of oriented gradients(HOG) feature and support vector machine(SVM). In [2], deformable part-based model(DPM) is proposed which enhanced detection accuracy by utilizing HOG features of the whole and part of objects. As the peak of handcraft feature based detection approach, the detection performances of DPM are still not ideal, due to the lack of effective representation of features. Besides, since hand craft feature extractor are always designed for specific object types, hence often result in low robustness in dealing with different category of objects.
In recent years, several object detection approaches based on convolutional neural networks (CNN) are proposed [3][4][5]. Image features are supervise trained by measuring error between prediction and annotated ground-truth in large-scale object detection datasets. Compare with handcraft features the CNN features' representation ability and robustness against various types of objects are both significantly enhanced. Therefore the detection performances are highly improved. Although deep learning object detection approaches have shown state of the art performance for general object detection, they are still limited in detecting small objects and the performances in detection various scale objects in single input image is also not ideal. The reasons for a low detection performances are as the follow: 1. The proportion of small objects in the image are always relative low, which means they might be excluded from the training process due to improper network hyperparameter settings. However, small objects often have low resolutions and less image information which means the difficulty of training small objects are always higher than general objects, the lack of sampling will further result in a low quality features extraction by the deep neural networks. 2. In natural scenes, the scale of objects are distributed stochastic, which means within a single image there might be objects with large scale variants, it is easy to take the majority of samples and ignore other objects in the process of training.
Overall, in order to further improve detection performances especially in detecting objects mentioned above, the number of small objects and proportion of various scale object in training samples are both important.In this paper an end-to-end object detection approach with multi-scale balanced sampling is proposed to improve the matching mechanism and ensure scale diversity in training samples. The key contribution of the approach is summarized as follow: 1. The samples' matching conditions is adjusted according to the objects' scale so that the the small objects are easier to be matched, which enhance the training samples of small objects. 2. The sample set is divided into multiple intervals according to the samples' scale and their corresponding discrimination difficulty. In addition, each interval is sampled in a balanced fashion to preserve the diversity of sample types during the classification and positioning network training process and to further ensure that the algorithm does not tend to detect a specific type of samples while ignoring the others. 3. Evaluation between proposed approach and others on several benchmarks is proposed, the experiment results show that the detection performances are better than other similar approaches.
The remainder of this paper is organized as follows. In Section 2, background and related works are introduced. In Section 3, framework and implement detail of proposed approach are introduced. Section 4 presents the experiment results and comparisons with other similar approaches. Finally, Section 5 conclude the proposed approach.

Related Works
In [6], a deep neural network based object detection approach called RCNN is first introduced. It is composed with three parts: First, by adopting selective search algorithm, RCNN generate a series of candidate region, each of them may responsible for detecting a specific object in the image. Second, extract feature of candidate regions by CNN, the network will be train supervised by measuring the error between prediction and ground-truths. Finally, SVM and bounding box regression is adopted to finetune the predicted results. The framework of RCNN is similar to tradition approaches except CNN is adopt to extract image features instead of handcraft features. In [7], an approach called Fast RCNN is proposed, it take the whole image as inputs, then crop the corresponding features of each candidate region and map them to a uniform size by region of interest(ROI) pooling, finally feed the mapped features into classification and regression network to acquire its category and localization. The fast RNN integrate coarse and finetune process in RCNN which improve both the detection performances and efficiency. In [5], an end-to-end CNN based object detection framework called Faster RCNN is introduced. Instead of generate candidate by selective search, this approach proposed region proposal network(RPN), it first generate a series of anchor boxes with different scale and aspect ratio, each of them is responsible for detecting object or not is depend on the inter section of union(IOU) between its coordinate and annotated ground-truth. As the first end-to-end object detection approach, Faster RCNN laid the foundation of subsequent deep neural network based object detection approaches. In [3], an approach which integrate the function of RPN and classification/regression network in one series convolution layers called SSD is introduced. At present, approaches whose framework are similar to Faster RCNN are summarized as two-stage approaches, in contrast the approaches like SSD are called one-stage approaches.
Based on backbones such as Faster RCNN and SSD, variety of approaches were proposed to enhance the detection performances. For example: Feature Pyramid Networks(FPN) proposed an pyramid architecture image feature extractor, objects in different resolution are arranged to corresponding layers, compare with the origin Faster RCNN, in dealing with a specific object FPN will provide more proper feature [8]. Cascade RCNN proposed training process that discrimination IOU threshold is gradually increased, which makes the classification and regression network training in a easy-to-hard way [9]. RetinaNet proposed focal loss which could enlarge the weight of hard samples, which makes the training process focuses on the hard samples [10]. Libra RCNN proposed a balanced sampling method, feature extraction and loss function in training process [11]. SRetinaNet propose an anchor optimization method which will help detecting small objects with specific parameter setting [12]. GA-RPN propose an anchor optimization method by combining anchor box with semantic features [13].
Regardless approaches being one or two stage, the fundament for object detection is the matching mechanism of anchor and ground-truth, which determines how many samples can be included in the network training. Therefore, it is important to propose a proper matching mechanism for enhancing the detection performances. However, the stochastic of objects scale poses challenges to the matching mechanism [12,13].

Framework Overview
The overview of proposed approach is shown in Figure 1, where the green cubes denote image feature extracting process, the pink cube denote MB-RPN module and purple cube denote classification/regression networks for finetune. Table 1 shows the details of the network. The specific steps are as follows: 1. Take an digital image as inputs, feed it into ResNet-101 pre-trained network so that the image features are extracted from shallow to deep using Conv1∼Conv5 [14]. 2. By adopting MB-RPN, samples are dynamically selected according to scale and proportion.
MB-RPN can be further decomposed into two parts: multi-scale and balanced sampling. 3. The MB-RPN calculation results are then transformed to the same size through ROI Pooling. 4. The candidate box is then sent to the classification and regression network to obtain the object category and its location.

Multi-Scale Sample Discrimination
The main factors affecting the sample matching in training process are the scale of the anchor box and the labelling result IOU discrimination threshold. The large difference between the scale of the small object and the anchor box makes match difficult under the existing discrimination conditions.
For the sampling process of the positive samples, Figure 2a shows the matching results when default shape of FPN anchor boxes is adopted and the discrimination threshold of IOU> 0.7, where the red rectangle indicates the manually marked area containing the object. As it is seen, none of the anchor boxes can be successfully matched with the object area. Therefore, the image is unable to guide the network parameter training because it does not contain any positive sample during the the training process. Figure 2b shows the matching results for the case where the scale of the anchor box is reduced by half. The green rectangle indicates the labeled samples in the training set, and the red rectangle indicates the corresponding matching anchor box, the sample matching results are still far from ideal.
For small objects, the default IOU threshold of FPN is a stringent condition, resulting in a poor matching even in the case where the anchor box scale is reduced. Also, the design of the anchor boxes should fully consider the objects in the image data set with different sizes. Therefore, simple reduction of the anchor box scale might, in return, result in matching failure for the object samples with a normal size. Hence, it is hard to improve the object matching probability solely by reducing the scale of the anchor box for detecting small objects.  To address the above issue, multi-scale positive sampling approach with dynamic IOU discrimination threshold is proposed. The FPN method has designed five scale-level anchor boxes, namely, A1∼A5 according to different scale sizes. According to the scale of anchor boxes, the approach divide three different positive sample intervals from smallest anchor size to the biggest, the criteria for levels of are shown as the follow: In Equation (1), a i represents the area of an anchor box, and A1∼A5 denote the present area of the anchor boxes in 5 different levels. The IOU discrimination threshold of small and medium anchor boxes are then decreased to 0.5 and 0.6 respectively, to ensure more small and medium boxes will be matched. For large anchor boxes, the default discrimination threshold is kept. By lowering the positive sample discrimination threshold, the anchor box is easier to match with the small object area, and the number of positive samples with the small object area is therefore increased.
Theoretically, lowering the matching threshold for the large-size anchor boxes can also effectively increase the number of matching anchor boxes. However, compared with the small object area, the large object has the following two differences: 1. For the large object it is much easier to meet the discrimination condition of the IOU threshold.
As it is seen in Figure 3, the large object located of the image has larger number of matching anchor boxes although for a threshold which has not been decreased. This suggests that further reduction of the threshold has only a limited impact on the increase of the positive samples. 2. Since a large object area contains a rich image feature information, compared to small objects, it is easier to obtain a set of valid discrimination and bounding box regression parameters during the network training process. Therefore, it is very limited to enhance the effect of detection performances for large objects by reducing the IOU discrimination threshold. From the network training perspective, the object detection approaches that are limited by computing resources often need to set an upper limit of samples. Part of the sampling results will be discarded randomly when too much samples are collected. Taking the FPN as an example, the upper limit on the total number of samples is usually 256, and arrange for positive and negative samples are 128 respectively, the redundant samples will be discarded. In this paper, we argue that the sample priority of the small object area should be higher than that of the large objects. In cases where there are a combination of small and large objects in the image, first and most important is to ensure a sufficient number of the small object areas samples for section. Therefore, the same IOU threshold value as the original FPN method is maintained for the large object areas, and the number of positive samples is not increased.
For negative sample sampling, besides considering the match of the anchor box and the object size, it is also necessary to consider the effect of different discrimination difficulty on the accuracy of the algorithm. For the object detection algorithm, the proposed approach divide the negative samples into easy and hard negative samples depending on the IOU threshold. In particular, the easy negative samples help the network to converge quickly. The detection accuracy however is mainly dependent on hard negative samples. Therefore, when collecting negative samples, the ratio of the number of hard to the number of easy negative samples is balanced. Figure 4 shows the example result of negative samples, where blue, green and red rectangles denote small, medium and big negative samples, most of them belong are easy and small samples. To address problem above, the Libra RCNN propose a balance sampling method to ensure the diversity of negative samples: First, according to IOU between anchor boxes and ground-truth the divide negative samples into different intervals. Second, divide the number of negative samples equally according to the intervals and balance sampling in each interval. Within the FPN method, Negative samples are defined as the anchors whose IOU with ground-truth are lower than 0.3, the Libra RCNN further divided it into easy, medium and hard negative intervals which are defined as the follow: Based on Libra RCNN, a balance negative sampling method which combining samples' scale and difficulty is proposed. Negative samples were divided into 8 intervals as shown in the Equation3. For medium and big negative samples this approach adopt the similar difficulty dividing approach as Libra RCNN, for instance, the easy_negative_medium negative sample denote the samples whose IOU ∈ [0, 0.1) and scale a i ∈ {A 2 , A 3 }. For small negative samples, since the IOU discrimination threshold of positive samples are adjusted to 0.5, the dividing approach of Libra RCNN is easy to cause confusing between positive and negative samples, therefore this approach correspond adjusted the dividing approach that only divide them into two different intervals.

Balanced Sampling
According to the dividing method mentioned above, a balance sampling approach is proposed: the positive samples are balanced according to the scale size to form a positive sample set. For the negative samples determined by the anchor box, the negative sample set is formed by balanced sampling with comprehensive consideration of the difficulty and scale size. For the sample set with an upper limit of N, the sample collection method designed in this paper is demonstrated in Algorithm 1.

Inputs:
Positive/Negative Sample Sets; 2: Number of Select Samples N; Outputs: Sample Set U; 4: divide_num = N set_num U = [] 6: sort(Sets) for set in Sets: 8: if n set > divide_num: U.append(sample(n set , divide_num)) 10: else: U.append(n set ) 12: reshape(divide_num) return U Ideally, the total number of positive and negative samples should be equal, therefore this approach initializes divide_num to the average of the total number of sample sets N set_num , if number of samples of all the intervals satisfies n set > divide_num, it is only needed to randomly sampling in each interval to generate set U. However, the ideal condition mentioned above is hardly appear in actual situation, therefore balance sampling is a problem that should be considered. If the total number of samples is less than the upper sampling limit N, it is necessary to include all samples in the sample set; Otherwise, the number of uniformly sampled objects in each interval, divide_num, is calculated based on the interval data, set_num. Sampling is then carried out from low to high according to the sample data in each interval. If the number of samples in the current interval, n set > divide_num, then divide_num samples are randomly selected to be included in the sample collection of the current interval; otherwise, all n_set samples are included in the sample collection, and divide_num is adjusted for subsequent sampling intervals using the reshape method.
The key point of the balanced sampling method is the reshape method for n set < divide_num. All samples in these kind of interval should be retained since the demand number of samples if more than the actual collected samples. Since the order of sampling approaching is depend on number of samples in each interval, therefore all of the subsequent intervals are redundant, which means the subsequent intervals are satisfy the following condition: In Equation (4), j represents the index of the current interval set in all sorted intervals. Since the surplus samples can be collected in the subsequent sampling process, a sufficient number of samples can be still collected. Therefore, as many as possible samples should be collected from the remaining intervals while maintaining the balance. The reshape method for updating divide_num is designed as the follow: In Equation (5), set_numle f t represents the number of remaining intervals. Since samples of each subsequent interval is updated. Take the collection process of the positive samples as an example and suppose that the number of samples in small_positive intervals, nsmall, is the lowest and less than divide_num. Then, divide_num is updated to (N − nsmall)/2 for the sampling process in the subsequent intervals. If the number of samples in the medium_positive and big_positive intervals is greater than the updated value of divide_num, then they are uniformly sampled.hrough the balanced sampling method, factors such as scale and difficulty are fully considered in the process of generating the sample set, which can effectively increase the number of small object samples and ensure sample diversity.

Loss Function
Similar to other tow-stage methods, the loss function is defined as sum of classification and regression loss: In Equation (6), L cls and L bbox denote classification and regression loss of MB-RPN and finetune loss respectively. Cross entropy is adopted for measuring the classification loss: where y i and y * i denote the predict and annotated category respectively where y * i is 1 if the anchor is positive in MB-RPN and y * i is 1 at the dimension representing the object's category in label vector. L reg denote the smooth L1 regression loss [7]: where t i and t * i denote predict and annotated coordinate and scale transform:

Discussion
In this section, both the framework and detail of proposed approach are introduced, the overall architecture is similar to FPN except positive/negative candidate sampling method are adjutsted. First, the framework is introduced, including network architecture, pre-trained backbone, object detection pipeline and network detail. Second, this section represents the matching mechanism multi-scale objects, the IOU threshold for small and medium anchors are reduced to ensure more small objects will be matching successfully. Third, a sampling algorithm is introduced to ensure the diversity of sampling results, all the samples are divided into different intervals, the algorithm tries to sample balance amount of samples in each of the interval. Finally, loss function of the proposed approach is introduced, including the cross entropy loss for classification and smooth L1 loss for localization.

Benchmarks
The proposed approach are evaluated on two datasets: Object Detection in Aerial Images(DOTA) and e Unmanned Aerial Vehicle Benchmark(UAVB) [15,16], the detail of them are as the follow: 1. DOTA contains over 2000 remote images. All of the images are large size about over 4000 × 4000 pixels. Images are annotated by experts in aerial and remote image interpretation using 15 common object categories, such as plane, ship, harbor, etc. The objects' distribution of each category are shown in Figure 5a, the abbreviation of each category will be shown in Table 2. 2. UAVB contains a unmanned aerial vehicle dataset, each frame is of the size about 560 × 1000 pixels and contains high density small objects. Vehicle category include car, truck and bus. The objects' distribution of each category are shown in Figure 5b. Both the DOTA and UAVB dataset contain all kinds scale and small objects account a large proportion. It means that both the ability of detecting small objects and all the scale of objects are important.

Implementation Detail
The network is established on Tensorflow and trained end-to-end.MB-RPN loss and detection loss are optimized simultaneous with Nvidia 1080Ti on Ubuntu operation system [17]. ResNet-101 pre-trained network is adopted to extract image features and other convolution layers were initialized randomly [14]. To optimize network parameters, Adam optimizer with lr=10 −6 , β 1 = 0.9 and β 2 = 0.999 is adopted [18].
The input images were set to 600 × 600. For DOTA dataset, since the shape of training and testing images are much larger than input size, to reduce the loss of image resolution the images are cropped into input size with stride of 300 for training and testing and merge test results to original shape. For UAVB dataset, since the shape of training and testing images are similar to input size, it is only needed to resize the images to input size. The size of anchor boxes for layer Conv1∼Conv5 are [32,64,128,256,512], which is consistent to the default value of FPN method. For DOTA dataset, aspect ratios of anchor boxes is [ 1 7 , 1 5 , 1 3 , 1 2 ,1,2,3,5,7] to adapt categories with both normal and slender shape such as bridge. For UAVB dataset, since the shape of all the categories are normal, therefore the aspect ratio is same to default FPN method. Mean average precision(mAP) is adopted to evaluate the proposed approach [19].

Effectiveness of Multi-Scale Sampling on Positive Sample
To evaluate the effectiveness of Multi-scale sampling for positive samples, the comparison of positive samples distribution on different scales between multi-scale and origin FPN sampling method on DOTA dataset, the result are shown in Figure 6. Since the IOU discrimination thresholds of small and medium objects are reduced, the amount positive samples are highly improved, which means more small anchor boxes are matched. The matching results are also visualized, compare to Figure 7 the matching results has been effectively improved.

Effectiveness of Multi-Scale Balanced Sampling on Negative Samples
To evaluate the effectiveness of multi-scale balanced sampling, comparison of negative samples' distribution on different scales and IOU between MB-RPN, FPN and Libra RCNN sampling method on DOTA dataset, the result are shown in Figure 8, where b 1 ∼ b 8 denote total amount of hard_small to easy_big negative samples. Most of the samples are Easy_small in FPN method, the Libra RCNN alleviate this situation significantly but the majority is still easy samples(b 1 , b 3 and b 6 ), especially the easy_small samples, the multi-scale sampling further improved samples distribution situation. The matching results are also visualized in Figure 9, compare to FPN and Libra RCNN, the matching results has been effectively improved.

Performance Comparison with Other Method
Several one stage and two stage object detection methods is adopt to evaluate the effectiveness of MB-RPN: SSD [3] RetinaNet [10], FPN [8], Libra RCNN [11], GA-RPN [13] and SRetinaNet [12]. All of these methods except SRetinaNet are implemented with the source code provided by authors. For SRetinaNet method, it is implemented by adjusting hyperparameters of RetinaNet. Table 2 shows the quantitative results on DOTA, the best performance for each category is colored in red. The mAP of MB-RPN achieves 68.5%, which outperform other methods. Compare with the one-stage approaches, since they are lack of positive/negative discrimination process, the detection accuracy are lower than all of the two-stage approaches obviously. Compare with original FPN, the mAP is 3% higher, moreover the AP of each category is also higher. Compare with GA-RPN and Libra RCNN, the mAP is increased about 1.7and 0.9% respectively, the AP of the most of categories are increased. The visualization comparison between MB-RPN and Libra RCNN is shown in Figure 10, MB-RPN detects more accurate small objects in various challenging cases, e.g., small vehicle objects at bottom left of Figure 10a and middle of Figure 10b. At the same time, the performances of other medium and big objects are not decreased. The above phenomenon proves the effectiveness of MB-RPN in enhancing the detection performances for images with small objects and large scale variants.
Considering the performances gap between one-stage and two-stage approaches, in this paper the performance comparison of UAVB dataset is only carried out between FPN, GA-RPN, Libra RCNN and MB-RPN. Table 3 shows the quantitative evaluation of these approaches, the best performance for each category is colored in red. All of the mAP acquired from the approaches above are not ideal, this may because the imbalance of samples. However, the MB-RPN still largely outperform other approaches on both mAP and AP for each category. Figure 11 provides a visual comparison of our approach and Libra RCNN, since both of the two approaches' performance are not ideal, it only shows the localization results but without categories. It can be seen that MB-RPN detects more small objects such as cars at upper of the input image which is corresponding to sampling results and distribution shown above. The above phenomenon proves the effectiveness of MB-RPN in enhancing the detection performances.

Disscusion
In this section, the experimental settings and results are introduced. First, the adopted DOTA and UAVB dataset are introduced, including their amounts of training/testing images and the distribution of each category samples. Second, implement details are represented, including the input size, crop mechanism of large images, parameter initialization and optimization method and hardware/software platform. Finally, both quantitative and visualized comparison are represented, the results show that under the equal conditions MB-RPN outperform other similar methods.

Conclusions
In this paper, a multi-scale balanced sampling approach for detecting small objects in complex scenes is proposed. With multi-scale positive sampling method, more small objects is able to be included in the network training process. With the balanced negative sampling method, the diversity of negative samples is ensured. Experimental results shows that compare with other similar methods, this approach acquire better performances on the images with small objects and large scale variant objects.

Conflicts of Interest:
The authors declare no conflict of interest.