As
Figure 1 shows, we applied a coarsetofine strategy to detect the small objects in 4K resolution images. In the coarse stage, an adaptive subRegion Select Block (ARSB) is applied to find rough areas in 4K resolution images. Then, we zoomed into these subregions from the original images, while maintaining the size of the area bounding box. In the fine stage, we applied stateofart object detection backbones like YoloV3 [
39] to detect the objects. Finally, we combined the bounding boxes in the subregions and zoom back out to the original images.
4.1. Adaptive SubRegion Select Block (ARSB)
One challenging task in the field of object detection for high resolution images is to find small objects correctly. A straightforward method is to split the high resolution images evenly. However, this simple cropping method can be inefficient when the objects are sparse in the images. As
Figure 2a shows, more than half of the evenly cropped image clips have no objects due to the sparseness. However, these additional image clips will increase the processing time dramatically. A characteristic of the objects in the images is that they are usually clustered in several subregions. In this case, we propose an adaptive subregion select block (ARSB) to get the subregions. Our idea is inspired by regionbased proposal networks (RPN) [
11], which aims at the candidate proposals from the feature maps. Similarly, we develop a mechanism for the coarse detector to select a rough region containing objects. Because of the sparseness of the objects in aerial images, the number of the predicted subregions is fewer than the evenly cropped regions. Instead, it can adaptively select regions and feed the subregions clips to the next fine detector and predict precise bounding box positions. Although our idea is similar to RPN, the method is different because we are cropping the subregions from the original images, and then zooming into these cropped regions for fine detection. As
Figure 2b illustrates, significantly fewer image clips are selected compared to the even cropping methods.
The process of extracting the subregion can be considered as a supervised learning method. However, none of the current datasets will provide groundtruthed information for the rough subregions containing objects. In this work, we propose an algorithm called Iterative BoundingBox Merge (IBBM) to merge the small objects as ground truths of the subregions. IBBM can also be applied to other public datasets to merge and generate large regions for coarse detection. The idea of IBBM comes from the algorithm of nonmaximum suppression (NMS) [
40]. NMS compares the predicted bounding boxes with the ground truth, keeping the bounding box with the largest confidence and suppressing the rest of the bounding boxes with IoU scores larger than the predefined threshold. We modified this algorithm and used a predefined large bounding box, denoted as an anchor, to merge the ground truth that has IoU scores larger than the threshold
${\tau}_{IBBM}$. The pseudocode of IBBM is shown in Algorithm 1, and the function RECENTER in the algorithm is applied to relocate the center of
${B}_{i}$ to ensure that the rescaled
${B}_{i}$ with new height
${h}_{b}$ and width
${w}_{b}$ is still in the image.
Algorithm 1 Iterative BoundingBox Merge (IBBM) 
Input: Bounding boxes of an image $\mathcal{B}={\left\{{B}_{i}\right\}}_{i=1}^{{N}_{\mathcal{B}}}$, classes of the bounding boxes $\mathcal{C}={\left\{{C}_{i}\right\}}_{i=1}^{{N}_{\mathcal{C}}}$, desired bounding box height ${h}_{b}$ and width ${w}_{b}$, non max merge threshold ${\tau}_{IBBM}$ Output: Merged bounding boxes ${\mathcal{B}}^{\prime}$ 1:
${\mathcal{B}}^{\prime}=\left\{\right\}$  2:
for$i\leftarrow 1$ to ${N}_{\mathcal{B}}$ do  3:
if ${B}_{i}$ is visited then  4:
continue  5:
Flag ${B}_{i}$ as visited  6:
${B}_{i}^{\u2020}\leftarrow \mathrm{RECENTER}({B}_{i},{h}_{b},{w}_{b})$  7:
for $j=i+1$ to ${N}_{\mathcal{B}}$ do  8:
if ${C}_{i}\ne {C}_{j}$ then  9:
continue  10:
if $\mathrm{IoU}({B}_{i}^{\u2020},{B}_{j})>{\tau}_{IBBM}$ then  11:
${\mathcal{B}}^{\prime}\leftarrow {\mathcal{B}}^{\prime}\cup \left\{{B}_{j}\right\}$  12:
Flag ${B}_{j}$ as visited  13:
return${\mathcal{B}}^{\prime}$

Given the generated ground truth, we follow the ideas in Yolov3 [
39] and consider the subregion extraction as a regression problem. In the extraction phase, we are interested in whether this region contains objects or not, hence we can only predict two classes
$({p}_{obj}$ and
${p}_{noobj})$, along with the information of the prediction bounding boxes
$(x,y,w,h)$.
$x,y$ is the center point of the bounding box and
$w,h$ is the box size. The width and height of the generated ground truth are fixed to a certain size and we can accordingly fix the
$(w,h)$ output in this training phase and only regress the results for
$(x,y)$. The overall cost function can be derived as a weighted sum of the confidence loss
${L}_{conf}$ and bounding box regression loss
${L}_{reg}$:
where
${p}_{obj}$ and
${p}_{noobj}$ represent the possibility that the anchor in a cell has objects or not.
${g}_{obj}$ and
${g}_{noobj}$ means whether a cell has objects. In the implementation, we use an IoU threshold
$\tau =0.5$ to get the value of
${g}_{obj}$ and
${g}_{noobj}$.
${p}_{x}$ and
${p}_{y}$ are the center of the bounding boxes,
${g}_{x}$ and
${g}_{y}$ are the center of the ground truth box.
${L}_{bce}$ is the loss of binary crossentropy and
${L}_{mse}$ is the loss of mean square, which could be represented as follows:
where
${\mathbb{I}}_{ij}^{c}$ is an indicator function for matching the
ith prediction box with the
Jth ground truth box of category
c. In this phase,
c will the same because we only have two classes: objects and no objects.
n is the size of the prediction boxes, and
m is the size of the ground truth boxes.
In most cases, the cell will not contain any objects. Therefore, the negative samples will be much more than the positive samples. In this case, we add ${L}_{noobj}$ as a penalty term to regularize the imbalance case, and set up a scalar ${\lambda}_{1}$ as a weight hyperparameter.
4.2. Fine Object Detection
Because we fix the width and height of the output from a coarse detector, the fine object detection can directly use the output to crop from the original images and feed into the network for final detection. Notice that, in fine detection phase, we resize the image from the size of
$1080\times 1080$ to
$416\times 416$, instead of scaling the
$3840\times 2160$ images to
$416\times 416$. From Yolov2 [
17], prior information about the anchor box sizes help to better train the model. We followed the modified kmeans algorithm to generate the priors of the anchor size. We used the distance defined in Yolov2 [
17] instead of the standard Euclidean distance:
where we set the number of centroid to 9. The
$IoU$ means the Intersection over Union of two boxes, which is a criterion to measure the similarity of the two boxes. For our proposed dataset, the ratio of the width and height can vary from extremely short wide boxes (w/h ratio = 5) to extremely tall, thin boxes (w/h ratio = 0.2). In addition, about 70% of the boxes are smaller than
$50\times 50$. When we directly apply the kmeans to the ground truth, most of the top9 boxes are small sized. To give some prior information for larger objects, we set up a threshold to label the boxes into three types: small, medium, and large. Then, we run the kmeans algorithms for different types of boxes with centroid = 3 and combine the nine total centroids as the prior for the anchors.
The prediction is similar to the process in the coarse detection phase. The difference is that the width and height of the prediction boxes are not fixed in the fine detection phase, and therefore we should learn the box size from the networks. We also need to predict multiple classes as opposed to just two classes. Thus, we can derive the loss function for fine detection. To differentiate from the notations in coarse phase, we denote the confidence of the bounding box as fine confidence loss
${L}_{fconf}$, and fine regression loss accordingly
${L}_{freg}$. In addition, we also predict the class of the bounding box, so we have one more loss to predict the class
${L}_{cls}$:
where
C is the number of the classes.
${p}_{c}^{k}$ means the cell has an object in a particular class
k.
${g}_{c}^{k}$ means the class in best match ground truth box. We are using onehot encoding to predict the class and therefore the value of
${p}_{c}^{k}$ and
${g}_{c}^{k}$ will be either 1 or 0. Thus, we use the binary cross entropy loss for
${L}_{cls}$.
Finally, we can combine loss in the coarse phase and loss in the fine stage together, and the overall loss is:
where
${\lambda}_{c}$ and
${\lambda}_{f}$ are the weighted scalars. In real implementation, we first set
${\lambda}_{c}=1,{\lambda}_{f}=0$ to fix the parameter in fine detector and train the coarse detector. Then, we fix the coarse detector and train the fine detector by setting
${\lambda}_{c}=0,{\lambda}_{f}=1$.