Efﬁcient Hybrid Supervision for Instance Segmentation in Aerial Images

: Instance segmentation in aerial images is of great signiﬁcance for remote sensing applications, and it is inherently more challenging because of cluttered background, extremely dense and small objects, and objects with arbitrary orientations. Besides, current mainstream CNN-based methods often suffer from the trade-off between labeling cost and performance. To address these problems, we present a pipeline of hybrid supervision. In the pipeline, we design an ancillary segmentation model with the bounding box attention module and bounding box ﬁlter module. It is able to generate accurate pseudo pixel-wise labels from real-world aerial images for training any instance segmentation models. Speciﬁcally, bounding box attention module can effectively suppress the noise in cluttered background and improve the capability of segmenting small objects. Bounding box ﬁlter module works as a ﬁlter which removes the false positives caused by cluttered background and densely distributed objects. Our ancillary segmentation model can locate object pixel-wisely instead of relying on horizontal bounding box prediction, which has better adaptability to arbitrary oriented objects. Furthermore, oriented bounding box labels are utilized for handling arbitrary oriented objects. Experiments on iSAID dataset show that the proposed method can achieve comparable performance (32.1 AP ) to fully supervised methods (33.9 AP ), which is obviously higher than weakly supervised setting (26.5 AP ), when using only 10% pixel-wise labels.


Introduction
Instance segmentation in aerial images is an important task, which benefits various applications, e.g., monitoring of land changes [1], urban management [2] and traffic monitoring [3]. With the fast development of deep convolutional neural networks (CNN), the CNN-based instance segmentation methods are able to reach higher performance. However, the prerequisite for their performance is the availability of large-scale image dataset with accurate manually annotated labels. Current mainstream fully supervised instance segmentation methods [4][5][6][7][8][9] need instance-level pixel-wise labels for training. Specifically, labeling a bounding box on an object takes 10.2 s on average while labeling a segmentation annotation takes 79 s, which is about 8× slower [10]. Furthermore, aerial images usually have a wide range of view, which means they contain much more objects of interest than natural images. This leads to higher labeling cost. Taking the iSAID dataset [11] as an example, it is a large-scale aerial image dataset where each image has 233.6 instances on average. In comparison, there are only 2.8 instances per image in natural image dataset PASCAL VOC 2012 [12]. It means labeling cost of aerial images is two magnitude higher than natural images.
Although fully supervised instance segmentation methods [4][5][6][7][8][9] for natural images can be extended to aerial images, their labeling cost is unacceptable. Currently, weakly supervised instance segmentation methods that use economic image-level labels [13][14][15] or bounding box labels [16][17][18] have been proposed for natural scenes. To achieve better performance, some of them [16,17] use bounding box labels and rely on hand-crafted heuristics (e.g., Grabcut [19] and MCG [20]) to infer an object mask inside a bounding box as pixel-wise label for training. Nonetheless, their performance is far behind fully supervised methods due to inaccurate labels, and they cannot work well with aerial images because they do not consider challenges in aerial images which do not exist in natural scenes. Specifically, there are three key challenges in aerial images, as illustrated in Figure 1, first, cluttered background results in severe false positives, especially for small objects. Second, objects (e.g., vehicles) can be extremely small and dense, meanwhile, huge objects (e.g., ground track field) can cover a huge area. Third, while objects often appear with horizontal orientation in natural images, they can be with arbitrary orientations in aerial images. These challenges make it hard for hand-crafted heuristics (e.g., Grabcut [19] and MCG [20]) to obtain accurate object masks from aerial images for training.
In this paper, we aim to achieve satisfactory performance while keeping low labeling cost for aerial image instance segmentation. To this end, we present a pipeline of hybrid supervision that takes advantage of low labeling cost from bounding box labels and high accuracy from pixel-wise labels. It only uses 5-20% of images with instance-level pixel-wise labels and the rest of training images only have bounding box labels. The pipeline consists of an ancillary segmentation model and a primary instance segmentation model. The ancillary segmentation model is designed to generate accurate pseudo pixel-wise labels from real-world aerial images. A small portion of pixel-wise labels is enough for it to learn to recognize the shape of objects. After obtaining accurate pseudo pixel-wise labels with the help of the ancillary segmentation model, the primary instance segmentation model can be easily trained with large amounts of pseudo pixel-wise labels as well as a small portion of pixel-wise labels in a hybrid way and reach similar performance to fully supervised setting.
To address three key challenges in aerial images, we add two simple but effective modules to the ancillary segmentation model. The bounding box attention module is proposed to suppress the noise in clutter background and deals with densely distributed small objects, and the bounding box filter module is designed to suppress false positives caused by cluttered background. To handle the arbitrary oriented objects in aerial images, we adopt oriented bounding box instead of horizontal box to help these two modules to work better when arbitrary orientated objects show up.
In summary, our main contributions of this work are that • The proposed method can well balance performance and labeling cost for instance segmentation in aerial images. • We propose a pipeline that consists of an ancillary segmentation model and a primary instance segmentation model. The ancillary segmentation model with bounding box attention module, bounding box filter module and oriented bounding box labels can effectively address the specific challenges in aerial images, i.e., cluttered background, extremely dense and small objects, and objects with arbitrary orientations. • We evaluate our method and achieve 32.1 AP on challenging iSAID dataset [11] using 10% pixel-wise labels, which is comparable to fully supervised method 33.9 AP and much better than weakly supervised setting 26.5 AP.

Related Work
In this section, we review segmentation methods in natural images and the segmentation methods in aerial images, and the progress of aerial image datasets.
Compared with semantic segmentation, instance segmentation further identifies each object instance in the same semantic category. Most mainstream methods use object proposals where objects are detected with candidate bounding boxes and then segmented with a binary mask. These methods can be further divided into two categories in terms of the proposal methods. One is two-stage object detection framework based [4][5][6]31], the other is one-stage object detection framework based [7][8][9]32]. The two-stage methods rely on two-stage detectors [33,34], which usually first employ region proposal techniques to obtain regions of interest, and then extract the features of the regions and obtain the predictions of categories, bounding boxes and shapes. The one-stage methods rely on one-stage detectors [35][36][37][38] which require only a single pass through the neural network and directly obtain the predictions.
The instance-level pixel-wise labels for segmentation is labor intensive. Therefore, some recent researchers exploit weakly supervised method, which only requires imagelevel labels [13][14][15] or bounding box labels [16][17][18]. Khoreva et al. [16] and Li et al. [17] use GrabCut [19] and MCG [20] to propose the pseudo pixel-wise labels of objects, and then refine them with iterative label refinement mechanism. Hsu et al. [18] learn a CNN-based model in an end-to-end fashion by using the bounding box tightness prior and multipleinstance learning. Although these weakly supervised methods require less labeling cost, their performance is much inferior to fully supervised methods [13][14][15][16]18].
Segmentation in aerial images. The state-of-the-art end-to-end aerial image semantic segmentation models are mostly inspired by the idea of fully convolutional networks [21], which generally consist of an encoder-decoder architecture [21]. Sherrah et al. [39] utilize a recurrent network in fully convolutional network which fuses multi-level features with boundary-aware features to achieve better inferences. Ghosh et al. [40] stack U-Nets architecture to merge high-resolution details and long distance context information at low-resolution image. Hamaguchi et al. [41] introduce local feature extraction module to aggregate local features with decreasing dilation factor.
As for instance segmentation, the relevant datasets for aerial images are less than natural image datasets, and related researches mainly focus on segmenting one particular type of object, e.g., vehicle [42] or ship [43]. Mou et al. [42] introduce a unified multi-task learning network that can simultaneously segment vehicle regions and detect semantic boundaries. Feng et al. [43] address dense object detection issue by applying a sequence of dilation convolution blocks to progressively learn multi-scale context information and avoid confusion between objects of the same class. To our best knowledge, there is still no existing studies for weakly supervised aerial image instance segmentation on multiple types of objects.

Aerial image datasets.
Recently, some well-annotated aerial image datasets for object detection [44,45] and semantic segmentation [46,47] have been introduced, which encourage the advancements in aerial images for earth observation. However, these datasets do not provide accurate pixel-wise labels for each object instance in an aerial image, so they are not suitable for instance segmentation task. There does exist a few publicly available instance segmentation datasets [11,48], but some of them typically focus on a single object category, e.g., [48] only labels building footprints. Currently, the only aerial image dataset with instance-level pixel-wise labels of multiple categories is iSAID [11]. It contains annotations for 655,451 instances of 15 important categories in 2806 high spatial resolution images. Moreover, iSAID dataset [11] exhibits the following distinctive characteristics: (1) images were collected from multiple sensors and platforms, scenes in these images are varying and have complex contextual information; (2) it has huge object scale variation, the small, medium and large objects, often show in the same image; (3) it depicts real-life aerial conditions, the distribution of objects is imbalanced and uneven, the orientation of objects are arbitrary. All of these characteristics make the instance segmentation task on iSAID dataset [11] challenging.

Hybrid Supervision for Instance Segmentation in Aerial Images
In this section, we first formulate the instance segmentation task in aerial images and provide more insights on the choice of label types and three challenges. Then, our proposed method and the implementation details are introduced.

Motivation
In this paper, we focus on instance segmentation in aerial images that locates the objects of interest (Figure 1a-d, e.g., aeroplanes, vehicles and harbors, etc.) with pixel-level accuracy.
Our goal is to balance performance and labeling cost. To this end, we analyze three types of labels that are commonly used and we also analyze the key challenges in aerial images that do not exist in natural images.  [49], Cityscapes [50] and COCO [51] respectively, where objects are much less and have a similar size.
Choice of label types. To reduce labeling cost of instance segmentation in aerial images, it is natural to use economic image-level labels or bounding box labels to replace expensive instance-level pixel-wise labels. However, uncertain locations of objects in image-level labels and uncertain shapes of objects in bounding box labels definitely harm the learning of instance segmentation model and lead to inferior performance. In order to solve this problem, as shown in Figure 2, we have carefully analyzed three types of labels which are commonly used.  • Image-level labels only provide the information about categories of objects in images and cannot indicate specific location of each object. They are usually used for image classification [52] but can be hardly used for instance segmentation task in aerial images especially when objects of interest are small and densely distributed. • Bounding box labels can provide the information about categories, and locations of objects. They are usually used for object detection [12]. Nonetheless, they do not contain the information about shapes of objects, which are important for instance segmentation task. • Instance-level pixel-wise labels contain rich information about category, location and shape of each interested object. But they are expensive to obtain. They are usually used for instance segmentation [12,50,51].
According to these analyses, the combination of a few pixel-wise labeled samples and a dominant majority of bounding box labeled samples is an optimal choice to reduce labeling cost while keeping satisfactory performance. Using dominant majority of bounding box labeled samples can save ∼7× labeling cost compared with pixel-wise labels [10], and the usage of a few pixel-wise labeled samples can provide the knowledge about shapes of objects, which can be beneficial to obtain the shape from bounding box labels and achieve better performance.
Key challenges. While weakly supervised instance segmentation in natural images has been well exploited [13][14][15][16]18], those methods cannot be naively adopted to aerial images, due to three challenges of weakly supervised instance segmentation in aerial images.
• Cluttered background. Aerial images can cover various scenes rather than specific scenes, e.g., cities, oceans and field. Furthermore, other factors like trees and shadows of buildings can also disturb the detection and segmentation. Therefore, the background (area without interested objects) can be highly diverse and cause false positives easily. Taking Figure 1a as an example, a line of cars in the shadow of the buildings are easy to ignore, and the shape of white cars surrounded by zebra crossings are difficult to obtain accurately. • Extremely dense and small objects. Aerial images are taken from a much longer distance than natural images, which results in an extremely dense distribution of small objects. For example, as shown in Figure 1b, many small vehicles are concentrated in specific area, sizes of these objects are smaller than 10 pixels in the aerial image. At the same time, there also exists extremely large objects, as shown in Figure 1c, making object detection more complex and challenging. • Arbitrary object orientation. In contrast to conventional datasets for instance segmentation [12,50,51], where objects are generally oriented upward due to gravity, the orientation of object in aerial images is arbitrary as shown in Figure 1d.

Formulation
The overview of our proposed method is illustrated in Figure 3. It adopts a small portion of pixel-wise labeled samples (i.e., fully labeled images) and a dominant amount of bounding box labeled samples (i.e., weakly labeled images).
In this part, we first introduce the idea of hybrid supervision, then the design of ancillary segmentation model and how we address the challenges in aerial images are described. As for instance segmentation model in our pipeline, we use vanilla Mask R-CNN [4] and CenterMask [8] for experiments separately, please refer to [4,8] for more details. Notice that they work independently after training and can be replaced by any instance segmentation models [5,6].

Ancillary Segmentation Model
Losses

Instance Segmentation Model
Bounding Box Labels

Pixel-wise Labels
Ground Truth Labels Pseudo Labels stop gradient Figure 3. The overview of our pipeline. The ancillary segmentation model is first trained with a small portion of pixel-wise labeled samples, so as to learn to predict high quality pseudo pixel-wise labels on the weakly labeled samples. Then, the instance segmentation model is optimized with the combination of pseudo labels generated by the ancillary segmentation model and the pixel-wise labeled samples. Notice that the instance segmentation model works independently after training.
Hybrid supervision. As shown in Figure 3, our method uses both fully labeled and weakly labeled images for learning in a hybrid way. Our work differs from previous methods [16][17][18]53] in four significant aspects: Firstly, there have been some existing works [17,54,55] that provide experiment results on utilizing both fully labeled and weakly labeled images for learning. Nevertheless, [54,55] can only deal with semantic segmentation task. [17] simply uses the GrabCut [19] and MCG [20] to obtain pseudo pixel-wise labels from bounding boxes and cannot utilize fully labeled images to refine the pseudo labels which limits its performance. We replace hand-crafted methods [19,20] and iterative mechanism [16,17] with a CNN-based ancillary segmentation model for extracting high quality pseudo instance-level pixel-wise labels on weakly labeled images. The pseudo labels we obtain can be seen as a hybrid of knowledge about shapes of objects from fully labeled images and the ground truth information about locations of objects from weakly labeled images, which provide richer information than original bounding box labels and help to reach better performance.
Secondly, weak supervision [16,18] uses only weakly labeled images but suffers from uncertain shapes of objects in labels. We utilize a small portion of fully labeled images to provide the knowledge about shapes of objects, avoiding the problem of uncertain shapes in weak supervision [16,18].
Thirdly, semi-supervision [53] utilizes fully labeled images and predicts pseudo labels on unlabeled images for learning, however, noise in pseudo labels (e.g., false positives in cluttered background, objects with wrong category labels) harms the learning of models. In our pipeline, the ancillary segmentation model is able to fully use the bounding box labels to suppress the noise in pseudo labels.
Finally, previous works [16][17][18]53] only consider natural images, our method is carefully designed to deal with the challenges in aerial image instance segmentation, i.e., cluttered background, extremely dense and small objects, and objects with arbitrary orientations, which do not exist in natural images.
Ancillary segmentation model. To implement the method of hybrid supervision, we design an ancillary segmentation model based on DeepLabv3+ [22], which is a reliable semantic segmentation network. The spatial-invariant property of fully convolutional networks makes DeepLabv3+ [22] unable to distinguish different instances that distribute in different location. To adopt it to instance segmentation, we modify DeepLabv3+ [22] into two decoder branches and utilize spatial-embedding loss introduced by Neven et al. [56] for training. The spatial-embedding loss [56] avoids the problem of spatial-invariance by assigning each pixel a spatial coordinate and learning position-relative offset vectors. Therefore, the resulting combination of pixel coordinate and offset vector points to its corresponding instance center.
As shown in Figure 4a, after extracting features by backbone Xception-65 in DeepLabv3+ [22], two decoder branches are followed. The confidence branch decoder consists of a deconvolutional layer, a residual block [57], a deconvolutional layer and a sigmoid layer. Furthermore, the instance branch decoder is similar, except that margin maps need no sigmoid layer and pixel offset maps need tanh layer to predict offset value in [−1, 1]. Notice that the difference in coordinate between two neighboring pixels is 1/800, both in x and y direction, so each pixel can point at most 800 pixels away. Deconvolutional layers in decoders are for 2× upsampling and the first deconvolutional layer is followed by a batch normalization layer and a ReLU layer.  The confidence branch is used to obtain confidence map for each category. The pixel i is more likely to be foreground if its confidence value d i is close to 1, and pixel with d i ≤ 0.5 is regarded as background. The instance branch predicts pixel offset maps and margin maps. Pixel offset value (o ix , o iy ) in pixel offset maps is used to obtain the predicted center (c ix , c iy ) where (e ix , e iy ) is the coordinates of pixel. The predicted centers of pixels that belong to the same instance should be close to each other, which can be recognized by clustering. Considering that different objects have different size and shape, they need object-specific clustering margin. Otherwise, if clustering margin is kept the same for all objects, two small objects that are next to each other may be clustered into one object, since margin is relative large for them. Furthermore, big objects may be clustered into more than one objects, because pixels far away from the center may not be able to point into this small region around the center. To handle this, margin maps are learnt for predicting instance margin values (σ ix , σ iy ) in (x, y) direction for each object, which are adapted to size and shape of each object. The confidence branch and the instance branch are optimized jointly to achieve best performance as described in Section 3.3. During inference, we sequentially cluster the foreground pixels (whose confidence value d i > 0.5) to different instance objects for each category-specific confidence map. The cluster procedure is first to choose the pixel with the highest confidence value in confidence map and then use corresponding predicted center as the center of instance S k (Ĉ kx ,Ĉ ky ). The corresponding instance margin values (σ kx ,σ ky ) are also kept. By using this center and accompanying margin, we cluster the i-th foreground pixel into instance S k , if the i-th predicted center (c ix , c iy ) is close to S k , the distance is measured by gaussian function, The spatial-embedding loss [56] is an excellent work designed for predicting high resolution instance segmentation results in urban street scenes. We utilize it for helping our ancillary segmentation model to predict pseudo pixel-wise labels in high resolution and instance-level, which are important for the learning of primary instance segmentation model. However, it is not enough to deal with the challenges in aerial images and the task of generating pseudo pixel-wise labels. To make our ancillary segmentation model more robust for extracting pseudo labels from aerial images, we add a bounding box attention module and bounding box filter module to it, and the usage of an oriented bounding box helps these two modules work better when arbitrary orientated objects show up.

Design and Learning Details
Bounding box attention module. In our analysis, there are two main obstacles in predicting accurate pseudo pixel-wise labels for small objects, insufficient object feature information and cluttered background. Small objects lose most of their feature information in deep layers due to the use of the pooling layer, which makes it hard to localize them. Meanwhile, cluttered backgrounds may introduce false positives due to its similarity with foreground objects.
We notice that bounding box labels contain the information about locations of small objects, and they also provide the information about the background. Therefore, we regard the bounding box labels as feature maps containing localization information and present a bounding box attention module for encoding bounding boxes to attention maps, as shown in Figure 4b. Though its structure is very simple, it can fully utilize the information of bounding boxes to adjust the feature maps of backbone, e.g., highlight the features of small objects within bounding boxes and decay the features in the area of background.
In detail, the bounding box attention module first converts the bounding box labels to feature maps with N + 1 channels, where N is the number of categories and 1 represents background. If a given pixel belongs to a bounding box of specific class, its corresponding category channel is set to 1 and the background channel is set to 0. Please notice that a pixel can belong to multiple bounding boxes. If a given pixel does not belong to any bounding boxes, the background channel is set to 1 and other channels are set to 0. Then the bounding box attention module converts the bounding box feature maps to attention maps which are fused with the feature maps of backbone Xception-65 [22] using element-wise multiplication. Moreover, the attention maps are generated in 4× downsampling scale and 16× downsampling scale, which are adapted to both small objects and large objects. Figure 5 visualizes a confidence map for small vehicles, due to the complexity of realworld aerial images, excessive noise can overwhelm the object information. As shown in Figure 5b, some small objects have relative low confidence score and cluttered background introduce severe false positives. After we add the bounding box attention module to ancillary segmentation model, in Figure 5c, most of small objects are clearly located and noise caused by cluttered background are largely suppressed.  Bounding box filter module. The original clustering procedure in [56] easily leads to inaccurate segmentation results when objects are densely distributed or background is cluttered. For example, two objects that are very close can be clustered as one object, and cluttered background may lead to false positives. To address this problem and further improve the quality of generated pseudo pixel-wise labels, we introduce the bounding box filter module. As shown in Figure 4c, it utilizes the bounding box labels to output confidence map for each object sequentially before clustering procedure. For each object, it separates objects and filters out noise of confidence map in area outside the bounding box, so as to prevent the situation that two objects are clustered as one. Furthermore, background area outside bounding box is not involved in clustering procedure, so the false positives are avoided. After clustering procedure, we choose the instance mask that fits the bounding box best as optimal pseudo pixel-wise label of object. We view the pseudo pixel-wise label along with corresponding bounding box label as ground truth for the training of primary instance segmentation model.

Usage of oriented bounding box.
Our ancillary segmentation model is segmentationbased and do not rely on horizontal bounding box prediction to locate object. Therefore, it has better adaptability to arbitrary oriented objects. For better dealing with arbitrary oriented objects, we adopt oriented bounding box inspired from [44,58,59], which tackle objects with arbitrary orientations in detection task. Notice that we use it for solving instance segmentation in aerial images with hybrid supervision, which is more challenging and not a naive extension. Moreover, we do not use oriented bounding box labels directly for training, they are utilized for obtaining pseudo pixel-wise labels with our ancillary segmentation model. Compared with horizontal bounding box, oriented bounding box can fit object with arbitrary orientation better. It helps bounding box attention module to generate more accurate attention maps (see Figure 6), which leads to better results. Furthermore, the bounding box filter module can better separate objects whose orientations are arbitrary. Therefore, our ancillary segmentation is able to obtain more accurate shape of object and this leads to better performance. Loss function. During training, we utilize aforementioned spatial-embedding loss [56] for learning. Specifically, the confidence branch is optimized with pixel-wise L2 loss L con f , and φ k is a gaussian function representing the distance to instance S k with probability where M is the number of pixels, 1 is indicator function, d i is confidence value and bg represents background. φ k (c ix , c iy ) usually has higher value if pixel i is close to the center of object. Therefore higher value in confidence maps indicates that the corresponding pixel is closer to a center of object. The (C kx , C ky ) is the center of instance object S k and (σ kx , σ ky ) is defined as The instance branch can be optimized with cross entropy loss Furthermore, to ensure that (σ ix , σ iy ) is close to (σ kx , σ ky ), a smoothness term is added So the total loss function is where λ con f , λ inst , λ smooth are the constant coefficient, we choose λ con f = 1, λ inst = 10, λ smooth = 1 for balancing three losses.

Experimental Results
Our method is systematically evaluated on the challenging iSAID dataset [11]. In this section, the dataset and evaluation metric are first introduced. Then, we describe the implement details. After that, we quantitatively and qualitatively evaluate our method. Finally, the ablation studies are performed to analyze the proposed modules.

Dataset and Evaluation Metric
Dataset. The iSAID dataset [11] contains 2806 original high spatial resolution images. These images are collected from multiple sensors and platforms with multiple resolutions. The original spatial resolution ranges from ∼800 × 800 to ∼4000 × 13,000. The predefined training set consists of 1411 images, while validation set contains 458 images and test set has 937 images. For instance segmentation task, the iSAID dataset provides 655,451 instances annotations over 15 different categories of object, which is the largest dataset for instance segmentation in high spatial resolution remote sensing images. As the official evaluation server of iSAID is still improving, the results on testing set are unavailable, we evaluate our method on iSAID validation set in the following.

Implementation Details
There are three steps in the whole training procedure. First, we train the ancillary segmentation model with a small portion of images with instance-level pixel-wise labels. Notice that both horizontal and oriented bounding box labels can be obtained from pixelwise labels by using extreme points or algorithm of finding minimum area rectangle, which needs no extra labeling cost. We input both bounding box labels and images to ancillary segmentation model and take pixel-wise labels as the supervisory signal. Second, the ancillary segmentation model generates high quality pseudo instance-level pixel-wise labels from bounding box labels on weakly labeled images. Third, the primary instance segmentation model can be trained with a combination of images with instance-level pixel-wise labels and images with generated pseudo instance-level pixel-wise labels. Here, we describe the implementation details of ancillary segmentation model and instance segmentation model in Figure 3.
Training procedure of ancillary segmentation model. For optimizing intersection-overunion of each instance, we follow [56] to use Lovász-hinge loss [60] rather than standard cross entropy loss in L inst . The backbone of the ancillary segmentation model is based on Xception-65 [22], which is pre-trained on the ImageNet dataset [52]. We first pre-train our model on 448 × 448 crops, which are taken out of the original 800 × 800 training images. Notice that each 448 × 448 image patch is centered around an object. In this way, we avoid spending too much computation time on background images patches without any instances, which leads to shorter training time. The training iterations and batch-size for pre-training are 40 k and 8, respectively.
Then, we finetune the ancillary segmentation model for another 20 k iterations on 800 × 800 crops with a batch-size of 2 to increase performance on the bigger objects which cannot fit completely within the 448 × 448 image patch. The batch normalization statistics are kept fixed during this stage for better convergence. We use the ADAM optimizer [61] and polynomial learning rate decay (1 − iter max iter ) 0.9 . The initial learning rate is 5 × 10 −4 , which is later decreased to 5 × 10 −5 for finetuning. The ancillary segmentation model is optimized on two NVIDIA GeForce GTX 1080 Ti GPUs for roughly two days. Next to random cropping, we also apply random horizontal mirroring and vertical mirroring as data-augmentation.
Training procedure of instance segmentation model. We use off-the-shelf Mask R-CNN [4] and CenterMask [8] as our primary instance segmentation model, and the implementation is based on Detectron2. The training data consists of all bounding box labels, a small portion of pixel-wise labels and pseudo labels generated by our ancillary segmentation model.

Methods for comparison.
To verify the effectiveness of the proposed method, we provide results under different label settings as follow: • Weak supervision. Considering there exists no weakly supervised method for instance segmentation in aerial images, we use bounding box labels as pseudo instance-level pixel-wise labels for training, so as to serve as weakly supervised results. • Full supervision. With all pixel-wise labels available, the full supervision can easily reach the best result. For a comprehensive comparison, we provide the fully supervised results with different percentage of pixel-wise labels available (i.e., 5%, 10%, 20% and 100%). • Weak and full supervision. For a fair comparison, we provide the results of weak and full supervision. Under this setting, the instance segmentation model is trained with both instance-level pixel-wise labels and bounding box labels, notice that bounding box labels are utilized to train the bounding box branch and classification branch only.
Overall performance. As illustrated in Tables 1 and 2, with 10% of pixel-wise labels and Mask R-CNN [4] serves as instance segmentation model, our hybrid supervision largely suppress the weak supervision (31.2 AP vs. 13.3 AP, 32.1 AP vs. 26.5 AP) and full supervision with 10% of pixel-wise labels available (32.1 AP vs. 22.3 AP). Compared with weak and full supervision setting that uses the same percentage of pixel-wise labels and bounding box labels, we achieve 3.5 AP higher result (32.1 AP vs. 28.6 AP), which is a strong evidence to prove the effectiveness of our proposed method. Furthermore, the AP 50 result of our method is very close to full supervision with 100% pixel-wise labels available (55.0 AP 50 vs. 56.4 AP 50 , i.e., 97.5%). Even with the most strict metric AP, our hybrid supervision achieves the 94.7% of performance of full supervision (32.1 AP vs. 33.9 AP). It means that our method can achieve similar performance to 100% fully supervised method with only 10% of pixel-wise labels available. Besides, we provide the results of Mask R-CNN [4] with 5% and 20% of pixel-wise labels. As shown in Table 1, our method still performs well with only 5% of pixel-wise labels, and when pixel-wise labels increase to 20%, our result is even closer to full supervision (33.3 AP vs. 33.9 AP). These results show that our hybrid supervision is adaptive to the percentage of pixel-wise labels. Table 1. AP, AP 50 , AP 75 , AP S , AP M and AP L results on iSAID dataset [11].  Table 2. AP results for each category on iSAID dataset [11]. The asterisk "*" indicates using horizontal bounding box labels instead of oriented bounding box labels. When we switch the instance segmentation model to CenterMask [4], our method with 5%, 10% and 20% pixel-wise labels also outperforms the weak supervision and achieves satisfactory performance compared with 100% full supervision, which shows the stability of our method.

Method
Performance on small and dense objects. We further evaluate performance on small and dense objects, for it is a key challenge in aerial images. With only 10% of pixel-wise labels, our method with Mask R-CNN [4] can effectively save labeling cost and achieve similar accuracy to 100% fully supervised result in APs, i.e., 18.5 APs vs. 18.8 APs, while the weakly supervised method can only achieve 14.8 APs. These results show that our method works well with small objects, and this conclusion keeps the same when we switch the instance segmentation model to CenterMask [8] (17.0 APs vs. 18.0 APs).

Qualitative Evaluation
Pseudo pixel-wise labels. We show some pseudo labels generated by our ancillary segmentation model when it is trained with 5%, 10% pixel-wise labels in Figure 7. It can be seen that we handle cluttered background well, and densely distributed small objects and arbitrary oriented objects are clearly identified in our pseudo labels. They are very similar to the ground truth in training set of iSAID dataset [11], which explains why our method can achieve similar performance with much less pixel-wise labels.  Instance segmentation results. Figure 8 shows some representative instance segmentation results for comparison, and Figure 9 shows more examples. It can be seen that our approach is able to produce high quality instance segmentation results, even in challenging scenarios, and the visual quality is very close to 100% fully supervised method. In Figure 8, the example of the first row shows that our method performs well in image with cluttered background, and is able to detect the vehicles in the shadow of buildings. The example of the second row may not be good enough in detecting small objects due to the limitation of primary instance segmentation model, it shows the recall rate of dense and small objects is basically the same with fully supervised setting. The example of the third row shows our method can segment object with arbitrary orientation well. The example of the fourth row shows that the mask prediction of our method outperforms the weakly supervised setting when segmenting objects with complex shape like planes. In short, the overall performance of our method is quite close to the fully supervised setting.

Ablation Study
The quality of pseudo instance-level pixel-wise labels generated by ancillary segmentation model significantly affects the final results. In this section, we conduct ablation study on the bounding box attention module, bounding box filter module and oriented bounding box labels to investigate their contribution on improving the quality of pseudo labels and effectiveness on solving the challenges in aerial images. We compare the pseudo labels with ground truth labels and evaluate their quality in the same metric used in main experiments, i.e., AP, AP 50 , AP 75 , AP S , AP M and AP L . We use 10% of pixel-wise labels with 90% of bounding box labels for experiments and provide the results on validation set of iSAID dataset [11]. The quantitative results are shown in Table 3, to illustrate the quantitative results more clearly, we show the content of Table 3 in form of bar chart in Figure 10. And some examples of pseudo labels are shown in Figure 11 as qualitative results. Baseline. As shown in Table 3, without bounding box attention modules, bounding box filter modules and bounding box labels, the pseudo labels generated by ancillary segmentation model have low accuracy. In Figure 11b we can see that there is severe noise in pseudo labels, e.g., false positives caused by a cluttered background, false positives of small objects and incomplete shapes of objects with arbitrary orientations.   Table 3. Bounding box attention module. Although the structure of bounding box is simple, Table 3 and Figure 10 show that it significantly improves the overall results. Specifically, the result on APs increases from 6.3 to 32.9, which demonstrates its effectiveness on segmenting densely distributed small objects. As for cluttered backgrounds, comparing the Figure 11b,c, we can see the false positives in background area are largely removed by the bounding box attention module. Furthermore, shapes of objects in pseudo labels are more complete. Both quantitative results and qualitative results show its effectiveness on addressing the challenge of cluttered background.
Bounding box filter module. As shown in Table 3 and Figure 10, the bounding box filter module further improves the overall result especially on AP 50 , from 70.4 to 77.2. In Figure 11, we can see that the false positives in background area in Figure 11c are removed by bounding box filter module in Figure 11d.
In short, these results show that the bounding box filter module improves performance of ancillary segmentation model by simply removing the false positives outside of bounding box. In this way, it largely reduces negative effects caused by cluttered background.
Horizontal bounding box vs. Oriented bounding box. Table 3 shows that the utilization of oriented bounding box improves result on AP 75 from 38.8 to 44.5, which means oriented bounding box labels can help ancillary segmentation model to obtain more accurate segmentation results in aerial images. As shown in Figure 11e, after replacing horizontal bounding box labels with oriented bounding box labels, our ancillary segmentation model is able to generate more accurate pseudo labels for both small objects and objects with arbitrary orientations. This leads to about a 0.5-0.9 AP improvement for primary instance segmentation model as shown in Table 1, which verifies its effectiveness on solving the challenge of arbitrary orientations in aerial images. More importantly, our pseudo labels in Figure 11e are very close to ground truth pixel-wise labels, and this explains why the proposed method can achieve similar results to 100% fully supervised setting.

Conclusions
In this paper, we present a pipeline of hybrid supervision for instance segmentation in aerial images. We design an ancillary segmentation model to generate accurate pseudo pixels-wise labels from real-world aerial images which only needs a small portion of pixel-wise labels for training. It largely reduces the labeling cost while helping the primary instance segmentation model to achieve satisfactory performance. The proposed bounding box attention module can effectively suppress the noise from clutter background in aerial images, and improve the capability of segmenting small objects, addressing the key challenges of cluttered background and small objects. The proposed bounding box filter module removes the false positives caused by cluttered background and densely distributed objects, addressing the key challenge of cluttered background. Besides, we replace horizontal bounding box labels with oriented bounding box labels to further improve performance of ancillary segmentation model on generating high quality pseudo pixel-wise labels. On a recent large-scale instance segmentation dataset for aerial images, i.e., iSAID [11], we achieve comparable performance 32.1 AP to fully supervised setting 33.9 AP which is obviously higher than weakly supervised setting 26.5 AP.
In future, it is worth investigating how to design and further jointly optimize the primary instance segmentation model to achieve better performance for aerial images.

Conflicts of Interest:
The authors declare no conflict of interest.