JMLNet: Joint Multi-Label Learning Network for Weakly Supervised Semantic Segmentation in Aerial Images

Weakly supervised semantic segmentation in aerial images has attracted growing research attention due to the significant saving in annotation cost. Most of the current approaches are based on one specific pseudo label. These methods easily overfit the wrongly labeled pixels from noisy label and limit the performance and generalization of the segmentation model. To tackle these problems, we propose a novel joint multi-label learning network (JMLNet) to help the model learn common knowledge from multiple noisy labels and prevent the model from overfitting one specific label. Our combination strategy of multiple proposals is that we regard them all as ground truth and propose three new multi-label losses to use the multi-label guide segmentation model in the training process. JMLNet also contains two methods to generate high-quality proposals, which further improve the performance of the segmentation task. First we propose a detection-based GradCAM (GradCAMD) to generate segmentation proposals from object detectors. Then we use GradCAMD to adjust the GrabCut algorithm and generate segmentation proposals (GrabCutC). We report the state-of-the-art results on the semantic segmentation task of iSAID and mapping challenge dataset when training with bounding boxes annotations.

boxes supervision. Several weakly supervised segmentation methods [23][24][25][26][27] explore closing the gap between pixel-level supervision and bounding boxes supervision.These methods mainly refine segmentation proposals from bounding boxes supervision, then take these segmentation proposals as pixel-level supervision and train deep FCN model.These methods mainly use traditional proposals like CRF [25], MCG [28] and GrabCut [29]. CRF [25] has been broadly used in semantic segmentation. It tries to model the relationship between pixels and enforce the predictions of pixels that have similar visual appearances to be more consistent. MCG [28] is a unified approach for bottom-up hierarchical image segmentation and object proposal generation. GrabCut [29] is an image segmentation method based on graph cuts. It requires a bounding box around the object. GrabCut estimates the color distribution of the target object and background using a Gaussian mixture model. BoxSup [23] takes MCG [28] as initial segmentation proposals and updated the proposals in an iterative way. SDI [30] takes intersection of MCG [28] and GrabCut [29] as segmentation proposals. Song et al. [27] use dense CRF [25] as segmentation proposals. These methods all feed one specific proposal to segmentation model, which easily overfit the wrongly labeled pixels from noisy label and limit the performance and generalization of segmentation model. So it is a natural idea to tackle these problems by taking advantage of multiple proposals in the training process.
To train with multiple proposals, traditional combining methods take intersection [30] of two kinds of segmentation proposals as supervision to reduce the noise. Pixels out of intersection are ignored in training. These pixels usually take up mainly part of the box area in difficult situations, which reduces the semantic information and limits segmentation model performance. We propose a joint multi-label learning network(JMLNet) to address the issue. The overall pipeline of our JMLNet is in Figure 1. Different from simply using the intersection of two proposals or only use one specific proposal, we regard multiple proposals as multi-label and make all noisy proposals contribute in the training process. Specifically, we propose three multi-label losses for training, including multi-label average loss (MA-Loss), multi-label minimum loss (MM-Loss), and box-wise multi-label minimum loss (BMM-Loss). These loss functions help segmentation model learn common knowledge from multiple noisy labels and prevent the model from overfitting one specific label.
The quality of Proposals is vital to weakly supervised semantic segmentation. Previous approaches train the models with MCG, GrabCut, or CRF proposals based on box supervision. Lacking high-level semantic knowledge, these proposals are easy to confuse in complicated scenes. As shown in Figure 2c, GrabCut confuses building and plane because of similar color. Low quality of traditional proposals damages the performance of segmentation model. We address this problem by proposing GradCAM D and GrabCut C , which generate high-quality pixel-level proposals. First, GradCAM D aims to generate visual explanations and proper proposals from object detectors. GradCAM D generates reliable proposals because the detection networks learn precise semantic information, as shown in Figure 2d. Second, we use GradCAM D to adjust GrabCut algorithm and generate proposals, which is denoted as GrabCut C . C indicates GradCAM. GrabCut C can be simply seen as GradCAM + GrabCut. As shown in Figure 2e, GrabCut C proposals are both reliable in the distinguished semantic area and detailed in instance edge. Our method improves the segmentation proposals' quality, which further improves the segmentation performance of JMLNet.
We summarize our contributions as follows: • We propose a novel joint multi-label learning network(JMLNet), which first regards multiple proposals as multi-label supervision to train weakly supervised semantic segmentation model. JMLNet learns common knowledge from multiple noisy labels and prevents the model from overfitting one specific label. • GradCAM D and GrabCut C methods are proposed to generate high-quality segmentation proposals, which further improve the segmentation performance of JMLNet. These proposals perform both reliable in the distinguished semantic area and detailed in instance edge.

•
We report the state-of-the-art results on semantic segmentation tasks of iSAID and mapping challenge dataset when training using bounding boxes supervision, reaching comparable quality with the fully supervised model.  Figure 1. The overall pipeline of previous weakly supervised semantic segmentation methods (top) and our proposed JMLNet (bottom). Previous methods generate one specific proposal and use it in the training process. However, we first generate multiple proposals as multi-label supervision and use multi-label loss to train the segmentation model.  [29] proposals. (e) We propose GrabCut C proposals, which perform better than traditional proposals.

Related Work
We introduce the weakly supervised semantic segmentation methods of natural image and remote sensing image and aerial image, region proposal from box supervision, and learning semantic knowledge with noisy labels that are related to our work.

Weakly Supervised Semantic Segmentation of Remote Sensing Image and Aerial Images
Weakly supervised semantic segmentation methods of remote sensing image and aerial images can be also classified into four parts, including image labels methods [4,38], points labels methods [39], scribbles labels methods [40], and bounding boxes labels methods [41]. WSF-NET [4] introduces a feature-fusion network to fuse different level feature of FCN [1] and increase the ability of feature representation. SPMF-Net [38] combines superpixel pooling to segmentation methods and use low level feature to get detail prediction. Wang et al. [39] use CAM [31] proposals as ground truth and train FCN [1] based model. Wu et al. [40] propose an adversarial architecture based model for segmentation. Rafique et al. [41] convert the bounding box into probabilistic masks and propose a boundary based loss function to restrict the edge of predict map to close to bounding box. We separate weakly supervised semantic segmentation as two aspects, including region proposal from box supervision and learning semantic knowledge with noisy labels.

Region Proposal from Box Supervision
Without proper pixel-level supervision, weakly supervised methods extract region proposal from box supervision. [25,28,29] are the most popular region proposal methods. BoxSup [23] takes MCG [28] as initial segmentation proposals and updated the proposals in an iterative way. SDI [30] takes intersection of MCG [28] and GrabCut [29] as segmentation proposals. Song et al. [27] use dense CRF [25] as segmentation proposals. These region proposal methods extract proposals from class-agnostic low-level features, which leads to generating confusing proposals in complicated scenes because of lacking high-level semantic information. To this end, we propose a GradCAM D method to generate visual explanations from object detectors and proper proposals by setting the threshold. GradCAM D generates reliable proposals because the detection network learns precise semantic information. Then we use GradCAM D to adjust GrabCut algorithm and generate training labels, which performs both reliable in the distinguished semantic area and detailed in instance edge.

Learning Semantic Knowledge with Noisy Labels
Though we can use [25,28,29] to generate proposals within bounding boxes annotations, there are still so many noises compared with a full-supervised label. How to learn with noisy labels becomes a key problem of weakly supervised semantic segmentation. SDI [30] directly uses the intersection of two kinds of segmentation proposals to reduce the noise. Song et al. [27] use different filling rates as priors to help the model training. These methods all use one specific pseudo label. We first propose JMLNet to combine multiple noisy labels in the training process. JMLNet helps the model learn common knowledge from multiple noisy labels and prevent it from overfitting one specific label.

Overview
In this section, we introduce the general pipeline of JMLNet. As shown in Figure 3, we collect multiple proposals like GrabCut, GradCAM D , and GrabCut C proposals as multi-label supervision and train the segmentation model with the proposed multi-label loss. Generating pseudo supervision. Except for popular segmentation proposals with bounding boxes labels, we generate GradCAM D and GrabCut C proposals as pseudo supervision. GradCAM D is a detection-based GradCAM. The first step to generate GradCAM D proposals is to train an object detector. We choose Faster R-CNN [42], a classical object detector, in our experiment. Then we calculate the GradCAM D in feature map of Faster R-CNN and generate the pixel-level proposals. GradCAM D is also used to adjust GrabCut algorithm and generate GrabCut C proposals. All these proposals contribute in the training process.
Model training with multiple noisy labels. As shown in Figure 1, we choose popular Deeplab v3 [43] as semantic segmentation model. Note that we collect multiple proposals {CRF, GrabCut,GradCAM D , GrabCut C } for a single input image, so we propose multi-label average-loss (MA-Loss), multi-label minimum loss (MM-Loss), and box-wise multi-label minimum loss (BMM-Loss) to help the model learn common knowledge from multiple noisy labels and prevent the model from overfitting one specific label.

Multi-Label Losses for Multiple Proposals
Most semantic segmentation methods use pixel-wise cross entropy loss as loss function: where N is the number of pixels, C is the number of classes, y ∈ {0, 1} is the ground truth, and p ∈ [0, 1] is the estimated probability. It is obvious that our pseudo proposals are all noisy within bounding boxes annotations and one specific proposal is hard to perform best in all image sets. Based on the analysis above, we propose three multi-label losses to help the model learn common knowledge from multiple noisy labels and prevent the model from overfitting one specific label. In practice, we propose multi-label average-loss (MA-Loss), multi-label minimum loss (MM-Loss), and box-wise multi-label minimum loss (BMM-Loss).
Dealing with multiple noisy labels, an intuitive idea is to calculate the average value of cross entropy losses for multiple proposals. We denote it as multi-label average-loss (MA-Loss): where Y denotes pseudo labels set, Z is the number of proposals types. Further, we calculate the cross entropy losses for multi proposals and take the minimum value in back propagation. We denote it as multi-label minimum loss (MM-Loss): In weakly supervised segmentation, a set of box-level labeled data D = {(I, B)} are given, where I and B denote an image and box-level ground truth respectively. We know the pixels out of B are background class according to ground truth. So pixels in B are key problem for our case. We categorize image pixels into two sets P + and P − according to their coordinates position by We calculate the minimum value of cross entropy losses for multi proposals in P + as follows: where y Z ijc indicate estimated probability of different proposals and n + indicates pixel number of P + . For all coordinates (i, j) in P − , y i,j = 0. We use cross entropy loss in P − as follows: where p b ij indicates estimated probability of background and n − indicates pixel number of P − . The L + and L − make up box-wise multi-label minimum loss (BMM-Loss): Our proposed MA-Loss, MM-Loss and BMM-Loss help the model learn common knowledge from multiple noisy labels and prevent the model from overfitting one specific label.

Pseudo Label Generation by Gradcam D and Grabcut C
The GradCAM D of our approach is shown in Figure 4 and Algorithm 1. In order to obtain the GradCAM D D ∈ R u×v of width u and height v for target class, we first compute the gradient of target score s with respect to feature maps M k , i.e. ∂s ∂M ij . k ∈ [1, K] and K is the channel number of feature maps. These gradients flowing back obtain the weight α k , which represents the weight of feature map M k for target class.
We calculate a weighted combination of feature maps.
As shown in Figure 5, GradCAM D explains why detector classifies a specific area as a specific class and cover instance region well. Based on the observation, we generate high GradCAM D proposal and low GradCAM D proposal by setting high and low thresholds to GradCAM D , as shown in Figure 4 and Algorithm 1. Low GradCAM D proposal D is closer to ground truth, and we can use it as the pseudo label to train segmentation model. High GradCAM D proposal D h can't cover all positive pixels of ground truth but contains less false-positive pixels. Different from generating visual explanations from classification network, like CAM [31] and GradCAM [32], GradCAM D generates visual explanations from object detector. Box supervision is fully used, and the detector learns precise semantic information, which improves the proposal quality.
As shown in Figure 6, we take GradCAM D as priors and categorize pixels into thre sets D + , D − and D u by where (i, j) is coordinate. Pixels in D + are fixed to the foreground, pixels in D − are fixed to background and pixels in D u are still uncertain. GrabCut updates proposals by taking these foreground and background information. The updated proposals are denoted as GrabCut C proposals. As shown in Figure 2f, GrabCut C generates proposals both reliable in the distinguished semantic area and detailed in instance edge.

Ship
Roundabout Plane Figure 5. Visualization of the GradCAM D . It shows why the detector classifies a specific area as a specific class and covers instance region well. Threshold Generate Figure 6. Overview of the GrabCut C . We use GradCAM D as prior to GrabCut and generate GrabCut C . In GradCAM D prior, green pixels represent foreground, black pixels represent background, and gray pixels represent uncertainty area. GrabCut takes this information as input and further refines proposal.

Experiments
In the experiments, we first introduce the experimental setup, then do ablation study of different super parameter, finally compare our method with the state-of-the-art methods.

Experimental Setup
In experimental setup, we introduce dataset, evaluation method and implementation details of our experiments.
Dataset: In our experiments, two aerial images dataset are used: iSAID [44] dataset and mapping challenge dataset [45]. We use iSAID [44] dataset, which is a further semantic labeled version for DOTA [46] dataset. It contains 15 classes of different objects and 1 background class. The spatial resolution of images ranges from 800 pixels to 13000 pixels, which exceed resolution of natural images by far. We train our method with 1,411 high-resolution images, eval with 458 high-resolution images. We use the mapping challenge dataset [45]. It contains 1 building class and 1 background class. We train our method with 280,741 images, eval with 60,317 images of size 300x300 pixels. We only exploit bounding boxes annotations when training. While the dataset contains labels for semantic segmentation, we only exploit box-level labels.
Evaluation: To evaluate the performance of our method and compare our results to other state-of-the-art methods, we calculate mean pixel Intersection-over-Union(mIoU), overall accuracy (OA), true positive rate (TPR) and true negative rate (TNR) as common practice [22,47]. IoU is defined as: and mIoU is defined as: and OA is defined as: and TPR is defined as: and TNR is defined as: where TP, FP, TN, FN are the number of true positives, false positives, true negatives and false negatives. C indicates the number of classes. Implementation Details: For iSAID dataset, we crop the high-resolution images to 512 × 512 patches. We adopt the classical Deeplab v3 [43] model for our experiments, which takes widely used ResNet-50 [48] as backbone. Firstly, we train a detection model Faster-RCNN [42] with box-level labels of iSAID [44]. Using the proposed GradCAM D and GrabCut C methods, we generate pseudo segmentation proposals for train set. Secondly, we train the Deeplab v3 model with the GrabCut C supervision for 50k iterations, further finetune it with proposed loss function for 10k iterations. We choose SGD as default optimizer. Mini-batch size is seted to 20. We set initial learning rate to 0.007 and multiply by (1 − step max s tep ) power and power is set to 0.9. We apply random horizontal flipping and random cropping to augment the diversity of dataset. We implement our method with the PyTorch [49] framework. For mapping challenge dataset, we follow the same basic setting as Rafique et al. [41] for fair comparison. We choose Adam optimizer with learning rate of 5e −4 , β 1 = 0.9, and β 2 = 0.999. Mini-batch size is seted to 16. We train the network for 3 epochs.

Ablation Study
We conduct two types of ablation studies, including the analysis of the contribution of proposed loss functions and the performance of the proposal with different thresholds.
Proposals quality. We do experiments on different proposals and loss functions. As shown in Table 1, experimental results show that our proposed GradCAM D and GrabCut C proposals perform better than traditional proposals. We train the Deeplab v3 model with different proposals as pseudo labels, including rectangle proposals, CRF proposals, GrabCut proposals, our proposed GradCAM D proposals and GrabCut C proposals. As shown in Table 1, our proposed GradCAM D and GrabCut C proposals achieve 53.88% and 54.24% mIoU, outperforming all the compared methods. As shown in Figure 7, the main difference between GrabCut and our proposed GradCAM D and GrabCut C proposals is edge predictions. Using GrabCut as label, segmentation model will tend to do predictions based on low level features, including color and edge. In hard cases, low level features can not represent precise information of target features, which lead to wrong predictions. GradCAM D and GrabCut C proposals perform better because they are of high level features obtained from object detector. As shown in Table 2, we evaluate the effectiveness of GradCAM D proposals on iSAID validation set. Our GradCAM D can be seen as a detection-based GradCAM. So we make a comparison between GradCAM D and standard GradCAM proposals within bounding box. Experimental results show that our proposed GradCAM D outperforms standard GradCAM. Losses selection. As shown in Table 1, experimental results show that our proposed MA-Loss, MM-Loss and BMM-Loss all improve segmentation results, in which BMM-Loss performs best. We combine different proposals and use our proposed loss functions to train the Deeplab v3 model. As shown in Table 1, using a combination of different proposals and our proposed loss functions, we improve segmentation results significantly. In particular, combination of {GrabCut,GradCAM D ,GrabCut C } and BMM-Loss achieve the best performance, 55.34% mIoU. We analyze that the reason why BMM-Loss performs best is BMM-Loss considers the similarity between predictions and multiple proposals in pixel-wise within boxes. The other loss functions, MA-Loss and MM-Loss, only focus on loss of the whole image. Segmentation performance will not be improved by adding rectangle proposals and CRF proposals to {GrabCut,GradCAM D ,GrabCut C }. We analyze that compared with {GrabCut,GradCAM D ,GrabCut C }, rectangle proposals and CRF proposals are quite rough and introduce more wrongly labeled pixels. Low quality of proposals will hurt the performance of segmentation model. We deal with noisy label by automatically selecting the high quality label in training process. It partly solves the problem of noisy label but adding bad pseudo-labels will hurt our performance in practice. There are still many future works that can be done to handle the bad influence of noisy label.  Threshold τ of low GradCAM D proposals D . Low GradCAM D proposals D depends on one key hyper-parameter, threshold τ . We use D as pseudo label to train segmentation model, which is vital to final performance. The threshold τ balances the foreground and background pixels within boxes annotations. If τ is set to 0, all pixels within boxes annotations are seen as proposals. As τ increases, the area of proposals decreases and only the distinguished part of GradCAM D remained in proposals. Table 3 shows the influence of threshold τ . As τ get higher, the area of foreground pixels get lower. Because foreground pixels usually take up most area within boxes annotations, so we find best τ in small values. When τ = 0.15, using D proposals as ground truth, we achieve the best performance. Table 1 indicate that D reachs 53.88% mIoU on iSAID validation set. We also fix τ = 0.15 in generating GrabCut C proposals. Threshold τ h of high GradCAM D proposals D h . High GradCAM D proposals D h depends on threshold τ h . We use D h as foreground to adjust GrabCut algorithm and generate GrabCut C . Table 4 shows the influence of threshold τ h . When τ h = 0.8, GrabCut C achieves the best performance. Table 1 indicates that GrabCut C reachs 54.24% mIoU in iSAID validation set.

Comparison with the State-Of-The-Art Methods
In the comparison with the state-of-the-art methods, we mainly choose SDI [30], Song et al. [27] and Rafique et al. [41].
Results of weakly-supervised semantic segmentation on iSAID dataset. As shown in Table 5, our method achieves 55.34% mIoU, 98.58% OA, 61.75% TPR and 99.63% TNR on iSAID validation set. Specific IOU for per category can be found in Table 6. Figure 8 shows the segmentation results of our method. Our method outperforms all compared weakly supervised semantic segmentation approaches.
The results indicate that our proposed method is effective when learning common knowledge from multiple noisy labels. Results of weakly-supervised semantic segmentation on mapping challenge dataset. We compare our proposed method with existing state-of-the-art weakly supervised semantic segmentation approaches on mapping challenge dataset. As shown in Table 7, our method achieves 75.65% mIoU on mapping challenge dataset validation set. Our method outperforms Rafique et al. [41], around 1.31% in mIoU, 0.64% in OA, 0.84% in TPR and 0.33% in TNR. Figure 9 shows the segmentation results of our method. The results indicate that our proposed method is effective in different datasets.
Results of semi-supervised semantic segmentation on iSAID dataset. We also do semi-supervised semantic segmentation experiments and compare to state-of-the-art approaches. In semi-supervised task, 141 pixel-level labels, 1/10 of the training sets, are added for training. As shown in Table 5, our proposed method outperforms all the compared methods and achieves 56.76% mIoU, 98.62% OA, 63.25% TPR and 99.64% TNR. Specific IoU for per category can be found in Table 6. The results indicate that our method is still effective in semi-supervised condition and the performance is very close to the fully supervised model.

Discussion
In this section, we further discuss: (1) The advantages of our method compared to the traditional methods, (2) the limits of our method and (3) potential improvement of the framework.
(1) The advantages of our method. Learning strategy from noisy labels and the quality of proposals are two key problems of weakly supervised semantic segmentation in aerial images. We tackle these problems by taking advantage of multiple proposals in the training process and proposing two kinds of high quality proposals, GradCAM D and GrabCut C . The experimental results in Sections 4.2 and 4.3 prove that the proposed method can effectively improve the performance of weakly supervised semantic segmentation in aerial images.
(2) The limits of our method. Our method needs bounding boxes annotations, which have two weaknesses in aerial images. On the one hand, bounding boxes annotations are slightly more expensive than image level annotations and points level annotations. On the other hand, bounding boxes annotations are not suitable for all semantic segmentation tasks in aerial images. For example, bounding boxes annotations represent airplanes, cars and buildings well but can not represent roads because roads are more similar to lines.
(3) Potential improvement of the framework. As shown in Table 1, although our method improves segmentation results significantly by using combination of different proposals.
The performance will not increase when adding all kinds of proposals. In particular, adding rectangle proposals or CRF proposals to {GrabCut,GradCAM D ,GrabCut C } will hurt the performance. We analyze that low quality of proposals will hurt the performance of segmentation model. In the ideal condition, we want our method can ignore most of the noise which is coming from noisy label. There are still many future works that can be done to handle the bad influence of noisy label.
Our combination strategy of multi-label is naive and can be improved by introducing more advanced statistical methods. Expectation-Maximization is elegant and we think it will contribute to experiments. We will try to realize it in future research.

Conclusions
In this paper, we propose a novel JMLNet, which first regards multiple proposals as multi-label supervision to train weakly supervised semantic segmentation model. JMLNet learns common knowledge from multiple noisy labels and prevents the model from overfitting one specific label. GradCAM D and GrabCut C methods are proposed to generate high-quality segmentation proposals, which further improve the segmentation performance of JMLNet. These proposals perform both reliable in the distinguished semantic area and detailed in instance edge. We report the state-of-the-art results on semantic segmentation tasks of iSAID and mapping challenge dataset when training using bounding boxes supervision, reaching comparable quality with the fully supervised model.