Dual Complementary Prototype Learning for Few-Shot Segmentation †

: Few-shot semantic segmentation aims to transfer knowledge from base classes with sufﬁcient data to represent novel classes with limited few-shot samples. Recent methods follow a metric learning framework with prototypes for foreground representation. However, they still face the challenge of segmentation of novel classes due to inadequate representation of foreground and lack of discriminability between foreground and background. To address this problem, we propose the Dual Complementary prototype Network (DCNet). Firstly, we design a training-free Complementary Prototype Generation (CPG) module to extract comprehensive information from the mask region in the support image. Secondly, we design a Background Guided Learning (BGL) as a complementary branch of the foreground segmentation branch, which enlarges difference between the foreground and its corresponding background so that the representation of novel class in the foreground could be more discriminative. Extensive experiments on PASCAL-5 i and COCO-20 i demonstrate that our DCNet achieves state-of-the-art


Introduction
Attributed to the development of convolutional neural networks (CNNs) with its strong representation ability and the access of large-scale datasets, semantic segmentation and object detection have developed tremendously. However, it is worth to point out that annotating a large number of object masks is time-consuming, expensive, and sometimes infeasible in some scenarios, such as computer-aided diagnosis systems. Moreover, without massive annotated data, the performance of deep learning models drops dramatically on classes that do not appear in the training dataset. Few-shot segmentation (FSS) is a promising field to tackle this issue. Unlike conventional semantic segmentation, which merely segments the classes appearing in the training set, few-shot segmentation utilizes one or a few annotated samples to segment new classes.
They firstly extract features from both query and support images, and then the support features and their masks are encoded into a single prototype [1] to represent foreground semantics or a pair of prototypes [2,3] to represent the foreground and background. Finally, they conduct dense comparison between prototype(s) and query feature. Feature comparison methods are usually performed in one of two ways: explicit metric function, (e.g., cosine-similarity [3]) and implicit metric function (e.g., relationNet [4]).
As shown in Figure 1a, it is common-sense [2,5,6] that using a single prototype generated by masked average pooling is unable to carry sufficient information. Specifically, due to variant appearance and poses, using masked average pooling only retains the information of discriminative pixels and ignores the information of plain pixels. To overcome this problem, multi-prototype strategy [2,5,6] is proposed by dividing foreground regions into several pieces.  [1,7] tend to lose information as plain pixels. (b) Multi-prototype methods [2,5,8] based on regional division may damage the representation for the whole object. (c) Our Complementary Prototype Generation module retains the information of discriminative pixels and plain pixels adaptively.
However, as shown in Figure 1b, these multi-prototype methods still suffer from two drawbacks. Firstly, the whole representation of foreground region is weakened, since existing methods split regions into several pieces and damage the correlation among the generated prototypes. Moreover, current methods often ignore inter-class similarity between foreground and background, and their training strategy in the context of segmenting the main foreground objects leads to underestimating the discrimination between the foreground and background. As a result, existing multi-prototype methods tend to misclassify background pixels into foreground.
In this paper, we propose a simple yet effective method, called Dual Complementary prototype Network (DCNet), to overcome the above mentioned drawbacks. Specifically, it is composed of two branches to segment the foreground and background in a complementary manner, and both segmentation branches rely on our proposed Complementary Prototype Generation (CPG) module. The CPG module is proposed to extract comprehensive support information from the support set. Through global average pooling with support mask, we extract the average prototype at first, and we obtain its attention weight on the support image by calculating the cosine distance between the foreground feature and the average prototype iteratively. In this way, we can easily figure out which part of the information is focused and which part of the information is ignored without segmentation on support image. Then we use this attention weight to generate a pair of prototypes to represent the focused and the ignored region. By using a weight map to generate prototypes for comparison, we can preserve the correlation among the generated prototypes and avoid the information loss to a certain extent.
Furthermore, we introduce background guided learning to pay additional attention on the inter-class similarity between the foreground and background. Considering that the background in support images is not always the same as that in a query image, we adopt a different training manner from foreground segmentation, where the query background mask is used as guidance for query image background segmentation. In this way, our model could learn a more discriminative representation for distinguishing foreground and background. The proposed method effectively and efficiently improves the performance on FSS benchmarks without extra inference cost.
The main contributions of this work are summarized as follows.

1.
We propose Complementary Prototype Generation (CPG) to learn powerful prototype representation without extra parameters costs; 2.
We propose Background Guided Learning (BGL) to increase the feature discrimination between foreground and background. Besides, BGL is merely applied in the training phase so that it would not increase the inference time; 3.
Our approach achieves the state-of-the-art results on both PASCAL-5 i and COCO-20 i datasets and improves the performance of the baseline model by 9.1% and 12.6% for 1-shot and 5-shot setting on COCO-20 i .

Semantic Segmentation
Semantic segmentation, which aims to perform classification for each pixel, has been extensively investigated. Following Fully Convolution Network (FCN) [9], which uses fully convolutional layers instead of fully connected layers as a classifier for semantic segmentation, large numbers of network frameworks have been designed. For example, Unet adopted a multi-scale strategy and a encoder-decoder architecture to improve the performance of FCN, and PSPNet was proposed to use the pyramid pooling module (PPM) to generate object details. Deeplab [10,11] designed an Atrous Spatial Pyramid Pooling (ASPP) module, conditional random field (CRF) module, and dilated convolution to FCN architecture. Recently, attention mechanism has been introduced, PSANet [12] was proposed to use point-wise spatial attention with a bi-directional information propagation paradigm. Channel-wise attention [13] and non-local attention [14][15][16][17] are also effective for segmentation. These methods have managed to succeed in large-scale datasets but they are not designed to deal with rare and unseen classes and cannot be accommodated without fine-tuning.

Few-Shot Learning
Few-shot learning focuses on the generalization ability of models, so that they can learn to predict novel classes with a few annotated examples [4,[18][19][20][21]. Matching networks [19] were proposed for 1-shot learning to exploit a special kind of mini-batches called episodes to match the training and testing environments, enhancing the generalization on the novel classes. Prototypical network [20] was introduced to compute the distances between the representation cluster centers for few-shot classification. Finn et al. [21] proposed an algorithm for meta-learning that is model-agnostic. Even though few shot learning has been extensively studied for classification task, it is still hard to adopt few-shot learning directly on segmentation due to the dense prediction.

Few-Shot Segmentation
As the extension of few-shot learning, few-shot semantic segmentation has also received considerable attention very recently. Shaban et al. first proposed the few-shot segmentation problem with a two-branch conditional network that learned the parameters on support images. Different from [22], later works [1][2][3]23,24] follow the idea of metric learning. Zhang et al. generates the foreground object segmentation of the support class by measuring the embedding similarity between query and supports, where their embeddings are extracted by the same backbone model. Generally, metric learning based methods can be divided into two groups: one group is inspired by ProtoNet [20], e.g., PANet [3] first embeds different foreground objects and the background into different prototypes via a shared feature extractor, and then measures the similarity between the query and the prototypes. The other group is inspired by relationNet [4], which learns a metric function to measure the similarity, e.g., Refs. [1,7,8] use an FPN-like structure to perform dense comparison with affinity alignment. Then, considering the incomplete representation of a single prototype, Li et al. [5] divide the masked region into pieces, the number of which is decided by the area of the masked region and then conducts masked average pooling for each piece to generate the numbers of the prototypes.Zhang et al. [6] utilize the uncovered foreground region and covered foreground region through segmentation on support images to generate a pair of prototypes to retrieve the loss information. However, compared to self-segmentation mechanism [6], our CPG does not need to segment on support images and utilization of CPG obtains competitive performance with few cost.s Compared to cluster methods [5,8], the experiment in the ablation study shows that our method can avoid over-fitting and generate stable performance in each setting.
Moreover, recent methods such as MLC [25] and SCNet [26] start to make use of knowledge hidden in the background. By exploiting the pre-training knowledge for the discovery of the latent novel class in the background, their methods bring huge improvements to the few-shot segmentation task. However, we argue that such a method is difficult to apply in realistic scenarios, since a novel class object is not only unlabelled but also unseen in the training set. Instead, we propose background guided learning to enhance the feature discriminability between the foreground and the background, which also improves the performance of the model.

Problem Setting
The aim of few-shot segmentation is to obtain a model that can learn to perform segmentation from only a few annotated support images in novel classes. The few-shot segmentation model should be trained on a dataset D train and evaluated on a dataset D test . Given the classes set in D train is C train and classes set in D test is C test , there is no overlap between training classes and test classes, e.g., Following a previous definition [22], we divide the images into two non-overlapping sets of classes C train and C test . The training set D train is built on C train and the test set is built on C test . We adopt the episode training strategy, which has been demonstrated as an effective approach for few-shot recognition. Each episode is composed of a shot support set S = {I s k , M s k } K k=1 and a query set Q = I q , M q to form a K-shot episode {S, I q }, where I * andM * are the image and its corresponding mask label, respectively. Then, the training set and test set are denoted by D train = {S} N train and D test = {Q} N test , where N train and N test is the number of episodes for the training and test set. Note that both the mask M s of the support set and the mask M q of the query set are provided in the training phase, but only the support image mask M s is included in the test phase.

Overview
As shown in Figure 2, our Dual Complementary prototype Network (DCNet) is trained via the episodic scheme on the support-query pairs. In episodic training, supports images and a query image are input to the share-weight encoder for feature extraction. Then, the query feature is compared with prototypes of the current support class to generate a foreground segmentation mask via a FPN-like decoder. Besides, we propose an auxiliary supervision, named Background Guided Learning (BGL), where our network learns robust prototype representation for a class-agnostic background in an embedding space. In this supervision, the query feature is compared with prototypes of the query background to make a prediction on its own background. With this joint training strategy, our model can learn discriminative representation for foreground and background.
Thus, the overall optimization target can be briefly formulated as: where L f g and L bg denote the foreground segmentation loss and background segmentation loss, respectively, and γ is the balance weight, which is simply set as 1.

Figure 2.
The framework of the proposed DCNet for 1-shot segmentation. At first, the encoder generates feature maps F s and F q from the support images and query images. Then, the support image masks M s and related features are fed into CPG to generate a pair of foreground prototypes P s . Finally, P s is expanded and concatenated with the query feature F q as an input to the decoder to predict the foreground in the query image. In the meantime, in BGL, the query feature F q and its background mask M bq are fed into CPG to generate a pair of background prototypes P bq . P bq is expanded and concatenated with query feature F q as an input to the decoder to predict the background in the query image.
In the following subsections, we first elaborate our prototype generation algorithm. Then, background-guided learning on 1-shot setting is introduced, followed by inference.

Complementary Prototypes Generation
Inspired by SCL [6], we propose a simple and effective algorithm, named Complementary Prototypes Generation (CPG), as shown in Figure 3. This CPG algorithm generates a pair of complementary prototypes and aggregates information hidden in features based on cosine similarity. Specifically, given the support feature F ∈ R H×W×C with the mask region as M ∈ R H×W , we extract a pair of prototypes to fully represent the information in the mask region.
As the first step, we extract the targeted feature F ∈ R H×W×C filtered through mask M from F, in Equation (2), where represents element-wise multiplication. Then, we initiate prototype P 0 by masked average pooling, in Equation (3), where i, j represents the coordination of each pixel, H, W denotes the width and height of feature F , respectively. Since M i,j ∈ 0, 1, the sum of M represents the area of the foreground region. In the next step, we aggregate the foreground features into two complementary clusters. For each iteration t, we first compute the cosine distance matrix S t ∈ R H×W between the prototype P t−1 0 and the targeted features F as follows, As we keep the relu layer in the encoder layer, the cosine distance is limited in [0, 1]. To calculate the weight of target features contributed to P t 0 , we normalize the S matrix as: Then, after the end of the iteration, based on matrix S t , we aggregate the features into two complementary prototypes as: It is worth noting that these prototypes are not separated like priors and CPG algorithm utilizes a weighted map to generate a pair of complementary prototypes. In this way, we retain the correlation between the prototypes. The whole CPG is delineated in Algorithm 1.

Algorithm 1 Complementary Prototypes Generation (CPG).
Input: targeted feature F , corresponding mask M, the number of iteration T.
init prototype P 0 c,0 by masked average pooling with F .
for iteration t in {1, .., T} do Compute association matrix S between targeted feature F and prototype P t−1 0 ,

Background Guided Learning
In previous works [1,5,6], the background information has not been adequately exploited for few-shot learning. Especially, these methods only use foreground prototypes to make a final prediction on the query image in the training. As a result, the representation on class-agnostic background is the lack of discriminability. To solve this problem, Background Guided Learning (BGL) is proposed via joint training strategy.
As shown in Figure 2, BGL is proposed to segment the background on the query image based on query background mask M bq . As the first step, query feature F q and its background mask M bq are fed into the CPG module to generate a pair of complementary prototypes P bq = P 1 , P 2 , following Algorithm 1. Next, we concatenate the complementary prototype P bq with all spatial location in query feature map F q , as Equation (8): where denotes the expansion operation and ⊕ denotes the concatenation operation, P 1 and P 2 are the complementary prototypes P bq as well as F m , denoting the concatenated feature. Then, concatenate feature F m is fed into the decoder, generating the final prediction, as shown in Equation (9): whereM is the prediction of the model, D is a decoder. The loss L bg is computed by: whereM bq denotes the background prediction on a query image and CE denotes the cross-entropy loss.
Intuitively, if the model can predict a good segmentation mask for the foreground using a prototype extracted from the foreground mask region, the prototype learned from the background mask region should be able to segment itself well. Thus, our BGL encourages the model to distinguish the background from the foreground better.

Inference
In the inference phase, we only keep the foreground segmentation branch for the final prediction. For K-shot setting, we following previous works and use the average to generate a pair of complementary prototypes.

Datasets
We evaluate our algorithm on two public few-shot datasets: PASCAL-5 i [22] and COCO-20 i [27]. PASCAL-5 i is built from PASCAL VOC 2012 and SBD datasets. COCO-20 i is built from MS-COCO dataset. In PASCAL-5 i , 20 object classes of PASCAL VOC are split into 4 groups, in which each group contains 5 categories. In COCO-20 i , as PASCAL-5 i , we divide MS-COCO into 4 groups, in which each group contains 20 categories. For PASCAL-5 i and COCO-20 i , we evaluate our approach based on PFENet. We use the same categories division and randomly sample 20,000 support-query pairs to evaluate as PFENet.
For both datasets, we adopt 4-fold cross-validation i.e., a training model on three folds (base class) and the inference model on the remaining one (novel class). The experimental results are reported on each test fold, and we also report the average performance of all four test folds.

Evaluation Metrics
Following previous work [7,27], we use the widely adopted class mean intersection over union (mIoU) as our major evaluation metric for the ablation study, since the class mIoU is more reasonable than the foreground-background IoU (FB-IoU), as stated in [7]. For each class, the IoU is calculated by

Implementation Details
Our approach is based on PFENet [1] with ResNet-50 as the backbone to create a fair comparison with the other methods. Following previous work [1,5,6], the parameters of the backbone are initialized with the pre-trained ImageNet, and is kept fixed during training. Other layers are initialized by the default setting of PyTorch. For PASCAL-5 i , the network is trained with an initial learning rate of 2.5 × 10 −3 , weight decay of 1 × 10 −4 , and a momentum of 0.9 for only 100 epochs. The batch size is 4. For COCO-20 i , the network is trained for 50 epochs with a learning rate of 0.005 and batch size of 8. We use data augmentation during training. Specifically, input images are transformed with random scale, horizontally flipped and rotated from [−10, 10], and then all images are cropped to 473 × 473 (for PASCAL and COCO) or 641 × 641 (for COCO) as the training samples, for fair comparison. We implemented our model with 4 RTX2080Ti.

COCO-20 i Result
COCO-20 i is a very challenging dataset that contains the numbers of objects in realistic scene images. We compare our approach with others on this dataset, and our approach outperforms other approaches by a big margin, as shown in Table 1. It can be seen that our approach achieves state-of-the-art performance on both 1-shot and 5-shot settings with mIOU gain of 0.3% and 0.5%, respectively. Furthermore, compared to our baseline (PFENet with ResNet101), our approach (with ResNet101) obtains 9.1% and 12.6% mIoU increases for 1-shot and 5-shot settings. In Table 2, our method obtains a top-performing 1-shot result and competitive 5-shot result with respect to FB-IoU. Once again, these results demonstrate that the proposed method is able to deal with more complex cases, since MSCOCO is a much more challenging dataset with diverse samples and categories.

PASCAL-5 i Result
In Table 3, we compare our method with other state-of-the-art methods on PASCAL-5 i . It can be seen that our method achieves on par state-of-the-art performance on 1-shot setting and 5-shot setting. Additionally, our method significantly improves the performance of PFENet on 1-shot and 5-shot segmentation settings, with an mIOU increase of 1.6% and 4%, respectively. In Table 4, our method obtains competitive 1-shot results and topperforming 5-shot results with respect to FB-IoU. In Figure 4, we report some qualitative results generated by our approach with PFENet [1] as the baseline. Our method is capable of making correct predictions and each part of our method could independently improve the performance of the model. Table 3. Comparison with state-of-the-art methods on PASCAL-5 i for 1-shot and 5-shot settings. For fair comparison, all methods are evaluated with backbone ResNet50 and tested on labels with original sizes. Bold denotes the best performance and red denotes the second best performance.   Table 4. Comparison of FB-IoU on PASCAL-5 i for 1-shot and 5-shot settings. We used ResNet50 as the backbone.

Ablation Study
To verify the effectiveness of out proposed methods, we conduct extensive ablation studies with a ResNet-50 backbone on PASCAL-5 i .

The Effectiveness of CPG
To verify the effectiveness of CPG, we conduct several experiments on prototype generation and compare it with other prototype generation algorithms. As a kind of soft cluster algorithm, we first compare our method with Adaptive K-means Algorithm (AK) provided by ASGNet [5], and a traditional algorithm, Expectation-Maximization Algorithm (EM), as shown in Table 5. Compared to the baseline, both AK and EM degenerate the performance of segmentation in a 1-shot setting while our CPG offers 0.6% improvement on the baseline. Compared to SCL [6] which needs to segment both support images and query images, our approach uses less computation cost and inference times (in Table 6) with competitive results on both 1-shot and 5-shot settings. These indicated the superiority of CPG on the few-shot segmentation task. To demonstrate the effectiveness of our proposed BGL, we conduct both qualitative and quantitative analysis on BGL. We assume the BGL has two sides of effectiveness on feature representation. The first one is the enhancement of feature representation for the novel classes and the second one is discrimination between the class-specific (foreground) feature and the class-agnostic (background) feature. Following [28], we measure the interclass variance, intra-class variance, and discriminative function φ. Here φ is defined as inter-class variance divided by the intra-class variance.
As shown in Figure 5a,b,d, BGL not only enlarges the inter-class variance for novel classes but also increases intra-class variance for novel classes. In other words, BGL does not improve the representation discriminability for novel classes. However, as shown in Figure 5c,e, BGL enlarges the inter-class distance and increases the discriminative function φ between the foreground and the background. Therefore, the effectiveness of BGL is in the promotion of discrimination between the foreground and background.

The Effectiveness of BGL and CPG
To demonstrate the effectiveness of both CPG and BGL, ablation studies are conducted on PASCAL-5 i , as shown in Table 6. Compared with the baseline, using CPG and BGL alone improves the performance by a large margin, 1.7% and 2.6% for mIoU on 5-shot setting, respectively. In addition, we show that using CPG alone could achieve the current SOTA performance provided by SCL [6], and using BGL could surpass thestate-of-theart performance with a 2.2% mIoU score. Then, combining both CPG and BGL achieves higher performance than the aforementioned one, with 4% improvement in total. In Figure 4, we show that using CPG and BGL alone may generate wrong segmentations on the background, but a combination of them could improve the results. In Figure 6, we show some representative heatmap examples, which further shows how the combination of CPG and BGL helps the model segment precisely and accurately.

Conclusions
In this paper, we propose a novel few-shot semantic segmentation method named DCNet, which is composed of CPG and BGL. Our approach is able to extract comprehensive support information through our proposed CPG module and generate discriminative feature representation for background pixels by BGL. Extensive experiments demonstrate the effectiveness of our proposed method. Institutional Review Board Statement: Ethical review and approval were waived for this study, due to no humans or animals were involved.