Guided Networks for Few-Shot Image Segmentation and Fully Connected CRFs

: The goal of the few-shot learning method is to learn quickly from a low-data regime. Structured output tasks like segmentation are challenging for few-shot learning, due to their being high-dimensional and statistically dependent. For this problem, we propose improved guided networks and combine them with a fully connected conditional random ﬁeld (CRF). The guided network extracts task representations from annotated support images through feature fusion to do fast, accurate inference on new unannotated query images. By bringing together few-shot learning methods and fully connected CRFs, our method can do accurate object segmentation by overcoming poor localization properties of deep convolutional neural networks and can quickly updating tasks, without further optimization, when faced with new data. Our guided network is at the forefront of accuracy for the terms of annotation volume and time.


Introduction
In the context of deep learning, each class requires at least thousands of training samples to saturate the performance of convolutional neural networks on known categories. In addition, the generalization ability of neural networks is weak. When the novel class comes, it is difficult for the model to learn to identify novel concepts through a small number of labeled samples. However, humans have the ability to quickly learn from small (single) samples. People can even accurately identify things in a picture based on just one picture. Inspired by the rapid learning ability of human beings, the researchers hope that the machine learning model can learn quickly after learning a large amount of data of a certain category, and only a small sample is needed for the new category. These prompted the emergence of few-shot learning methods [1][2][3]. The current few-shot learning methods mainly rely on meta-learning to adapt to new tasks. However, these methods focus on classification rather than structured output tasks.
Image segmentation is the core task of visual recognition, and its end-to-end system has achieved advanced performance. Although deep convolution neural network (DCNN) has made great progress in the field of image segmentation, there is evidence that the response of the last layer of DCNN is not enough to accurately locate the target boundary [4]. Convolution neural network models perform very poorly in their ability to capture fine edge details and are unable to adapt to long-range dependencies. In order to solve the problem of small amount of training data and precise segmentation at the same time, we propose combining the few-shot learning method with fully connected pairwise conditional random fields (CRFs) proposed by Krähenbühl and Koltun [5], for its efficient computation and localization performance.
Specifically, we solve such a few-shot segmentation problem: just a little sparse pixelwise annotated support images for indicating the task are given, and then segment unannotated images correspondingly. In this work, our framework is at the pixel-level. That is to say, the input and output are all the pixel-level. Thus, they are from inside and across images propagating pixel annotations to unannotated pixel for inference. In addition, we can infer the latent task representation defined by sparse pixelwise annotations through optimizing the guided network. Moreover, according to the latent task representation, the new query image without pixel annotations is segmented accordingly. Our guided network even requires only two annotated pixels (one positive pixel and one negative pixel) per concept, to segment new concepts, and incorporates further annotations to renew and ameliorate inference. Our method can spread across the spectrum from an annotated pixel to intensive entire masks, unlike some existing methods that may fail to segment specific tasks in very sparse regimes.
In this paper, we propose a new class of guided networks which combines fully connected CRFs (see Figure 1). Our model is composed of three fairly well-established branches, guided branch, segmentation branch, and fully connected CRFs. Given an annotated support set, the guide (g) extracts a potential task representation (R) and uses it to direct the segmentations of query images. We introduce a new mechanism for merging images and annotations on encode the support, which greatly improves learning time and inference accuracy. For the segmentation branch, we designed a small convolutional network, which can be understood as a learning distance measure from support to query; under the guidance of task representation, R, the segmentation branch extracts the foreground object of the image and generates rough segmentation results. Once trained, our model does not need to make further efforts to optimize to deal with new few-shot tasks. Finally, we use fully connected CRFs to optimize the details of the output and pinpoint it. The main contributions of this paper can be summarized as the following three aspects: (1) We implemented an image segmentation algorithm which combines the few-shot learning method and the fully connected conditional random field (CRF), and get relatively good segmentation results; (2) we introduce a new mechanism for merging images and annotations, to improve learning time and inference accuracy and propagate pixels across different images; and (3) we combined the fully connected CRF behind the guided network, to improve the ability of the network to capture detailed features and achieve accurate segmentation of objects.

Related Work
Our framework realizes the accurate segmentation of images with a little training samples. At present, deep learning technology has made tremendous progress in the field of image segmentation [6][7][8][9]. However, due to the bottleneck of deep learning technology, which needs a large amount of label data, it has led to the exploration of the few-shot learning method. We concentrate on one-shot, semi-supervised, and interactive methods; at the same time, we review the relationship between few-shot learning methods and structured output.

Few-Shot Learning
The few-shot learning method is a good generalization for the problem of limited labeled datasets, which generally contain only a few training samples of the target class [10]. Although the interest in few-shot learning methods is increasing, most of the current research focuses on classification [11][12][13] rather than structured output, and little attention is paid to the supervision of sparse and imbalanced. Shaban et al. [14] were the first to apply the one-shot learning method to image semantic segmentation, which only requires an image and its corresponding pixel-level annotation per class. Few-shot learning ensures the efficiency of data; at the extreme, one-shot learning requires only a single annotation of a new concept.
To locate our study, we herein review methods such as segmentation associated with visual structured output tasks. During inference, few-shot learning methods do optimize by gradients on a learned recurrent optimizer [15,16]. Notably, the majority scarcely use task and architecture presumptions, but these ways are unconfirmed for the skewed distributions and high dimensionality of segmentation. Motivated by Siamese networks [17] used for metric learning [18,19], few-shot also as embedding learns a metric and seeks the nearest target from the support. Although these mediums are quickly and fairly uncomplicated [20] on small datasets, they are a disgrace with higher shot and way. It is difficult to extend one-shot to few-shot, because of the way of few-shot regresses model parameters based on the support.

Segmentation
There are many types of segmentation, e.g., semantic segmentation, instance segmentation, panoramic segmentation, and so on. We take the semantic and interactive segmentation as our main challenges (see Figure 2). The fully convolutional network (FCN) [6] is a pioneering work that applies a convolutional neural network (CNN) structure to the field of image semantic segmentation and achieves outstanding segmentation results. However, its segmentation results are not precise enough to segment the details of the target image. For semantic segmentation, Shaban et al. [14] designed a segmented architecture for one-shot learning method, which only needs a few training images, but requires dense annotations and supervision at training time. For our guided architecture, we only need to randomly point a few positive sample points and negative sample points (foreground points and background points of images) on the training sample image. We mainly draw support from the feedforward guidance for few-shot learning, so as to make our methods faster and better. For interactive segmentation, Xu et al. [21] introduced a state-of-the-art segmentation method. They put the original image together with Euclidean distance maps based on foreground and background annotations into a full convolutional network, to generate a probability map. It is a pity that it cannot propagate pixel annotations across different images. Undeniably, that is a bottleneck on annotation efficiency. What is amazing is that our approach can segment new inputs independently. Thus, even if support images and query images are different, we can also achieve accurate segmentation. This is much better than interactive seg; Xu et al. [21] and we regard the interactive as a special case of few-shot.

Fully Connected CRFs
Structured prediction tasks such as image segmentation can gain many advantages from conditional random fields and other probability graph models. CRF are often used for pixel-level label prediction. Traditionally, a CRF was used to smooth the noise segmentation images [22,23]. Generally, these models contain energy terms that couple adjacent nodes so that the same labels are assigned to the proximal pixels in space. The basic CRF model is a graph model composed of the unary potential function and the potential function composed of adjacent elements. Obviously, a disadvantage of the basic CRF model in image tasks is that it only considers the adjacent neighborhood elements, without considering the whole, so it will lose some context information. Therefore, a further idea was born: Each pixel is made into an edge for all other pixels, to achieve a dense fully connected model, that is, fully connected CRFs [5]. The fully connected CRF obtains as much adjacent point information as possible by operating all nodes, thereby obtaining more accurate segmentation results [9,24,25].

Few-Shot Segmentation
In the few-shot learning method, the training set contains many categories, and there are multiple samples in each category. In the training phase, N classes of data are randomly selected from the training set, and K samples of each class (N × K data in total) are constructed to a meta-task as the model's support set input. Then, we take a batch of samples from the remaining data of these N classes as the query set of the model. That is to say, the model is required to learn how to distinguish these N classes from N × K data. Such a task is called the N-way, K-shot problem [10,15,26]. For few-shot image segmentation tasks, in this setting, we also need to add a further pixel dimension, as annotations may be spatially dense or sparse. We have to consider the amount of support images and the amount of annotated pixels per support image. We express the amount of pixel annotations for every support image as P and think over the place settings of (K, P)-shot learning for different K and P. We especially pay attention to the sparse annotation, that is, the case where P is very little, because this can reduce the cost of annotation and more practical to collect. More importantly, it only asks the user to point to the segment of interest. Furthermore, we deal with mixed-shot learning, where the quantity of annotation changes as class and task change.
We follow and expand the notation of Chen. et al. [27]. We represent the support and Query set of the task as the following form. Support Set: (l) and Query Set: where I i represents the i (th) original image, L i are the corresponding annotations, the indexes s and q are the support set and query set, N is the number of images in each set, and (l) is the semantic class of the dataset. In general, we regard each segmentation task to be binary with N = 2, or L = (L + , L − ), where every task interprets its own positive and negative is the supplement (that is the background in image segmentation). Note that the binary task is a natural one for interactive segmentation problem, in case the tasks consist of a single object to be segmented. Obviously, the binary task can be extended to the higher-way task. Because the inference for every query image is independent in our mechanism, we keep the number of unannotated query images is one.
In order to solve the problem of few-shot image segmentation, our model is divided into three parts:(1) extracting task representation from semi-supervised support images that can express segmentation tasks with high quality; (2) segmenting query images according to the task representation extracted in the previous step; and (3) introducing the fully connected conditional random fields [4], to consider the global location information, and further optimizing the segmentation results. We express the task representation as follows R = g(I s , L + , L − ) and the query segmentation as y = f I q , R . The selection of the task representation, R, and its encoder, g, are important for few-shot segmentation to deal with the hierarchical structure of images and their pixelwise annotations. We discuss the issue in Section 4. According to the task representation, we integrate the few-shot methods into the dense pixelwise inference through a fully convolutional network. Compared with other few-shot methods, our evaluation emphasizes the limits of shot and efficiency.

Methods
Our model can make predictions independently, while guiding the task and rectifying errors under the guidance of users. Unlike static model parameters, our guidance is dynamically variable. It can be expanded or rectified as directed by an annotator. It is worth noting that the process of self-prediction could be considered as interactive segmentation when the support image and query image are the same. It can be seen as a special case of few-shot segmentation. Specifically, we use a guide, R = g(I s , L s ), to extract a latent task representation, R, from the support through the guided branch. Subsequently, the segmentation branch combines task representation, R, and query features to make joint predictions, y = f I q , R . We discuss how to better design the above two function formulas in the following sections.
Our model uses VGG-16 [28] as a feature extractor, pre-training it on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [29], and converting it into fully convolutional form.

Guided Branch: Extracting Task Representation from Support
For the sake of the segmentation of query images, the task representation, R, has to fuse pixel annotations with the support image features. Because pixels are semi-supervised and spatially correlated, our support is dependent statistically. In addition, the full supervision is difficult to annotate because of the high-dimensional and class-skewed scenes. For the purpose of simplicity, let us first think over (1, P)-shot support, and then extend it to (K, P)-shot support. We express the guidance process as follows: by architecturally inducing structure, where R includes both foreground object features and background features. Inspired by the method Rakelly et al. proposed [30,31], we first match positive annotated pixels and negative annotated pixels to the same coordinate scale as the support image I s . We record the position of them and set the click position to 1 and others to 0. Afterward, we gain two annotation masks, (L + ) and (L − ),L ∈ {0, 1}. Then, we use the pre-trained VGG-16 model as a feature extractor, λ, to extract visual features from the support alone. Since the VGG-16 model contains 5 pooling operations, the extracted support feature map is reduced by 32 times. To make sure they are the same size, the positive mask and negative mask are both down-sampled by bilinear interpolation kernel m(L + ), m(L − ). Finally, we use the element-wise product, µ, to fuse the support features with the positive mask and negative mask. In this way, we can update the task representation quickly by constantly recomputing the masking, to incorporate new positive and negative annotations. This greatly reduces inference time. Additionally, support and query share a feature extractor to extract visual features, which significantly improves learning efficiency. The overview is shown in Figure 3. Previous work merely concatenates the image and annotations. Xu et al. [21] proposed a method that enables end-to-end learning to fully control how to fuse. However, due to the fact that the number of channels has changed after concatenating the image and annotations, the original feature extractor VGG-16 cannot handle the new input data. This will break the input structure of the network and prevent the implementation of a unified network. Shaban et al. [14] used the method of directly multiplying the support and dense label annotation to fuse, which ignores all background information. Our method can well preserve the background information of the support. Moreover, the factorization into feature-level information and annotation branches better defines the spatial dependency between annotations and the support. The previous methods have some inherent model problems: inconsistency of the support with query features, and the fusion is so slow.
When there are multiple foreground objects in the image, we hope to segment all the target objects in the image, not just one of them, as shown in Figure 4. In addition, if the support and query images are totally different, the spatial corresponding relationship between the two is unknown, and the support and query images can only be mapped through features. For this, our method is to global pool, to merge the local task representations for all position and abandon the spatial dimensions. In the pooling step, we choose global average pooling to handle it. However, if the support is the same as the query image (e.g., interactive segmentation), feature location is informative, and the global pooling process can be ignored.

Segmentation Branch: Feature Fusion
In the ordinary fixed segmentation model, the form of inference is just y = f θ I q for input image, I q , parameters, θ, and output results, y. However, in our method, guided inference is the further function y = f θ I q , R , where R is the guidance extracted from the support. We use a fusion operation, C, to concatenate the guided task representation and the query features. The segmentation process can be defined as follows: where λ is the same fully convolutional encoder as the support uses. See Figure 5 for a schematic illustration. Specifically, fusion operation, C, has this from C = λ I q ⊗ tile(r), where r is the task representation after global pooling obtained by the guided branch. Tile function copies the original matrix horizontally and vertically. ⊗ represents the channel numbers stack. We keep repeating the guidance vector, r, until it is the same as the spatial dimension of the query features, λ I q , to make sure the parameters have the same dimension. Note that the method of Yoon et al. [32] is similar to our instantiation of this method, but it has difficulties in solving sparse pixel settings. In addition, they need to optimize for few-shot usage during the inference process. Then we decode the fused support-query features into a binary predicted segmentation through a small convolutional network, f θ . You could understand f θ as a learned distance metric for retrieval from the query to support. Specifically, the f θ network can be summarized into two parts. The first part uses a combination of the convolutional layer (1 × 1 kernel size), rectified linear unit (ReLU), and drop-out layer to fuse support-query features. The parameters in the convolution layer are used to calculate the distance of pixels from the support to query. The second part consists of only one convolution layer (1 × 1 kernel size) with 2 channel dimensions to predict the score of foreground and background classes on each coarse distance metric maps. Finally, the prediction results are restored to the original size by the bilinear interpolation and learn end-to-end by back-propagation from the pixel-wise loss.
We adopt the same training episode as Rakelly et al. [30] did. We first sample the task, and then we sample the subset of images containing the task, which is divided into the support and query. Then, when given inputs and targets, we train the network with cross-entropy loss between the prediction results and the target label: where y i is the dense label (ground truth) of image i, and p i represents the corresponding predicted segmentation results. After learning, our few-shot method is completed through guidance and guided inference. As described in Section 4.1, we first train K = 1 for efficiency. Once learned, our networks can operate under different (K, P)-shot settings to solve sparse and dense pixelwise annotations in the same model.

Fully Connected CRFs for Accurate Localization
As illustrated in Figure 6, few-shot guided network score maps can reliably infer the rough position of the target object in an image, but cannot accurately delineate its precise outline. For example part of the pixels between the legs of a horse in the image are grass, but the segmentation result accidentally identifies that piece of grass as a horse. Moreover, the horse's ears were not correctly identified. This is because of the invariance of spatial transformation of convolutional networks. The invariance can enhance the ability to learn hierarchical abstract of data, but it may hinder low-level vision tasks [9] (for example, image segmentation). Eigen et al. [33] and Long et al. [6] use the information of multiple layers in convolution networks to better estimate the target borders. Mostajabi et al. [34] take a completely different approach, using a super-pixel representation to solve this problem. We try to solve the challenge of accurate location by coupling the few-shot segmentation method proposed by Rakelly et al. [30] with the fine-grained localization accuracy of fully connected CRFs. It is proved in our work that advanced results can be obtained by this method.
Traditionally, conditional random fields (CRFs) have been widely used in image segmentation [22,35,36]. Generally, these methods include energy terms that couple adjacent nodes, facilitating the assignment of the same label to proximal pixels in space. However, these basic short-range CRFs only consider the adjacent neighborhood elements, not the whole, and will lose some context information. Therefore, a more mature idea was born. Each pixel is made up of one edge to all other pixels, so as to achieve a dense fully connected model, which is called fully connected conditional random field. We integrated into our few-shot guided network the fully connected CRF model of Krahenbuhl and Koltun [5], for its efficient computation and ability to capture fine edge details, while also catering to the long-range dependencies.
The energy function used in the fully connected CRF model is as follows: where θ i (x i ) and θ ij x i , x j , respectively, represent the unary potential function and the pairwise potential function; x is the label assignment for pixels. The unary potential function can be specifically expressed as follows: θ i (x i ) = −logP(x i ), where P(x i ) represents the label assignment chance at pixel i. Specifically, what we want is the probability value of the corresponding label x i when the observed pixel color is y i . Intuitively, for example, in a picture of a black dog standing in the grass, if the observed pixels are black, it is most likely to be a dog. Here we take the output of the few-shot guided network as a unary potential function. The pairwise potential function can be specifically expressed as follows: Because the model's factor graph is fully connected, every pixel pair will have a value for each pair of pixels i and j in the image. k M f i , f j is the Gaussian kernel of f i , f j , f i is the feature vector of pixel I, and ω M is the corresponding weight. The pairwise potential function is used to measure the probability of two events happening at the same time. To put it bluntly, it describes the relationship between pixels. If the two pixels are similar, it may be of the same class; otherwise, it will split.

Dataset and Evaluation Metrics
We test our model on the augmented PASCAL VOC 2012 dataset, the so-called SBD (Semantic Boundaries Dataset and Benchmark) [37]. It is usually placed in the benchmark release folder. At present, the SBD includes annotations from 11,355 images obtained from the PASCAL VOC 2011 dataset. 8498 images of them are used for training and 2857 images for testing. These images were annotated on Amazon Mechanical Turk and the conflicts between the segmentations were solved by hand. We offer both category-level and instance-level segmentations and boundaries for every image. The segmentations and boundaries provided are for the 20 foreground object classes and one background class.
To standardize the evaluation, we report four metrics for all tasks, which are derived from pixel accuracy and intersection-over-union (IU) of the positives. These are the common metric choice for many segmentation problems and scene parsing evaluations. Supposing n ij represents the number of pixels that belong to class i but are predicted to be class j. Moreover, there are N + 1 (which contains a background) different classes totally. We respectively express the four metrics as the following forms [6]: • Intersection-over-union: . This is a standard metric for semantic segmentation, which represents the ratio between the intersection and union of the two sets ground truth and predicted segmentation.
• Frequency weighted intersection-over-union (FWIU): FWIU = This method sets weights for each class based on its frequency of occurrence.

Experiments
We employed the easiest form of piecewise training, decoupling the few-shot guided network and CRF training stages, supposing the unary potential function offered by the few-shot guided network is stationary during CRF training.
We evaluated our few-shot guided network on a variety of problems that are interactive segmentation and semantic segmentation. We take fine-tuning and foreground-background segmentation as baselines for all problems. Fine-tuning is just an attempt to optimize the model on the support, as Caelles et al. [38] did. Foreground-background proves the learning of few-shot methods, and their output changes with the support. We train for binary segmentation on each split training classes.
Turning to qualitative results, we provide the visual segmentation results of our model with and without the fully connected CRF in Figure 7. Our few-shot guided network before CRF can already predict the target with high accuracy. After employing a fully connected CRF, we improved the prediction along target boundaries and allowed the model to capture fine edge details of the object by rule and line. Of course, the model proposed in this article also has certain flaws. Our model needs to extract the semantic information of foreground and background from the support and determine the category of the pixel by calculating the distance metric for each pixel in the query to foreground and background. Therefore, when the foreground and background have similar representations, the model will make a mistake. In the last row of Figure 7, The color of the snow on the motorcycle's wheel is the same as the snow background, which leads to the wrong judgment of some wheels as the background after using CRF post-processing. We solved this problem with reference to the high-resolution feature maps [9,39] and leave it as a future work.

Interactive Segmentation
As mentioned in the previous section, we restore the issue as a special case of few-shot segmentation when the support and query images are the same. We mainly compare our methods with Xu et al. [21], because it is state-of-the-art, and we pay more attention to the efficiency and generality of learning labels. Our methods are different from them in support encoding. They fuse by simply stacking, but our fusion factorizes into images and annotations, and we fuse globally. In contrast, our approach is more accurate, with sparse annotations, and it is faster to update, due to a full forward pass of previous methods (see Figure 8). We decided on late-global guidance throughout.

Semantic Segmentation
Because of the high intra-class variance of each task's appearance, it is not a simple work to apply few-show learning to semantic segmentation. For this problem, we follow the experimental protocol of Shaban et al. [14] and define four class-wise splits [30]. We set up training set from the whole images including non-held-out classes. It has 21 classes (containing background).
We concentrated on evaluating both dense and sparse annotations of the support with full masks and a single point per positive/negative separately (see Table 1). We achieved state-of-the-art results on both dense and sparse with just two annotated pixels. The early method of Shaban et al. [14] is incompatible with missing annotations, regrettably, and so is Xu et al. [21]. They are just defined for binary annotation.
In the semantic segmentation problem, we also found a strange phenomenon: Our method is not sensitive to the number of annotations. As shown in Table 2, the IU accuracy increases very slowly and inconspicuously by increasing the number of annotations. Basically, this is because one-shot cannot cover all the visual variation in a category. Think about it in terms of segmenting a black long-haired dog, and the guidance information given is obtained from a white short-haired dog. The color and shape between the two are quite different, resulting in inaccurate guidance information. We took another way of thinking and considered solving this problem by increasing one-shot to few-shot. We increased the one-shot to five-shot, while keeping two annotated pixels (one positive pixel and one negative pixel) unchanged, and found that the accuracy of IU increased by about 6%. This is a big promotion. Table 1. Few-shot semantic segmentation evaluation on SBD with the IU (%) metric over binary tasks. As shown in the table, our approach is much better than previous methods. Note that foreground-background (FG-BG) is a strong baseline and rivals fine-tuning. At the end of the experiment, we put the interactive segmentation and few-shot semantic segmentation together and compared their methods with the four metrics mentioned above. We report the evaluation results in Table 3. Our method achieved the best results for semantic segmentation and interactive segmentation. Then we incorporated the fully connected CRF to our model, respectively, which produced a significant performance boost, about 4% improvement, as shown in Table 4. Table 3. Evaluation results under the settings of (1,3)-shot. For interactive segmentation, our interactive-late-global has a significant improvement in all four metrics, especially in IU. For semantic segmentation, our approach also has about 10% improvement in IU.

Discussion
Our work combines few-shot segmentation with the fully connected CRF to solve the problem of image segmentation under low data settings, producing accurate segmentation predictions and recovering object boundaries as much as possible. At the same time, it keeps a high computing efficiency. The specific method of few-shot segmentation is as follows. Few-shot-guided networks extract the latent task representation from any amount of supervision given of support for interactive inference. Once learned, it can segment new inputs without the supervisor, while maintaining its accuracy and high efficiency. Our experimental results show that the proposed method achieves a good result in the augmented PASCAL VOC 2012 image segmentation dataset, the so-called SBD.
Although we have achieved good results by integrating into our networks the fully connected CRF, there are also some unavoidable limitations. For example, our model is not an end-to-end system. It is just that CRF uses the results of few-shot networks as unary potential function. Therefore, we plan to entirely integrate its two major parts (few-shot networks and fully connected CRFs) and train the whole system in an end-to-end fashion. In addition, we intend to experiment with more datasets. We think this is an area full of challenges, and we hope to make continuous improvement in our future work.