Interactive Trimap Generation for Digital Matting Based on Single-Sample Learning

: Image matting refers to the task of estimating the foreground of images, which is an important problem in image processing. Recently, trimap generation has attracted considerable attention because designing a trimap for every image is labor-intensive. In this paper, a two-step algorithm is proposed to generate trimaps. To use the proposed algorithm, users must only provide some clicks (foreground clicks and background clicks), which are employed as the input to generate a binary mask. One-shot learning technique achieves remarkable progress on semantic segmentation, we extend this technique to perform the binary mask prediction task. The mask is further used to predict the trimap using image dilation. Extensive experiments were performed to evaluate the proposed algorithm. Experimental results show that the trimaps generated using the proposed algorithm are visually similar to the user-annotated ones. Comparing with the interactive matting algorithms, the proposed algoritm is less labor-intensive than trimap-based matting algorithm and achieved more accuate results than scribble-based matting algorithm.


Introduction
Matting is a process of extracting a foreground object image F along with its opacity mask α (typically called alpha matte) from a given digital photograph I, which plays an important role in video editing and image processing. Specifically, the image matting problem is modelled as a convex combination of a foreground image F and a background image B as given in Equation (1).
where i = (x, y) is the image lattice. The value of α is between 0 and 1; if α i = 1 or 0, then the pixel at location i belongs to the definite foreground or definite background, respectively. Otherwise, the pixel belongs to the opacity mask. This is the main difference in comparison with image segmentation, which has no pixels between the foreground and the background. The ill-posed problem in image matting is under-constrained since F, B and α are all unknown. Note that in Equation (1), if we consider a full color image (RGB), there are seven unknown parameters (F, B for each channel and Some pixels are annotated as belonging to definite foreground or definite background. Scribbles only provide a small amount of interaction information, but trimaps can provide more complete interaction information. However, drawing such precise trimaps requires considerable human effort, which is often undesirable, particularly in the case of opaque objects. For example, drawing a trimap like the first row in Figure 1e will take about 50 s. Scribbles are easy to obtain, but they are error prone and inaccurate (Figure 1c) when the foreground scribble pixels are mixed with background pixels, then the generated alpha matte is inaccurate or even wrong. Relatively, user-clicks are more robust than scribbles. To fully extract meaningful foreground objects and minimize the user's annotation workload, our algorithm takes advantage of a few user-provided clicks and directly generates trimaps for image matting (Figure 1d) .
The spectral matting algorithm [3] and automatic trimap generation algorithm [4] automatically extract the matte from the input image without any user intervention. However, the limitation of these methods is that they assume that there is one single object present in the given scene. When multiple semantic targets appear in a scene, the generated result is not the user's interest region. For example, cats and dogs could appear in the same image. If we want to obtain the alpha matte for cats, the dogs should be seen as the background. The generated result cannot correctly address the user's interest in these cases. Hsieh et al. [5] proposed a method to automatically obtain trimaps, but their algorithm took the original image and the segmentation results as the input, which is the segmentation result obtained by the user's interaction. Therefore, ref. [5] requires similar constraint information for the base scribble algorithm. Computers cannot replace us when deciding which parts to target. Providing interactive information is a prerequisite for implementing the algorithm.
In actual image matting applications, we aim to obtain a large number of images' alpha mattes. Many of these images belong to the same semantic classes. The same semantic classes have similar characteristics. Previous methods needed to design trimaps for every image. Collecting these dense constraints for every image is another problem that is time consuming, tedious, and error-prone.
After the above analysis, we aimed to reduce the amount of user interaction while maintaining alpha matte accuracy. Our goal was to obtain the trimap for any image in an unknown class under, the condition of only one image, and the corresponding pixel level click annotation. We were inspired by one-shot learning and propose a three-branched model to generate trimaps. Our model consists of three branches: the guided branch to extract the guidance from the annotated image, the inference branch obtains the segmentation results of the unlabeled image given guidance, and the generated branch converts the segmentation results into a trimap ( Figure 2) . To summarize, the main contributions of our work are three-fold: (1) To the best of our knowledge, this is the first algorithm to connect one-shot learning with trimap generation, achieving relatively accurate results. (2) We use an algorithm to generate trimaps, which further reduces user workload. By implementing the proposed algorithm, users do not need to design the trimaps; only a few clicks are needed. (3) Our algorithm can generate information to guide the segmentation process, which means we can obtain their trimaps based on the interactive information for one single image provided by users for any images from unseen semantic classes.

Related Work
Various digital matting approaches, like Bayesian matting [1], learning-based matting [6], and closed-form matting [7], require trimaps to be specified by the user. The GrubCut [8] and Lazy Snapping [9] algorithms use the graph-cut-based optimization approach to extract foregrounds from images according to a small amount of user input, such as a few strokes or a bounding box. Wang et al. [2] combined the segmentation and digital matting and presented a unified optimization approach based on belief propagation. Wang et al. [10] designed an interactive tool (Soft Scissors) to obtain high quality image matting. Cho et al. [11] built a deep neural network to learn the matching relationship between the inputs and predicted alpha mattes (obtained by [7,12]). Xu et al. [13] proposed a deep-learning-based model to improve the performance of the matting algorithm by tackling the colors and textures. Since digital matting is an under-constrained problem, all the matting algorithms require user interaction to solve matting problems. However, designing such a relatively accurate trimap is time consuming, which is unsuitable for practical applications. The other interactive approach, the scribble-based approach, lacks robustness. To fill this gap, we propose a method to generate trimaps from user clicks. The final result, accurate alpha mattes, is calculated by the generated trimap in the previous step.
A series of trimap generation studies [4,[14][15][16][17][18] obtained trimaps through user interaction or automatically. Rhemann et al. [14] presented a new approach using parametric max-flow to generate trimaps; the final alpha matte was obtained by the gradient preserving prior. Shahrian et al. [15] presented a new sampling strategy to construct a comprehensive sampling set of known samples by sampling all the color distribution in the known region. Cho et al. [16] presented a trimap generation approach for light-field images. They took advantage of the binary segmentation result to obtain coarse trimaps. Then, they optimized the trimaps by analyzing the color distribution along the boundary of the segmentation result. Gupta et al. [4] employed superpixels to over-segment the image. Then, they used a saliency map and local feature descriptor to help automatically generate trimaps. Juan et al. [17] proposed a segmentation algorithm to extract a relatively accurate trimap from coarse indication. Then, they took advantage of the generated trimap to design an improved matting method to produce a better alpha matte. Gastal et al. [19] presented a real-time matting algorithm for natural images and videos. The algorithm was based on the sampling technique, making full use of the similarity between adjacent pixels. The inherent parallelism of the algorithm was combined. The fundamental difference between our method and previous trimap generation algorithms is the guidance ability, which means that for any image of the same semantic class, we can obtain their alpha matte by providing only one scribbled image. A series of studies [20][21][22] refined the trimaps as a preprocessing step by expanding the foreground and background starting from their boundaries with the unknown region, which was proposed by Shahrian et al. [15]. Shen et al. [18] proposed a deep automatic matting method, which can generate trimaps for portrait images by deep network, however their system unable to handle other semantic classes.
Recently, digital image processing technology has developed rapidly, and many image processing algorithms have emerged. Zhu et al. [23] proposed a novel hashing approach to deal with scalable image retrieval problems. Fang et al. [24] solved imagery de-noising problems using discriminative representations. Zong et al. [25] proposed a novel multiple description image coding model to improve coding efficiency. Jiang et al. [26] proposed a matching-based method for aligning multimodal images. Zhao et al. [27] built an efficient image feature representation method (ASD) to detect images. They presented another multi-trend structure descriptor [28], which was built based on the local and multi-trend structures to further improve detection accuracy. Liu et al. [29] presented a novel computer-aided design system based on a computational approach to producing 3D images for stimulating the creativity of designers. Wang et al. [30] presented a novel just noticeable (JND) model that satisfies the visual perception characteristics of human eyes and matched the spread spectrum transform jitter modulation (STDM) watermark framework. Deng et al. [31] presented a graph-cut-based method for automated aorta segmentation.
Learning-based algorithms have achieved great success in the field of image processing. Li et al. [32] applied kernel learning to achieve face recognition. Some learning-based segmentation algorithms, such as U-NET [33], FCN [34], MASK R-CNN [35], and DeepLab [36], can accurately separate the foreground object from the background. However, these algorithms are trained with full annotations and require investments in expensive labeling tasks. To reduce the annotation workload, a promising alternative method is to apply weak annotations for learning, e.g., bounding boxes [37] and points [38]. The main disadvantage of weakly supervised methods is that they lack the ability to generalize unseen classes. For example, if a network is trained to segment cats using many images containing various breeds of cats, it will not be able to segment vehicles without fine-tuning the network using many images containing vehicles. Therefore, researchers extensively focused on generalizing new class objects so they can minimize labeling costs and improve the use efficiency of labeled samples.
Humans have relatively good cognitive abilities. Humans can recognize objects with little guidance. For example, a child can easily identify a dog species from an image of a dog, even though they had never seen a dog before. Inspired by this, one-shot learning focuses on imitating this ability. The goal of one-shot segmentation is to obtain the object regions of a query image with only one support image. Both support-image and query-image are sampled from the same unseen class. Sampling from an unknown class is the main difference between one-shot segmentation and traditional semantic segmentation. If we want to segment an unseen class object using a traditional learning-based algorithm, we need at least hundreds of labeled data and multiple iterations to achieve a good segmentation result for of objects. However, one-shot-based algorithms only takes one label-image pair as guidance, and optimization during the process of segmentation is not required. There are two major advantages for one-shot segmentation: (1) minimized annotation effort and (2) there is no need to fine-tune model, since the parameters are fixed after training, reducing time and computation costs.
One-shot semantic segmentation recognizes object regions from invisible categories with only one annotated sample serving as the supervision. Shaban et al. [39] proposed a pioneering applied one-shot learning to the semantic segmentation. They segmented new semantic classes requiring only an image and the corresponding densely annotated label. They constructed the two-branched model OSLSM, which is based on Siamese Network. The network is divided into conditioning and segmentation branches, where the conditioning branch supports image-label pairs and produces dynamic parameters for a segmentation branch. They used the conditioning branch to perform dense pixel-level prediction on a test image for the new semantic class. This process adds a convolutional layer after the FCN [34], and the parameters in this convolutional layer are provided by generated dynamic parameters. However, OSLSM still needs users to provide pixel-level annotation information for support images. It is also unstable during optimization; different support image-label pairs produce the same task parameters. Rakelly et al. [40] proposed the REVOLVER model, which requires an image and its pixel-level annotated label, minimizing user interaction workload. REVOLVER only need users to provide a few clicks for the support image (these clicks are located in the absolute foreground and the absolute background). Differing from the OSLSM, which generates dynamic parameters, REVOLVER adopts the distance metric method. It calculates the distance between query features with the foreground representation and background. REVOLVER achieved similar result as OSLSM with only a few pixel-level annotations. Xu et al. [41] introduced the state-of-the-art deep interactive object segmentation (DIOS). They transformed users' positive and negative clicks to Euclidean distance maps and train a full convolutional neural network to recognize "object" and "background" based on training samples. However, it was not designed to generate trimaps. In particular, DIOS cannot propagate annotations across different images ( Figure 3). This is a bottleneck on annotation efficiency, since it requires at least two annotations for every input, whereas our method can segment new inputs independently.

Problem Setup
Suppose we have two datasets: a query set, (l) and a support set: where I i represents the original image; Y i represents the corresponding groundtruth mask; N represents the number of images in each set, the indexes s and q represents the support-set and query-set; respectively; and l represents the semantic class. Our goal was to learn a model f θ (L s , L q ) that can precisely predict binary masks Y q according to the reference of the support set L s , where θ represents the network parameters.
We cannot apply this model to generate trimaps directly because previous algorithms focused on semantic segmentation, which classifies each pixel. However, trimap generation focuses on classifying pixels to foreground, background, and opacity regions. For generating trimaps, we require a groundtruth trimap to optimize the network. Existing dataset labels only include the foreground and background (PASCALVOC [42]). We still trained the model using a binary mask to generate binary segmentation results. We used the Dilation method to produce an initial estimate of the trimap, inspired by [5]. We used dynamic width and voting steps to optimize the trimap. The implementation details are provided in Section 4.
During the training process, the support image is fed into the network with its sparse pixel-level annotations, which are obtained from their corresponding groundtruth mask. We simulated manual labeling and randomly selected some points from the foreground and background to guide network training. The query image is fed into the network with its dense mask, which is used for loss calculation and parameter optimization. In the test process, there is no label to exploit, there are only the sparse annotations collected from user interaction. Notably, L qry and L sup share the same types of objects, but no categories are the same between the training set and the test set {l train } ∩ {l test } = ∅. This is the main difference between one-shot segmentation and traditional image segmentation. The traditional training process splits the dataset into a training set and a test set; the training set images never appear in the test set. However, the training set and the test set have overlaps in terms of categories. So, when training data are processed, we turn the target into the background if its categories appear in the test set.
State-of-the-art algorithms for image segmentation [36] use networks pre-trained on ILSVRC.

Proposed Method
The network in this paper consists of three branches: the first branch generates task representation and guides the second segmentation branch to generate the segmentation results. Finally, the generated branch converts the segmentation result into a trimap. Our model is able to make predictions on its own, and, with expert guidance, can direct the task or correct errors. The process of self-prediction can be regarded as interaction segmentation when the support image and query image are the same. Note that interactive is a special case of one-shot segmentation. In particular, we used the guidance branch to extract guidance z from support set: z = g(i s , Y s ). Afterward, the segmentation branch was combined with guidance z and query image features to jointly predict the output results y = f (i q , z). We discuss how to design z = g(i s , Y s ), and y = f (i q , z) in the following sections.

Guidance Branch
Guidance branch fused user interaction involves clicks with the support image, and guidance information z s is generated. The guidance process can be expressed as : where z s includes target features and background features. We assume that the user can provide a positive click (+) and a negative click (−). We match pixel-level clicks to the same coordinate scale as the support image I s ∈ R 3 * w * h , where w, h represent the length and width of support image respectively; the pixels under the strokes are set to values of 1, and 0 otherwise. Then, we can obtain two annotation maps (Y + ) and (Y − ), Y ∈ {0, 1} w * h . We used fully convolutional networks as the feature extractor to extract visual features from the support image I s using λ. The feature extractor λ of our model is VGG-16 [43], pre-trained on ILSVRC [44] and converted into fully convolutional form [34].
The λ(I s )inR c * w * h and c, w , h represent the channels, length, and width of feature maps, respectively.
To ensure that they have the same scale, the positive map and negative map are down-sampled to the same scale using bilinear kernel m(Y + ), m(Y − ) . Then, we fuse the support features with the positive map and negative map using the element-wise product µ. The spatial relationship between the support image and annotations can be well defined by fusing features from the visual and annotation branches. Using m to interpolate and µ to fuse, the visual representations of the support and query can be obtained using a unified feature extractor λ. The guidance process is shown in Figure 4c. In contrast to Shaban et al. [39], in which an element-wise multiplication was directly applied to the support image and the dense label annotation (foreground value 1, background value 0), the background is omitted as a result (Figure 4a). Our method saves the background information for the support image. In addition, by integrating annotations with feature-level information through the factorization method, the spatial dependency between them is more clearly defined. Xu et al. [41] proposed an early fusion method that concatenates the positive map and negative map with the support image to five channels, as shown in Figure 4b. However, the disadvantage of this method is that concatenating the support image with their maps breaks the input structure of the network, which also prevents the implementation of a unified network. The features of the support image and query image need to be extracted through different network structures. The model proposed in this paper (Figure 4c) maintains the identical input structure of the network, which enables us to process both the support and query images within a unified network.
When multiple foreground objects appear in the image (Figure 5a), we need to obtain the segmentation mask for all objects (Figure 5d) in the image instead of obtaining one object (Figure 5c). In addition, when the support and query images are completely different, no spatial information is available. The only mapping between the support and query should be achieved through characterization. We chose global pooling g z to merge the local task representations and discarding the spatial dimensions, which can be represented as z s ∈ R c * w * h → v s ∈ R c * 1 * 1 . However, when the support and query images are the same (e.g., interactive segmentation), location information can be used and global pooling g z procedures can be omitted.

Segmentation Branch
We obtain the foreground target of the image by segmentation branch and generated rough segmentation results, the segmentation model define as: The same as for the guidance branch, we extract visual features from the query image I q using a deep convolution network λ(), as with the support image, where I q ∈ R c * w * h and w , h represent the length and width of support image respectively. v s is the globalized task representation obtained by the guidance branch. ⊕ represents the channel number stack. We repeat guidance vector v s until its spatial dimension is equal to query features maps λ(I q ) to ensure the parameters have the same dimension. f θ is a small convolutional network that fuses the query-support feature and decodes to a binary predicted segmentation result. f θ can be interpreted as a learned distance metric for retrieval from support to query. The distance metric part consists of two components. The first component fuses query-support features through one combination of the convolution layer (1 × 1 kernel size), rectified linear units (ReLU), and drop-out, and the parameters in the convolution layer are used to compute the distance between pixels from support to query. The outputs of this component are coarse distance metric maps. The second part only includes one layer of convolution (1 × 1 kernel size) with a channel dimension of 2 to predict scores for foreground and background classes at each of the coarse distance metric maps. The second part is followed by bilinear upsampling for end-to-end learning by back-propagation from the pixel-wise loss. The segmentation process is shown in Figure 6. layer, activation layer and Dropout, which is used to compute the distance between pixels from support to query. Where the z s represents the guidance, g z represents global pooling and v s represents the pooled guidance vector.
Inspired by REVOLVER's [40] training episode, we first sampled a task. Then, we sampled a subset of images containing the task, which we divided into support and query. Given inputs and targets, we trained the network by cross-entropy loss: where the y represents the predicted segmentation results, the y represents the corresponding dense labels. Notably, the optimization process of the model during training is different from the one-shot learning process, where the parameters are not optimized during the one-shot learning process, and one-shot learning is achieved via guidance and guided inference.

Generate Branch
After the training, the model parameters were fixed. The model predicts segmentation results using several clicks. However, the predicted segmentation results have relatively rough edges, which do not satisfy the digital matting requirements. Therefore, we used three main steps to further process the segmentation results. Firstly, we optimized the segmentation results by conditional random field (CRF) to increase the precision of the target edge region. Secondly, inspired by [5], we obtained the initial trimap by dilating and eroding the binary segmentation result. We set the foreground region of the ablation image as the foreground region in the trimap, the background region of the expansion image as the background region in the trimap, and the rest of the image as the unknown region. Finally, we used the deep matting model to obtain the final alpha matte, and the model was trained by the Adobe image matting database [41].

Dataset
We chose the PASCALVOC2012 [42] dataset D train to train the model. The PASCALVOC2012 dataset includes 21 classes (aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person potted plant, sheep, sofa, train, tv/monitor, and background). In the first experiment, the performance of the proposed matting method was evaluated on the standard benchmark [45]. The dataset consists of 27 images with corresponding groundtruth alpha and trimaps. The trimaps were generated by an experienced user given a paint tool with different size brushes. The generated trimaps included two types that are represented as trimap-1 and trimap-2. In the second experiment, to verify the generalization of unseen classes, we selected 20 images from the Adobe image matting dataset [13] with their corresponding groundtruth alpha matte, as shown in Figure 7. Each row of pictures represents the same category. The criteria of our experiment include Mean Absolute Error (MAE), the formula is expressed as: n represents the number of pixels, x i and y i represent predicted alpha matte and groundtruth alpha matte, respectively, and the Intersection-over-Union (IoU) of unknown areas in the trimap, the formula is expressed as: U pred and U gt represent the unknow-region area of predict and user-input trimap, respectively. We split the training dataset into four parts as shown in Table 1. We used the D train (l) training network separately, where l represents the semantic categories. We used l split to represent the partition of the dataset, e.g., l 0 means we use l i=1,2,3 for training and l i=0 for testing. For comparison with the standard benchmark and obtain more accurate results, in the first experiment, we selected all categories for training the model D train (l oracle ). To verify the guidance ability of the model, in the second and third experiments, we selected part of the categories for training the models D train (l split ). Table 1. Classes for each fold of PASCAL-5i. To ensure the class disjoint {l train } ∩ {l test } = ∅, we split the dataset into four parts. We used three of them for training and the remainder for testing.

Experiment 1:
We compare trimap precision directly in this section. We selected all images from the alpha matting dataset [45] to test our model. To obtain more accurate results, we set the support image and query image as the same. The trimap generation process can be understood as one-shot interactive segmentation. We used all classes D train (l oracle ) to train our model. A partial visualization of the experimental results is provided in Figure 8. We estimated the performance of the trimap generation method using two criteria. Firstly, we performed the deep matting [13] for the different trimaps (proposed method, user input trimap, and scribble) and compared their matting results. From Figure 8, the generated trimaps are visually similar with the user-input trimaps. However, there is still a performance gap between the generated trimap and the user-input trimap. Since the user-input trimaps select from benchmark matting dataset, which drawed by professional users, the generated trimap is difficult to achieve this accuracy. The advantages of our algorithm lie in processing time and interactive workload. Besides, the proposed algorithm is much more accurate than the user scribble (Table 2). We think that MAE comparison between the proposed and scribble-based map is fair, because our method takes images and user clicks as inputs; the labeling time consumption between clicks and scribbles are basically the same. Secondly, we compared the unknown region intersection over union (IOU) between the generated trimap and the user input trimap. As shown in Table 3, the mean unknown region IOU was close to 50% for both trimap-1 and trimap-2. The above experiments proved the feasibility and robustness of our algorithm.  [45]). The last column represents the user input trimaps (selected from [45]). The second column represents the trimaps generated by the proposed method. The third column represents the optimized trimaps (by trimap trimming [15]). Table 2. Mean absolute error (MAE) statistics of alpha matte computation on [45] (using [13]). The proposed method obtains a approximate MAE to the trimap-based image matting, and is much more accurate than scribble-based image matting.

Experiment 2:
We verify the guidance ability of our proposed method in this section. We used part of the classes D train (l split ) to train our model. So, we obtained four models with different parameters. Following the principle of category disjunction ({l train } ∩ {l test } = ∅), we used different rows in Figure 7 to verify the different models, e.g., we used the first row (person semantic class) to verify the model trained by D train (l 0 ) (without person semantic class). To obtain more accurate results, we still set the support image and query image to the same. As with experiment 1, we compared the MAE and IOU for the 20 test images. The experimental results are partially depicted in Figure 9. Even though most details in the margin are captured by the proposed algorithm (third column in Figure 9), the generated trimap is visually similar to user input trimap (fourth column in Figure 9). The statistical results are shown in Tables 4 and 5. The experimental results showed that although the test class does not appear in the training class, our model can still identify foreground targets by pixel-level annotations and generate relatively accurate trimaps. The results showed that our model has guidance ability and has the potential to generalize unknown semantic classes. However, the main problem with the proposed approach is that one-shot segmentation calculates the distance metric for every pixel in the query image to the foreground and background using support image as guidance. So, the model will misjudge when the foreground and background have similar representations. The fourth row in Figure 9b,c shows that the white hair of the dog is similar to the color of the beach, causing the model to judge part of the white hair belonging to the foreground as the background.    [7], and Soft Scissors [10]. The experimental results are shown in Figure 10a. Our method is the fastest. The main reason for the speed lies in the following two points. Firstly, our system generates trimaps using a few clicks. Compared with previous algorithms used to design a complete trimap, the click time is negligible. Second, we use deep convolutional neural networks to obtain the final matte results. Compared with sample-based algorithms or propagation-based algorithms that need many iterations, our algorithm requires one forward propagation of the model. Experiment 4: At first glance, the proposed method is similar to interactive matting algorithms like graph-cut. So, in this experiment, we proved the differences and further verified the model's ability to transmit guidance in the same semantic class. We selected more complex foreground images from [42] and set the support image and query image are the completely different (semantic category consistency), as shown in Figure 11a. The first row to the last row represent different spatial positions, different foreground objects, and multiple foreground objects, respectively. The query image is overlaid with our predicted mask in green (Figure 11b). The experimental results showed that when the difference is huge in the morphological and spatial position of the foreground target between the query image and the support image, the task representation still can guide the query branch to obtain a relatively accurate result. However, graph-based methods cannot transmit task representation between different images like our proposed method. To verify the advantages of our method from the previous methods, we compare our system with interactive trimap segmentation methods (e.g., [14,17]), a partial visualization of the experimental results is provided in Figure 12, the trimaps are covered by RGB color. Our results (second row in Figure 12) is visually similar to the interactive segmentation methods (first row in Figure 12), since our deep learning based model better combines the depth features, and the CRF further optimizes the segmentation results. Besides, the results further prove guidance ability of our model, which delivers the semantic representation between different query and support images.  The support set on the left with annotations and the query set on the right; (b) one-shot semantic segmentation result, the query image is overlaid with our predicted result in green; (c) generated trimaps, and (d) our predicted alpha matte using (c) by [13]. First row, different spatial position. Second row, different foreground objects. Last row, multiple foreground objects.
One question arose: how many pixel-level clicks are needed? The experiment proved that our method is insensitive to the amount of annotations, as shown in Figure 11b. The IOU accuracy does not increase after 3 pixels. This is an ill-posed problem for unseen class generalization from one shot. One-shot learning cannot cover the complete visual information in an unseen class, for example, matting a black, long-haired dog, as in Figure 7(16) cannot be achieved given a brown, short-haired dog like in Figure 7(18) as guidance. The guidance struggles to make judgments when the color, texture, and pose are very different. The solution is method improving using shot, one-shot to few-shot.  [14,17], the experimental results showed that our model deliver semantic representation between query and support images, and the generated trimaps are accurate enough.

Conclusions
In this paper, to reduce the user interaction workload, we adopt the one-shot learning-based segmentation algorithm to generate trimaps for image matting. Our model only needs a few clicks from the user to generate a high-precision trimap. Compared with scribble-based image matting, our method has better robustness. Compared with trimap-based algorithms, our method reduces the required user interaction time. Simultaneously, our model turns user-provided clicks into guidance, which implements interactive information sharing between the same semantic classes and minimizes user interaction workload. However, the proposed method still has limitations when the foreground and background have similar features in the query image.