Multilevel Features-Guided Network for Few-Shot Segmentation

Xin, Chenjing; Li, Xinfu; Yuan, Yunfeng

doi:10.3390/electronics11193195

Open AccessArticle

Multilevel Features-Guided Network for Few-Shot Segmentation

by

Chenjing Xin

^1,2,

Xinfu Li

^1,2,* and

Yunfeng Yuan

^1,2

¹

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

²

Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3195; https://doi.org/10.3390/electronics11193195

Submission received: 5 September 2022 / Revised: 28 September 2022 / Accepted: 28 September 2022 / Published: 5 October 2022

Download

Browse Figures

Versions Notes

Abstract

:

The purpose of few-shot semantic segmentation is to segment unseen classes with only a few labeled samples. However, most methods ignore the guidance of low-level features for segmentation, leading to unsatisfactory results. Therefore, we propose a multilevel features-guided network using convolutional neural network techniques, which fully utilizes features from each level. It includes two novel designs: (1) a similarity-guided feature reinforcement module (SRM), which uses features from different levels, it enables sufficient guidance from the support set to the query set, thus avoiding the situation that some feature information is ignored in deep network computation, (2) a method that bridges query features at each level to the decoder to guide the segmentation, making full use of local and edge information to improve model performance. We experiment on PASCAL-5

^{i}

and COCO-20

^{i}

datasets to demonstrate the effectiveness of the model, the results in 1-shot setting and 5-shot setting on PASCAL-5

^{i}

are 64.7% and 68.0%, which are 3.9% and 6.1% higher than the baseline model, respectively, and the results on the COCO-20

^{i}

are also improved.

Keywords:

few-shot learning; few-shot segmentation; scene understanding; semantic segmentation

1. Introduction

With the rapid development of deep neural networks [1,2,3,4,5,6,7], the research on semantic segmentation has made great progress [8,9,10,11]. These achievements can be largely attributed to the datasets with pixel-level manual annotation [12,13]. However, obtaining such annotated data is time-consuming and labor-intensive, and the performance of the model will drop significantly in the face of unseen classes. To alleviate the above problems and achieve considerable predicted segmentation results under the condition of giving only a few labeled samples, the few-shot semantic segmentation (FSS) method [14] was proposed. Many researchers have conducted a lot of work to improve the model’s performance, and few-shot semantic segmentation has become an active topic in the field of computer vision.

Few-shot semantic segmentation is similar to the few-shot learning approach used for image classification and recognition [15,16,17]. The network used for training is fed with two sets of images called support set and query set, the support set will contain the corresponding true segmentation mask and the query set needs to be given prediction values by the network. The role of the support set is to guide the prediction of the query set. In few-shot learning, the goal of few-shot image classification and recognition is to predict the class of the image or the location of the target class in the image (usually by labeling bounding boxes), both of which require image-level prediction. Unlike few-shot image classification and recognition tasks, few-shot image semantic segmentation requires pixel-level predictions, which is more complex and challenging. Early research [14,18] laid the basic model framework for FSS. A large number of follow-up studies [19,20,21,22] are based on these foundations to make innovative improvements. To achieve fine pixel-level prediction, the network needs to be trained on a relatively large dataset to obtain the weights, and the data in the test phase have classes that do not appear in the training phase. Since the network is more inclined to segment objects of known classes, it is not trivial to obtain fine segmentation predictions for unknown classes. Recent studies have introduced meta-classes [23], global and local contrastive learning [24], support and query cross-guide [25], and in the test phase, methods such as updating pixel classifiers for different unknown classes [26] or extracting latent features [27]. All the above methods can improve the performance of the model well. However, these methods are inadequate in utilizing the multilevel features obtained at the encoding stage, thus losing part of the detailed information and leading to insufficient segmentation prediction accuracy.

Based on the above problem, we propose a few-shot semantic segmentation network using multilevel features-guided segmentation, which makes full use of each level’s features from the support branch and query branch to obtain more detailed segmentation predictions. Specifically, we use a novel module to implement support features at each level to guide the query features at each level, and cross-connect the query features of each level to the decoder to guide the segmentation. The motivation of our method are (1) in the process of downsampling through the convolutional network, as the network goes deeper, the features are more abstract, and the information contained in the features is more global. Although the network can use these features to make segmentation predictions, some local and edge information will inevitably be lost, and support features will not provide sufficient guidance for query features, resulting in unsatisfactory results. Inspired by HSNet [28], we considered if the multilevel query features participate in the segmentation prediction process under the guidance of the support features, the different information contained in the features of different levels can be more fully utilized to improve the accuracy, and (2) for the abstracted high-level query features, the common method is to directly send it to the decoder to obtain the predicted segmentation result, but this will introduce uncertain factors due to upsampling interpolation, which will lead to some segmentation errors.

The semantic information contained in the features obtained in the encoding stage can guide segmentation, and a cross-connection approach can be used to guide the decoder upsampling process to reduce uncertainty and improve segmentation quality. The main contributions of this paper are as follows:

We propose a similarity-guided feature reinforcement module (SRM) to enrich query features and enhance the coherence of support features to query feature guidance, so that more semantic information can be included when predicting query image segmentation results.
Connecting the multilevel features of the encoder to the decoder provides more accurate segmentation guidance for the decoder upsampling, which effectively alleviates the accuracy loss caused by the depthwise convolution process, and a reasonable approach with limited use of cross-connect features also avoids the negative impact of non-critical information on segmentation.
Compared to previous work, our proposed method has improved performance on the PASCAL-5 $^{i}$ dataset and also performs well on the COCO-20 $^{i}$ dataset.

2. Related Work

2.1. Semantic Segmentation

The purpose of image semantic segmentation is to understand the image at the pixel level and assign an object class to each pixel in the image. The introduction of the Fully Convolutional Network [10] marked the application of deep learning in image semantic segmentation. Many subsequent studies have improved the performance of the model and made many contributions in the field of image semantic segmentation. Such as Seg-Net [8], U-Net [9], PSPNet [11], deep-lab [29,30], DFANet [31], CRGNet [32], ENet [33]. Although remarkable achievements have been made in image semantic segmentation the current conventional semantic segmentation methods are not satisfactory in the face of unseen classes.

2.2. Few-Shot Learning

Few-shot learning refers to obtaining appropriate prediction results under the condition of giving only a few labeled samples. Few-shot learning was originally used for image classification and image recognition. According to different methods, it can be divided into metric-learning methods [17,34,35,36] or meta-learning based methods [16,37,38,39,40,41]. Graph neural network-based methods also have many applications in few-shot learning [15,42]. The goal of few-shot learning is to make image-level predictions, whereas the goal of few-shot semantic segmentation is to make pixel-level predictions, which is more challenging. Our work is closer to a meta-learning based approach, but also applies the idea of metric-learning.

2.3. Few-Shot Semantic Segmentation

OSLSM [14] applied the few-shot learning method to the field of semantic segmentation for the first time. Since then, many studies [19,43,44,45,46] have added different structures to the basic model to make few-shot segmentation methods more diverse. SG-one [47] proposed the use of global average pooling with mask to extract target features. CANet [48] focused on the mid-level features of the decoder to obtain the hidden local features of the target. FWB [49] enhanced the segmentation ability by activating foreground features. DAN [50] introduced a homogenized graph attention mechanism to activate more pixels on the object. BriNet [51] reduced intra-class variance by interacting support features and query features. SAGNN [52] designed a scale-aware graph neural network to explain the cross-scale structural relationship. PFENet [53] proposed a prior mask generation method and constructs a feature enrichment module, which significantly improves the model for few-shot segmentation. MANet [54] used mask aggregation network, simultaneously generate a fixed number of masks and their probabilities of being targets. DCP [55] leveraged effective masked average pooling operations to derive a series of support-induced proxies, with different agents conquering different challenges. Our method is constructed based on the framework of PFENet [53]. Different from PFENet [53], our method pays more attention to the different features contained in all levels of features and their relationships, especially the local and edge information contained in low-level features.

3. Material and Methods

3.1. Problem Definition

The goal of few-shot semantic segmentation is to obtain suitable segmentation results with a trained network environment, even for unseen classes with little image annotation information. In the few-shot semantic segmentation, the data of the training set

D_{t r a i n}

and the test set

D_{t e s t}

are isolated from each other, that is, the classes of the training set

C_{t r a i n}

and the classes of the test set

C_{t e s t}

are disjoint (

C_{t r a i n} \cap C_{t e s t} = ⌀

). The training set

D_{t r a i n}

and the test set

D_{t e s t}

are divided into support set S and query set Q, respectively. For k-shot segmentation,

S = {x_{i}^{s}, m_{i}^{s}}_{i = 1}^{k}

means that it contains k support images,

x_{i}^{s} \in R^{3 \times H \times W}

is the RGB information of the support image,

m_{i}^{s} \in {(0, 1)}^{H \times W}

is the binary segmentation mask corresponding to the support image;

Q = {x^{q}, m^{q}}

usually contains one query image,

x^{q} \in R^{3 \times H \times W}

is the RGB information of the query image,

m^{q} \in {(0, 1)}^{H \times W}

is the binary segmentation mask corresponding to the query image.

The main process of the few-shot semantic segmentation network is: first, the support-query image pair

(x^{s}, x^{q})

is sent to the encoder to extract the support feature

F_{i}^{s}

and query feature

F^{q}

, then, combine the support feature

F_{i}^{s}

with the corresponding mask

m_{i}^{s}

to guide the query feature

F^{q}

to obtain the predicted mask

{\hat{m}}^{q}

, and finally, compare the predicted mask

{\hat{m}}^{q}

with the ground-truth mask

m^{q}

. After the network is trained on the training set

D_{t r a i n}

for several rounds, the weights of the network are locked, and then the network is tested on the test set

D_{t e s t}

to judge the network performance.

3.2. The Proposed Model

We propose a few-shot semantic segmentation network, which uses multilevel features to guide segmentation prediction. By performing similarity-guided processing of features at different levels, local information of low-level features is retained while abstract information of high-level features is obtained. In the decoding process, query features at all levels are crossed over to the decoder, and limited use is made of query features when they guide the decoding process at all levels, thus reducing the adverse effects of non-critical information and making full use of edge and local information, thus enabling the network to better predict segmentation. The main frame structure of the network is shown in Figure 1.

First, the support-query image pair

{x^{s}, x^{q}}

is processed by the backbone network to obtain the corresponding features of different levels

{F_{1}^{s}, F_{2}^{s}, F_{3}^{s}}, {F_{1}^{q}, F_{2}^{q}, F_{3}^{q}}

:

\begin{matrix} F_{1}^{s}, F_{2}^{s}, F_{3}^{s} & = F_{b} (x^{s}), \\ F_{1}^{q}, F_{2}^{q}, F_{3}^{q} & = F_{b} (x^{q}), \end{matrix}

(1)

where

F_{b} (•)

represents the feature extraction of the image by the backbone network, and the encoder of the support branch and the query branch share the same weights. Then, in order to obtain the rich semantic information, the extracted support features of each level

{F_{1}^{s}, F_{2}^{s}, F_{3}^{s}}

and their masks

m^{s}

, combined with the query features

{F_{1}^{q}, F_{2}^{q}, F_{3}^{q}}

, go through three independent similarity-guided feature reinforcement modules (SRM) to get the query feature

F_{S R M}

containing rich information:

\begin{matrix} F_{S R M}^{3} & = F_{s}^{3} (F_{3}^{s}, F_{3}^{q}, m^{s}), \\ F_{S R M}^{2} & = F_{s}^{2} (F_{2}^{s}, F_{2}^{q}, m^{s}, F_{S R M}^{3}), \\ F_{S R M} & = F_{S R M}^{1} = F_{s}^{1} (F_{1}^{s}, F_{1}^{q}, m^{s}, F_{S R M}^{2}), \end{matrix}

(2)

where

F_{s}^{i} {(•)}_{i = 1}^{3}

represents the calculation process of the i-th SRM, which we will describe in detail in Section 3.3. After that,

F_{S R M}

is combined with the query feature, fed into the Atrous Spatial Convolution Pooling Pyramid (ASPP) to obtain

Y_{q}

. Finally,

Y_{q}

are fed into the decoder under the guidance of

{F_{1}^{q}, F_{2}^{q}, F_{3}^{q}}

to obtain the predicted segmentation mask

{\hat{m}}^{q}

:

\begin{matrix} {\hat{m}}^{q} & = F_{d} (F_{d} (F_{d} (Y_{q}, F_{3}^{q}), F_{2}^{q}), F_{1}^{q}), \end{matrix}

(3)

where

F_{d} (•)

represents the calculation process of the decoder with limited use of cross-connect features guidance, which we will describe in detail in Section 3.4. The main process of network training is briefly shown in Algorithm 1. The number of iterations and pre-trained weights will be introduced in Section 4.2.

Algorithm 1 The main process of training multilevel features-guided network

Input: Support set

S = {x^{s}, m^{s}}

, Query set

Q = {x^{q}, m^{q}}

1: Initialize the network nodes (the backbone part is initialized with pre-trained weights)
2: for each iteration do
3: Feed

x^{s}

and

x^{q}

to backbone to get

F_{i}^{s}

and

F_{i}^{q}

.
4: Feed

(F_{i}^{s}, m^{s})

and

F_{i}^{q}

to the SRMs to get

F_{S R M}^{i}

.
5: Feed

(F_{2}^{s}, m^{s})

and

F_{S R M}^{1}

to the ASPP to get

Y_{q}

.
6: Feed

Y_{q}

to the decoder to get the prediction

{\hat{m}}^{q}

.
7: Compare

{\hat{m}}^{q}

with

m^{q}

, update network nodes except backbone.
8: end for

3.3. Similarity-Guided Feature Reinforcement Module (SRM)

As mentioned earlier, using only mid-level and high-level features when support features guide query features will lose some local and edge information, and adding low-level features can alleviate this problem. However, simply adding low-level features introduces redundant non-critical information, which is instead detrimental to the effective guidance of query features by support features. Therefore, we use the support features with mask information for guidance while processing the query features of the three levels in turn, and use the processed higher-level features to continue to guide the lower-level features, thus obtaining features with multiple levels of information. Based on this idea, we design Similarity-Guided Feature Reinforcement Module (SRM). The structure diagram of SRM is shown in Figure 2. First, the support features and their masks

{F_{i}^{s}, m_{i}^{s}}_{i = 1}^{3}

are fused to obtain the support features with mask information

\dot{F_{i}^{s}}

:

\begin{matrix} {\dot{F}}_{i}^{s} = F_{i}^{s} ⊙ m_{i}^{s}, \end{matrix}

(4)

where ⊙ is the Hadamard product. Then, calculate the Similarity

(S i m)

between the support-query feature pair

{{\dot{F}}_{i}^{s}, F_{i}^{q}}_{i = 1}^{3}

using the cosine similarity principle:

S i m = \frac{{\dot{F}}_{i}^{s} \cdot F_{i}^{q}}{∥ {\dot{F}}_{i}^{s} ∥ ∥ F_{i}^{q} ∥},

(5)

Finally, combine the

S i m

with the support-query feature pair

{{\dot{F}}_{i}^{s}, F {_{i}}^{q}}_{i = 1}^{3}

to get output feature of

F_{S R M}^{i}

:

F_{S R M}^{i} = F_{i}^{q} ⊙ ((S i m \otimes F_{i}^{q}) \oplus F_{S R M}^{h i g h e r}),

(6)

where

F_{S R M}^{h i g h e r}

represents the features of the higher-level introduced in the first two SRMs, and the last SRM does not have this parameter, ⊗ represents the matrix multiplication, and ⊕ represents the concatenation operation along channel dimensions.

We designed this structure with branches following SENet [2] to enable the guidance of support features while processing query features. There are three independent SRMs in our model, corresponding to the low, medium, and high-level features of the support and query set, respectively. According to CANet [48], the use of mid-level features can yield satisfactory results, so how the different number of SRMs and the different ways of using features will affect the experimental results will be discussed in Section 5.2.

3.4. Multilevel Features-Guided Decoder

Following PFENet [53], the obtained features

F_{S R M}

after processing in multiple SRM modules are fed into the Atrous Spatial Pyramid Pool (ASPP) together with the mid-level features

F_{2}^{s}

of the query image to obtain new features

Y_{q}

:

\begin{matrix} Y_{q} = F_{a} (F_{S R M} \oplus P o o l i n g ({\dot{F}}_{2}^{s})), \end{matrix}

(7)

where

F_{a} (•)

represents the process of ASPP processing features,

F_{S R M}

is calculated by Equation (2), and

{\dot{F}}_{2}^{s}

is calculated by Equation (4), here we replace the FEM of PFENet [53] with ASPP to reduce the computational complexity.

The prediction segmentation can be obtained by decoding

Y_{q}

, but to take full advantage of the rich semantic information contained in each level of query features, we combine the multilevel query features

{F_{1}^{q}, F_{2}^{q}, F_{3}^{q}}

with the new features

Y_{q}

in the decoding process. In this case, more detailed segmentation can be achieved by using features wisely. However, if the features are directly concatenated with the new features

Y_{q}

, the decoding result becomes undesirable due to the introduction of too much non-critical information, so the useless information of the query features should be filtered out before combining them with the new features

Y_{q}

in the decoder. Therefore, we make limited use of the introduction of query features, that is, we combine

Y_{q}

with query features and then add them to

Y_{q}

, as shown in Figure 3. Specifically, the features of the query set

{(F_{i}^{q})}_{i = 1}^{3}

are guided by

Y_{q}

to get

{({\dot{F}}_{i}^{q})}_{i = 1}^{3}

, and then combined with

Y_{q}

and update

Y_{q}

, as:

\begin{matrix} {\dot{F}}_{i}^{q} = a v g P o o l i n g (Y_{q}) ⊙ F_{i}^{q}, \\ Y_{q} + = {\dot{F}}_{i}^{q} \end{matrix}

(8)

Y_{q}

passes through query set high-level features

F_{3}^{q}

, mid-level features

F_{2}^{q}

and low-level features

F_{1}^{q}

in turn to guide segmentation in the decoder. According to previous work [23,53], decoding without introducing multilevel feature guidance can yield the expected results, therefore, the effect of different numbers of feature-guided decoders on the results, and the effect of different ways of feature-guided decoders on the results, will be discussed in Section 5.2.

3.5. Loss Function

We use cross-entropy loss

L = - \frac{1}{N} \sum [y l o g p + (1 - y) l o g (1 - p)]

as our loss function, where N is the total number of pixels,

y \in {0, 1}

is the pixel label (0 for background, 1 for foreground), and p is the probability that the prediction is positive. Following PFENet [53], an auxiliary loss function is also used to improve the performance of the model. The auxiliary loss

L_{a u x}

is obtained by performing intermediate-level supervision training between the fuzzy segmentation prediction obtained by decoding

Y_{q}

and the real binary segmentation mask of the query image. The final prediction result of the network produces the final loss

L_{f i n a l}

, so the overall loss

L_{a l l}

is expressed as:

L_{a l l} = λ L_{a u x} + L_{f i n a l},

(9)

where

λ

represents the balance weight of the auxiliary loss, which we set to 1.0 in all experiments.

4. Implementation Details

4.1. Datasets

We evaluate the proposed model performance using two public datasets, PASCAL-5

^{i}

[14] and COCO-20

^{i}

[49], which are widely used in the field of few-shot semantic segmentation.

PASCAL-5

^{i}

was first used by Shaban et al. [14], derived from the PASCAL VOC [12] dataset, and enhanced with the SDS [56]. PASCAL VOC dataset contains 20 categories of images, which will be evenly divided into 4 subsets. The i subset is called PASCAL-5

^{i}

, and each subset contains 5 categories of images.

COCO-20

^{i}

was first produced and used by Nguyen et al. [49], derived from MSCOCO [13]. In COCO-20

^{i}

, a total of 80 categories of images are included. Similar to PASCAL-5

^{i}

, these 80 categories will be divided into 4 subsets, and the i subset is called COCO-20

^{i}

. Each subset contains 20 categories of images. Compared with PASCAL-5

^{i}

, COCO-20

^{i}

has more categories and a larger number of pictures. Therefore, experiments on COCO-20

^{i}

are more challenging than those on PASCAL-5

^{i}

.

Following OSLSM [14] and FWB [49], for four different subsets of PASCAL-5

^{i}

and COCO-20

^{i}

, we use a cross-validation training method: three subsets are used as training sets to train the model, and the remaining subset is used as the test set to test the model, and 1000 support-query pairs are randomly selected from the test set for testing.

4.2. Experimental Setting

All experiments are conducted on PyTorch framework. We select ResNet-50 and ResNet-101 [1] as the backbone and use the weights pre-trained on ImageNet [57] for initialization. During the training process, the weights of ResNet remain unchanged, and the input images are all cropped to

473 \times 473

. Following the previous work [23,53,58], the batch sizes of PASCAL-5

^{i}

and COCO-20

^{i}

are 4 and 8, respectively, both using SGD optimizer, the number of iterations (epochs) is 200 rounds and 50 rounds, respectively, the learning rate is 0.0025 and 0.005, and the poly strategy is used to adjust the learning rate, that is,

l r_{n e w} = l r_{b a s e} \times {(1 - \frac{c u r r e n t_{e p o c h}}{m a x_{e p o c h}})}^{p o w e r}

, where

p o w e r

is equal to 0.9. All experiments are performed on an NVIDIA RTX 3090 GPU.

4.3. Evaluation Metrics

According to CANet [48] and BriNet [51], we adopt the mean intersection over union (mIoU) as our major evaluation metric because the foreground-background intersection over union (FB-IoU) cannot reflect the model capability well. However, we still add the FB-IoU results to the table for comparison. The mIoU is calculated by

m I o U = \frac{1}{C} \sum_{i = 1}^{C} I o U_{i}

, where C is the number of each fold, and

I o U = \frac{T P}{T P + F P + F N}

, where

T P

,

F P

, and

F N

represent the counts of true positives, false positives, and false negatives respectively.

5. Results

5.1. Comparison with Other Methods

We compared our experimental results with few-shot semantic segmentation methods from recent years.

As shown in Table 1, The results obtained by training and testing the model on the PASCAL-5

^{i}

dataset are presented, the data include the mIoU of each subset and the overall average mIoU and FB-IoU in 1-shot setting and 5-shot setting. It can be found that our method can achieve considerable results when the backbone networks are the same, especially in 1-shot setting with ResNet50 as the backbone, which is significantly better than other methods, there is 3.9% improvement compared to PFENet [53] and 0.7% improvement compared to HSNet [28].

As shown in Table 2, the results obtained by training and testing the model on the COCO-20

^{i}

dataset are presented, the data include the mIoU of each subset and the overall average mIoU and FB-IoU in 1-shot setting and 5-shot setting. It can be seen that our method has a slight performance improvement compared to some previous work and is competitive in 1-shot setting, there is 0.7% improvement compared to HSNet [28] with ResNet50 as the backbone, 3.7% improvement compared to PFENet [53] with ResNet101 as the backbone, and a 1.0% improvement compared to HSNet [28] with ResNet101 as the backbone.

However, our model does not perform well on the 5-shot setting, especially on COCO-20

^{i}

with ResNet101 as the backbone, and even shows degraded performance. One of the reasons may be the limitation of the hardware device used, which cannot support higher resolution images as input and can only crop the images to 473 × 473, and the number of training rounds is only 50, so the experimental results are not satisfactory.

The qualitative results are shown in Figure 4, where we use PFENet [53] as the baseline. From the figure, we can see that our method is more accurate in detail while segmenting the target effectively.

5.2. Ablation Study

To verify the effectiveness of the proposed method, we conducted experiments on Fold-0, a subset of PASCAL-5

^{i}

, using ResNet-50 as the backbone, and compared the impact of our approach with different components on the experimental results.

5.2.1. Number of SRMs

As shown in Table 3, we tested the model performance with different numbers of SRMs and compared the results. It can be seen that if the high-level SRMs are used, the test results are close to that of using all levels of SRMs, and are significantly better than the result of removing the high-level SRMs. This may be due to the high-level features typically contain highly abstracted image features that play a critical role in computing prediction results. Using only a single low-level SRM, the query features can provide some guidance on the support features, but the effect is minimal. When high-level SRMs are added the model performance can be significantly improved and the performance results are better than those of high-level SRMs alone, which proves the relevance of using SRMs with multilevel features.

5.2.2. Number of Features Cross-Connect to the Decoder

As shown in Table 4, we tested and compared the results with different numbers of features cross-connect to the decoder. It can be seen that when using only high-level features cross-connect to the decoder for segmentation prediction, the performance is even worse than the case without cross-connect features. This may be due to the high-level features themselves are already global and abstract enough, cross-connecting to the decoder does not provide more segmentation information but worsens the segmentation result by introducing non-critical information. However, although the low-level features contain some interference from non-target information, the limited use of features in Section 3.4 makes it possible to effectively filter out too much useless information and provide more accurate segmentation guidance, and the performance is improved by using multiple levels of cross-connect features.

5.2.3. Different Methods of Using Multilevel Features to Guide Segmentation

As shown in Table 5, we tested and compared the results of using different features guide methods in the decoder. It can be seen that the performance of directly connecting features through channels is worse than that of the decoder without the guidance of cross-connect feature, and the performance of element multiplication and matrix multiplication is not significantly improved, whereas the limited use of features to guide segmentation in Section 3.4 can significantly improve the model performance. Because of the good performance of low-level features in Section 5.2.2, we add a set of experiments that only use low-level features to guide decoding for comparison. The experimental results are shown in the last column of Table 5. It can be seen that when only low-level features are used, if the features are not used in a limited manner, the results will also be worse than those with limited use of features.

6. Conclusions

We propose a multilevel features-guided network for few-shot semantic segmentation. In this network, we design a similarity-guided reinforcement module to better implement the guidance of support features to query features while processing query features at all levels. In the decoder part, we use the method of cross-connecting multilevel features of the query image to guide the segmentation prediction of the decoder and make limited use of the cross-connect features, so the segmentation prediction contains sufficient local and edge information. Our method is trained and tested on PASCAL-5

^{i}

and COCO-20

^{i}

datasets, and good results are obtained. Our method also has many shortcomings. During the guidance of support features to query features, only the foreground information is used, and the background information is not used, but reasonable processing of the background information can improve the generalization ability of the model. In addition, the network structure is not light enough, for example, the lower-level SRM is guided by the support features and the output features of the higher-level SRM, the redundant information of the two guiding features may lead to heavy computation. We hope to make improvements in our future work.

Author Contributions

Conceptualization, C.X. and X.L.; methodology, C.X. and X.L.; software, C.X. and Y.Y.; validation, C.X.; formal analysis, C.X. and X.L.; investigation, C.X.; resources, C.X. and Y.Y.; data curation, C.X.; writing—original draft preparation, C.X.; writing—review and editing, C.X., X.L. and Y.Y.; visualization, C.X.; supervision, X.L.; project administration, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; Conference Track Proceedings. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Piscataway, NJ, USA, 2015; pp. 1–9. [Google Scholar]
Xu, Y.; Du, B.; Zhang, L. Robust Self-Ensembling Network for Hyperspectral Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2022; 1–14, in press. [Google Scholar] [CrossRef]
Xu, Y.; Du, B.; Zhang, L. Self-Attention Context Network: Addressing the Threat of Adversarial Attacks for Hyperspectral Image Classification. IEEE Trans. Image Process. 2021, 30, 8671–8685. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., III, Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 6230–6239. [Google Scholar]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar]
Shaban, A.; Bansal, S.; Liu, Z.; Essa, I.; Boots, B. One-Shot Learning for Semantic Segmentation. In Proceedings of the British Machine Vision Conference 2017, London, UK, 4–7 September 2017. [Google Scholar]
Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. Conference Track Proceedings. [Google Scholar]
Graves, A.; Wayne, G.; Danihelka, I. Neural Turing Machines. arXiv 2014, arXiv:1410.5401. [Google Scholar]
Wang, P.; Liu, L.; Shen, C.; Huang, Z.; van den Hengel, A.; Shen, H.T. Multi-attention Network for One Shot Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 6212–6220. [Google Scholar]
Rakelly, K.; Shelhamer, E.; Darrell, T.; Efros, A.A.; Levine, S. Conditional Networks for Few-Shot Semantic Segmentation. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. Workshop Track Proceedings. [Google Scholar]
Bhunia, A.K.; Bhunia, A.K.; Ghose, S.; Das, A.; Roy, P.P.; Pal, U. A deep one-shot network for query-based logo retrieval. Pattern Recognit. 2019, 96, 106965. [Google Scholar] [CrossRef] [Green Version]
Park, Y.; Seo, J.; Moon, J. CAFENet: Class-Agnostic Few-Shot Edge Detection Network. In Proceedings of the 32nd British Machine Vision Conference 2021, Online, 22–25 November 2021; p. 275. [Google Scholar]
Yang, Y.; Meng, F.; Li, H.; Ngan, K.N.; Wu, Q. A New Few-shot Segmentation Network Based on Class Representation. In Proceedings of the 2019 IEEE Visual Communications and Image Processing (VCIP 2019), Sydney, Australia, 1–4 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Zhu, K.; Zhai, W.; Cao, Y. Self-Supervised Tuning for Few-Shot Segmentation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; Bessiere, C., Ed.; pp. 1019–1025. [Google Scholar]
Wu, Z.; Shi, X.; Lin, G.; Cai, J. Learning Meta-class Memory for Few-Shot Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 497–506. [Google Scholar]
Liu, W.; Wu, Z.; Ding, H.; Liu, F.; Lin, J.; Lin, G. Few-Shot Segmentation with Global and Local Contrastive Learning. arXiv 2021, arXiv:2108.05293. [Google Scholar]
Zhang, B.; Xiao, J.; Qin, T. Self-Guided and Cross-Guided Learning for Few-Shot Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual, 19–25 June 2021; pp. 8312–8321. [Google Scholar]
Lu, Z.; He, S.; Zhu, X.; Zhang, L.; Song, Y.; Xiang, T. Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8721–8730. [Google Scholar]
Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. Mining Latent Classes for Few-shot Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8701–8710. [Google Scholar]
Min, J.; Kang, D.; Cho, M. Hypercorrelation Squeeze for Few-Shot Segmentation. arXiv 2021, arXiv:2104.01538. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 833–851. [Google Scholar]
Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 9522–9531. [Google Scholar]
Xu, Y.; Ghamisi, P. Consistency-Regularized Region-Growing Network for Semantic Segmentation of Urban Scenes With Point-Level Annotations. IEEE Trans. Image Process. 2022, 31, 5038–5051. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; pp. 4077–4087. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1199–1208. [Google Scholar]
Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; Luo, J. Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7260–7268. [Google Scholar]
Santoro, A.; Bartunov, S.; Botvinick, M.M.; Wierstra, D.; Lillicrap, T.P. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33nd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1842–1850. [Google Scholar]
Thrun, S.; Pratt, L.Y. Learning to Learn: Introduction and Overview. In Learning to Learn; Thrun, S., Pratt, L.Y., Eds.; Springer: Cham, Switzerland, 1998; pp. 3–17. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Volume 70, pp. 1126–1135. [Google Scholar]
Wang, X.; Yu, F.; Wang, R.; Darrell, T.; Gonzalez, J.E. TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1831–1840. [Google Scholar]
Yu, M.; Guo, X.; Yi, J.; Chang, S.; Potdar, S.; Cheng, Y.; Tesauro, G.; Wang, H.; Zhou, B. Diverse Few-Shot Text Classification with Multiple Metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1 (Long Papers), pp. 1206–1215. [Google Scholar]
Gidaris, S.; Komodakis, N. Dynamic Few-Shot Visual Learning Without Forgetting. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4367–4375. [Google Scholar]
Siam, M.; Oreshkin, B.N.; Jägersand, M. AMP: Adaptive Masked Proxies for Few-Shot Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5248–5257. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9196–9205. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid Graph Networks With Connection Attentions for Region-Based One-Shot Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9586–9594. [Google Scholar]
Yang, B.; Liu, C.; Li, B.; Jiao, J.; Ye, Q. Prototype Mixture Models for Few-Shot Semantic Segmentation. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12353, pp. 763–778. [Google Scholar]
Zhang, X.; Wei, Y.; Yang, Y.; Huang, T.S. SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation. IEEE Trans. Cybern. 2020, 50, 3855–3865. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5217–5226. [Google Scholar]
Nguyen, K.; Todorovic, S. Feature Weighting and Boosting for Few-Shot Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 622–631. [Google Scholar]
Wang, H.; Zhang, X.; Hu, Y.; Yang, Y.; Cao, X.; Zhen, X. Few-Shot Semantic Segmentation with Democratic Attention Networks. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12358, pp. 730–746. [Google Scholar]
Yang, X.; Wang, B.; Zhou, X.; Chen, K.; Yi, S.; Ouyang, W.; Zhou, L. BriNet: Towards Bridging the Intra-class and Inter-class Gaps in One-Shot Segmentation. In Proceedings of the 31st British Machine Vision Conference 2020, Virtual Event, UK, 7–10 September 2020. [Google Scholar]
Xie, G.; Liu, J.; Xiong, H.; Shao, L. Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 5475–5484. [Google Scholar]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior Guided Feature Enrichment Network for Few-Shot Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1050–1065. [Google Scholar] [CrossRef]
Ao, W.; Zheng, S.; Meng, Y. Few-shot semantic segmentation via mask aggregation. arXiv 2022, arXiv:2202.07231. [Google Scholar]
Lang, C.; Tu, B.; Cheng, G.; Han, J. Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1024–1030. [Google Scholar]
Hariharan, B.; Arbelaez, P.; Bourdev, L.D.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 991–998. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Xie, G.; Xiong, H.; Liu, J.; Yao, Y.; Shao, L. Few-Shot Semantic Segmentation with Cyclic Memory Network. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7273–7282. [Google Scholar]

Figure 1. The framework of our multilevel features-guided network. The query image and the support image are fed into the backbone (shared weights) to obtain three-level features (inside the larger dotted box), blue/yellow parts represent query/support features. There are three similarity-guided feature reinforcement modules (SRM) that sequentially process three-level query/support feature pairs. The three-level query features (small blue circles) are cross-connected to the decoder part for segmentation guidance. GAP and ASPP are abbreviations for Global Average Pooling and Atrous Spatial Pyramid Pooling respectively.

Figure 2. Similarity-Guided Feature Reinforcement Module (SRM), The “Higher SRM Feature” using the dotted line means that the last SRM module does not have this structure.

Figure 3. Method of multilevel features guide segmentation in the decoder part.

Figure 4. Qualitative results of our approach on Pascal-5

^{i}

in 1-shot setting. From top to bottom: (a) support images with ground truth masks, (b) query images with ground truth masks, (c) predictions of baseline, (d) predictions of our approach.

Figure 4. Qualitative results of our approach on Pascal-5

^{i}

in 1-shot setting. From top to bottom: (a) support images with ground truth masks, (b) query images with ground truth masks, (c) predictions of baseline, (d) predictions of our approach.

Table 1. Results and comparison of mIoU and FB-IoU on the four folds of PASCAL-5

^{i}

. Bold numbers represent the best performance.

Table 1. Results and comparison of mIoU and FB-IoU on the four folds of PASCAL-5

^{i}

. Bold numbers represent the best performance.

Backbone	Methods	1-Shot						5-Shot
Backbone	Methods	Fold-0	Fold-1	Fold-2	Fold-3	Mean	FB-IoU	Fold-0	Fold-1	Fold-2	Fold-3	Mean	FB-IoU
ResNet50	CANet $_{2019}$ [48]	52.5	65.9	51.3	51.9	55.4	66.2	55.5	67.8	51.9	53.2	57.1	69.6
	BriNet $_{2020}$ [51]	56.5	67.2	51.6	53.0	57.1	-	-	-	-	-	-	-
	PFENet $_{2020}$ [53]	61.7	69.5	55.4	56.3	60.8	73.3	63.1	70.7	55.8	57.9	61.9	73.9
	CMN $_{2021}$ [58]	64.3	70.0	57.4	59.4	62.8	72.3	65.8	70.4	57.6	60.8	63.7	72.8
	HSNet $_{2021}$ [28]	64.3	70.7	60.3	60.5	64.0	76.7	70.3	73.2	67.4	67.1	69.5	80.6
	MANet $_{2022}$ [54]	62.0	69.4	51.8	58.2	60.3	71.4	66.0	71.6	55.1	64.5	64.3	75.2
	DCP $_{2022}$ [55]	63.8	70.5	61.2	55.7	62.8	75.6	67.2	73.2	66.4	64.5	67.8	79.7
	Ours	64.4	70.8	63.4	60.3	64.7	76.4	67.3	73.7	66.2	64.9	68.0	79.3
ResNet101	FWB $_{2019}$ [49]	51.3	64.5	56.7	52.2	56.2	-	54.8	67.4	62.2	55.3	59.9	-
	PFENet $_{2020}$ [53]	60.5	69.4	54.4	55.9	60.1	72.9	62.8	70.4	54.9	57.6	61.4	73.5
	DAN $_{2020}$ [50]	54.7	68.6	57.8	51.6	58.2	71.9	57.9	69.0	60.1	54.9	60.5	72.3
	HSNet $_{2021}$ [28]	67.3	72.3	62.0	63.1	66.2	77.6	71.8	74.4	67.0	68.3	70.4	80.6
	MANet $_{2022}$ [54]	63.9	69.2	52.5	59.1	61.2	71.4	67.0	70.8	54.8	65.5	64.5	74.1
	Ours	66.1	72.8	64.9	62.0	66.5	76.8	67.6	74.5	67.2	65.4	68.7	79.6

Table 2. Results and comparison of mIoU and FB-IoU on the four folds of COCO-20

^{i}

. Bold numbers represent the best performance.

Table 2. Results and comparison of mIoU and FB-IoU on the four folds of COCO-20

^{i}

. Bold numbers represent the best performance.

Backbone	Methods	1-Shot						5-Shot
Backbone	Methods	Fold-0	Fold-1	Fold-2	Fold-3	Mean	FB-IoU	Fold-0	Fold-1	Fold-2	Fold-3	Mean	FB-IoU
ResNet50	BriNet $_{2020}$ [51]	32.9	36.2	37.4	30.9	34.4	-	-	-	-	-	-	-
	CMN $_{2021}$ [58]	37.9	44.8	38.7	35.6	39.3	61.7	42.0	50.5	41.0	38.9	43.1	63.3
	HSNet $_{2021}$ [28]	36.3	43.1	38.7	38.7	39.2	68.2	43.3	51.3	48.2	45.0	46.9	70.7
	MANet $_{2022}$ [54]	33.9	40.6	35.7	35.2	36.4	-	41.9	49.1	43.2	42.7	44.2	-
	DCP $_{2022}$ [55]	40.9	43.8	42.6	38.3	41.4	-	45.9	49.7	43.7	46.7	46.5	-
	Ours	40.8	45.5	41.1	39.1	41.6	65.2	46.1	52.3	46.2	44.3	47.2	69.1
ResNet101	FWB $_{2019}$ [49]	17.0	18.0	21.0	28.9	21.2	-	19.1	21.5	23.9	30.1	23.7	-
	DAN $_{2020}$ [50]	-	-	-	-	24.4	62.3	-	-	-	-	29.6	63.9
	PFENet $_{2020}$ [53]	36.8	41.8	38.7	36.7	38.5	63.0	40.4	46.8	43.2	40.5	42.7	65.8
	HSNet $_{2021}$ [28]	37.2	44.1	32.4	41.3	41.2	69.1	45.9	53.0	51.8	47.1	49.5	72.4
	Ours	41.0	45.6	40.6	39.6	41.7	65.6	46.5	53.1	45.6	43.2	47.1	69.1

Table 3. Comparison of the results of ablation experiments at different numbers of SRMs on PASCAL-5

^{i}

.

Table 3. Comparison of the results of ablation experiments at different numbers of SRMs on PASCAL-5

^{i}

.

SRM			Mean-IoU
Low-Level	Mid-Level	High-Level	Mean-IoU
-	-	-	57.5
-	✓	✓	63.1
✓	-	✓	64.4
✓	✓	-	58.5
✓	-	-	58.2
-	-	✓	63.6
✓	✓	✓	65.7

Table 4. Comparison of experimental results of ablation with different number of features cross-connect to decoders on PASCAL-5

^{i}

.

Table 4. Comparison of experimental results of ablation with different number of features cross-connect to decoders on PASCAL-5

^{i}

.

Query Features			Mean-IoU
Low-Level	Mid-Level	High-Level	Mean-IoU
-	-	-	64.2
-	✓	✓	63.9
✓	✓	-	64.4
-	-	✓	62.8
✓	-	-	64.9
✓	✓	✓	65.7

Table 5. Experimental results of ablation with different feature-guided methods on the decoder.

Feature-Guided Method	Mean-IoU
Feature-Guided Method	Three-Level	Low-Level
-	64.2	64.2
Concatenate by channel	61.1	62.5
Element-wise product	63.6	63.3
Matmul product	64.4	63.1
limited use of features	65.7	64.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xin, C.; Li, X.; Yuan, Y. Multilevel Features-Guided Network for Few-Shot Segmentation. Electronics 2022, 11, 3195. https://doi.org/10.3390/electronics11193195

AMA Style

Xin C, Li X, Yuan Y. Multilevel Features-Guided Network for Few-Shot Segmentation. Electronics. 2022; 11(19):3195. https://doi.org/10.3390/electronics11193195

Chicago/Turabian Style

Xin, Chenjing, Xinfu Li, and Yunfeng Yuan. 2022. "Multilevel Features-Guided Network for Few-Shot Segmentation" Electronics 11, no. 19: 3195. https://doi.org/10.3390/electronics11193195

APA Style

Xin, C., Li, X., & Yuan, Y. (2022). Multilevel Features-Guided Network for Few-Shot Segmentation. Electronics, 11(19), 3195. https://doi.org/10.3390/electronics11193195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilevel Features-Guided Network for Few-Shot Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Few-Shot Learning

2.3. Few-Shot Semantic Segmentation

3. Material and Methods

3.1. Problem Definition

3.2. The Proposed Model

3.3. Similarity-Guided Feature Reinforcement Module (SRM)

3.4. Multilevel Features-Guided Decoder

3.5. Loss Function

4. Implementation Details

4.1. Datasets

4.2. Experimental Setting

4.3. Evaluation Metrics

5. Results

5.1. Comparison with Other Methods

5.2. Ablation Study

5.2.1. Number of SRMs

5.2.2. Number of Features Cross-Connect to the Decoder

5.2.3. Different Methods of Using Multilevel Features to Guide Segmentation

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI