Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation

Guo, Lan; Li, Xuyang; Wang, Jinqiang; Tong, Yuqi; Xiao, Jie; Zhou, Rui; Li, Ling-Huey; Zhou, Qingguo; Li, Kuan-Ching

doi:10.3390/sym17101726

Open AccessArticle

Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation

by

Lan Guo

^1,†

,

Xuyang Li

^1,†

,

Jinqiang Wang

¹,

Yuqi Tong

¹,

Jie Xiao

¹,

Rui Zhou

¹,

Ling-Huey Li

²,

Qingguo Zhou

^1,* and

Kuan-Ching Li

^2,*

¹

School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

²

Department of Computer Science and Information Engineering, Providence University, Taichung 43301, Taiwan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2025, 17(10), 1726; https://doi.org/10.3390/sym17101726

Submission received: 5 September 2025 / Revised: 29 September 2025 / Accepted: 9 October 2025 / Published: 14 October 2025

(This article belongs to the Special Issue Symmetry in Process Optimization)

Download

Browse Figures

Versions Notes

Abstract

Few-Shot Semantic Segmentation (FSS) faces significant challenges in modeling complex backgrounds and maintaining prediction consistency due to limited training samples. Existing methods oversimplify backgrounds as single negative classes and rely solely on pixel-level alignments. To address these issues, we propose a symmetry-aware superpixel-enhanced FSS framework with a symmetric dual-branch architecture that explicitly models the superpixel region-graph in both the support and query branches. First, top–down cross-layer fusion injects low-level edge and texture cues into high-level semantics to build a more complete representation of complex backgrounds, improving foreground–background separability and boundary quality. Second, images are partitioned into superpixels and aggregated into “superpixel tokens” to construct a Region Adjacency Graph (RAG). Support-set prototypes are used to initialize query-pixel predictions, which are then projected into the superpixel space for cross-image prototype alignment with support superpixels. We further perform message passing/energy minimization on the RAG to enhance intra-region consistency and boundary adherence, and finally back-project the predictions to the pixel space. Lastly, by aggregating homogeneous semantic information, we construct robust foreground and background prototype representations, enhancing the model’s ability to perceive both seen and novel targets. Extensive experiments on the PASCAL-

5^{i}

and COCO-

20^{i}

benchmarks demonstrate that our proposed model achieves superior segmentation performance over the baseline and remains competitive with existing FSS methods.

Keywords:

few-shot learning; semantic segmentation; cross-layer feature fusion; symmetry-aware superpixels

1. Introduction

Semantic segmentation, a core task in computer vision, has broad applicability across medical image analysis, autonomous driving, robotic vision, and remote sensing image interpretation [1,2]. However, the stringent demand for large-scale, pixel-level annotations in conventional segmentation methods makes data collection and labeling prohibitively expensive, substantially limiting their practicality in real-world scenarios. To alleviate this issue, few-shot semantic segmentation (FSS) has emerged, which aims to achieve rapid adaptation and accurate segmentation of novel classes under conditions of extreme annotation scarcity.

Few-shot segmentation (FSS) typically uses a dual-branch architecture (support and query). The goal is to learn generalizable representations from the support set and quickly adapt them for segmenting the query image. Prototype learning, a common approach, extracts class prototypes from the support set and uses pixel–prototype matching on the query image for fast adaptation. However, FSS faces challenges due to limited supervision, distribution shifts, and noise in pixel-level decisions, leading to issues such as foreground–background confusion, inconsistent intra-region predictions, and inaccurate boundaries, especially in complex scenes with varying object scales [3,4,5].

Most existing prototype-based methods rely on pixel-level prototype alignment. These approaches tend to oversimplify the background into a single negative prototype, leading to foreground–background confusion. This practice can induce foreground–background confusion and biased matching in complex scenes. Simultaneously, the absence of explicit regional topological constraints in pixel-level decisions often results in discontinuities within objects and boundary bleeding. To mitigate these limitations, recent studies have approached the problem from multiple angles: for example, cross-domain FSS explores automatically generated prompts and interactions with large models to boost generalization and transferability; intrinsic feature enhancement and consistency modeling strive to extract more discriminative and alignable semantic representations from limited samples, thereby stabilizing pixel-level decisions [6]; in addition, boundary- and structure-aware enhancements have been shown to significantly reduce misclassification and omission around object edges [7]. Despite these efforts, several critical gaps remain insufficiently addressed in current few-shot semantic segmentation research. First, background oversimplification persists, as most existing methods treat diverse backgrounds as a uniform negative class, failing to capture the inherent complexity and heterogeneity of real-world background distributions. This oversimplification introduces systematic bias during prototype matching and undermines the model’s ability to distinguish subtle foreground–background boundaries. Second, representations of complex backgrounds remain incomplete—relying solely on high-level semantics may overlook fine-grained low-level cues such as edges and textures, which are crucial for accurate boundary localization and foreground–background separability. Third, a lack of region-level consistency and boundary handling arises from purely pixel-level prototype alignment, which lacks explicit constraints on regional topology and adjacency, making it difficult to guarantee intra-region consistency and precise boundary alignment. Finally, the need for multi-prototype alignment becomes evident, as single-prototype strategies fail to capture intra-class variation and cannot adequately handle diverse object appearances within the same semantic category.

To this end, we propose a symmetry-aware superpixel-enhanced few-shot semantic segmentation method. The core idea is to explicitly perform symmetric superpixel region-graph modeling in both the support and query branches. First, through top–down cross-layer fusion, we inject fine-grained low-level cues (e.g., edges and textures) into high-level semantics to build a more complete representation of complex backgrounds, thereby improving foreground–background separability and boundary quality. At the structural level, we partition the query image into superpixels and construct a Region Adjacency Graph (RAG): we initialize query-pixel predictions using support prototypes, project them into the superpixel space for cross-image prototype alignment with support superpixels, and then perform message passing and energy minimization on the RAG to enhance intra-region consistency and boundary adherence; finally, we back-project the predictions to the pixel space. At the representational level, we aggregate homogeneous semantic information to construct robust foreground and background prototype representations, compensating for the mismatch caused by oversimplifying the negative class.

Our contributions are as follows:

1. We design a top–down cross-layer fusion strategy that effectively injects low-level edge and texture cues into high-level semantics, yielding a more complete and separable representation in complex backgrounds. This alleviates the foreground–background confusion caused by oversimplifying the negative class and enhances the model’s sensitivity to seen and unseen targets.

2. We propose a symmetry-aware framework that explicitly performs region-graph modeling and cross-image prototype alignment in both the support and query branches. This design systematically improves intra-region consistency and boundary quality.

3. On the PASCAL-

5^{i}

and COCO-

20^{i}

benchmarks, our method surpasses the baseline in segmentation accuracy and delivers competitive results compared with the existing FSS approaches, validating its effectiveness and generalizability.

2. Related Work

2.1. Few-Shot Learning

Few-shot learning (FSL) aims to build machine learning systems that can learn new tasks from very limited labeled samples, and has become a key route to addressing data scarcity [8,9,10]. Research in this area largely follows two directions: optimization-based meta-learning and non-parametric metric learning. Meta-learning approaches learn favorable model initializations or optimization strategies so that the model can quickly adapt to new tasks with few samples [11,12,13]. These methods typically optimize model parameters via meta-learning and often require fine-tuning during testing. In contrast, metric-learning methods keep parameters fixed at test time and perform inference by computing similarities between query and support samples, offering fast inference and obviating parameter updates [14,15,16,17,18,19]. Concretely, the mean of support features serves as a class-specific representation, and distance-based matching is used to classify the query set.

The rise of vision–language models has opened new opportunities for FSL: recent work explores leveraging pre-trained vision–language models to enhance tasks such as few-shot object detection [20,21,22]. In addition, cross-domain few-shot learning employs techniques such as adaptive transformer networks to bridge distribution gaps between the source and target domains [23,24]. These advances suggest that FSL is progressing toward more practical and versatile solutions.

2.2. Semantic Segmentation

Semantic segmentation assigns a semantic label to every pixel in an image and has wide applications in autonomous driving, medical image analysis, and remote sensing image understanding [25,26]. Inspired by the success of fully convolutional networks (FCNs), numerous FCN-based architectures have been proposed for semantic segmentation, such as UNet [27], DeepLab [28], SegNet [29], and PSPNet [30]. Traditional methods typically adopt an encoder–decoder design to support end-to-end learning from feature extraction to dense prediction. Over time, the field has evolved from conventional FCNs to architectures that integrate multiple advanced mechanisms. Current research hotspots focus on effectively incorporating attention mechanisms and multi-scale feature fusion strategies into FCN backbones [31,32,33,34]. Notably, despite the strong performance of deep learning under full supervision, the heavy reliance on large-scale labeled datasets limits applicability in label-scarce scenarios—motivating the rise of few-shot semantic segmentation research.

2.3. Few-Shot Semantic Segmentation

Few-shot semantic segmentation (FSS) combines the strengths of FSL and semantic segmentation, aiming to achieve accurate pixel-level segmentation of novel classes with only a few labeled examples [35,36]. Distinct from image classification, which emphasizes global feature matching, FSS requires dense, pixel-level prediction—making learning under few-shot conditions particularly challenging.

Shaban et al. [37] first proposed an FSS model based on a dual-branch architecture. Subsequent works have introduced many deep learning-based FSS methods [13,31,38,39,40]. Recent surveys indicate that existing methods can be broadly grouped into three categories: prototype metric-based, meta-learning-based, and conditional parameterization-based approaches [41]. In recent studies [42,43,44,45,46], the averaged deep features extracted by a backbone are used as class-specific representations—an archetypal use of metric learning. Prototype metric-based methods, which extract class prototypes from the support set and perform similarity matching to segment the query image, have become mainstream. However, these methods typically focus on foreground regions in the current support images and seldom exploit background regions; moreover, because each class representation is learned independently per support set, substantial contextual information is lost. Some meta-learning approaches [47,48,49,50] have shown promising results in FSS, but they often introduce many hyperparameters and face optimization challenges [6,19]. Subsequent research [51,52,53] has enhanced prototype representations by incorporating deep descriptors and pixel-level metric learning, leveraging richer, transferable semantics in fine-grained features [54,55]. Furthermore, Vision Transformers (ViTs) have demonstrated exceptional performance in the field of few-shot semantic segmentation. Dos Santos et al. [56] utilized Vision Transformers for multi-scale feature fusion and enhanced few-shot segmentation effectiveness through self-attention mechanisms. This study indicates that ViTs possess strong feature learning capabilities under few-sample conditions. MSDNet [57] incorporates a multi-scale decoder and transformer-guided prototyping approach to improve few-shot semantic segmentation performance, further confirming the potential of ViTs in this task.

Nonetheless, purely pixel-based prototype alignment generally lacks explicit constraints on regional topology and adjacency, making it difficult to guarantee intra-region consistency and precise boundary alignment. Furthermore, existing methods often model the background coarsely as a “single negative class”, which mismatches the diverse and structurally complex background distributions in real scenes and thus induces systematic bias during prototype matching. Consequently, under few-shot settings, jointly addressing the coupled challenges of complex background modeling, robust cross-image matching, and regional structural consistency remains central to advancing FSS performance.

To mitigate these issues, we adopt a dual-branch design. Even under limited samples, we posit that background regions may still contain valuable class cues or structural information. Accordingly, we propose a symmetry-aware superpixel-enhanced FSS method. The core idea is to explicitly construct symmetric superpixel region graphs in both the support and query branches: via top–down cross-hierarchical feature fusion, we inject fine-grained low-level cues (edges and textures) into high-level semantic representations to strengthen the modeling of complex backgrounds, thereby improving the foreground–background separability and boundary quality. Structurally, we first partition the query image into superpixels and build a Region Adjacency Graph (RAG); initialize query predictions at the pixel level using support prototypes and project them into the superpixel space to achieve cross-image prototype alignment between support and query; then perform message passing and energy minimization on the RAG to optimize regional consistency and boundary alignment; and finally, map the optimized results back to the pixel space. At the representational level, we aggregate semantically consistent regional information to construct robust foreground and background prototypes, alleviating the mismatch induced by oversimplifying the negative class.

3. Task Definition

With the rapid development of deep learning, semantic segmentation has achieved remarkable progress in computer vision; however, much of this success hinges on large-scale, pixel-level annotated datasets [6,55]. Acquiring dense pixel annotations is both time consuming and costly, and is even infeasible in many real-world scenarios [50]. Therefore, we aim to train a model that can learn from a tiny number of samples and accurately segment unseen categories using only a few annotated examples. More formally, the task can be cast as a (C)-way (K)-shot segmentation problem, where there are (C) different categories and only (K) labeled images per category available as guidance [58]. In line with prior work, we primarily focus on the 1-way 1-shot and 1-way 5-shot settings.

In few-shot semantic segmentation, the base set

D_{base}

and the novel set

D_{novel}

are drawn from two disjoint label spaces

C_{base}

and

C_{novel}

, i.e.,

C_{base} \cap C_{novel} = ⌀ .

(1)

Following the standard episodic protocol, multiple episodes are sampled from

D_{base}

and

D_{novel}

, namely

\begin{matrix} D_{base} & = {(S_{i}, Q_{i})}_{i = 1}^{N_{base}}, \\ D_{novel} & = {(S_{i}, Q_{i})}_{i = 1}^{N_{novel}}, \end{matrix}

(2)

where

N_{base}

and

N_{novel}

denote the numbers of episodes for base and novel classes, respectively.

Each training or testing episode

e_{i}

consists of a support set

S_{i}

and a query set

Q_{i}

, i.e.,

e_{i} = (S_{i}, Q_{i}) .

(3)

Concretely, the support set

S_{i}

contains several semantic classes. For each class

c \in C

, there are K distinct image–mask pairs:

S_{i}^{c} = {\{(I_{c, s}^{j}, M_{c, s}^{j})\}}_{j = 1}^{K}, I_{c, s}^{j} \in R^{3 \times H \times W}, M_{c, s}^{j} \in {0, 1}^{H \times W}

(4)

where,

I_{c, s}^{j}

represents the feature map at channel c in spatial position s of the j-th layer, and

M_{c, s}^{j}

represents the corresponding mask encoding at the same position.

Similarly, the query set

Q_{i}

contains

l_{i}

images from the same classes as the support set:

Q_{i}^{c} = {\{(I_{c, q}^{j}, M_{c, q}^{j})\}}_{j = 1}^{l_{i}},

(5)

where

I_{c, q}^{j}

denotes the j-th query RGB image and

M_{c, q}^{j}

is its corresponding ground-truth binary mask. Note that the query masks

M_{c, q}^{j}

are used only for evaluation during testing and are not involved in the model’s inference. Throughout,

I \in R^{3 \times H \times W}

represents an RGB image, and

M \in {0, 1}^{H \times W}

is a binary segmentation mask, with 1 indicating foreground pixels and 0 indicating background pixels.

4. Methodology

4.1. Overview

We propose a symmetry-aware superpixel-enhanced framework for few-shot semantic segmentation, as illustrated in Figure 1, and the algorithm pseudocode of the model is presented in Algorithm 1. The model adopts a symmetric dual-branch architecture that inputs a support image and a query image. The backbone and the Cross-Layer Feature Fusion (CFF) module share weights across branches, enforcing consistent “symmetry awareness”. CFF adaptively fuses high-level semantics with low-level details to produce feature maps Fs and Fq that encode semantic cues and boundary information.

Algorithm 1: Prototype-guided few-shot image segmentation via cross-layer fusion and superpixel-relational matching

To further structure the representation, we introduce a Superpixel–Prototype Relational Matching (SPRM) module. First, superpixels are generated on both the support and query sides, and their internal features are aggregated to obtain structured region representations. Using the support annotations as priors, we then perform relation modeling and noise suppression between regions inside/outside the mask and the class prototypes, strengthening intra-class consistency while mitigating inter-class confusion. After SPRM, the support image undergoes mask-guided aggregation and prototype refinement to produce a set of diverse, boundary-sensitive class prototypes (Prototype Generation) that explicitly capture intra-class multimodality in appearance and deformation. Finally, non-parametric matching measures the similarity between the query features Fq and the refined prototypes, yielding a pixel-wise probability map from which the query mask is predicted. The entire pipeline is trained end-to-end under an episodic setting: symmetry awareness ensures representational consistency across branches, superpixels enhance boundary and region consistency, and the prototype set models intra-class diversity, enabling more accurate and robust segmentation under minimal supervision.

4.2. Cross-Layer Feature Fusion

In few-shot semantic segmentation, many existing methods treat the background as a single monolithic class, overlooking its rich hierarchical structure and fine-grained details. This simplification hinders the model’s ability to fully understand complex and variable background content, thereby degrading the accuracy of foreground segmentation. As illustrated in Figure 2, a visualization of features from different backbone layers reveals that shallow features retain more detailed edge and texture cues but lack semantic consistency, whereas deep features possess intense semantic expressiveness yet lose spatial details critical to segmentation. This representational mismatch and information gap lead to suboptimal performance when the model encounters backgrounds with diverse appearances and complex structures.

To address these issues, we propose a top–down Cross-Layer Fusion (CLF) mechanism. CLF selectively injects fine-grained cues—such as edges and textures—from low-level features into high-level semantic features, yielding fused representations that combine semantic consistency with spatial detail. This fusion strategy produces a more complete and discriminative multi-level characterization of complex backgrounds, which not only markedly improves the separability between foreground and background regions but also enhances the boundary localization and overall segmentation quality. To effectively integrate multi-scale features, we formulate our cross-layer fusion mechanism, where

X_{q}^{l}

denotes the query feature representation extracted from the l-th layer, and

X_{s}^{l}

represents the corresponding support features from the same layer level.

In the query branch, given a query image

I_{q}

, we extract multi-level features using the backbone shared with the support branch:

{X_{q}^{2}, X_{q}^{3}, X_{q}^{4}, X_{q}^{5}} .

(6)

A

1 \times 1

convolution is applied for channel alignment and normalization. Starting from the deepest stage, we perform top–down fusion: the high-level feature is upsampled to provide semantic guidance, while the corresponding low-level feature is selectively injected via a gating mechanism, forming

P_{q}^{l} = Up (P_{q}^{l + 1}) + G_{q}^{l} ⊙ {\hat{X}}_{q}^{l},

(7)

where

Up (\cdot)

denotes upsampling and ⊙ is element-wise multiplication.

Finally, features from all levels are converted to a high-resolution scale, concatenated, and compressed to obtain the fused representation:

F_{q} = ψ (Concat (P_{q}^{2}, Up (P_{q}^{3}), {Up}^{2} (P_{q}^{4}), {Up}^{3} (P_{q}^{5}))) .

(8)

This fused feature retains both semantic consistency and boundary details, and is subsequently used to compute similarity with support prototypes for segmentation. The entire procedure shares weights with the support branch and does not rely on class masks.

In the support branch, we utilize image masks to obtain the background region of the original image. Given an RGB input image

I \in R^{3 \times w \times h}

and a background mask

M \in {0, 1}^{w \times h}

, where w and h are the image width and height, the background image is computed according to the following equation:

I^{B M_{0}} = I ⊙ M

(9)

where ⊙ denotes the Hadamard product. As shown in Figure 3, the image I and the obtained background image

I^{B M_{0}}

are fed into the backbone network for feature extraction. Cross-layer background feature fusion is performed at the first two layers and the final output layer of the network. For simplicity, we describe the feature fusion process for the first layer output of the feature extraction network. After passing through the first layer of the backbone, I and

I^{B M_{0}}

yield feature maps

F^{I} \in R^{c \times w^{'} \times h^{'}}

and

F^{B M_{0}} \in R^{c \times w^{'} \times h^{'}}

, respectively, where c is the number of channels, and

w^{'}

and

h^{'}

are the width and height of the feature maps. Using bilinear interpolation to match the mask M with the obtained feature dimensions, denoted by function

τ (\cdot) : R^{w \times h} \to R^{c \times w^{'} \times h^{'}}

, we then perform feature masking on

F^{I}

according to Equation to obtain the first-layer background features:

F^{B M_{1}} = F^{I} ⊙ τ (M)

(10)

Finally, fine-grained information from

F^{B M_{0}}

is supplemented into

F^{B M_{1}}

to obtain the final background feature

F^{B M_{0} - 1}

. As illustrated in Figure 4, to better fuse the feature information of

F^{B M_{1}}

and

F^{B M_{0}}

, we introduce the masking method proposed in [59] within the Hybrid Background Module (HBM). Specifically, inactive values in the high-level background feature

F^{B M_{1}}

are replaced by corresponding values from the low-level background feature

F^{B M_{0}}

, while other activated values remain unchanged to preserve semantic information. We then input

F^{B M_{0} - 1}

and

F^{I}

to the second layer of the backbone network and repeat the above operations.

4.3. Superpixel–Prototype Relational Matching

Few-shot semantic segmentation suffers from unstable prototypes and imprecise boundaries. Pixel-wise dense matching is highly susceptible to noise; therefore, we propose a Superpixel–Prototype Relational Matching (SPRM) module. Pixel-level features are aggregated into superpixel-level tokens to construct a region graph, upon which cross-image prototype alignment and message passing are performed at the regional level. The refined representations are then projected back to the pixel space for fine-grained prediction. As illustrated in Figure 5, this design ensures cross-image semantic alignment while leveraging the boundary-awareness of superpixels to enhance regional consistency and boundary adherence. Superpixel segmentation adaptively partitions the image into regions according to visual cues; consistency clustering enforces intra-token feature homogeneity, and, when combined with hierarchical feature fusion, it preserves fine details and edge information more effectively.

Given a query image

I_{q}

and its deep feature map

F \in R^{H \times W \times C}

, we apply SLIC/SEEDS to partition the image domain

Ω = {1, \dots, H} \times {1, \dots, W}

into a set of superpixels

S = {S_{k}}_{k = 1}^{k}

that are pairwise disjoint and whose union covers the whole image, i.e.,

S_{i} \cap S_{j} = ⌀ for i \neq j

, and

⋃_{k = 1}^{K} S_{k} = Ω .

Within each superpixel, mean pooling is performed to obtain a region token:

p_{k} = \frac{1}{| S_{k} |} \sum_{x \in S_{k}} F (x) \in R^{C}, k = 1, \dots, K .

(11)

where

F (x)

denotes the C-dimensional feature vector at pixel x. The resulting token set can be written as

P = {[p_{1}, \dots, p_{K}]}^{⊤} \in R^{K \times C} .

(12)

We further construct a Region Adjacency Graph

G = (V, E)

where

V = {1, \dots, N}

. An undirected edge

(i, j) \in E

is created if the superpixels

S_{i}

and

S_{j}

are spatially adjacent (touch each other). The edge weight integrates color, spatial, and boundary-strength cues via a weighted combination:

w_{i j} = exp (- α ∥ μ_{i} - μ_{j} ∥_{2}^{2} - β {∥ c_{i} - c_{j} ∥}_{2}^{2}) \cdot (1 + γ b_{i j}),

(13)

where

μ_{i}

denotes the centroid coordinates,

c_{i}

the average color, and

b_{i j}

the boundary confidence score.

To capture multi-modal semantics within each category, we introduce

K_{c}

prototype centers

{p_{c, k}}_{k = 1}^{K_{c}}

for each class c. The region–prototype relationship is modeled via a soft assignment as follows:

q_{i, k}^{c} = \frac{\exp (κ 〈 \hat{t_{i}}, {\hat{p}}_{c, k} 〉)}{\sum_{m = 1}^{K_{c}} \exp (κ 〈 \hat{t_{i}}, {\hat{p}}_{c, m} 〉)}, \sum_{k} q_{i, k}^{c} = 1 .

(14)

Building on this, the prototypes are updated via weighted aggregation:

p_{c, k} = \frac{\sum_{i = 1}^{N} q_{i, k}^{c} t_{i}}{\sum_{i = 1}^{N} q_{i, k}^{c}} .

(15)

After obtaining the multi-prototypes for each class, we define a region-level class distribution

z_{i} \in Δ^{| c |}

Z_{i} \in Δ^{| C |}

(where

Δ

denotes the probability simplex). Its energy function consists of a unary term (prototype similarity) and a pairwise term (graph smoothness/boundary alignment) as given by the equation

\min_{z, {{\hat{p}}_{c, k}}} E (z, {{\hat{p}}_{c, k}}) = \sum_{i = 1}^{N} CE (z_{i}, σ {(\max_{k} 〈 {t_{i}, {\hat{p}}_{c, k}} 〉)}_{c \in C}) + λ \sum_{(i, j) \in E} w_{i j} {∥ z_{i} - z_{j} ∥}_{2}^{2}

(16)

where

σ (\cdot)

denotes the softmax operator. The energy is minimized via a few iterations of message passing with Laplacian regularization to obtain

{\hat{z}}_{i}

. The region-level predictions are then back-projected to the pixel domain to produce a fine-grained mask, which is fused with the pixel-level logits as shown in the equation:

ℓ (x, c) = β T (F (x), p_{c}^{(0)}) + (1 - β) {\hat{z}}_{s (x)} (c),

(17)

where the second term represents pixel-class similarity. Here,

s (x)

returns the superpixel index to which pixel x belongs, and

β \in [0, 1]

controls the fusion weight.

4.4. Non-Parametric Metric Learning

The semantic segmentation task aims to learn target object segmentation patterns from support images and generalize them to corresponding targets in query images. Specifically, this task can be formulated as a pixel-wise category inference problem in the spatial domain. As illustrated in Figure 6, this work leverages a non-parametric metric learning mechanism to compute the semantic similarity

β_{c, q}^{(x, y)}

between the feature vector

F_{q} \in R^{d \times h \times w}

at each spatial location

(x, y)

of the deep features

F_{q}^{(x, y)}

obtained from encoding the query image and each class prototype

p_{c} \in R^{d \times 1 \times 1}

, thereby accomplishing the segmentation of unknown categories based on this similarity, and the similarity computation is defined as

β_{c, q}^{(x, y)} = \frac{F_{q}^{(x, y)} \cdot p_{c}}{∥ F_{q}^{(x, y)} ∥ ∥ p_{c} ∥}, p_{c} \in P, c \in C

(18)

where

P \in R^{2 \times d \times 1 \times 1}

encompasses all class prototypes for both foreground and background. Subsequently, the class prototype index

ℓ_{c, q}^{(x, y)}

that best matches the feature vector at each location is obtained through the

argmax (\cdot)

operation as shown in the following equation:

ℓ_{c, q}^{(x, y)} = \arg \max (β_{c, q}^{(x, y)})

(19)

Based on this, the class prototypes corresponding to all spatial locations are integrated to generate a semantic response map

G \in R^{h^{'} \times w^{'}}

. Subsequently, bilinear interpolation is employed to resize G to the original image dimensions, yielding

G^{'} \in R^{h \times w}

which is concatenated with the query features

E_{Q}

along the channel dimension to construct an enhanced support branch for the better transfer of segmentation information from the support set. Finally, the model performs pixel-wise classification to determine whether each pixel belongs to the target class or background in the support image. The overall pipeline structure is illustrated in Figure 6. During training, a cross-entropy loss function is utilized to optimize the prototype representations in an end-to-end manner according to the equation

L_{s \to q} = - \frac{1}{h w} \sum_{x, y} \sum_{P c} ϖ [M_{c, q}^{(x, y)} = c] \log Z_{c, q}^{(x, y)}

(20)

where

M_{c, q}^{(x, y)}

is the ground truth mask of the query image. Optimizing the

L_{q \to q}

will derive a suitable class-specific prototype for each class. Specifically,

M_{c, q}^{(x, y)}

is only used in the testing stage.

5. Experimental Design

5.1. Datasets and Evaluation Protocol

Datasets: We assess our method on two canonical few-shot semantic segmentation benchmarks, PASCAL-

5^{i}

and COCO-

20^{i}

.

PASCAL-

5^{i}

is constructed from PASCAL VOC 2012 [60] and the SBD dataset [61], covering 20 object categories. It uses 10,582 training and 1449 validation images as the base and novel pools, respectively.

COCO-

20^{i}

[62], derived from COCO-2014, is larger and more challenging, spanning 80 categories with 82,783 training and 40,504 validation images serving as base and novel pools. For both datasets, we follow the four-fold protocol in [37]: the categories are split evenly into four folds; in each run, three folds are used for training, and the remaining fold is reserved for testing.

Evaluation metrics: Following common practice [37,47], we report the mean Intersection over Union (mIoU) and foreground–background IoU (FB-IoU). For class c, the IoU is

{IoU}_{c} = \frac{{tp}_{c}}{{tp}_{c} + {fp}_{c} + {fn}_{c}},

(21)

where

{tp}_{c}

,

{fp}_{c}

, and

{fn}_{c}

denote the counts of true-positive, false-positive, and false-negative pixels for that class. The mean over the C foreground classes gives

mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c}

(22)

with

C = 5

for

P A S C A L - 5^{i}

and

C = 20

for

C O C O - 20^{i}

. FB-IoU disregards class labels and averages IoU over foreground and background, i.e.,

C = 2

. All numbers are averaged across four trials.

Test protocol: Because randomly sampled test episodes can vary in difficulty, we evaluate the baseline and our method on the same test episodes to ensure fair comparison.

Baseline: As our approach is grounded in metric learning, we adopt PANet [39] as the baseline, consistent with prior work [63], and denote it as PANet. For fairness, both our method and the baseline share the same backbone feature extractor.

5.2. Implementation Details

We adopt ResNet [64] and VGG [65] as our backbone networks, pre-trained on ImageNet, including ResNet-50 and ResNet-101. We discard the last backbone stage and the final ReLU to achieve better generalization when using the backbone network. We use SGD to optimize our model with a momentum of 0.9, an initial learning rate of 1 × 10⁻⁴, and decay by a factor of 2 every 2000 iterations. The model is trained for 24,000 iterations. The batch size is set to 4 for training, limited by GPU memory constraints when processing 473 × 473 resolution images with superpixel computation. During evaluation, we use a batch size of 1 following standard FSS protocols. Both images and masks are resized and cropped to (473 × 473) and augmented with random horizontal flipping. Evaluation is performed on original images.

To balance region-level semantic coherence and computational efficiency, we set the number of superpixels to 300 for all query images by default. For datasets with significantly different image resolutions, we further adjust the superpixel count according to the image area as

N = ⌊\frac{H \times W}{600}⌋

, where H and W denote the height and width, respectively. We use the SLIC algorithm with a compactness parameter set to 10 to generate superpixels, ensuring each superpixel conveys sufficient local context without sacrificing boundary precision. The number of support branch multi-prototypes per class is set to 3, empirically determined to balance alignment robustness and computational efficiency. The loss function comprises a binary cross-entropy loss and a Dice loss with weights of 1.0 and 1.0 respectively, and the region consistency term is weighted by 0.5.

For evaluation, we conduct 1000 episodes per class per fold, with each episode randomly sampling one support–query pair, ensuring statistical significance of the reported mIoU and FB-IoU metrics. Regarding seed policy, we employ a fixed random seed (seed = 1234) for dataset splitting and episode sampling to ensure reproducible results. All experiments are implemented on PyTorch 1.13.1 and conducted on a server composed of one Intel(R) Xeon(R) Gold 6242 CPU with 256 GB memory and one NVIDIA A100 40 GB GPU accelerator card.

5.3. Experimental Results

We benchmark our approach against state-of-the-art methods on PASCAL-

5^{i}

and COCO-

20^{i}

using two evaluation metrics. Experiments cover the 1-way 1-shot and 1-way 5-shot settings with three backbones—VGG-16, ResNet-50, and ResNet-101—and we report both mIoU and FB-IoU.

P A S C A L - 5^{i}

. From Table 1, we can observe that our SSENet method demonstrates consistent superiority across different backbone networks. For VGG-16, our method achieves 65.4 mIoU and 77.2 FB-IoU in 1-shot segmentation task, outperforming the previous best method DCP by 4.47% and 2.12%, respectively. In the 5-shot task, our method obtains 68.3 mIoU and 81.3 FB-IoU, which surpasses DCP by 0.74% and 0.74%, respectively, demonstrating the effectiveness of our approach even with limited backbone capacity. For ResNet-50, our method achieves 67.4 mIoU and 78.9 FB-IoU in the 1-shot segmentation task, which outperforms the state-of-the-art method DCP by 1.97% and 1.68% respectively. In the 5-shot task, our method reaches 71.0 mIoU and 81.3 FB-IoU, exceeding DCP by 1.00% and 3.57%, respectively. Furthermore, our method significantly outperforms PANet by 38.4% and 28.6% mIoU under 1-shot and 5-shot segmentation tasks, respectively, along with 17.9% and 13.9% higher FB-IoU performance, demonstrating the effectiveness of our approach in leveraging support features for enhanced segmentation. For ResNet-101, our method achieves 68.2 mIoU and 78.3 FB-IoU in the 1-shot task, outperforming the previous best method DCP by 1.34% mIoU while showing a slight decrease of 0.25% in FB-IoU. In the 5-shot task, our method obtains 72.5 mIoU and 81.6 FB-IoU, which surpasses DCP by 1.40% mIoU while showing a decrease of 1.33% in FB-IoU. Compared with PANet*, our method shows substantial improvements of 33.2% and 26.1% mIoU under 1-shot and 5-shot segmentation tasks respectively, along with 11.4% and 13.3% higher FB-IoU performance. MSDNet achieves a marginally higher FB-IoU in the ResNet-101 configuration (e.g., 85.0 vs. 81.6 in the 5-shot setting), as our model prioritizes semantic accuracy through region-level consistency enforcement rather than aggressive binary foreground–background separation. Overall, our SSENet consistently delivers superior segmentation performance across different experimental settings, validating the robustness and effectiveness of our proposed approach.

To further verify the segmentation effectiveness of our proposed method, we evaluate our method and PANet on 2-way 1-shot and 5-shot segmentation performance as shown in Table 2. We observe that our method also performs favorably in both metrics across all backbone networks. Specifically, in the 1-shot setting, our method obtains 62.7, 63.2, and 67.9 mIoU and 73.8, 74.1, and 77.5 FB-IoU with VGG-16, ResNet-50 and ResNet-101, respectively, which significantly outperforms PANet by 39.0%, 39.5%, and 36.3% mIoU and 15.0%, 15.1%, and 13.0% FB-IoU, respectively. In the 5-shot setting, our method achieves 59.4, 60.2, and 65.3 mIoU and 78.2, 79.5, and 80.9 FB-IoU with VGG-16, ResNet-50, and ResNet-101 respectively, surpassing PANet by 23.2%, 23.4%, and 20.0% mIoU and 16.0%, 15.6%, and 10.7% FB-IoU, respectively. These substantial improvements across different backbone architectures demonstrate that the proposed method has superior generalization performance and robustness in more challenging 2-way segmentation scenarios.

Regarding COCO-

20^{i}

dataset analysis, from Table 3, we can observe that our SSENet method achieves state-of-the-art performance across different backbone networks on the more challenging COCO-

20^{i}

dataset. For VGG-16, our method achieves 43.1 mIoU and 65.8 FB-IoU in the 1-shot segmentation task, outperforming the previous best method SAGNN by 15.5% and 7.5%, respectively. In the 5-shot task, our method obtains 45.6 mIoU and 66.8 FB-IoU, which surpasses SAGNN by 12.0% and 5.9%, respectively, demonstrating significant improvements even with limited backbone capacity. For ResNet-50, our method achieves 47.0 mIoU and 70.1 FB-IoU in the 1-shot segmentation task, outperforming the state-of-the-art method DCP by 3.3% mIoU. Notably, our FB-IoU performance substantially exceeds other methods, with a remarkable improvement of 11.3% over ASGNet. In the 5-shot task, our method reaches 53.8 mIoU and 73.9 FB-IoU, exceeding DCP by 5.7% mIoU and surpassing ASGNet by 26.6% mIoU and 10.1% FB-IoU, respectively. These results demonstrate the effectiveness of our approach in leveraging multiple support examples for enhanced performance. For ResNet-101, our method achieves 47.2 mIoU and 68.9 FB-IoU in the 1-shot task, outperforming the previous best method DCP by 5.8% mIoU and exceeding NTRENet by 20.7% mIoU and 2.1% FB-IoU. In the 5-shot task, our method obtains 53.6 mIoU and 72.4 FB-IoU, which surpasses DCP by 8.5% mIoU and outperforms NTRENet by 24.1% mIoU and 4.0% FB-IoU. The consistent improvements across different backbone architectures validate that our method can effectively handle the increased complexity and diversity of the COCO-

20^{i}

dataset. Overall, our SSENet demonstrates superior segmentation performance and excellent generalization capability on this challenging dataset.

To further verify the segmentation effectiveness of our method, we evaluate and compare our proposed method and PANet on 2-way 1-shot and 5-shot segmentation performance on the COCO-

20^{i}

dataset as shown in Table 4. Additionally, our method also performs favorably in both metrics across all backbone networks. Specifically, in the 1-shot setting, our method obtains 39.7, 41.0, and 45.8 mIoU and 64.3, 69.2, and 64.9 FB-IoU with VGG-16, ResNet-50 and ResNet-101, respectively, which significantly outperforms PANet by 93.7%, 83.9%, and 33.9% mIoU and 9.5%, 16.1%, and 2.4% FB-IoU, respectively. In the 5-shot setting, our method achieves 48.2, 46.9, and 51.2 mIoU and 66.0, 71.4, and 70.6 FB-IoU with VGG-16, ResNet-50, and ResNet-101 respectively, surpassing PANet by 47.4%, 44.3%, and 27.7% mIoU and 7.8%, 14.8%, and 7.3% FB-IoU, respectively. These substantial improvements across different backbone architectures demonstrate that the proposed method has superior generalization performance and robustness in more challenging 2-way segmentation scenarios on the COCO-

20^{i}

dataset.

5.4. Ablation Studies

Here, we conduct ablation studies on PASCAL-

5^{i}

and COCO-

20^{i}

datasets using ResNet-50, and report the average performance of all results on mIoU and FB-mIoU. The quantitative results are shown in Table 5, and the qualitative results are shown in Figure 7.

Effect of different components. In this section, we analyze the impact of different components on the performance of our method with ResNet-50 in the 1-way 1-shot setting as shown in Table 5. Fs+Fq&CLF represents a variant that uses the CLF module to obtain fine-grained foreground regions (Fs) and query-specific features (Fq) based on extracted features, while excluding other components. Similarly, Fs&CLF+Fq&CLF represents our design where both support and query features are processed through the CLF module to enhance feature fusion. Fs&CLF+Fq&CLF+FG+BG&SPRM denotes a variant of our method where SPRM is used in conjunction with all other components to extract the fine-grained relational features of background and foreground regions.

In Table 5, we first analyze the impact of the main components of our method, namely the fine-grained feature extraction modules (Fs, Fq), Cross-Layer Feature Fusion (CLF), and the Superpixel–Prototype Relational Matching (SPRM) module. We can see that each component plays a vital role in performance improvement. Starting from the baseline FG + BG (48.7% mIoU), the introduction of Fs + Fq&CLF improves the performance to 50.3%, demonstrating the effectiveness of fine-grained feature learning. Furthermore, we observe that when CLF is applied to both support and query features simultaneously (Fs&CLF + Fq&CLF), the performance further improves to 52.1%, which provides better feature alignment and fusion effects. Moreover, the integration of another module, SPRM, has also brought significant improvements. For example, Fs + Fq&CLF + FG&SPRM + BG achieves 56.0%, while our complete model Fs&CLF + Fq&CLF + FG&SPRM + BG&SPRM reaches the best performance of 67.4% mIoU on PASCAL-

5^{i}

and 47.0% on COCO-

20^{i}

. This further confirms the aforementioned hypothesis that relational matching between superpixels and prototypes can effectively capture fine-grained spatial correspondences, and explicit relational modeling can enhance feature discrimination capability.

Visualization analysis. Figure 7 shows our method with the baseline method on 1-way 1-shot qualitative results. We observe that our method can give satisfying segmentation results on unseen classes from the background with only the guidance of the support images, even if some support images and query images do not share much appearance similarities.

Compared to the baseline, our SSENet demonstrates significant improvements in background consistency, with the most notable enhancements evident in boundary quality and object completeness. In the PASCAL-

5^{i}

examples shown in Figure 7, specifically in the train scene (leftmost), our method produces more accurate and complete segmentation of the red locomotive, while the baseline shows fragmented predictions with missing regions in the central part of the train. Similarly, in the construction equipment scene (second from left), our approach achieves better boundary adherence and reduces the obvious false negative regions present in the baseline results. The animal segmentation examples (sheep and cat in columns 3–4) particularly highlight our method’s superior capability in capturing fine-grained details and maintaining object completeness, whereas the baseline exhibits incomplete segmentations with significant missing portions of the target objects.

This demonstrates that our method effectively suppresses background noise through the Cross-Layer Feature Fusion (CLF) module, which integrates fine-grained edge and texture information with high-level semantics, resulting in cleaner and more coherent background predictions. The most striking improvement lies in boundary precision, where our Superpixel–Prototype Relational Matching (SPRM) module enforces regional consistency by operating on superpixel-level tokens rather than individual pixels, leading to markedly sharper and more accurate object boundaries. In challenging scenarios involving objects at different scales or with varying contextual backgrounds, our method demonstrates enhanced robustness through the CLF module’s ability to inject low-level cues into high-level representations, enabling the better handling of fine-grained details while maintaining semantic understanding for larger structures.

6. Conclusions and Future Work

In this work, we propose a symmetry-aware superpixel-enhanced few-shot semantic segmentation method that effectively addresses critical limitations in existing approaches, namely insufficient complex background modeling and poor regional prediction consistency through explicit superpixel region graph modeling. Our main contributions include three aspects: first, we design a symmetric dual-branch architecture that leverages explicit superpixel region graph modeling to enhance background representation and prediction consistency; second, we propose a top–down cross-layer fusion mechanism that effectively integrates fine-grained edge and texture information into high-level semantic features; finally, we construct a cross-image prototype alignment strategy based on Region Adjacency Graphs (RAG) with message passing optimization to obtain more robust foreground and background prototype representations. Extensive experiments and ablation studies on PASCAL-

5^{i}

and COCO-

20^{i}

benchmarks demonstrate the effectiveness and superiority of our proposed method, achieving consistent improvements across multiple backbone architectures. Despite achieving good results, our framework still has certain limitations. The method exhibits superpixel granularity sensitivity, where inappropriate superpixel counts can affect segmentation quality, and shows cross-domain generalization challenges when substantial domain gaps exist between training and testing scenarios. While these limitations do not undermine the core contributions, they highlight important areas for future improvement and represent valuable directions for advancing few-shot semantic segmentation research.

Adaptive superpixel optimization represents a key direction, and we will explore dynamic superpixel granularity selection mechanisms and learnable superpixel generation networks to automatically adapt to different scene complexities and object scales. We plan to investigate methods for predicting optimal superpixel parameters based on image characteristics and object complexity, potentially through reinforcement learning or adaptive sampling strategies. We will pursue enhanced cross-domain generalization capabilities by incorporating domain adaptation techniques into our cross-layer fusion mechanism and exploring meta-learning approaches to improve adaptation to new domains with minimal fine-tuning. We will develop domain-aware feature alignment strategies and investigate adversarial training methods to learn domain-invariant representations that maintain segmentation accuracy across diverse visual contexts.

Author Contributions

Conceptualization, Q.Z.; Methodology, L.G. and K.-C.L.; Software, J.W.; Validation, J.W. and J.X.; Formal analysis, X.L.; Investigation, Y.T.; Resources, X.L., R.Z. and Q.Z.; Data curation, L.-H.L.; Writing—original draft, L.G.; Writing—review & editing, K.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, under Grant Nos. 2023YFB4503903 and 2020YFC0832500; the National Natural Science Foundation of China, under Grant Nos. U22A20261 and 61402210; the Gansu Province Science and Technology Major Project—Industrial Project, under Grant Nos. 22ZD6GA048 and 23ZDGA006; the HY-Project, under Grant No. 4E49EFF3; the Gansu Province Key Research and Development Plan—Industrial Project, under Grant No. 22YF7GA004; the Gansu Provincial Science and Technology Major Special Innovation Consortium Project, under Grant No. 21ZD3GA002; the Fundamental Research Funds for the Central Universities, under Grant Nos. lzujbky-2024-jdzx15, lzujbky-2022-kb12, lzujbky-2021-sp43, lzujbky-2020-sp02, and lzujbky-2019-kb51; the Open Project of Gansu Provincial Key Laboratory of Intelligent Transportation, under Grant No. GJJ-ZH-2024-002; the Science and Technology Plan of Qinghai Province, under Grant No. 2020-GX-164; 2023 China Higher Education Institutions Industry-Academia-Research Innovation Fund for Digital Intelligence and Educational Projects No. 2023RY020.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Han, D.; Shi, J.; Zhao, J.; Wu, H.; Zhou, Y.; Li, L.H.; Khan, M.K.; Li, K.C. LRCN: Layer-residual Co-Attention Networks for visual question answering. Expert Syst. Appl. 2025, 263, 125658. [Google Scholar] [CrossRef]
Xia, C.; Li, X.; Gao, X.; Ge, B.; Li, K.C.; Fang, X.; Zhang, Y.; Yang, K. PCDR-DFF: Multi-modal 3D object detection based on point cloud diversity representation and dual feature fusion. Neural Comput. Appl. 2024, 36, 9329–9346. [Google Scholar] [CrossRef]
He, W.; Zhang, Y.; Zhuo, W.; Shen, L.; Yang, J.; Deng, S.; Sun, L. APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 23762–23772. [Google Scholar]
Jin, K.; Du, W.; Tang, M.; Liang, W.; Li, K.; Pathan, A.S.K. LSODNet: A Lightweight and Efficient Detector for Small Object Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 24816–24828. [Google Scholar] [CrossRef]
Shen, W.; Ma, A.; Wang, J.; Zheng, Z.; Zhong, Y. Adaptive Self-Supporting Prototype Learning for Remote Sensing Few-Shot Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5634116. [Google Scholar] [CrossRef]
Johnander, J.; Edstedt, J.; Felsberg, M.; Khan, F.S.; Danelljan, M. Dense Gaussian Processes for Few-Shot Segmentation. In Lecture Notes in Computer Science, Proceedings of the Computer Vision, ECCV 2022, PT XXIX, 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cisse, M., Farinella, G., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; Volume 13689, pp. 217–234. [Google Scholar] [CrossRef]
Ma, J.; Bai, S.; Pan, W. Boosting Few-Shot Semantic Segmentation with Prior-Driven Edge Feature Enhancement Network. IEEE Trans. Artif. Intell. 2025, 6, 211–220. [Google Scholar] [CrossRef]
Zhang, H.; Xu, J.; Jiang, S.; He, Z. Simple Semantic-Aided Few-Shot Learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 28588–28597. [Google Scholar] [CrossRef]
McCall, A. Few-Shot Learning in Computer Vision: Overcoming Data Scarcity. ResearchGate 2022. Available online: https://www.researchgate.net/publication/390542684_Few-Shot_Learning_in_Computer_Vision_Overcoming_Data_Scarcity (accessed on 8 October 2025).
Zhao, J.; Kong, L.; Lv, J. An Overview of Deep Neural Networks for Few-Shot Learning. Big Data Min. Anal. 2025, 8, 145–188. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, C.; Ni, B.; Xu, M.; Yang, X. Variational Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Dvornik, N.; Schmid, C.; Mairal, J. Diversity with Cooperation: Ensemble Methods for Few-Shot Classification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3722–3730. [Google Scholar] [CrossRef]
Schwartz, E.; Karlinsky, L.; Shtok, J.; Harary, S.; Marder, M.; Kumar, A.; Feris, R.; Giryes, R.; Bronstein, A. Delta-encoder: An effective sample synthesis method for few-shot object recognition. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar] [CrossRef]
Guo, Y.; Codella, N.C.; Karlinsky, L.; Codella, J.V.; Smith, J.R.; Saenko, K.; Rosing, T.; Feris, R. A Broader Study of Cross-Domain Few-Shot Learning. In Image Processing Computer Vision Pattern Recognition and Graphics, Proceedings of the 16th European Conference on Computer Vision-ECCV-Biennial, Electrical Network, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; PT XXVII; Springer International Publishing: Cham, Switzerland, 2020; Volume 12372, pp. 124–141. [Google Scholar] [CrossRef]
Wang, Y.; Lee, D.; Heo, J.; Park, J. One-Shot Summary Prototypical Network Toward Accurate Unpaved Road Semantic Segmentation. IEEE Signal Process. Lett. 2021, 28, 1200–1204. [Google Scholar] [CrossRef]
Zhang, X.; Wei, Y.; Yang, Y.; Huang, T.S. SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation. IEEE Trans. Cybern. 2020, 50, 3855–3865. [Google Scholar] [CrossRef]
Tian, P.; Wu, Z.; Qi, L.; Wang, L.; Shi, Y.; Gao, Y. Differentiable meta-learning model for few-shot semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12087–12094. [Google Scholar]
Askari, F.; Fateh, A.; Mohammadi, M.R. Enhancing few-shot image classification through learnable multi-scale embedding and attention mechanisms. Neural Netw. 2025, 187, 107339. [Google Scholar] [CrossRef]
Ren, G.; Liu, J.; Wang, M.; Guan, P.; Cao, Z.; Yu, J. Few-Shot Object Detection via Dual-Domain Feature Fusion and Patch-Level Attention. Tsinghua Sci. Technol. 2025, 30, 1237–1250. [Google Scholar] [CrossRef]
Liu, T.; Sun, F. Self-Aligning Multi-Modal Transformer for Oropharyngeal Swab Point Localization. Tsinghua Sci. Technol. 2024, 29, 1082–1091. [Google Scholar] [CrossRef]
Paeedeh, N.; Pratama, M.; Ma’sum, M.A.; Mayer, W.; Cao, Z.; Kowlczyk, R. Cross-domain few-shot learning via adaptive transformer networks. Knowl.-Based Syst. 2024, 288, 111458. [Google Scholar] [CrossRef]
Liu, Y.; Sun, Y.; Chen, Z.; Feng, C.; Zhu, K. Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video. Tsinghua Sci. Technol. 2025, 30, 290–302. [Google Scholar] [CrossRef]
Zhang, J.; Chen, X.; Yang, B.; Guan, Q.; Chen, Q.; Chen, J.; Wu, Q.; Xie, Y.; Xia, Y. Advances in attention mechanisms for medical image segmentation. Comput. Sci. Rev. 2025, 56, 100721. [Google Scholar] [CrossRef]
Zhi, P.; Jiang, L.; Yang, X.; Wang, X.; Li, H.W.; Zhou, Q.; Li, K.C.; Ivanović, M. Cross-Domain Generalization for LiDAR-Based 3D Object Detection in Infrastructure and Vehicle Environments. Sensors 2025, 25, 767. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Lecture Notes in Computer Science, Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; PT III; Tech Univ Munich: Munich, Germany; Friedrich Alexander Univ Erlangen Nuremberg: Erlangen, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive Pyramid Context Network for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Choi, S.; Kim, J.T.; Choo, J. Cars Can’t Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Elhassan, M.A.; Huang, C.; Yang, C.; Munea, T.L. DSANet: Dilated spatial attention for real-time semantic segmentation in urban street scenes. Expert Syst. Appl. 2021, 183, 115090. [Google Scholar] [CrossRef]
Guo, L.; Li, X.; Wang, J.; Xiao, J.; Hou, Y.; Zhi, P.; Yong, B.; Li, L.; Zhou, Q.; Li, K. EdgeVidCap: A Channel-Spatial Dual-Branch Lightweight Video Captioning Model for IoT Edge Cameras. Sensors 2025, 25, 4897. [Google Scholar] [CrossRef]
Catalano, N.; Matteucci, M. Few Shot Semantic Segmentation: A review of methodologies, benchmarks, and open challenges. arXiv 2024, arXiv:2304.05832. [Google Scholar] [CrossRef]
Tang, S.; Yan, S.; Qi, X.; Gao, J.; Ye, M.; Zhang, J.; Zhu, X. Few-shot medical image segmentation with high-fidelity prototypes. Med. Image Anal. 2025, 100, 103412. [Google Scholar] [CrossRef]
Shaban, A.; Bansal, S.; Liu, Z.; Essa, I.; Boots, B. One-Shot Learning for Semantic Segmentation. arXiv 2017, arXiv:1709.03410. [Google Scholar] [CrossRef]
Li, A.; Luo, T.; Xiang, T.; Huang, W.; Wang, L. Few-Shot Learning with Global Class Representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Luo, X.; Wu, H.; Zhang, J.; Gao, L.; Xu, J.; Song, J. A Closer Look at Few-shot Classification Again. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2023; Volume 202, pp. 23103–23123. [Google Scholar]
Liu, Y.; Zhu, Y.; Chong, H.; Yu, M. Few-shot image semantic segmentation based on meta-learning: A review. J. Intell. Fuzzy Syst. 2024, 47, 351–367. [Google Scholar] [CrossRef]
Dong, N.; Xing, E.P. Few-shot semantic segmentation with prototype learning. In Proceedings of the BMVC, Newcastle, UK, 3–6 September 2018; Volume 3, p. 4. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Rakelly, K.; Shelhamer, E.; Darrell, T. Conditional Networks for Few-Shot Semantic Segmentation. ICLRWorkshop. 2018. Available online: https://openreview.net/forum?id=SkMjFKJwG (accessed on 8 October 2025).
Siam, M.; Oreshkin, B.N.; Jagersand, M. AMP: Adaptive Masked Proxies for Few-Shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
He, Z.; Li, L.; Wang, H. Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization. Symmetry 2025, 17, 1150. [Google Scholar] [CrossRef]
Pambala, A.K.; Dutta, T.; Biswas, S. SML: Semantic meta-learning for few-shot semantic segmentation☆. Pattern Recognit. Lett. 2021, 147, 93–99. [Google Scholar] [CrossRef]
Zhang, B.; Xiao, J.; Qin, T. Self-Guided and Cross-Guided Learning for Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 8312–8321. [Google Scholar]
Yang, B.; Liu, C.; Li, B.; Jiao, J.; Ye, Q. Prototype Mixture Models for Few-Shot Semantic Segmentation. In Image Processing Computer Vision Pattern Recognition and Graphics, Proceedings of the 16th European Conference on Computer Vision-ECCV-Biennial, Electrical Network, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; PT VIII; Springer International Publishing: Cham, Switzerland, 2020; Volume 12353, pp. 763–778. [Google Scholar] [CrossRef]
Lu, Z.; He, S.; Zhu, X.; Zhang, L.; Song, Y.Z.; Xiang, T. Simpler Is Better: Few-Shot Semantic Segmentation with Classifier Weight Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 8741–8750. [Google Scholar]
Li, G.; Jampani, V.; Sevilla-Lara, L.; Sun, D.; Kim, J.; Kim, J. Adaptive Prototype Learning and Allocation for Few-Shot Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 8334–8343. [Google Scholar]
Liu, Y.; Zhang, X.; Zhang, S.; He, X. Part-Aware Prototype Network for Few-Shot Semantic Segmentation. In Image Processing Computer Vision Pattern Recognition and Graphics, Proceedings of the 16th European Conference on Computer Vision-ECCV-Biennial, Electrical Network, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; PT IX; Springer International Publishing: Cham, Switzerland, 2020; Volume 12354, pp. 142–158. [Google Scholar] [CrossRef]
Li, W.; Xu, J.; Huo, J.; Wang, L.; Gao, Y.; Luo, J. Distribution consistency based covariance metric networks for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8642–8649. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Chen, H.; Li, H.; Li, Y.; Chen, C. Sparse spatial transformers for few-shot learning. Sci.-China-Inf. Sci. 2023, 66, 210102. [Google Scholar] [CrossRef]
Dos Santos, M.E.; Guimarães, S.J.F.; Patrocínio, Z.K.G. Cross-Attention Vision Transformer for Few-Shot Semantic Segmentation. In Proceedings of the 2023 IEEE Ninth Multimedia Big Data (BigMM), Laguna Hills, CA, USA, 11–13 December 2023; pp. 64–71. [Google Scholar] [CrossRef]
Fateh, A.; Mohammadi, M.R.; Jahed-Motlagh, M.R. MSDNet: Multi-scale decoder for few-shot semantic segmentation via transformer-guided prototyping. Image Vis. Comput. 2025, 162, 105672. [Google Scholar] [CrossRef]
Shao, J.; Gong, B.; Dai, K.; Li, D.; Jing, L.; Chen, Y. Query-support semantic correlation mining for few-shot segmentation. Eng. Appl. Artif. Intell. 2023, 126, 106797. [Google Scholar] [CrossRef]
Moon, S.; Sohn, S.S.; Zhou, H.; Yoon, S.; Pavlovic, V.; Khan, M.H.; Kapadia, M. HM: Hybrid Masking for Few-Shot Segmentation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 506–523. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 991–998. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Lecture Notes in Computer Science, Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; PT V; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. Mining Latent Classes for Few-Shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 8721–8730. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Chang, Z.; Lu, Y.; Ran, X.; Gao, X.; Zhao, H. Simple yet effective joint guidance learning for few-shot semantic segmentation. Appl. Intell. 2023, 53, 26603–26621. [Google Scholar] [CrossRef]
Gao, G.; Fang, Z.; Han, C.; Wei, Y.; Liu, C.H.; Yan, S. DRNet: Double Recalibration Network for Few-Shot Semantic Segmentation. IEEE Trans. Image Process. 2022, 31, 6733–6746. [Google Scholar] [CrossRef]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior Guided Feature Enrichment Network for Few-Shot Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1050–1065. [Google Scholar] [CrossRef]
Chang, Z.; Lu, Y.; Wang, X.; Ran, X. MGNet: Mutual-guidance network for few-shot semantic segmentation. Eng. Appl. Artif. Intell. 2022, 116, 105431. [Google Scholar] [CrossRef]
Chen, Y.; Chen, S.; Yang, Z.X.; Wu, E. Learning self-target knowledge for few-shot segmentation. Pattern Recognit. 2024, 149, 110266. [Google Scholar] [CrossRef]
Lang, C.; Cheng, G.; Tu, B.; Han, J. Few-Shot Segmentation via Divide-and-Conquer Proxies. Int. J. Comput. Vis. 2024, 132, 261–283. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y.; Jiang, X.; Cao, X.; Zhen, X. You only need the image: Unsupervised few-shot semantic segmentation with co-guidance network. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1496–1500. [Google Scholar]
Hu, T.; Yang, P.; Zhang, C.; Yu, G.; Mu, Y.; Snoek, C.G. Attention-based multi-context guiding for few-shot semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8441–8448. [Google Scholar]
Nguyen, K.; Todorovic, S. Feature Weighting and Boosting for Few-Shot Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 622–631. [Google Scholar] [CrossRef]
Wang, H.; Zhang, X.; Hu, Y.; Yang, Y.; Cao, X.; Zhen, X. Few-Shot Semantic Segmentation with Democratic Attention Networks. In Image Processing Computer Vision Pattern Recognition and Graphics, Proceedings of the 16th European Conference on Computer Vision-ECCV-Biennial, Electrical Network, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J., Eds.; PT XIII; Springer International Publishing: Cham, Switzerland, 2020; Volume 12358, pp. 730–746. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y.; Cao, X.; Zhen, X.; Snoek, C.; Shao, L. Variational Prototype Inference for Few-Shot Semantic Segmentation. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Electrical Network, Virtual, 5–9 January 2021; pp. 525–534. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, X.; Wang, Q.; Wu, W.; Chang, X.; Liu, J. RPMG-FSS: Robust Prior Mask Guided Few-Shot Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6609–6621. [Google Scholar] [CrossRef]
Hu, Y.; Huang, X.; Luo, X.; Han, J.; Cao, X.; Zhang, J. Learning Foreground Information Bottleneck for few-shot semantic segmentation. Pattern Recognit. 2024, 146, 109993. [Google Scholar] [CrossRef]
Xie, G.S.; Liu, J.; Xiong, H.; Shao, L. Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 5475–5484. [Google Scholar]
Liu, B.; Ding, Y.; Jiao, J.; Ji, X.; Ye, Q. Anti-Aliasing Semantic Reconstruction for Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 9747–9756. [Google Scholar]
Fan, Q.; Pei, W.; Tai, Y.W.; Tang, C.K. Self-support Few-Shot Semantic Segmentation. In Lecture Notes in Computer Science, Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cisse, M., Farinella, G., Hassner, T., Eds.; PT XIX; Springer Nature: Cham, Switzerland, 2022; Volume 13679, pp. 701–719. [Google Scholar] [CrossRef]
Guan, H.; Spratling, M. Query semantic reconstruction for background in few-shot segmentation. Vis. Comput. 2024, 40, 799–810. [Google Scholar] [CrossRef]
Liu, Y.; Liu, N.; Cao, Q.; Yao, X.; Han, J.; Shao, L. Learning Non-Target Knowledge for Few-Shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11573–11582. [Google Scholar]

Figure 1. Overall architecture of the Symmetry-Aware Superpixel-Enhanced Model for few-shot semantic segmentation in a 1-way 1-shot setting.

Figure 2. Cross-Layer Feature Visualization. From left to right, the figure shows feature heatmaps from different layers of the backbone network: shallow layers highlight edges and texture details, while deeper layers produce responses that progressively focus on semantic regions and become more abstract. This visualization indicates that cross-layer features are complementary.

Figure 3. Cross-Layer Feature Fusion. Through a cross-hierarchical feature integration mechanism, fine-grained low-level cues are injected into high-level semantics, yielding optimized background representations in complex scenes.

Figure 4. Hybrid Background Feature Module (HBM). The input feature

F^{I}

is masked by the background mask M to obtain the background feature

F^{{BM}_{1}}

. Then,

F^{{BM}_{1}}

is combined with the background feature

F^{{BM}_{0}}

extracted from the previous convolutional layer to produce the hybrid background feature

F^{{BM}_{0 - 1}}

.

Figure 4. Hybrid Background Feature Module (HBM). The input feature

F^{I}

is masked by the background mask M to obtain the background feature

F^{{BM}_{1}}

. Then,

F^{{BM}_{1}}

is combined with the background feature

F^{{BM}_{0}}

extracted from the previous convolutional layer to produce the hybrid background feature

F^{{BM}_{0 - 1}}

.

Figure 5. Superpixel–Prototype Relational Matching. The image feature map is fed into a superpixel segmentation module, which groups spatially adjacent and visually similar pixels into superpixels. For each superpixel, mean pooling is performed in the feature space to obtain its feature centroid, which serves as a fine-grained pseudo-prototype. The collection of these pseudo-prototypes compactly yet sufficiently characterizes the semantic distribution of the entire query image.

Figure 6. Illustration of non-parametric metric learning framework. The pipeline begins with prototypes extracted from support images, which are compared with query features Fq using cosine similarity computation (

1 \times 1 \times C

). The argmax operation selects the most similar prototype to generate a guide map at each spatial location. This guide map is concatenated with the original query features Fq along the channel dimension (denoted by ⊕) to produce enhanced features for final segmentation prediction. The framework enables effective knowledge transfer from support samples to query images through prototype-based similarity matching.

Figure 6. Illustration of non-parametric metric learning framework. The pipeline begins with prototypes extracted from support images, which are compared with query features Fq using cosine similarity computation (

1 \times 1 \times C

). The argmax operation selects the most similar prototype to generate a guide map at each spatial location. This guide map is concatenated with the original query features Fq along the channel dimension (denoted by ⊕) to produce enhanced features for final segmentation prediction. The framework enables effective knowledge transfer from support samples to query images through prototype-based similarity matching.

Figure 7. Qualitative results of 1-way 1-shot setting on PASCAL-

5^{i}

.

Figure 7. Qualitative results of 1-way 1-shot setting on PASCAL-

5^{i}

.

Table 1. Results of 1-way 1-shot and 5-shot segmentation on PASCAL-

5^{i}

dataset using mIoU and FB-IoU metric. The best results are in bold. * indicates the results we replicated ourselves.

Table 1. Results of 1-way 1-shot and 5-shot segmentation on PASCAL-

5^{i}

dataset using mIoU and FB-IoU metric. The best results are in bold. * indicates the results we replicated ourselves.

Methods	Backbone	1-Shot		5-Shot
Methods	Backbone	mIoU	FB-IoU	mIoU	FB-IoU
OSLSM [37]	VGG-16	40.8	61.3	44.0	61.5
co-FCN [44]		41.1	60.1	41.4	60.2
PL [42]		42.7	61.2	43.7	62.3
AMP [45]		43.4	62.2	46.9	63.8
PANet [39]		48.1	66.5	55.7	68.4
SG-One [18]		46.3	63.1	47.1	65.9
JGLNet [66]		49.3	68.3	55.6	70.6
DRNet [67]		52.4	67.5	55.2	70.0
PFENet [68]		58.0	–	59.0	–
MGNet [69]		43.9	67.8	50.3	50.3
LSTNet [70]		58.5	–	60.4	–
DCP [71]		62.6	75.6	67.8	80.6
SSENet (Ours)		65.4	77.2	68.3	81.3
PANet * [39]	ResNet-50	48.7	66.9	55.6	71.4
CGNet [72]		47.6	64.1	49.5	66.2
PPNet [52]		52.9	–	63.0	–
SML [47]		51.3	67.1	60.0	72.2
PFENet [68]		60.8	73.3	61.9	73.9
ASGNet [51]		59.3	69.2	63.9	74.2
DRNet [67]		58.6	71.4	61.7	73.7
DGPNet [6]		63.2	–	73.1	–
MSDNet [57]		64.3	77.1	68.7	82.1
DCP [71]		66.1	77.6	70.3	78.5
SSENet (Ours)		67.4	78.9	71.0	81.3
PANet * [39]	ResNet-101	51.2	70.3	57.5	72.0
A-MCG [73]		–	61.2	–	62.2
PPNet [52]		55.2	70.9	65.1	77.5
FWB [74]		56.2	–	59.9	–
DAN [75]		58.2	71.9	60.5	72.3
VPI [76]		57.3	–	60.4	–
ASGNet [51]		59.3	71.7	64.4	75.2
LSTNet [70]		61.8	–	64.2	–
PRMG [77]		62.6	–	65.7	–
PFENet+ [78]		62.6	75.1	64.0	76.6
MSDNet [57]		64.7	77.3	70.8	85.0
DCP [71]		67.3	78.5	71.5	82.7
SSENet (Ours)		68.2	78.3	72.5	81.6

Table 2. Results of 2-way 1-shot and 5-shot tasks on PASCAL-

5^{i}

dataset.

Table 2. Results of 2-way 1-shot and 5-shot tasks on PASCAL-

5^{i}

dataset.

Methods	Task	mIoU			FB-IoU
Methods	Task	VGG-16	ResNet-50	ResNet-101	VGG-16	ResNet-50	ResNet-101
PANet (Baseline)	1-shot	45.1	45.3	49.8	64.2	64.4	68.6
SSENet (Ours)	1-shot	62.7	63.2	67.9	73.8	74.1	77.5
PANet (Baseline)	5-shot	48.2	48.8	54.4	67.4	68.6	73.1
SSENet (Ours)	5-shot	59.4	60.2	65.3	78.2	79.5	80.9

Table 3. Results of 1-way 1-shot and 5-shot segmentation on COCO-

20^{i}

dataset. * denotes the results implemented by ourselves. The best results are in bold.

Table 3. Results of 1-way 1-shot and 5-shot segmentation on COCO-

20^{i}

dataset. * denotes the results implemented by ourselves. The best results are in bold.

Methods	Backbone	1-Shot		5-Shot
Methods	Backbone	mIoU	FB-IoU	mIoU	FB-IoU
PANet [39]	VGG-16	20.9	59.2	29.7	63.5
DRNet [67]		18.5	58.3	25.2	62.6
MGNet [69]		27.8	61.1	35.6	63.8
JGLNet [66]		25.3	61.8	34.7	63.6
LSTNet [70]		35.8	–	37.5	–
PFENet [68]		34.1	60.0	37.7	61.6
SML [47]		22.6	59.3	–	–
SAGNN [79]		37.3	61.2	40.7	63.1
SSENet (Ours)		43.1	65.8	45.6	66.8
RPMM [49]	ResNet-50	30.6	60.4	42.5	67.0
PANet * [39]		23.6	63.0	34.2	64.1
PPNet [52]		29.0	–	38.5	–
SML [47]		23.3	59.5	–	–
ASR [80]		33.8	–	36.7	–
MLC [63]		33.9	–	40.6	–
ASGNet [51]		34.6	60.4	42.5	67.1
CWT [50]		32.9	–	41.3	–
DRNet [67]		23.3	61.4	32.2	64.8
SSP [81]		33.6	–	41.3	–
QSCMNet [58]		36.4	60.7	42.8	64.8
LSTNet [70]		36.6	–	38.0	–
PFENet + QSR [82]		35.1	–	38.2	–
DCP [71]		45.5	–	50.9	–
SSENet (Ours)		47.0	70.1	53.8	73.9
FWB [74]	ResNet-101	21.2	–	23.7	–
A-MCG [73]		–	52.0	–	64.7
PANet * [39]		35.1	63.7	41.4	66.5
PMMs [49]		29.6	–	34.3	–
DAN [75]		24.4	62.3	29.6	63.9
PFENet [68]		38.5	63.0	42.7	65.8
VPI [76]		23.4	–	27.8	–
SAGNN [79]		37.2	60.9	42.7	63.4
CWT [50]		32.4	–	42.0	–
NTRENet [83]		39.1	67.5	43.2	69.6
PFENet+ [78]		38.2	61.8	39.9	63.4
LSTNet [70]		38.2	–	38.2	–
PFENet + QSR [82]		36.9	–	41.2	–
DCP [71]		44.6	–	49.4	–
SSENet (Ours)		47.2	68.9	53.6	72.4

Table 4. Results of 2-way 1-shot and 5-shot tasks using ResNet-50 backbone on COCO-

20^{i}

dataset.

Table 4. Results of 2-way 1-shot and 5-shot tasks using ResNet-50 backbone on COCO-

20^{i}

dataset.

Methods	Task	mIoU			FB-IoU
Methods	Task	VGG-16	ResNet-50	ResNet-101	VGG-16	ResNet-50	ResNet-101
PANet (Baseline)	1-shot	20.5	22.3	34.2	58.7	59.6	63.4
SSENet (Ours)	1-shot	39.7	41.0	45.8	64.3	69.2	64.9
PANet (Baseline)	5-shot	32.7	32.5	40.1	61.2	62.2	65.8
SSENet (Ours)	5-shot	48.2	46.9	51.2	66.0	71.4	70.6

Table 5. Ablation studies on the effect of different components. F: Foreground, B: Background, C: CLF Module, S: SPRM Module.

Variants	PASCAL- $5^{i}$		COCO- $20^{i}$		Speed (FPS)
Variants	mIoU	FB-mIoU	mIoU	FB-mIoU	Speed (FPS)
F + B (Baseline)	48.7	66.9	23.6	63.0	17.4
$F_{s}$ + $F_{q}$ C	50.3	67.8	25.8	63.7	17.2
$F_{s}$ C + $F_{q}$ C	52.1	68.6	28.2	64.3	17.2
$F_{s}$ C + $F_{q}$ C + F + B	54.2	69.5	33.8	64.9	17.2
$F_{s}$ + $F_{q}$ C + FS + B	56.0	70.3	35.1	65.6	16.8
$F_{s}$ + $F_{q}$ C + F + BS	57.8	71.2	36.4	66.2	16.8
$F_{s}$ C + $F_{q}$ + FS + B	59.4	72.0	38.6	66.8	16.8
$F_{s}$ C + $F_{q}$ + F + BS	61.1	72.9	40.7	67.4	16.7
$F_{s}$ C + $F_{q}$ C + F + BS	62.6	73.7	42.8	68.1	16.7
$F_{s}$ C + $F_{q}$ C + FS + B	64.2	74.6	44.7	68.7	16.5
$F_{s}$ C + $F_{q}$ C + F + BS	65.8	75.9	46.2	69.3	16.5
$F_{s}$ C + $F_{q}$ C + FS + BS (Ours)	67.4	78.9	47.0	70.1	16.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Li, X.; Wang, J.; Tong, Y.; Xiao, J.; Zhou, R.; Li, L.-H.; Zhou, Q.; Li, K.-C. Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation. Symmetry 2025, 17, 1726. https://doi.org/10.3390/sym17101726

AMA Style

Guo L, Li X, Wang J, Tong Y, Xiao J, Zhou R, Li L-H, Zhou Q, Li K-C. Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation. Symmetry. 2025; 17(10):1726. https://doi.org/10.3390/sym17101726

Chicago/Turabian Style

Guo, Lan, Xuyang Li, Jinqiang Wang, Yuqi Tong, Jie Xiao, Rui Zhou, Ling-Huey Li, Qingguo Zhou, and Kuan-Ching Li. 2025. "Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation" Symmetry 17, no. 10: 1726. https://doi.org/10.3390/sym17101726

APA Style

Guo, L., Li, X., Wang, J., Tong, Y., Xiao, J., Zhou, R., Li, L.-H., Zhou, Q., & Li, K.-C. (2025). Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation. Symmetry, 17(10), 1726. https://doi.org/10.3390/sym17101726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Semantic Segmentation

2.3. Few-Shot Semantic Segmentation

3. Task Definition

4. Methodology

4.1. Overview

4.2. Cross-Layer Feature Fusion

4.3. Superpixel–Prototype Relational Matching

4.4. Non-Parametric Metric Learning

5. Experimental Design

5.1. Datasets and Evaluation Protocol

5.2. Implementation Details

5.3. Experimental Results

5.4. Ablation Studies

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI