FC-SBAAT: A Few-Shot Image Classification Approach Based on Feature Collaboration and Sparse Bias-Aware Attention in Transformers

Min Wang; Chengyu Yang; Lin Sha; Jiaqi Li; Shikai Tang

doi:10.3390/sym18010095

,

and

¹

Tianjin Key Laboratory of Intelligent Control of Electrical Equipment, Tiangong University, Tianjin 300387, China

²

School of Life Sciences, Tiangong University, No. 399 Binshui West Road, Tianjin 300387, China

³

School of Control Science and Engineering, Tiangong University, No. 399 Binshui West Road, Tianjin 300387, China

^*

Author to whom correspondence should be addressed.

Symmetry2026, 18(1), 95;https://doi.org/10.3390/sym18010095

This article belongs to the Special Issue Symmetry and Its Applications in Deep Learning and Artificial Intelligence Methods

Version Notes

Order Reprints

Abstract

Few-shot classification aims to generalize from very limited samples, providing an effective solution for data-scarce scenarios. From a symmetry viewpoint, an ideal Few-Shot classifier should be invariant to class permutations and treat support and query features in a balanced manner, preserving intra-class cohesion while enlarging inter-class separation in the embedding space. However, existing methods often violate this symmetry because prototypes are estimated from few noisy samples, which induces asymmetric representations and task-dependent biases under complex inter-class relations. To address this, we propose FC-SBAAT, feature collaboration, and Sparse Bias-Aware Attention Transformer, a framework that explicitly leverages symmetry in feature collaboration and prototype construction. First, we enhance symmetric interactions between support and query samples in both attention and contrastive subspaces and adaptively fuse these complementary representations via learned weights. Second, we refine prototypes by symmetrically aggregating intra-class features with learned importance weights, improving prototype quality while maintaining intra-class symmetry and increasing inter-class discrepancy. For matching, we introduce a Sparse Bias-Aware Attention Transformer that corrects asymmetric task bias through bias-aware attention with a low computational overhead. Extensive experiments show that FC-SBAAT achieves 55.71% and 73.87% accuracy for 1-shot and 5-shot tasks on MiniImageNet and 70.37% and 83.86% on CUB, outperforming prior methods.

Keywords:

few-shot learning; symmetry; feature collaboration; prototype generation; sparse bias-aware attention; transformer

1. Introduction

In supervised learning models, classification performance typically depends on large amounts of accurately labeled data. However, obtaining large-scale annotated datasets is both time-intensive and costly, especially in domains like medical imaging and remote sensing, where the annotation process is not only complex but also requires the involvement of domain experts. As a result, in scenarios where labeled data is scarce, the model’s robustness might be impacted, potentially leading to issues such as overfitting or bias. To tackle the challenges posed by data scarcity, Few-Shot Learning (FSL) is recognized as a critical focus in current research efforts. This learning paradigm enables models to be trained with only a few labeled samples, simulating the human ability to learn from limited data. It thus reduces reliance on large-scale annotations and enhances model performance in data-constrained environments. From a broader pattern-recognition perspective, symmetry is a fundamental concept for designing robust classifiers. In Few-Shot Learning, an ideal model should preserve permutation symmetry with respect to class labels and treat support and query samples in a symmetric way so that similar objects are mapped to compact, approximately symmetric clusters in the embedding space. Such geometric and relational symmetries are crucial for constructing stable prototypes from only a few labeled samples.

A variety of effective strategies have been proposed in the area of Few-Shot classification. Metric-based methods have become a central focus in research because of their broad applicability and robustness in real-world applications. These methods are primarily characterized by making classification decisions through the construction of similarity metric functions in the feature space. Prototypical Networks [1] and Relation Networks [2] establish the foundational paradigm of metric learning through generative class prototype construction and discriminative relational modeling, respectively. The former pioneered the use of class-wise mean feature representations as prototypes, defining geometric similarity metrics in the feature space. The latter broke away from traditional fixed metric patterns and introduced an end-to-end learnable relational function model for the first time. Together, these two innovations provide a methodological cornerstone for subsequent research in Few-Shot classification. Matching Networks [3] proposed a prototype matching method based on the attention mechanism, while Neural-Network-based Metric Learning [4] explores optimization strategies for deep feature transformations.

However, prototype-based methods still face certain limitations in Few-Shot classification tasks. These include the difficulty of adapting fixed metric functions to intra-class distribution shifts, the accumulation of metric errors among heterogeneous samples, sensitivity to outliers, and the inability to model complex inter-class relationships—all of which may result in degraded classification performance. To mitigate the constraints of Few-Shot classification tasks, researchers have adopted Graph Neural Networks (GNNs) to enhance learning by modeling relationships between samples, as in Few-Shot Learning with Graph Neural Networks [5]. Building on this, Prototype Rectification for Few-Shot Learning [6] improves classification accuracy by correcting prototype vectors using information from neighboring classes. Another class of methods focuses on incorporating external information to improve model learning. For example, Learning with Side Information for Few-Shot Learning [7] introduces contextual data such as labels or class attributes to help the model better capture inter-class similarities and differences. AMRN [8] improves classification performance by learning multiple prototypes per class to handle intra-class diversity. In terms of task optimization, Task-Aware Prototype Refinement (TAPR) [9] dynamically adjusts prototype vectors during training based on relationships between task-relevant samples. Meanwhile, distributed Few-Shot Learning methods, such as Distributed Few-Shot Learning with Prototype Distribution Correction, further optimize intra-class relationships by introducing distribution correction techniques.

For relation modeling, SaberNet [10] introduces a self-attention mechanism to better model relationships between samples. PSANet [11] applies a salient attention mechanism to focus on key image regions and improve segmentation performance. SCL [12] leverages base class information to train multiple self-supervised objectives and enhance classification. Although significant advances have been made in Few-Shot classification, most approaches handle features within each class separately and do not adequately capture relationships between classes, feature interactions, and multi-level semantic correlations. Moreover, task adaptability remains limited. From the symmetry perspective, the above limitations can be summarized as breaking under low-data regimes. Noisy supports distort class centers and break intra-class exchangeability, which yields unstable prototypes. Meanwhile, insufficient modeling of support–query interactions may introduce asymmetric feature representations and task-dependent biases. Therefore, restoring symmetry in prototype construction and matching becomes a direct path to improving robustness and generalization in Few-Shot tasks. These constraints reduce the expressive capacity of models when dealing with complex Few-Shot classification tasks, hindering further performance improvements.

To address prototype instability and inefficient matching in Few-Shot classification, we propose FC-SBAAT, a unified framework that jointly designs feature collaboration, prototype rectification, and attention-based matching. The method first improves the reliability of class representations by rectifying prototypes under scarce and noisy support sets and then leverages the rectified prototypes to guide the subsequent matching process. Specifically, FC-SBAAT performs fine-grained reinforcement in two complementary subspaces and builds intra-class relations to estimate the reliability of support instances. An MLP is then employed to adaptively fuse the enhanced representations, followed by an intra-class-relation-guided refinement that conducts instance-level symmetric weighted aggregation to obtain rectified prototypes, as illustrated in Figure 1. Notably, this rectification differs from common prototype rectification strategies that rely on cross-class neighborhoods or query-dependent heuristics; instead, it exploits intra-class consistency to suppress outliers and mitigate prototype shift. In novel-class Few-Shot scenarios, cross-class or query-driven rectification may introduce unreliable external cues and amplify the shift; hence, we adopt intra-class-consistency-driven symmetric aggregation to improve prototype stability.

Figure 1. Prototype Generation Process (ProtoNet: ProtoNet uses class mean as prototypes; Ours: Ours uses prototype correction to refine prototypes). Each colored circle represents a sample of a class, the dashed circle represents an uncorrected prototype, and the star represents a corrected prototype.

Furthermore, we introduce a Sparse Bias-Aware Attention mechanism in the Transformer decoder to address two key challenges in Few-Shot matching: the high time complexity of dense cross-attention and the susceptibility to task-irrelevant keys in low-data regimes. Concretely, a relevance-driven bias term is injected into the attention logits, and sparse selection is performed on the biased logits so that normalization and value aggregation are computed only over the most task-relevant connections. Regarding information-flow coupling, the class representations produced by prototype rectification are directly used to construct the relevance bias for matching, which in turn determines sparse selection and aggregation, enabling coupled optimization between prototype construction and relation matching. This design improves semantic focus while reducing the time complexity of attention computation, leading to better accuracy, robustness, and generalization.

The principal contributions presented in this study include the following:

We propose FC-SBAAT, a unified framework that jointly designs feature collaboration, prototype rectification, and attention-based matching and establishes an information-flow coupling from prototype rectification to relevance-bias construction and sparse selection. The rectified prototypes directly shape the matching bias and determine the sparse aggregation scope, enabling coupled optimization between prototype construction and relation matching and improving accuracy, robustness, and generalization under noisy Few-Shot prototypes and complex inter-class relations.
We introduce a two-subspace feature enhancement mechanism that strengthens fine-grained representations in two complementary subspaces and builds intra-class relations to characterize the reliability of support samples. An MLP is then employed to adaptively fuse the enhanced features, producing more informative task representations for prototype generation.
We develop an intra-class-consistency-based prototype rectification strategy that suppresses the influence of outliers and noisy samples on mean prototypes through intra-class relation guided symmetric weighted aggregation, thereby mitigating prototype shift. This strategy avoids cross-class neighborhood or query-driven heuristics and is particularly suitable for novel-class Few-Shot scenarios.
We introduce a Sparse Bias-Aware Attention mechanism in the Transformer decoder for matching. A relevance-driven bias term is injected into attention logits, and sparse selection is performed on the biased logits so that normalization and value aggregation are computed only over the most task-relevant subset, reducing the time complexity of attention computation while improving semantic focus.

3. Methodology

3.1. Notations and Problem Definition

To formalize the subsequent formulation, we consider a standard Few-Shot classification setting with a base-class dataset

D_{base}

and a novel-class dataset

D_{novel}

, where

D_{base} \cap D_{novel} = \emptyset

. Following the episodic meta-learning protocol, the overall procedure is divided into a meta-training phase and a meta-testing phase, and the model is optimized/evaluated over a sequence of Few-Shot tasks (episodes).

In the meta-training phase, each training episode is constructed by first sampling K classes from

D_{base}

to form a K-way N-shot task

T_{train} = (S_{support}, Q_{query})

. The support set is defined as

S_{support} = {(F_{k, i}^{s}, y_{k})}_{k = 1, i = 1}^{K, N}

(4)

where

F_{k, i}^{s}

denotes the i-th labeled support sample from the k-th sampled class, and

y_{k} \in {1, 2, \dots, K}

denotes the class label for the k-th class in the episode. In addition, we sample query instances from the remaining examples of the same K classes to form the query set

Q_{query} = {(F_{k, m}^{q}, y_{m})}_{k = 1, m = 1}^{K, N_{q}}

(5)

N is typically very small (e.g., 1 or 5). Importantly, in our method, the query labels in

Q_{query}

are available during meta-training and are used to compute the training objectives (e.g., relation-based supervision and Contrastive Learning terms) together with the support set, while the model is still required to predict query classes conditioned on the support set.

In the meta-testing phase, we evaluate the trained model on test episodes

T_{test} = (S_{support}, Q_{test})

sampled from

D_{novel}

under the same K-way N-shot protocol.

3.2. Framework

Our framework is composed of four core modules and one data augmentation preprocessing step, as shown in Figure 2. Specifically, the data augmentation module implements task-adaptive sample enhancement to mitigate the problem of data scarcity; the FCM module conducts feature collaboration through two spaces (Attention Space and Contrast Space) to strengthen intra-class correlations and improve inter-class discriminability; the AFM module performs dynamic weighted fusion on dual-subspace features to avoid information redundancy and enhance model stability; the IPRM module refines prototypes under the guidance of intra-class relations to boost the representativeness of prototypes, and the SBAATM adopts a Sparse Bias-Aware Attention mechanism to reduce time complexity and achieve an accurate classification matching process.

Figure 2. System architecture diagram. Note: The symbol ⨂ denotes element-wise multiplication, and the symbol ⨁ denotes concatenation.

3.3. FCM

We treat support and query samples as unordered sets and enhance them through relation-based operators. Multi-Head Attention computes interactions based on pairwise similarities, so re-ordering inputs only permutes the attention matrix accordingly and preserves permutation equivariance. The contrastive subspace further enforces intra-class cohesion and inter-class separation, which can be viewed as a symmetry-preserving regularization in the embedding space. By fusing two complementary subspaces with learned weights, FCM produces symmetry-consistent representations for subsequent prototype construction. The specific procedure is described as follows:

In the feature reinforcement stage, we perform feature collaborative enhancement through the parallel Multi-Head Attention and Contrastive Learning methods within FCM, thereby addressing the intra-class heterogeneity issue. Specifically, during the data augmentation process, we adopt the mixing enhancement scheme corresponding to the ImageNet Enhancement data augmentation strategy, as shown in Figure 2. If the number of support samples

N_{s}

is equal to or less than 1 (1-shot scenario), we apply traditional data enhancement methods such as

f_{step} (x_{i}, α)

, which include color perturbation, Gaussian noise, fuzzy processing, and other operations. These enhancements are applied multiple times in sequence, with intensities set to 10, 20, and 30, respectively. When the number of support samples

N_{s}

is greater than 1, we adopt a sample mixing strategy, randomly selecting two samples from the same class and mixing them according to a certain ratio to enhance sample diversity. This strategy helps improve the model’s ability to learn from different samples, thereby enhancing classification performance. The specific formula is as follows:

\tilde{X} = λ x_{i} + (1 - λ) x_{j}, x_{i}, x_{j} \in S, i \neq j

(6)

where

x_{i}

and

x_{j}

are different support samples of the same category, and

λ

is a random mixing proportion (a random value between 0 and 1). The full formula for the augmentation process is shown as follows, which defines the enhancement strategy based on the number of support samples

N_{s}

.

Augmentation = \{\begin{matrix} f_{step} (x_{i}, α), & if N_{s} \leq 1 \\ λ x_{i} + (1 - λ) x_{j}, & if N_{s} > 1 \end{matrix}

(7)

Subsequently, the feature-enhanced support samples are subjected to feature extraction using the backbone network, and the feature extractor extracts the feature representations of the support samples and query samples as

F^{q}, F^{s}

.

For the Multi-Head Attention enhancement component of the Feature Co-enhancement Module (FCM), we first extract features of all support and query samples via the backbone network in the

F C M

:

A t t e n t i o n S p a c e

as illustrated in Figure 2, yielding

F_{s}

and

F_{q}

. Linear transformations are then applied to the query, key, and value vectors to generate the matrices Q, K, and V, respectively.

Here,

F_{k}^{s} \in R^{S \times D}

denotes the support feature matrix (with S samples in the class and feature dimension D). Multi-Head Attention uses c heads indexed by

h \in {1, 2, \dots, c}

. For the h-th head, the query, key, and value are obtained by head-specific linear projections:

Q_{h} = F_{k}^{s} W_{h}^{Q}

(8)

K_{h} = F_{k}^{s} W_{h}^{K}

(9)

V_{h} = F_{k}^{s} W_{h}^{V}

(10)

where

W_{h}^{Q}, W_{h}^{K}, W_{h}^{V} \in R^{D \times d_{k}}

are trainable parameter matrices, and

d_{k} = D / c

. Equivalently, one may write

W^{Q} = [W_{1}^{Q}, \dots, W_{c}^{Q}]

,

W^{K} = [W_{1}^{K}, \dots, W_{c}^{K}]

, and

W^{V} = [W_{1}^{V}, \dots, W_{c}^{V}]

.

For each head h, the dot-product similarity between

Q_{h}

and

K_{h}

is computed and normalized by

Softmax

to produce attention weights, which are then used to aggregate

V_{h}

:

f_{h} (Q_{h}, K_{h}, V_{h}, d_{k}) = {Attention}_{h} (Q_{h}, K_{h}, V_{h}) = Softmax (\frac{Q_{h} K_{h}^{T}}{\sqrt{D / c}}) V_{h},

(11)

Finally, the outputs of all heads are concatenated and projected back to the input feature space by a learnable output matrix

W^{O}

:

MultiHead (Q, K, V) = Concat (f_{1}, f_{2}, \dots, f_{c}) W^{O},

(12)

where

f_{h}

denotes the attentional outputs of each head, and

W^{o}

is the learnable output mapping matrix. The support feature, enhanced by the Multi-Head Attention mechanism, is represented as

F^{s^{'}}

, as described by the following equation:

F_{k}^{s^{'}} = F_{k}^{s} + MultiHead (Q, K, V)

(13)

To facilitate a clearer elaboration of the algorithm details, in the following, we will adopt a more refined index notation to supplement the definition of relevant symbols so as to precisely map to the feature information of different classes and samples. This Multi-Head Attention module enables the model to capture fine-grained intra-support feature dependencies more accurately in different subspaces by computing multiple attention heads in parallel. For each class

k \in {1, 2, \dots, K_{way}}

, we construct an intra-class similarity matrix

A_{k} \in R^{N \times N}

based on the N attention-enhanced support features

{F_{k, i}^{s^{'}}}_{i = 1}^{N}

of that class, where

A_{k, i, j}

denotes the

(i, j)

-th entry of

A_{k}

.

A_{k, i, j} = \frac{F_{k, i}^{s^{'}} \cdot F_{k, j}^{s^{'}}}{{∥F_{k, i}^{s^{'}}∥}_{2} {∥F_{k, j}^{s^{'}}∥}_{2}}, i, j \in {1, 2, \dots, N} .

(14)

where

K_{way}

denotes the number of classes in an episode, N is the number of support samples per class, and

F_{k, i}^{s^{'}}

is the enhanced feature of the i-th support sample in the k-th class.

For the contrastive enhancement part of FCM, which is illustrated in the Contrast Space of FCM (denoted as FCM: Contrast Space in Figure 2), positive and negative pairs are constructed from samples of the same and different classes, respectively. Given query features

F^{q}

and support features

F^{s}

, we map each feature to a contrastive embedding via a shared projection head

g (\cdot)

followed by

ℓ_{2}

normalization:

z = norm (g (F)), F \in {F^{q}, F^{s}} .

(15)

where

g (\cdot)

denotes a shared projection head (e.g., a lightweight MLP) applied to both support and query features, and

norm (\cdot)

performs

ℓ_{2}

normalization, i.e.,

norm (x) = x / {∥ x ∥}_{2}

. This operation projects heterogeneous features into a unified embedding space and makes

Sim (\cdot, \cdot)

comparable across samples.

For an anchor embedding

z_{k, i}^{s} = norm (g (F_{k, i}^{s}))

, we construct the positive sample

z_{k, i}^{+}

from another support feature of the same class:

z_{k, i}^{+} = norm (g (F_{k, j}^{s})), j \neq i,

(16)

where

k \in {1, \dots, K_{way}}

indexes the class in an episode, and

i, j \in {1, \dots, N}

index the N support samples of class k. When

N = 1

,

z_{k, i}^{+}

is obtained by using another augmented view of the same support sample to ensure a valid positive pair.

We construct negative samples from query features belonging to different classes:

z^{-} = norm (g (F_{m}^{q})), y_{m} \neq k,

(17)

where m indexes a query sample, and

y_{m}

denotes its class label. The similarity function is defined as

Sim (a, b) = a^{T} b

, which is equivalent to cosine similarity after

ℓ_{2}

normalization. Let

Z_{k, i}^{-}

denote the set of negatives for anchor

(k, i)

.

The InfoNCE loss is formulated as

L_{InfoNCE} = - \frac{1}{K_{way} N} \sum_{k = 1}^{K_{way}} \sum_{i = 1}^{N} log \frac{exp (Sim (z_{k, i}^{s}, z_{k, i}^{+}) / τ)}{exp (Sim (z_{k, i}^{s}, z_{k, i}^{+}) / τ) + \sum_{z^{-} \in Z_{k, i}^{-}} exp (Sim (z_{k, i}^{s}, z^{-}) / τ)} .

(18)

where

τ

is a temperature parameter controlling the sharpness of the softmax distribution. The numerator encourages the anchor to be close to its positive sample, while the denominator contrasts the anchor against all negatives in

Z_{k, i}^{-}

. The factor

1 / (K_{way} N)

averages the loss over all anchors in the support set within each episode.

X_{k, i}^{s^{'}} = contrast (F_{m}^{q}, F_{k, i}^{s})

(19)

3.4. AFM

To produce more distinctive and discriminative features, the MLP-driven Adaptive Fusion Module (AFM) performs a weighted fusion of the contrastive-learning reinforced features and the Multi-Head-Attention-enhanced features. Specifically, for each support sample indexed by the class label

k \in {1, 2, \dots, K_{way}}

and the intra-class index

i \in {1, 2, \dots, N}

, we concatenate the attention-enhanced feature

F_{k, i}^{s^{'}}

and the contrastive reinforced feature

X_{k, i}^{s^{'}}

to form a fused representation:

F_{k, i}^{concat} = concat (F_{k, i}^{s^{'}}, X_{k, i}^{s^{'}})

(20)

We use an MLP, as shown in the

M L P

of Figure 2, to generate dynamic weights. For each support sample indexed by

(k, i)

, the MLP takes the concatenated feature

F_{k, i}^{concat} = concat (F_{k, i}^{s^{'}}, X_{k, i}^{s^{'}})

as input and outputs a scalar weight

ζ_{k, i} \in (0, 1)

through a linear transformation followed by a nonlinear activation function so as to dynamically control the fusion ratio of the features. The weight generator is formulated as

ζ_{k, i} = σ (Linear (F_{k, i}^{concat}))

, where

σ (\cdot)

is the sigmoid function.

ζ_{k, i} = σ (W_{2} \cdot ReLU (W_{1} \cdot F_{k, i}^{concat} + b_{1}) + b_{2}) \in (0, 1)

(21)

where

W_{1}

and

W_{2}

are the learnable weight matrices, and

b_{1}

and

b_{2}

are the corresponding bias terms.

ReLU (\cdot)

is a nonlinear activation function used to enhance the expressive capacity of the model.

σ (\cdot)

denotes the sigmoid function, which constrains the generated weight

ζ_{k, i}

to the range

(0, 1)

.

After obtaining the dynamic weights

ζ_{k, i}

, we perform a weighted fusion between the Contrast-Learning-reinforced feature

X_{k, i}^{s^{'}}

and the Multi-Head-Attention-enhanced feature

F_{k, i}^{s^{'}}

. The fusion is formulated as

F_{k, i}^{fuse} = ζ_{k, i} X_{k, i}^{s^{'}} + (1 - ζ_{k, i}) F_{k, i}^{s^{'}}

(22)

The weight

ζ_{k, i}

controls the relative contribution of the two features in the final fused representation. As

ζ_{k, i}

varies, the model can be dynamically adjusted between the Contrastive-Learning-reinforced features

X_{k, i}^{s^{'}}

and the Multi-Head-Attention-enhanced features

F_{k, i}^{s^{'}}

, thereby improving the model’s adaptability.

3.5. IPRM

In addition to addressing the problem of uneven distribution within classes as well as to obtain more accurate feature aggregation, we use intra-class weighting for prototype generation, as shown in the

I P R M

component of Figure 2, which is different from the direct method of finding the average value within a class, and use the intra-class similarity matrix generated by the Multi-Head Attention mechanism, which is aggregated with different features of each class, with the specific details as follows: For the support feature

F_{fuse}

after feature fusion and the intra-class matrix

A_{k, i, j}

generated by Multi-Head Attention,

F_{fuse}

is weighted using the similarity matrix:

F_{k, i}^{*} = \sum_{j = 1}^{N} A_{k, i, j} \cdot F_{k, j}^{fuse}

(23)

where

F_{k, i}^{*}

denotes the weighted feature of the i-th fused support feature

F_{k, i}^{fuse}

within the k-th class.

A_{k, i, j}

denotes the intra-class similarity between the i-th and j-th support features within the k-th class. The final weighted feature matrix

F_{k, i}^{*}

represents the features after adding the similarity weights within the class. The weighted features

F_{k, i}^{*}

are averaged to obtain the prototype features

P_{k}

of the class:

P_{k} = \frac{1}{N} \sum_{i = 1}^{N} F_{k, i}^{*}, k = 1, 2, \dots, K_{way} .

(24)

3.6. SBAATM

In the classification stage, we adopt a novel classification architecture based on the Sparse Bias-Aware Attention Transformer, which reconstructs the conventional classification paradigm by incorporating similarity as a bias mechanism. The overall framework is depicted in Figure 2, while the detailed structure is presented in Figure 3. Unlike traditional CNN-based Relation Networks that rely on local feature interactions, we design a Transformer with the capability of global relational modeling. This module replaces the cross-attention mechanism in the Transformer decoder with a Top-t sparse attention strategy and integrates inter-class similarity bias. By constructing an implicit relational graph along the feature channel dimension, the classifier adaptively captures semantic associations across regions. Specifically, during the encoding phase, we concatenate the query features with the class prototypes to form a sequence

C_{m} = {Z_{1, m}, Z_{2, m}, \dots, Z_{K_{way}, m}}

, where

m \in {1, 2, \dots, N_{q}}

. Through a Multi-Head Attention mechanism, we obtain inter-class relations, and the entire process can be formulated as follows:

C_{e} = Encoder (C_{m}), m \in {1, 2, \dots, N_{q}}

(25)

Figure 3. SBAATM architecture.

The feature sequence

C_{e}

, which contains inter-class relationships, is obtained through processing by the encoder.

During the decoding phase, as shown in Figure 3, the decoder first applies masked multi-head self-attention to the query feature sequence to form an intermediate decoder representation, where the mask adopts a diagonal masking scheme. The resulting output is then passed through an Add & Norm layer to obtain

A_{e} = {a_{k, m}}, k \in {1, \dots, K_{way}}

, and

m \in {1, \dots, N_{q}}

.

A_{e}

is fed into the proposed Sparse Bias-Aware Attention module in which the sparse attention mechanism replaces the conventional cross-attention. A bias term B is introduced during attention computation, which is calculated based on the cosine similarity between the query features and the prototypes:

B_{m, k} = \frac{{F_{m}^{q}}^{T} P_{k}}{∥F_{m}^{q}∥ ∥P_{k}∥}, k = 1, 2, \dots, K_{way}

(26)

F_{m}^{q}

represents the query feature, and

P_{k}

denotes the prototype feature. The cosine similarity is used as the bias term B, which is integrated into the sparse attention computation to dynamically adjust the similarity between the query and the prototype. This enhances the model’s ability to focus on important features. The sparse attention mechanism selects the most relevant prototypes, thereby reducing computational cost and improving classification performance. The final sparse attention computation is defined as follows:

We employ a Top-t sparsification strategy that retains only the t highest-value prototype connections for each query feature, where

t ≪ K_{way}

and

K_{way}

denotes the number of classes (K-way) in an episode. Let h denote the number of attention heads. For the h-th head, the per-head dimension is defined as

Q_{h, k, m} = a_{k, m} W_{h}^{Q}

(27)

K_{h, k, m} = a_{k, m} W_{h}^{K}

(28)

V_{h, k, m} = a_{k, m} W_{h}^{V}

(29)

The attention logits are computed by

S_{h, k, m} = \frac{Q_{h, k, m} K_{h, k, m}^{T}}{\sqrt{D / c}} + B_{m, k} .

(30)

where B is a cosine-similarity-based bias term.

For each query index

m \in {1, \dots, N_{q}}

, we select the Top-t indices from the m-th row of

S_{h, k, m}

:

I_{h, m} = {Top}_{t} ({S_{h, k, m}}_{k = 1}^{K_{way}})

(31)

where

| I_{h, m} | = t

. The attention weights are then computed by normalizing the selected logits:

α_{h, k, m} = \frac{exp (S_{h, k, m})}{\sum_{u \in I_{h, m}} exp (S_{h, u, m})}, k \in I_{h, m},

(32)

and

α_{h, k, m} = 0

for

k \notin I_{h, m}

. The head output for the m-th query is obtained by aggregating the corresponding values over the selected indices:

O_{h, m} = \sum_{k \in I_{h, m}} α_{h, k, m} V_{h, k, m} .

(33)

Finally, the multi-head output is produced by concatenation and linear projection:

SparseAttention (Q, K, V, B) = Concat (O_{1}, \dots, O_{c}) W^{O} .

(34)

We realize the Top-t selection and subsequent computations via

t o p - t

and

g a t h e r

, so the normalization and value aggregation are carried out only on the selected entries, avoiding explicit materialization of a dense attention matrix.

By applying the aforementioned Top-t sparsification strategy to the attention logits integrated with this similarity bias, this module retains only the t most relevant prototype connections for each query. It then aggregates the values and outputs the Multi-Head Attention result.

The output of the sparse attention module then passes sequentially through the subsequent Add & Norm layer, feed-forward network, and final Add & Norm layer of the decoder. These steps further refine the relational features between queries and prototypes, which are then processed through ReLU and Sigmoid activation functions to generate the class-level matching score matrix R. The computation can be expressed as

R = Decoder (F^{q}, C e, B)

(35)

R = {R_{1, m}, R_{2, m}, \dots, R_{K_{way}, m}}, R_{k, m} \in [0, 1]

(36)

k denotes the k-th class, and

R_{k, m}

indicates the matching score between the query sample and the prototype class P. The complete network is ultimately trained end to end, employing MSE as the loss function. This procedure is mathematically expressed as

L_{relation} = \frac{1}{N_{q} K_{way}} \sum_{m = 1}^{N_{q}} \sum_{k = 1}^{K_{way}} {(R_{k, m} - I [y_{m} = y_{k}])}^{2} .

(37)

where

I (y_{m} = = y_{k})

is an indicator function. Here,

y_{m}

denotes the ground-truth label of the m-th query sample

F_{m}^{q}

, and

y_{k}

denotes the class label associated with the k-th prototype

P_{k}

. The indicator returns 1 if

y_{m} = y_{k}

and 0 otherwise. Therefore, a matched (query and prototype class) pair has a target value of 1, while an unmatched pair has a target value of 0.

L_{total} = γ \cdot L_{relation} + β \cdot L_{InfoNCE}

(38)

where

γ

and

β

are hyperparameters. Finally, the model’s accuracy on the test tasks is calculated as

accuracy = \frac{\sum_{m = 1}^{N^{q}} I ({\hat{y}}_{m} = = y_{m})}{N^{q}} \times 100 %

(39)

where

{\hat{y}}_{m}

denotes the predicted class label of the m-th query sample, which is determined by the decoder-produced matching score matrix R as

{\hat{y}}_{m} = arg max_{k} R_{k, m}, k = 1, 2, \dots, K_{way} .

(40)

Finally, we summarize the overall proposed method in Algorithm 1.

Algorithm 1 The training procedure of FC-SBAAT

1:: Input: training set $D_{base}$
2:: For each iteration
3:: Sample tasks: $D_{base} \to S_{support}, Q_{query}$
4:: Perform Data Augmentation
5:: Extract features $F^{s}, F^{q}$
6:: Process $F^{s}, F^{q}$ via FCM to obtain $F^{s^{'}}$ , $X_{k, i}^{s^{'}}$ , $A_{k, i, j}$
7:: Process $F^{s^{'}}$ , $X_{k, i}^{s^{'}}$ via AFM to obtain $F_{k, i}^{f u s e}$
8:: IPRM compute prototypes: $F_{k, i}^{*} \leftarrow \sum_{j = 1}^{N} A_{k, i, j} \cdot F_{k, j}^{fuse}$ $P_{k} \leftarrow \frac{1}{N} \sum_{i = 1}^{N} F_{k, i}^{*}$
9:: EndFor
10:: Initialize $L_{total} \leftarrow 0$
11:: For $m = 1$ to $N^{q}$
12:: For $k = 1$ to $K_{way}$
13:: Combine $P_{k}$ and query features to $Z_{k, m}$
14:: EndFor
15:: Compute relational scores via SBAATM
16:: EndFor
17:: Compute $L_{relation}$ : relational loss
18:: Compute total loss: $L_{total} \leftarrow γ L_{relation} + β L_{InfoNCE}$

4. Experimental Results and Discussion

This chapter begins by detailing the experimental setup, including the dataset and implementation specifics (Section 4.1). Subsequently, we evaluate the proposed method through comparisons with contemporary state-of-the-art approaches to demonstrate its performance merits (Section 4.2). Ablation analysis is carried out to examine the influence of individual modules on the model’s overall results (Section 4.3). Finally, t-SNE visualization is employed to illustrate the model’s performance across various tasks, providing deeper insights into its feature space distribution and classification capabilities (Section 4.4).

4.1. Datasets and Experimental Environment

The MiniImageNet and CUB datasets were chosen for this study to systematically assess the performance of the proposed FC-SBAAT method in Few-Shot Learning tasks. A brief overview of the two datasets is provided below:

MiniImageNet: The MiniImageNet dataset [3] is a widely used benchmark for Few-Shot Learning. It is derived from ImageNet and consists of 100 randomly selected categories, each with 600 images resized to 84 × 84 pixels. The dataset is split into three parts: 64 categories for training, 16 for validation, and 20 for testing.

CUB (Caltech-UCSD Birds 200): The CUB-200-2011 dataset [35] is a standard benchmark for fine-grained image categorization and contains 11,788 images of 200 bird categories, with approximately 30 to 60 images per category. This dataset presents a high challenge for Few-Shot Learning due to the high inter-category similarity and intra-category variation. In this study, 200 categories are randomly divided into 100 for training, 50 for validation, and the remaining 50 for testing. The standard training-validation-testing split and the corresponding evaluation scheme specified for CUB-200-2011 are adopted throughout all experiments.

Experimental configuration environment: Intel Core i9-13900F CPU (Intel Corporation, Santa Clara, CA, USA), 32 GB RAM, NVIDIA RTX 4060 GPU (NVIDIA, Santa Clara, CA, USA), Windows 11 (64-bit, Microsoft, Redmond, WA, USA), Python 3.11.8 (https://www.python.org, accessed on 4 January 2026). experimental environment.

4.2. Experimental Details

This study adopts the episode training paradigm for Few-Shot Learning. All key training parameters are explicitly defined through configuration files and code to ensure full reproducibility of the experiments. The training phase includes 700,000 training episodes, providing sufficient training volume for the model to fully learn stable feature representations and robust relation matching capabilities.

The Adam optimizer is selected as it effectively balances convergence speed and gradient stability in Few-Shot Learning tasks. The initial learning rate is set to

1 \times 10^{- 4}

, with an L2 weight decay coefficient of

1 \times 10^{- 4}

introduced to alleviate overfitting during model training. For the loss function, the hyperparameters

γ

and

β

are set to 0.9 and 0.1, respectively, to optimize the trade-off between intra-class cohesion and inter-class separation in feature learning [36,37].

The sample composition of each training episode follows the K-way N-shot setting for Few-Shot tasks: a single task contains five classes, each with one or five support samples corresponding to the two core task scenarios of 5-way 1-shot and 5-way 5-shot. Each class is paired with 16 query samples for model optimization. In the Contrastive Learning module, the temperature coefficient is set to 0.1. This parameter adjusts the discrimination scale of feature similarity, enhancing the model’s ability to distinguish fine-grained similar features and improving the discriminability of feature representations.

The validation phase includes 500 validation episodes, with model performance evaluated every 5000 training episodes. The final experimental results are calculated as the average of the evaluation data from these 500 validation episodes, and the 95% confidence interval is used to measure the statistical reliability of the results.

4.3. Comparison with State of the Arts

In this section, we conduct a comprehensive evaluation of the proposed FC-SBAAT method using the MiniImageNet and CUB datasets, comparing it with existing state-of-the-art techniques. As indicated in Table 1, FC-SBAAT shows notable improvements in both 1-shot and 5-shot tasks on MiniImageNet and CUB, surpassing other methods. These results highlight the enhanced performance and efficiency of the proposed approach.

Table 1. Mean classification accuracy (%) on MiniImageNet and CUB (5-way 1-shot/5-shot).

MAML and MetaOptNet are optimization-based meta-learning methods that achieve fast adaptation mainly through within-task parameter updates or explicit optimization. Under Few-Shot conditions, these methods can be sensitive to the optimization trajectory and initialization, and they often fail to fully exploit fine-grained relationships among samples, which may lead to performance bottlenecks when the class structure is complex. As shown in Table 1, under the Conv4 setting, FC-SBAAT achieves 55.71% and 73.87% accuracy on MiniImageNet for 1-shot and 5-shot, respectively, which is clearly higher than MAML with 48.70% and 63.11% and MetaOptNet with 51.28% and 69.71%. This indicates that relying solely on within-task optimization is insufficient to stably capture fine-grained discriminative cues in Few-Shot regimes, whereas explicit relation modeling and more robust class-level representations are more critical [44]. On CUB, FC-SBAAT further reaches 70.37% and 83.86% accuracy for 1-shot and 5-shot, respectively, again outperforming the aforementioned optimization-based baselines. This result further suggests that, in fine-grained scenarios with larger intra-class variations, optimization-driven adaptation alone is more vulnerable to noisy support samples and prototype shift and thus requires a more robust class representation construction strategy. When adopting ResNet-12 as the backbone, FC-SBAAT still maintains a consistent advantage over optimization-based baselines. Table 1 shows that, on MiniImageNet, FC-SBAAT achieves 67.42% and 85.71% accuracy for 1-shot and 5-shot, respectively, which is higher than MetaOptNet with 62.64% and 78.63% and also exceeds the cosine classifier baseline with 55.43% and 77.18%. This implies that even with a stronger feature extractor, optimization-based methods may not sufficiently convert the representational gains of the backbone into stable task-level discrimination, while relation enhancement and prototype refinement are more effective for task generalization [45]. On CUB, FC-SBAAT attains 77.13% and 91.48% accuracy for 1-shot and 5-shot, respectively, surpassing MetaOptNet with 72.00% and 84.20% and the cosine classifier with 67.30% and 84.75%. This demonstrates that, for fine-grained datasets, relying solely on decision-level optimization or fast adaptation is still inadequate to suppress representation shift caused by intra-class variations and noise-driven prototype bias, whereas FC-SBAAT can produce more robust class representations and more reliable matching decisions through effective relation modeling and prototype generation refinement, thereby achieving consistent performance gains across different backbones [46,47].

Compared with metric-learning methods, the advantages of FC-SBAAT are mainly reflected in two aspects: feature representation quality and the stability of prototypes and matching. ProtoNet forms class prototypes by aggregating support samples within each class. RelationNet computes similarities between the support set and the query set via a relation module. DeepEMD [41] measures distances using local matching based on Earth Mover’s Distance (EMD). SLTRN enhances inter-sample relation modeling by combining the Transformer self-attention mechanism with relational reasoning. TADAM [39] improves cross-class generalization by strengthening the task adaptivity of the metric space. However, under extremely Few-Shot settings, these methods are still susceptible to noisy samples, intra-class variations, and prototype shift. In particular, although SLTRN mainly improves the expressiveness of inter-sample relations, its class representation construction inevitably relies on holistic aggregation of support features, and it lacks an explicit mechanism to suppress prototype bias caused by uneven support quality or noisy samples. As a result, when the support set contains abnormal samples or exhibits quality fluctuations, class-level representations are more likely to drift, thereby undermining the stability and discriminability of the matching process. As reported in Table 1, under the Conv4 backbone, FC-SBAAT improves upon DeepEMD on MiniImageNet by increasing the 1-shot accuracy from 53.81% to 55.71% and the 5-shot accuracy from 70.56% to 73.87%, and it further boosts the 5-shot accuracy on CUB from 83.58% to 83.86%. Meanwhile, FC-SBAAT also achieves more consistent gains over SLTRN: on MiniImageNet, the 1-shot accuracy increases from 52.11% to 55.71% and the 5-shot accuracy increases from 66.54% to 73.87%; on CUB, the 1-shot accuracy increases from 67.55% to 70.37%, and the 5-shot accuracy increases from 80.07% to 83.86%. Under the ResNet-12 backbone, FC-SBAAT further surpasses the strong baseline BML, improving the 1-shot accuracy on MiniImageNet from 67.04% to 67.42% and the 5-shot accuracy from 83.63% to 85.71% and also improving the 1-shot accuracy on CUB from 76.21% to 77.13% and the 5-shot accuracy from 90.45% to 91.48%. These results indicate that FC-SBAAT can still deliver stable improvements under a stronger backbone.

For FC-SBAAT, we adopt a feature co-enhancement strategy to optimize the prototype generation process. By modeling inter-sample relations in a more fine-grained manner, FC-SBAAT produces more representative class prototypes and better captures subtle inter-class differences. This design enables FC-SBAAT to consistently outperform ProtoNet, which mainly relies on simple prototype estimation, under different backbones. Under Conv4, ProtoNet achieves 49.42% and 68.20% accuracy on MiniImageNet for 1-shot and 5-shot and 54.52% and 73.30% accuracy on CUB, whereas FC-SBAAT achieves 55.71% and 73.87% on MiniImageNet and 70.37% and 83.86% on CUB, corresponding to relative improvements of 12.72% and 8.31% on MiniImageNet and 29.07% and 14.41% on CUB. Under ResNet-12, ProtoNet achieves 62.39% and 80.53% on MiniImageNet and 71.88% and 87.42% on CUB, while FC-SBAAT achieves 67.42% and 85.71% on MiniImageNet and 77.13% and 91.48% on CUB, yielding relative improvements of about 8.06% and 6.43% on MiniImageNet and 7.30% and 4.64% on CUB. Overall, the results in Table 1 demonstrate that FC-SBAAT achieves superior Few-Shot recognition performance under both Conv4 and ResNet-12, and the improvement is more evident on the fine-grained CUB dataset, which aligns with the design goal of relation enhancement and more robust prototypes.

4.4. Ablation Study

4.4.1. The Impact of Different Structures

To evaluate the impact of individual sub-modules in the prototype rectification process of FC-SBAAT, we conduct ablation studies (see Table 2) to systematically quantify the contribution of each component and its role in both feature enhancement and prototype generation. Specifically, we decompose the Feature Cooperative Enhancement Module (FCM) into two components: the Multi-Head Attention enhancement module (MHA) and the Contrastive Learning enhancement module (CL). The former strengthens feature interactions via a Multi-Head Attention mechanism, while the latter improves discriminability by pulling samples from the same class closer and pushing samples from different classes apart through Contrastive Learning. In addition, we adopt a Relation Network paradigm as the baseline, where a Sparse Bias-Aware Transformer is used as the relation function. For fair comparison, all methods are evaluated on MiniImageNet and CUB under both 1-shot and 5-shot settings. The baseline model employs Conv-4 as the backbone, and all variants use the same feature enhancement strategy described above, forming the variants Base-MHA, Base-FCM, Base-FCM-AFM, and Base-FCM-AFM+IRPM to progressively verify the effectiveness and synergy of the proposed modules.

Table 2. Ablation study on FC-SBAAT: classification accuracy (%) on MiniImageNet and CUB.

Table 2 reports the mean accuracy with 95% confidence intervals on MiniImageNet and CUB under different sub-module combinations, aiming to quantify how the key components in FC-SBAAT contribute to feature enhancement and prototype refinement. Overall, the results exhibit a clear monotonic and steady improvement as MHA, FCM, AFM, and IRPM are progressively incorporated. This consistent trend across both datasets and both shot settings indicates that the proposed components provide complementary benefits along two core paths, namely, representation reinforcement and prototype rectification, with their cumulative effects driving gradual performance optimization.

Base serves as the prototype-based reference. After introducing the Multi-Head Attention module, Base-MHA yields steady incremental gains on both datasets, improving MiniImageNet from 52.42% to 53.11% in 1-shot and from 67.10% to 71.37% in 5-shot while improving CUB from 67.86% to 68.92% in 1-shot and from 80.36% to 81.64% in 5-shot. The more noticeable incremental improvement on MiniImageNet 5-shot suggests that Multi-Head Attention effectively captures fine-grained cross-sample dependencies and stabilizes feature interactions when more support evidence is available, which translates to more tangible gains in scenarios with relatively sufficient samples.

Replacing MHA with the full feature co-enhancement strategy further achieves steady performance gains. Base-FCM consistently outperforms both Base and Base-MHA, reaching 53.88% and 71.81% on MiniImageNet and 69.17% and 82.38% on CUB under 1-shot and 5-shot, respectively. These results confirm that co-enhancement provides richer and more discriminative representations than attention-only reinforcement, thereby yielding more consistent incremental gains for metric-based classification. The steady performance lift demonstrates that the synergistic effect of MHA and CL is conducive to refining feature quality without introducing redundant information.

When AFM is incorporated on top of Base-FCM, performance continues to show steady incremental improvements in all settings, with Base-FCM-AFM achieving 54.95% and 72.62% on MiniImageNet and 69.81% and 83.03% on CUB. This observation highlights that adaptive fusion is critical for converting multi-source enhanced features into a unified and stable representation—even with modest accuracy gains, the functional value of AFM is evident. By dynamically reweighting reinforced cues, AFM reduces information cancellation and suppresses noise accumulation, which directly improves the reliability of support features and thus the quality of prototype estimation.

Finally, adding IRPM leads to further steady incremental improvements, and the complete model attains the best results in every setting, reaching 55.71% and 73.87% on MiniImageNet and 70.37% and 83.86% on CUB. The additional modest yet meaningful gains over Base-FCM-AFM indicate that the remaining bottleneck lies not only in representation quality but also in the sensitivity of prototype construction under limited support. IRPM explicitly exploits intra-class structural cues to reweight and constrain prototype formation, mitigating prototype shift induced by atypical samples and improving decision stability—an important merit beyond mere incremental performance gains, especially in small-sample scenarios where result reliability is crucial.

In the ablation analysis of the classification module (see Table 3), we adopt the vanilla Transformer architecture as the baseline and construct an improved classification module by introducing a bias-aware mechanism, thereby systematically validating its effectiveness on representative Few-Shot classification benchmarks. To ensure a fair comparison, all methods consistently employ Conv-4 as the feature extraction backbone and share the same prototype construction pipeline, i.e., the framework composed of FCM, AFM, and IRPM, and we vary only the classification module to eliminate confounding factors. The experimental results show that the improved classification module delivers stable performance gains under all task configurations, with more pronounced improvements when labeled samples are extremely limited. This advantage primarily stems from the bias-aware mechanism, which directionally strengthens semantic correlations. Compared with the global unbiased similarity computation in conventional attention, the bias-guided strategy better focuses on task-relevant discriminative features.

Table 3. Ablation study on classification module: classification accuracy (%) comparison on MiniImageNet and CUB datasets.

4.4.2. The Impact of Different Hyper-Parameters

This experiment involved a comprehensive evaluation of the FC-SBAAT model on the MiniImageNet and CUB datasets for 1-shot and 5-shot tasks, systematically varying the number of query images

u \in {3, 6, 9, 12, 15}

and recording the classification accuracies for each configuration. The results are presented in Figure 4, where Figure 4a shows the 1-shot setting, and Figure 4b shows the 5-shot setting.

Figure 4. Impact of query sample size ((a): impact on 1-shot task; (b): impact on 5-shot task).

The classification accuracy of FC-SBAAT significantly improves as the number of query images increases, particularly in 1-shot and 5-shot tasks.The model’s performance shows a strong positive correlation with the number of query images, indicating that as more query images are available, the model can access additional support samples, thereby improving its ability to discriminate subtle category differences. More query images provide the model with richer information, enabling it to better understand the task from different perspectives, ultimately enhancing classification accuracy.

However, as the number of query images increases, the model’s performance improvement starts to plateau and becomes less noticeable after a certain point, suggesting a saturation effect in the impact of query image quantity on performance. Further increasing the number of query images may lead to a waste of computational resources, while the performance improvement effect tends to be limited. Therefore, in practical applications, a reasonable selection of the number of query images is crucial for improving the efficiency of the model.

In this experiment, when the number of query images reaches 15, the model’s performance has clearly leveled off and increasing the number of query images further no longer significantly enhances classification accuracy. Therefore, in order to avoid the computational overhead caused by too many query images while fully utilizing the model’s potential, we chose 16 query images as the optimized parameter setting. Such a choice balances the relationship between computational resources and classification accuracy, ensuring both efficient model training and high classification accuracy, thereby reflecting the effectiveness and practicality of the model optimization.

Figure 5a,b demonstrate the performance of the FC-SBAAT model in 1-shot and 5-shot tasks on the MiniImageNet and CUB datasets. The experimental results indicate that with variations in the number of attention heads within the Multi-Head Attention mechanism, the model’s classification accuracy remains stable and does not exhibit notable fluctuations. This phenomenon indicates that FC-SBAAT possesses excellent robustness and stability in hyperparameter selection, especially in adjusting the number of Multi-Head Attention heads.

Figure 5. Impact of the number of attention heads ((a): on MiniImageNet dataset; (b): on CUB dataset).

Specifically, as the number of attention heads increases, although the model’s performance fluctuates across different configurations, its final performance tends to stabilize and can be adaptively adjusted within a certain range without significant degradation. From the experimental results, both on the MiniImageNet and CUB datasets, the performance of FC-SBAAT demonstrates a steady improvement with the increase in the number of attention heads, eventually reaching a plateau as the number continues to grow.

This suggests that the FC-SBAAT model is able to maintain its performance effectively under different hyperparameter settings and is highly adaptable to changes in the number of attention heads. This property not only indicates that the model is designed with strong generalization ability but also reflects its optimization flexibility within the Multi-Head Attention mechanism.

4.4.3. The Impact of Different Feature Enhancement Strategies

For assessing the effectiveness of our presented feature enhancement approach, we performed experiments using the MiniImageNet and CUB datasets, assessing performance in both 1-shot and 5-shot task scenarios. The experimental results are shown in Table 4, where our feature enhancement scheme achieves consistent improvement under multiple task settings.

Table 4. Comparison of feature enhancement strategies: classification accuracy (%) on MiniImageNet and CUB datasets.

Our feature enhancement scheme employs a careful strategy in data preprocessing and enhancement by gradually adjusting the enhancement magnitude of each sample so that each sample undergoes multiple changes at different intensities, thus generating more diverse data. This stepwise enhancement strategy enhances the diversity of data and facilitates the model’s capture of fine-grained differences and variations among samples. In this way, the model is able to extract more accurate features from diverse training data, which significantly improves its adaptability and learning ability for the learning task with fewer samples.

The experimental findings validate the efficacy of our proposed method. Using the MiniImageNet dataset, the feature enhancement scheme improves accuracy by 0.22% and 0.53% in 1-shot and 5-shot tasks, respectively, compared to other methods. On the CUB dataset, the accuracy improvements are even greater, with increases of 0.45% and 0.11% in the 1-shot and 5-shot tasks, respectively. These results suggest that progressively adjusting enhancement strength and including subtle sample variations greatly improves the model’s performance in Few-Shot Learning tasks.

4.4.4. The Computational Impact of Time Complexity

To quantitatively evaluate the impact of the Top-t sparsification mechanism on the computational complexity and inference efficiency of the Transformer-based relation network, we conduct an ablation study by comparing the Vanilla Transformer relation network with its Top-t-equipped variant and report FLOPs, parameter count, and processing time, as summarized in Table 5.

Table 5. Model complexity and processing time comparison of the Transformer relation network.

The results show that the number of parameters remains unchanged at

397.65 K

for both variants since Top-t sparsification only alters the retention pattern of attention connections and does not change the dimensions or the number of weights in the projection matrices

W^{Q}

,

W^{K}

,

W^{V}

, and

W^{O}

. Meanwhile, the FLOPs decrease slightly from

404.91 M

to

401.03 M

, indicating a limited overall change. This is mainly because the scoring stage still computes attention logits against all

K_{way}

prototypes/keys to enable Top-t selection, and thus the dominant

Q K^{T}

computation is preserved, which bounds the potential FLOP reduction. The observed FLOPs decrease primarily comes from the post-selection stages, where the softmax normalization and the value-weighted aggregation are carried out on the selected Top-t subset rather than the full

K_{way}

dimension, reducing the non-dominant exponential and multiply–accumulate operations.

In contrast, the processing time decreases from

5.51 ms

to

3.96 ms

, corresponding to an approximately

28.1 %

reduction in latency and about a 1.39× speedup. This gain mainly arises after Top-t selection: the attention weights are normalized only over the selected t entries, and the output is aggregated using the corresponding k values, which reduces practical memory traffic and cache pressure. Overall, the ablation study indicates that Top-t sparsification improves inference efficiency while maintaining global scoring over all prototypes/keys in the Transformer relation network.

4.5. Visualization

Beyond the quantitative analysis, we additionally conducted low-dimensional visualization of the extracted features using t-SNE to further validate the internal states of the FC-SBAAT model, as shown in Figure 6 and Figure 7. In these visualization experiments, we randomly selected five new classes, each containing 21 embedded features. We present the t-SNE visualization results before and after feature enhancement and prototype optimization for the 1-shot and 5-shot tasks on the MiniImageNet and CUB datasets. On the left side of each subplot, the original feature distribution without model processing is shown, while the right side shows the optimized feature distribution after feature enhancement and prototype calibration.

Figure 6. MiniImageNet tasks.

Figure 7. CUB tasks.

In the visualizations presented in Figure 6 and Figure 7, the left subfigures illustrate the results without task-related feature enhancement and prototype optimization. It can be observed that the prototypes of each class are scattered, failing to effectively capture class centers or distinguish inter-class differences—indicating that raw features lack sufficient discriminative power for class separation. In contrast, the right subfigures demonstrate the outcomes after applying the proposed feature enhancement and prototype optimization modules. Notably, the prototypes become more compact and representative, with significantly improved separability between distinct classes. This visualization evidence confirms that FC-SBAAT not only enhances inter-class separability and prototype representativeness but also strengthens the model’s adaptability to task-specific structures. Consequently, the model achieves more accurate and robust generalization when classifying novel categories.

5. Conclusions

We propose a novel Few-Shot classification method, FC-SBAAT, which is based on a Feature-Cooperative Prototype Optimization Strategy and a Task-Adaptive Sparse Bias-Aware Transformer Strategy. By integrating contextual information into intra-class relationship modeling, minimizing intra-class similarity, and maximizing inter-class differences, this method effectively optimizes class prototypes—yielding more representative and discriminative prototype representations. Additionally, it uses Bias-Aware Attention to guide the model in learning task-relevant features, thereby enhancing classification performance.

We performed extensive experiments on the widely adopted MiniImageNet and CUB datasets. The results show that FC-SBAAT outperforms existing state-of-the-art methods on both 1-shot and 5-shot tasks. These findings highlight FC-SBAAT’s significant advances in prototype quality, model performance, and robustness. Notably, FC-SBAAT demonstrates strong inter-class separability even with extremely limited labeled data.

Furthermore, we conducted detailed ablation studies to evaluate the contribution of each module. The results indicate that both the Multi-Head Attention mechanism and contrastive learning play key roles in feature enhancement and prototype refinement. The adaptive feature fusion mechanism dynamically adjusts feature weights, further improving model flexibility and robustness, which leads to significantly enhanced task performance.

Author Contributions

M.W.: conceptualization, methodology, and writing—original draft; C.Y.: conceptualization, methodology, supervision, writing—review and editing, literature review, resources, formal analysis, data analysis, and interpretation; L.S.: conceptualization and research framework design; J.L.: data collection, data curation (image preparation), and methodology (optimization); S.T.: formal analysis, investigation, and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

This experiment adopts public datasets. The Mini-ImageNet dataset is available at https://image-net.org/download.php (accessed on 1 November 2025). The CUB200 (CUB-200-2011) dataset is available at https://www.vision.caltech.edu/datasets/cub_200_2011/ (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for Few-Shot learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 3630–3638. [Google Scholar]
Cheng, Y.; Yu, M.; Guo, X.; Zhou, B. Few-shot Learning with Meta Metric Learners. arXiv 2019, arXiv:1901.09890. [Google Scholar] [CrossRef]
Garcia, V.; Bruna, J. Few-shot learning with graph neural networks. arXiv 2017, arXiv:1711.04043. [Google Scholar] [CrossRef]
Liu, J.; Song, L.; Qin, Y. Prototype rectification for Few-Shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part I 16; Springer: Cham, Switzerland, 2020; pp. 741–756. [Google Scholar] [CrossRef]
Tsai, Y.-H.H.; Salakhutdinov, R. Improving one-shot learning through fusing side information. arXiv 2017, arXiv:1710.08347. [Google Scholar] [CrossRef]
Li, X.; Tian, T.; Liu, Y.; Yu, H.; Cao, J.; Ma, Z. Adaptive multi-prototype relation network. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1707–1712. [Google Scholar]
Zhang, W.; Gu, X. Task-aware prototype refinement for improved Few-Shot learning. Neural Comput. Appl. 2023, 35, 17899–17913. [Google Scholar] [CrossRef]
Li, Z.; Hu, Z.; Luo, W.; Hu, X. SaberNet: Self-attention based effective relation network for few-shot learning. Pattern Recognit. 2023, 133, 109024. [Google Scholar] [CrossRef]
Li, H.; Huang, G.; Yuan, X.; Zheng, Z.; Chen, X.; Zhong, G.; Pun, C.-M. PSANet: Prototype-guided salient attention for few-shot segmentation. Vis. Comput. 2025, 41, 2987–3001. [Google Scholar] [CrossRef]
Lim, J.Y.; Lim, K.M.; Lee, C.P.; Tan, Y.X. SCL: Self-supervised Contrastive Learning for few-shot image classification. Neural Networks 2023, 165, 19–30. [Google Scholar] [CrossRef] [PubMed]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5149–5169. [Google Scholar] [CrossRef] [PubMed]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese Neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 6–11 July 2015; Volume 2, pp. 1–30. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-SGD: Learning to Learn quickly for few-shot learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
Nichol, A.; Schulman, J. Reptile: A scalable metalearning algorithm. arXiv 2018, arXiv:1803.02999. [Google Scholar]
Gong, Y. Meta-Learning with Differentiable Convex Optimization. Technical Report. EasyChair. 2023. Available online: https://easychair.org/publications/preprint/RJVM (accessed on 2 January 2026).
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Munkhdalai, T.; Yu, H. Meta networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 2554–2563. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A Simple Neural attentive meta-learner. arXiv 2017, arXiv:1707.03141. [Google Scholar]
Sung, F.; Zhang, L.; Xiang, T.; Hospedales, T.; Yang, Y. Learning to learn: Meta-critic networks for sample efficient learning. arXiv 2017, arXiv:1706.09529. [Google Scholar] [CrossRef]
Finn, C.; Xu, K.; Levine, S. Probabilistic model-agnostic meta-learning. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; pp. 9516–9527. [Google Scholar]
Taud, H.; Mas, J.-F. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar] [CrossRef]
Chen, D.; Chen, Y.; Li, Y.; Mao, F.; He, Y.; Xue, H. Self-supervised learning for few-shot image classification. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1745–1749. [Google Scholar] [CrossRef]
Li, A.; Luo, T.; Xiang, T.; Huang, W.; Wang, L. Few-Shot learning with global class representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9714–9723. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar] [CrossRef]
Huang, X.; Choi, S.H. SAPENet: Self-attention based prototype enhancement network for few-shot learning. Pattern Recognit. 2023, 135, 109170. [Google Scholar] [CrossRef]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross attention network for few-Shot classification. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 4003–4014. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Pfeiffer, J.; Kamath, A.; Rücklé, A.; Cho, K.; Gurevych, I. Adapterfusion: Non-destructive task composition for transfer learning. arXiv 2020, arXiv:2005.00247. [Google Scholar] [CrossRef]
Sun, Z.; Zheng, W.; Wang, M. SLTRN: Sample-level transformer-based relation network for few-shot classification. Neural Netw. 2024, 176, 106344. [Google Scholar] [CrossRef] [PubMed]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Li, H.; Sun, Y.; Qiao, S. Enhanced lithology identification with few-Shot well-logging data using a confidence-enhanced semi-supervised meta-learning approach. Measurement 2025, 247, 116762. [Google Scholar] [CrossRef]
Li, H.; Qiao, S.; Sun, Y. A depth graph attention-based multi-channel transfer learning network for fluid classification from logging data. Phys. Fluids 2024, 36, 106623. [Google Scholar] [CrossRef]
Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C.F.; Huang, J.-B. A closer look at few-shot classification. arXiv 2019, arXiv:1904.04232. [Google Scholar] [CrossRef]
Oreshkin, B.; Rodríguez López, P.; Lacoste, A. TADAM: Task dependent adaptive metric for improved few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; pp. 721–731. [Google Scholar]
Hao, F.; He, F.; Cheng, J.; Wang, L.; Cao, J.; Tao, D. Collect and select: Semantic alignment metric learning for few-Shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8460–8469. Available online: https://github.com/haofusheng/saml (accessed on 2 January 2026).
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Differentiable Earth Mover’s distance for few-shot learning. arXiv 2020, arXiv:2003.06777. [Google Scholar] [CrossRef]
Ye, H.-J.; Hu, H.; Zhan, D.-C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 8808–8817. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Ye_Few-Shot_Learning_via_Embedding_Adaptation_With_Set-to-Set_Functions_CVPR_2020_paper.html (accessed on 2 January 2026).
Zhou, Z.; Qiu, X.; Xie, J.; Wu, J.; Zhang, C. Binocular mutual learning for improving few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 8402–8411. [Google Scholar] [CrossRef]
Qiao, S.; Huang, M.; Li, H.; Wang, L.; Wenjing, Y.; Sun, Y.; Zhao, Z. FedSSH: A consumer-oriented federated semi-supervised heterogeneous IoMT framework. IEEE Trans. Consum. Electron. 2025, 71, 8465–8476. [Google Scholar] [CrossRef]
Qiao, S.; Guo, Q.; Wang, M.; Zhu, H.; Rodrigues, J.J.P.C.; Lyu, Z. FRW-TRACE: Forensic-ready watermarking framework for tamper-resistant biometric data and attack traceability in consumer electronics. IEEE Trans. Consum. Electron. 2025, 71, 8234–8245. [Google Scholar] [CrossRef]
Qiao, S.; Zhu, H.; Sha, L.; Wang, M.; Guo, Q. DynMark: A dynamic packet counting watermarking scheme for robust traffic tracing in network flows. Comput. Secur. 2025, 157, 104571. [Google Scholar] [CrossRef]
Qiao, S.; Guo, Q.; Wang, M.; Zhu, H.; Rodrigues, J.J.P.C.; Lyu, Z. Advances in network flow watermarking: A survey. Comput. Secur. 2025, 159, 104653. [Google Scholar] [CrossRef]

Figure 1. Prototype Generation Process (ProtoNet: ProtoNet uses class mean as prototypes; Ours: Ours uses prototype correction to refine prototypes). Each colored circle represents a sample of a class, the dashed circle represents an uncorrected prototype, and the star represents a corrected prototype.

Figure 2. System architecture diagram. Note: The symbol ⨂ denotes element-wise multiplication, and the symbol ⨁ denotes concatenation.

Figure 3. SBAATM architecture.

Figure 4. Impact of query sample size ((a): impact on 1-shot task; (b): impact on 5-shot task).

Figure 5. Impact of the number of attention heads ((a): on MiniImageNet dataset; (b): on CUB dataset).

Figure 6. MiniImageNet tasks.

Figure 7. CUB tasks.

Table 1. Mean classification accuracy (%) on MiniImageNet and CUB (5-way 1-shot/5-shot).

Method	Backbone	MiniImageNet		CUB
Method	Backbone	1-Shot	5-Shot	1-Shot	5-Shot
Cosine classifier (2019) [38]	Conv4	47.99 ± 0.18	66.93 ± 0.17	57.79 ± 0.22	74.03 ± 0.18
MAML (2017) [15]	Conv4	48.70 ± 1.84	63.11 ± 0.92	47.85 ± 0.22	64.77 ± 0.20
ProtoNet (2017) [1]	Conv4	49.42 ± 0.78	68.20 ± 0.66	54.52 ± 0.23	73.30 ± 0.17
RelationNet (2018) [2]	Conv4	50.44 ± 0.82	65.32 ± 0.70	58.81 ± 0.24	75.23 ± 0.18
TADAM (2018) [39]	Conv4	50.50 ± 0.20	69.09 ± 0.16	56.64 ± 0.23	73.66 ± 0.17
MetaOptNet (2019) [18]	Conv4	51.28 ± 0.20	69.71 ± 0.16	49.52 ± 0.22	71.68 ± 0.18
SAML (2019) [40]	Conv4	52.88 ± 0.20	68.17 ± 0.16	62.75 ± 0.23	78.24 ± 0.16
DeepEMD (2020) [36]	Conv4	53.81 ± 0.20	70.56 ± 0.16	62.09 ± 0.23	83.58 ± 0.17
SLTRN (2024) [34]	Conv4	52.11 ± 0.86	66.54 ± 0.70	67.55 ± 0.96	80.07 ± 0.65
Ours	Conv4	55.71 ± 0.63	73.87 ± 0.44	70.37 ± 0.36	83.86 ± 0.23
Cosine classifier (2019) [38]	ResNet-12	55.43 ± 0.81	77.18 ± 0.61	67.30 ± 0.86	84.75 ± 0.60
ProtoNet (2017) [1]	ResNet-12	62.39 ± 0.21	80.53 ± 0.34	71.88 ± 0.91	87.42 ± 0.48
DeepEMD (2020) [41]	ResNet-12	65.91 ± 0.82	82.41 ± 0.56	75.65 ± 0.83	88.69 ± 0.50
MetaOptNet (2019) [18]	ResNet-12	62.64 ± 0.61	78.63 ± 0.46	72.00 ± 0.70	84.20 ± 0.50
FEAT (2020) [42]	ResNet-12	66.78 ± 0.20	82.05 ± 0.14	73.27 ± 0.22	85.77 ± 0.14
BML (2021) [43]	ResNet-12	67.04 ± 0.63	83.63 ± 0.29	76.21 ± 0.63	90.45 ± 0.36
Ours	ResNet-12	67.42 ± 0.63	85.71 ± 0.57	77.13 ± 0.41	91.48 ± 0.69

Note: Values in bold represent the highest accuracy across all comparative models.

Table 2. Ablation study on FC-SBAAT: classification accuracy (%) on MiniImageNet and CUB.

Method	MiniImageNet		CUB
Method	1-Shot	5-Shot	1-Shot	5-Shot
Base	$52.42 \pm 0.78$	$67.10 \pm 0.59$	$67.86 \pm 0.27$	$80.36 \pm 0.17$
Base-MHA	$53.11 \pm 0.39$	$71.37 \pm 0.43$	$68.92 \pm 0.42$	$81.64 \pm 0.31$
Base-FCM	$53.88 \pm 0.36$	$71.81 \pm 0.51$	$69.17 \pm 0.45$	$82.38 \pm 0.71$
Base-FCM-AFM	$54.95 \pm 0.21$	$72.62 \pm 0.26$	$69.81 \pm 0.17$	$83.03 \pm 0.23$
Base-FCM-AFM+IRPM (Ours)	$55.71 \pm 0.22$	$73.87 \pm 0.29$	$70.37 \pm 0.18$	$83.86 \pm 0.23$

Note: Values in bold represent the highest accuracy across all comparative models.

Table 3. Ablation study on classification module: classification accuracy (%) comparison on MiniImageNet and CUB datasets.

Method	Backbone	MiniImageNet		CUB
Method	Backbone	1-Shot	5-Shot	1-Shot	5-Shot
Vanilla Transformer	Conv4	54.61 ± 0.15	71.44 ± 0.36	68.96 ± 0.37	81.93 ± 0.54
Ours	Conv4	55.71 ± 0.63	73.87 ± 0.44	70.37 ± 0.36	83.86 ± 0.23

Note: Values in bold represent the highest accuracy across all comparative models.

Table 4. Comparison of feature enhancement strategies: classification accuracy (%) on MiniImageNet and CUB datasets.

Method	MiniImageNet		CUB
Method	1-Shot	5-Shot	1-Shot	5-Shot
Traditional	55.49 ± 0.22	73.34 ± 0.48	69.92 ± 0.57	83.75 ± 0.17
Ours	55.71 ± 0.63	73.87 ± 0.44	70.37 ± 0.36	83.86 ± 0.23

Note: Values in bold represent the highest accuracy across all comparative models.

Table 5. Model complexity and processing time comparison of the Transformer relation network.

Method	FLOPs (M)	Parameters (K)	Processing Time (ms)
Vanilla Transformer	404.91	397.65	5.51
Ours	401.03	397.65	3.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

FC-SBAAT: A Few-Shot Image Classification Approach Based on Feature Collaboration and Sparse Bias-Aware Attention in Transformers

Abstract

1. Introduction

3. Methodology

3.1. Notations and Problem Definition

3.2. Framework

3.3. FCM

3.4. AFM

3.5. IPRM

3.6. SBAATM

4. Experimental Results and Discussion

4.1. Datasets and Experimental Environment

4.2. Experimental Details

4.3. Comparison with State of the Arts

4.4. Ablation Study

4.4.1. The Impact of Different Structures

4.4.2. The Impact of Different Hyper-Parameters

4.4.3. The Impact of Different Feature Enhancement Strategies

4.4.4. The Computational Impact of Time Complexity

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

FC-SBAAT: A Few-Shot Image Classification Approach Based on Feature Collaboration and Sparse Bias-Aware Attention in Transformers

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Contrastive Learning in Few-Shot Learning

2.3. Multi-Head Attention

2.4. Transformer-Based Methods

3. Methodology

3.1. Notations and Problem Definition

3.2. Framework

3.3. FCM

3.4. AFM

3.5. IPRM

3.6. SBAATM

4. Experimental Results and Discussion

4.1. Datasets and Experimental Environment

4.2. Experimental Details

4.3. Comparison with State of the Arts

4.4. Ablation Study

4.4.1. The Impact of Different Structures

4.4.2. The Impact of Different Hyper-Parameters

4.4.3. The Impact of Different Feature Enhancement Strategies

4.4.4. The Computational Impact of Time Complexity

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics